Integral and Inverse Reinforcement Learning for Optimal Control Systems and Games 9783031452512, 9783031452529

Integral and Inverse Reinforcement Learning for Optimal Control Systems and Games develops its specific learning techniq

107 75 9MB

English Pages 287 [278] Year 2024

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Integral and Inverse Reinforcement Learning for Optimal Control Systems and Games
 9783031452512, 9783031452529

Table of contents :
Series Editor’s Foreword
Preface
Acknowledgements
Contents
Abbreviations and Notation
Abbreviations
Notation
1 Introduction
1.1 Motivation
1.2 Optimal Control
1.3 Integral Reinforcement Learning
1.4 Inverse Optimal Control
1.5 Inverse Reinforcement Learning
1.6 Outline of This Book
References
2 Background on Integral and Inverse Reinforcement Learning for Feedback Control
2.1 Integral Reinforcement Learning for Continuous-Time Systems
2.1.1 Linear Quadratic Regulators
2.1.2 Integral Reinforcement Learning
2.2 Inverse Optimal Control for Continuous-Time Systems
2.2.1 Inverse Optimal Control for Linear Systems
2.2.2 Inverse Optimal Control for Nonlinear Systems
References
Part I Integral Reinforcement Learning for Optimal Control Systems and Games
3 Integral Reinforcement Learning for Optimal Regulation
3.1 Introduction
3.2 On-Policy Synchronous Integral Reinforcement Learning with Experience Replay for Nonlinear Constrained Systems
3.2.1 Problem Formulation
3.2.2 Offline Integral Reinforcement Learning Policy Iteration
3.2.3 Value Function Approximation
3.2.4 Synchronous Online Integral Reinforcement Learning for Nonlinear Constrained Systems
3.2.5 Simulation Examples
3.3 Off-Policy Integral Reinforcement Learning for Linear Quadratic Regulators with Input–Output Data
3.3.1 Discounted Optimal Control Problem
3.3.2 State-Feedback Off-Policy RL with Input-State Data
3.3.3 Output-Feedback Off-Policy RL with Input–Output Data
3.3.4 Simulation Examples
References
4 Integral Reinforcement Learning for Optimal Tracking
4.1 Introduction
4.2 Integral Reinforcement Learning Policy Iteration for Linear …
4.2.1 Problem Formulation
4.2.2 Augmented Algebraic Riccati Equation for Causal Solution
4.2.3 Integral Reinforcement Learning for Online Linear Quadratic Tracking
4.2.4 Simulation Examples
4.3 Online Actor–Critic Integral Reinforcement Learning …
4.3.1 Standard Problem Formulation and Solution
4.3.2 New Formulation for the Optimal Tracking Control Problem of Constrained-Input Systems
4.3.3 Tracking Bellman and Hamilton–Jacobi–Bellman Equations
4.3.4 Offline Policy Iteration Algorithms
4.3.5 Online Actor–Critic-Based Integral Reinforcement Learning
4.4 Simulation Results
References
5 Integral Reinforcement Learning for Zero-Sum Games
5.1 Introduction
5.2 Off-Policy Integral Reinforcement Learning for upper H Subscript normal infinityHinfty Tracking Control
5.2.1 Problem Formulation
5.2.2 Tracking Hamilton–Jacobi–Isaacs Equation and the Solution Stability
5.2.3 Off-Policy Integral Reinforcement Learning for Tracking Hamilton–Jacobi–Isaacs Equation
5.2.4 Simulation Examples
5.3 Off-Policy Integral Reinforcement Learning for Distributed …
5.3.1 Formulation of Distributed Minmax Strategy
5.3.2 Stability and Robustness of Distributed Minmax Strategy
5.3.3 Off-Policy Integral Reinforcement Learning for Distributed Minmax Strategy
5.3.4 Simulation Examples
References
Part II Inverse Reinforcement Learning for Optimal Control Systems and Games
6 Inverse Reinforcement Learning for Optimal Control Systems
6.1 Introduction
6.2 Off-Policy Inverse Reinforcement Learning for Linear Quadratic Regulators
6.2.1 Problem Formulation
6.2.2 Inverse Reinforcement Learning Policy Iteration
6.2.3 Model-Free Off-Policy Inverse Reinforcement Learning
6.2.4 Simulation Examples
6.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …
6.3.1 Problem Formulation
6.3.2 Model-Based Inverse Reinforcement Learning
6.3.3 Model-Free Off-Policy Integral Inverse Reinforcement Learning
6.3.4 Simulation Examples
References
7 Inverse Reinforcement Learning for Two-Player Zero-Sum
Games
7.1 Introduction
7.2 Inverse Q-Learning for Linear Two-Player Zero-Sum Games
7.2.1 Problem Formulation
7.2.2 Model-Free Inverse Q-Learning
7.2.3 Implementation of Inverse Q-Learning Algorithm
7.2.4 Simulation Examples
7.3 Off-Policy Inverse Reinforcement Learning for Nonlinear Two-Player Zero-Sum Games
7.3.1 Problem Formulation
7.3.2 Inverse Reinforcement Learning Policy Iteration
7.3.3 Model-Free Off-Policy Integral Inverse Reinforcement
Learning
7.3.4 Simulation Examples
7.4 Online Adaptive Inverse Reinforcement Learning for Nonlinear Two-Player Zero-Sum Games
7.4.1 Integral RL-Based Off line Inverse Reinforcement Learning
7.4.2 Online Inverse Reinforcement Learning with Synchronous Neural Networks
7.4.3 Simulation Examples
References
8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games
8.1 Introduction
8.2 Off-Policy Inverse Reinforcement Learning for Linear Multiplayer Non-Zero-Sum Games
8.2.1 Problem Formulation
8.2.2 Inverse Reinforcement Learning Policy Iteration
8.2.3 Model-Free Off-Policy Integral Inverse Reinforcement Learning
8.2.4 Simulation Examples
8.3 Off-Policy Inverse Reinforcement Learning for Nonlinear Multiplayer Non-Zero-Sum Games
8.3.1 Problem Formulation
8.3.2 Inverse Reinforcement Learning Policy Iteration
8.3.3 Model-Free Off-Policy Integral Inverse Reinforcement Learning
8.3.4 Simulation Examples
References
Appendix A Some Useful Facts in Matrix Algebra
Index

Citation preview

Advances in Industrial Control

Bosen Lian · Wenqian Xue · Frank L. Lewis · Hamidreza Modares · Bahare Kiumarsi

Integral and Inverse Reinforcement Learning for Optimal Control Systems and Games

Advances in Industrial Control Series Editor Michael J. Grimble, Industrial Control Centre, University of Strathclyde, Glasgow, UK Editorial Board Graham Goodwin, School of Electrical Engineering and Computing, University of Newcastle, Callaghan, NSW, Australia Thomas J. Harris, Department of Chemical Engineering, Queen’s University, Kingston, ON, Canada Tong Heng Lee , Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore Om P. Malik, Schulich School of Engineering, University of Calgary, Calgary, AB, Canada Kim-Fung Man, City University Hong Kong, Kowloon, Hong Kong Gustaf Olsson, Department of Industrial Electrical Engineering and Automation, Lund Institute of Technology, Lund, Sweden Asok Ray, Department of Mechanical Engineering, Pennsylvania State University, University Park, PA, USA Sebastian Engell, Lehrstuhl für Systemdynamik und Prozessführung, Technische Universität Dortmund, Dortmund, Germany Ikuo Yamamoto, Graduate School of Engineering, University of Nagasaki, Nagasaki, Japan

Advances in Industrial Control is a series of monographs and contributed titles focusing on the applications of advanced and novel control methods within applied settings. This series has worldwide distribution to engineers, researchers and libraries. The series promotes the exchange of information between academia and industry, to which end the books all demonstrate some theoretical aspect of an advanced or new control method and show how it can be applied either in a pilot plant or in some real industrial situation. The books are distinguished by the combination of the type of theory used and the type of application exemplified. Note that “industrial” here has a very broad interpretation; it applies not merely to the processes employed in industrial plants but to systems such as avionics and automotive brakes and drivetrain. This series complements the theoretical and more mathematical approach of Communications and Control Engineering. Indexed by SCOPUS and Engineering Index. Proposals for this series, composed of a proposal form (please ask the in-house editor below), a draft Contents, at least two sample chapters and an author cv (with a synopsis of the whole project, if possible) can be submitted to either of the: Series Editor Professor Michael J. Grimble: Department of Electronic and Electrical Engineering, Royal College Building, 204 George Street, Glasgow G1 1XW, United Kingdom; e-mail: [email protected] or the In-house Editor Mr. Oliver Jackson: Springer London, 4 Crinan Street, London, N1 9XW, United Kingdom; e-mail: [email protected] Proposals are peer-reviewed. Publishing Ethics Researchers should conduct their research from research proposal to publication in line with best practices and codes of conduct of relevant professional bodies and/or national and international regulatory bodies. For more details on individual ethics matters please see: https://www.springer.com/gp/authors-editors/journal-author/journal-author-helpdesk/ publishing-ethics/14214

Bosen Lian · Wenqian Xue · Frank L. Lewis · Hamidreza Modares · Bahare Kiumarsi

Integral and Inverse Reinforcement Learning for Optimal Control Systems and Games

Bosen Lian Department of Electrical and Computer Engineering Auburn University Auburn, AL, USA

Wenqian Xue State Key Laboratory of Synthetical Automation for Process Industries Northeastern University Shenyang, Liaoning, China

Frank L. Lewis UTA Research Institute University of Texas at Arlington Arlington, TX, USA

Hamidreza Modares Department of Mechanical Engineering Michigan State University East Lansing, MI, USA

Bahare Kiumarsi Department of Electrical and Computer Engineering Michigan State University East Lansing, MI, USA

ISSN 1430-9491 ISSN 2193-1577 (electronic) Advances in Industrial Control ISBN 978-3-031-45251-2 ISBN 978-3-031-45252-9 (eBook) https://doi.org/10.1007/978-3-031-45252-9 Mathematics Subject Classification: 49L20, 93-08, 49N05, 49N45, 91A23 MATLAB is a registered trademark of The MathWorks, Inc. See https://www.mathworks.com/trademarks for a list of additional trademarks. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

To my family, who support me every day —Bosen Lian To my family and supervisors —Wenqian Xue To Galina —Frank L. Lewis To my parents —Hamidreza Modares To my parents —Bahare Kiumarsi

Series Editor’s Foreword

Control engineering is considered rather differently by researchers who often produce very general design approaches and engineers who must implement and maintain specific industrial control systems. It is of course valuable to develop algorithms for control problems with a well-defined mathematical basis, but engineers often have different concerns over the limitations of equipment, quality of control, safety, security, and possible system downtime. The monograph series Advances in Industrial Control (AIC) attempts to bridge this divide by encouraging the consideration of advanced control techniques where they offer real benefits and where they also address some of the more practical everyday problems. The rapid development of new control theory, techniques, and technology has an impact on all areas of engineering. This series focuses on applications of advanced control that may even stimulate the development of new industrial control paradigms. This is desirable if the different aspects of the “control design” problem are to be explored with the same dedication that “analysis and synthesis” problems have received in the past. The series enables researchers to introduce new ideas motivated by challenging problems in interesting applications. It raises awareness of the various benefits that advanced control can provide whilst also covering the challenges that can arise. This monograph has the role of introducing the state of the art in the exciting and rapidly expanding area of artificial intelligence—(AI)-based, data-driven control. This monograph for the series is concerned with “reinforcement learning and inverse reinforcement learning” for “optimal” control systems. This is one of the most useful areas for AI to be employed in control engineering applications. Reinforcement learning is based on the simple idea that animals adapt their behaviour based on rewards or punishments received. Chapter 1 introduces the ideas of reinforcement learning and integral reinforcement learning (IRL) with a view to reducing steadystate errors. Inverse reinforcement learning is the inverse problem of reinforcement learning, i.e., the problem of inferring the “reward function” of an expert or agent, given its policy, or observed behaviour, and without prior knowledge of the system model. The subject has much in common with a general form of adaptive control, and

vii

viii

Series Editor’s Foreword

indeed, a family of optimal adaptive controllers can be developed using traditional reinforcement-learning methods. Chapter 2 discusses several areas of optimal control that are reasonably wellknown but these are needed for later use. It begins with the topic of IRL using linear quadratic regulator (LQR) ideas. Chapter 3 concerns optimal regulation problems. IRL for optimal regulation problems provides a type of adaptive controller. These can learn about the systems dynamics in the absence of complete plant knowledge and they enable the online learning of the optimal gains in real time. Chapter 4 considers IRL for optimal control-based tracking problems. The design approach provides optimal tracking performance and at the same time allows system constraints to be satisfied. The combination with the very well-known and familiar linear quadratic optimal control philosophy is helpful since it makes the material on “learning policy” easier to understand. Chapter 5 considers IRL for “zero-sum games” (where the sum of the amounts won equals the combined losses of other players). This is a useful approach for the solution of H∞ optimal tracking control problems. Unlike the usual H∞ optimal control problems where the system dynamics are assumed to be known, in this case, robust controllers can be found to handle disturbances and uncertainties in the system dynamics. In Part II of the text the problem of “inverse reinforcement learning” for optimal control systems is discussed. Inverse reinforcement learning infers the implicit reward function or cost function from the behaviour experienced. Given the reward function and a system model, an optimal controller can, of course, be computed. Chapter 6 begins with reinforcement learning for LQRs assuming no external disturbances. Both model-based and model-free data-driven inverse reinforcement learning algorithms are described along with their properties. Chapter 7 deals with systems that have noncooperative and adversarial inputs such as disturbances. The optimal control solution is obtained by minimizing the effects of the worst-case adversarial input, using the Hamilton–Jacobi–Isaacs equation for nonlinear systems or the usual game algebraic Riccati equation for linear systems. The focus is on model-free inverse solutions for dynamic systems. Chapter 8 turns to the more difficult problem of inverse reinforcement learning for multiplayer non-zero-sum games. There is a range of applications that involve multiple control inputs, such as in road vehicles, where there are various controls (steering wheel, pedals for acceleration and braking, turning signals, and gear selectors). This text necessarily involves significant theoretical analysis and synthesis theory, but the results provided have a rather practical aim. The theoretical problems and solutions are probably the simplest that could be chosen to illustrate the potential of the algorithms developed. The aim to reduce dependency on models and rely more on data obtained for the control design process is quite challenging. The need to improve the quality of control through learning is also an ambitious target. The text includes simulation examples in the chapters to illustrate the ideas and it has a set of references and an appendix to help the reader.

Series Editor’s Foreword

ix

The monograph is very suitable for students and researchers working on the use of AI in control systems since it provides a mathematical basis in an area where ideas are often presented in an intuitive style. Engineers wishing to use AI and machine learning methods in areas such as robotics, automotive applications, aircraft systems, power systems, and chemical processes can be comforted that the approach described in this AIC monograph is relatively simple given the challenge and complexity of the problem. Glasgow, UK August 2023

Michael J. Grimble

Preface

Reinforcement learning (RL) is a powerful learning approach that empowers agents or decision makers to learn optimal policies through interactions with the environment, aiming at optimizing a prescribed performance function such as cost, reward, or profit. In the context of continuous-time systems, integral RL (IRL) has emerged as a framework built upon policy iteration and value iteration. IRL leverages these methods to solve Hamilton–Jacobi–Bellman and Hamilton–Jacobi–Isaacs equations for optimal controllers online and forward in time with completely or partially unknown system dynamics. It is crucial to emphasize that IRL operates based on a predefined performance function and focuses on goal-oriented learning. Inverse RL, on the other hand, tackles the inverse problem of RL, wherein it observes the behavior of an expert with optimal outcomes and deduces the motivating performance function driving their actions and states, all without any prior knowledge of the system dynamics. This book offers cutting-edge insights into IRL and inverse RL for optimal control systems and games, presented in two parts. The IRL part addresses the challenges posed by the continuous-time Hamiltonian in optimal control systems and games, which incorporates system dynamics. The part shows a powerful methodology that overcomes the limitations of applying RL to continuous-time systems and unlocks the potential for online solutions of HJB equations, fueling advancements in the field of optimal control. In contrast, the inverse RL part of the book introduces modelfree capabilities to unravel the underlying performance functions that drive observed optimal behaviors in the realm of system and control theory, distinguishing it from existing model-based inverse optimal control approaches. The motivation behind this book stems from extensive research conducted by the authors in the field of IRL and inverse RL applied to various domains, including linear quadratic regulators, linear quadratic trackers, nonlinear systems, zero-sum games, and multiplayer noncooperative games. By providing real-world application examples, the authors bring the concepts to life and demonstrate the relevance of IRL and inverse RL across diverse domains. With its state-of-the-art insights, this book is at the forefront of the field, making it an essential resource for researchers, practitioners, and students interested in IRL, inverse RL, and their applications in optimal control systems and games. xi

xii

Preface

Chapter 1 provides a comprehensive introduction to optimal control, IRL, inverse optimal control, and inverse RL. Following that, Chap. 2 delves into the fundamental background knowledge of IRL and inverse RL for continuous-time systems. The IRL Part I comprises three chapters, each presenting IRL-based algorithms for designing optimal control in the context of optimal regulation, tracking, and zero-sum games (or H∞ control), respectively. In Inverse RL Part II, three chapters highlight state-of-theart inverse RL algorithms for optimal linear and nonlinear systems, two-player zerosum games, and multiplayer non-zero-sum games. Throughout the book, rigorous convergence and stability analyses accompany each main result. This book serves as a focused and comprehensive introduction and background resource for IRL and inverse RL, making it suitable for students and newcomers to the topics. Additionally, researchers in adaptive dynamic programming, intelligent control, machine learning, and artificial intelligence will find value in this book, which not only offers mathematically rigorous developments, but also presents comprehensive and in-depth IRL results and the latest advancements in inverse RL for optimal systems and games. Moreover, the theoretical insights provided in this book can guide those working on industrial applications such as aircraft, robotics, power systems, and communication networks. By delving into the depths of IRL and inverse RL, this book equips readers with the knowledge and tools necessary to tackle real-world challenges and drive innovation in various domains. Auburn, USA Shenyang, China Fort Worth, USA East Lansing, USA East Lansing, USA

Bosen Lian Wenqian Xue Frank L. Lewis Hamidreza Modares Bahare Kiumarsi

Acknowledgements

We thank all the people who have contributed to the development of the main results presented in this book: Ali Davoudi, Mohammad-Bagher Naghibi-Sistani, Tianyou Chai, and Vrushabh S. Donge. We express our gratitude to Li Zhang, Oliver Jackson, and Bhagyalakkshme Sreenivasan for their invaluable assistance in proofreading the book. Special appreciation goes to Oliver for providing meticulous corrections and detailed feedback. The research presented in this book was supported by National Science Foundation grant ECCS-1128050, National Science Foundation grant IIS-1208623, AFOSR EOARD Grant# 13-3055, Office of Naval Research under Grants N00014-13-1-0562, N00014-14-1-0718, N00014-18-1-2221, and the Army Research Office under Grants W911NF-11-D-0001, W911NF-20-1-0132.

xiii

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Integral Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Inverse Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Inverse Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Outline of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Background on Integral and Inverse Reinforcement Learning for Feedback Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Integral Reinforcement Learning for Continuous-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Linear Quadratic Regulators . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Integral Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 2.2 Inverse Optimal Control for Continuous-Time Systems . . . . . . . . . . 2.2.1 Inverse Optimal Control for Linear Systems . . . . . . . . . . . . . 2.2.2 Inverse Optimal Control for Nonlinear Systems . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part I

1 1 2 3 4 5 6 7 11 11 12 18 27 28 31 35

Integral Reinforcement Learning for Optimal Control Systems and Games

3 Integral Reinforcement Learning for Optimal Regulation . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 On-Policy Synchronous Integral Reinforcement Learning with Experience Replay for Nonlinear Constrained Systems . . . . . . 3.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Offline Integral Reinforcement Learning Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Value Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . .

39 39 39 40 42 43

xv

xvi

Contents

3.2.4 Synchronous Online Integral Reinforcement Learning for Nonlinear Constrained Systems . . . . . . . . . . . . . . . . . . . . . 3.2.5 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Off-Policy Integral Reinforcement Learning for Linear Quadratic Regulators with Input–Output Data . . . . . . . . . . . . . . . . . . 3.3.1 Discounted Optimal Control Problem . . . . . . . . . . . . . . . . . . . 3.3.2 State-Feedback Off-Policy RL with Input-State Data . . . . . . 3.3.3 Output-Feedback Off-Policy RL with Input–Output Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45 54 55 56 60 61 67 69

4 Integral Reinforcement Learning for Optimal Tracking . . . . . . . . . . . . 71 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Integral Reinforcement Learning Policy Iteration for Linear Quadratic Tracking Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.2 Augmented Algebraic Riccati Equation for Causal Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.3 Integral Reinforcement Learning for Online Linear Quadratic Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.4 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3 Online Actor–Critic Integral Reinforcement Learning for Optimal Tracking Control of Nonlinear Systems . . . . . . . . . . . . . 84 4.3.1 Standard Problem Formulation and Solution . . . . . . . . . . . . . 84 4.3.2 New Formulation for the Optimal Tracking Control Problem of Constrained-Input Systems . . . . . . . . . . . . . . . . . . 86 4.3.3 Tracking Bellman and Hamilton–Jacobi–Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.4 Offline Policy Iteration Algorithms . . . . . . . . . . . . . . . . . . . . . 94 4.3.5 Online Actor–Critic-Based Integral Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5 Integral Reinforcement Learning for Zero-Sum Games . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Off-Policy Integral Reinforcement Learning for H∞ Tracking Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Tracking Hamilton–Jacobi–Isaacs Equation and the Solution Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Off-Policy Integral Reinforcement Learning for Tracking Hamilton–Jacobi–Isaacs Equation . . . . . . . . . . . 5.2.4 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109 109 110 110 112 121 126

Contents

xvii

5.3 Off-Policy Integral Reinforcement Learning Reinforcement Learning (RL) for Distributed Minmax Strategy Distributed minmax strategy of Multiplayer Games . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Formulation of Distributed Minmax Strategy . . . . . . . . . . . . . 5.3.2 Stability and Robustness of Distributed Minmax Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Off-Policy Integral Reinforcement Learning for Distributed Minmax Strategy . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part II

127 128 131 138 144 146

Inverse Reinforcement Learning for Optimal Control Systems and Games

6 Inverse Reinforcement Learning for Optimal Control Systems . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Off-Policy Inverse Reinforcement Learning for Linear Quadratic Regulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Inverse Reinforcement Learning Policy Iteration . . . . . . . . . . 6.2.3 Model-Free Off-Policy Inverse Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Off-Policy Inverse Reinforcement Learning for Nonlinear Optimal Control Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Model-Based Inverse Reinforcement Learning . . . . . . . . . . . 6.3.3 Model-Free Off-Policy Integral Inverse Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Inverse Q-Learning for Linear Two-Player Zero-Sum Games . . . . . 7.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Model-Free Inverse Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Implementation of Inverse Q-Learning Algorithm . . . . . . . . 7.2.4 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Off-Policy Inverse Reinforcement Learning for Nonlinear Two-Player Zero-Sum Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Inverse Reinforcement Learning Policy Iteration . . . . . . . . . . 7.3.3 Model-Free Off-Policy Integral Inverse Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

151 151 152 152 154 159 163 164 164 166 173 177 180 183 183 184 184 187 193 194 196 196 199 202

xviii

Contents

7.3.4 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Online Adaptive Inverse Reinforcement Learning for Nonlinear Two-Player Zero-Sum Games . . . . . . . . . . . . . . . . . . . . 7.4.1 Integral RL-Based Offline Inverse Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Online Inverse Reinforcement Learning with Synchronous Neural Networks . . . . . . . . . . . . . . . . . . . . 7.4.3 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Off-Policy Inverse Reinforcement Learning for Linear Multiplayer Non-Zero-Sum Games . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Inverse Reinforcement Learning Policy Iteration . . . . . . . . . . 8.2.3 Model-Free Off-Policy Integral Inverse Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Off-Policy Inverse Reinforcement Learning for Nonlinear Multiplayer Non-Zero-Sum Games . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Inverse Reinforcement Learning Policy Iteration . . . . . . . . . . 8.3.3 Model-Free Off-Policy Integral Inverse Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

206 207 208 208 220 223 225 225 226 226 229 234 239 243 243 245 249 254 256

A Some Useful Facts in Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

Abbreviations and Notation

Abbreviations ARE BLS CT GARE GM HJB HJI IOC IRL LQR LQT NN OPFB OTCP PDE PE PI RL RLS TD UUB VFA

Algebraic Riccati Equation Batch Least Squares Continuous Time Game Algebraic Riccati Equation Gain Margin Hamilton–Jacobi–Bellman Hamilton–Jacobi–Isaacs Inverse Optimal Control Integral Reinforcement Learning Linear Quadratic Regulator Linear Quadratic Tracking Neural Networks Output Feedback Optimal Tracking Control Problem Partial Differential Equation Persistent Of Excitation Policy Iteration Reinforcement Learning Recursive Least Squares Temporal Difference Uniformly Ultimate Bounded Value Function Approximation

Notation R Rn

Set of Real Numbers Set of Real Column Vectors xix

xx

Rn×m || · || |·| In 0n rank(A) A>0 A≥0 A 0 the penalty/cost weights. This performance index includes the energy .x T Qx of the state and the energy .u T Ru of the input. (Recall that energy is a quadratic form, e.g., the kinetic energy of motion with velocity .v(t) is . K = 21 mv.2 .) The matrices . Q and . R are parameters selected by the design engineer to trade off the weighting of the state energy and the control energy. Optimal Control Problem The linear quadratic regulator problem is to find the control input .u(t) that minimizes the performance index . J as the state moves along the trajectories prescribed by the dynamics (2.1). Stabilizability and Detectability The matrix pair .(A, B) is said to be stabilizable if there exists a control input .u(t) such that the state √ goes to zero, that is, .x(t) → 0 as time √.t → ∞ in (2.1). Defining an output . y = Qx for the dynamics (2.1), the pair .(A, Q) is said to be detectable if output . y(t) → 0 as .t → ∞ implies that the full state .x(t) → 0. The detectability condition means that all state excursions away from zero are eventually perceptible through the performance index. Controllability and Observability These are stronger and more familiar properties than stabilizability and detectability. Let .u = −K x be a state-feedback control. A system .(A, B) is said to be controllable if the eigenvalues of the closed-loop system .x˙ = (A − B K )x can be assigned to any desired values by selecting a proper feedback matrix . K . Note that stabilizability means that can be selected to make .x˙ = (A − B K )x stable. Hence, controllability implies stabilizability. In fact, stabilizable means that the unstable modes are controllable. .(A, B) is controllable if the input-coupling matrix .[s I − A, B] has full row rank for all values of .s. .(A, B) is stabilizable if .[s I − A, B] has full row rank except at the stable eigenvalues of . A. A system .(A, C) is said to be observable if the eigenvalues of .(A − LC) can be assigned to any desired values by selecting a proper observer matrix . L. Detectability means that the eigenvalues of .(A − LC) can be selected to be stable. Hence, observability [ ] implies detectability..(A, C) is observable if the output-coupling matrix sI − A . has full column rank for all values of .s. .(A, C) is detectable if the outputC coupling matrix has full column rank except at the stable eigenvalues of . A. Optimality and Stability A continuous control input .u(t) that minimizes .V (x(t)) is called optimal. A milder requirement is that .u(t) is stabilizing. This stability means that .u(t) applied to the dynamics (2.1) results in a state .x(t) that goes to zero with time .t. Suppose .u(t) is continuous and optimal. Then .u(t) yields a minimum value of . V (x(t)) in (2.3). Then infinite integral . V (x(t)) takes on a finite value and the integrand is continuous. Hence, the√integrand .x T Qx + u T Qu = y T y = u T Ru goes to zero. Since . R > 0, both . y(t) = Qx and .u(t) go to zero with time. This implies that

14

2 Background on Integral and Inverse Reinforcement Learning for Feedback Control

√ input .u(t) → 0, and .x(t) → 0 if .(A, Q) is detectable. Consequently, optimality implies stability under these conditions. Admissibility Control input .u(t) is said to be admissible if it is continuous, stabilizes system (2.1), and yields a finite value of V (x(t)) in (2.3). Value Function If .u(t) is any stabilizing control, then the associated function .V (x(t)) with (2.3), i.e., ʃ ∞ ʃ ∞ . V (x(t)) = r (x, u)dτ = (x T Qx + u T Ru)dτ (2.3) t

t

is finite for the dynamics (2.3) and is known as the Value Function. It represents V (x(t)) computed along the trajectories given by (2.1) when the input is .u(t). It is termed as the value of using that control. There is one basic approach to solving the optimal control problem and finding the optimal LQR control: the Minimum Principle. We shall now consider it. To find the LQR by the Minimum Principle, the first step is to differentiate the value function.

.

Leibniz’s Formula Recall that, given a function ʃ ϕ(z, t) =

β(t)

.

α(t)

φ(z, t)dz,

its time derivative is given by Leibniz’s formula (Olver 1993) ʃ β(t) d ∂ ' ' . ϕ(z, t) = β φ(β, t) − α φ(α, t) + φ(z, t)dz dt ∂t α(t)

(2.4)

(2.5)

where prime denotes the time derivative. Bellman Equation Using Leibniz’s formula (2.5) in (2.3) yields .

1 V˙ (x(t)) = − (x T (t)Qx(t) + u T (t)Ru(t)). 2

(2.6)

On the other hand, using the chain rule one can express .V˙ (x(t)) in terms of .x˙ so that .

1 V˙ (x(t)) + (x T (t)Qx(t) + u T (t)Ru(t)) 2 1 ∂V T ) x˙ + (x T (t)Qx(t) + u T (t)Ru(t)) = 0. =( ∂t 2

(2.7)

This object is important enough that it deserves its own name. Consequently, defining the Hamiltonian function . H (x, ∇V, u) and substituting from (2.1), one writes the Bellman equation

2.1 Integral Reinforcement Learning for Continuous-Time Systems

.

15

1 ∆ H (x, ∇V, u) = ∇V T (x(t))(Ax(t) + Bu(t)) + (x T (t)Qx(t) + u T (t)Ru(t)) = 0, 2 (2.8)

where the gradient is defined as .∇V (x(t)) = ∇V /∇x. This is a partial differential equation (PDE). Its initial condition is .V (x(0)) = 0 with .t = 0 the initial time. The Hamiltonian function is the central quantity in Hamiltonian dynamics in physics. It combines the system dynamics (2.1) and the performance requirements (2.3) into a single object. In Optimal Control, there is generally no requirement for the Hamiltonian function to be equal to zero (Lewis et al. 2012). However, if it does equal zero, then equations (2.8) and the combination of (2.1) and (2.3) are equivalent. A formal result shows that, given a continuous, stabilizing control .u(t), the .V (x(t)) found by solving the PDE (2.8) is the same as .V (x(t)) found when integrating dynamics (2.1) for an infinite time horizon and evaluating (2.3). The point is that by solving PDE (2.8), we do not need to simulate the system (2.1) over an infinite time horizon to evaluate its performance, or to control the actual system corresponding to (2.1) in real time over an infinite time horizon. This highlights the extreme importance of (2.8) and justifies giving it a name: the Bellman equation. The importance of the Bellman equation is not exploited in the standard LQR solution procedure. We shall see in subsequent sections that it is the basis for reinforcement learning. Hamilton–Jacobi–Bellman (HJB) Equation The Stationarity Condition states that the optimal control is found by minimizing . H (x, ∇V, u). This is a special case of Pontryagin’s Minimum Principle which we discuss in the next chapter. To minimize (2.8), the stationarity condition requires that .

∂ H (x, ∇V, u) = B T ∇V (x(t)) + Ru(t) = 0, ∂u

(2.9)

u ∗ (t) = −R −1 B T ∇V (x(t)).

(2.10)

so that .

Now substitute this control into the Bellman equation (2.8) to obtain the HJB equation 1 ∇V T (x(t))(Ax(t) − B R −1 B T ∇V (x(t))) + x T (t)Qx(t) 2 1 + ∇V T (x(t)))B R −1 B T ∇V (x(t)) = 0 2

.

or 1 1 ∇V T (x(t))Ax(t) + x T (t)Qx(t) − ∇V T (x(t))B R −1 B T ∇V (x(t)) = 0. (2.11) 2 2

.

16

2 Background on Integral and Inverse Reinforcement Learning for Feedback Control

The HJB equation is a quadratic PDE with .V (x(0)) = 0 that must be solved for V (x(t)). Its initial condition is .V (x(0)) = 0 with .t = 0 the initial time. Note that the Hessian is given by

.

.

∂ 2 H (x, ∇V, u) = R > 0, ∂u 2

(2.12)

so that the stationarity condition yields a minimum value of . H (x, ∇V, u), not a maximum. A formal theorem says that if the HJB equation (2.11) is solved for a continuous solution .V ∗ (x(t)), then the control (2.10) is continuous and minimizes (2.3). Furthermore, the minimal value of (2.3) is given by .V ∗ (x(t)). Finally, since .V ∗ (x(t)) is finite, the control (2.10) stabilizes the system (2.1). This theorem is proven in the next chapter for the full case of nonlinear dynamics and nonlinear performance index. LQR Solution and Algebraic Riccati Equation (ARE) For the LQR, the sweep method states that the value function has the special quadratic form .

V (x(t)) =

1 T x (t)P x(t) 2

(2.13)

for some matrix. P = P T > 0 ∈ Rn×n . Consequently, the HJB PDE can be simplified. Note that in this case, .∇V = ∂ V /∂ x = P x, so that substituting (2.13) into (2.11) yields .

1 1 x T (t)P Ax(t) + x T (t)Qx(t) − x T (t)P B R −1 B T P x(t) = 0 2 2

(2.14)

or .

1 T x (t)(AT P + P A + Q − P B R −1 B T P)x(t) = 0. 2

(2.15)

Note that .x T (t)P Ax(t) = 21 x T (AT P + P A)x. Now assume that this equation holds for all initial conditions in (2.1), and hence all state trajectories .x(t). Then one has the ARE .

AT P + P A + Q − P B R −1 B T P = 0.

(2.16)

Thus, given the quadratic form (2.13) the HJB PDE (2.11) has been converted into the quadratic matrix equation (2.16). This equation is easily solved for . P using, for instance, MATLAB.® routine .lqr (A, B, Q, R). Note that if (2.13) holds then, according to (2.13), the LQR optimal control is u ∗ (t) = −R −1 B T P x(t).

.

(2.17)

2.1 Integral Reinforcement Learning for Continuous-Time Systems Table 2.1 Linear quadratic regulator Algebraic Riccati equation (ARE) T −1 B T P = 0. .A P + P A + Q − P B R LQR Optimal Feedback Gain ∗ −1 B T P x(t) = −K x(t). .u (t) = −R

17

(2.18) (2.19)

Our results are summarized in Table 2.1 and in the following result, which we deem sufficiently important to formulate as a Theorem. Understanding the proof is an important factor in comprehending the LQR. Theorem 2.1 Given linear dynamics (2.1) √ and quadratic performance index (2.2), suppose .(A, B) is stabilizable and .(A, Q) is observable. Then the ARE (2.18) has a unique positive-definite solution and the closed-loop dynamics .

x(t) ˙ = Ax + Bu = (A − B K )x

(2.20)

are asymptotically stable. The control given by the state feedback (2.19) minimizes the value function (2.3). Moreover, the minimal value of (2.3) is given by .V = 21 x T P x. Proof Equations (2.18)–(2.19) yield (A − B K )T P + P(A − B K ) + Q + K T R K = 0.

.

(2.21)

√ Lancaster and Rodman (1995) shows that when .(A, B) is stabilizable and .(A, Q) is observable, (2.18) has a unique positive-definite solution . P > 0. Then, the closedloop system dynamics (2.20) are asymptotically stable. One rewrites (2.3) as ʃ .



V (x(t)) = =

ʃt ∞

ʃ (x T Qx + u T Ru)dτ + (x T Qx + u T Ru)dτ +

t



ʃt ∞

V˙ dτ − V (x(∞)) + V (x(t)) x T P(Ax + Bu)dτ

t

− V (x(∞)) + V (x(t)).

(2.22)

With the state feedback (2.19), (2.14) becomes 2x T (t)P Ax(t) + x T (t)Qx(t) − u ∗T Ru ∗ = 0.

.

Adding (2.23) to (2.22) yields

(2.23)

18

2 Background on Integral and Inverse Reinforcement Learning for Feedback Control

ʃ .



V (x(t)) =

(u − u ∗ )T R(u − u ∗ )dτ − V (x(∞)) + V ∗ (x(t)),

(2.24)

t

where .u denotes general control input. Assuming .u(t) is admissible, one has .V (x(∞)) = 0. Equation (2.24) shows that ∗ .u in (2.19) is the optimal control and minimizes the value function (2.3). The minimal value is .V ∗ (x(t)) = 21 x T P x.

2.1.2 Integral Reinforcement Learning The LQR solution requires one to solve the ARE in Table 2.1 and then compute the optimal gain therein. The ARE is a matrix equation that is easily solved using, for instance, MATLAB.® routine .lqr (A, B, Q, R). A major drawback of this solution procedure is that the full dynamics information .(A, B) in (2.1) must be known to solve the ARE. This requires system identification, or for aircraft, extensive wind tunnel tests. RL is a method of Machine Learning from Computer Science that can be adapted to learn the solution to the LQR problem forward-in-time by measuring system input and output data that is available in real time as the dynamics evolve. The systems dynamics do not have to be known in RL. Reinforcement Learning. The precept of RL is to apply a trial control policy to the system, evaluate the performance outcome, and based on that evaluation update the control to improve the performance. To develop a full reinforcement learning approach to LQR design that avoids solving the HJB equation (2.11) (equivalently, the ARE in Table 2.1), and finds the optimal control without knowing the dynamics .(A, B) by measuring data available moving forward in real time, we need Three Steps To Reinforcement Learning: RL Step 1. First, we have to avoid solving the HJB equation (2.11). This is done by policy iteration. RL Step 2. Second, we need to show how to get rid of the system dynamics in Bellman equation (2.8). This is done by integral reinforcement learning. RL Step 3. Third, we need to show how to implement a solution procedure online in real time. This is done by value function approximation. These three steps are respectively presented in the following three sections.

2.1.2.1

Policy Iteration for LQR

The key to developing RL methods for the LQR is the Bellman equation (2.8) derived using the Minimum Principle. There, the Hamiltonian function is required to be equal to zero to provide a differential equation equivalent to the function (2.3), which is in integral form.

2.1 Integral Reinforcement Learning for Continuous-Time Systems

19

Policy Iteration Solution of HJB Equation. Here, we accomplish RL Step 1. The ARE in Table 2.1 is a special case of the HJB equation (2.11). To avoid solving the HJB equation, consider the Bellman equation (2.8) and the control computation (2.10). Now, consider the following iterative procedure. Select a stabilizing control .u(t). Solve the Bellman equation (2.8) for. V (x(t)). Then update the control according to (2.10). Repeat. That is, repeatedly solve (2.8) followed by (2.10). Then it can be shown that this iterative procedure converges to the solution to the HJB equation (2.11). Algorithm 2.1 Policy iteration Set j = 0. Select a stabilizing initial control policy u 0 (t). 1. Policy evaluation: Given a control input u j (x), solve the PDE Bellman equation for V j (x) ∆

H (x, ∇V j , u) = (∇V j )T (Ax + Bu j +

1 T (x Qx + (u j )T (t)Ru j = 0, V j (0) = 0. 2

(2.25)

2. Policy improvement: Update the control policy using u j+1 = −R −1 B T ∇V j .

(2.26)

Stop if V j = V j−1 . Otherwise, set j = j + 1 and go to Step 1.

This repeated interleaving of Bellman equation (2.8) followed by control update (2.10) is known as Policy Iteration and is detailed as Algorithm 2.1. In this algorithm, supscript . j denotes the iteration number. Note that if the Policy Iteration algorithm converges, then .u j+1 = u j . Then, putting (2.26) into (2.25) yields nothing but the HJB equation (2.11). Control Policy. The input is said to be a Control Policy if it is given as a function of the state. That is .u(t) = h(x(t)) for some, possibly nonlinear, function .h(·). It is important to note that, in Policy Iteration Algorithm 2.1, given a control policy .u j , solving (2.25) gives the value function .V j which is equal to the value of the integral (2.3) when the input is .u j . Hence it is called Policy Evaluation. The control update (2.26) can be shown (see next theorem) to result in an improved policy at iteration . j, and hence is known as Policy Improvement. Theorem 2.2 (Policy Iteration) Let the initial control policy .u 0 (t) in Algorithm 2.1 be stabilizing. Then, the control policy .u j at each iteration is stabilizing. Moreover, j+1 .u has a value .V j+1 that is smaller than the value .V j of policy .u j . Finally, the Policy Iteration algorithm converges monotonically to the solution .V (x) to the HJB equation (2.11). Proof By applying the analysis of Theorem 1 in Abu-Khalaf and Lewis (2005) on the linear case in this section, we conclude that iterating on (2.25) and (2.26), conditioned by an initial admissible policy .u 0 (t), .V (x) ≤ V j+1 (x) ≤ V j (x) and .V j (x) → V (x) where .V (x) is the solution of the HJB equation (2.11) and it is also the optimal one. ◻

20

2 Background on Integral and Inverse Reinforcement Learning for Feedback Control

Note that Policy Evaluation in Algorithm 2.1 requires a solution of the partial differential Bellman equation (2.25). This is not easy to do. It also requires the full system dynamics.(A, B). We now show Kleinman’s algorithm converts this to a more amenable equation that is easier to solve. Kleinman’s Algorithm For the LQR case, assuming the quadratic form for the LQR value function (2.13), the Bellman equation (2.8) becomes 1 T [x P(Ax + Bu) + (Ax + Bu)T P x + x T Qx + u T Ru] = 0, 2 (2.27) ∆ −1 T and consider the state feedback .u = −R B P x = −K x for any gain matrix . K , not necessarily the feedback given by . R −1 B T P. Putting .u = −K x into (2.27) yields .

H (x, ∇V, u) =

. H (x, ∇V, u)

=

1 T [x P( A − B K )x + (A − B K )T P x + x T Qx + x T K T R K x] = 0. 2

(2.28)

Since this holds for all trajectories .x(t), one obtains the Lyapunov equation .

P( A − B K ) + (A − B K )T P + Q + K T R K = 0.

(2.29)

It can be shown that the Lyapunov equation has a unique positive-definite solution . P if and only if .(A − B K ) is asymptotically stable. Therefore, for the LQR case Policy Iteration Algorithm 2.1 interleaves (2.29), as detailed in Algorithm 2.2, which was first delivered by David Kleinman in 1963 in Kleinman (1968). The convergence of the algorithm is provided therein. Algorithm 2.2 Kleinman’s algorithm Set j = 0. Select a stabilizing initial control policy K 0 . 1. Policy evaluation: Given a gain K j , solve the Lyapunov equation for P j P j ( A − B K j ) + ( A − B K j )T P j + Q + (K j )T R K j = 0.

(2.30)

2. Policy improvement: Update the feedback gain using K j+1 = R −1 B T P j .

(2.31)

Stop if P j = P j−1 . Otherwise, set j = j + 1 and go to Step 1.

Kleinman’s algorithm was very important when it was invented in 1963. Then, there was no MATLAB.® , and hence no routine .lqr (A, B, Q, R) to solve the matrix quadratic ARE. In fact, it was not known how to efficiently solve the ARE until the work of Alan Laub (1979). By contrast, the Lyapunov equation is linear in the

2.1 Integral Reinforcement Learning for Continuous-Time Systems

21

unknown matrix . P and is quite easy to solve. Kleinman’s algorithm solves the matrix quadratic ARE by repeated solutions of the simpler linear matrix Lyapunov equation.

2.1.2.2

Integral Reinforcement Learning Policy Iteration

The Policy Iteration Algorithm 2.1 avoids solving the HJB equation (2.11) by repeatedly solving Bellman’s equation (2.25) and performing control policy updates (2.31). That is, the second-order HJB PDE is solved by repeatedly solving the simpler firstorder Bellman PDE. Nonetheless, solving a PDE is not easy. Kleinman’s Algorithm 2.2 does not solve PDEs but only repeatedly solves a linear Lyapunov equation. Unfortunately, solving the Lyapunov equation requires full information about the system dynamics .(A, B). Now, we accomplish RL Step 2 by showing show how to write another equation that is equivalent to Bellman equation (2.25) but is not a PDE, and does not contain the system dynamics .(A, B). Consider the value function (2.3). By differentiating .V (x(t)), one obtains the PDE Bellman equation (2.8)/(2.27). Consider instead chopping off the tail of (2.3) and writing the value function as ʃ .

V (x(t)) = t

1 = 2



ʃ

r (x, u)dτ = t+T

1 2

ʃ



(x T Qx + u T Ru)dτ

t

(x Qx + u Ru)dτ + T

T

t

1 2

ʃ



(x T Qx + u T Ru)dτ.

(2.32)

t+T

Now note that, because of the infinite upper limit, the last term in this equation is nothing but the future value .V (x(t + T )). Consequently, we obtain the IRL Bellman equation .

1 2

V (x(t)) =

ʃ

t+T

(x T Qx + u T Ru)dτ + V (x(t + T )).

(2.33)

t

This is a difference equation for the value .V (x(t)). The quantity ρ(t, t + T ) =

.

1 2

ʃ

t+T t

(x T Qx + u T Ru)dτ =

1 2

ʃ

t+T

r (x, u)dτ

(2.34)

t

is known as the Integral Reinforcement over time interval .(t, t + T ). It is the value of the value function (2.3) over time interval .(t, t + T ) for any stabilizing policy applied over that interval. The next result is the key to IRL. Lemma 2.1 The value function.V (x(t)) found by solving the IRL Bellman difference equation (2.14) is the same as the value function obtained by solving the Bellman PDE (2.8)/ (2.27).

22

2 Background on Integral and Inverse Reinforcement Learning for Feedback Control

Proof The Bellman PDE (2.27) and (2.6) are the same. Integrate (2.6) to obtain ʃ

t+T

.

t

d 1 V (x(τ ))dτ = V (x(t + T )) − V (x(t)) = − dτ 2

ʃ

t+T

(x T Qx + u T Ru)dτ,

t

(2.35) which is the same as the IRL Bellman equation (2.14). According to this result, it is equivalent to replacing the PDE Bellman equation (2.30) in Policy Iteration Algorithm 2.2 by the simpler IRL Bellman difference equation (2.33) to obtain the IRL Policy Iteration Algorithm 2.3. Algorithm 2.3 IRL policy iteration Set j = 0. Select a stabilizing initial control policy u 0 (t). 1. Policy evaluation: Given a control input u j (x), solve the IRL Bellman equation for V j (x) V j (x(t)) − V j (x(t + T )) =

1 2

ʃ

t+T

(x T Qx + (u j )T Ru j )dτ, V j (0) = 0.

(2.36)

t

2. Policy improvement: Update the control policy using u j+1 = −R −1 B T ∇V j .

(2.37)

Stop if V j = V j−1 . Otherwise, set j = j + 1 and go to Step 1.

For the LQR case, the value function has the quadratic form (2.13). Then, we write IRL Algorithm 2.3 for the LQR as Algorithm 2.4. Algorithm 2.4 IRL policy iteration for LQR Set j = 0. Select an initial P 0 so that control policy u 0 = −R −1 B T P 0 is stabilizing. 1. Policy evaluation: Given a control input u j (x), solve the IRL Bellman equation for P j x T (t)P j x(t) − x T (t + T )P j x(t + T ) =

1 2

ʃ

t+T

(x T Qx + (u j )T Ru j )dτ.

(2.38)

t

2. Policy improvement: Update the control policy using u j+1 = −R −1 B T P j . Stop if P j = P j−1 . Otherwise, set j = j + 1 and go to Step 1.

(2.39)

2.1 Integral Reinforcement Learning for Continuous-Time Systems

2.1.2.3

23

Integral Reinforcement Learning by Value Function Approximation

Our final goal here is to develop a full reinforcement learning approach to LQR design that avoids solving the HJB equation (2.11) (equivalently, the ARE in Table 2.1), and finds the optimal control without knowing the dynamics .(A, B) by measuring data available in real time. We have shown in RL Step 1 how to avoid solving the HJB equation (equivalently, the ARE) by using Policy Iteration Algorithm 2.1 which repeatedly solves the Bellman PDE equation (2.25) instead. Then, in RL Step 2, we showed how to avoid solving the Bellman PDE, which requires knowledge of the dynamics .(A, B), by solving the IRL Bellman difference equation in IRL Policy Iteration Algorithm 2.3, which needs no dynamics information. Now, we accomplish RL Step 3 and show how to develop a real-time algorithm that can be implemented online forward-in-time without knowing the dynamics . A by measuring data available in real time. This is accomplished by approximation of the Value Function .V (x(t)) in (2.38). Function Approximation A basic result in function approximation is the following. Weierstrass Approximation Theorem. Any continuous real-valued function . f (x) of a scalar .x can be approximated according to

.

f (x) =

L ∑

wl φl (x) + ε(x)

(2.40)

l=1

for suitable coefficients .wl , where .{φl (x)} is a basis set of polynomials in .x. That is, {φl (x)} = {1, x, x 2 , x 3 , x 4 , . . .}. Moreover, as the number of polynomials . L goes to infinity, the approximation error .ε(x) goes to zero uniformly (e.g., independently of the value of .x). In fact, (2.40) is nothing but the first terms of a Taylor series of . f (x). More elaborate approximation results are based on, e.g., neural networks (Stone 1948), where a smooth vector function. f (x) : Rn → R p with real-valued components. f i (x) is approximated on a compact set .||x|| ≤ R, . R > 0 as

.

f (x) =

L ∑

. i

wil φl (x) + εi (x),

(2.41)

l=1

where .{φl } are known as activation functions. Defining a coefficient matrix .W T = [wil ] ∈ R p×L allows one to write .

f (x) = W T φ(x) + ε(x).

(2.42)

Note that .φ( x) is known and selected by the design engineer. It was shown by Hornik et al. (1990), Sandberg (1998) that, for suitably chosen activation functions, the

24

2 Background on Integral and Inverse Reinforcement Learning for Feedback Control

approximation error .ε(x) = [εi (x)] is bounded on a compact set. Moreover, as the number of hidden-layer units . L goes to infinity, .ε(x) goes to zero. In fact, the activation functions must be chosen to be a basis set for . f (x). Typical popular activation functions include sigmoids, tanh(.x), radial basis functions, etc. Real-Time IRL Policy Iteration by Value Function Approximation (VFA) Suppose we write the value function using the value function approximation (VFA) equation .

V (x(t)) = W T φ(x(t))

(2.43)

with .φ(x(t)) a suitably chosen activation function vector. Since the value function is a scalar, .W T = [wl ] ∈ R1×L and the weight vector .W is a vector of length . L. Substitute this VFA equation into the IRL Bellman equation (2.38) in IRL Policy Iteration Algorithm 2.3 to obtain 1 .(W ) φ(x(t)) = 2

ʃ

t+T

j T

(x T Qx + (u j )T Ru j )dτ + (W j )T φ(x(t + T )),

(2.44)

t

or 1 .(W ) [φ(x(t)) − φ(x(t + T ))] = 2

ʃ

t+T

j T

(x T Qx + (u j )T Ru j )dτ.

(2.45)

t

This is a linear equation in the unknown weights .W j . That is, VFA has allowed us to transform the difference equation (2.38) into a linear equation that is far easier to solve. Moreover, using VFA one has ∇V (x(t)) =

.

∂φ T (x(t)) W ∂x

(2.46)

with .∂φ/∂ x the activation function derivatives with respect to .x. Then, the control update (2.39) can be written in terms of the weight vector. The final form of Real-Time IRL Policy Iteration is given as Algorithm 2.5. In summary, Policy Iteration has allowed us to avoid solving the second-order PDE HJB equation 2.11 by repeatedly solving the first-order PDE Bellman equation in Policy Iteration Algorithm 2.1. Then IRL has transformed this into IRL Policy Iteration Algorithm 2.3, which instead requires a repeated solution of the IRL Bellman difference equation (2.38). Finally, VFA has transformed this into Online IRL Policy Iteration Algorithm 2.5, with an IRL Bellman equation (2.47) that is linear in the weights. This algorithm is implemented online forward-in-time by measuring data . x(t) and .u(t) along the system trajectories, and using it to solve (2.47) for the weights j .W . Define the activation function difference at time .t as

2.1 Integral Reinforcement Learning for Continuous-Time Systems

25

Algorithm 2.5 Real-time IRL policy iteration using VFA Set j = 0. Select a stabilizing initial control policy u 0 (t). 1. Policy evaluation: Given a control input u j (x), solve the IRL Bellman equation for W j (W j )T [φ(x(t)) − φ(x(t + T ))] =

1 2

ʃ

t+T

(x T Qx + (u j )T Ru j )dτ = ρ(t, t + T ).

(2.47)

t

2. Policy improvement: Update the control policy using u j+1 = −R −1 B T

∂φ T (x(t)) j W . ∂x

(2.48)

Stop if W j = W j−1 . Otherwise, set j = j + 1 and go to Step 1.

∆φ(x(t)) = φ(x(t)) − φ(x(t + T ))

.

(2.49)

to write the IRL weight equation (2.47) as (W j )T ∆φ(x(t)) =

.

1 2

ʃ

t+T

(x T Qx + (u j )T Ru j )dτ = ρ(t, t + T ).

(2.50)

t

The problem with the IRL weight equation (2.47) is that it is a scalar equation for the weights .W j , which are generally given as a vector with . L entries. Nevertheless, there are several simple mechanisms for solving for .W j using data .x(t), u(t) measured online in real time. In fact, this equation is in the standard form of the System Identification equation in adaptive control. As such, it is easily solved using Recursive Least Squares (RLS) algorithms (Engel et al. 2004). In fact, .∆φ in (2.50) is called the regression vector in system identification. The procedures boil down to solving . L equations (2.47) for different times .t. An alternative solution procedure is given in the next example. Example 2.1 (Batch Least Squares Solution of IRL Bellman Weight Equation for LQR) Consider the linear time-invariant LQR case with a 2-dimensional state vector 2 . x(t) ∈ R . Then the value function is given by (2.13) or ] [ ][ ] p11 p12 x1 (t) . V (x) = x (t)P x(t) = x 1 (t) x 2 (t) p12 p22 x2 (t) T

[

(2.51)

where the .2 × 2 weight matrix . P > 0 is a constant. Since . P is symmetric, there are 3 unknowns and one can write ⎡ ⎤ x12 (t) ] [ ∆ T . V (x) = p11 p12 p22 ⎣2x 1 (t)x 2 (t)⎦ = W φ(x). (2.52) 2 x2 (t)

26

2 Background on Integral and Inverse Reinforcement Learning for Feedback Control

It is seen that in VFA (2.44), a suitable set of activations for LQR are the polynomials in vector .x(t). Moreover, the weight vector .W is constant and consists of the entries of the matrix . P. Here, .W is a vector of length 3. Now the activation function difference can be written as ⎡ ⎤ ⎡ ⎤ x12 (t) x12 (t + T ) .∆φ(x(t)) = φ(x(t)) − φ(x(t + T )) = ⎣2x 1 (t)x 2 (t)⎦ − ⎣2x 1 (t + T )x 2 (t + T )⎦ x22 (t) x22 (t + T ) (2.53) which is the polynomial vector evaluated at .x(t) minus the polynomial vector evaluated at .x(t + T ). This can be directly computed if one measures .x(t), x(t + T ) at each time step. Now take data from three time steps and use (2.47) to write ʃ 1 t+T T (x Qx + (u j )T Ru j )dτ = ρ(t, t + T ), 2 t ʃ 1 t+2T T (W j )T ∆φ(x(t + T )) = (x Qx + (u j )T Ru j )dτ = ρ(t + T, t + 2T ), 2 t+T ʃ 1 t+3T T j T (x Qx + (u j )T Ru j )dτ = ρ(t + 2T, t + 3T ). (W ) ∆φ(x(t + 2T )) = 2 t+2T (W j )T ∆φ(x(t)) =

.

Note that the weight matrix .W j is the same in all three equations and collect them into the single matrix equation [ ] (W j )T ∆φ(x(t)) ∆φ(x(t + T )) ∆φ(x(t + 2T )) [ ] = ρ(t, t + T ) ρ(t, t + 2T ) ρ(t, t + 3T ) .

.

] ∆ [ By defining [. M(t, t + 2T ) = ∆φ(x(t)) ∆φ(x(t +]T )) ∆φ(x(t + 2T )) .P(t, t + 2T ) = ρ(t, t + T ) ρ(t, t + 2T ) ρ(t, t + 3T ) , one writes this as (W j )T M(t, t + 2T ) = P(t, t + 2T ).

.

(2.54) and

(2.55)

These are the least squares normal equations. Note that. M(t, t + 2T ) is a.3 × 3 matrix so that .W j can be found by a batch least squares solution. In fact, if . M(t, t + 2T ) has 3 independent columns, then it can be inverted to yield .W j . Suppose the weight vector .W j has . L entries, with . L the number of hidden-layer units, e.g., the number of entries in activation vector .φ(x). Then one requires at least . L equations of the form (2.47) using data collected from . L time steps, and the least squares normal equations become (W j )T M(t, t + (L − 1)T ) = P(t, t + (L − 1)T ),

.

(2.56)

2.2 Inverse Optimal Control for Continuous-Time Systems

27

where . M(t, t + (L − 1)T ) is a . L × L matrix. If this matrix is invertible then one can solve the least squares normal equations for .W j . This happens if outputs .∆φ(x(t)) of the hidden-layer units are persistently exciting. Persistence of Excitation. The series .{∆φ(x(t))} is persistently exciting (PE) if Ʌ−1 ∑ .

∆φ(x(t + lT ))[∆φ(x(t + lT ))]T > 0

(2.57)

l=0

for some number of time steps .Ʌ. Clearly, if .{∆φ(x(t))} is PE for .Ʌ, then matrix . M(t, t + (Ʌ − 1)T )M T (t, t + (Ʌ − 1)T ) is nonsingular and the least squares normal equations (2.55) over .Ʌ time steps can be solved for .W j using the batch least squares solution (W j )T = P(t, t + (Ʌ − 1)T )M T (t, t + (Ʌ − 1)T ) [ ]−1 × M(t, t + (Ʌ − 1)T )M T (t, t + (Ʌ − 1)T ) .

.

(2.58)

Obviously, it is required that .Ʌ > L.

2.2 Inverse Optimal Control for Continuous-Time Systems In this section, we introduce the fundamental framework of inverse optimal control, which offers an alternative perspective to traditional optimal control methods. As discussed in Sect. 2.1.1, optimal control aims to design a controller that minimizes a given performance index. This often requires solving the steady-state Hamilton– Jacobi–Bellman (HJB) equation or the algebraic Riccati equation for linear systems, which can be intractable in some cases. Inverse optimal control has emerged as a promising approach that circumvents the complexity associated with solving the HJB equation. The key idea behind inverse optimal control is to parameterize a family of stabilizing controllers that minimize a derived cost functional which offers flexibility in specifying the control law. As illustrated in Haddad and Chellaboina (2011), the performance integrand, which is the quadratic function to be integrated, explicitly depends on the continuous-time system dynamics, the Lyapunov function of the closed-loop system, and the stabilizing feedback control law. This coupling is introduced through the HJB equation. By seeking parameters in the Lyapunov function and the performance integrand, the proposed framework enables the characterization of a class of globally stabilizing controllers that satisfy closed-loop system response constraints. The research on inverse optimal control has witnessed significant advancements in recent years, including works such as Ornelas et al. (2011), Sanchez and Ornelas (2017), Johnson et al. (2011), Molloy et al. (2018a, b, 2020), Ornelas et al. (2010), Mombaur et al. (2010), where Molloy et al. (2018b, 2020)

28

2 Background on Integral and Inverse Reinforcement Learning for Feedback Control

focus on online iterative inverse optimal control methods, while (Mombaur et al. 2010; Ornelas et al. 2010) delve into tracking and imitation problems. Inverse optimal control lays a foundation for inverse reinforcement learning that has been extensively studied in the context of imitation learning, apprenticeship learning, and demonstrated learning problems in the domain of Markov decision processes (Abbeel and Ng 2004; Neu and Szepesvari 2012; Ziebart et al. 2008). In inverse reinforcement learning, an expert, acting as a teacher, provides demonstrated behavior based on an underlying performance index that is minimized. The learner aims to imitate this behavior by reconstructing the expert’s unknown underlying performance index and subsequently computing an optimal control policy. Inverse reinforcement learning is a powerful approach that facilitates adaptive solutions and shares inherent principles with inverse optimal control. In recent years, it has been extended to dynamical systems, as exemplified in Kamalapurkar (2018), Self et al. (2020, 2021), Choi et al. (2017), and Tsai et al. (2016).

2.2.1 Inverse Optimal Control for Linear Systems Building upon the optimal control framework for linear systems discussed in Sect. 2.1.1, this section focuses on the reconstruction of the performance index for continuous-time linear systems using the optimal control framework. Alternatively, this can be interpreted as choosing a controller that ensures that the performance index is minimized and the time derivative of the Lyapunov function is negative along the trajectories of the closed-loop system. Moreover, this approach establishes sufficient conditions for the existence of asymptotically stabilizing solutions to the algebraic Riccati equation (ARE). As a result, a family of globally stabilizing controllers, parameterized by the minimized performance function, can be obtained. Consider the linear dynamical system .

x˙ = Ax + Bu,

(2.59)

where .x ∈ Rn is the state, and .u ∈ Rm is the control input, and state matrix . A and control input matrix . B are with appropriate dimensions. Let the feedback control law of .u(t) be given by u = −K x.

.

(2.60)

2.2 Inverse Optimal Control for Continuous-Time Systems

29

We consider an infinite-horizon quadratic performance index in the form ʃ .

J (x(t0 ), u(·)) =



(x T Qx + u T Ru)dτ,

(2.61)

t0

where . Q = Q T ∈ Rn×n ≥ 0 and . R = R T ∈ Rm×m > 0 are penalty/cost weights on state and control, respectively, to be determined. Note that .x T Qx + u T Ru is the performance integrands. Inverse Optimal Control Problem. Find penalty weights .(Q, R) for (2.61) given the trajectories .(x, u) with some admissible control .u given by the form (2.60), with respect to which the control .u is optimal. This problem arises in two distinct scenarios. The first one is to determine the performance index by determining its penalty weights given an admissible and optimal control law of .u. The other one focuses on designing an admissible control law of .u by determining some performance index that satisfies certain conditions. In this case, the performance index should be optimally associated with the designed control law. This framework is not limited to stabilization control problems but also extends to trajectory imitation or tracking control problems, which are areas where inverse reinforcement learning is applicable. In summary, when given target or demonstrated trajectories, the objective is to simultaneously design and determine both the controller and the performance index that optimally capture the desired trajectories. By doing so, the optimality of the control law and the fidelity of the imitation performance can be ensured. The next key result shows how to find such control law and penalty weights of performance index using inverse optimal control. Theorem 2.3 Assume there exists . P > 0 satisfying .

AT P + P A + Q − P B R −1 B T P = 0.

(2.62)

Then, the closed-loop system .

x˙ = (A − B K )x

(2.63)

is globally asymptotically stable with the feedback control law u = −K x = −R −1 B T P x,

.

(2.64)

and the performance index (2.61) is minimized in the sense of .

J (x(t0 ), −K x(·)) = min J (x(t0 ), u(·))

(2.65)

J (x(t0 ), −K x(·)) = x(t0 )T P x(t0 ).

(2.66)

u

as .

30

2 Background on Integral and Inverse Reinforcement Learning for Feedback Control

Proof This is the result by expanding the inverse optimal control of linear systems in Haddad and Chellaboina (2011) to the problem with the standard linear quadratic performance index. It means there exists a continuously differentiable function .V as .

V (x) = x T P x,

(2.67)

which infers that V (0) = 0, . V (x) > 0, x / = 0,

(2.68a) (2.68b)

V (x) → ∞ as ||x|| → ∞.

(2.68c)

.

.

Note that combining (2.62) and (2.64) obtains the HJB equation in terms of Bellman equation as .

H (x, K ) = (Ax − B K x)T P x + x T P(Ax − B K x) + x T Qx + (K x)T R K x = 0, (2.69)

and the Lyapunov equation (A − S P)T P + P(A − S P) + Rˆ = 0

.

(2.70)

with. S = B R −1 B T and. Rˆ = Q + P S P T = Q + P S P. Then, it follows from (2.64), (2.62), and (2.70) that .

V ' (x)[Ax − B K x] = V ' (x)[Ax − B R −1 B T P x] ˆ = −x T Qx − x T P B R −1 B T P x = −x T Rx < 0, x /= 0.

(2.71)

With conditions (2.68a)–(2.68c) and (2.71), then, the conclusions hold. This completes the proof. ◻ √ Remark 2.1 Any . R > 0 together with . Q > 0 or . Q ≥ 0 with observable .(A, Q) allows the existence of the . P > 0 satisfying (2.62). This is easy to infer from optimal control theory about the linear quadratic regulator problem. Theorem 2.3 is the classical inverse optimal control process of linear systems with quadratic performance functions, which is a specialized case of that of the nonlinear system. Next, Sect. 2.2.2 would show the inverse optimal control of nonlinear systems to show more general conditions and formulations for the solution. Theorem 2.3 explicitly shows that inverse optimal control is an inverse process of optimal control. Putting it simply, optimal control aims to find a control law that is optimal to the given performance index, while inverse optimal control aims to find a

2.2 Inverse Optimal Control for Continuous-Time Systems

31

performance index with respect to which the control law is optimal. Compared to the optimal control that solves control policy parameter . P in the standard ARE (2.62) given cost weights . Q and . R, in inverse optimal control, . Q, . R, and . P in (2.62) are to be determined, which provides extra degrees of solution and thus makes the solving process easier by avoiding the standard ARE solving process. Therefore, inverse optimal control is usually used in a stable system to find the performance index that optimally explains its control action. It is not difficult to see that when we solve the inverse optimal control, the . Q, . R, and the corresponding . P that meet the conditions above and are optimally associated with the same . K are not unique. In fact, there can be infinite groups of goal solutions . Q, . R, and . P. For instance, any two proportional groups of (. Q, . R) optimally yield the proportional . P and the same . K . This is the well-known ill-posedness property of inverse problem.

2.2.2 Inverse Optimal Control for Nonlinear Systems The inverse optimal control of linear systems is a special case of that of nonlinear systems. We now show the more general inverse optimal control formulations. Instead of an ARE, an HJB equation would be used here. Specifically, consider a nonlinear affine system .

x˙ = f (x) + g(x)u

(2.72)

with the performance index ʃ .



J (x(t0 ), u(·)) =

(Q(x) + u T R(x)u)dτ,

(2.73)

t0

where .x ∈ Rn is the state, and .u ∈ Rm is the control input, . f : .Rn → Rn satisfies n m T . f (0) = 0 (f is Lipschitz), . g: .R → R , and . Q(x) + u R(x)u is the performance integrands. Similar to the inverse optimal control problem of linear systems, we propose the following problem of nonlinear systems. Inverse Optimal Control Problem. Find penalty weights .(Q, R) for (2.73) given the trajectories of .(x, u) with some admissible control .u, the corresponding .u is optimal. This problem applies for two situations. One is to determine the performance index by determining its penalty weights with some admissible control law of .u given or set, with respect to which the control .u is optimal. The other one is to design admissible control law of .u by determining some performance index that satisfies certain conditions, such that such a performance index is optimally associated with the designed control law of .u.

32

2 Background on Integral and Inverse Reinforcement Learning for Feedback Control

We would like to show a lemma about the conditions for optimal control of nonlinear systems which is necessary for inverse optimal control results of nonlinear systems. Lemma 2.2 (Haddad and Chellaboina 2011) Consider the nonlinear system .

x˙ = F(x, u)

(2.74)

with the performance index ʃ .



J (x(t0 ), u(·)) =

L(x, u)dτ,

(2.75)

t0

where .x ∈ Rn is the state, and .u ∈ Rm is the control input, and . F satisfies . F(0, 0) = 0. Let set .Ω contain admissible controls .u(·). Let .D ⊂ Rn be an open set and .x ∈ D. Assume there exists a continuously differentiable function .V and control law .ψ such that .

V (0) = 0, . V (x) > 0, x ∈ D, x / = 0,

(2.76a) (2.76b)

ψ(0) = 0, ' . V (x)F(x, ψ(x)) < 0, x ∈ D, x / = 0, . H (x, ψ(x)) = 0, x ∈ D

(2.76c) (2.76d) (2.76e)

H (x, u) ≥ 0, x ∈ D

(2.76f)

.

.

where .

H (x, u) = L(x, u) + V ' (x)F(x, u).

(2.77)

Note that .V ' (x) represents the derivative of .V with respect to .x. Then, with the control law .u = ψ(x), the zero solution .x(t) ≡ 0 of the closed-loop system .

x˙ = F(x, ψ(x))

(2.78)

is locally asymptotically stable, and there exists a neighborhood of the origin.D0 ⊂ D such that J (x(t0 ), ψ(x(·))) = V (x(t0 )), x(t0 ) ∈ D0

(2.79)

J (x(t0 ), ψ(x(·))) =

(2.80)

.

and .

min

u∈S(x(t0 ))

J (x(t0 ), u(·))

2.2 Inverse Optimal Control for Continuous-Time Systems

33



where. S(x(t0 )) = {u(·) : u(·) is admissible such that with.x(t0 ).x(t) → 0 as.t → ∞}. If .

V (x) → ∞ as ||x|| → ∞,

(2.81)

then, the zero solution .x(t) ≡ 0 of the closed-loop system (2.74) is globally asymptotically stable. Proof It follows from (2.76d) that .

V˙ (x) = V ' (x)F(x, ψ(x)) < 0, t ≥ t0 , x(t) /= 0.

(2.82)

Together with (2.76a) and (2.76b), we conclude that .V (x) is a Lyapunov function for the system (2.74). This proves local asymptotic stability of the zero solution.x(t) ≡ 0 to (2.74) and, hence, .x(t) → 0 as .t → ∞ for all initial conditions .x(t0 ) ∈ D0 for some neighborhood of the origin .D0 ⊂ D. With .u = ψ(x) and (2.82), we have − V˙ (x) + V ' (x)F(x, ψ(x)) = 0, t ≥ t0 .

.

(2.83)

Then, (2.76e) and (2.77) imply that .

L(x, ψ(x)) = −V˙ (x) + L(x, ψ(x)) + V ' (x)F(x, ψ(x)) = −V˙ (x).

(2.84)

Integrating it over .[t0 , t] gives ʃ

t

.

L(x, ψ(x))dτ = −V (x(t)) + V (x(t0 )).

(2.85)

t0

Let .t → ∞; we know .V (x(t)) → 0 for all .x(t0 ) ∈ D0 , which yields J (x(t0 ), ψ(x)) = V (x(t0 )). Finally, for .D = Rn global asymptotic stability is a direct consequence of the radially unbounded condition (2.81). Now, let .x(t0 ) ∈ D0 and .u(t) ∈ S(x(t0 )) and we have

.

− V˙ (x) + V ' (x)F(x, u) = 0, t ≥ t0 .

(2.86)

L(x, u) = −V˙ (x) + L(x, u) + V ' (x)F(x, u) = −V˙ (x) + H (x, u).

(2.87)

.

With (2.77), we have .

Integrating it and combining (2.76f) gives

34

2 Background on Integral and Inverse Reinforcement Learning for Feedback Control

ʃ .



J (x(t0 ), u(·)) =

[−V˙ (x) + H (x, u)]dτ

t0

ʃ

= − lim V (x(t)) + V (x(t0 )) + t→∞ ʃ ∞ = V (x(t0 )) + H (x, u)dτ



H (x, u)dτ

t0

t0

≥ V (x(t0 )) = J (x(t0 ), ψ(x(·))),

(2.88)

which yields (2.80). This completes the proof.



Theorem 2.4 Consider the nonlinear affine system (2.72) with the performance index (2.73). Assume there exists a continuously differentiable function .V satisfying V (0) = 0,

(2.89a)

V (x) > 0, x /= 0, 1 T ' g(x)R −1 (x)g T (x)V ' (x)] < 0, x /= 0, . V (x)[ f (x) − 2 . V (x) → ∞ as ||x|| → ∞

(2.89b)

. .

(2.89c) (2.89d)

with some positive definite . R(x). Then the zero solution .x(t) ≡ 0 of the closed-loop system .

x˙ = f (x) − g(x)ψ(x)

(2.90)

is globally asymptotically stable with the feedback control law 1 T u = ψ(x) = − R −1 (x)g T (x)V ' (x), 2

.

(2.91)

and the performance index (2.73), with .

Q(x) = ψ T (x)R(x)ψ(x) − V ' (x) f (x)

(2.92)

being minimized in the sense of .

J (x(t0 ), ψ(x(·))) = min J (x(t0 ), u(·)),

(2.93)

J (x(t0 ), ψ(x(·))) = V (x(t0 )).

(2.94)

u

as .

Proof This result is a direct consequence of Lemma 2.2 for nonlinear affine systems. Specifically, with the performance integrand in the form

References

35 .

L(x, u) = Q(x) + u T R(x)u,

(2.95)

the Hamiltonian has the form .

H (x, u) = Q(x) + u T R(x)u + V ' (x)[ f (x) + g(x)u].

(2.96)

(x,u) Setting . ∂ H∂u = 0 yields the control law (2.90). With (2.90), we observe that (2.89a)–(2.89c) imply (2.76a), (2.76b), (2.76d), and (2.81). Moreover, since .V (x) is continuously differentiable with .x = 0 a local minimum of it, it follows that ' . V (0) = 0 and hence .ψ(0) = 0. This implies (2.76c). Then, with . Q(x) in (2.92) and .ψ(x) in (2.91), (2.76e) holds. Finally, since . R(x) is positive-definite for all n . x ∈ R and .

H (x, u) = H (x, u) − H (x, ψ(x)) = [u − ψ(x)]T R(x)[u − ψ(x)],

(2.97)

(2.76f) holds. Now, the result is a direct consequence of Lemma 2.2. This completes the proof. ◻ Combining (2.91) and (2.92) gives .

L(x, ψ(x)) = Q(x) + ψ(x)T R(x)ψ(x) = 2ψ T (x)R(x)ψ(x) − V ' (x) f (x) = −V ' (x)g(x)ψ(x) − V ' (x) f (x) = −V ' (x)( f (x) + g(x)ψ(x)),

(2.98)

which, by (2.89c), is positive-definite for x /= 0. This is the classical inverse optimal control process of nonlinear affine systems (Haddad and Chellaboina 2011). It shares the consistent formulations with optimal control, but we observe that it avoids solving complex HJB equations to obtain optimal control with some given . Q(x) and . R(x) since . Q(x) and . R(x) are to determined. Moreover, the . Q(x), . R(x) that meet the conditions above and optimally associate with the same .ψ(x) are not unique and can be infinite groups. Similar to the linear case, this is the well-known ill-posedness property of inverse problems.

References Abbeel P, Ng AY (2004) Apprenticeship learning via inverse reinforcement learning. In: The 21st international conference on machine learning, Banff, Canada, pp 1–8 Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41(5):779–791 Choi S, Kim S, Kim JH (2017) Inverse reinforcement learning control for trajectory tracking of a multirotor UAV. Int J Control Autom Syst 15(4):1826–1834

36

2 Background on Integral and Inverse Reinforcement Learning for Feedback Control

Engel Y, Mannor S, Meir R (2004) The kernel recursive least-squares algorithm. IEEE Trans Signal Process 52(8):2275–2285 Haddad WM, Chellaboina V (2011) Nonlinear dynamical systems and control: a Lyapunov-based approach. Princeton University Press Hornik K, Stinchcombe M, White H (1990) Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw 3(5):551–560 Johnson M, Aghasadeghi N, Bretl T (2011) Finite-Horizon inverse optimal control for discrete-time nonlinear systems. In: The 52nd IEEE conference on decision and automation, Firenze, Italy, pp 2906–2913 Kalman RE (1964) When is a linear control system optimal? Kamalapurkar R (2018) Linear inverse reinforcement learning in continuous time and space. In: American control conference, Milwaukee, USA, pp 1683–1688 Kleinman DL (1968) On an iterative technique for Riccati equation computations. IEEE Trans Autom Control 18(1):114–115 Lancaster P, Rodman L (1995) Algebraic Riccati equations. Clarendon Press Laub A (1979) A Schur method for solving algebraic Riccati equations. IEEE Trans Autom Control 24(6):913–921 Lewis FL, Vrabie D, Syrmos V (2012) Optimal control, 3rd edn. Wiley, New Jersey Molloy TL, Ford JJ, Perez T (2018a) Inverse optimal control for deterministic continuous-time nonlinear systems. Automatica 87(1):442–446 Molloy TL, Ford JJ, Perez T (2018b) Online Inverse Optimal Control on Finite Horizons. In: The 57th IEEE conference on decision and control, Miami, USA, pp 1663–1668 Molloy TL, Ford JJ, Perez T (2020) Online inverse optimal control for control-constrained discretetime systems on finite and infinite horizons. Automatica 120(10):1–8 Mombaur K, Truong A, Laumond JP (2010) From human to humanoid locomotion—an inverse optimal control approach. Auton Robots 28:369–383 Neu G, Szepesvari C (2012) Apprenticeship learning using inverse reinforcement learning and gradient methods. ArXiv:1206.5264, pp 295–302 Olver PJ (1993) Applications of Lie groups to differential equations. Springer Ornelas F, Sanchez EN, Loukianov AG (2010) Discrete-time inverse optimal control for nonlinear systems tracking. In: The 49th IEEE conference on decision and control, Atlanta, USA, pp 4813– 4818 Ornelas F, Sanchez EN, Loukianov AG (2011) Discrete-time nonlinear systems inverse optimal control: a control Lyapunov function approach. In: IEEE international conference on control applications, Denver, USA, pp 1431–1436 Sanchez EN, Ornelas F (2017) Discrete-time inverse optimal control for nonlinear systems. CRC Press, Boca Raton, USA Sandberg IW (1998) Notes on uniform approximation of time-varying systems on finite time intervals. IEEE Trans Circuits Syst I: Fundam Theory Appl 45(8):863–865 Self R, Coleman K, He B, Kamalapurkar R (2021) Online observer-based inverse reinforcement learning. IEEE Control Syst Lett 5(6):1922–1927 Self R, Abudia M, Kamalapurkar R (2020) Online inverse reinforcement learning for systems with disturbances. In: American control conference, Denver, USA, pp 1118–1123 Stevens BL, Lewis FL, Johnson EN (2015) Aircraft control and simulation: dynamics, controls design, and autonomous systems. Wiley Stone MH (1948) The generalized Weierstrass approximation theorem. Math Mag 21(5):237–254 Tsai D, Molloy TL, Perez T (2016) Inverse two-player zero-sum dynamic games. In: The 2016 Australian control conference, Newcastle, Australia, pp 192–196 Young WH (1948) The generalized Weierstrass approximation theorem. Math Mag 21(5):237–254 Ziebart BD, Maas AL, Bagnell JA, Dey AK (2008) Maximum entropy inverse reinforcement learning. In: The 23rd AAAI conference on artificial intelligence, Chicago, USA, pp 1433–1438

Part I

Integral Reinforcement Learning for Optimal Control Systems and Games

Chapter 3

Integral Reinforcement Learning for Optimal Regulation

3.1 Introduction Reinforcement learning (RL) has emerged as a powerful tool for designing feedback controllers for continuous-time (CT) dynamical systems. This approach enables the development of adaptive controllers that can learn optimal control solutions in a forward-in-time manner, even in the absence of complete knowledge about the system dynamics. On-policy integral reinforcement learning (IRL) and off-policy IRL algorithms have been successfully devised for CT systems, facilitating the online learning of optimal control solutions in real time. In this chapter, we present a unified framework for addressing optimal regulation problems. We demonstrate how on-policy synchronous IRL with experience replay and off-policy IRL algorithms can be developed within this framework, utilizing either state measurements or input–output measurements. By leveraging these algorithms, the optimal control solutions can be learned and updated dynamically as new data becomes available. Note that from this chapter, we formulate the performance index in the form of a value function with some initial time instant .t.

3.2 On-Policy Synchronous Integral Reinforcement Learning with Experience Replay for Nonlinear Constrained Systems This section presents on-policy IRL algorithms to learn the solution to the Hamilton– Jacobi–Bellman equation for partially unknown constrained input systems. Both the offline and online synchronous IRL algorithms are designed. Experience replay is used to update the critic weights in the online method. The technique is an easy-tocheck condition on the richness of the recorded data and is sufficient to guarantee © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 B. Lian et al., Integral and Inverse Reinforcement Learning for Optimal Control Systems and Games, Advances in Industrial Control, https://doi.org/10.1007/978-3-031-45252-9_3

39

40

3 Integral Reinforcement Learning for Optimal Regulation

convergence to a near-optimal control law compared with the conventional persistence of the excitation condition. The stability is analyzed and a simulation example is provided.

3.2.1 Problem Formulation Consider a nonlinear system as .

x(t) ˙ = f (x(t)) + g(x(t))u(t),

(3.1)

where .x ∈ Rn is the system state vector, . f (x) ∈ Rn is the drift dynamics of the system, .g(x) ∈ Rn×m is the input dynamics of the system, and .u(t) ∈ Rm is the control input. We denote .Ωu = u|u ∈ Rm , |u i (t)| ≤ λ, i = 1, ..., m as the set of all inputs satisfying the input constraints, where .λ is the saturating bound. It is assumed that . f (x) + g(x)u is Lipschitz and the system (3.1) is stabilizable. The goal is to find an optimal constrained policy .u ∗ that drives the state of the system (3.1) to the origin, by minimizing a performance index as a function of state and control variables. The performance index is defined as ʃ .

V (x(t)) =



Q(x(τ )) + U (u(τ )) dτ,

(3.2)

t

where . Q(x) is a positive-definite monotonically increasing function and .U (u) is a positive-definite integrand function. Assumption 3.1 The performance functional (3.2) satisfies zero-state observability. The input constraints can be taken into account by considering the following generalized nonquadratic cost function .U (u) ʃ

u

U (u) = 2

.

(λβ −1 (v/λ))T R dv,

(3.3)

0

where .v ∈ Rm , .β(·) = tanh(·), and . R = diag(r1 , . . . , rm ) > 0 is assumed to be diagonal for simplicity of analysis. Denote .w(v) = (λβ −1 (v/λ))T R = [w1 (v1 ) . . . wm (vm )]. Then, the integral in (3.3) is defined as ʃ

u

U (u) = 2

.

0

w(v) dv = 2

m ʃ ∑ i=1

ui 0

wi (vi ) dvi .

(3.4)

3.2 On-Policy Synchronous Integral Reinforcement Learning with Experience …

41

Considering (3.3) in (3.2) results in ʃ .



V (x(t)) =

(

ʃ

u

Q(x(τ )) + 2

t

(λβ

−1

) (v/λ)) R dv T

dτ.

(3.5)

0

By differentiating .V along the system trajectories, the following Bellman equation is given ʃ .

u

Q(x(t)) + 2

(λβ −1 (v/λ))T Rdv + ∇V T ( f (x) + g(x)u) = 0,

(3.6)

0

where .∇V (x) = ∂ V (x)/∂ x ∈ R n . Let .V ∗ (x) be the optimal cost function defined as .

ʃ



V (x(t)) =



min

u(τ ) ∈π(Ω) t≤τ 0 . Otherwise, set i = i + 1 and go to Step 1.

Integral Reinforcement Learning For any time .t > T and time interval .T > 0, the value function (3.5) satisfies ʃ .

V (xt−T ) =

T t−T

(

ʃ

u

Q (x (τ )) + 2

(

λ tanh

−1

)T

)

(v/λ) R dv dτ + V (xt ) ,

0

(3.14) where .xt and .xt−T are short notation for .x(t) and .x(t − T ), and .x(τ ) is the solution of (3.1) for initial condition .x(t − T ) and control input .{u (x (τ )) , τ ≥ t − T }. In Vrabie et al. (2009), it is shown that (3.14) and (3.5) are equivalent and have the same solution for the value function. Therefore, (3.14) can be viewed as a Bellman equation for CT systems. Note that the IRL form of the Bellman equation does not

3.2 On-Policy Synchronous Integral Reinforcement Learning with Experience …

43

involve the system dynamics. Using (3.14) instead of (3.5) in Algorithm 3.1, the following PI algorithm is obtained: Algorithm 3.2 IRL policy iteration 1. Policy evaluation: Given a control input u i (x), find V i (x) using the Bellman equation ) ʃ ui ( ʃT ( )T V i (xt−T ) = t−T Q (x) + 2 0 λ tanh−1 (v/λ) R dv dτ + V i (xt ) . (3.15) 2. Policy improvement: Update the control policy using ( ) 1 −1 T R g (x) ∇V i (x) . (3.16) u i+1 (x) = −λ tanh 2λ | | 3. Convergence test: Terminate the program if |V i − V i−1 | ≤ ε1 , where threshold ε1 > 0. Otherwise, set i = i + 1 and go to Step 1.

The IRL PI Algorithm 3.2 only needs to have knowledge of the input dynamics, i.e., the function .g(x), which is required for the policy improvement in Eq. (3.16); however no knowledge on the drift dynamics, described by . f (x), is required. The online implementation of this PI algorithm will be introduced later.

3.2.3 Value Function Approximation In this subsection, we discuss value function approximation (VFA) (Werbos 1992) to solve for the cost function .V (x) in policy evaluation (3.15). Assuming the value function is a smooth function, then according to the Weierstrass high-order approximation Theorem (Finlayson 1990), there exists a single-layer neural network (NN) such that the solution .V (x) and its gradient can be uniformly approximated as .

V (x) = W1T φ (x) + εv (x) ,

∇V (x) = ∇φ (x) W1 + ∇εv (x) ,

.

T

(3.17a) (3.17b)

where.ϕ (x) ∈ Rl is a suitable basis function vector,.εv (x) is the approximation error, l . W1 ∈ R is a constant parameter vector, and .l is the number of neurons. Assumption 3.2 The following standard assumptions are considered for the NNs in this section. (a) The NN reconstruction error and its gradient are bounded over the compact set .Ω, i.e., .∥ε (x)∥ ≤ bε and .∥∇ε (x)∥ ≤ bεx . (b) The NN activation functions and their gradients are bounded, i.e., .∥φ (x)∥ ≤ bσ and .∥∇φ (x)∥ ≤ bσ x .

44

3 Integral Reinforcement Learning for Optimal Regulation

Before presenting the online implementation of Algorithm 3.2, since our objective in this section is to find the solution to the HJB equation corresponding to a constrained optimal control problem, it is necessary to see the effect of the reconstruction error on the HJB equation. Assuming that the optimal value function is approximated by (3.17a) and using its gradients (3.17b) in Bellman equation (3.14), we have ) ʃ u ʃ T ( ( )T Q (x (τ )) + 2 λ tanh−1 (v/λ) R dv dτ + W1T ∆φ (x (t)) = ε B (t) , . t−T

0

(3.18) where .ε B (t) is the Bellman equation error and ∆φ (x (t)) = φ (x (t)) − φ (x (t − T ))

.

(3.19)

acts as a regression vector. Note that ʃ .

W1T (t)∆φ (t) = W1T (t)∆φ (x (t)) = ʃ =

T t−T

T t−T

W1T (τ )∇φ (x) x˙ dτ

W1T (τ )∇φ (x) ( f + gu) dτ.

(3.20)

On the other hand, using (3.17b) in (3.9), the optimal policy is obtained as u ∗ = −λ tanh

.

(

) ) 1 −1 T ( T R g ∇φ W1 + ∇εTv . 2λ

(3.21)

Using (3.19)–(3.21) in (3.18), we obtain the following HJB equation: ʃ

T

.

t−T

(

) ( ) Q + W1T ∇φ f + λ2 R ln 1 − tanh2 (D) + ε H J B dτ = 0,

(3.22)

where. D = (1/2λ) R −1 g T ∇φ T W1 , and.ε H J B is the residual error due to the function reconstruction error. In Abu-Khalaf and Lewis (2005), the authors show that as the number of hidden-layer neurons .l increases, the error of the HJB approximation solution converges to zero. Hence, for each constant vector .εh , we can construct a NN so that .sup∀x ∥ε H J B ∥ ≤ εh . Note that in (3.22) and in the sequel, the variable .x is dropped for ease of exposition.

3.2 On-Policy Synchronous Integral Reinforcement Learning with Experience …

45

3.2.4 Synchronous Online Integral Reinforcement Learning for Nonlinear Constrained Systems We now present an online IRL algorithm based on the PI algorithm. The learning structure employs value function approximation (Werbos 1992) using two neural networks: the actor and critic networks. These networks approximate the Bellman equation and its corresponding policy. The offline PI Algorithm 3.2 serves as the foundation for the structure of this online PI algorithm. In contrast to the sequential updates of the critic and actor networks in Algorithms 3.1 and 3.2, the synchronous online PI algorithm updates both networks simultaneously in real time. The learning process is implemented as differential equations for tuning the weights of the neural networks. We refer to this approach as synchronous online PI, which is the continuous version of Generalized Policy Iteration introduced in Sutton and Barto (2018). In Generalized PI, the value of a given policy is not fully evaluated at each step; instead, the current estimated value is updated incrementally toward the target value. To reduce the number of interactions with the environment, we apply the technique of experience replay (Lin 1992) for updating the critic network. This involves using recorded past experiences in conjunction with current data to adapt the weights of the critic network concurrently.

3.2.4.1

Critic NN Using Experience Replay

This subsection presents tuning and convergence of the critic NN weights for a fixed control policy, in effect designing an observer for the unknown value function for use in feedback. It is shown how experience data can be recorded and reused to achieve a better convergence for critic NN weights. Consider a fixed control policy .u(x) and assume that its corresponding value function is approximated by (3.17a). Then, using the gradients of the value function approximation (3.17b), the Bellman equation (3.14) becomes ʃ

T

.

t−T

(

ʃ

u

Q (x) + 2 0

) ( )T λ tanh−1 (v/λ) R dv dτ + W1T ∆ϕ (x (t)) = ε B (t) , (3.23)

where the residual error due to the function reconstruction error is ʃ T .ε B = − ∇εT ( f + g u) dτ .

(3.24)

t−T

Under Assumption 3.2, this residual error is bounded on the compact set .Ω, i.e., .sup∀x∈Ω ∥ε B ∥ ≤ εmax .

46

3 Integral Reinforcement Learning for Optimal Regulation

However, the ideal critic NN weights that provide the best approximation solution for (3.23) are unknown and must be approximated in real time. Hence, the output of the critic NN and the approximate Bellman equation can be written as .

and ʃ T ( .

ʃ Q (x) + 2

t−T

0

u

Vˆ (x) = Wˆ 1T φ (x)

(3.25)

) ( )T λtanh−1 (v/λ) Rdv dτ + Wˆ 1T φ (x (t)) − Wˆ 1T φ (x (t − T ))

= e (t) ,

(3.26)

where the weights .Wˆ 1 are the current estimated values of .W1 . Equation (3.26) can be written as e (t) = Wˆ 1T (t) ∆φ (t) + p (t) ,

.

(3.27)

where the integral reinforcement signal ʃ .

p (t) =

T

(

ʃ

u

Q (x) + 2

t−T

) ( )T −1 λ tanh (v/λ) R dv dτ

(3.28)

0

can be considered as the continuous-time counterpart of the reward signal. Note that the Bellman error .e(t) in (3.26) and (3.27) is the continuous-time counterpart of the Temporal Difference (TD) (Sutton and Barto 2018). The problem of finding the value function is now converted to adjusting the parameters of the critic NN such that the TD .δ is minimized. To bring the TD error to its minimum value, we can consider the following objective function defined based on the instantaneous Bellman equation error .

E=

1 T e (t)e (t) . 2

(3.29)

From (3.27) and using the chain rule, the gradient descent algorithm for . E is given by .

∂E ∆φ (t) = −α1 ( W˙ˆ 1 = −α1 )2 e (t) , ∂ Wˆ 1 1 + ∆φ(t)T ∆φ (t)

(3.30)

( )2 where .α1 > 0 is the learning rate, and the term . 1 + ∆φ(t)T ∆φ (t) is used for normalization. In the standard update rule (3.30), which is used in normal TD, every single observation obtained in the latest transition is used for updating the critic NN weights to contribute a small change in the critic NN weights and then becomes unavailable

3.2 On-Policy Synchronous Integral Reinforcement Learning with Experience …

47

for further use. This kind of learning does not exploit all the information contained in the data and requires a large number of environment steps to obtain a suitable policy. In order to use the data more effectively, the experience replay technique (Lin 1992; Wawrzynski 2009; Adam et al. 2012; Kalyanakrishnan and Stone 2007; Dung et al. 2008) can be employed for updating the critic NN weights. In the following, a real-time learning algorithm using the experience replay technique is applied for updating critic NN weights for continuous-time systems with input constrained with proof of convergence. The proposed experience replay-based update rule for the critic NN weights stores recent transition samples and repeatedly presents them to the gradient-based update rule. It can be interpreted as a gradient descent algorithm that not only tries to minimize the instantaneous Bellman error (similar to the update rule (3.30)), but also minimizes the Bellman equation error for the stored transition samples obtained by the current critic NN weights. These samples are stored in a history stack. To collect a history stack, let ( ( )) ( ( ( ) )) ∆ϕ j = ∆ϕ t j = ϕ x t j − ϕ x t j − T

(3.31)

.

and ( ) .pj = p tj =

ʃ

(

tj

t j −T

ʃ

u

Q (x) + 2

(

λ tanh

−1

)T

)

(v/λ) R dv dτ

(3.32)

0

denote .∆ϕ(t) and . p(t) evaluated at the recorded time .t j , j = 1, . . . , l and are stored in the history stack. Based on the stored data, define e = Wˆ 1 (t) ∆φ j + p j

. j

(3.33)

as the Bellman equation error at recorded time .t j using the current critic NN weights. Then using (3.18), (3.27), and (3.33), the Bellman equation error for the current time and recorded times become ( ) e = W˜ 1T (t) ∆φ j + ε B t j , ˜ 1T (t) ∆φ (t) + ε B (t) , .e (t) = W . j

(3.34a) (3.34b)

( ) where .W˜ 1 = W1 − Wˆ 1 is the critic weight estimation error and .ε B t j is the reconstruction error obtained by (3.18) in recorded time .t j . The proposed novel experience replay algorithm for the critic NN is now given as

48

3 Integral Reinforcement Learning for Optimal Regulation

.

( ) ∆ϕ (t) T ˆ p + ∆ϕ(t) W W˙ˆ 1 (t) = −α1 ( (t) (t) 1 )2 1 + ∆ϕ(t)T ∆ϕ (t) l ( ) ∑ ∆ϕ j T − α1 ( )2 p j + ∆ϕ j Wˆ 1 (t) . T j=1 1 + ∆ϕ j ∆ϕ j

(3.35)

Remark 3.1 Note that in this experience replay tuning law, the last term depends on the history stack of previous activation function differences. It is seen in the simulation example in Sect. 3.2.5 that the history stack term results in faster convergence. It also makes the choice of probing noise easier. In the following, it is shown that using the update law (3.35), the critic NN estimation errors converge to zero exponentially fast if the following condition is satisfied. Condition 3.1. Let. Z = [∆ϕ¯1 , . . . , ∆ϕ¯l ] be the history stack. Then,. Z in the recorded data contains as many linearly independent elements .∆ϕ¯ ∈ Rm as the dimension of the basis of the uncertainty. That is .rank(Z ) = m. Theorem 3.1 Let the online critic experience replay tuning law be given by the weight update law of (3.35). If the recorded data points satisfy Condition 3.1, then (a) for .ε B (t) = 0, ∀t (no reconstruction error), .Wˆ 1 converges exponentially to the unknown weights .W1 . ˜ (b) for bounded .ε B (t), i.e., { .sup∀x ∥ε B (t)∥} ≤ εmax ∀t, .W1 converges exponentially to the residual set . Rs = W˜ 1 |W˜ 1 ≤ cεmax , where .c > 0 is a constant. Proof Note that using (3.27) (and (3.33)–(3.35) and noting .∆φ¯ = ) ) ( and ∆φ/ 1 + ∆φ T ∆φ , .∆φ¯ j = ∆φ j / 1 + ∆φ Tj ∆φ j , .m = 1 + ∆φ T ∆φ m j = 1 + ∆φ Tj ∆φ j , we can obtain

.

⎛ .

¯ T+ W˙˜ 1 (t) = −W˙ˆ 1 (t) = −α1 ⎝∆φ¯ (t) ∆φ(t)

l ∑

⎞ ∆φ¯ j ∆φ¯ Tj ⎠ W˜ 1 (t) + α1 εG B ,

j=1

(3.36) where .εG B =

¯ ∆φ(t) εB m

(t) +

l ∑ j=1

∆φ¯ j mj

( ) εB t j .

(a) Consider a Lyapunov function as .

V =

1 ˜T W (t)α1−1 W˜ 1 (t) . 2 1

(3.37)

Differentiating (3.37) along the trajectories of (3.36) and considering .ε B (t) = 0, we have

3.2 On-Policy Synchronous Integral Reinforcement Learning with Experience …

⎛ .

¯ T+ V˙ = −W˜ 1T (t) ⎝∆φ¯ (t) ∆φ(t) ⎛ ≤ −W˜ 1T (t) ⎝

l ∑



l ∑

49

⎞ ∆φ¯ j ∆φ¯ Tj ⎠ W˜ 1 (t)

j=1

∆φ¯ j ∆φ¯ Tj ⎠ W˜ 1 (t) .

(3.38)

j=1

If Condition 4.1 is satisfied, then.

l ∑ j=1

∆ϕ¯ j ∆ϕ¯ Tj > 0 and hence.V˙ < 0. This concludes

that .W˜ 1 (t) converge to zero exponentially fast. (b) Viewing (3.36) as a linear time-varying system, the solution .W˜ 1 (t) is given by ˜ 1 (t) = ϕ (t, t0 ) W˜ 1 (0) + α1 .W

ʃ

t

ϕ (τ, t0 ) εG B dτ

(3.39)

t0

with the state transition matrix defined as ⎛ ⎞ l ∑ ∂ϕ (t, t0 ) ¯ T+ . ∆φ¯ j ∆φ¯ Tj ⎠ ϕ (t, t0 ) , ϕ (t0 , t0 ) = I. = −α1 ⎝∆φ¯ (t) ∆φ(t) ∂t j=1 (3.40) From the proof of part (a), it can be concluded that .ϕ (t, t0 ) is exponentially stable provided that Condition 3.1 is satisfied. Therefore, if Condition 3.1 is satisfied, the state transition matrix of the homogeneous part of (3.39) satisfies .

∥ϕ (t, t0 )∥ ≤ η1 e−η2 (t−t0 )

(3.41)

for all .t , t0 > 0 and some .η1 , η2 > 0. Using (3.39) and (3.41), we obtain ʃ ∥ ∥ ∥˜ ∥ −η2 t . ∥ W 1 ∥ ≤ η0 e + α1

t

e−η2 (t−τ ) ∥εG B ∥ dτ,

(3.42)

0

( ) where .η0 = η1 W˜ 1 (0) eη2 t0 . Since .sup∀x ∥ε B (t)∥ ≤ εmax and . ∆φ¯ (t)/m s < 1, (3.42) can be written as .

∥ ∥ α1 (l + 1) ∥˜ ∥ εmax , ∥W1 ∥ ≤ η0 e−η2 t + η2

(3.43)

where .l is the number of samples stored in the history stack. The first term of the above equation converges to zero exponentially fast and this completes the proof of (b). ◻

50

3 Integral Reinforcement Learning for Optimal Regulation

Remark 3.2 The above proof shows the only condition required for exponential convergence of critic NN weights error is Condition 3.1. This condition is related to the persistence of excitation of.∆ϕ¯ (t). However, Condition 3.1 can easily be checked online.

3.2.4.2

Actor NN and Synchronous Policy Iteration

This section presents the main algorithm. To solve the optimal control problem adaptively, an online PI algorithm is given which involves simultaneous and synchronous tuning of the actor and critic NNs. First, the actor NN structure is developed. In the policy improvement step (3.16) of Algorithm 3.2, the actor finds an improved control policy according to the current estimated value function. Assume that .Wˆ 1 is the current estimation for the optimal critic NN weights. Then according to (3.16), one can update the control policy as ( u = −λ tanh

. 1

) 1 −1 T R g ∇φ T Wˆ 1 . 2λ

(3.44)

However, this policy improvement does not guarantee the stability of the overall system. Therefore, to assure stability in a Lyapunov sense (as will be discussed later) the following policy update law is used: ( uˆ = −λ tanh

. 1

) 1 −1 T R g ∇φ T Wˆ 2 , 2λ

(3.45)

where .Wˆ 2 are the weights of an actor NN which provide the current estimated values of the unknown optimal critic weights .W1 . Define the actor NN estimation errors as .

W˜ 2 = W1 − Wˆ 2 .

(3.46)

Assumption 3.3 Vamvoudakis and Lewis (2010) The following assumptions are considered in the system dynamics: (a) . f (x) is assumed to be Lipschitz and . f (0) = 0 so that .

∥ f (x)∥ ≤ b f ∥x∥ .

(b) .g(x) is bounded by a constant, i.e., .∥g(x)∥ ≤ bg . We now present the main theorem which provides the tuning laws for the actor and critic NNs that assure convergence of the proposed PI algorithm to a near-optimal control law, while guaranteeing stability. Define

3.2 On-Policy Synchronous Integral Reinforcement Learning with Experience …

51

Dˆ = (1/2λ) R −1 g T ∇φ T Wˆ 2 , ( )) ( ) ( Uˆ = Wˆ 2T ∇φ g λ tanh Dˆ + λ2 R ln 1 − tanh2 Dˆ , [ ( ) ( )] M2 = ∇φ g λ sgn Dˆ − tanh Dˆ . .

Theorem 3.2 (Stability of NNs and the System) Given the dynamical system (3.1), let the tuning for the critic NN be provided by using experience replay ˙ˆ .W 1 (t) = −α1 ( − α1



∆φ (t) 1 + ∆φ(t)T ∆φ (t)

l ∑ j=1

)2

∆φ j

T

(

) ) T ˆ ˆ Q + U dτ + ∆φ(t) W1 (t)

t−T



( )2 1 + ∆φ Tj ∆φ j

tj t j −T

(

)

)

Q + Uˆ dτ + ∆φ Tj Wˆ 1 (t) . (3.47)

Let Condition 3.1 be satisfied. Let the actor NN be tuned as ˙

ˆ2 .W

] [ ʃ T ( ) ∆φ¯ T ˆ M2 T Wˆ 2 dτ W1 + a M 2 (t) = −α2 Y1 Wˆ 2 + ∇φ g λ tanh Dˆ + M2 m t−T

(3.48) ( ) where .∆φ¯ = ∆φ/ 1 + ∆φ T ∆φ , .m = 1 + ∆φ T ∆φ , and .Y1 is a design parameter. Let Assumptions 3.1–3.3 hold. Let the control law be given by (3.45). Then the closedloop system states, the critic NN error, and the actor NN error are uniformly ultimate bounded (UUB) for the sufficiently large number of NN neurons provided that Y > max 0.5 (1 + a) ∥M2 ∥2 ,

. 1

∀x

(3.49)

and / .

T
0, ∀x, we have .

( ) ( ) Wˆ 2T ∇φ g λ tanh Dˆ = 2λ R Dˆ T tanh Dˆ > 0.

(3.60)

Using (3.58)–(3.60), and the fact that .ε H J B is bounded, i.e., .sup∀x ∥ε H J B ∥ ≤ εh (Abu-Khalaf and Lewis 2005), (3.57) becomes ʃ

T

V˙ (x (τ )) dτ
0 there exists .q such that .x T q x < Q (x) , ∀x ∈ Ω, (3.61) becomes

.

ʃ T .

t−T

V˙ (x (τ )) dτ 0, the control input weight matrix is . R = R T ∈ R p× p > 0, and .γ ≥ 0 is the discount factor. It is assumed that √ .( QC, A) is observable. Consider a fixed admissible state-feedback control policy as u = K x.

.

(3.68)

The value function for a control policy in the form of (3.68) can be written as the quadratic form ʃ .

V (x(t)) =



t T

( ) e−γ (τ −t) x T C T QC + K T R K xdτ

= x (t)P x(t),

(3.69)

3.3 Off-Policy Integral Reinforcement Learning for Linear Quadratic …

57

and the optimal control input is given by u = K ∗x

(3.70)

K ∗ = −R −1 B T P,

(3.71)

.

with .

where . P is the solution to the discounted ARE .

AT P + P A − γ P + C T QC − P B R −1 B T P = 0.

(3.72)

The ARE (3.72) is first solved for . P, and then the optimal gain is obtained by substituting the ARE solution into (3.71). Upper Bound for Discount Factor to Assure Stability Now, an upper bound is found for the discount factor in the performance function (3.67) to assure the stability of the optimal control solution found by solving the ARE (3.72). Note that Modares and Lewis (2014) showed that the control input .u = K ∗ x with ∗ γt . K given in (3.71) makes .e x(t) converge to zero asymptotically. This confirms that, as shown in the following example, the system state may diverge if the discount factor is not chosen appropriately. Example 3.1 Consider a scalar dynamical system .

x˙ = x + u,

(3.73)

.

y = x.

(3.74)

Assume that in the ARE (3.72), we have . Q = R = 1. For this linear system with the quadratic performance function (3.67), the value function is .V (x) = px 2 and, therefore, the optimal solution is u = − px,

(3.75)

(2 − γ ) p − p 2 + 1 = 0.

(3.76)

.

where . p is the solution to the ARE .

Solving this equation gives the optimal solution as ( ) √ u = − (1 − 0.5γ ) + (1 − 0.5γ )2 + 1 x.

.

(3.77)

However, this solution does not make the system stable for all values of the discount factor .γ . In fact, if .γ > 2, then the system is unstable. In this example,

58

3 Integral Reinforcement Learning for Optimal Regulation

γ ∗ = 2 is an upper bound for .γ to assure stability. In the next Theorem 3.3, it is shown how to find an upper bound .γ ∗ to assure stability.

.

Theorem 3.3 Let the system (3.66a)–(3.66b) be stabilizable, and one of the following assumptions be satisfied: (1) .C T QC is positive-definite. √ (2) .C T QC is positive semi-definite and .( QC, A) is observable. Then, the solution to the ARE (3.72) is positive-definite and one has Re(λ) < 0.5γ ,

.

(3.78)

where .λ is an eigenvalue of the closed-loop system . Ac with .

Ac = A − B R −1 B T P.

(3.79)

(3) Moreover, the closed-loop system is asymptotically stable if the following condition is satisfied along with one of Condition 1 or 2: ∥( )T ( 1 )T ∥ ∥ ∥ − 21 γ ≤ γ∗ = ∥ Q2C ∥ B R ∥ ∥.

.

(3.80)

Proof It is first shown that . P > 0 if Condition 1 or 2 is satisfied, to this end, multiplying the left- and right-hand sides of the ARE (3.72) by .x T and .x, respectively, one has 2x T AT P x − γ x T P x + x T C T QC x − x T P B R −1 B T P x = 0.

.

(3.81)

Therefore, if . P x = 0 then .C T QC x = 0. That is, the null space of . P is a subspace of the null space of .C T QC. If Condition 1 is satisfied, then the null space of .C T QC and consequently the null √ space of . P is empty which concludes . P > 0. If Condition 1 is not satisfied, but .( QC, A) is observable, we rewrite the ARE (3.72) as .

A¯ Tc P + P A¯ c + C T QC + P B R −1 B T P = 0,

(3.82)

where . A¯ c = Ac − 0.5γ I , and . Ac is defined in (3.79). Multiplying the left- and right¯T ¯ hand sides of (3.82) by .e Ac t and .e Ac t , respectively, and doing some manipulation, one has .

∥ ∥2 d ( T A¯ Tc t A¯ c t ) 1 ¯ ¯T ¯ ∥ ∥ x e Pe x = − ∥B R − 2 Pe Ac t x ∥ − x T e Ac t C T QCe Ac t x. dt

(3.83)

3.3 Off-Policy Integral Reinforcement Learning for Linear Quadratic …

59

Integrating from both sides from 0 to .t gives .

¯T

¯

x T e Ac t Pe Ac t x − x T P x ʃ ʃ T∥ ∥ ∥ − 21 A¯ c t ∥2 =− ∥B R Pe x ∥ dτ − 0

T

¯T

¯

x T e Ac t C T QCe Ac t xdτ

(3.84)

0 ¯T

¯

which implies that .0 ≤ x T e Ac t Pe Ac t x ≤ x T P x. Therefore, if . P x = 0 for a non-zero ¯ vector .x, one has . Pe Ac t x = 0. Since the null space of . P is a subspace of the null ¯ T a non-zero vector .x, then .C T QCe Ac t x = 0. However, space of .C QC, if . P x = 0 for√ this contradicts the fact that .( QC, A) is observable. Therefore, if Condition 2 is satisfied, one has . P > 0. To show the stability of the closed-loop system, we assume that .γ is an eigenvalue of . Ac . That is . A c x = λx, (3.85) where .x is the eigenvector corresponding to the eigenvalue .λ. Multiplying the leftand right-hand sides of (3.83) by .x T and .x, respectively, and using (3.85) gives ∥2 ∥ 1 ∥ ∥ 2(Re(λ) − 0.5λ)x T P x = − ∥B R − 2 P x ∥ − x T C T QC x.

.

(3.86)

Since . P > 0, if Condition 1 is satisfied, one has .Re(λ) < 0.5λ. If .C T QC ≥ 0, then (3.86) yields .Re(λ) ≤ 0.5λ. However, if .Re(λ) = 0.5λ, then . B T P x = 0 and T T jwt A¯ t A¯ t T A¯ t .C QC = 0. On the other hand, .C QCe c xe c C QC x = 0 because .e c = e . √ This contradicts the fact that .( QC, A) is observable and thus .Re(λ) < 0.5λ either Condition 1 or 2 is satisfied, but Condition 3 is not satisfied. However, since . P > 0, canceling .x T and .x from both sides, multiplying the left- and right-hand sides by −1 2 2 . P 2 , and using Young’s inequality .a + b ≥ 2ab gives ∥∥ ∥ 1 ∥ 1 1∥ 1 ∥ ∥√ ∥ Re(λ) ≤ − ∥R − 2 B T P 2 ∥ ∥ QC P − 2 ∥ + λ. 2

.

(3.87)

Using (3.87) and the fact that .∥A∥∥B∥ ≥ ∥AB∥, one can conclude that the closedloop system is stable if Condition 3 in (3.80) is satisfied. This completes the proof. Remark 3.5 Note that for the system in Example 3.1, Condition 3 in (3.80) gives the upper bound .γ ∗ = 2 to assure stability. This bound is equal to the actual bound obtained in Example 3.1. The upper bound in (3.80) is, however, a conservative bound which says one can assure the stability of the closed-loop system by choosing a large Q and/or small discount factor.

60

3 Integral Reinforcement Learning for Optimal Regulation

3.3.2 State-Feedback Off-Policy RL with Input-State Data Now a state-feedback model-free off-policy RL algorithm is given to learn the solution to the discounted optimal control problem. This algorithm does not require any knowledge of the system dynamics but requires complete knowledge of the system states. In order to find the solution to the ARE (3.72), the following offline policy iteration algorithm is presented in Modares and Lewis (2014). Algorithm 3.3 Offline policy iteration algorithm 1. Initialization: Start with an admissible control policy u 0 = K 0 x. Set a stopping criterion e. Let i = 0. 2. Policy evaluation: For a fixed control policy K i , solve P i by Lyapunov equation (

A + BKi

)T

( ) P i + P i A + B K i − γ P i C T QC + (K i )T R K i = 0.

(3.87)

3. Policy improvement: Update the control policy gain using K i+1 = −R −1 B T P i .

(3.88a)

4. Stop if ∥K i+1 − K i ∥ ≤ e. Otherwise, set i → i + 1 and go to Step 2.

Remark 3.6 It was shown in Kleinman (1968) that if the initial control policy . K 0 is stabilizing, then the subsequent control policies . K i , i > 0 are also stabilizing. Also, it was shown that Algorithm 3.3 converges to the optimal control policy . K ∗ in (3.71) and desired . P in (3.72) with monotonicity property .0 < P < P i+1 < P i . Algorithm 3.3 is performed offline and requires complete knowledge of the system dynamics. In order to obviate this requirement, model-free off-policy RL algorithms were proposed in Lee et al. (2014); Jiang and Jiang ZP (2012) for solving the LQR problem with an undiscounted performance function. Here, we extend this idea for discounted performance functions. To this end, the system dynamics (3.66a) is first written as .

x˙ = Ai x + B(−K i x + u)

(3.89)

with. Ai = A + B K i . Then, using (3.87) and (3.88a), we have the following off-policy Bellman equation:

3.3 Off-Policy Integral Reinforcement Learning for Linear Quadratic …

61

e−γ δt x T (t + δt )P i x(t + δt ) − x T (t)P i x(t) ʃ t+δt d ( −γ (τ −t) T i ) x P x dτ e = dτ t ʃ t+δt [ ] + e−γ (τ −t) x T (AiT P i + P i Ai − γ P i )x + 2(u − K i x)T B T P i x t ʃ t+δt [ ] =− e−γ (τ −t) x T Q i x + 2(u − K i x)T R K i+1 x dτ, (3.90)

.

t

where . Q i = C T QC + (K i )T R K i . For a fixed control gain . K i , (3.90) can be solved for both the kernel matrix . P i and the improved gain . K i+1 , simultaneously. The following Algorithm 3.4 uses the above Bellman equation to iteratively solve the ARE Eq. (3.90). Algorithm 3.4 Online model-free IRL state-feedback algorithm 1. Initialization: Start with a control policy u 0 = K 0 x + ϵ, where K 0 is stabilizing and ϵ is the probing noise. Set a stopping criterion e. Apply an admissible control input u. Let i = 0. 2. Solve the following Bellman equation for P i and K i+1 simultaneously e−γ δt x T (t + δt )P i x(t + δt ) − x T (t)P i x(t) ʃ t+δt [ ] =− e−γ (τ −t) x T Q i x + 2(u − K i x)T R K i+1 x dτ.

(3.91)

t

3. Stop if ∥K i+1 − K i ∥ ≤ e. Otherwise, set i → i + 1 and go to Step 2.

Remark 3.7 Algorithm 3.4 is an extension of the algorithm presented in Jiang and Jiang ZP (2012) for undiscounted performance functions. The convergence proof of Algorithm 3.4 to the optimal control solution is the same as that of Jiang and Jiang ZP (2012) and thus is omitted.

3.3.3 Output-Feedback Off-Policy RL with Input–Output Data In this part, it is first shown that for an observable system, the states can be constructed using only a limited number of measured system outputs over the past history of the system trajectory. Then, using these delayed system outputs, an OPFB Bellman equation is presented. A model-free off-policy RL-based OPFB design method is then provided to find the optimal OPFB controller without requiring the knowledge of system dynamics or the system states.

62

3 Integral Reinforcement Learning for Optimal Regulation

State Reconstruction and Value Function Approximation Using Output Data Suppose that at time .t, we have a set of . N output values from the history of the system and consider that they are stored in a history stack . y N . That is y = {y(t − h i ), i = 0, 1, . . . , N − 1},

. N

(3.92)

where .h i , .i = 0, . . . , N − 1 are the delayed times and are assumed fixed. Consider that these . N output samples are sampled from the system (3.66b) at . N time instances stored at vector .τ N as τ = {t − h i ≥ 0, i = 0, 1, . . . , N − 1}.

. N

(3.93)

Definition 3.1 System (3.66a)–(3.66b) is said to be .τ N observable if .x(0) can be uniquely determined from observations . y N on .τ N (Wang et al. 2011). Definition 3.2 For a given time interval .[0, T¯ ] and an integer . N T¯ , the system is said to be . N T¯ -sample observable if the system is .τ NT¯ observable for any .τ NT¯ with ¯ , .i = 1, . . . , N T¯ . .0 ≤ t − h i ≤ T The following theorem shows that for the system (3.66a)–(3.66b), if .(A, C) is observable, one can always find an integer . N such that if . N > N T the system is . N T¯ -sample observable. Theorem 3.4 Suppose the matrix . A in (3.66a)–(3.66b) has eigenvalues .λ j , . j = 1, . . . , n. Denote .β = max1≤i, j≤n {im(λi 0λ j )}, where im(Z) in the imaginary part of ¯ ], define . Z ∈ C. For a given interval .[0, T μT¯ = 2(n − 1) +

.

T¯ β. 2π

(3.94)

Given .(A, C) is observable, if . N T¯ > μT¯ , then the system is . N T¯ -sample observable. Proof See Wang et al. (2011).



If the condition of Theorem 3.4 is satisfied, then the system state at each time can be calculated from the knowledge of the system output at . N points in its history. The next lemma shows that if the interval .[0, T¯ ] is small enough, one can construct the system state using .n previous values of the output, for an .n-dimensional system. Lemma 3.1 For any given .n-dimensional observable system, there exists a sufficiently small time interval .[0, T¯ ] such that if .n sampling times .0 ≤ t − h i ≤ T¯ , .i = 1, . . . , n, then the system is .n-sample observable. Proof See Wang et al. (2011).



Note that in the state-feedback model-free IRL Algorithm 3.4, the control policy which is applied to the system is considered to be fixed. In the following, using Theorem 3.4, a formula is given by which the knowledge of the state needed in

3.3 Off-Policy Integral Reinforcement Learning for Linear Quadratic …

63

Algorithm 3.4 is obtained by the knowledge of the system output at . N points in its history of using the applied control policy. These . N points are collected and stored in the history stack at reinforcement interval times .t − iδt , .i = 1, . . . , N . That is, in (3.93) we have .h i = iδt and hence τ = {t − iδt ≥ 0, i = 0, 1, . . . , N − 1}.

. N

(3.95)

Note that the number of samples. N in the history stack is fixed and the new sample is replaced while the old one is removed. Based on Lemma 3.1, for a sufficiently small time interval, the number of samples in the history stack can be equal to the dimension of the state space of the system. Now, assume that the control policy which is applied to the system is given by (3.68). Then, using (3.68) in (3.66a), the closed-loop system dynamic becomes .

x(t) ˙ = (A + B K )x(t).

(3.96)

It is now shown that the state needed for solving the Bellman equation (3.90) can be expressed in terms of . N measurements of the output in the history of using the control gain . K . To this end, first the system state for every time instance stored in the vector .τ N is expressed in terms of the system state at current time .t. In fact, since the control policy which is applied to the system is considered to be fixed, using the solution of (3.96), the state for an arbitrary time .t − iδt with respect to the state for the current time .t can be written as .

x(t − iδt ) = e−iδt (A+B K ) x(t).

(3.97)

Using (3.66b) and (3.97), we have .

y(t − iδt ) = Ce−iδt (A+B K ) x(t).

(3.98)

Suppose that at the current time .t, a set of . N output values . y N = {y(t − h i ), i = 0, 1, . . . , N − 1} are sampled at . N time instances .τ N = {t − h i ≤ 0, i = 0, 1, . . . , N − 1} and stored in a history stack. Then, using (3.98), we have ⎡ .

⎢ ⎢ ⎢ ⎣

y(t) y(t − δt ) .. .

y(t − (N − 1)δt )





⎥ ⎢ ⎥ ⎢ ⎥=⎢ ⎦ ⎣

C

Ce−δt (A+B K ) .. .

⎤ ⎥ ⎥ ⎥ x(t). ⎦

(3.99)

Ce−(N −1)δt (A+B K )

Define [ ]T y¯ = y T (t) y T (t − δt ) . . . y T (t − (N − 1)δt ) , [ ]T T T . G = C e −δt (A+B K ) C T . . . e −(N −1)δt (A+B K ) C T ,

. t

(3.100a) (3.100b)

64

3 Integral Reinforcement Learning for Optimal Regulation

where . y¯t ∈ R pN ×1 and .G ∈ R pN ×n with . p the dimension of the output. Then, (3.99) becomes .y ¯t = Gx(t). (3.101) If the system (3.66a) is observable and the number of samples . N is larger than μT defined in (3.94), then based on Theorem 3.4 the system is . N -sample observable. Therefore, .G is full rank. Thus, from (3.99), the system state vector is given by

.

x = G N y¯t = [L 1 , . . . , L N ] y¯t

. t

=

N ∑

L i y(t − iδt ),

(3.102)

i=1

where .G N = (G T G)−1 G T ∈ Rn× pN and . L i = G N (1 : n, (i − 1) p + 1 : i p). Note that (3.102) shows that if the system is observable, one can construct the sufficient system state to evaluate the Bellman equation uniquely using a limited number of the measured system outputs in the history of using the given control policy. Note that the system dynamics information . A, . B, and .C must be known to construct the system state from the measurement system outputs. In fact, .G in (3.100b) depends on . A, . B, and .C. In the next step, it is shown how to use the structural dependence in (3.102) yet avoid knowledge of the system dynamics. We first show that the value function (3.69) can be expressed as a quadratic form in terms of a limited number of measured system outputs in the history of the system. Using (3.102) in (3.69) gives .

V (t) = x T (t)P x(t) = (G N y¯t )T P(G N y¯t ) = y¯tT G TN P G N y¯t ,

which is written as .

(3.103)

V (t) = y¯tT P¯ y¯t ,

(3.104)

P¯ = G TN P G N

(3.105)

where .

is constant. Using (3.104), (3.69) becomes y¯ T P¯ y¯t =

. t

ʃ



) ( e−γ (τ −t) y T Qy + u T Ru dτ.

(3.106)

t

Note that the matrix . P¯ in (3.105) depends on the system dynamics . A, . B, and .C. In the next subsection, it is shown how to use the model-free off-policy RL method to learn this matrix without knowing the system dynamics.

3.3 Off-Policy Integral Reinforcement Learning for Linear Quadratic …

65

Off-Policy RL with Measured Output Data Now a model-free OPFB off-policy IRL algorithm is developed with measured output data. This algorithm is equivalent to Algorithm 3.4. Algorithm 3.4 is a model-free off-policy RL algorithm in which the policy.u which is applied to the system can be different that the policy .u i = K i x which is updated and evaluated. We assume that .u is the updated policy plus a probing noise. That is u = K i x + w,

.

(3.107)

where .w is the probing noise. Based on (3.101), the relation between the state feedback and OPFB control gains is obtained by u = K i x = K i G iN y¯t = K¯ i y¯t ,

.

(3.108)

where .G iN = ((G i )T G i )−1 (G i )T ∈ Rn× pN and .

[ ] i T i T G i = C e−δt (A+B K ) C T . . . e−(N −1)δt (A+B K ) C T .

(3.109)

Note that . K¯ i = K i G iN is a nonlinear function of the state-feedback gain . K i and the system dynamics. Using (3.102), (3.104), and (3.106) in (3.91), the key Eq. (3.91) in Algorithm 3.4, which uses the state information to evaluate both value function and control policy, can be written in terms of the measured outputs as T e−γ δt y¯t+δ P¯ i y¯t+δt − y¯tT P¯ i y¯t t ʃ t+δt =− e−γ (τ −t) [ y¯tT Q¯ i y¯t + 2(u − K¯ i y¯t )T R K¯ i+1 y¯t ]dτ,

.

(3.110)

t

where . Q¯ i = [1 0 . . . 0]T Q i [1 0 . . . 0]. We now use the OPFB Bellman equation (3.110) to present an optimal model-free OPFB control design method as follows. Algorithm 3.5 Model-free off-policy RL-based OPFB control design algorithm 1. Initialization: Start with a control policy u 0 = K¯ 0 y¯t , where K 0 is stabilizing. Set a stopping criterion e. Apply an admissible control input u. Let i = 0. 2. Solve the following Bellman equation for P¯ i and K¯ i+1 simultaneously T P¯ i y¯t+δt − y¯tT P¯ i y¯t e−γ δt y¯t+δ t ʃ t+δt [ ] =− e−γ (τ −t) y¯tT Q¯ i y¯t + 2(u − K¯ i y¯t )T R K¯ i+1 y¯t dτ. t

3. Stop if ∥K i+1 − K i ∥ ≤ e. Otherwise, set i → i + 1 and go to Step 2.

(3.111)

66

3 Integral Reinforcement Learning for Optimal Regulation

Algorithm 3.5 does not require any knowledge of the system dynamics or the system states. In fact the requirement of the system dynamics and system states are replaced by the input and output information measured online. The solution . P¯ i and ¯ i+1 to (3.111) can be found using the least squares methods (see Jiang and Jiang .K ZP (2012) for more details). Theorem 3.5 Algorithm 3.5 converges to the optimal OPFB control gain . K¯ ∗ and value function kernel matrix. P¯ ∗ . Moreover, if the condition of Theorem 3.4 is satisfied, one has .u ∗ = K¯ ∗ y¯t = K ∗ x, where . K ∗ is given in (3.70) and satisfies the statefeedback ARE (3.71). That is, the optimal OPFB solution gives the optimal statefeedback solution. Proof Using (3.109), one has .

K¯ i+1 = K i+1 G iN = −R −1 B T P i G iN .

(3.112)

Dividing both sides of (3.111) by .δt and taking limit yields .

lim

T e−γ δt y¯t+δ P¯ i y¯t+δt − y¯tT P¯ i y¯t t

δt →0

ʃ t+δt

+ lim

δt →0

t

e

δt −γ (τ −t)

[ y¯tT Q¯ i y¯t + 2(u − K¯ i y¯t )T R K¯ i+1 y¯t ]dτ δt

= 0.

(3.113)

By L’Hopital’s rule, (3.113) becomes − γ y¯tT P¯ i y¯t + y˙¯tT P¯ i y¯t + y¯tT P¯ i y˙¯t + y¯tT Q¯ i y¯t − 2(u − K¯ i y¯t )T R K¯ i+1 y¯t = 0. (3.114) On the other hand, by differentiating (3.101), we have .

˙ = G(Ax + Bu) = G AG N y¯t + G Bu. y˙¯ = G x(t)

. t

(3.115)

Using (3.115) and (3.112) into (3.114) yields [ ] G TN (A + B K i )T P i + P i (A + B K i ) − γ P i + C T QC + (K i )T R K i G N = 0. (3.116) Since .G N is full rank, the Lyapunov equation (3.87) is satisfied. That is, evaluating a fixed OPFB control policy .u = K¯ i y¯t using the Bellman equation (3.111) gives the same value function as evaluating the fixed state-feedback control policy.u = K i x(t), with . K¯ i = K i G iN , using the state-feedback Lyapunov equation (3.87). Moreover, based on (3.112), the policy improvement . K¯ i+1 in terms of . K i+1 becomes . K i = −R −1 B T P i . Algorithms 3.3 and 3.4 give the same results for policy evaluation and improvement steps, and thus have the same convergence properties. This confirms that the proposed OPFB design method converges to an optimal solution and gives a state-feedback control. .

3.3 Off-Policy Integral Reinforcement Learning for Linear Quadratic …

67

Remark 3.8 The proposed control input is more powerful than the static OPFB in the form of .u = K y(t). In fact, as shown in the proof of Theorem 3.5, the proposed control input is equivalent to a state-feedback control input as a result of using the delayed outputs. Therefore, in contrast to the static OPFB, using the proposed controller one can stabilize a system that is state-feedback stabilizable but is not static OPFB stabilizable. Simulation results confirm this statement. Remark 3.9 Note that the matrix .G in (3.101)–(3.103), which is given by (3.100b), requires complete knowledge of the system dynamics. This matrix is used to construct the system states. To obviate the need to know the system dynamics, the structure (3.101) is combined with the value function approximation and the update policy law in the Bellman equation (3.111) and .G is explicitly inserted in the value function kernel matrix and improved control policy. This matrix is learned by measuring the system outputs online in real time and without requiring any knowledge of the system dynamics.

3.3.4 Simulation Examples Consider the following linear CT system: [

] [ ] 0 1 0 .x ˙= x+ u, −1 0 1 [ ] . y = 1 0 x.

(3.117a) (3.117b)

This system is both controllable and observable. However, it is not static OPFB stabilizable. That is, there is no gain . K such that the control input .u = K y makes the system asymptotically stable. In contrast, now we show that we can stabilize it using the proposed OPFB design method. We now use Algorithm 3.5 to find the optimal OPFB gain. The reinforcement interval time is considered as .δt = 0.1 and the number of stored data in the history stack is 2. That is, the control input .u(t) is constructed from the current output . y(t) and the past outputs, . y(t − δt ). A probing noise is added to the control input to persistently excite the system output. Define . y¯t = [y(t) y(t − 0.1)] and .u = K¯ y¯t . Figure 3.3 shows the convergence of. K¯ . Figure 3.4 shows the trajectories of control input .u and the output . y, respectively. It shows that . K¯ converges to .

K¯ = [8.6429, −6.1777].

(3.118)

The optimal OPFB policy is then given by u = 8.6429y(t) − 6.1777(t − 0.1).

.

(3.119)

68

3 Integral Reinforcement Learning for Optimal Regulation 600

500

400

300

200

100

0 0

2

4

6

8

10

12

Fig. 3.3 Convergence of OPFB control gain 15

10

5

0

-5 0

5

10

15

20

25

30

0

5

10

15

20

25

30

10 0 -10 -20 -30 -40

Fig. 3.4 Trajectories of the output and the control input

References

69

References Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41(5):779–791 Adam S, Busoniu L, Babuska R (2012) Experience replay for real-time reinforcement learning control. IEEE Trans Syst Man Cybern Part C: Appl Rev 42:201–212 Dung LT, Komeda T, Takagi M (2008) Efficient experience reuse in non-Markovian environments. Proc Int Conf Instrum Control Inf Technol. Tokyo, Japan, pp 3327–3332 Finlayson BA (1990) The method of weighted residuals and variational principles. Academic Press, New York Jiang Y, Jiang ZP (2012) Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 48(10):2699–2704 Kalyanakrishnan S, Stone P (2007) Batch reinforcement learning in a complex domain. In: Proc. 6th Int Conf Auton Agents Multi-Agent Syst. Honolulu, HI, pp 650–657 Kleinman D (1968) On an iterative technique for Riccati equation computations. IEEE Trans Autom Control 13(1):114–115 Lee JY, Park JB, Choi YH (2014) Integral reinforcement learning for continuous-time input-affine nonlinear systems with simultaneous invariant explorations. IEEE Trans Neural Networks Learn Syst 26(5):916–932 Lin LJ (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach Learn 8:293–321 Modares H, Lewis FL (2014) Linear quadratic tracking control of partially-unknown continuoustime systems using reinforcement learning. IEEE Trans Autom Control 59(11):3051–3056 Modares H, Lewis FL, Jiang ZP (2016) Optimal output-feedback control of unknown continuoustime linear systems using off-policy reinforcement learning. IEEE Trans Cybern 46(11):2401– 2410 Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT Press Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 46(5):878–888 Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis FL (2009) Adaptive optimal control for continuoustime linear systems based on policy iteration. Automatica 45(2):477–484 Wang L, Li C, Yin GG, Guo L, Xu CZ (2011) State observability and observers of linear-timeinvariant systems under irregular-sampling and sensor limitations. IEEE Trans Autom Control 56(11):2639–2654 Wawrzynski P (2009) Real-time reinforcement learning by sequential actor-critics and experience replay. Neural Networks 22:1484–1497 Werbos PJ (1992) Approximate dynamic programming for real time control and neural modeling. In: White DA, Sofge DA (eds) Handbook of intelligent control. Multiscience Press

Chapter 4

Integral Reinforcement Learning for Optimal Tracking

4.1 Introduction In control system design, a common objective is to find a stabilizing controller that ensures the system’s output tracks a desired reference trajectory. Optimal control theory aims to achieve this goal by determining a control law that not only stabilizes the error dynamics but also minimizes a predefined performance index. Reinforcement learning (RL) algorithms have proven to be effective in solving the optimal tracking control problem (OTCP) for both discrete-time (Dierks and Jagannathan 2009; Wang et al. 2012; Modares et al. 2014) and continuous-time systems (Zhang et al. 2011). RL algorithms not only learn optimal tracking control solutions but also stabilize the tracking error systems. This chapter first explores the integral reinforcement learning (IRL) algorithms for linear quadratic tracking control. Then, a novel formulation for the OTCP of nonlinear constrained-input systems is introduced. The IRL algorithm is designed to offer a comprehensive approach to address the challenges associated with tracking control problems. They enable the design of controllers that achieve optimal tracking performance while satisfying system constraints.

4.2 Integral Reinforcement Learning Policy Iteration for Linear Quadratic Tracking Control For linear systems accompanied by a quadratic performance index, the optimal tracking problem is called linear quadratic tracking (LQT), which is an important problem in the field of optimal control theory. Traditional solutions to the LQT problem are composed of two components; a feedback term obtained by solving an algebraic Riccati equation (ARE) and a feedforward term obtained by either solving a differential equation (Lewis et al. 2012) or calculating a desired control input a priori using © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 B. Lian et al., Integral and Inverse Reinforcement Learning for Optimal Control Systems and Games, Advances in Industrial Control, https://doi.org/10.1007/978-3-031-45252-9_4

71

72

4 Integral Reinforcement Learning for Optimal Tracking

knowledge of the system dynamics (Mannava et al. 2012). The feedback term tries to stabilize the tracking error dynamics and the feedforward term tries to guarantee perfect tracking. Procedures for computing the feedback and feedforward terms are traditionally based on offline solution methods, which must be done in a noncausal manner backward in time and require complete knowledge of the system dynamics. In this chapter, we talk about RL-based control design to solve optimal tracking control problems.

4.2.1 Problem Formulation Consider the linear CT system .

x(t) ˙ = Ax(t) + Bu(t), y(t) = C x(t),

(4.1)

where .x ∈ Rn is a measurable system state vector, . y ∈ R p is the system output, m n×n .u ∈ R is the control input, . A ∈ R gives the drift dynamics of the system, . B ∈ n×m is the input matrix and .C ∈ R p×n is the output matrix. R √ Assumption 4.1 The pair .(A, B) is stabilizable and the pair .( Q C, A) is observable. The goal of the optimal tracking problem is to find the optimal control policy .u ∗ so as to make the system (4.1) track a desired (reference) trajectory y (t) ∈ R p

. d

in an optimal manner by minimizing a predefined performance index. In the infinitehorizon LQT problem, the performance index is usually considered as .

J (x, y¯d ) =



1 2



[

] (C x − yd )T Q (C x − yd ) + u T R u dτ ,

(4.2)

t

where . y¯d = {yd (τ ) , t ≤ τ }, . Q > 0 and . R > 0 are symmetric matrices, and (C x − yd )T Q (C x − yd ) + u T R u is the utility function. The standard solution to the LQT problem is given as (Lewis et al. 2012; Barbieri and Alba-Flores 2000)

.

u = −R −1 B T S x + R −1 B T v SS ,

.

(4.3)

where . S is obtained by solving the Riccati equation 0 = AT S + S A − S B R −1 B T S + C T Q C

.

(4.4)

4.2 Integral Reinforcement Learning Policy Iteration for Linear …

73

and the limiting function .v SS is given by .v SS = lim T →∞ v, with the auxiliary time signal .v satisfies .

( )T − v˙ = A − B R −1 B T S v + C T Q yd , v (T ) = 0.

(4.5)

The first term of the control input (4.3) is a feedback control part that depends linearly on the system state, and the second term is a feedforward control part that depends on the reference trajectory. The feedforward part of the control input is time varying in general and thus a theoretical difficulty arises in the solution of the infinite-horizon LQT problem. In Barbieri and Alba-Flores (2000, 2006), methods for real-time computation of .v SS are provided. These methods are performed offline and require complete knowledge of the system dynamics. Remark 4.1 Note that the performance function (4.2) is unbounded if the reference trajectory does not approach zero as time goes to infinity. This is because the feedforward part of the control input and, consequently, the second term under the integral of the performance function (4.2) depends on the reference trajectory. Therefore, the standard methods presented in Barbieri and Alba-Flores (2000, 2006) can be only used if the reference trajectory is generated by an asymptotically stable system. Because of this shortcoming and the theoretical difficulty arises in solving (4.5), the infinite-horizon LQT problem has received little attention in the literature.

4.2.2 Augmented Algebraic Riccati Equation for Causal Solution In this section, a causal solution to the LQT problem is presented. It is assumed that the reference trajectory is generated by a linear command generator and it is then shown that the value function for the LQT problem is quadratic in the system state and the reference trajectory. An augmented LQT ARE for this system is derived to solve the LQT problem in a casual manner. Assumption 4.2 Assume that the reference trajectory . yd (t) is generated by the command generator system y˙ = F yd ,

. d

(4.6)

where . F is a constant matrix of appropriate dimension. Remark 4.2 Matrix . F is not assumed stable. The command generator dynamics given in (4.6) can generate a large class of useful command trajectories, including unit step (useful, e.g., in position command), sinusoidal waveforms (useful, e.g., in hard disk drive control), damped sinusoids (useful, e.g., in vibration quenching in flexible beams), the ramp (useful in velocity tracking systems, e.g., satellite antenna pointing), and more.

74

4 Integral Reinforcement Learning for Optimal Tracking

The use of the performance function (4.2) for the LQT problem requires the command generator to be asymptotically stable, i.e., . F in (4.6) must be Hurwitz. In order to relax this restrictive assumption, a discounted value function for the LQR problem is introduced as follows .

V (x, y¯d ) =

1 2





[ ] e−γ(τ −t) (C x − yd )T Q (C x − yd ) + u T R u dτ ,

(4.7)

t

where .γ > 0 is the discount factor. Definition 4.1 (Admissible Control) A control policy .μ(x) is said to be admissible with respect to (4.2), if .μ(x) is continuous, .μ(0) = 0, .u(x) = μ(x) stabilizes (4.1), and .V (x(t), y¯d ) is finite .∀ x(t) and . y¯d . Lemma 4.1 (Quadratic Form of the LQT Value Function) Consider the LQT problem with the system dynamics and the reference trajectory dynamics given as (4.1) and (4.6), respectively. Consider the admissible fixed control policy

u = K x + K ' yd .

(4.8)

.

Then, the value function (4.7) for control policy (4.8) can be written as the quadratic form .

V (x(t), y¯d ) = V (x(t), yd (t)) =

] [ ]T 1[ T X (t) ydT (t) P X T (t) ydT (t) 2

(4.9)

for some symmetric . P > 0. Proof Putting (4.8) in the value function (4.7) and performing some manipulations yields ∫ 1 ∞ −γτ [ T x (τ + t)(C T Q C + K T R K ) x(τ + t) e . V (x(t), yd (t)) = 2 0 + 2x T (τ + t)(−C T Q + K T R K ' ) yd (τ + t) ] T + ydT (τ + t)(Q + K ' R K ' ) yd (τ + t) dτ . (4.10) Using (4.8), the solutions for the linear differential Eqs. (4.1) and (4.6) become .

x(τ + t) = e

(A+B K )τ

(∫

τ

x(t) +

e

(A+B K ) τ '

' F τ'

BK e



'

) yd (t)

0 ∆

= L 1 (τ ) x(t) + L 2 (τ ) yd (t), y (τ + t) = e

. d





(4.11a)

yd (t)

= L 3 (τ ) yd (t).

(4.11b)

4.2 Integral Reinforcement Learning Policy Iteration for Linear …

75

Substituting (4.11a) and (4.11b) in (4.10) results in .

[ where . P =

V (x(t), yd (t)) =





P11 = 0

∫ .

P12 =



0

∫ .

P21 =



0

.

P22 = 0



(4.12)

] P11 P12 with P21 P22 .



] [ ]T 1[ T X (t) ydT (t) P X T (t) ydT (t) 2

) ( e−γτ L T1 (τ ) C T Q C + K T R K L 1 (τ ) dτ ,

(4.13)

[ ) ( e−γτ L T1 (τ ) C T Q C + K T R K L 2 (τ ) ) ( ] +L T1 (τ ) −C T Q + K T R K ' L 3 (τ ) dτ ,

(4.14)

[ ) ( e−γτ L T2 (τ ) C T Q C + K T R K L 1 (τ ) ) ( ] T +L T3 (τ ) −Q C + K ' R K L 1 (τ ) dτ ,

(4.15)

[ T e−γτ L T3 (τ )(Q + K ' R K ' ) L 3 (τ )+ L T2 (τ )(C T Q C + K T R K ) L 2 (τ ) ] + 2 L T2 (τ )(−C T Q + K T R K ' ) L 3 (τ ) dτ . (4.16) ◻

This completes the proof.

.

Note that Eq. (4.9) is valid because Assumption (4.2) is imposed. Also, note that because the closed-loop system is stable for an admissible policy, . L 1 and . L 2 in (4.13)–(4.16) are bounded. The boundness of . L 3 and consequently the existence of a solution to the LQT problem is discussed in the following remark Remark 4.3 If the reference trajectory is bounded (i.e., if . F is stable or marginally stable, e.g., tracking a step or sinusoidal waveform), then. L 3 and is bounded for every .γ > 0. However, if the command generator dynamics . F in (4.6) is unstable, then the first and last terms of . P22 in (4.16) can be unbounded for some values of .γ. More specifically, one can conclude form (4.16) that . P22 is bounded if .(F − 0.5γ I ) has all its poles in the left-hand side of the complex plane. Therefore, if . F is unstable, we need to know an upper bound of the real part of unstable poles of the . F to choose .γ large enough to make sure . P22 is bounded and thus a solution to the LQT exists. Now define the augmented system state as .

[ ]T X (t) = X T (t) ydT (t) .

(4.17)

76

4 Integral Reinforcement Learning for Optimal Tracking

Putting (4.1) and (4.6) together construct the augmented system as ˙ = .X

[

A 0 0 F

]

[

B X+ 0

]



u = T X + B1 u.

(4.18)

The value function (4.9) in terms of the augmented system state becomes .

V (X (t)) =

1 T X (t)P X (t). 2

(4.19)

Using value function (4.19) for the left-hand side of (4.7) and differentiating (4.7) along with the trajectories of the augmented system (4.18) gives the augmented LQT Bellman equation 0 = (T X + B1 u)T P X + X T P (T X + B1 u)

.

− γ X T P X + X T C1T Q C1 X + u T R u,

(4.20)

C1 = [C −I ] .

(4.21)

where .

Consider the fixed control input (4.8) as u = K x + K ' yd = K 1 X,

.

(4.22)

'

where. K 1 = [K K ]. Putting (4.19) and (4.22) into (4.20), the LQT Bellman equation gives the augmented LQT Lyapunov equation (T + B1 K 1 )T P + P (T + B1 K 1 ) − γ P + C1T Q C1 + K 1 T R K 1 = 0.

.

(4.23)

Based on (4.20), define the Hamiltonian .

H (X, u, P) = (T X + B1 u)T P X + X T P (T X + B1 u) − γ X T P X + X T C1T Q C1 X + u T R u.

(4.24)

Theorem 4.1 (Causal Solution for the LQT Problem) The optimal control solution for the infinite-horizon LQT problem is given by

u = K 1 X,

.

where

(4.25)

4.2 Integral Reinforcement Learning Policy Iteration for Linear … .

K 1 = −R −1 B1T P

77

(4.26)

and . P satisfies the augmented LQT ARE 0 = T T P + P T − γ P + P B1 R −1 B1T P + C1T Q C1 .

.

(4.27)

Proof A necessary condition for optimality (Lewis et al. 2012) is stationarity condition .

∂H = B1T P X + R u = 0 ∂u

(4.28)

which results in control input (4.25). Substituting (4.19) and (4.25) in the LQT .◻ Bellman equation (4.20) yields (4.27). . Lemma 4.2 (Existence of the Solution to the LQT ARE) The LQT ARE (4.27) has a unique positive semi-definite solution if .(A, B) is stabilizable and the discount factor .γ > 0 is chosen such that . F − 0.5γ I is stable. Proof Note that the LQT ARE (4.27) can be written as 0 = (T − 0.5γ I )T P + P (T − 0.5γ I ) + P B1 R −1 B1T P + C1T Q C1 .

.

(4.29)

This amounts to an ARE without discount factor and with the system dynamics given by .T − 0.5γ I and . B1 . Therefore, a unique solution to the LQT ARE (4.29) and consequently the LQT ARE (4.27) exists if .(T − 0.5γ I, B1 ) is stabilizable. This requires that.(A − 0.5γ I, B) be stabilizable and. F − 0.5γ I be stable. However, since .(A, B) is stabilizable, then .(A − 0.5γ I, B) is also stabilizable for any .γ > 0. This .◻ completes the proof. Remark 4.4 The fact that . F − 0.5γ I should be stable to have a solution to the LQT ARE supports the conclusion followed by Lemma 4.1 (see Remark 4.3) for the existence of a solution to the LQT problem. In Remark 4.3, it is further elaborated on how to choose the discount factor to make sure the LQT problem has a solution. Remark 4.5 The optimal control input (4.25) can be written in form of .u = K x + K ' yd , as in (4.22). Therefore, similar to the standard solution given in Section II, the proposed control solution (4.25) has both feedback feedforward control parts. However, in the proposed method, both control parts are obtained simultaneously by solving an LQT ARE in a causal manner. This causal formulation is a consequence of Assumption (4.2) and the quadratic form (4.9), (4.19). Now a formal proof is given to show that the LQT ARE solution makes the tracking ∆ error .ed = C x − r bounded and it asymptotically stabilizes .e¯d (t) = e−(γ/2)t ed (t). The following key fact is instrumental.

78

4 Integral Reinforcement Learning for Optimal Tracking

Lemma 4.3 (Lewis et al. 2012) For any admissible control policy .u(X ), let . P be the corresponding solution to the Bellman equation (4.20). Define .u ∗ (X ) = −R −1 B1T P X . Then .

H (X, u, P) = H (X, u ∗ , P) + (u − u ∗ )T R (u − u ∗ ),

(4.30)

where . H is the Hamiltonian function defined in (4.24). Theorem 4.2 (Stability of the LQT ARE Solution) Consider the LQT problem for the system (4.1) with performance function (4.7). Suppose that . P ∗ is a smooth positive-definite solution to the tracking LQT ARE (4.27) and define the control input ∆ ∗ −1 T ∗ .u = −R B1 P X . Then, .u ∗ makes .e¯d (t) = e−(γ/2)t ed (t) asymptotically stable. Proof .V (X ) = X T P X , by differentiating .V (X ) along the augmented system trajectories, one has .

d V (X ) = (T X + B1 u)T P X + X T P(T X + B1 u), dt

(4.31)

so that .

H (X, u, P) =

d V (X ) − γV (X ) + X T C1T Q C1 X + u T R u. dt

(4.32)

Suppose now that . P ∗ satisfies the LQT ARE (4.27). Then, using (4.30) and since H (X ∗ , u ∗ , P ∗ ) = 0, one has

.

.

d V (X ) − γV (X ) + X T C1T Q C1 X + u T R u = (u − u ∗ )T R (u − u ∗ ). dt

(4.33)

Selecting .u = u ∗ = K 1 X gives .

d V (X ) − γV (X ) + X T (C1T Q C1 + K 1T R K 1 ) X = 0, dt

(4.34)

where . K 1 is the control gain obtained by solving the LQT ARE and it is given in (4.26). Multiplying .e−γt to the both sides of (4.34) and using .V (X ) = X T P X gives .

d −γt T (e X P X ) = −e−γt X T (C1T Q C1 + K 1T R K 1 )X ≤ 0. dt

(4.35)

Now define the new state . X¯ (t) = e−(γ/2)t X (t) and consider the Lyapunov function ¯ ) = X¯ T P X¯ . Then using (4.35) one has .V ( X .

V˙ ( X¯ ) = − X¯ T (C1T Q C1 + K 1T R K 1 ) X¯ < 0.

(4.36)

Therefore . X¯ (t) is asymptotically stable. On the other hand, since .e¯d = C1 X¯ and .C 1 / = 0, thus .e ¯ is also asymptotically stable. .◻

4.2 Integral Reinforcement Learning Policy Iteration for Linear …

79

Remark 4.6 According to Theorem (4.2), the tracking error is bounded for the optimal control solution. Moreover, the larger the . Q is, the more negative the Lyapunov functions (4.36) and consequently the faster the tracking error decreases. Also, the smaller the discount factor is, the faster the tracking error decreases. The discount factor .γ and the weight matrix . Q in (4.7) are design parameters and they can be chosen appropriately to make the system state go to a very small region around zero.

4.2.3 Integral Reinforcement Learning for Online Linear Quadratic Tracking In this section, first, an offline solution to the LQT ARE is presented. Then, a CT Bellman equation is developed based on integral reinforcement learning (IRL). Based on this, an RL technique is employed to solve the LQT problem online in real-time and without the need for the knowledge of the drift dynamics of the system . A and command generator dynamics . F. Offline PI for solving the LQT ARE The LQT Lyapunov equation (4.23), which can be solved to evaluate a fixed control policy, is linear in . P and is easier to solve than the LQT ARE (4.27). This is the motivation for introducing an iterative technique to solve the LQT problem. An iterative Lyapunov method for solving the LQT problem is given as follows. Algorithm 4.1 Offline policy iteration for solving the LQT problem 1. Initialization: Start with an admissible control input u = K 1 0 X . 2. Policy evaluation: Given a control gain K 1i , find P i using the LQT Lyapunov equation (T + B1 K 1i )T P i + P i (T + B1 K 1i ) − γ P i + C1T Q C1 + (K 1i )T R (K 1i ) = 0.

(4.37)

3. Policy improvement: Update the control gain using K 1 i+1 = −R −1 B1T P i .

(4.38)

Algorithm 4.1 is an offline algorithm that extends Kleinman’s algorithm (Kleinman 1968) to the LQT problem. Convergence of Kleinman’s algorithm to the solution of the ARE is shown in Kleinman (1968). Online IRL for Solving the LQT Problem To obviate the need for complete knowledge of the system dynamics, the IRL algorithm (Vrabie and Lewis 2009; Vrabie et al. 2009) can be extended to the LQT problem. The IRL is a PI algorithm that uses an equivalent formulation of the Lyapunov equation that does not involve the system dynamics. Hence, it is central to the

80

4 Integral Reinforcement Learning for Optimal Tracking

development of model-free RL algorithms for CT systems. This Bellman equation uses only the information given by measuring the system state and an integral of the utility function in finite reinforcement intervals to evaluate a control policy. To obtain the IRL Bellman equation for the LQT problem, note that for time interval .∆t > 0, the value function (4.7) satisfies ∫ ] 1 t+∆t −γ(τ −t) [ T X (t)C1T Q C1 X (t) + u T R u dτ e . V (X (t)) = 2 t + e−γ ∆t V (X (t + ∆t)) ,

(4.39)

where .C1 is defined in (4.21). Using (4.19) in (4.39) yields the LQT IRL Bellman equation ∫ .

X (t)P X (t) =

t+∆t

T

t

[ ] e−γ(τ −t) X T (t)C1T Q C1 X (t)+ u T R u dτ

+ e−γ ∆t X T (t + ∆t)P X (t + ∆t).

(4.40)

The first term of (4.40) is known as the integral reinforcement (Vrabie et al. 2009). Lemma 4.4 (Equivalence of the Lyapunov Equation (4.23) and the IRL Bellman Equation (4.40)) The LQT IRL Bellman equation (4.40) and the LQT Lyapunov equation (4.23) have the same positive semi-definite solution for value function. Proof Dividing both sides of (4.40) by .∆t and taking limit yields .

e−γ ∆t X T (t + ∆t)P X (t + ∆t) − X T (t)P X (t) ∆t→0 ∆t ∫ t+∆t −γ(τ −t) [ T ] e X (t)C1T Q C1 X (t) + u T R u dτ t = 0. + lim ∆t→0 ∆t lim

(4.41)

By L’Hopital’s rule, then ∫ t+∆t

.

[ ] e−γ(τ −t) X T (t)C1T Q C1 X (t) + u T R u dτ ∆t→0 ∆t T T T = X (t)C1 Q C1 X (t) + u R u lim

t

(4.42)

and also .

X T (t)P X (t) − e−γ ∆t X T (t + ∆t)P X (t + ∆t) ∆t→0 ∆t { −γ ∆t T X (t + ∆t)P X (t + ∆t) + e−γ ∆t X˙ T (t + ∆t)P X (t + ∆t) = lim −γe ∆t→0 } + e−γ ∆t X T (t + ∆t)P X˙ (t + ∆t) (4.43) = −γ X T (t)P X (t) + X˙ T (t)P X (t) + X T (t)P X˙ (t). lim

4.2 Integral Reinforcement Learning Policy Iteration for Linear …

81

Using the system dynamics (4.18) in (4.43) and putting (4.42) and (4.43) in (4.41) gives the Bellman equation (4.20). On the other hand, the Bellman equation (4.20) has the same value function solution as the Lyapunov equation (4.23) and this completes .◻ the proof. Using (4.40) instead of (4.23) in the policy evaluation step of Algorithm 4.1, the following IRL-based algorithm is obtained. Algorithm 4.2 Online IRL algorithm for solving the LQT problem 1. Initialization: Start with an admissible control input u 0 = K 1 0 X . 2. Policy evaluation: Given a control policy u i find P i using the Bellman equation X T (t)P i X (t) =

1 2



t+∆t t −γ ∆t

+e

e−γ(τ −t) [X T (t)C1T Q C1 X (t)+ (u i )T R (u i )] dτ X T (t + ∆t)P i X (t + ∆t).

(4.44)

3. Policy improvement: Update the control input using u i+1 = −R −1 B1T P i X.

(4.45)

The policy evaluation and improvement steps (4.44) and (4.45) are repeated until the policy improvement step no longer changes the present policy,‖thus convergence ‖ to the optimal controller is achieved. That is, until .‖ P i+1 − P i ‖ ≤ ε is satisfied, where is a small constant. Algorithm 4.2 does not require knowledge of . A and . F. Note that the method of Jiang and Jiang (2012) can be used to avoid knowledge of . B. According to Lemma 4.4, the IRL Bellman equation (4.44) in Algorithm 4.2 has the same value function solution as the Lyapunov equation (4.37) in Algorithm 4.1. Therefore, Algorithm 4.2 has the same convergence properties as Algorithm 4.1. Remark 4.7 As discussed in Vrabie et al. (2009), a solution . P i can be uniquely determined under some persistence of excitation (PE) condition. The PE condition can be satisfied by injecting a probing noise into the control input. This can cause biased results. However, it was shown in Lewis and Vamvoudakis (2011) that discounting the performance function can significantly reduce the deleterious effects of probing noise. Moreover, since the probing noise is known a priori, one can consider its effect into the IRL Bellman equation, as in Lee et al. (2012), to avoid affecting the convergence of the learning process. Remark 4.8 The proposed IRL Algorithm 4.2 has the same structure as the IRL algorithm in Vrabie et al. (2009) for solving the LQR problem. However, in the proposed algorithm, the augmented system state involves the reference trajectory in it and also a discount factor is used in the IRL Bellman equation of Algorithm 4.2. In

82

4 Integral Reinforcement Learning for Optimal Tracking

fact, using Assumption (4.2) and developing Lemmas (4.1) and (4.4) and Theorem (4.1) allows us to extend the IRL algorithm presented in Vrabie et al. (2009) to the LQT problem. Remark 4.9 The solution for . P i in the policy evaluation step (4.44) is generally carried out in the least squares (LS) sense. In fact, (4.44) is a scalar equation and P is a symmetric .nn matrix with .n(n + 1)/2 independent elements and therefore, at least .n(n + 1)/2 data sets are required before (4.44) can be solved using LS. Both batch LS and recursive LS methods can be used to perform policy evaluation step (4.44). Remark 4.10 The proposed policy iteration Algorithm 4.2 requires an initial admissible policy. If one knows that the open-loop system is stable a priori, then the initial policy can be chosen as .u = 0 and the admissibility of the initial policy is guaranteed without requiring any knowledge of . A. Otherwise, the initial admissible policy can be obtained by using some knowledge of A. Suppose the system (4.1) has a nominal model . A N satisfying . A = A N + ∆A, where .∆A is unknown part of . A. In this case, one can use robust control methods such as . H∞ control with the nominal model . A N to yield an initial admissible policy. Note that the learning process does not require any knowledge of . A. Finally, Algorithm 4.2 is a policy iteration algorithm and IRL value iteration can be used to avoid the need for an initial admissible policy.

4.2.4 Simulation Examples In this section, an example is provided to verify the correct performance of Algorithm 4.2 for solving the LQT problem. Consider the unstable continuous-time linear system [

0.5 1.5 .x ˙ (t) = 2.0 −2

]

[ ] 5 x (t) + u (t) , 1

[ ] y (t) = 1 0 x (t)

(4.46)

and suppose that the desired trajectory is generated by the command generator system y˙ = 0

. d

(4.47)

with the initial value . yd (0) = 3. So, the reference trajectory is a step input with amplitude 3. The performance index is given as (4.2) with . Q = 10 and . R = 1 and the discount factor is chosen as .γ = 0.1. The solution obtained by directly solving the LQT ARE (4.27) using known dynamics .(T, B1 ) is given by ⎡

⎤ 0.6465 0.0524 −0.6221 ∗ 0.0191 −0.0244 ⎦ . P = ⎣ 0.0524 −0.6221 −0.0244 1.7360 and hence using (4.26) the optimal control gain becomes

(4.48)

4.2 Integral Reinforcement Learning Policy Iteration for Linear … .

[ ] K 1 ∗ = −3.2851 −0.2813 3.1347 .

83

(4.49)

It is now assumed that the system drift dynamics and the command generator dynamics are unknown and Algorithm 4.2 is implemented online to solve the LQT problem for the system. The simulation was conducted using data obtained from the augmented system at every .0.05 s. A batch least squares problem is solved after 6 data samples and thus, the controller is updated every 0.3s. The initial control policy is chosen as. K 1 = [ −5.0 −1.0 −0.5] . Figure 4.1 shows how the norm of the difference between the optimal P matrix and the P matrix obtained by the online learning algorithm converges to zero. Also, Fig. 4.2 depicts the norm of the difference between the optimal control gain and the control gain obtained by the learning algorithm. From Figs. 4.1 and 4.2, it is clear that the value function and control gain parameters converge to their optimal values in (4.48) and (4.49) after four iterations. Thus, the solution of the LQT ARE is obtained at time .t = 1.2 s. Figure 4.3 shows the output and the desired trajectory during simulation. It can be seen that the output tracks the desired trajectory after the optimal control is found.

Fig. 4.1 Convergence of . P to . P ∗

Fig. 4.2 Convergence of . K 1 to . K 1∗

84

4 Integral Reinforcement Learning for Optimal Tracking

Fig. 4.3 Green curve is the reference trajectory and the blue curve is the output of the system

4.3 Online Actor–Critic Integral Reinforcement Learning for Optimal Tracking Control of Nonlinear Systems In this section, a review of the optimal tracking control problem (OTCP) for continuous-time nonlinear systems is first given. It is pointed out that the standard solution to the given problem requires complete knowledge of the system dynamics. It is also pointed out that the input constraints caused by the actuator saturation cannot be encoded into the standard performance function a priori. A new formulation of the OTCP is given in the next sections to overcome these shortcomings associated with online RL algorithms.

4.3.1 Standard Problem Formulation and Solution Consider the affine CT dynamical system describe by .

x(t) ˙ = f (x(t)) + g(x(t)) u(t),

(4.50)

where .x ∈ Rn is the measurable system state vector, . f (x) ∈ Rn is the drift dynamics of the system, .g(x) ∈ Rn×m is the input dynamics of the system, and .u(t) ∈ Rm is the control input. The elements of .u(t) are defined by .u i (t), i = 1, . . . , m. Assumption 4.3 It is assumed that . f (0) = 0 and . f (x) and .g(x) are lipschitz, and that the system (4.50) is controllable in the sense that there exists a continuous control on a set .Ω ⊆ Rn which stabilizes the system. Assumption 4.4 Bhasin et al. (2010); Vamvoudakis and Lewis (2010) The following assumptions are considered on the system dynamics 1. .∥ f (x)∥ ≤ b f ∥x∥ for some constant .b f . 2. .g(x) is bounded by a constant .bg , i.e., .∥g(x)∥ ≤ bg . Note that Assumption 4.4 requires . f (x) be lipschitz and . f (0) = 0 (see Assumption 4.3) which is a standard assumption to make sure the solution .x(t) of the system

4.3 Online Actor–Critic Integral Reinforcement Learning …

85

(4.50) is unique for any finite initial condition. On the other hand, although Assumption 4.4 restricts the considered class of nonlinear systems, many physical systems, such as robotic systems (Slotine and Li 1991) and aircraft systems (Sastry 2013) fulfill such a property. The goal of the optimal tracking problem is to find the optimal control policy so as to make the system (4.50) track a desired (reference) trajectory in an optimal manner by minimizing a predefined performance function. Moreover, the input must be constrained to remain within predefined limits. Define the tracking error as ∆

e (t) = x(t) − xd (t).

. d

(4.51)

Assumption 4.5 The desired reference trajectory .xd (t) is bounded and there exists a Lipschitz continuous command generator function .h d (xd (t)) ∈ Rn such that x˙ (t) = h d (xd (t))

. d

(4.52)

and .h d (0) = 0. Note that the reference dynamics needs only be stable in the sense of Lyapunov, not necessarily asymptotically stable. A general performance function leading to the optimal tracking controller can be expressed as ∫ .

V (ed (t), xd (t)) =



e−γ(τ −t) [E(ed (τ )) + U (u(τ ))] dτ ,

(4.53)

t

where . E(ed ) is a positive-definite function, .U (u) is a positive-definite integrand function, and .γ is the discount factor. Note that the performance function (4.53) contains both the tracking error cost and the whole control input energy cost. The following assumption is made in accordance with other work in the literature. The standard solution to the OTCP and its shortcomings are discussed as follows. In the existing standard solution to the OTCP, the desired or the steady-state part of the control input.u d (t) is obtained by assuming that the desired reference trajectory satisfies x˙ (t) = f (xd (t)) + g(xd (t)) u d (t).

. d

.

(4.54)

If the dynamics of the system is known and the inverse of the input dynamics g −1 (xd (t)) exists, the steady-state control input which guarantees perfect tracking is given by u (t) = g −1 (xd (t)) (x˙d (t) − f (xd (t)).

. d

(4.55)

86

4 Integral Reinforcement Learning for Optimal Tracking

On the other hand, the feedback part of the control is designed to stabilize the tracking error dynamics in an optimal manner by minimizing the following performance function ∫ ∞ ( T ) ed (τ )Q ed (τ ) + u Te R u e dτ , . V (ed (t)) = (4.56) t

where .u e (t) = u(t) − u d (t) is the feedback control input. The optimal feedback control solution .u ∗e (t) which minimizes (4.56) can be obtained by solving the Hamilton– Jacobi–Bellman equation related to this performance function (Lewis et al. 2012). The standard optimal solution to the OTCP is then constituted using obtained optimal feedback control .u ∗e (t). Remark 4.11 The optimal feedback part of the control input .u ∗e (t) can be learned using the IRL method (Vrabie et al. 2009) to obviate knowledge of the system drift dynamics. However, the exact knowledge of the system dynamics is required to find the steady-state part of the control input given by (4.55), which cancels the usefulness of the IRL technique. Remark 4.12 Because only the feedback part of the control input is obtained by minimizing the performance function (4.56), it is not possible to encode the input constraints into the optimization problem by using a nonquadratic performance function, as has been performed in the optimal regulation problem (Abu-Khalaf and Lewis 2005; Modares et al. 2013, 2014).

4.3.2 New Formulation for the Optimal Tracking Control Problem of Constrained-Input Systems A new formulation for the OTCP is presented in this section. In this formulation, both the steady-state and feedback parts of the control input are obtained simultaneously by minimizing a new discounted performance function in the form of (4.53). The input constraints are also encoded into the optimization problem a priori. A tracking HJB equation for the constrained OTCP is derived and an iterative offline IRL algorithm is presented to find its solution. This algorithm provides a basis to develop an online IRL algorithm for learning the optimal solution to the OTCP for partially-unknown systems, which is discussed later.

4.3.2.1

An Augmented System and a New Discounted Performance Function

An augmented system composed of the tracking error dynamics and the command generator dynamics is constructed. Then, based on this augmented system, a new

4.3 Online Actor–Critic Integral Reinforcement Learning …

87

discounted performance function for the OTCP is presented. It is shown that this performance function is identical to the performance function (4.53). The tracking error dynamics can be obtained by using (4.50) and (4.51), and the result is e˙ (t) = f (x(t)) − h d (xd (t)) + g(x(t)) u(t).

. d

(4.57)

Define the augmented system state .

[ ]T X (t) = edT (t) xdT (t) ∈ R2n .

(4.58)

Then, putting (4.52) and (4.57) together yields the augmented system .

X˙ (t) = F(X (t)) + G(X (t)) u(t),

(4.59)

where .u(t) = u(X (t)) and ] f (ed (t) + xd (t)) − h d (xd (t)) , . F(X (t)) = h d (xd (t)) [ ] g(ed (t) + xd (t)) . G(X (t)) = . 0 [

(4.60a) (4.60b)

Based on the augmented system (4.59), we introduce the following discounted performance function for the OTCP ∫ ∞ ( ) . V (X (t)) = e−γ(τ −t) X T (τ )Q T X (τ ) + U (u(τ )) dτ , (4.61) t

where .γ > 0 is the discount factor, [ .

QT =

] Q0 , Q > 0, 0 0

(4.62)

In addition, .U (u) is a positive-definite integrand function defined as ∫

u

U (u) = 2

.

(

)T

λ β −1 (v/λ)

R dv,

(4.63)

0

where .v ∈ Rm , .β(.) = tanh(.), .λ is the saturating bound for the actuators and . R = diag(r 1 , . . . , r m ) > 0 is assumed to be diagonal for simplicity of analysis. This nonquadratic performance function is used in the optimal regulation problem of constrained-input systems to deal with the input constraints (Abu-Khalaf and Lewis 2005; Modares et al. 2013, 2014). In fact, using this nonquadratic performance function, the following constraints are always satisfied

88

4 Integral Reinforcement Learning for Optimal Tracking .

|u i (t)| ≤ λ i = 1, . . . , m.

(4.64)

Definition 4.2 (Admissible Control) A control policy .μ(X ) is said to be admissible with respect to (4.61) on .Ω, denoted by .μ(X ) ∈ π(Ω), if .μ(X ) is continuous on .Ω, .μ(0) = 0, .u(t) = μ(X ) stabilizes the error dynamics (4.57) on .Ω, and . V (X ) is finite .∀x ∈ Ω. Note that from (4.59)–(4.60b) it is clear that, as expected, the command generator dynamics are not under our control. Since they are assumed be bounded, the admissibility of the control input implies the boundness of the states of the augmented system. Remark 4.13 Note that for the first term under the integral we have . X T Q T X = edT Q ed . Therefore, this performance function is identical to the performance function (4.53) with . E(ed (τ )) = edT Q ed and .U (u) given in (4.63). Remark 4.14 The use of the discount factor in the performance function (4.61) is essential. This is because the control input contains a steady-state part which in general makes (4.61) unbounded without using a discount factor, and therefore the meaning of minimality is lost. Remark 4.15 Note that both steady-state and feedback parts of the control input are obtained simultaneously by minimizing the discounted performance function (4.61) along the trajectories of the augmented system (4.59). As is shown in the subsequent sections, this formulation enables us to extend the IRL technique to find the solution to the OTCP without requiring the augmented system dynamics . F. That is, both the system drift dynamics . f and the command generator dynamics .h d are not required

4.3.3 Tracking Bellman and Hamilton–Jacobi–Bellman Equations In this subsection, the optimal tracking Bellman equation and the optimal tracking HJB equation related to the defined performance function (4.61) are given. Using Leibniz’s rule to differentiate .V along the augmented system trajectories (4.59), the following tracking Bellman equation is obtained ˙ (X ) = .V



∞ t

∂ −γ(τ −t) T e (X Q T X + U (u)) dτ − X T Q T X − U (u). ∂t

(4.65)

Using (4.63) in (4.65) and noting that the first term in the right-hand side of (4.65) is equal to .γV (X ), gives

4.3 Online Actor–Critic Integral Reinforcement Learning …

∫ .

X T QT X + 2

)T ( λ tanh−1 (v/λ) R dv − γV (X )V˙ (X ) = 0

u

89

(4.66)

0

or, by defining the Hamiltonian function .





u

H (X, u, ∇V ) = X Q T X + 2 T

( )T λtanh−1 (v/λ) R dv

0

− γV (X ) + ∇V T (X ) (F(X ) + G(X ) u(X )) = 0,

(4.67)

where.∇V (X ) = ∂V (X )/∂ X ∈ R2n . Let.V ∗ (X ) be the optimal cost function defined as ∫ ∞ ∗ . V (X (t)) = min e−γ(τ −t) [X T Q T X + U (u)] dτ . (4.68) u∈π(Ω)

t

Then, based on (4.67), .V ∗ (X ) satisfies the tracking HJB equation .







u∗

H (X, u , ∇V ) = X Q T X + 2 T

)T ( λ tanh−1 (v/λ) R dv − γV ∗ (X )

0

+ ∇V ∗T (X ) (F(X ) + G(X ) u ∗ (X )) = 0.

(4.69)

The optimal control input for the given problem is obtained by employing the stationarity condition (Lewis et al. 2012) on the Hamiltonian (4.67). The result is u ∗ (X ) = arg min [H (X, u, ∇V ∗ )]

.

u∈π(Ω)

= −λ tanh ((1/2λ)R −1 G T (X ) ∇V ∗ (X )).

(4.70)

This control is within its permitted bounds .±λ. The nonquadratic cost (4.63) for .u ∗ is U (u ∗ ) = 2



.

u∗

( )T λ tanh−1 (v/λ) R dv

0

= 2 λ (tanh−1 (u ∗ /λ))T R u ∗ + λ2 R¯ ln (1−(u ∗ /λ)2 ),

(4.71)

where .1 is a column vector having all of its elements equal to one, and . R¯ = [r1 , . . . , rm ] ∈ R1×m . Putting (4.70) in (4.71) results in U (u ∗ ) = λ ∇V ∗T (X ) G(x) tanh (D ∗ ) + λ2 R¯ ln (1 − tanh2 (D ∗ )),

.

(4.72)

where . D ∗ = (1/2λ) R −1 G(X )T ∇V ∗ (X ). Substituting .u ∗ (4.70) back into (4.69) and using (4.72), the tracking HJB Eq. (4.69) becomes

90

4 Integral Reinforcement Learning for Optimal Tracking .

H (X, u ∗ , ∇V ∗ ) = X T Q T X − γV ∗ (X ) + ∇V ∗T (X ) F(X ) + λ2 R¯ ln (1 − tanh2 (D ∗ )) = 0.

(4.73)

To solve the OTCP, one solves the HJB Eq. (4.73) for the optimal value .V ∗ , then the optimal control is given as a feedback .u(V ∗ ) in terms of the HJB solution using (4.70). Now a formal proof is given that the solution to the tracking HJB equation for constrained-input systems provides the optimal tracking control solution and when the discount factor is zero it locally asymptotically stabilizes the error dynamics (4.57). The following key fact is instrumental. Lemma 4.5 For any admissible control policy .u(X ), let .V (X ) ≥ 0 be the corresponding solution to the Bellman equation (4.67). Define.u ∗ (X ) = u(V (X )) by (4.70) in terms of .V (X ). Then .

H (X, u, ∇V ) = H (X, u ∗ , ∇V ) + ∇V T (X ) G(X )(u − u ∗ ) ∫ u ( )T +2 λ tanh−1 (v/λ) R dv.

(4.74)

u∗

Proof The Hamiltonian function is ∫ .

H (X, u, ∇V ) = X T Q T X + 2

u

( )T λ tanh−1 (v/λ) R dv

0

− γV (X ) + ∇V T (X ) (F(X ) + G(X ) u(X )). Adding and subtracting .2 (4.75) yields



u∗ 0

T

(λ tanh−1 (v/λ)) R dv and .∇V T (X ) G(X ) u ∗ (X ) to ∫

.

(4.75)

u∗

H (X, u, ∇V ) = X Q T X + 2 T

( )T λ tanh−1 (v/λ) R dv − γV (X )

0

+ ∇V T (X ) (F(X ) + G(X ) u ∗ (X )) + ∇V T (X )G(X )(u(X ) − u ∗ (X )) ∫ u )T ( +2 (4.76) λ tanh−1 (v/λ) R dv, u∗

which gives (4.74) and completes the proof



.

Theorem 4.3 Consider the optimal tracking control problem for the augmented system (4.59) with performance function (4.61). Suppose that.V ∗ is a smooth positivedefinite solution to the tracking HJB Eq. (4.75). Define control .u ∗ = u(V ∗ (X )) as given by (4.70). Then, .u ∗ minimizes the performance index (4.61) over all admissible controls constrained to .|u i | ≤ λ, i = 1, . . . , m , and the optimal value on .[0, ∞) is given by .V ∗ (X (0)). Moreover, when the discount factor is zero, the control input .u ∗ makes the error dynamics (4.57) asymptotically stable.

4.3 Online Actor–Critic Integral Reinforcement Learning …

91

Proof We first show the optimally of the HJB solution. Note that for any continuous value function .V (X ), one can write the performance function (4.61) as . V (X (0), u)

= =

∫ ∞ 0

∫ ∞ 0

e−γτ [X T Q T X + U (u)] dτ +

∫ ∞ d −γτ (e V (X ) )dτ + V (X (0)) dt 0

e−γτ [X T Q T X + U (u)] dτ

∫ ∞ e−γτ [∇V (X )T (F + G u) − γV (X )] dτ + V (X (0)) + 0 ∫ ∞ = e−γτ H (X, u, ∇V ) dτ + V (X (0)).

(4.77)

0

Now, suppose .V (X ) satisfies the HJB Eq. (4.73). Then . H (X ∗ , u ∗ , ∇V ∗ ) = 0 and (4.74) yields ∫ .



V (X (0), u) =

e−γτ ( 2



u∗

0

+ ∇V

∗T

u

( )T λ tanh−1 (v/λ) R dv

(X )G(X ) (u − u ∗ ))dτ + V ∗ (X (0)).

(4.78)

To prove that .u ∗ is the optimal control solution and the optimal value is .V ∗ (X (0)), it remains to show that the integral term in the right-hand side of the above equation is bigger than zero for all .u /= u ∗ and attains it minimum value, i.e., zero, at .u = u ∗ . That is, to show that ∫ u T .H = 2 (λ tanh−1 (v/λ)) R dv + ∇V ∗T (X )G(X ) (u − u ∗ ) (4.79) u∗

is bigger than or equal to zero. To show this, note that using (4.70) one has ∇V ∗T (X )G(X ) = −2 (λ tanh−1 (v/λ))T R.

.

(4.80)

Substituting (4.80) in (4.79) and noting .ϕ−1 (·) = (λ tanh−1 (·/λ))T yields .

−1







u∗

H = 2 ϕ (u ) R (u − u) − 2

ϕ−1 (v) R dv.

(4.81)

u

∑ ∑ As . R is symmetric positive definite, one can rewrite it as . R = Ʌ Ʌ, where . is a triangular matrix with its values being the singular values of . R and .Ʌ is an orthogonal symmetric matrix. ¯ one Substituting for . R in (4.81) and applying the coordinate change .u = Ʌ−1 u, has

92

4 Integral Reinforcement Learning for Optimal Tracking

.

−1

−1 ∗

H = 2 ϕ (Ʌ u¯ ) Ʌ = 2 β(u¯ ∗ )









(u¯ − u) ¯ −2

(u¯ ∗ − u) ¯ −2



u¯ u¯

u¯ ∗

ϕ−1 (Ʌ−1 ξ)Ʌ





β(ξ)





dξ,

dξ (4.82)

where .β(u) ¯∑ = ϕ−1 (Ʌ−1 u) ¯ Ʌ. Note that .β is monotone odd because is monotonic odd. Since . is a triangular matrix, one can decouple the transformed input vector as .

H =2

m ∑[ ∑

β(u¯ ∗k ) (u¯ ∗k

∫ − u¯ k ) −

k=1 kk

u¯ ∗k u¯ k

] β(ξk ) dξk ,

(4.83)

∑ where . kk > 0, k = 1, . . . , m, due to . R > 0. To complete the proof it remains to show that the term .

L k = β(u¯ ∗k ) (u¯ ∗k − u¯ k ) −



u¯ ∗k u¯ k

β(ξk ) dξk

(4.84)

is bigger than zero for .u ∗ /= u and is zero for .u ∗ = u. To show this, first assume that .u¯ ∗k > u¯ k . Then using the mean value theorem for the integrals, there exists a .u k ∈ (u ¯ k , u¯ ∗k ) such that ∫

u¯ ∗k

.

u¯ k

β(ξk ) dξk = β(u k ) (u¯ ∗k − u¯ k ) < β(u¯ ∗k ) (u¯ ∗k − u¯ k ),

(4.85)

where the inequality is obtained by the fact that .β is monotone odd, and hence β(u k ) < β(u ∗k ). Therefore, . L k > 0 for .u¯ ∗k > u¯ k . Now suppose that .u¯ ∗k < u¯ k . Then, using mean value theorem for the integrals, there exists a .u k ∈ (u¯ ∗k , u¯ k ) such that

.



u¯ ∗k

.

u¯ k

∫ β(ξk )dξk = −

u¯ k u¯ ∗k

β(ξk )dξk = −β(u k )(u¯ k − u¯ ∗k )

< −β(u¯ ∗k )(u¯ k − u¯ ∗k ) = β(u¯ ∗k )(u¯ ∗k − u¯ k ),

(4.86)

where the inequality is obtained by the fact that .β is monotone odd, and hence β(u k ) > β(u ∗k ). Therefore . L k > 0 also for .u¯ ∗k < u¯ k . This completes the proof of the .◻ optimality.

.

Now the stability of the error dynamics is shown. Note that for any continuous value function .V (X ), by differentiating .V (X ) along the augmented system trajectories, one has .

∂V (X ) ∂V (X )T ˙ d V (X ) ∂V (X )T = + (F(X ) + G(X )u), X= dt ∂t ∂X ∂X

(4.87)

4.3 Online Actor–Critic Integral Reinforcement Learning …

93

so that .

H (X, u, ∇V ) =

d V (X ) − γV (X ) + X T Q T X + 2 dt



u

)T ( λ tanh−1 (v/λ) R dv.

0

(4.88) Suppose now that .V (X ) satisfies the HJB equation . H (X ∗ , u ∗ , ∇V ∗ ) = 0 and is positive definite. Then, substituting .u = u ∗ gives d V (X ) . − γV (X ) + X T Q T X + 2 dt



( )T λ tanh−1 (v/λ) R dv = 0

u

(4.89)

0

or equivalently .

d V (X ) − γV (X ) = −X T Q T X − 2 dt



u

( )T λ tanh−1 (v/λ) R dv.

(4.90)

0

Multiplying .e−γt to both sides of (4.90) gives d −γt (e V (X )) = e−γt (−X T Q T X − 2 . dt



u

)T ( λ tanh−1 (v/λ) R dv) ≤ 0. (4.91)

0

Equation (4.91) shows that the tracking error is bounded for the optimal solution, but its asymptotic stability cannot be concluded. However, if .γ = 0 (which can be chosen only if the reference input goes to zero), LaSalle’s extension can be used to show that the tracking error is locally asymptotically stable. In fact, based on LaSalle’s extension, the augmented state . X = [ed , xd ] goes to a region of .R2n wherein .V˙ = 0. Considering that . X T Q T X = ed T Q ed with . Q > 0, .V˙ = 0 only if .ed = 0 and .u = 0. Since.u = 0 also requires that.ed = 0, therefore, for.γ = 0 the tracking error is locally asymptotically stable with Lyapunov function .V (X ) > 0. This confirms that in the limit as the discount factor goes to zero, the control input.u ∗ makes the error dynamics (4.57) asymptotically stable. Note that although for .γ /= 0 (which is essential to be considered if the reference trajectory does not go to zero) only boundness of the tracking error is guaranteed for the optimal solution, one can make the tracking error as small as desired by choosing a small discount factor and/or large . Q. To demonstrate this, assume that the tracking error is nonzero. Then, considering that . X T Q T X = ed T Q ed with . Q > 0, the derivative of the Lyapunov function in (4.91) becomes negative and therefore the tracking error decreases until the exponential term .e−γt becomes zero and makes the derivative of the Lyapunov function zero. After that, we can only conclude that the tracking error does not increase anymore. The larger the . Q is the more the speed of decreasing the tracking error is and the smaller tracking error can be achieved. Moreover, the smaller the discount factor is the less the speed of decreasing the derivative of the Lyapunov function to zero is and the smaller tracking error can be achieved. Consequently, by choosing a smaller discount factor and/or larger . Q one

94

4 Integral Reinforcement Learning for Optimal Tracking

can make the tracking error as small as desired before the value of .e−γt becomes very small. Remark 4.16 The use of discounted cost functions is common in optimal regulation control problems and the same conclusion can be drawn for asymptotic stability of the system state in the optimal regulator problem, as is drawn here for asymptotic stability of the tracking error in the OTCP. However, the discount factor is a design parameter and as is shown in optimal regulation control problems in the literature, it can be chosen small enough to make sure the system state goes to a very small region around zero. Simulation results in Section 5 confirm this conclusion for the OTCP.

4.3.4 Offline Policy Iteration Algorithms The tracking HJB Eq. (4.73) is a nonlinear partial differential equation which is extremely difficult to solve. In this subsection, two iterative offline policy iteration (PI) algorithms are presented for solving this equation. An IRL-based offline PI algorithm is given which is a basis for our online IRL algorithm presented in the next section. Note that the tracking HJB Eq. (4.73) is nonlinear in the value function derivative ∗ .∇V , while the tracking Bellman equation (4.66) is linear in the cost function derivative .∇V . Therefore, finding the value of a fixed control policy by solving (4.66) is easier than finding the optimal value function by solving (4.73). This is the motivation of introducing an iterative policy iteration (PI) algorithm for approximating the tracking HJB solution. The PI algorithm performs the following sequence of two-step iterations as follows to find the optimal control policy. Algorithm 4.3 Offline PI algorithm 1. Policy evaluation: Given a control input u i (X ), find V i (X ) using the following Bellman equation ∫

ui

X T QT X + 2

(

/ )T λ tanh−1 (v λ) R dv

0

− γV i (X ) + (∇V i (X ))T (F(X ) + G(X ) u i ) = 0. 2. Policy improvement: Update the control policy using ( ) 1 −1 T R G (X ) ∇V i (X ) . u i+1 (X ) = −λ tanh 2λ

(4.92)

(4.93)

4.3 Online Actor–Critic Integral Reinforcement Learning …

95

Algorithm 4.3 is an extension of the offline PI algorithm in Abu-Khalaf and Lewis (2005) to the optimal tracking problem. The following theorem shows that this algorithm converges to the optimal solution of the HJB Eq. (4.73). Theorem 4.4 If .u 0 ∈ π(Ω) , then .u i ∈ π(Ω), ∀i ≥ 1 . Moreover, .u i converges to ∗ i ∗ .u and . V converges to . V uniformly on .Ω. Proof See Abu-Khalaf and Lewis (2005) and Liu et al. (2013) for the same .◻ proof. The tracking Bellman equation (4.92) requires complete knowledge of the dynamics of the system. In order to find an equivalent formulation of the tracking Bellman equation that does not involve the dynamics, we use the IRL idea introduced in Vrabie et al. (2009) for the optimal regulation problem. Note that for any integral reinforcement interval .T > 0, the value function (4.61) satisfies ∫ .

V (X (t − T )) =

T

[ ] e−γ(τ −t+T ) X T (τ )Q T X (τ ) + U (u(τ )) dτ + e−γT V (X (t)).

t−T

(4.94) This IRL form of the tracking Bellman equation does not involve the system dynamics. Lemma 4.6 The IRL tracking Bellman equation (4.94) and the tracking Bellman equation (4.66) are equivalent and have the same positive semi-definite solution for the value function Proof See Vrabie et al. (2009); Liu et al. (2013) for the same proof.



.

Using the IRL tracking Bellman equation (4.94), the following IRL-based PI algorithm can be used to solve the tracking HJB Eq. (4.73) using only partial knowledge about the system dynamics Algorithm 4.4 Offline IRL algorithm 1. Policy evaluation: Given a control input u i (X ), find V i (X ) using the tracking Bellman equation ∫ T [ ] e−γ(τ −t+T ) X T (τ )Q T X (τ ) + U (u(τ )) dτ + e−γT V i (X (t)). V i (X (t − T )) = t−T

(4.95) 2. Policy improvement: Update the control policy using ( ) 1 −1 T R G (X ) ∇V i (X ) . u i+1 (X ) = −λ tanh 2λ

(4.96)

96

4 Integral Reinforcement Learning for Optimal Tracking

4.3.5 Online Actor–Critic-Based Integral Reinforcement Learning In this subsection, an online solution to the tracking HJB Eq. (4.73) is presented which only requires partial knowledge of the system dynamics. The learning structure uses the value function approximation (Werbos 1992; Finlayson 1990) with two NNs, namely an actor and a critic. Instead of sequentially updating the critic and actor NNs, as in Algorithm 4.4, both are updated simultaneously in real-time. We call this synchronous online PI. Critic NN and Value Function Approximation Assuming the value function is a smooth function, according to the Weierstrass highorder approximation theorem (Finlayson 1990), there exists a single-layer neural network (NN) such that the solution .V (X ) and its gradient .∇V (X ) can be uniformly approximated as .

V (X ) = W1T φ(X ) + εv (X ),

(4.97a)

∇V (X ) = ∇φ(X ) W1 + ∇εv (X ), T

.

(4.97b)

where .φ(X ) ∈ Rl provides a suitable basis function vector, .εv (X ) is the approximation error, .W1 ∈ Rl is a constant parameter vector and .l is the number of neurons. Equation (4.97a) defines a critic NN with weights.W1 . It is known that the NN approximation error and its gradient are bounded over the compact set .Ω, i.e., .∥εv (X )∥ ≤ bε and .∥∇εv (X )∥ ≤ bεx (Hornik et al. 1990). Assumption 4.6 The critic NN activation functions and their gradients are bounded, ∥φ(X )∥ ≤ bφ i.e., and .∥∇φ(X )∥ ≤ bφx .

.

The critic NN ((4.97a) is used to approximate the value function related to the IRL tracking Bellman equation (4.94). Using the value approximation (4.97a) in the tracking IRL, the Bellman equation (4.94) yields ∆



ε (t) =

. B

T

e

−γ(τ −t+T )

[

t−T + W1T ∆φ(X (t)),

∫ X (τ )Q T X (τ )+2 T

u

−1

T

]

(λ tanh (v/λ)) R dv dτ

0

(4.98)

where ∆φ(X (t)) = e−γT φ(X (t)) − φ(X (t − T ))

.

(4.99)

and .ε B is the tracking Bellman equation error due to the NN approximation error. Under Assumption 4.6, this approximation error is bounded on the compact set .Ω. That is, there exists a constant bound .εmax for .ε B such that .sup ∥ε B ∥ ≤ εmax . ∀t

We now present the tuning and convergence of the critic NN weights for a fixed control policy, in effect designing an observer for the unknown value function. As

4.3 Online Actor–Critic Integral Reinforcement Learning …

97

the ideal critic NN weights vector .W1 which provides the best approximate solution to the tracking Bellman (4.98) is unknown, it is approximated in real time as .

Vˆ (X ) = Wˆ 1T φ(X ),

(4.100)

where .Wˆ 1 is the current estimation of .W1 . Therefore, the approximate IRL tracking Bellman equation becomes ∫ e (t) =

[ ∫ e−γ(τ −t+T ) X T (τ )Q T X (τ ) + 2

T

. B

t−T

u

] T (λ tanh−1 (v/λ)) R dv dτ

0

+ Wˆ 1T ∆φ(X (t)).

(4.101)

Equation (4.101) can be written as e (t) = Wˆ 1T (t) ∆φ(X (t)) + p(t),

. B

(4.102)

where ∫ .

p(t) =

[ ∫ e−γ(τ −t+T ) X T (τ )Q T X (τ ) + 2

T t−T

u

] )T λ tanh−1 (v/λ) R dv dτ

(

0

(4.103) is the integral reinforcement reward. The tracking Bellman error .e B in Eqs. (4.101) and (4.102) is the continuous-time counterpart of the temporal difference (TD). The problem of finding the value function is now converted to adjusting the critic NN weights such that the TD error .e B is minimized. Consider the objective function .

EB =

1 2 e . 2 B

(4.104)

From (4.104) and using the chain rule, the gradient descent algorithm for . E B is given by .

W˙ˆ 1 =

−α1 (1 +

∆φT ∆φ)2

∂ EB −α1 ∆φ =− eB , ˆ (1 + ∆φT ∆φ)2 ∂ W1

(4.105)

where .α1 > 0 is the learning rate and .(1 + ∆φT ∆φ)2 is used for normalization. Note that the square of the denominator, i.e., .(1 + ∆φT ∆φ)2 , is used in (4.105) for normalization to assure the stability of the critic weights error .W˜ 1 . Define ∆φ¯ =

.

∆φ . (1 + ∆φT ∆φ)

(4.106)

98

4 Integral Reinforcement Learning for Optimal Tracking

The proof of the convergence of the critic NN weights is shown in the following theorem Theorem 4.5 Let .u be any admissible bounded control policy and consider the adaptive law (4.105) for tuning the critic NN weights. If .∆φ¯ in (4.106) is persistently exciting (PE), i.e., if there exist .γ1 > 0 and .γ2 > 0 such that .∀t > 0 ∫

t+T1

γ I ≤

. 1

¯ ) ∆φ¯ T (τ ) dτ ≤ γ2 I, ∆φ(τ

(4.107)

t

Then, 1. For .ε B (t) = 0 (no reconstruction error), the critic weight estimation error converges to zero exponentially fast. 2. For bounded reconstruction error, i.e., .∥ε B (t)∥ < εmax , the critic weight estimation error converges exponentially fast to a residual set. Proof Using the IRL tracking Bellman equation (4.98) one has ∫

T

e

.

−γ(τ −t)

t−T

[ ∫ 2

u

−1

T

]

(λ tanh (v/λ)) R dv + X (τ )Q T X (τ ) dτ T

0

= −W1T ∆φ(X (t)) + ε B (t).

(4.108)

Substituting (4.108) in (4.101), the tracking Bellman equation error becomes e (t) = W˜ 1T (t) ∆φ(t) + ε B (t),

. B

(4.109)

where .W˜ 1 = W1 − Wˆ 1 is the critic weights estimation error. Using (4.109) in (4.105) and denoting .m = 1 + ∆φT ∆φ, the critic weights estimation error dynamics becomes .

¯ ¯ ¯ T W˜ 1 (t) + α1 ∆φ(t) ε B (t). φ(t) W˙˜ 1 (t) = −α1 ∆φ(t)∆ m(t)

(4.110)

This estimation error is the same as the critic weight estimation error obtained in Vamvoudakis and Lewis (2010) and the remainder of the proof is identical to the .◻ proof of Theorem 4.3 in Vamvoudakis and Lewis (2010). Remark 4.17 The critic estimation error Eq. (4.110) implies that.∆φ¯ T W˜ 1 is bounded. However, in general the boundness of .∆φ¯ T W˜ 1 does not imply the boundness of .W˜ 1 . Theorem 4.5 shows that if the PE condition (4.107) is satisfied, then the boundness of .∆φ¯ T W˜ 1 implies the boundness of the state .W˜ 1 . We shall use this property in the proof of Theorem 4.6.

4.3 Online Actor–Critic Integral Reinforcement Learning …

99

Synchronous Actor–Critic-Based Integral Reinforcement Learning Algorithm Now an online IRL algorithm is given which involves simultaneous or synchronous tuning of the actor and critic NNs to find the optimal value function and control policy related to the OTCP, adaptively. Assume that the optimal value function solution to the tracking HJB equation is approximated by the critic NN in (4.97a). Then, using (4.97b) in (4.70), the optimal policy is obtained by ( ) 1 −1 T T T .u = −λ tanh (4.111) R G (∇φ W1 + ∇εv ) . 2λ To see the effect of the error .∇εv on the tracking HJB equation, note that using integration by parts we have ∫ T ∫ T . e−γ(τ −t+T ) φ˙ dτ = e−γ(τ −t+T ) ∇φ(F + Gu) dτ t−T

t−T

= ∆φ(X ) +



T

e−γ(τ −t+T ) φ(X ) dτ ,

(4.112)

t−T

or equivalently ∫ .∆φ(X ) =

T

e−γ(τ −t+T ) ∇φ(X )(F + Gu) dτ −

t−T



T

e−γ(τ −t+T ) φ(X ) dτ .

t−T

(4.113) Also, note that .U (u) in (4.71) for the optimal control input given by (4.111) becomes ∫ u T .U (u) = 2 (λ tanh−1 (v/λ)) R dv 0 ( ( )) T = W1 ∇φ F u + λ2 R¯ ln − tanh2 D + 0.5λR −1 G T ∇εv . (4.114) Using (4.113) and (4.114) for the third and second terms of (4.98), respectively, the following tracking HJB equation is given ∫ T . e−γ(τ −t+T ) (X T Q T X − γ W1T φ + W1T ∇φ F t−T

+ λ2 R¯ ln(−tanh2 (D)) + ε H J B ) dτ = 0

(4.115)

where . D = (1/2λ) R −1 G T ∇φT W1 and .ε H J B , i.e., the HJB approximation error due to the function approximation error, is ∫ T .ε H J B = e−γ(τ −t+T ) (∇εTv F + λ2 R ln(1 − tanh2 (D + 0.5λR −1 G T ∇εv ) t−T

− λ2 R¯ ln( − tanh2 (D )) − γεv )dτ .

(4.116)

100

4 Integral Reinforcement Learning for Optimal Tracking

Since the NN approximation error is bounded, there exists a constant error bound .εh , so that .sup ∥ε H J B ∥ ≤ εh . We should note that the choice of the NN structure to make the error bound .εh arbitrary small is commonly carried out by computer simulation in the literature. We assume here that the NN structure is specified by the designer, and the only unknowns are the NN weights. To approximate the solution to the tracking HJB Eq. (4.115), the critic and actor NNs are employed. The critic NN given by (4.100) is used to approximate the unknown approximate optimal value function. Assuming that .Wˆ 1 is the current estimation for the optimal critic NN weights .W1 , then using (4.111) the policy update law can be obtained by u = −λ tanh ((1/2λ)R −1 G T ∇φT Wˆ 1 ).

. 1

(4.117)

However, this policy update law does not guarantee the stability of the closed-loop system. It is necessary to use a second neural network.Wˆ 2T ∇φ for the actor because the control input must not only solve the stationarity condition (4.70), but also guarantee system stability while converging to the optimal solution. This is seen in the Lyapunov proof of Theorem 4.6. Hence, to assure stability in a Lyapunov sense, the following actor NN is employed. uˆ = −λ tanh ((1/2λ)R −1 G T ∇φT Wˆ 2 ),

. 1

(4.118)

where .Wˆ 2 is the actor NN weights vector and it is considered as the current estimated value of .W1 . Define the actor NN estimation error as .

W˜ 2 = W1 − Wˆ 2 .

(4.119)

Note that using the actor .uˆ 1 in (4.118), the IRL Bellman equation error is now given by ∫

T

.

t−T

[ ] e−γ(τ −t+T ) X T (τ )Q T X (τ ) + Uˆ dτ + Wˆ 1T ∆φ(X (t)) = eˆ B (t),

(4.120)

where ˆ =2 .U



uˆ 1

)T ( λ tanh−1 (v/λ) R dv.

(4.121)

0

Then, the critic update law (4.105) becomes .

W˙ˆ 1 = −

−α1 ∆φ (1 + ∆φT ∆φ)2

eˆ B .

(4.122)

4.3 Online Actor–Critic Integral Reinforcement Learning …

101

Define the error .eu as the difference between the control input .uˆ 1 (4.118) applied to the system and the control input .uˆ (4.117) as an approximation of the optimal control input given by (4.70) with .V ∗ approximated by (4.100). That is, e = uˆ 1 − u 1 ) ( )) ( ( 1 −1 T T ˆ 1 −1 T T ˆ . R G ∇φ W2 − tanh R G ∇φ W1 = λ tanh 2λ 2λ

. u

(4.123)

The objective function to be minimized by the action NN is now defined as .

E u = euT R eu .

(4.124)

Then, the gradient descent update law for the actor NN weights becomes .

( ) ˆ eu + Y Wˆ 2 , W˙ˆ 2 = −α2 ∇φ G eu + ∇φ Gtanh2 ( D)

(4.125)

where .

.

1 −1 T T ˆ R G ∇φ W2 , Dˆ = 2λ

(4.126)

Y > 0 is a design parameter and the last term of (4.125) is added to assure stability. Before presenting our main theorem, note that based on Assumption 4.6 and the boundness of the command generator dynamics .h d , for the drift dynamics of the augment system . F one has .

∥F(X )∥ ≤ b F1 ∥ed ∥ + b F2

(4.127)

for some .b F1 and .b F1 . Theorem 4.6 (Stability of NNs and the System) Given the dynamical system (4.50) and the command generator (4.52), let the tracking control law be given by the actor NN (4.118). Let the update laws for tuning the critic and actor NNs be provided by (4.122) and (4.125), respectively. Let Assumptions 4.3, 4.6 hold and .∆φ¯ in (4.106) be persistently exciting. Then there exists a .T0 defined by (4.152) such that for the integral reinforcement interval .T < T0 the tracking error .ed in (4.51), the critic NN error .W˜ 1 , and the actor NN error .W˜ 2 in (4.119) are uniformly ultimate bounded (UUB). Proof Consider the following Lyapunov function .

1 1 J (t) = V (t) + W˜ 1T (t)α1−1 W˜ 1 (t) + W˜ 2T (t)α2−1 W˜ 2 (t), 2 2

(4.128)

102

4 Integral Reinforcement Learning for Optimal Tracking

where .V (t) is the optimal value function. The derivative of the Lyapunov function is given by .

˜ T −1 ˙˜ ˜˙ J˙ = V˙ + W˜ 1T α−1 1 W1 + W2 α2 W2 .

(4.129)

Before evaluating (4.129), note that putting (4.113) and the tracking HJB (4.115) in the IRL tracking Bellman equation (4.120) gives

.eˆ B (t)

= =

∫ T

) ( ˆ dτ e−γ(τ −t+T ) X T Q T X + Uˆ + γ Wˆ 1T (t)φ − Wˆ 1T (t)∇φ(F − Gλ tanh( D))

t−T

∫ T

ˆ e−γ(τ −t+T ) (Uˆ − U − γ Wˆ 1T (t)φ+Wˆ 1 (t)T ∇φ (F − G λ tanh ( D))

t−T

ˆ + ε H J B ) dτ , + γW1T (t)φ − W1T ∇φ (F − G λ tanh ( D))

(4.130)

where .Uˆ is defined in (4.121) and is given by ˆ + λ2 R¯ ln(1 − tanh2 ( D)) ˆ Uˆ = Wˆ 2T ∇φ G λ tanh( D)

(4.131)

U = W1T ∇φ Gλ tanh(D) + λ2 R¯ ln(1 − tanh2 (D))

(4.132)

.

and .

is the cost (4.63) for the optimal control input.u = −λ tanh ((1/2λ)R −1 G T ∇φT W1 ). Using (4.113) and some manipulations, (4.130) becomes ∫

( ˆ e−γ(τ −t+T ) Uˆ − U W1T ∇φ (F − G λ tanh( D)) t−T ) − W1T ∇φ × (F − G λ tanh(D)) + ε H J B dτ − ∆φT W˜ 1 (t).

eˆ (t) =

. B

T

(4.133)

Using (4.131) and (4.132) and some manipulations,.Uˆ − U can be written as Modares et al. (2013) ˆ + W˜ 2T ∇φ Gλ sgn( D) ˆ − W1T ∇φ G λ tanh(D) Uˆ − U = Wˆ 2T ∇φ G λ tanh ( D) [ ] ˆ − sgn(D) + λ2 R¯ (ε ˆ − ε D ), − W1T ∇φ G λ sgn( D) (4.134) D

.

where .ε Dˆ and .ε D are some bounded approximation errors. Substituting (4.134) in (4.133) gives .e ˆ B (t) = − ∆φ W˜ 1 (t) +



T

T

t−T

e−γ(τ −t+T ) W˜ 2T M dτ + E,

(4.135)

4.3 Online Actor–Critic Integral Reinforcement Learning …

103

where .

ˆ − sgn( D)) ˆ M = ∇φ G λ (tanh( D)

(4.136)

and ∫

.

( ˆ − sgn(D)) e−γ(τ −t+T ) W1T ∇φGλ(sgn( D) t−T ) + λ2 R¯ (ε Dˆ − ε D ) + ε H J B dτ .

E=

T

(4.137)

Note that . M and . E are bounded. We now evaluate the derivative of the Lyapunov function (4.129). For the first term of (4.129), one has .

ˆ + ε0 , V˙ = W1T ∇φ (F − G λ tanh ( D))

(4.138)

ˆ ε (x) = ∇εT (F − G λ tanh( D)).

(4.139)

where . 0

According to Assumption 4.4 and the definition of .G in (4.60b), one has .

∥G∥ ≤ bG .

(4.140)

Using Assumption 4.6, (4.127) and (4.140), and taking norm of .ε0 in (4.139) yields .

∥ ε0 (x) ∥ ≤ bεx b F1 ∥ed ∥ + λ bε x bG + b F2 .

(4.141)

Using the tracking HJB Eq. (4.115), the first term of (4.138) becomes .

W1T ∇φ F = −edT Q ed − U − γ W1T φ + W1T ∇φ G λ tanh(D) + ε H J B ,

(4.142)

where .U > 0 and it is defined in (4.132). Also, using .W1 = Wˆ 2 + W˜ 2 , and the fact x T tanh(x) > 0 ∀x, for the second term of (4.138) one has

.

.

ˆ > W˜ 2T ∇φ G λ tanh( D). ˆ W1T ∇φ G λ tanh( D)

(4.143)

Using (4.141)–(4.143) and Assumption 4.6, (4.138) becomes .

ˆ V˙ < −λmin (Q) ∥e∥2 + k1 ∥e∥ + k2 − W˜ 2T ∇φ G λ tanh( D),

(4.144)

where .k1 = bε x b F1 and .k2 = 2λ bG bφx ∥W1 ∥ + γ ∥W1 ∥ ∥φ∥ + λ bεx bG + b F2 + εh , and .εh is the bound for .ε H J B .

104

4 Integral Reinforcement Learning for Optimal Tracking

For the second term of (4.129), using (4.135) in (4.105), .W˙˜ 1 (t) becomes .

∫ ∆φ¯ ∆φ¯ T −γ(τ −t+T ) ˜ T E e W2 (τ ) M dτ − α1 W˙˜ 1 (t) = − α1 ∆φ¯ ∆φ¯ T W˜ 1 (t) − α1 m t−T m (4.145)

and therefore J˙ = W˜ 1T (t) α1−1 W˜˙ 1 (t) = − W˜ 1T (t)∆φ¯ ∆φ¯ T W˜ 1 (t) ∫ ∆φ¯ T −γ(τ −t+T ) ˜ T ∆φ¯ T ˜ E. − W1 (t) e W2 (τ )M dτ − W˜ 1T (t) m t−T m

. 1

(4.146)

For small enough reinforcement interval, the integral term of (4.146) can be approximated by the right-hand rectangle method (with only one rectangle) as ∫

T

.

t−T

e−γ(τ −t+T ) W˜ 2T (τ ) M dτ ≈ T e−γT M T W˜ 2 (t).

(4.147)

Using (4.147) in (4.146) gives T ∆φ¯ E − e−γT W˜ 1T (t) ∆φ¯ M T W˜ 2 (t). J˙ = − W˜ 1T (t)∆φ¯ ∆φ¯ T W˜ 1 (t) − W˜ 1T (t) m m (4.148)

. 1

By applying the Young’s inequality to the last term of (4.148), one has .

T −γT ˜ T e W1 (t) ∆φ¯ M T W˜ 2 (t) m T 2 e−2γT ˜ T ε ˜T ≤ W (t)M M T W˜ 2 (t) W1 (t) ∆φ¯ ∆φ¯ T W˜ 1 (t) + 2ε 2m 2 2

(4.149)

for every .ε > 0. Using (4.149) in (4.148) yields ¯ ¯ φ¯ T W˜ 1 (t) − W˜ T (t) ∆φ E + ε W˜ T (t)M M T W˜ 2 (t), J˙ ≤ −d W˜ 1T (t)∆φ∆ 1 m 2m 2 2 (4.150)

. 1

where d =1−

.

T 2 e−2γT . 2ε

(4.151)

4.3 Online Actor–Critic Integral Reinforcement Learning …

105

Define .T0 as a constant that satisfies T 2 e−2γT0 = 2ε.

(4.152)

. 0

Then .d > 0 if .T < T0 . Finally, for the last term of (4.129), using (4.125), and definitions .Wˆ 1 (t) = W1 − W˜ 1 (t) and .Wˆ 2 (t) = W1 − W˜ 2 (t), one has J˙ = W˜ 2T (t)α2−1 W˙˜ 2 (t)

. 2

ˆ + W˜ 2T (t)k3 , = −W˜ 2T (t)Y W˜ 2 (t) + W˜ 2T (t)λ ∇φ G tanh ( D)

(4.153)

1 ˆ eu . where .k3 = −λ ∇φ G tanh ( 2λ R −1 G T ∇φT Wˆ 1 ) + Y W1 + λ∇φ Gtanh2 ( D) Based on definitions of .eu , .G in (4.123) and Assumptions 4.5 and 4.6, .k3 is bounded. Using (4.144), (4.150) and (4.153) into (4.129), . J becomes

.

J˙ < −λmin (Q)∥e∥2 + k1 ∥e∥ + k2 − d W˜ 1T (t) ∆φ¯ ∆φ¯ T W˜ 1 (t) ∆φ¯ − W˜ 1T (t) E − W˜ 2T (t)N W˜ 2 (t) + W˜ 2T (t)k3 , m

(4.154)

where .

N =Y−

ε M M T. 2m 2

(4.155)

If we choose .T and .Y such that .d in (4.99) and . N in (4.155) are bigger than zero, then . J becomes negative, provided that / .

.

∥e∥ >

k1 2 λmin (Q)

‖ ‖ E ‖ ¯T ˜ ‖ ‖ ∆φ W1 ‖ > , d ‖ ‖ k3 ‖˜ ‖ . . ‖ W2 ‖ > λmin (N )

+

k12 2 4 λmin (Q)

+

k2 , λmin (Q)

(4.156a) (4.156b) (4.156c)

Since (4.156b) holds on the output .∆φ¯ T W˜ 1 of error dynamics (4.110), as it was .◻ shown in Theorem 4.5, the PE signal .∆φ¯ shows that the state .W˜ 1 is UUB. Remark 4.18 The stability analysis in the proof of Theorem 4.6 differs from the stability proof presented in Vamvoudakis and Lewis (2010); Modares et al. (2013) from at least two different perspectives. First, the actor update law in the mentioned papers is derived entirely by the stability analysis whereas the actor update law here is based on the minimization of the error between the actor neural network and the approximate optimal control input. Moreover, the optimal tracking problem is considered here, not the optimal regulation problem, and the tracking HJB equation

106

4 Integral Reinforcement Learning for Optimal Tracking

has an additional term depending on the discount factor compared to the regulation HJB equation considered in the mentioned papers. Remark 4.19 The proof of Theorem 4.6 shows that the integral reinforcement learning time interval .T cannot be too big. Moreover, since .d and . N in Eqs. (4.99) and (4.155) should be bigger than zero for an arbitrary .ε > 0, one can conclude that the bigger the reinforcement interval .T is, the bigger the parameter .Y in learning rule (4.125) should be chosen to assure stability.

4.4 Simulation Results The system dynamic is given as x˙ = x2

(4.157a)

. 1

x˙ =

. 2

−x13

− 0.5x2 + u(t).

(4.157b)

Suppose the control bound is .|u| ≤ 0.25. To find the optimal solution using the proposed method, the critic NN is chosen as a power series neural network with 45 activation functions containing powers of the state variable of the augmented system up to order four. The critic’ weights and activation functions are .W

= [W1 , ..., W45 ]T

φ(X ) = [X 12 , X 1 X 2 , X 1 X 3 , X 1 X 4 , X 22 , X 2 X 3 , X 2 X 4 , X 32 , X 3 X 4 , X 42 , X 14 , X 13 X 2 , X 13 X 3 , X 13 X 4 , X 12 X 22 , X 12 X 2 X 3 , X 12 X 2 , X 4 , X 12 X 32 , X 12 X 3 X 4 , X 12 X 42 , X 1 X 23 X 1 X 22 X 3 , X 1 X 22 X 4 , X 1 X 2 X 32 , X 1 X 2 X 3 X 4 , X 1 X 2 X 42 X 1 X 33 , X 1 X 32 X 4 , X 1 X 3 X 42 , X 1 X 43 , X 24 , X 23 X 3 , X 23 X 4 , X 22 X 32 , X 22 X 3 X 4 , X 22 X 42 , X 2 X 33 , X 2 X 32 X 4 , X 2 X 3 X 42 , X 2 X 43 , X 34 , X 33 X 4 , X 32 X 42 , X 3 X 43 , X 44 ]T .

(4.158)

The reinforcement interval .T is selected as .0.1. As no verifiable method exists to ensure PE in nonlinear systems, a small exploratory signal consisting of sinusoids of varying frequencies, i.e., .n(t) = 0.3 sin(8t)2 cos(2t) + 0.3 sin(20t)4 cos(7t), is added to the control input to excite the system states and ensure the PE qualitatively. The critic weights vector finally converges to .W

= [9.04, 3.95, −1.20, −1.64, 2.41, 0.71, −1.06, 14.28, 0.38, 2.93, −2.97, − 0.75, 4.60, −2.40, −3.33, 1.79, 2.18, 3.11, 0.69, −2.45, −2.23, 1.70, 2.02, 0.94, 0.43, 1.21, −0.47, −0.75, 0.54, 1.31, 0.03, 1.70, 0.81, 0.88, − 0.02, −0.76, 0.84, −0.15, −3.14, −0.83, 4.11, 0.29, 0.86, −0.88, 0.07]. (4.159)

The following figures show the performance of the proposed method (Figs. 4.4, 4.5 and 4.6).

References

107

Fig. 4.4 Control input

Fig. 4.5 The first state of the system and reference trajectory

Fig. 4.6 The second state of the system and reference trajectory

References Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41(5):779–791 Barbieri E, Alba-Flores R (2000) On the infinite-horizon LQ tracker. Syst Control Lett 40(2):77–82 Barbieri E, Alba-Flores R (2006) Real-time infinite horizon linear-quadratic tracking controller for vibration quenching in flexible beams. IEEE conference on systems, man, and cybernetics. Taipei, Taiwan, pp 38–43 Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE (2010) A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 49(1):82–92 Dierks T, Jagannathan S (2009). Optimal tracking control of affine nonlinear discrete-time systems with unknown internal dynamics. In: Proceedings of the 48h IEEE conference on decision and control held jointly with 2009 28th Chinese control conference, pp 6750–6755

108

4 Integral Reinforcement Learning for Optimal Tracking

Finlayson BA (1990) The method of weighted residuals and variational principles. Academic Press, New York Hornik K, Stinchcombe M, White H (1990) Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks 3(5):551–560 Jiang Y, Jiang ZP (2012) Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 48(10):2699–2704 Kleinman DL (1968) On an iterative technique for Riccati equation computations. IEEE Trans Autom Control 18(1):114–115 Lee JY, Park JB, Choi YH (2012) Integral reinforcement learning with explorations for continuoustime nonlinear systems. IEEE world congress on computational intelligence. Brisbane, Australia, pp 10–15 Lewis FL, Vamvoudakis K (2011) Reinforcement learning for partially observable dynamic processes: Adaptive dynamic programming using measured output data. IEEE Trans Syst Man Cybern Part B: Cybern 41(1):14–23 Lewis FL, Vrabie D, Syrmos V (2012) Optimal control, 3rd edn. Wiley, New Jersey Liu D, Yang X, Li H (2013) Adaptive optimal control for a class of continuous-time affine nonlinear systems with unknown internal dynamics. Neural Comput Appl 23:1843–1850 Mannava A, Balakrishnan SN, Tang L, Landers RG (2012) Optimal tracking control of motion systems. IEEE Trans Control Syst Technol 20(6):1548–1556 Modares H, Sistani MBN, Lewis FL (2013) A policy iteration approach to online optimal control of continuous-time constrained-input systems. ISA Trans 52(5):611–621 Modares H, Lewis FL, Naghibi-Sistani MB (2014) Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica 50(1):193–202 Sastry S (2013) Nonlinear systems: Analysis, stability, and control. Springer Science & Business Media Slotine JJE, Li W (1991) Applied nonlinear control. Prentice Hall Englewood Cliffs, NJ Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 46(5):878–888 Vrabie D, Lewis FL (2009) Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems. Neural Networks 22(3):237–246 Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis FL (2009) Adaptive optimal control for continuoustime linear systems based on policy iteration. Automatica 45(2):477–484 Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocomputing 78(1):14–22 Werbos PJ (1992) Approximate dynamic programming for real time control and neural modeling. In: White DA, Sofge DA (eds) Handbook of intelligent control. Multiscience Press Zhang H, Cui L, Zhang X, Luo Ya (2011) Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method. IEEE Trans Neural Networks 22(12):2226–2236

Chapter 5

Integral Reinforcement Learning for Zero-Sum Games

5.1 Introduction Two-player zero-sum games provide a well-defined framework for addressing . H∞ optimal control problems, which have been extensively utilized in designing feedback controllers to mitigate the impact of disturbances on system performance. The exploration of optimal controllers within this framework gained traction following Zames’ contributions (Zames 1981) to the field of optimal control. The solution to such problems can be equated to finding the Nash equilibrium of the corresponding two-player zero-sum game (Zhang et al. 2011, 2012; Li et al. 2014; Lian et al. 2022), which entails solving the Hamilton–Jacobi–Isaacs (HJI) equation. The first part of this chapter focuses on addressing the problem of . H∞ tracking control for nonlinear continuous-time systems under the influence of disturbances. This problem is formulated as a two-player zero-sum game and is tackled using an integral reinforcement learning (IRL) algorithm, without knowing system dynamics. By leveraging the IRL algorithm, the . H∞ tracking control problem can be effectively solved, allowing for the design of robust controllers that can handle disturbances and uncertainties in the system dynamics. The second half of the chapter explores the distributed minmax strategy for multiplayer games, which is resolved using an off-policy IRL algorithm. This strategy is equivalent to the minmax solution in zero-sum games but is tailored to multiplayer systems. The results presented in this section differ from the traditional Nash equilibrium solution obtained through solving coupled algebraic Riccati equations (Cheng et al. 2021; Mitake and Tran 2014; Vamvoudakis and Lewis 2011; Freiling et al. 1996). The application of the off-policy IRL algorithm enables the derivation of efficient and robust solutions for multiplayer games.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 B. Lian et al., Integral and Inverse Reinforcement Learning for Optimal Control Systems and Games, Advances in Industrial Control, https://doi.org/10.1007/978-3-031-45252-9_5

109

110

5 Integral Reinforcement Learning for Zero-Sum Games

5.2 Off-Policy Integral Reinforcement Learning for . H∞ Tracking Control This section designs . H∞ tracking control of nonlinear continuous-time systems. A general bounded . L 2 -gain tracking problem with a discounted performance function is introduced for the . H∞ tracking. A rigorous analysis of bounded . L 2 -gain and stability of the control solution obtained is provided. An off-policy IRL algorithm is developed to learn. H∞ tracking control solution without using system dynamics. The convergence of the proposed algorithm is shown. Simulation examples are provided to verify the algorithm.

5.2.1 Problem Formulation Consider the affine nonlinear continuous-time system defined as .

x˙ = f (x) + g(x) u + k(x) d,

(5.1)

where .x ∈ Rn is the state, .u = [u 1 , . . . , u m ] ∈ Rm is the control input, .d = [d1 , . . . , dq ] ∈ Rq denotes the external disturbance, . f (x) ∈ Rn is the drift dynamics, .g(x) ∈ Rn×m is the input dynamics, and .k(x) ∈ Rn×q is the disturbance dynamics. It is assumed that the functions . f (x), .g(x), and .k(x) are Lipschitz with . f (0) = 0, and that the system (5.1) is controllable in the sense that there exists a continuous control on a set .Ω ⊆ Rn which stabilizes the system in the absence of the disturbance. Moreover, it is assumed that the functions . f (x), .g(x), and .k(x) are unknown. Let .r (t) be the bounded reference trajectory and assume that there exists a Lipschitz continuous command generator function .h d (.) ∈ Rn such that r˙ = h d (r )

.

(5.2)

and .h d (0) = 0. Define the tracking error ∆

e (t) = x(t) − r (t).

. d

(5.3)

Using (5.1)–(5.3), the tracking error dynamics is e˙ (t) = f (x(t)) − h d (xd (t)) + g(x(t)) u(t) + k(x(t)) d(t).

. d

(5.4)

The fictitious performance output to be controlled is defined such that it satisfies ‖z(t)‖2 = edT Q ed + u T R u.

.

(5.5)

5.2 Off-Policy Integral Reinforcement Learning for H∞ Tracking Control

111

The goal of the . H∞ tracking is to attenuate the effect of the disturbance input .d on the performance output .z. Before defining the . H∞ tracking control problem, we define the following general . L 2 -gain or disturbance attenuation condition. Definition 5.1 (Bounded Gain or Disturbance Attenuation) The nonlinear system (5.1) is said to have . L 2 -gain less than or equal to .γ if the following disturbance attenuation condition is satisfied for all .d ∈ L 2 [0, ∞): ∫∞ .

∫t∞ t

e−α(τ −t) ‖z(τ )‖2 dτ e−α(τ −t) ‖d(τ )‖2 dτ

≤ γ 2,

(5.6)

where .α is the discount factor, and .γ is the attenuation level. Definition 5.2 (. H∞ Optimal Tracking) The optimal tracking control problem is to find a control policy .u = β(e, r ) for some smooth function .β depending on the tracking error .e and the desired trajectory .r , such that (i) the closed-loop system.x˙ = f (x) + g(x)β(x, r ) + k(x)d satisfies the attenuation condition (5.6). (ii) the tracking error dynamics (5.4) with .d = 0 is locally asymptotically stable. Remark 5.1 The disturbance attenuation condition (5.6) implies that the effect of the disturbance input on the desired performance output is attenuated by a degree at least equal to .γ . The minimum value of .γ for which the disturbance attenuation condition (5.6) is satisfied gives the so-called optimal robust control solution. However, there exists no way to find the smallest amount of disturbance attenuation for general nonlinear systems, and a large enough value is usually predetermined for .γ . Remark 5.2 The standard solution to the . H∞ optimal tracking control problem divided the control input into two parts. More specifically, the control input was considered as .u = u e + u d , where .u e is the feedback part which depends only on the tracking error .e, and .u d is the feedforward control input which depends only on the reference trajectory. In these methods, .u d is first obtained separately using the dynamic inversion method or the Francis–Byrnes–Isidori equations without considering any optimality criterion. Then, the problem of optimal design of .u e was reduced to an . H∞ optimal regulation problem. However, ignoring the feedforward control input in the performance may result in a large control effort. Moreover, these methods lead to a suboptimal solution as only part of the control input is penalized in the performance function. Remark 5.3 Note that the performance function (5.6) represents a meaningful cost in the sense that it includes a positive penalty on the tracking error and a positive penalty on the control effort. The use of the discount factor is essential. This is because the feedforward part of the control input does not converge to zero in general and thus penalizing the control input in the performance function without a discount factor makes the performance function unbounded and therefore the meaning of the minimality is lost. Note that in contrast to existing methods, in the proposed method,

112

5 Integral Reinforcement Learning for Zero-Sum Games

both feedback and feedback parts of the control input are obtained simultaneously because of the general version of the. L 2 -gain defined in (5.6) where the whole control input and the tracking error energies are weighted by an exponential discount factor in the performance criterion. In fact, in this way the design of feedforward control input is not separated from the design of the feedback control input. The control solution to the . H∞ tracking problem with the proposed attenuation condition (5.6) is provided in the subsequent sections. We shall see in the subsequent sections that this general disturbance attenuation condition enables us to find both feedback and feedforward parts of control input simultaneously and therefore extends the method of off-policy RL for solving the problem in hand without requiring any knowledge of the system dynamics.

5.2.2 Tracking Hamilton–Jacobi–Isaacs Equation and the Solution Stability In this section, a new formulation for solving the . H∞ tracking control problem is presented. The problem of solving the . H∞ tracking control problem is transformed into a minmax optimization problem subject to an augmented system composed of the tracking error dynamics and the command generator dynamics. A tracking HJI equation is developed for finding the solution to the minmax optimization problem. The stability and . L 2 -gain bound of the control solution obtained by solving the tracking HJI equation are discussed.

5.2.2.1

Tracking Hamilton–Jacobi–Isaacs Equation

In the subsection, an augmented system composed of the tracking error system and the command dynamics is constructed. A discounted performance function in terms of the state of the augmented system is defined, and it is shown that solving the . H∞ optimal tracking is equivalent to solving a minmax optimization problem with the defined discounted performance function. A tracking HJI equation is then developed to give the solution to the optimization problem. Define the augmented system state .

X (t) = [edT (t) r T (t)]T ∈ R2n ,

(5.7)

where .ed (t) is the tracking error defined in (5.3) and .r (t) is the reference trajectory. Putting (5.2) and (5.4) together yields the augmented system .

X˙ (t) = F(X (t)) + G(X (t)) u(t) + K (X (t)) d(t),

(5.8)

5.2 Off-Policy Integral Reinforcement Learning for H∞ Tracking Control

113

where .u(t) = u(X (t)) and [ .

F(X ) =

] [ ] [ ] g(ed + r ) k(ed + r ) f (ed + r ) − h d (r ) , G(X ) = , K (X ) = . h d (r ) 0 0 (5.9)

Using the augmented system (5.8), the disturbance attenuation condition (5.8) becomes ∫ ∞ ∫ ∞ ( T ) −α(τ −t) T 2 X Q T X + u Ru dτ ≤ γ . e e−α(τ −t) (d T d)dτ, (5.10) t

t

where [ .

QT =

] Q0 . 0 0

(5.11)

Based on (5.10), define the performance function ∫ .

J (u, d) =



( ) e−α(τ −t) X T Q T X + u T R u − γ 2 d T d dτ.

(5.12)

t

Remark 5.4 Note that the problem of finding a control policy that satisfies bounded L 2 -gain condition for the optimal tracking problem is equivalent to minimizing the discounted performance function (5.12) subject to the augmented system (5.8).

.

It is well-known that the . H∞ control problem is closely related to the two-player zero-sum differential game theory (Basar and Bernard 1995; Feng et al. 2009). In fact, the solvability of the . H∞ control problem is equivalent to the solvability of the following zero-sum game (Basar and Bernard 1995) .

V ∗ (X (t)) = J (u ∗ , d ∗ ) = min max J (u, d), u

d

(5.13)

where . J is defined in (5.12) and .V ∗ (X (t)) is defined as the optimal value function. This two-player zero-sum game control problem has a unique solution if a game theoretic saddle point exists, i.e., if the following Nash condition holds: .

V ∗ (X (t)) = min max J (u, d) = max min J (u, d). u

d

d

u

(5.14)

Note that differentiating (5.12) and noting that .V (X (t)) = J (u(t), d(t)) give the following Hamiltonian in terms of Bellman equation: .



H (V, u, d) = X T Q T X + u T R u − γ 2 d T d − αV + VXT (F + G u + K d) = 0, (5.15)

114

5 Integral Reinforcement Learning for Zero-Sum Games ∆





where . F = F(X ), G = G(X ), K = K (X ), and .VX = ∂ V /∂ X . Applying stationarity conditions .∂ H (V ∗ , u, d)/∂u = 0, ∂ H (V ∗ , u, d)/∂d = 0 gives the optimal control and disturbance inputs as 1 u ∗ = − R −1 G T VX∗ , 2 1 ∗ .d = K T VX∗ , 2γ 2 .

(5.16a) (5.16b)

where .V ∗ is the optimal value function defined in (5.13). Substituting the control .u ∗ input (5.16a) and the disturbance input .d ∗ (5.16b) into (5.15), the following tracking HJI equation is obtained: 1 ∗T 1 ∆ V K K T VX∗ H (V ∗ , u ∗ , d ∗ ) = X T Q T X + VX∗T F − αVX − VX∗T G T R −1 G VX∗ + 4 4γ 2 X = 0. (5.17) In the following, it is shown that the control solution (5.16a), which is found by solving the HJI Eq. (5.17), solves the . H∞ tracking problem formulated in Definition 5.2.

.

5.2.2.2

Stability of the Solution to the Hamilton–Jacobi–Isaacs Equation

In this subsection, it is first shown that the control solution (5.16a) satisfies the disturbance attenuation condition (5.10) (part (i) of Definition 5.2). Then, the stability of the tracking error dynamics (5.4) without the disturbance is discussed (part (ii) of Definition 5.2). It is shown that there exists an upper bound .α ∗ such that if the discount factor is less than .α ∗ , the control solution (5.16a) makes the system locally asymptotically stable. Theorem 5.1 (Saddle Point Solution) Consider the . H∞ tracking control problem as a two-player zero-sum game problem with the performance function (5.12). Then, the pair of strategies .(u ∗ , d ∗ ) defined in (5.16a)–(5.16b) provides a saddle point solution to the game. Proof See (Abu-Khalaf and Lewis 2008).



.

Theorem 5.2 (. L 2 -Gain of the System for the Solution to the HJI Equation) Assume that there exists a continuous positive semi-definite solution .V ∗ (X ) to the tracking HJI Eq. (5.17). Then .u ∗ in (5.16a) makes the closed-loop system (5.17) to have . L 2 -gain less than or equal to .γ . Proof The Hamiltonian (5.15) for the optimal value function .V ∗ and any control policy .u and disturbance policy .d becomes .

H (V ∗ , u, d) = X T Q T X + u T R u − γ 2 d T d − αV ∗ + VX∗ T(F + G u + K d). (5.18)

5.2 Off-Policy Integral Reinforcement Learning for H∞ Tracking Control

115

On the other hand, using (5.16a)–(5.17) one has .

H (V ∗ , u, d) = H (V ∗ , u ∗ , d ∗ ) + (u − u ∗ )T R (u − u ∗ ) + γ 2 (d − d ∗ )T (d − d ∗ ). (5.19)

Based on the HJI Eq. (5.17), we have . H (V ∗ , u ∗ , d ∗ ) = 0. Therefore, (5.18) and (5.19) give .

X T Q T X + u T Ru − γ 2 d T d − αV ∗ + VX∗ T(F + G u + K d) = −(u − u ∗ )T R (u − u ∗ ) − γ 2 (d − d ∗ )T (d − d ∗ ).

(5.20)

Substituting the optimal control policy .u ∗ in the above equation yields .

X T Q T X + u ∗ TRu ∗ − γ 2 d T d − αV ∗ + VX∗ T(F + Gu ∗ + K d) = −γ 2 (d − d ∗ )T (d − d ∗ ) ≤ 0.

(5.21)

Multiplying both sides of this equation by .e−αt and defining .V˙ ∗ = VX∗T (F + G u ∗ + K d) as the derivative of .V ∗ along the trajectories of the closed-loop system give .

d −αt ∗ (e V (X )) ≤ e−αt (−X T Q T X − u ∗ TR u ∗ + γ 2 d T d). dt

(5.22)

Integrating from both sides of this equation yields e

.

−αT







T

V (X (T )) − V (X (0)) ≤

( ) e−ατ −X T Q T X − u T Ru ∗ + γ 2 d T d dτ

0

(5.23) for every.T > 0 and every.d ∈ L 2 [0, ∞). Since.V ∗ (.) ≥ 0, the above equation yields ∫

T

e

.

−ατ

(



X Q T X + u TRu T

0

This completes the proof.



)

∫ dτ ≤

T

e−ατ (γ 2 d T d)dτ + V ∗ (X (0)). (5.24)

0



.

Theorem 5.2 solves part (i) of the state-feedback . H∞ tracking control problem given in Definition 5.2. In the following, we consider the problem of stability of the closed-loop system without disturbance, which is part (ii) of Definition 5.2. Theorem 5.3 (Stability of the Optimal Solution for .α → 0) Suppose that .V ∗ (X ) is a smooth positive semi-definite and locally quadratic solution to the tracking HJI equation. Then, the control input given by (5.16a) makes the error dynamics (5.4) with .d = 0 asymptotically stable in the limit as the discount factor goes to zero.

116

5 Integral Reinforcement Learning for Zero-Sum Games

Proof Differentiating.V ∗ along the trajectories of the closed-loop system with.d = 0 and using the tracking HJI equation gives .

VX∗T (F + G u ∗ ) = αV ∗ − X T Q T X − u ∗ TR u ∗ + γ 2 d T d

(5.25)

or equivalently, .

( ) d −αt ∗ (e V (X )) = e−αt −X T Q T X − u ∗ TRu ∗ + γ 2 d T d ≤ 0. dt

(5.26)

If the discount factor goes to zero, then LaSalle’s extension can be used to show that the tracking error is locally asymptotically stable. More specifically, if .α → 0, based on LaSalle’s extension,. X (t) = [edT (t) r (t)T ]T goes to a region wherein.V˙ = 0. Since T T ˙ = 0 only if .ed (t) = 0 and . X Q T X = ed (t)Q ed (t) where . Q is positive-definite, . V .u + 0 when .d = 0. On the other hand, .u = 0 also requires that .ed (t) = 0; therefore, .◻ for .γ = 0, the tracking error is locally asymptotically stable. Theorem 5.3 shows that if the discount factor goes to zero, then the optimal control solution found by solving the tracking HJI equation makes the system locally asymptotically stable. However, if the discount factor is non-zero, the local asymptotic stability of the optimal control solution cannot be guaranteed by Theorem 5.3. In the following Theorem 5.4, it is shown that local asymptotic stability of the optimal solution is guaranteed as long as the discount factor is smaller than an upper bound. Before presenting the proof of local asymptotic stability, the following example shows that if the discount factor is not small, the control solution obtained by solving the tracking HJI equation can make the system unstable. Example 5.1 Consider the scalar dynamical system .

X˙ = X + u + d.

(5.27)

Assume that in the HJI Eq. (5.17) we have . Q T = R = 1 and the attenuation level is .γ = 1. For this linear system with quadratic performance, the value function is quadratic. That is, .V (X ) = p X 2 and therefore the HJI equation reduces to (2 − α) p −

.

3 2 p +1=0 4

(5.28)

5.2 Off-Policy Integral Reinforcement Learning for H∞ Tracking Control

117

and the optimal control solution becomes u = − p X.

(5.29)

.

Solving this equation gives the optimal solution as ( u=−

.

4 2 (1 − 0.5 α) + √ 3 3

/

4 (1 − 0.5 α)2 + 1 3

) X.

(5.30)

However, this optimal solution does not make the system stable for all values of the discount factor .α. In fact, if .α > α ∗ = 27/12, then the system is unstable. The next theorem shows how to find an upper bound .α ∗ for the discount factor to assure the stability of the system without disturbance. Before presenting the stability theorem, note that the augmented system dynamics (5.8) can be written as .

¯ ), X˙ = F(X ) + G(X )u + K (X )d = AX + Bu + Dd + F(X

where . AX + Bu + K d is the linearized model with ] [ [ ]T ]T [ Al1 Al1 − Al2 , B = BlT 0T , D = DlT 0T , .A = 0 Al2

(5.31)

(5.32)

where . Al1 and . Al2 are the linearized models of the drift system dynamics . f and the ¯ ) is the remaining nonlinear command generator dynamics .h d , respectively, and . F(X terms. Theorem 5.4 (Stability of the Optimal Solution and Upper Bound for .α) Consider the system (5.8). Define .

L l = Bl R −1 BlT +

1 Dl DlT , γ2

(5.33)

where . Bl and . Dl are defined in (5.32). Then, the control solution (5.16a) makes the error system (5.4) with .d = 0 locally asymptotically stable if ‖ ‖ α ≤ α ∗ = 2 ‖(L l Q)1/2 ‖ .

.

(5.34)

Proof Given the augmented dynamics (5.8) and the performance function (5.12), the Hamiltonian function in terms of the optimal control and disturbance is defined as .

( ) ( ) H (ρ, u ∗ , d ∗ ) = e−αt X T Q T X + u ∗ TRu ∗ − γ 2 d ∗ Td ∗ + ρ T F + G u ∗ + K d ∗ , (5.35)

118

5 Integral Reinforcement Learning for Zero-Sum Games

where .ρ is known as the costate variable. Using Pontryagin’s maximum principle, the optimal solutions .u ∗ and .d ∗ satisfy the following state and costate equations: .

X˙ = Hρ (X, ρ),

(5.36a)

ρ˙ = −H X (X, ρ).

(5.36b)

.



.

Define the new variable μ = eαt ρ.

.

(5.37)

Based on (5.37), define the modified Hamiltonian function as .H

m

) ( = e−αt H = X T Q T X + u ∗ TRu ∗ − γ 2 d ∗ Td ∗ + μT (F + G u ∗ + K d ∗ ).

(5.38)

Then, conditions (5.36a) and (5.36b) become .

X˙ = Hμm (X, μ),

(5.39a)

μ˙ = α μ − H

(5.39b)

.

m

X (X, μ).

Equation (5.39a) gives the augmented system dynamics (5.8) and Eq. (5.39b) is equivalent to the HJI Eq. (5.17) with .μ = VX∗ . To prove the local stability of the closed-loop system, the stability of the closed-loop linearized system is investigated. Using (5.31) for the system dynamics, (5.38) becomes .H

m

) ( ¯ ) ). = X T Q T X + u ∗ TR u ∗ − γ 2 d ∗ Td ∗ + μT (AX + Bu ∗ + Dd ∗ + F(X

(5.40)

Then, the costate can be written as the sum of a linear and a nonlinear term as ∆

μ = 2P X + ϕ0 (X ) = μ1 + ϕ0 (X ).

.

(5.41)

Using .∂ H m /∂u = 0, ∂ H m /∂d = 0 and (5.41), one has u ∗ = −R −1 B T P X + ϕ1 (X ), 1 T ∗ D P X + ϕ2 (X ), .d = γ2 .

(5.42a) (5.42b)

¯ ), and . P. Using (5.35)–(5.42b), for some.ϕ1 (X ) and.ϕ2 (X ) depending on.ϕ0 (X ), . F(X conditions (5.39a) and (5.39b) become

5.2 Off-Policy Integral Reinforcement Learning for H∞ Tracking Control

[ .

X˙ μ˙ 1

]

[

A −(B R −1 B T − γ12 D D T ) = −Q T −AT + α In [ ] [ ] F1 (X ) X ∆ + = W , F2 (X ) μ1

][

119

] ] [ X F1 (X ) + F2 (X ) μ1 (5.43)

for some nonlinear functions . F1 (X ) and . F2 (X ). The linear part of the costate is a stable manifold of .W and thus based on the linear part of (5.43), it satisfies the following game algebraic Riccati equation (GARE): .

Q T + AT P + P A − α P − P B R −1 B T P +

1 P D D T P = 0. γ2

(5.44)

] P11 P12 . Then, based on (5.11) and (5.32), the upper left-hand side P12 P22 of the LQT GARE (5.44) becomes [

Define . P =

.

t Q + Al1 P11 + P11 Al1 − α P11 − P11 Bl R −1 BlT P11 +

1 P11 Dl DlT P11 = 0. γ2 (5.45)

The closed-loop system dynamics for the control input (5.42a) and without the disturbance is .

X˙ = (A − B R −1 B T P)X + F f (X )

(5.46)

for some nonlinear function . F f (X ) with . F f = [F Tf 1 , F Tf 2 ]T , which gives the following tracking error dynamics e˙ = (Al1 − Bl R −1 BlT P11 ) ed + F f 1 = Ac ed + F f 1 .

. d

(5.47)

Based on the closed-loop error dynamics . Ac , the GARE becomes .

Q + ATc P11 + P11 Ac − α P11 + P11 Bl R −1 BlT P11 +

1 P11 Dl DlT P11 = 0. (5.48) γ2

To find a condition on the discount factor to assure stability of the linearized error dynamics, assume that .λ is an eigenvalue of the closed-loop error dynamics . Ac . That is, . Ac x = λx with .x the eigenvector corresponding to .λ. Then, multiplying the leftand right-hand sides of the GARE (5.48) by .x T and .x, respectively, one has 2 (Re(λ) − 0.5 α) x T P11 x = −x T Q x − x T P11 (Bl R −1 BlT + D D T )P11 x.

.

Using the inequality .a 2 + b2 ≥ 2ab and since . P11 > 0, (5.49) becomes

(5.49)

120

5 Integral Reinforcement Learning for Zero-Sum Games

‖ ‖‖ ‖ ‖ −1 1/2 ‖ ‖ (Re(λ) − 0.5 α) ≤ − ‖(Q P11 ) ‖ (L l P11 )1/2 ‖

(5.50)

‖ ‖‖ ‖ ‖ −1 1/2 ‖ ‖ Re(λ) ≤ − ‖(Q P11 ) ‖ (L l P11 )1/2 ‖ + 0.5 α,

(5.51)

.

or equivalently, .

where . L l is defined in (5.33). Using the fact that .‖ A‖ ‖B‖ ≥ ‖AB‖ gives ‖ ‖ Re(λ) ≤ − ‖(L l Q)1/2 ‖ + 0.5 α.

.

(5.52)

Therefore, the linear error dynamics in (5.47) is stable if condition (5.34) is satisfied and this completes the proof. Remark 5.5 Note that the GARE (5.45) can be written as .

Q T + (A − 0.5α I )T P + P(A − 0.5α I ) − P B R −1 B T P +

1 P D D T P = 0. γ2

This amounts to a GARE without a discount factor and with the system dynamics given by . A − 0.5α I , . B, and . D. Therefore, the existence of a unique solution to the GARE requires.(A − 0.5α I, B) to be stabilizable. Based on the definition of. A and. B in (5.32), this requires that be stabilizable and .(Al1 − 0.5α I, Bl ) be stable. However, since .(Al1 , Bl ) is stabilizable, as the system dynamics in (5.1) is assumed robustly stabilizing, then .(Al1 − 0.5α I, Bl ) is also stabilizable for any .α>0. Moreover, since the reference trajectory is assumed bounded, the linearized model of the command generator dynamics, i.e., . Al2 , is marginally stable, and thus .(Al2 − 0.5α I ) is stable. Therefore, the discount factor does not affect the existence of the solution to the GARE. Remark 5.6 Theorem 5.4 shows that the asymptotic stability of only the first .n variables of . X is guaranteed, which are the error dynamic states. This is reasonable as the last .n variables of . X are the reference command generator variables that are not under our control. √ Remark 5.7 For Example 5.1, condition (5.33) gives the bound .α < 80/12 to assure the stability. This bound is very close to the actual bound obtained in Example 5.1. However, it is obvious that condition (5.33) gives a conservative bound for the discount factor to assure stability. Remark 5.8 Theorem 5.4 confirms the existence of an upper bound for the discount factor to assure stability of the solution to the HJI tracking equation and relates this bound to the input and disturbance dynamics, and the weighting matrices in the performance function. Condition (5.34) is not a restrictive condition even if the system dynamics are unknown. In fact, one can always pick a very small discount factor, and/or large weighting matrix . Q (which is a design matrix) to assure that condition (5.34) is satisfied.

5.2 Off-Policy Integral Reinforcement Learning for H∞ Tracking Control

121

5.2.3 Off-Policy Integral Reinforcement Learning for Tracking Hamilton–Jacobi–Isaacs Equation In this section, an offline RL algorithm is first given to solve the problem of . H∞ optimal tracking by learning the solution to the tracking HJI equation. An off-policy IRL algorithm is then developed to learn the solution to the HJI equation online without requiring any knowledge of the system dynamics. Three neural networks on an actor–critic–disturbance structure are used to implement the proposed off-policy IRL algorithm.

5.2.3.1

Off-Policy RL Algorithm

The Bellman equation (5.15) is linear in the cost function .V , while the HJI Eq. (5.17) is nonlinear in the value function .V ∗ . Therefore, solving the Bellman equation for ∗ ∗ . V is easier than solving the HJI for . V . Instead of directly solving for . V , the policy iteration (PI) algorithm iterates on both control and disturbance players to break the HJI equation into a sequence of differential equations linear in the cost. An offline PI algorithm for solving the . H∞ optimal tracking problem is given as follows. Algorithm 5.1 Offline PI for solving the LQT problem 1. Initialization: Start with an admissible control input u 0 . 2. Policy evaluation: Compute Vi using the following Bellman equation H (Vi , u i , di ) = X T Q T X + V XTi (F + G u i + K di ) − αVi + u iT R u i − γ 2 diT di = 0.

(5.53) 3. Policy improvement: Update the control gain using 1 u i+1 = arg min [H (Vi , u, d)] = − R −1 G T V Xi 2 u

(5.54)

and the disturbance using di+1 = arg max [H (Vi , u i , d)] = d

1 K T V Xi . 2γ 2

(5.55)

Algorithm 5.1 extends the results of the simultaneous RL algorithm in Wu and Luo (2012) to the tracking problem. The convergence of this algorithm to the minimal nonnegative solution of the HJI equation was shown in Wu and Luo (2012). In fact, similar to Wu and Luo (2012), the convergence of Algorithm 5.1 can be established by proving that the iteration on (5.54) is essentially a Newton’s iterative sequence which converges to the unique solution of the HJI Eq. (5.17).

122

5 Integral Reinforcement Learning for Zero-Sum Games

Algorithm 5.1 requires complete knowledge of the system dynamics. In the following, an off-policy IRL algorithm is developed to solve the optimal tracking for systems with completely unknown dynamics. To this end, the system dynamics (5.8) is first written as .

X˙ = F + G u i + K di + G (u − u i ) + K (d − di ),

(5.56)

where .u i = [u i,1 , . . . , u i,m ] ∈ Rm and .di = [di,1 , . . . , di,q ] ∈ Rq are policies to be updated. Differentiating .Vi (x) along with the system dynamics (5.56) and using (5.53)–(5.55) give .

V˙i = VXTi (F + G u i + K di ) + VXTi G(u − u i ) + VXTi K (d − di ) = α Vi − X T Q T X − u iT R u i + γ 2 diT di T T − 2 u i+1 R (u − u i ) + 2γ 2 di+1 (d − di ).

(5.57)

Multiplying both sides of (5.57) by .e−α(τ −t) and integrating from both sides yields the following off-policy IRL Bellman equation: e−αT Vi (X (t + T )) − Vi (X (t)) ∫ t+T ( ) e−α(τ −t) −X T Q T X − u iT R u i + γ 2 diT di dτ =

.

t



t+T

+ t

( ) T T e−α(τ −t) −2 u i+1 R (u − u i ) + 2γ 2 di+1 (d − di ) dτ.

(5.58)

Note that for a fixed control policy .u (the policy which is applied to the system), and a given disturbance .d (the actual disturbance which is applied to the system), Eq. (5.58) can be solved for both value function .Vi and updated policies .u i+1 and .di+1 , simultaneously. Lemma 5.1 The off-policy IRL Eq. (5.58) gives the same solution for the value function as the Bellman equation (5.53) and the same updated control and disturbance policies as (5.54) and (5.55). Proof Dividing both sides of the off-policy IRL Bellman equation (5.58) by .T and taking limit result in .

e−α T Vi (X (t + T )) − Vi (X (t)) T →0 T ∫ t+T −α(τ −t) ( T ) X Q T X + u iT R u i − γ 2 diT di dτ e t + lim T →0 T ∫ t+T −α(τ −t) ( T ) T 2u i+1 R(u − u i ) − 2γ 2 di+1 e (d − di ) dτ t + lim = 0. T →0 T lim

(5.59)

5.2 Off-Policy Integral Reinforcement Learning for H∞ Tracking Control

123

By L’Hopital’s rule, the first term in (5.39b) becomes .

e−α T Vi (X (t + T )) − Vi (X (t)) T →0 T [ ] = lim −α e−α T Vi (X (t + T )) + e−α T V˙i (X (t + T )) lim

T →0

= −αVi + VXi (F + Gu i + K di + G(u − u i ) + K (d − di )),

(5.60)

where the last term on the right-hand side is obtained by using .V˙ = VX X˙ . Similarly, for the second and third terms of (5.39b), one has ∫ t+T

( ) e−α(τ −t) X T Q T X + u iT Ru i − γ 2 diT di dτ . lim T →0 T = X T Q T X + u iT R u i − γ 2 diT di t

(5.61)

and ∫ t+T

( T ) T e−α(τ −t) 2u i+1 R(u − u i ) − 2γ 2 di+1 (d − di ) dτ . lim T →0 T T T = 2u i+1 R(u − u i ) − 2γ 2 di+1 (d − di ). t

(5.62)

Substituting (5.60)–(5.62) in (5.39b) yields .

− α Vi + V Xi (F + G u i + K di + G (u − u i ) + K (d − di )) + X T Q T X + u iT R u i T R (u − u ) − 2γ 2 d T (d − d ) = 0. − γ 2 diT di + 2 u i+1 (5.63) i i i+1

Substituting the updated policies .u i+1 and .di+1 from (5.54) and (5.55) into (5.63) .◻ gives the Bellman equation (5.53). This completes the proof. Remark 5.9 In the off-policy IRL Bellman equation (5.58), the control input .u which is applied to the system can be different from the control policy .u i which is evaluated and updated. The fixed control policy .u should be a stable and exploring control policy. Moreover, in this off-policy IRL Bellman equation, the disturbance input .d is the actual external disturbance that comes from a disturbance source and is not under our control. However, the disturbance .di is the disturbance that is evaluated and updated. One advantage of this off-policy IRL Bellman equation is that, in contrast to on-policy RL-based methods, the disturbance input which is applied to the system does not require to be adjustable. The following algorithm uses the off-policy tracking Bellman equation (5.58) to iteratively solve the HJI Eq. (5.17) without requiring any knowledge of the system dynamics. The implementation of this algorithm is discussed in the next subsection. It is shown how the data collected from a fixed control policy .u is reused to evaluate many updated control policies .u i sequentially until convergence to the optimal solution is achieved.

124

5 Integral Reinforcement Learning for Zero-Sum Games

Algorithm 5.2 Online off-policy RL algorithm for solving tracking HJI equation 1. Phase 1(data collection using a fixed control policy): Apply a fixed control policy u to the system and collect required system information about the state, control input, and disturbance at N different sampling interval T . 2. Phase 2 (reuse of collected data sequentially to find an optimal policy): Given u i and di , use collected information in phase 1 to Solve the following Bellman equation for Vi , u i+1 and di+1 simultaneously e−αT Vi (X (t + T )) − Vi (X (t)) ∫ t+T ( ) e−α(τ −t) −X T Q T X − u iT R u i + γ 2 diT di dτ = t



+ t

t+T

( ) T T e−α(τ −t) −2u i+1 R(u − u i ) + 2γ 2 di+1 (d − di ) dτ.

(5.64)

3. Stopping criteria: Stop if a stopping criterion is met, otherwise, set i = i + 1 and go to 2.

Remark 5.10 Algorithm 5.2 has two separate phases. First, a fixed initial exploratory control policy .u is applied, and the system information is recorded over the time interval .T . Second, without requiring any knowledge of the system dynamics, the information collected in Phase 1 is repeatedly used to find a sequence of updated policies .u i and .di converging to .u ∗ and .d ∗ . Note that Eq. (5.64) is a scalar equation and can be solved in the least squares sense after collecting enough data samples from the system. It is shown in the following section how to collect the required information in Phase 1 and reuse them in Phase 2 in a least squares sense to solve (5.64) for . Vi , .u i+1 , and .di+1 simultaneously. After the learning is done and the optimal control policy .u ∗ is found, it can then be applied to the system. Theorem 5.5 (Convergence of Algorithm 5.2) The off-policy Algorithm 5.2 converges to the optimal control and disturbance solutions given by (5.16a) and (5.16b) where the value function satisfies the tracking HJI Eq. (5.17). Proof It was shown in Lemma 5.2 that the off-policy tracking Bellman equation (5.64) gives the same value function as the Bellman equation (5.53) and the same updated policies as (5.54) and (5.55). Therefore both Algorithms 5.1 and 5.2 have the same convergence properties. The convergence of Algorithm 5.1 is proven in White and Sofge (1992). This confirms that Algorithm 5.2 converges to the optimal .◻ solution. Remark 5.11 Although both Algorithms 5.1 and 5.2 have the same convergence properties, Algorithm 5.2 is a model-free algorithm that finds an optimal control policy without requiring any knowledge of the system dynamics. This is in contrast to Algorithm 5.1 which requires full knowledge of the system dynamics. Moreover, Algorithm 5.1 is an on-policy RL algorithm that requires the disturbance input to be specified and adjustable. On the other hand, Algorithm 5.2 is an off-policy RL algorithm that obviates this requirement.

5.2 Off-Policy Integral Reinforcement Learning for H∞ Tracking Control

5.2.3.2

125

Implementing Off-Policy RL Algorithm 5.2

In order to implement the off-policy RL Algorithm 5.2, it is required to reuse the collected information found by applying a fixed control policy .u to the system to solve Eq. (5.64) for .Vi , .u i+1 , and .di+1 iteratively. Three neural networks (NNs), i.e., the actor NN, the critic NN, and the disturber NN are used here to approximate the value function and the updated control and disturbance policies in the Bellman equation (5.64). That is, the solution .Vi , .u i+1 , and .di+1 of the Bellman equation (5.64) is approximated by three NNs as .

Vˆi (X ) = Wˆ 1T σ (X ),

(5.65a)

Wˆ 2T φ(X ), Wˆ 3T ϕ(X ),

(5.65b)



(X ) =



(X ) =

. i+1 . i+1

(5.65c)

where .σ = [σ1 , . . . , σl1 ] ∈ Rl1 , .φ = [φ1 , . . . , φl2 ] ∈ Rl2 , and .ϕ = [ϕ1 , . . . , ϕl3 ] ∈ Rl3 provide suitable basis function vectors .Wˆ 1 ∈ Rl1 , .Wˆ 2 ∈ Rm×l2 , and .Wˆ 3 ∈ Rq×l3 are constant weight vectors and .l1 , .l2 , and .l3 are the number of neurons. Define .v 1 = [v11 , . . . , v1m ]T = u − u i , .v 2 = [v12 , . . . , vq2 ]T = d − di and assume . R = diag(r, . . . , r m ). Then, substituting (5.65a)–(5.65c) in (5.64) yields e(t) = Wˆ 1T (e−αT σ (X (t + T )) − σ (X (t))) ∫ t+T ( ) e−α(τ −t) −X T Q T X − u iT Ru i + γ 2 diT di dτ −

.

t

+2

m ∑

∫ t

l=1

− 2γ

2

t+T

rl

q ∫ ∑ t

k=1

t+T

T e−α(τ −t) Wˆ 2,l φ(X (t)) vl1 dτ

T e−α(τ −t) Wˆ 3,k ϕ(X (t)) vk2 dτ ,

(5.66)

where .c(t) is the Bellman approximation error, .Wˆ 2,l is the .l-th column of .Wˆ 2 , and ˆ 3,k is the .k-th column of .Wˆ 3 . The Bellman approximation error is the continuous.W time counterpart of the temporal difference. In order to bring the temporal difference error to its minimum value, the least squares method is used. To this end, rewrite Eq. (5.66) as .

y(t) + e(t) = Wˆ T h(t),

(5.67)

where .

T T T T T , . . . , Wˆ 2,m , Wˆ 3,1 , . . . , Wˆ 3,q ] ∈ Rl1 +m×l2 +q×l3 , Wˆ = [Wˆ 1T , Wˆ 2,l

(5.68)

126

5 Integral Reinforcement Learning for Zero-Sum Games



⎤ e−αT σ (X (t + T )) − σ (X (t))) ∫ t+T −α(τ −t) ⎢ 2r1 e φ(X (t)) v11 dτ ⎥ ⎢ ⎥ t ⎢ ⎥ .. ⎢ ⎥ . ⎢ ⎥ ∫ t+T −α(τ −t) ⎢ ⎥ 1 e φ(X (t)) v dτ .h(t) = ⎢ 2r m ⎥, m t∫ ⎢ ⎥ t+T 2 −α(τ −t) ⎢ −2γ 2 t e ϕ(X (t)) v1 dτ ⎥ ⎢ ⎥ .. ⎢ ⎥ ⎣ ⎦ . ∫ 2 t+T −α(τ −t) 2 e ϕ(X (t)) vq dτ −2γ t ∫ .

y(t) = t

t+T

( ) e−α(τ −t) −X T Q T X − u iT Ru i + γ 2 diT di dτ .

(5.69)

(5.70)

The parameter vector .Wˆ , which gives the approximated value function, actor, and disturbance (5.65a)–(5.65c), is found by minimizing, in the least squares sense, the Bellman error (5.68). Assume that the system’s state, input, and disturbance information are collected at . N ≥ l1 + ml2 + ql3 (the number of independent elements in ˆ ) points .t1 to .t N in the state space, over the same time interval .T in Phase 1. Then, .W for a given .u i and .di , one can use this information to evaluate (5.69) and (5.70) at . N points to form H = [h(t1 ), . . . , h(t N )]

(5.71)

Y = [y(t1 ), . . . , y(t N )]T .

(5.72)

.

and .

The least squares solution to (5.67) is then equal to .

Wˆ = (H H T )−1 H Y,

(5.73)

which gives .Vi , .u i+1 , and .di+1 . Remark 5.12 Note that although . X (t + T ) appears in Eq. (5.66), this equation is solved in the least squares sense after observing . N samples . X (t), . X (t + T ), .. . ., . X (t + N T ). Therefore, the knowledge of the system is not required to predict the future state . X (t + T ) at time .t to solve (5.66).

5.2.4 Simulation Examples In this subsection, the proposed off-policy IRL algorithm is applied to control a nonlinear system. The dynamic equations of the system are given by

5.3 Off-Policy Integral Reinforcement Learning for Distributed …

127

Fig. 5.1 The blue curve is the first state of the system . x 1 and the black curve is the reference trajectory

.

x˙1 = x2 + d, x˙2 = −x13 + 0.5 x2 + u,

(5.74)

and the reference signal is generated by the following command generators: [

] 0 1 .r˙ = r, −1 0

(5.75)

where .r = [r1 , r2 ]. The state vector for the augmented system is considered as . X = [e1 , e2 , r1 , r2 ] with .e1 = x1 − r1 and .e2 = x2 − r2 . A power series neural network with 45 activation functions is constructed and contains even powers of the state variables of the system up to order four. The activation functions for the control and disturbance policies are chosen as polynomials of all powers of the states up to order four. We now implement Algorithm 5.2 to find the optimal control solution online. The reinforcement interval is chosen as .T = 0.01. The proposed algorithm starts the learning process from the beginning of the simulation and finishes it after about 15 seconds, when the control policy is updated. The plots of state trajectories of the closed-loop system and the reference trajectory are shown in Figs. 5.1 and 5.2. From these figures, it is obvious that the system tracks the reference trajectory after the learning is finished and the optimal controller is found.

5.3 Off-Policy Integral Reinforcement Learning for Distributed Minmax Strategy of Multiplayer Games This section introduces a distributed minmax strategy for multiplayer games and proposes an off-policy IRL algorithm to solve this strategy. The key characteristic of the proposed minmax strategy is its distributed nature, where each player determines its optimal control policy by considering the other players as disturbances. This approach involves solving a distributed algebraic Riccati equation (ARE) within

128

5 Integral Reinforcement Learning for Zero-Sum Games

Fig. 5.2 The blue curve is the second state of the system .x2 and the black curve is the reference trajectory

a multiplayer zero-sum game, which extends the concept of two-player zero-sum games. The resulting control policy is devised to counteract the worst-case policies adopted by the other players. The section also investigates the existence of distributed minmax solutions and analyzes their . L 2 and asymptotic stabilities. Through mild conditions, it is demonstrated that the obtained minmax control policies offer improved robust gain and phase margins for multiplayer systems, surpassing the standard linear quadratic regulator controls. To find the distributed minmax solutions, model-based policy iteration and data-driven off-policy IRL algorithms are developed. The formulated concepts and algorithms are validated through simulation examples.

5.3.1 Formulation of Distributed Minmax Strategy Consider a multiplayer system

.

x(t) ˙ = Ax(t) +

N ∑

Bi u i (t),

(5.76)

i=1 ∆

where .x ∈ Rn is the system state and .u i ∈ Rm i is the control input of player .i ∈ N = {1, 2, . . . , N }. The constant system dynamic matrices are. A ∈ Rn×n and. Bi ∈ Rn×m i , where. Bi can be different from. B j for.i /= j ∈ N. The pair.(A, Bi ),.i ∈ N is assumed to be stabilizable. We formulate a distributed minmax strategy for multiplayer games to solve completely distributed ARE without knowing the other players’ strategies. This strategy finds each player the optimal control policy by regarding all the other players as disturbances and solving for their worst-case policies in a distributed ARE. The performance index, also known as cost function, of each player .i ∈ N is defined as

5.3 Off-Policy Integral Reinforcement Learning for Distributed … .

129

Vi (x(0), u i , u −i ) ∫ ∞( ) ∑ = x T Q i x + u iT Rii u i − γi2 u Tj Ri j u j dτ 0



j∈−i

= x Pi x, T

(5.77)

where .−i ≜ {1, . . . , i − 1, i + 1, . . . , N }, . Pi = PiT ∈ Rn×n > 0 is the cost matrix. In addition,. Q i = Q iT ≥ 0,. Rii = RiiT > 0,. Ri j = RiTj > 0, and.(Q i , A) is observable. Definition 5.3 (Minmax Strategy of Multiplayer Games) The minmax strategy of player .i ∈ N in (5.77) is ∆

u ∗ = arg min max Vi (x, u i , u −i ).

. i

ui

(5.78)

u −i

Define the Hamiltonian function associated with (5.77) as .

Hi (x, u i , u −i ) = x T Q i x + u iT Rii u i − γi2 ⎛ + 2x T Pi ⎝ Ax(t) +

∑ j∈−i

N ∑

u Tj Ri j u j ⎞

Bju j⎠ .

(5.79)

j=1

The optimal control input of player .i , denoted by .u i∗ , and the worst-case input of i ,u −i ) another player . j, denoted by .vi∗j , satisfy stationary conditions . ∂ Hi (x,u = 0 and ui .

∂ Hi (x,u i ,u −i ) uj

= 0, respectively. Therefore, we have u ∗ = −K i∗ x, ∗ ∗ .vi j = −K i j x, . i

(5.80a) (5.80b)

where ∆

K i∗ = Rii−1 BiT Pi , 1 −1 T ∗ ∆ .Ki j = − R B Pi , γi2 i j j .

(5.81a) (5.81b)

and . Pi is the unique stabilizing solution of the distributed algebraic Riccati equation (ARE) .

AT Pi + Pi A − Pi Bi Rii−1 BiT Pi +

1 ∑ T Pi B j Ri−1 j B j Pi + Q i = 0. γi2 j∈−i

(5.82)

130

5 Integral Reinforcement Learning for Zero-Sum Games

Note that ARE (5.82) is distributed because .u i∗ in (5.80a) and .vi∗j in (5.80b) update by the same . Pi . Remark 5.13 It is shown in van der Schaft (1992) that there exists a critical scalar γ ∗ > 0 such that the ARE (5.82) has a positive-definite solution for any.γi > γi∗ > 0. Therefore, the existence of minmax control policy (5.80a) is easily guaranteed by selecting a large enough .γi . This is in contrast to solving for Nash equilibrium by solving coupled AREs. It is generally shown that a unique stabilizing solution to coupled AREs exists under certain conditions (Basar 1976; Papavassilopoulos and Cruz 1979).

. i

Remark 5.14 Note that .u i∗ (5.80a) is the actual control input for each player .i, j ∈ N, while .vi∗j (5.80b) is the virtual worst control input that player .i assumes for its neighbor . j. This virtual control input is not applied to the system (5.76) by player . j. Remark 5.15 With the minmax strategy policy (5.80a), the ARE (5.82) is distributed in the sense that the solution . Pi does not depend on . P j , j ∈ −i. By contrast, the coupled AREs for the non-distributed Nash equilibrium solve . Pi requiring . P j , j ∈ −i. Moreover, as shown in Lian et al. (2022), it is hard to solve coupled AREs and it is more efficient to compute (5.82) than Nash control strategy using RL algorithms. Remark 5.16 This result is an extension of existing zero-sum games (Chen et al. 2022; Zhu and Zhao 2020; Gokcesu and Kozat 2018; Li et al. 2017; Modares et al. 2015) in two aspects. First, (Chen et al. 2022; Zhu and Zhao 2020; Gokcesu and Kozat 2018) mainly focus on two-player . H∞ control problems that find the optimal control input to reject the worst disturbance, whereas we extend them to more general multiplayer cases. Second, (Li et al. 2017; Modares et al. 2015) prove the asymptotic stability of systems by assuming zero disturbances, but our system (5.76) employs the minmax strategy (5.81a) for each player and the asymptotic stability is guaranteed as shown in later Theorem 5.8. The next result demonstrates that the control policy (5.80a), with . Pi satisfying the distributed ARE (5.82), is the minmax strategy for the player .i. This result assumes the asymptotic stability of the system (5.76) using the control policy (5.80a). This asymptotic stability is proven in Chap. 5.3.2. Theorem 5.6 Let player .i ∈ N in system (5.76) use the control policy (5.80a) where Pi is the unique stabilizing solution of the distributed ARE (5.82). Assume that the control policy (5.80a) asymptotically stabilizes the system (5.76). Then, each player in the game has the minmax strategy as defined in Definition 5.3. Furthermore, the minmax value of each player is .Vi (x(0)).

.

5.3 Off-Policy Integral Reinforcement Learning for Distributed …

131

Proof We rewrite the function .Vi in (5.77) as .

Vi (x(0), u i , u −i ) ) ∫ ∞( ∑ T T 2 T x Q i x + u i Rii u i − γi = u j Ri j u j dτ 0

j∈−i

∫ −



V˙i (x)dτ +



0





2x T Pi ⎝ Ax(t) +

0

N ∑

⎞ B j u j ⎠ dτ.

(5.83)

j=1

Since the closed-loop system∫(5.76) is globally asymptotically stable, i.e., .x(t) → ∞ 0 with .t → ∞, then, we have . 0 V˙i (x)dτ = Vi (x(∞)) − Vi (x(0)) = −Vi (x(0)). Besides, . Pi is the unique stabilizing solution of the distributed ARE (5.82). Then, . Vi (x(0), u i , u −i )

=

∫ ∞( 0

− γi2

u iT Rii u i − 2u iT Rii u i∗ + u i∗ TRii u i∗ − γi2



vi∗j TRi j vi∗j + 2γi2

j∈−i

=

∫ ∞( 0



)



u Tj Ri j u j

j∈−i

u Tj Ri j vi∗j dτ + Vi (x(0))

j∈−i

(u i − u i∗ )∗ TRii (u i − u i∗ ) − γi2



) (u j − vi∗j )T Ri j (u j − vi∗j ) dτ + Vi (x(0)).

j∈−i

(5.84) Select .u j = vi∗j , . j ∈ −i. Then, .

Vi (x(0), u i , u ∗−i ) =

∫ 0



(u i − u i∗ )T Rii (u i − u i∗ )dτ + Vi (x(0)),

(5.85)

which shows that .u i∗ (5.80a) is the control policy following the minmax strategy in Definition 5.3. The player .i has the optimal game value .Vi (x(0)). The proof is completed. .◻

5.3.2 Stability and Robustness of Distributed Minmax Strategy This section studies the stability of the multiplayer game (5.76) where each player strictly follows the actual distributed minmax strategy policy. Then, we show that this policy leads to improved gain margin (GM) and phase margin (PM) compared to the standard LQR controller under mild conditions.

132

5 Integral Reinforcement Learning for Zero-Sum Games

The next theorem shows that the multiplayer system (5.76) with minmax control policy (5.80a) is (i) . L 2 -gain stable associated with (5.77). Theorem 5.7 (. L 2 Stability) The system (5.76) is . L 2 -stable with . L 2 -gain bound .γi for control input player .i with respect to its minmax control policy .u i∗ in (5.80a) and . Pi satisfying (5.82). Proof It is seen from (5.84) that one has . Vi (x(0), u i , u −i )

= =

.

∫ ∞( 0

∫ ∞( 0

x T Q i x + u iT Rii u i − γi2



) u Tj Ri j u j dτ

j∈−i

(u i − u i∗ )T Rii (u i − u i∗ ) − γi2



) (u j − vi∗j )T Ri j (u j − vi∗j ) dτ + Vi (x(0)).

(5.86)

j∈−i

Since . L 2 stability must hold for all initial conditions, let .x(0) = 0. We have Vi (x(0)) = 0. Using .u i = u i∗ in (5.86) yields ) ∫ ∞( ∑ x T Q i x + u iT Rii u i − γi2 . u Tj Ri j u j dτ 0

j∈−i





=− 0

This implies ∫ .

0



(

γi2



(u j − vi∗j )T Ri j (u j − vi∗j )dτ ≤ 0.

(5.87)

j∈−i

) x T Q i x + u iT Rii u i dτ ≤





0

γi2



u Tj Ri j u j dτ.

(5.88)

j∈−i

T 2 T Define the player .i’s ∑rewardTas .‖z i ‖ = x Q i x + u i Rii u i . Take the other players 2 as rivals and .‖σi ‖ = j∈−i u j Ri j u j will be the antagonistic inputs. Then, based on (5.88), we have

‖z i ‖ ≤ γi ‖σi ‖.

(5.89)

.

This shows that (5.76) is . L 2 -stable with . L 2 -gain bounded by .γi . The proof is com.◻ pleted. Using the minmax strategy in terms of control policy .u i∗ (5.80a) for each player in (5.76), we have the closed-loop dynamics ⎛ .

x˙ = ⎝ A −

N ∑ j=1

⎞ T ⎠ x. B j R −1 j j B j Pj

(5.90)

5.3 Off-Policy Integral Reinforcement Learning for Distributed …

133

The next theorem proves that (5.90) is globally asymptotically stable. Theorem 5.8 (Globally Asymptotic Stability of (5.90)) Consider minmax strategy (5.78). Using the distributed minmax strategy policy (5.80a) in (5.76) for each player, the system (5.90) is globally asymptotically stable.

Proof We now use Lyapunov stability theory to prove that (5.90) is asymptotically ∑ stable. Select Lyapunov candidate function . L(x) = x T (t) Nj=1 P j x(t). Note that ∑N . j=1 P j > 0, where . P j is the unique stabilizing solution in (5.82). Then, we have ˙ . L(t)

= x˙ T

N ∑

Pj x + xT

j=1

N ∑ j=1



= 2x T ⎝ A −

N ∑

P j x˙ ⎞T

B j K ∗j ⎠

j=1

= x T AT

N ∑

= xT

Pj x

j=1

Pj x + xT

j=1 N ∑

N ∑

N ∑

P j Ax − 2x T

j=1

−1 T Pi Bi Rii Bi Pi x − x

N ∑

Q i x − 2x T

i=1

= −x T

N ∑

N ∑

γ2 i=1 j∈−i i

K ∗j TB Tj

j=1

K ∗j TB Tj Pi x −x T

i=1 j∈−i

= −x T

(∑ N i=1

N ∑ ∑ ( N i=1

T Pi B j Ri−1 j B j Pi x

) Pj x

γ2 i=1 j∈−i i N ∑

Pi

i=1

Qi − x T

) Pj x

j=1

N ∑ ∑ 1 T

i=1

− xT

N (∑

N (∑

j=1

−1 T Pi Bi Rii Bi Pi x − x

N ∑ ∑

K ∗j TB Tj

j=1 N ∑ ∑ 1 T

i=1

− xT

N ∑



T Pi B j Ri−1 j B j Pi x

B j K ∗j x −x T

j∈−i

N ∑

Qi x

i=1

N ∑ ) ) −1 T ( P j Bi Rii Bi Pj

j=1

j=1

) ∑ ) −1 T −1 T P B R B P + P B R B P i j ij j i ii j x j i i γ2 j∈−i j∈−i i=1 j∈−i i (∑ N N ∑ ∑ −1 T ≤ −‖x‖2 λmin (Q i ) − ‖ P j ‖2 ‖Bi Rii Bi ‖ −

N ∑ ( ∑ 1

i=1

+

N 1 ∑ ∑



i=1

j∈−i

T ‖Pi ‖2 ‖B j Ri−1 j B j ‖+‖



)

−1 T P j ‖2 ‖Bi Rii Bi ‖

γi2 i=1 j∈−i j∈−i (∑ ) N ∑ N ∑ 1 T‖ λmin (Q i ) + 2 ‖Pi ‖2 ‖B j Ri−1 B ≤ −‖x‖2 j j γi i=1 j∈−i i=1 ≤ 0.

(5.91)

134

5 Integral Reinforcement Learning for Zero-Sum Games

Then, (5.90) is stable. Using LaSalle’s extension (Khalil 2015), it can be verified that . L˙ = 0 if only if .x = 0. The system (5.90) is globally asymptotically stable. .◻ System robustness is shown against gain perturbations or inaccurate phase provided by the minmax control policy. The subsequent analysis is based on the direct manipulation of the distributed ARE (5.82) by using the return difference and analyzing singular value properties. The proposed minmax strategies are shown to improve GM and PM compared to the standard LQR controller under mild conditions. The next assumption and preliminary return difference results are provided first. Assumption 5.1 Assume . Bi , .i ∈ N has full column rank. Assumption 5.2 Under Assumption 5.1, select sufficiently large .λi , such that .

∑ 1 T T −1 (BiT Bi )−1 BiT B j Ri−1 > 0. j B j Bi (Bi Bi ) 2 γi j∈−i

Rii−1 −

(5.92)

Remark 5.17 Note that given large .γi , Remark 5.13 implies .

Bi Rii−1 BiT −

1 ∑ T B j Ri−1 j B j ≥ 0, γi2 j∈−i

(5.93)

which ensures the stabilizing solution . Pi in the ARE (5.82). Given Assumption 5.1, (5.93) directly infers (5.92). Assumption 5.1 is necessary for the robustness analysis using the return difference equations, while Assumptions 5.1 and 5.2 are not necessary for the stability of (5.90) and the later RL algorithms. Theorem 5.9 (Return Difference Equation of (5.82) and Minimum Singular Value) Suppose Assumptions 5.1 and 5.2 hold. Then, given distributed ARE (5.82), we have (I + G¯ i (s)) H R¯ ii (I + G¯ i (s)) = R¯ ii + Fi (s),

.

ˆ i (s)] ≥ 1, s = jω ∈ jR, .σ [I + G

(5.94a) (5.94b)

where . .

∆ G¯ i (s) = R¯ ii−1 Rii G i (s),

G i (s) = Rii−1 BiT Pi (s I − A)−1 Bi , ∑ 1 ∆ T T −1 ¯ ii−1 = B j Ri−1 .R Rii−1 2 (BiT Bi )−1 BiT j B j Bi (Bi Bi ) , γi j∈−i − ∆ Gˆ i (s) = R¯ ii2 G¯ i (s) R¯ ii 2 , 1

.

(5.95a)



.



Fi (s) =

BiT (−s I

(5.95b) (5.95c)

1

T −1

−A )

(5.95d) Q i (s I − A)

−1

Bi .

(5.95e)

5.3 Off-Policy Integral Reinforcement Learning for Distributed …

135

Proof The frequency-domain representation of (5.82) is 0 = −(−s I − AT )Pi − Pi (s I − A) + Q i ) ( 1 ∑ T B j Ri−1 − Pi Bi Rii−1 BiT − 2 j B j Pi . γi j∈−i

.

(5.96)

Based on Assumptions 5.1 and 5.2, there exists . R¯ ii > 0 in (5.95b), such that .

1 ∑ T Bi R¯ ii−1 BiT = Bi Rii−1 BiT − 2 B j Ri−1 j Bj . γi j∈−i

(5.97)

Then, (5.96) is rewritten as 0 = −(−s I − AT )Pi − Pi (s I − A) − Pi Bi R¯ ii−1 BiT Pi + Q i .

.

(5.98)

Multiplying . BiT (−s I − AT )−1 and .(s I − A)−1 Bi on the left and right sides of (5.98), respectively, yields .

BiT (−s I − AT )−1 Q i (s I − A)−1 Bi = BiT Pi (s I − A)−1 Bi + BiT (−s I − AT )−1 Pi Bi + BiT (−s I − AT )−1 Pi BiT R¯ ii−1 Bi Pi (s I − A)−1 Bi .

(5.99)

Adding . R¯ ii to both sides of (5.99) gives R¯ ii + BiT (−s I − AT )−1 Q i (s I − A)−1 Bi [ ]H [ ] = I + R¯ ii−1 BiT Pi (s I − AT )−1 Bi R¯ ii I + R¯ ii−1 BiT Pi (s I − A)−1 Bi [ ]H ] [ (5.100) = I + G¯ i (s) R¯ ii I + G¯ i (s) .

.

With .G¯ i (s) (5.95a) and . Fi (s) (5.95e), we rewrite (5.100) as (5.94a). Multiplying

1 ¯ ii− 2 .R

on both sides of (5.100) gives [ .

− I + R¯ ii2 G¯ i (s) R¯ ii 2

1

1

]H [

− I + R¯ ii2 G¯ i (s) R¯ ii 2 1

1

]

− − = I + R¯ ii 2 BiT (−s I − AT )−1 Q i (s I − A)−1 Bi R¯ ii 2 . 1

1

(5.101)

It is seen that . R¯ ii > 0 (Assumption 5.2) and . Q i ≥ 0. Then, considering minimum singular values, we have −1

−1

σ [ R¯ ii 2 Fi (s) R¯ ii 2 ] ≥ 0.

.

(5.102)

136

5 Integral Reinforcement Learning for Zero-Sum Games

Define .Gˆ i (s) in (5.95c). Then, we obtain (5.94b). The proof is completed.



.

With the perturbation . Fi (s) added to .G i (s), .s = jω ∈ jR, the system becomes G˜ i (s) = Fi (s)G i (s) = Fi (s)Rii−1 R¯ ii G¯ i (s). The next lemma and theorem give stability analysis for .G˜ i (s), and provide the GM and PM of the minmax strategy.

.

Lemma 5.2 The perturbed system .G˜ i (s), s = jω ∈ jR, is closed-loop asymptotically stable if the following hold (Lehtomaki et al. 1981; Lian et al. 2021): (i) The system .Gˆ i (s) is closed-loop asymptotically stable; ˜ have the same number of closed right half plane (ii) det(.s I − A) and det(.s I − A) ˜ then det(. jω0 I − A)=0, where . A˜ is the state zeros and, if det(. jω0 I − A)=0, dynamics of the perturbed system .G˜ i (s); (iii) . R¯ ii Fi (s)Rii−1 R¯ ii + (Fi Rii−1 R¯ ii ) H (s) R¯ ii ≥ R¯ ii ; (iv) .Fi (s)Rii−1 R¯ ii + (Fi Rii−1 R¯ ii ) H (s) ≥ 0. Theorem 5.10 (GM and PM of Distributed Minmax Strategy) Let . Rii be diagonal. Follow Theorem 5.8 and Theorem 5.9. Suppose Lemma 5.2’s Conditions (ii), (iii), and (iv) hold. Then, we have (1) The perturbed system .G˜ i (s) is closed-loop asymptotically stable; (2) For each player .i, we have a < G M < ∞, 2 ◦ .|P M| < 60 + ωi , .

(5.103a) (5.103b)

where ωi

.

a R˜ ii

√ 12 − 3a 2 = ar ccos( ), 4 = 1 − λmax ( R˜ ii ), ∑ 1 T T −1 = 2 (BiT Bi )−1 BiT B j Ri−1 j B j Bi (Bi Bi ) Rii . γi j∈−i a+

Proof (1) The distributed ARE (5.82) has positive-definite solutions. Thus, one guarantees that .G i (s) is closed-loop asymptotically stable. Then, given Conditions (ii), (iii), and (iv), and Lemma 5.2, one infers that .G˜ i (s) is closed-loop asymptotically stable. (2) Note that (iii) in Lemma 5.2 implies .

( 1 ) −1 σ¯ R¯ ii2 (Fi Rii−1 R¯ ii )−1 R¯ ii 2 − I ≤ 1.

(5.104)

5.3 Off-Policy Integral Reinforcement Learning for Distributed …

137

Based on (Lehtomaki et al. 1981; Lian et al. 2021), take .Fi (s) as diagonal given by Fˆ i (s) = diag{ f i,1 (s), . . . , f i,k (s), . . . , f i,m i (s)}, k ∈ m i .

(5.105)

.

As . Rii and .Fi (s) are diagonal, (5.104) is rewritten as .

Note that

.

( ) σ¯ R¯ ii−1 Rii Fi−1 (s) − I ≤ 1.

R¯ ii−1 Rii = I − R˜ ii

with

.

(BiT Bi )−1 Rii . (5.106) becomes .

R˜ ii =

1 (BiT Bi )−1 BiT γi2

( ) σ¯ (I − R˜ ii )Fi−1 (s) − I ≤ 1.

(5.106) ∑ j∈−i

T B j Ri−1 j B j Bi

(5.107)

Condition (iv) in Lemma 5.3 indicates that the real part of . f i,k (s) is positive. To obtain GM, we denote .Re( f i,k (s)) as the real part of . f i,k (s) (Lehtomaki et al. 1981). (5.107) implies | | | | |(Re( f i,k (s)))−1 (1 − λk ( R˜ ii )) − 1| < 1,

.

(5.108)

such that .

1 (1 − λmax ( R˜ ii )) < Re( f i,k (s)) < ∞, 2

(5.109)

1 (1 − λmax ( R˜ ii )) < G M < ∞, 2

(5.110)

i.e., .

where . R˜ ii is defined in (5.103b). Note that . I − R˜ ii > 0, i.e., .1 − λmax ( R˜ ii ) > 0 from Assumption 5.2. We thus yield the GM of the minmax policy (5.81a). Similarly, to obtain PM, let. f i,k (s) = eφi,k (s) , where.φi,k (s) is real (Lehtomaki et al. 1981; Lian et al. 2021). Then, (5.108) becomes .

| | |cos(φi,k (s))| < 1 + ωi , 2

(5.111)

i.e., |P M|
0 be the solution of (5.82). It follows from Vrabie et al. (2009) that the sequence generated by (5.117) is equivalent to Newton’s method and converges to ∞ the optimal solution of distributed ARE (5.82), as .h → ∞. Furthermore, .{K ih }i=1 h ∞ ∗ ∗ and .{K i j }i=1 converge to . K i (5.81a) and . K i j (5.81b), respectively. .◻ 5.3.3.2

Data-Driven Off-Policy IRL for Minmax Solutions

This subsection finds a data-driven algorithm that is equivalent to the model-based Algorithm 5.3, but without the need to know system dynamics . A or . Bi , .∀i ∈ N. The data-driven algorithm solves minmax strategies using only online observed system trajectories .(x, u 1 , . . . , u N ) of (5.76). We use off-policy (Jiang et al. 2012) and IRL techniques (Vrabie et al. 2009) to derive data-driven formulation for (5.114)– (5.115b). Rewrite (5.76) with auxiliary inputs .u ih = −K ih x and .vihj = −K ihj x as

140

5 Integral Reinforcement Learning for Zero-Sum Games .

x˙ = Ax + Bi u ih +



B j vihj

j∈−i

+ Bi (u i − u ih ) +



B j (u j − vihj ).

(5.118)

j∈−i

Putting (5.118) into .V˙ih (x, Pih ) = x˙ T Pih x + x T Pih x˙ yields .

x˙ T Pih x + x T Pih x˙ ( )T ∑ B j K ihj Pih x = 2x T A − Bi K i − j∈−i

+ 2(u i −

u ih )T BiT Pih x

+2



(u j − vihj )T B Tj Pih x

j∈−i

= −x Q i x −

(u ih )T Rii u ih

T

+ γi2



(vihj )T Ri j vihj

j∈−i

+ 2(u i −

u ih )T BiT Pih x

+2



(u j − vihj )T B Tj Pih x.

(5.119)

j∈−i

It is seen from (5.115a) and (5.115b) that .

BiT Pih = Rii K ih+1 ,

T h . B j Pi

=

(5.120a)

−γi2 Ri j K ih+1 j .

(5.120b)

Based on (5.119), (5.120a), and (5.120b), we integrate both sides from .t to .t + ∆t to yield the data-driven formulation of (5.114) as .

x T (t + ∆t)Pih x(t + ∆t) − x T (t)Pih x(t) ⎛ ⎞ ∫ t+∆t ∑ ⎝x T Q i x + (u ih )T Rii u ih − γi2 =− (vihj )T Ri j vihj ⎠ dτ t

∫ +

t+∆t

t

∫ −

t

j∈−i

2(u i − u ih )T Rii K ih+1 xdτ

t+∆t

2γi2



(u j − vihj )T Ri j K ih+1 j dτ,

(5.121)

j∈−i

where .∆t is a small integral time. An online data-driven off-policy IRL algorithm to solve distributed minmax solutions is presented in Algorithm 5.4. Remark 5.19 In Algorithms 5.3 and 5.4, we only need the initial stabilizing policies K i0 without using a stabilizing . K i0j . We do not need stabilizing . K i0j and, instead, set 0 . K i j = 0 because .γi is set large to guarantee the existence of the solution of (5.82). .

5.3 Off-Policy Integral Reinforcement Learning for Distributed …

141

Algorithm 5.4 Off-policy IRL algorithm for solving the distributed minmax strategy 1. Initialization: Select initial stabilizing control policies K i0 and K i0j = 0 for i ∈ N, j ∈ −i. Set small thresholds ei and h = 0. Apply u i = −K i x + ϵi to (5.118), where K i is a fixed stabilizing policy and ϵi is a small probing noise. by 2. Solve control feedback gains K ih+1 and K ih+1 j x T (t + ∆t)Pih x(t + ∆t) − x T (t)Pih x(t) ∫ t+∆t − 2(u i − u ih )T Rii K ih+1 xdτ ∫

t

t+∆t

+ t

∫ =−

2γi2 ⎛

t+∆t



(u j − vihj )T Ri j K ih+1 j xdτ

j∈−i

⎝x Q i x

t

(5.122)

T

+ (u ih )T Rii u ih



− γi2

⎞ (vihj )T Ri j vihj ⎠ dτ.

j∈−i

3. Update control inputs u ih+1 and vih+1 by j u ih+1 = −K ih+1 x, h+1 .vi j

=

(5.123a) (5.123b)

−K ih+1 j x.

4. Stop if ‖Pih+1 − Pih ‖ ≤ ei , ∀ i ∈ N. Otherwise, set h → h + 1, and go to Step 2.

Also, as shown in Lemma 5.3, only initial stabilizing policies . K i0 are needed to guarantee convergence. Lemma 5.4 Algorithm 5.4 is equivalent to Algorithm 5.3 and has the same convergence. Proof We divide both sides of (5.114) by .∆t and take the limit .

lim

∆t→0

x T (t + ∆t)Pih x(t + ∆t) − x T (t)Pih x(t) ∫ t+∆t



∆t

∫ t+∆t − lim

∆t→0

t

2(u i − u ih )T Rii K ih+1 xdτ ∆t

h+1 h T j∈−i (u j − vi j ) Ri j K i j xdτ t + lim ∆t ∆t→0 ) ∫ t+∆t ( T ∫ t+∆t 2 ∑ h T h −x Q i x − (u ih )T Rii u ih dτ γi t j∈−i (vi j ) Ri j vi j dτ t = lim + lim . ∆t ∆t ∆t→0 ∆t→0

2γi2

(5.124)

142

5 Integral Reinforcement Learning for Zero-Sum Games

Based on L’Hopital’s rule, (5.124) is rewritten as ⎛ .

x T Pih ⎝ Ax + Bi u ih +



B j vihj + Bi (u i − u ih ) +

j∈−i

− 2(u i −

u ih )T Rii K ih+1 x

+

2γi2





⎞ B j (u j − vihj )⎠

j∈−i

(u j −

vihj )T Ri j K ih+1 j x

j∈−i

= −x T Q i x − (u ih )T Rii u ih + γi2



(vihj )T Ri j vihj .

(5.125)

j∈−i

Then, putting .u ih+1 (5.123a) and .v h+1 (5.123b) into (5.125), and removing .x j from both sides, gives (5.114). This infers that (5.122) is equivalent to (5.114). Thus, the data-driven Algorithm 5.4 has the same convergence as the model-based .◻ Algorithm 5.3. Online Implementation of Algorithm 5.4 This subsection provides the online implementation of Algorithm 5.4 by using sampled real-time trajectory data. First, we rewrite (5.122) as (vev(x(t + ∆t)) − vev(x(t)))T vem(Pih ) ∫ t+∆t + 2x T ⊗ [(u i − u ih )T Rii ]dτ vec(K ih+1 ) t ∑ ∫ t+∆t 2x T ⊗ [(u j − u hj )T Ri j ]dτ vec(K ih+1 − j )

.

j∈−i



t

t+∆t

=− t

( vev(x(t))T vem(Q i ) + vev(u i )T vem(Rii ) ) ∑ vev(vi j )T vem(Ri j ) dτ. − γi2 j∈−i

To solve (5.126), we use BLSs and define operators d

. xx

Iu i u ih

[ = vev(x(t + ∆t)) − vev(x(t)), . . . ,

]T vev(x(t + l∆t)) − vev(x(t + (l − 1)∆t)) , [ ∫ t+∆t = x ⊗ [Rii (u i − u ih )]dτ, . . . , t

t+l∆t

t+(l−1)∆t t+∆t

[∫ Iu j vihj =



t

]T x ⊗ [Rii (u i − u ih )]dτ

x ⊗ [Ri j (u j − vihj )]dτ, . . . ,

,

(5.126)

5.3 Off-Policy Integral Reinforcement Learning for Distributed …

∫ [∫ [∫ [∫ Ivihj vihj =

]T x ⊗ [Ri j (u i −

t+(l−1)∆t t+∆t

Ix x = Iu ih u ih =

t+l∆t



vihj )]dτ

, ]T

t+l∆t

vev(x)dτ, . . . ,

t

t t+∆t

∫ vev(u ih )dτ, . . . ,

t+l∆t t+(l−1)∆t

∫ vev(vihj )dτ, . . . ,

,

vev(x)dτ t+(l−1)∆t

t+∆t

t

143

t+l∆t t+(l−1)∆t

]T vev(u ih )dτ

, ]T

vev(vihj )dτ

,

(5.127)

where .l is the group number of sampling data. Note that in (5.126), . Pih , . K ih+1 , . K ih+1 for .i ∈ N, . j ∈ −i have .n(n + 1)/2 + j ∑N nm unknown independent elements in total. Small probing noises are added i i=1 to guarantee the persistent excitation of the system (5.118). Then. .l ≥ n(n + 1)/2 + ∑N nm data tuples in (5.127) are collected, such that the batch least squares i i=1 method uniquely solves them by .

[ ] h+1 T T T vem(Pih )T vec(K i1 ) . . . vec(K ii+1 )T . . . vec(K ih+1 N ) = (ηiT ηi )−1 ηiT ξi , i ∈ N,

(5.128)

where .



K i = K ii , [ h ηi = dx x , −2Iu 1 vi1h , . . . , −2Iu i−1 vi(i−1) , 2Iu i u ih , ] h , . . . , −2Iu N vihN , − 2Iu i+1 vi(i+1) ∑ ξi = −Ix x vem(Q i ) − Iu ih u ih vem(Rii ) + Ivihj vihj vem(Ri j ). j∈−i

Theorem 5.11 (Convergence of Online Implementation) Implement Algorithm 5.4 by (5.128). Select probing noise .ϵi in Algorithm 5.4. Then, (5.128) has converged solutions, i.e., . Pih → Pi in (5.82), . K ih → K i∗ in (5.81a), and . K ihj → K i∗j in (5.81b), where .i ∈ N, j ∈ −i. Proof First, note that the data-driven implementation (5.128) is rigorously derived from Algorithm 5.4. The∑persistence excitation condition is satisfied in Algorithm 5.4, N nm i data tuples in (5.127) are collected. Then, (5.128) and .l ≥ n(n + 1)/2 + i=1 solves the same unique solutions . Pi∞ , . K i∞ , . K i∞j as that of Algorithm 5.4. Second, it follows from Lemmas 5.3–5.4 that one concludes that (5.128) is equivalent to Algorithm 5.3 and has the same convergence properties. Therefore, (5.128) ensures .◻ the convergence, . Pih → Pi , . K ih → K i∗ , and . K ihj → K i∗j .

144

5 Integral Reinforcement Learning for Zero-Sum Games

5.3.4 Simulation Examples This section verifies the data-driven Algorithm 5.4 for distributed minmax multiplayer games with two simulation examples: an inverted pendulum system with 4 players. Consider an inverted pendulum system with its plant matrix from Liu et al. (2015) and four control input players [ .

x˙ =

0 6(M+m)g l(4M+m)

] 1 x + B1 u 1 + B2 u 2 + B3 u 3 + B4 u 4 , 0

(5.129)

where .x = [α α] ˙ T ; .α denotes the pendulum angle; .m and . M denote the pendulum mass and the cart mass, respectively; .g is the gravity parameter; and .l is the pendulum length. The parameters from Liu et al. (2015) are given as . M = 1.096kg, 2 .m = 0.109kg,.l = 0.25m, and. g = 9.8m/s . Give control input dynamics of all players in (5.129) as [

[ ] [ ] [ ] ] 0 1 −1 1 . B1 = , B2 = , B3 = , B4 = . −4 −1 1.7 0

(5.130)

Select the cost weights for 4 players as .

R11 = 1, R12 = 2, R13 = 3, R14 = 1, R21 = 2, R22 = 1, R23 = 3, R24 = 1, R31 = 2, R32 = 3, R33 = 1, R34 = 1, R41 = 1, R42 = 2, R43 = 1, R44 = 1, Q 1 = I2 , Q 2 = 2 × I2 , Q 3 = 0.5 × I2 , Q 4 = 0.25 × I2 , γ1 = 6, γ2 = 7, γ3 = 8, γ4 = 9.

(5.131)

The desired cost metrics . Pi in (5.82), and the desired feedback gains . K i∗ in (5.81a) and. K i∗j in (5.81b) for.i ∈ {1, 2, 3, 4}, j ∈ {−i} are obtained using MATLAB.® (command ICARE) as .

[ ] [ ] 26.4314 4.3779 21.8296 4.5006 , P2 = , 4.3779 0.8119 4.5006 1.0866 [ ] [ ] 26.6526 4.9758 11.7616 2.1527 = , P4 = , 4.9758 0.9724 2.1527 0.4163 [ ] [ ] ∗ = −17.5115 −3.2478 , K 12 = −0.3063 −0.0495 , [ ] [ ] ∗ = 0.1758 0.02776 , K 14 = −0.7342 −0.1216 , [ ] [ ] ∗ = 17.3290 3.4139 , K 21 = 0.1837 0.0444 , [ ] [ ] ∗ = 0.09645 0.0180 , K 24 = −0.4455 −0.0918 , [ ] [ ] ∗ = −18.1938 −3.3227 , K 31 = 0.1555 0.0304 ,

P1 = P3

K 1∗ ∗ K 13

K 2∗ ∗ K 23

K 3∗

5.3 Off-Policy Integral Reinforcement Learning for Distributed …

145

[ ] [ ] ∗ ∗ K 32 = −0.1129 −0.0209 , K 34 = −0.4164 −0.0777 , [ ] [ ] ∗ K 41 = 0.1063 0.0206 , K 4∗ = 11.7616 2.1527 , [ ] [ ] ∗ ∗ = −0.0593 −0.0107 , K 43 = 0.1000 0.0178 . K 42

(5.132)

The data-driven Algorithm 5.4 is implemented using (5.128). All elements of initial states are randomly selected in .[−1, 1]. Select tiny integral interval .∆t = 10−3 s (seconds) and thresholds .ei = 10−3 for all players. The probing noises are chosen as −3 .ϵi = 1.1 × 10 × ai , where .ai is randomly selected in .[0, 1]. Figures 5.3 and 5.4 show the norm of the difference between iterative learning values using Algorithm 5.4 and their desired values . Pi , . K i∗ , . K i∗j , .i ∈ {1, 2, 3, 4}, . j ∈ {−i} in (5.132). It is seen that all learning values converge to their desired values in 6 iterations. Figure 5.5 shows the trajectories of states and 4 minmax strategy control players of the inverted pendulum system (5.129).

Fig. 5.3 Convergence of . Pih and . K ih , .i ∈ {1, 2, 3, 4} to their desired values in (5.132) using the model-free Algorithm 5.4

10 5 0

1

2

3

4

5

6

1

2

3

4

5

6

10 5 0

Fig. 5.4 Convergence of h

h

h

h

h

h

h

h

h

h

0.2

. K 12 , . K 13 , . K 14 , . K 21 , . K 23 , . K 24 , . K 31 , . K 32 , . K 34 , . K 41 ,

h

h to their desired and . K 44 values in (5.132) using the model-free Algorithm 5.4

. K 42 ,

0.1

0

1

2

3

4

5

6

1

2

3

4

5

6

0.15 0.1 0.05 0

146 Fig. 5.5 Trajectories of the inverted pendulum system (5.129) using Algorithm 5.4

5 Integral Reinforcement Learning for Zero-Sum Games 0.4 0.2 0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

10 0 -10

References Abu-Khalaf M, Lewis FL (2008) Neurodynamic programming and zero-sum games for constrained control systems. IEEE Trans Neural Networks 19(7):1243–1252 Basar T (1976) On the uniqueness of the Nash solution in linear-quadratic differential games. Int J Game Theory 5:65–90 Basar T, Bernard P (1995) H∞ Optimal control and related minimax design problems. Birkhäuser, Boston, MA Chen Z, Xue W, Li N, Lian B, Lewis FL (2022) A novel Z-function-based completely model-free reinforcement learning method to finite-horizon zero-sum game of nonlinear system. Nonlinear Dyn 107(3):2563–2582 Cheng W, Zhao K, Zhou M. (2021) Multi-objective Herglotz’variational principle and cooperative Hamilton-Jacobi systems. ArXiv preprint arXiv:2104.07546 Feng Y, Anderson BD, Rotkowitz M (2009) A game theoretic algorithm to compute local stabilizing solutions to HJBI equations in nonlinear H1 control. Automatica 45(4):881–888 Freiling G, Jank G, Abou-Kandil H (1996) MOn global existence of solutions to coupled matrix Riccati equations in closed-loop Nash games. IEEE Trans Autom Control 41(2):264–269 Gokcesu K, Kozat SS (2018) An online minimax optimal algorithm for adversarial multiarmed bandit problem. IEEE Trans Neural Networks Learn Syst 29(11):5565–5580 Jiang Y, Jiang ZP (2012) Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 48(10):2699–2704 Khalil HK (2015) Nonlinear control. Pearson New York Lehtomaki N, Sandell N, Athans M (1981) Robustness results in linear-quadratic Gaussian based multivariable control designs. IEEE Trans Autom Control 26(1):75–93 Li H, Liu D, Wang D (2014) Integral reinforcement learning for linear continuous-time zero-sum games with completely unknown dynamics. IEEE Trans Autom Sci Eng 11(3):706–714 Li J, Kiumarsi B, Chai T, Lewis FL, Fan J (2017) Off-policy reinforcement learning: Optimal operational control for two-time-scale industrial processes. IEEE Trans Cybern 47(12):4547– 4558 Lian B, Lewis FL, Hewer GA, Estabridis K, Chai T (2021) Robustness analysis of distributed kalman filter for estimation in sensor networks. IEEE Trans Cybern 52(11):12479–12490

References

147

Lian B, Donge VS, Xue W, Lewis FL, Davoudi A (2022) Distributed minmax strategy for multiplayer games: Stability, robustness, and algorithms. IEEE Trans Neural Networks Learn Syst. https:// doi.org/10.1109/TNNLS.2022.3215629 Lian B, Lewis FL, Hewer GA, Estabridis K, Chai T (2022) Online learning of minmax solutions for distributed estimation and tracking control of sensor networks in graphical games. IEEE Trans Control Network Syst 9(4):1923–1936 Liu A, Zhang W, Yu L, Liu S, Chen MZQ (2015) New results on stabilization of networked control systems with packet disordering. Automatica 52:255–259 Mitake H, Tran HV (2014) Homogenization of weakly coupled systems of Hamilton-Jacobi equations with fast switching rates. Archive Rational Mech Anal 211(3):733–769 Modares H, Lewis FL, Jiang ZP (2015) H∞ Tracking control of completely unknown continuoustime systems via off-policy reinforcement learning. IEEE Trans Neural Networks Learn Syst 26(10):2550–2562 Papavassilopoulos G, Cruz JJ (1979) On the uniqueness of Nash strategies for a class of analytic differential games. J Optim Theory Appl 27(2):309–314 van der Schaft AJ (1992) L2-gain analysis of nonlinear systems and nonlinear state-feedback H∞ control. IEEE Trans Autom Control 37(6):770–784 Vamvoudakis KG, Lewis FL (2011) Multi-player non-zero-sum games: Online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica 47(8):1556–1569 Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis FL (2009) Adaptive optimal control for continuoustime linear systems based on policy iteration. Automatica 45(2):477–484 White DA, Sofge DA (1992) Handbook of intelligent control. Van Nostrand Reinhold, New York Wu HN, Luo B (2012) Neural network based online simultaneous policy update algorithm for solving the HJI equation in nonlinear H∞ control. IEEE Trans Neural Networks Learn Syst 23(12):1884–1895 Zames G (1981) Feedback and optimal sensitivity: Model reference transformations, multiplicative seminorms, and approximate inverses. IEEE Trans Autom Control 26(1):301–320 Zhang H, Wei Q, Liu D (2011) An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games. Automatica 47(1):207–214 Zhang H, Wei Q, Liu D (2012) Neural network based online simultaneous policy update algorithm for solving the HJI equation in nonlinear H ∞ control. IEEE Trans Neural Networks Learn Syst 23(12):1884–1895 Zhu Y, Zhao D (2020) Online minimax Q network learning for two-player zero-sum Markov games. IEEE Trans Neural Networks Learn Syst 33(3):1228–1241

Part II

Inverse Reinforcement Learning for Optimal Control Systems and Games

Chapter 6

Inverse Reinforcement Learning for Optimal Control Systems

6.1 Introduction To overcome the need for manual specification of cost functions in an agent, inverse reinforcement learning (inverse RL) (Ng and Russell 2000) has been introduced to infer the hidden cost functions from demonstrated behaviors. Inverse RL is commonly employed in apprenticeship learning scenarios (Abbeel and Ng 2004; Chu et al. 2020; Lin et al. 2019; Self et al. 2020; Song et al. 2018; Syed and Schapire 2007), where a learner leverages observations of an expert’s behavior to uncover the unknown expert cost functions and replicate the expert’s behavior. In contrast, there are alternative approaches (Atkeson and Schaal 1997; Ho and Ermon 2016; Sammut et al. 1992) that focus on directly learning the mapping from states to control inputs, without explicitly considering the cost function. These approaches are typically applicable when the objective is to mimic a specific expert behavior. However, in practice, cost functions exhibit inherent adaptability, transferability, and robustness to environmental changes and variations in expert behavior, making them more suitable for tasks such as autonomous driving, where traffic conditions can vary dynamically (Abbeel and Ng 2004; Arora and Doshi 2021). It is worth noting that most existing inverse RL studies (Abbeel and Ng 2004; Brown and Niekum 2018; Chu et al. 2020; Imani and Braga-Neto 2018; Imani and Ghoreishi 2021; Levine et al. 2011; Lin et al. 2019; Song et al. 2018; Syed and Schapire 2007) primarily focus on the context of state dynamics described by Markov decision processes (MDPs). Similar to inverse RL, the inverse optimal control (IOC) technique enables reconstructing a cost function given states and control inputs of a system. However, IOC primarily focuses on differential systems and aims to ensure Lyapunov stability (Ab Azar et al. 2020; Haddad and Chellaboina 2011; Johnson et al. 2013; Kalman 1964; Sanchez and Ornelas-Tellez 2017; Vega et al. 2018). Considering that system dynamics might be unknown in differential systems, we are aiming to explore completely model-free inverse RL algorithms to compute

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 B. Lian et al., Integral and Inverse Reinforcement Learning for Optimal Control Systems and Games, Advances in Industrial Control, https://doi.org/10.1007/978-3-031-45252-9_6

151

152

6 Inverse Reinforcement Learning for Optimal Control Systems

the cost function and the optimal control policy for the trajectory tracking/imitation problem in continuous-time linear and nonlinear systems.

6.2 Off-Policy Inverse Reinforcement Learning for Linear Quadratic Regulators This section introduces continuous-time (CT) linear learner and expert systems described by differential dynamic equations in terms of linear quadratic regulators (LQR) and provides inverse RL algorithms for the trajectory imitation of the learner to the expert. Given the demonstrated expert states and control inputs, inverse RL algorithms allow the learner to reconstruct the expert’s cost function that yields the expert’s optimal control policy and generates the same demonstrated trajectories. Model-based and model-free data-driven inverse RL algorithms are provided associated with their theoretical guarantees including convergence of the algorithm and stability of the learned policy.

6.2.1 Problem Formulation Expert System An expert system is described as x˙ = Axe + Bu e ,

. e

(6.1)

where .xe ∈ Rn and .u e ∈ Rm are expert’s states and control inputs, respectively. . A ∈ Rn×n and . B ∈ Rn×m are state matrix and control input matrix, respectively. Assumption 6.1 The pair .(A, B) is controllable. Assumption 6.2 The control input .u e in (6.1) is optimal at the corresponding state x .

. e

The expert’s control input in terms of .u e = −K e xe minimizes an integrated cost function ʃ ∞ . Ve (x e ) = (xeT Q e xe + u Te Re u e ) dτ, (6.2) t

where. Q e = Q Te ∈ Rn×n ≥ 0 and. Re = ReT ∈ Rm×m > 0 are state- and input- penalty weight, respectively. .(A, Q e ) is observable. In the optimal control theory (Lewis et al. 2012; Lewis and Vrabie 2009), the cost function (6.2) can be represented by T . Ve (x) = x e Pe x e , where . Pe is a positive definite matrix. The optimal control input .u e in terms of control feedback gain . K e is given by u = −K e xe = −Re−1 B T Pe xe ,

. e

(6.3)

6.2 Off-Policy Inverse Reinforcement Learning for Linear Quadratic Regulators

153

in which . Pe satisfies the algebraic Riccati equation (ARE) .

AT Pe + Pe A + Q e − Pe B Re−1 B T Pe = 0.

(6.4)

Note that (6.4) solves a unique stabilizing control solution . Pe . This defines the expert system (6.1). Learner System A learner system is described as .

x˙ = Ax + Bu,

(6.5)

where .x ∈ Rn are learner states, .u ∈ Rm are learner control inputs, . A ∈ Rn×n and n×m .B ∈ R are state matrix and control input matrix, respectively. The learner and the expert are homogeneous. The learner (6.5) can find an optimal control input to minimize the integrated cost function ʃ ∞ . V (x) = (x T Qx + u T Ru) dτ, (6.6) t

where . Q T = Q ∈ Rn×n ≥ 0 and . R T = R ∈ Rm×m > 0 are given penalty weight matrices. .(A, Q) is observable. With the state feedback control .u = −K x, the function (6.6) is represented by .V (x) = x T P x, where . P > 0. To solve the optimal solution, we start with a Hamiltonian function which is obtained by taking the derivative of (6.6) .

H (x, u) = (Ax + Bu)T P x + x T P(Ax + Bu) + x T Qx + u T Ru.

(6.7)

By the stationary condition . ∂∂uH = 0, the optimal control input and its corresponding optimal control gain . K is given by u ∗ = −K x,

.

.

K =R

−1

T

B P,

(6.8a) (6.8b)

where . P > 0 is the unique solution of the ARE .

AT P + P A + Q − P B R −1 B T P = 0.

(6.9)

Given the above expert and learner formulation, we now provide an assumption and a definition for defining an inverse RL problem. Assumption 6.3 The expert’s penalty weights . Q e and . Re in (6.2) and the control parameters . Pe and . K e in (6.3)–(6.4) are unknown to the learner system (6.5). The system dynamics . A, . B in (6.5) and (6.1) are unknown.

154

6 Inverse Reinforcement Learning for Optimal Control Systems

Definition 6.1 Given an . R ∈ Rm×m > 0, if there exists a . Q ∈ Rn×n ≥ 0 in (6.9), such that . K = K e , then, . Q is called an equivalent weight to . Q e . Problem 6.1 Let Assumptions 6.1–6.3 hold. Selecting any . R ∈ Rm×m > 0, the learner (6.5) aims to find an equivalent weight Q to . Q e that yields the same control gain such that for imitating the expert’s trajectories .(xe , u e ). Note that this is also a Data-Driven Model-Free Inverse RL Control Problem.

6.2.2 Inverse Reinforcement Learning Policy Iteration We now provide a model-based inverse RL Policy iteration (PI) algorithm to find an equivalent weight to . Q e by combining optimal control learning and inverse optimal control (IOC) learning. The optimal control process is already shown in Sect. 6.2.1. Based on it, the IOC process is shown as follows. In Sect. 6.2.1, (6.8b) and (6.9) compute the optimal control solution given the penalty weights . Q and . R. However, . Q may not be an equivalent weight to . Q e , and . K in (6.8b) may not equal . K e in (6.3). To this end, keeping . R fixed, we use IOC to correct the state-penalty weight . Q toward the one that is equivalent to . Q e , such that .K = Ke. First, to see the difference between the current control gain . K and the expert’s control gain . K e , we estimate the unknown . K e in (6.3) using expert’s data of .xe and .u e by [u e (t − (s − 1)T ), . . . , u e (t − T ), u e (t)] = −K e [xe (t − (s − 1)T ), . . . , xe (t − T ), xe (t)],

.

(6.10)

where .s > mn is the number of data groups. Define .u¯ e = [u e (t − (s − 1) T ), . . . , u e (t − T ), u e (t)] ∈ Rm×s and .x¯e = [xe (t − (s − 1)T ), . . . , xe (t − T ), n×s . x e (t)] ∈ R . By using batch least square method, the estimate of . K e is given by .

.

Kˆ e = −u¯ e (x¯eT x¯e )−1 x¯eT .

(6.11)

Note that (6.10) is a standard approximation problem. Based on (6.11), the error between . K and . Kˆ e is defined as ( .

E˜ = trace(e˜T e) ˜ e˜ = K − Kˆ e

(6.12)

where . E˜ is a scalar. Note that . E˜ = 0 if and only if .e˜ = 0, i.e., . K = Kˆ e . We correct the weight . Q by minimizing the error . E˜ in (6.12). Applying the gradient descent (Bertsekas 1997) method to tune . P yields the weight correction factor . f (P)

6.2 Off-Policy Inverse Reinforcement Learning for Linear Quadratic Regulators

.

∂ E˜ ∂P = P − α(B ¯ R −1 e˜ + e˜T R −1 B T ),

155

f (P) = P − α¯

(6.13)

where .α¯ > 0 is a learning step factor, and . f (P) is an improved estimate of . P that makes the estimated . K in (6.8b) approach closer to the target . K e in (6.3). Then, based on the IOC theory (Haddad and Chellaboina 2011), the state-penalty weight . Q can be corrected by .

Q = −(AT f (P) + f (P)A − f (P)B R −1 B T f (P)).

(6.14)

The complete iteration circle is shown in the following Algorithm 6.1. The correction (6.17) for. Q i+1 is a standard model-based IOC process (Haddad and Chellaboina 2011). This algorithm has a two-loop iteration structure, with control policy computed repeatedly in the inner-loop iterations and state-penalty weight computed in the outer-loop iterations. Algorithm 6.1 Model-based inverse RL PI for LQR 1. Initialization: Give a R ∈ Rm×m > 0, an initial weight Q 0 ≥ 0, a small threshold ϵe > 0. Set i = 0 and compute Kˆ e by (6.11). 2. Optimal control: Calculate Pi and update u i and K i by 0 = AT Pi + Pi A + Q i − Pi B R −1 B T Pi , u i = −K i x = −R −1 B T Pi x.

(6.15a) (6.15b)

3. Measure e˜i and calculate correction factor f (Pi ) by e˜i = K i − Kˆ e , ¯ R −1 e˜i + e˜iT R −1 B T ). f (Pi ) = Pi − α(B

(6.16a) (6.16b)

4. State-penalty weight improvement: Update the state-penalty weight Q i+1 by Q i+1 = −AT f (Pi ) − f (Pi )A + f (Pi )B R −1 B T f (Pi ).

(6.17)

5. Set i = i + 1, repeat step 2 to 4, stop until E˜ i < ϵe .

The next theorems analyze the convergence, stability of the learned policy, and non-unique solutions of Algorithm 6.1. Theorem 6.1 (Convergence) Given . R > 0 and initial . Q 0 ≥ 0, the state penalty . Q i obtained by Algorithm 6.1 converges to a . Q ∗ as .i → ∞ that optimally produces the . K e . That is T ∗ ∗ ∗ ∗ −1 T ∗ .A P + P A + Q − P B R B P = 0, (6.18) .

K ∗ = R −1 B T P ∗ = K e .

(6.19)

156

6 Inverse Reinforcement Learning for Optimal Control Systems

Proof In Algorithm 6.1, given . Q i ≥ 0, . Pi obtained by (6.15a) is a unique ARE solution (Vrabie et al. 2009). Denote .g(e˜i ) = B R −1 e˜i + e˜iT R −1 B T , where .g(e˜i ) = g(e˜i )T . Substituting it and (6.16b) into (6.17) and (6.15a) yields .

− (AT Pi+1 + Pi+1 A − Pi+1 B R −1 B T Pi+1 ) = −(AT Pi + Pi A − Pi B R −1 B T Pi ) + α(A ¯ T g(e˜i ) + g(e˜i )A) − αg( ¯ e˜i )B R −1 B T Pi − α¯ Pi B R −1 B T g(e˜i ) + α¯ 2 g(e˜i )B R −1 B T g(e˜i ),

(6.20)

Using gradient descent method (Bertsekas 1997) to develop (6.16b) to tune . Pi allows 0 ≤ E˜ i+1 < E˜ i , from which we have. lim E˜ i = 0 and. lim e˜i = 0. Then (6.16b) gives

.

i→∞

.

i→∞

lim f (Pi ) = lim (Pi − αg( ¯ e˜i )) = lim Pi ,

i→∞

i→∞

(6.21)

i→∞

where. lim g(e˜i ) = 0. Assuming the effectiveness of least square method gives . Kˆ e = i→∞

K e , together with (6.16a) we have .

lim K i = lim R −1 B T Pi = K e .

i→∞

(6.22)

i→∞

Considering (6.20) and (6.21), we obtain lim Q i+1 = lim −(AT Pi + Pi A − Pi B R −1 B T Pi ) = lim Q i ,

.

i→∞

i→∞

i→∞

(6.23)

which consequently implies the convergence . lim Pi+1 = lim Pi . Let. lim Q i = Q ∗ i→∞

i→∞

i→∞

and. lim Pi = P ∗ , then (6.22) and (6.23) become (6.19) and (6.18), respectively. This i→∞

completes the proof. ◻ Theorem 6.2 (Stability) With initial . Q 0 ≥ 0 and . R > 0, there exists .α¯ > 0 to make each updated control input .u i , i = 0, 1, . . . , in (6.15b) obtained by Algorithm 6.1 globally exponentially stabilize the learner (6.5). Proof With . Q i ≥ 0, i = 0 and . R > 0, the control policy . Pi > 0 solved by ARE (6.15a) produces the control gain . K i in (6.15b) satisfying the Lyapunov equation (A − B K i )T Pi + Pi (A − B K i ) = −Q i − Pi B R −1 B T Pi < 0,

.

(6.24)

where.(A − B K i ), the closed-loop dynamics of learner system (6.5), is exponentially stable. This implies that the learned . K i for all .i exponentially stabilizes the learner system as long as . Q i ≥ 0 holds for all .i. We observe that . Q i+1 is updated via (6.17) with the correction factor . f (Pi ) in (6.16b). By defining .

G(e˜i ) = AT g(e˜i ) + g(e˜i )A − g(e˜i )B R −1 B T Pi − Pi B R −1 B T g(e˜i ) + αg( ¯ e˜i )B R −1 B T g(e˜i ),

(6.25)

6.2 Off-Policy Inverse Reinforcement Learning for Linear Quadratic Regulators

157

where .g(e˜i ) = B R −1 e˜i + e˜iT R −1 B T , then one can write .

Q i+1 = Q i + αG( ¯ e˜i ).

(6.26)

Theorem 6.1 shows that.e˜i → 0,.g(e˜i ) → 0 as.i → ∞. This demonstrates.G(e˜i ) → 0 in (6.25) as .i → ∞. Since . K i and the goal . Kˆ e are both stabilizing, and . Pi linearly effects . K i , there exists .α¯ > 0 to locate . K i+1 between . K i and . K e , which is also stabilizing. That is, with . Q i ≥ 0 and . Q e ≥ 0, there exists .α¯ > 0 to enable . Q i+1 ≥ 0 in (6.26) yielding . K i+1 and .u i via (6.15a) and (6.15b) to exponentially stabilize the ◻ learner system (6.5). This completes the proof. The final penalty weights . Q ∗ and . R that could be found by Algorithm 6.1, satisfying (6.18)–(6.19), may not be unique and can be different from the expert’s . Q e and . Re . But all these solutions are equivalent to the expert’s . Q e , . Re . Each learning process of Algorithm 6.1 obtains one group of them. The next theorem characterizes all such possible solutions. Theorem 6.3 (Non-unique Solutions) Consider the converged solutions . Q ∗ and . P ∗ of Algorithm 6.1 satisfying (6.18)–(6.19), and the expert’s . Q e , . Re and . Pe satisfying (6.3)–(6.4). Let the symmetric matrices . Q o and . Po satisfy .

Ro R −1 B T (Pe + Po ) + B T Po = 0,

(6.27)

.

Q o + A Po + Po A +

(6.28)

T

K eT Ro K e

= 0,

where . Ro = Re − R, . K e = Re−1 B T Pe in (6.3). Then, any . Q ∗ = Q e + Q o and . P ∗ = Pe + Po satisfy (6.18) and give the target control gain . K e in (6.19). Proof Recalling . K e in (6.3) and . Po in (6.27) and using . P ∗ = Pe + Po and . Ro = Re − R, we have .

B T P ∗ = B T (Pe + Po ) = B T Pe − Re R −1 B T Pe − Re R −1 B T Po + B T Pe + B T Po ,

(6.29)

which shows that .

B T Pe − Re R −1 B T Pe − Re R −1 B T Po = 0.

(6.30)

Since . Re > 0, multiplying both sides of (6.30) by . Re−1 yields .

K e = Re−1 B T Pe = R −1 B T Pe + R −1 B T Po = R −1 B T P ∗

which is actually (6.19).

(6.31)

158

6 Inverse Reinforcement Learning for Optimal Control Systems

With .e˜i in (6.16a) and Theorem 6.1, (6.31) implies ˜ ∗ ) = 0, e(P ˜ ∗ ) = 0, E(P

.

(6.32)

˜ Then, we know from (6.16b) which means. P ∗ = Pe + Po is a solution minimizing. E. that .

f (P ∗ ) = P ∗ .

(6.33)

Using . Q ∗ = Q e + Q o and . P ∗ = Pe + Po , one obtains .

Q ∗ + AT P ∗ + P ∗ A − P ∗ B R −1 B T P ∗ = Q e + AT Pe + Pe A + Q o + AT Po + Po A − P ∗ B R −1 B T P ∗ .

(6.34)

Using . R = Re − Ro and (6.31) in (6.34) yields .

Q ∗ + AT P ∗ + P ∗ A − P ∗ B R −1 B T P ∗ = Q e + AT Pe + Pe A + Q o + AT Po + Po A − Pe B Re−1 B T Pe + Pe B Re−1 Ro Re−1 B T Pe .

(6.35)

When ARE (6.4) and (6.28) hold, (6.35) becomes .

Q ∗ + AT P ∗ + P ∗ A − P ∗ B R −1 B T P ∗ = 0,

(6.36)

which is actually (6.18). Considering (6.33) and (6.17), the. Q ∗ obtained by Algorithm 6.1 is .

Q ∗ = −AT f (P ∗ ) − f (P ∗ )A − f (P ∗ )B R −1 B T f (P ∗ ) = −AT P ∗ − P ∗ A + P ∗ B R −1 B T P ∗ .

(6.37)

Now, we know that . Q ∗ and . P ∗ obtained by Algorithm 6.1 do not only satisfy ∗ . Q = Q e + Q o and . P = Pe + Po with . Q e , Pe shown in (6.4) and . Q o , Po shown in (6.27)–(6.28), but also give the target control gain . K e given in (6.3). This completes the proof. ◻ ∗

Remark 6.1 If and only if . Q o and . Po in (6.27)–(6.28) are zero, then the . Q ∗ , P ∗ obtained by Algorithm 6.1 are exactly the actual target solution . Q e , Pe , respectively. If . Q o /= 0 and . Po /= 0, let .Ω be a set containing all possible solutions to (6.18) and (6.19). Each pair of . Q ∗ , P ∗ is a subset of .Ω. It can be deduced that if . Ro = 0, . Q e and . Q ∗ are both diagonal, then the solution . Q ∗ , . P ∗ to (6.18)–(6.19) is unique and equal . Q e , . Pe , respectively.

6.2 Off-Policy Inverse Reinforcement Learning for Linear Quadratic Regulators

159

6.2.3 Model-Free Off-Policy Inverse Reinforcement Learning In real-world applications, systems models may not be known. To solve the Inverse RL Control Problem 6.1, we present a model-free data-driven inverse RL algorithm without knowing system model knowledge .(A, B) based on the model-based Algorithm 6.1. Model-Free Optimal Control Update First, to remove the need for .(A, B) from Step 2 in Algorithm 6.1, i.e., (6.15a)– (6.15b), the off-policy integral RL (Jiang et al. 2012; Xue et al. 2020) is applied. For each current state-penalty weight . Q i , i = 0, 1, . . ., we introduce an auxiliary input at iteration . j = 0, 1, . . . , j

j

u = −K i xe

. i

(6.38)

to the expert system (6.1) to obtain j

j

x˙ = Axe + Bu e − Bu i + Bu i

. e

j

= Ac xe + B(u e − u i ),

(6.39)

j

j

where . Ac = A − B K i . .u e is the demonstrated target control input and .u i is the updating control input. j+1 j Multiplying both sides of (6.15a) by .xe and using . K i = R −1 B T Pi yields j

j

j

j

x T Q i xe + xeT (ATc Pi + Pi Ac )xe + xeT K i R K i xe

. e

j

j

j

j

= (Ac xe + B(u e + K i xe ))T Pi xe + xeT Pi (Ac xe + B(u e + K i xe )) j

j

j

+ xeT Q i xe + xeT K i R K i xe − 2(u e + K i xe )T R K i = 0.

j+1

xe (6.40)

Considering (6.39), integrating on both sides of (6.40) from .t to .t + TI follows that ʃ j

ʃ

t+TI

=− t

t+TI

j

x (t + TI )T Pi xe (t + TI ) − xe (t)T Pi xe (t) − 2

. e

j

t

j

(u e + K i xe )T R K i

j

xeT (Q i + (K i )T R K i )xe dτ,

j+1

xe dτ (6.41)

j

j+1

where .TI > 0 is a small integral time interval. The . Pi and . K i are computed using j measured trajectory data of .(xe , u e ) by (6.41) without knowing .(A, B). When . Pi j+1 and . K i converge, the optimal . Pi and . K i corresponding to . Q i are given by .

Ki = Ki

j+1

j

, Pi = Pi .

(6.42)

160

6 Inverse Reinforcement Learning for Optimal Control Systems

Model-Free IOC Update To remove the need of .(A, B) from the correction of . Q i+1 in Steps 3–4 of Algorithm 6.1, i.e., (6.16b)–(6.17), we propose an off-policy IOC procedure. The off-policy Bellman equation (6.41) always gives the solution . Pi > 0. As −1 T .Ki = R B Pi , we replace . R −1 B T in (6.16b) by . K i Pi−1 . Then, (6.16b) becomes .

f (Pi ) = Pi − α(P ¯ i−1 K iT e˜i + e˜iT K i Pi−1 ).

(6.43)

Then, we define a new operator .

K i' = K i Pi−1 f (Pi ).

(6.44)

Now we show how to calculate the . Q i+1 by (6.47) using only data. Introduce u ' = −K i' xe to system (6.1) to yield

. i

x˙ = Axe + Bu e − Bu i' + Bu i'

. e

= A'c xe + B(u e − u i' ),

(6.45)

where . A'c = A − B K i' . .u e is the demonstrated expert control input. Multiplying both sides of (6.17) by .xe yields x T Q i+1 xe + xeT (AT f (Pi ) + f (Pi )A)xe − xeT (K i' )T R K i' xe

. e

= xeT Q i+1 xe + (A'c xe + B(u e + K i' xe ))T f (Pi )xe + xeT f (Pi )(A'c xe + B(u e + K i' xe )) − xeT (K i' )T R K i' xe − 2u Te R K i' xe = 0. (6.46) Integrating on both sides of (6.46) from .t to .t + TI with (6.45) obtains x (t + TI )T f (Pi )xe (t + TI ) − xe (t)T f (Pi )xe (t) ʃ t+TI ʃ t+TI (u Te R K i' xe ) dτ − (xeT (K i' )T R K i' xe ) dτ −2

. e

t

t

ʃ

t+TI

=− t

(xeT Q i+1 xe ) dτ.

(6.47)

The above derivations are summarized in Algorithm 6.2. There is an alternative learning algorithm that reduces the use of expert’s data. By j applying the auxiliary input .u i in (6.38) to the learner system (6.5) and conducting similar derivations to (6.39)–(6.41), one can obtain

6.2 Off-Policy Inverse Reinforcement Learning for Linear Quadratic Regulators

161

Algorithm 6.2 Data-driven model-free off-policy inverse RL algorithm 1. Initialization: Select a R > 0, Q 0 ≥ 0, an arbitrary stabilizing control gain K 0 , and small thresholds ϵ p , ϵq , ϵe . Set i = 0, j = 0. j j+1 using (6.41). 2. Optimal control: Calculate Pi and K i j j−1 3. Set j = j + 1, repeat Step 2 until ||Pi − Pi || < ϵ p . Set j = 0. Compute Pi and K i by (6.42). ˜ 4. Measure E i and e˜i in (6.16a), calculate f (Pi ) by (6.43). 5. State-penalty weight improvement: Calculate K i' by (6.44) and update Q i+1 by (6.47). 6. Set i = i + 1, repeat steps 2–4, stop until E˜ i ≤ ϵe and ‖Q i+1 − Q i ‖ < ϵq .

ʃ .

j

ʃ

t+TI

=− t

t+TI

j

x(t + TI )T Pi x(t + TI ) − x(t)T Pi x(t) − 2 j

t

j

(u + K i x)T R K i

j+1

x dτ

j

x T (Q i + (K i )T R K i )x dτ,

(6.48)

which is equivalent to (6.41) but uses .x and .u. By bringing .u i' = −K i' x in (6.44) to system (6.5) and conducting similar derivations to (6.45)–(6.47), one can obtain ʃ .

t+TI

x(t + TI )T f (Pi )x(t + TI ) − x(t)T f (Pi )x(t) − 2 ʃ −

t+TI

t

(x T (K i' )T R K i' x) dτ = −

t

ʃ

t+TI

(u T R K i' x) dτ

(x T Q i+1 x) dτ

(6.49)

t

which is equivalent to (6.47) but uses .x and .u. More discussions about the method can be seen in Xue et al. (2021). Algorithm Implementation Steps 2 and 5 in Algorithm 6.2 can be implemented as follows. Define the following operators for (6.41) [ d xe = vev(xe (t + TI )) − vev(xe (t)), . . . ,

.

]T vev(xe (t + lTI )) − vev(xe (t + (l − 1)TI )) , [ʃ t+TI ]T ʃ t+lTI I xe = xe ⊗ xe dτ, . . . , xe ⊗ xe dτ ; [ʃ

t

t+(l−1)TI

t+TI

I xu =

ʃ

t+lTI

xe ⊗ u e dτ, . . . ,

t

]T xe ⊗ u e dτ

,

(6.50)

t+(l−1)TI

where .l is the group number of sampling data. Then, (6.41) can be rewritten as [ η

. i

j

vem(Pi ) j+1 vecl(K i )

] = ξi ,

(6.51)

162

6 Inverse Reinforcement Learning for Optimal Control Systems

where j

η = [d xe , −2I xe (In ⊗ K i R) − 2I xu(In ⊗ R)],

. i

j

j

ξi = −I xe vecl(Q i + (K i )T R K i ). j

Using batch least square method, . Pi and . K i [ .

j

vem(Pi ) j+1 vecl(K i )

j+1

(6.52)

can be solved by

] = (ηiT ηi )−1 ηiT ξi .

(6.53)

For (6.47), define the following operators: [ ' d xe = vev(xe (t + TI )) − vev(xe (t)), . . . ,

.

]T vev(xe (t + kTI )) − vev(xe (t + (k − 1)TI )) , [ʃ t+TI ]T ʃ t+kTI vev(xe ) dτ, . . . , vev(xe ) dτ , I xˆe = '



I xe = '



t t+TI

ʃ

t+kTI

xe ⊗ xe dτ, . . . ,

t

]T xe ⊗ xe dτ

t+(k−1)TI

t+TI

I xu =

t+(k−1)TI

t

ʃ xe ⊗ u e dτ, . . . ,

t+kTI

, ]T

xe ⊗ u e dτ

,

t+(k−1)TI

where .k is the data group number. Similarly, . Q i+1 in (6.47) can be solved by vem(Q i+1 ) = (η¯ iT η¯ i )−1 η¯ iT ξ¯i ,

.

(6.54)

where η¯ = −I xˆe , ' ξ¯i = (d xe )T vem( f (Pi )) − 2(I xu ' )T vecl(R K i' ) − (I xe' )T vecl((K i' )T R K i ' ). (6.55)

. i

Remark 6.2 Probing noise is needed to be injected for the persistence of excitation condition (Jiang et al. 2012; Kiumarsi et al. 2015; Xue et al. 2020) and it can be white noise. Note that .l should be no less than the unknowns to be solved in (6.53), i.e., .l ≥ n(n + 1)/2 + mn, and similarly .k ≥ n(n + 1)/2 in (6.54), so that . Pi and . Q i+1 in Algorithm 6.2 can be solved uniquely by (6.53) and (6.54). Another method, e.g., recursive least squares method, can also be used to solve . Pi and . Q i+1 . Remark 6.3 At each iteration, the solution of the model-based Algorithm 6.1 satisfies the model-free Algorithm 6.2. Moreover, Algorithm 6.2 solves a unique solution at each iteration. This concludes that Algorithm 6.2 approximates the solutions of the model-based Algorithm 6.1.

6.2 Off-Policy Inverse Reinforcement Learning for Linear Quadratic Regulators

163

6.2.4 Simulation Examples A simulation example is given to verify the effectiveness of the data-driven modelfree Algorithms 6.2. Consider the learner and expert systems with the models [

] [ ] −1 2 2 .A = ,B = . 2.2 1.7 1.6 The expert’s penalty weights . Q e and . Re , the corresponding ARE solution . Pe and the control gain . K e are [

] [ ] 50 0.8886 0.0461 .Qe = , Re = 1, Pe = K = [1.8509 3.5358]. 05 0.0461 2.1523 e The initial penalty weights . Q 0 , R, control gain . K 0 , integral time interval .TI , and the learning step .α¯ for Algorithm 6.2 are [ .

Q0 =

] 0.2 0 , R = 1, K 0 = [1.2292 2.1684], TI = 0.0008, α¯ = 0.1. 0 0.2

With the initial states .x(0) = [10 − 10]T , Fig. 6.1 shows the convergence of . Q i , . Pi and . K i , where . Q i and . K i converge to .

Q∗ =

Fig. 6.1 Convergence of and . K i using Algorithm 6.2

[

] 1.2896 2.2728 , K ∗ = [1.8523 5.5291]. 2.2728 5.1335

3.5

. Q i , . Pi ,

3 2.5 2 1.5 1 0.5 0

0

10

20

30

40

164 Fig. 6.2 Learner’s state and control input trajectories using Algorithm 6.2

6 Inverse Reinforcement Learning for Optimal Control Systems

10 0 -10 0

0.2

0.4

0.6

0.8

0

0.2

0.4

0.6

0.8

20 10 0

Figure 6.2 shows that the learner has the same trajectories of states and control inputs as the expert under the learned control gain . K ∗ .

6.3 Off-Policy Inverse Reinforcement Learning for Nonlinear Optimal Control Systems This section studies inverse RL for nonlinear systems in the continuous-time domain. We start with optimal control of nonlinear expert and learner systems with quadratic penalty functions. Then, both model-based and model-free inverse RL algorithms are provided to reconstruct the expert’s cost function for a learner. The convergence, stability of the learned policy, and non-unique nature of algorithms are guaranteed.

6.3.1 Problem Formulation Expert System Consider the expert with the nonlinear time-invariant affine dynamics x˙ = f (xe ) + g(xe )u e ,

. e

(6.56)

where .xe ∈ Rn and .u e ∈ Rm are the expert’s state and control input, respectively, . f ∈ Rn and .g ∈ Rn×m are the drift function and the control input function, respectively. Assumption 6.4 The functions . f and .g are Lipschitz with . f (0) = 0.

6.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

165

Assumption 6.5 The control input .u e in (6.56) is optimal at the corresponding state x .

. e

The expert’s input .u e is optimal by minimizing the integrated cost function ʃ .

Ve (xe , u e ) =



t

(

) Q e (xe ) + u Te Re u e dτ,

(6.57)

where . Q e (xe ) = q T (xe )Q e q(xe ) ∈ R is expert’s state penalty with the state-penalty weight . Q e = Q Te ∈ Rn×n > 0 and a function .q(xe ) = [xes1 xes2 . . . xesn ] ∈ Rn with the power .s. . Re = ReT ∈ Rm×m > 0 is the control input-penalty weight. In the optimal control theory (Lewis et al. 2012), .u e is given by 1 u = − Re−1 g T (xe )∇Ve∗ (xe ), 2

. e

(6.58)

where .∇Ve∗ satisfies the HJB equation 1 0 = Q e (xe ) − ∇VeT (xe )g(xe )Re−1 g T (xe )∇Ve (xe ) + ∇VeT (xe ) f (xe ). 4

.

(6.59)

This defines the expert system. Learner System The learner system is described by .

x˙ = f (x) + g(x)u,

(6.60)

where .x ∈ Rn is the learner state and .u ∈ Rm is the control input. The dynamics . f and .g are identical with those in (6.56) and satisfy Assumption 6.5. The integrated cost function to be minimized is defined as ʃ .



V (x, u, Q) =

(Q(x) + u T Ru) dτ,

(6.61)

t

where. Q(x) = q T (x)Qq(x) ∈ R is learner’s state penalty with a state-penalty weight T n×n .Q = Q ∈ R > 0 and a function .q(x) ∈ Rn of the state .x, and . R = R T ∈ m×m > 0 is the penalty weight on control input. The function .q(·) in (6.57) and R (6.61) has the same mapping. To find the optimal .u that minimizes .V in (6.61), differentiating (6.61) yields the learner’s Bellman equation .



H (x, u) = Q(x) + u T Ru + ∇V T ( f (x) + g(x)u) = 0.

(6.62)

166

6 Inverse Reinforcement Learning for Optimal Control Systems

By . ∂∂uH = 0, the learner’s optimal control is given by 1 u ∗ = − R −1 g T (x)∇V ∗ (x), 2

.

(6.63)

which satisfies the HJB equation 1 0 = Q(x) − ∇V ∗T (x)g(x)R −1 g T (x)∇V ∗ (x) + ∇V ∗T (x) f (x), 4

.

(6.64)

where .V ∗ is the minimal cost. We provide the next standard assumption and definition before proposing an inverse RL control problem. Assumption 6.6 The learner does not know the expert’s cost function weights (Q e , Re ) in (6.57) but the expert’s trajectory .(xe , u e ). The system dynamics . f , .g in (6.56) and (6.60) are unknown.

.

Definition 6.2 If . R ∈ Rm×m > 0 and . Q > 0 in learner’s HJB Eq. (6.64) optimally correspond to a .V ∗ (x) such that .u ∗ in (6.63) is equal to .u e in (6.58) and .x = xe with the same initials .x(t0 ) = xe (t0 ), then we call that . Q is an equivalent weight to . Q e in (6.59). Problem 6.2 Let Assumptions 6.4–6.6 hold. Selecting a. R ∈ Rm×m > 0, the learner aims to learn an equivalent to . Q e in (6.59) to perform the same trajectories as the expert, i.e., (.x, .u ∗ ) = (.xe , .u e ) and its learned policy is stabilizing. This is a data-driven model-free inverse RL problem.

6.3.2 Model-Based Inverse Reinforcement Learning This section presents a model-based inverse RL algorithm that combines the optimal control learning process and the inverse optimal control (IOC) learning process using . f and . g. Then, based on this algorithm, we further come up with a data-driven modelfree inverse RL algorithm to solve Problem 6.2. Optimal Control Learning Given a fixed. R and the current estimate. Q in (6.61), the policy iteration (PI) (Johnson et al. 2011) is used to solve (6.63) and (6.64) for optimal control solution of the learner. This process is shown in the right block of the schematic diagram Fig. 6.3 and is performed in the inner-loop iteration of Algorithm 6.3. That is, given. R and. Q j where . j = 0, 1, . . . denotes the outer-loop iterations, we obtain the corresponding optimal control solution .u j by inner-loop interactions. If the current estimate . Q j is not the equivalent weight to . Q e , then . Q j needs to be improved, as prescribed in the following IOC process.

6.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

167

Fig. 6.3 Schematic diagram of inverse RL control problem

IOC for State-Penalty Weight Improvement Given the expert’s demonstration .u e in (6.58) and learner’s optimal trajectories ∗ .(x, u ), we improve . Q toward the equivalent weight to . Q e based on IOC (Haddad and Chellaboina 2011). This process is shown in the middle and left blocks of the schematic diagram Fig. 6.3. It is performed in the outer-loop iteration of Algorithm 6.3, e.g., the current . Q j is improved to be an equivalent weight to . Q e using j .u e and the converged .(x, u ) from inner loops by (6.65). Then, the learner will use j+1 to update the next optimal .u j+1 . Repeat the inner and outer loops this revised . Q j until . Q converges to an equivalent weight to . Q e when .‖u j+1 − u e ‖ is smaller than a small threshold. As a result, the learner obtains the same behavior (.xe , .u e ). Algorithm 6.3 Model-based inverse RL algorithm 1. Initialization: select Q 0 > 0, R > 0, initial stabilizing u 00 , and small thresholds ε1 and ε2 . Set j = 0. 2. Outer-loop iteration j based on IOC 3. Inner-loop iteration i using optimal control: Given j and initial stabilizing u j0 , set i = 0 4. Policy evaluation: Compute V ji by ) ( )T ( f (x) + g(x)u ji . 0 = q T (x)Q j q(x) + (u ji )T Ru ji + ∇V ji (x) 5.

(6.65)

Policy improvement: Compute u j (i+1) and v j (i+1) by 1 u j (i+1) = − R −1 g T (x)∇V ji (x). 2

(6.66)

Stop if ‖∇V ji − ∇V j (i−1) ‖ ≤ ε1 , then set ∇V j (x) = ∇V ji (x) and u j = u ji . Otherwise, set i ← i + 1 and go to Step 4. 7. State-penalty weight improvement: Update Q j+1 using the expert’s demonstration u e ( ) (6.67) q T (x)Q j+1 q(x) = u Te Ru e − 2u Te Ru j − (∇V j (x))T f (x) + g(x)u j . 6.

8. Stop if ‖u j+1 − u e ‖ ≤ ε2 . Otherwise, set u ( j+1)0 = u j and j ← j + 1, then go to Step 3.

168

6 Inverse Reinforcement Learning for Optimal Control Systems

Remark 6.4 It is seen that Algorithm 6.3 provides a numerical solution and has less complexity than directly solving the HJB Eq. (6.64). The smaller the initial weights are, the larger the exploration domain can be reached, which benefits successful learning. Moreover, unlike the inverse RL in Lin et al. (2017, 2019); Song et al. (2018) that involve stochastic and complex MDPs and require a large data set, Algorithm 6.3 is a simpler deterministic algorithm and does not need large-scale data. Next, we analyze the convergence, non-unique solutions, and stability of the learned policies of Algorithm 6.3. Theorem 6.4 (Convergence of Algorithm 6.3) Select . R > 0 and small . Q 0 > 0. Consider the Algorithm 6.3 for solving the Imitation Learning Problem. If Algorithm 6.3 is convergent, then the learner has .(x, u j ) = (xe , u e ) for . j → ∞, where .u j and .u e are given by (6.66) and (6.58), respectively. The state-penalty weight . Q j , ∞ . j = 0, 1, . . . converges to the weight . Q which is equivalent to . Q e . Proof First, note that if the initial state-penalty weight . Q 0 is semi-positive definite, then the nonlinear Lyapunov function (6.65) has a unique and positive-definite solution at step 4. With step 5 for policy improvement, there exists a converged solution 0 .u . This is the optimal control learning process (also called RL PI), and is proven to be convergent (Abu-Khalaf and Lewis 2005). Given the state-penalty weight . Q j , the optimal control learning (steps 4 and 5) gives ( )T ( ) 0 = q T (x)Q j q(x) + (u j )T Ru j + ∇V j (x) f (x) + g(x)u j .

.

(6.68)

Substituting (6.68) into (6.67) in Algorithm 6.3 yields q T (x)Q j+1 q(x) = u Te Ru e − 2u Te Ru j + (u j )T Ru j + Q j (x)

.

= (u e − u j )T R(u e − u j ) + q T (x)Q j q(x) ≥ 0.

(6.69)

This shows that at any outer-loop iteration . j, . Q j is positive definite where . j = 0, 1, . . ., and .‖Q j ‖ increases as . j increases. ∆ Denote .∆ j = (u e − u j )T R(u e − u j ). It is seen from Definition 6.2 that there exist different .(Q ∞ , R) to .(Q e , Re ) that result in .u ∞ = u e . Also, it is known that j j .u is uniquely determined by . Q . If the algorithm is convergent, given a small 0 0 ∞ initial . Q such that . Q ≤ Q , . Q j increases to . Q ∞ and .u k converges to .u e . Then, ∞ j j .‖∆ ‖ converges to 0. This means that as . j → ∞, one has .‖∆i (x)‖ = 0, .u → u e , j ∞ .Q → Q and u∞ = ue.

.

(6.70)

With this learned result, if the learner and expert systems have the same initial states x ∞ (t0 ) = xe (t0 ), then they have the same state .x = xe for all time.

.

6.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

169

It is seen that the convergent behaviors .(x, u ∞ ) = (xe , u e ) and . Q j → Q ∞ are achieved simultaneously. Then, one rewrites the learner’s HJB Eq. (6.64) as 1 0 = q T (xe )Q ∞ q(xe ) − (∇V ∞ )T (xe )g(xe )R −1 g T (xe )∇V ∞ (xe ) 4 + (∇V ∞ )T (xe ) f (xe ),

.

(6.71)

which gives converged .V ∞ (xe ) with . Q ∞ , such that .



1 −1 T 1 R g (xe )∇V ∞ (xe ) = − Re−1 g T (xe )∇Ve (xe ). 2 2

(6.72)

Therefore, one has the convergence . Q j → Q ∞ and .(x, u j ) → (xe , u e ). However, one cannot guarantee . Q ∞ = Q e because . R can be different from . Re . . Q ∞ is then the ◻ equivalent weight to . Q e as shown in Definition 6.2. Algorithm 6.3 can find an equivalent weight to the expert’s for the learner to solve the imitation problem. However, this equivalent weight may not be unique. Note that the uniqueness discussed here means that the learned penalty weight . Q ∞ and the selected . R are exactly the expert’s . Q e , Re , respectively. Otherwise, we say the learned solution is not unique. Theorem 6.5 (Non-uniqueness of . Q ∞ ) Let . Q j and .∇V j in Algorithm 6.3 converge to . Q ∞ and .∇V ∞ , respectively. Then, . Q ∞ satisfies q T (xe )(Q ∞ − Q e )q(xe ) 1 = ∇V ∞T (xe )g(xe )R −1 (R − Re )R −1 g T (xe )∇V ∞ (xe ) 4 + (∇Ve∗ (xe ) − ∇V ∞T (xe ))T f (xe ),

.

(6.73)

where .∇Ve is uniquely solved by (6.59), and .∇V ∞ (xe ) satisfies .

g T (xe )∇V ∞ (xe ) = R Re−1 g T (xe )∇Ve (xe ),

(6.74)

which implies that . Q ∞ may not be unique. Proof As shown in Theorem 6.4, when the learner obtains the convergence (x, u ∞ ) = (xe , u e ), we follow (6.72) and have

.

.

g T (xe )∇V ∞ (xe ) = R Re−1 g T (xe )∇Ve (xe ).

Then, subtracting (6.71) from (6.59) yields

(6.75)

170

6 Inverse Reinforcement Learning for Optimal Control Systems

q T (xe )(Q ∞ − Q e )q(xe ) 1 = ∇V ∞T (xe )g(xe )R −1 (R − Re )R −1 g T (xe )∇V ∞ (xe ) 4 + (∇VeT (xe ) − (∇V ∞ )T (xe )) f (xe ).

.

(6.76)

In (6.75), if one lets.g T (xe )X = R Re−1 g T (xe )∇Ve (xe ), there will be infinite number of solutions for. X unless rank.(g(xe )) = n. This means that one may find many solutions of.∇V ∞ (xe ) that make (6.76) hold. However, one cannot guarantee rank.(g(xe )) = n. Thus, one may obtain non-unique .∇V ∞ (xe ) that is different from .∇Ve (xe ). Besides, ∞ . R can be different from . Re . From (6.76), the equivalent weight . Q − Q e may ∞ be nonzero and . Q may be different from . Q e . There may be infinite number of solutions for . Q. All possible and non-unique solutions for . Q ∞ satisfy (6.76) with ∞ .∇V satisfying (6.75). ◻ Theorem 6.6 (Stability of the Learner Using Algorithm 6.3) Give the learner dynamics (6.60) with the cost function (6.61). Give small . Q 0 > 0 and . R > 0, and use Algorithm 6.3. Then, at each iteration of Algorithm 6.3, the learner dynamics (6.60) is asymptotically stable. Proof To prove the stability of (6.60), we prove that .V˙ j (x) ≤ 0 for any . j ≥ 0. It follows from (6.64) that at outer-loop iteration . j, one has 1 0 = q T (x)Q j q(x) − (∇V j (x))T g(x)R −1 g T (x)∇V j (x) + (∇V j (x))T f (x) 4 = q T (x)Q j−1 q(x) + (u e − u j−1 )T R(u e − u j−1 ) 1 + (∇V j )T (x)g(x)R −1 g T (x)∇V j (x) 4 1 + (∇V j )T (x)( f (x) − R −1 g T (x)∇V j (x)) 2 = q T (x)Q j−1 q(x) + (u e − u j−1 )T R(u e − u j−1 ) + (u j )T Ru j + V˙ j (x). (6.77)

.

With . Q j−1 > 0 proven in Theorem 6.4, it follows from (6.77) that .V˙ j (x) ≤ 0 where V˙ j (x) = 0 holds only when .x = 0, u = u e = 0. Thus, the learner is asymptotically ◻ stable during iterations of Algorithm 6.3.

.

Remark 6.5 Note that Algorithm 6.3 only learns state-penalty weight . Q but not the input-penalty weight . R. . R > 0 is arbitrarily selected for the learner and can be different from . Re > 0. This does not stop the learner from obtaining the expert control policy in (6.74) and . Q ∞ in (6.73) as shown by Theorems 6.5 and 6.4. Algorithm Implementation via Neural Networks (NNs) We now present an NN-based method to implement the inverse RL Algorithm 6.3 online. First, the learner approximates the performance cost function .V ji at step 4

6.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

171

of Algorithm 6.3 given the current . Q j and .u ji . Then, .u j (i+1) updates using .V ji at step 5. Third, the learner updates . Q j+1 at step 7 using the expert’s control input .u e and the converged solutions from inner-loop iterations, i.e., .u j and .V j . According to Weierstrass approximation (Werbos 1974) using polynomial approximation, there exists an activation vector function .ϕ(x) = ]T [ ϕ1 (x) ϕ2 (x) . . . ϕ N1 (x) with . N1 hidden-layer neurons, such that the cost function .V ji (x) and its gradient can be uniformly approximated as Vˆ ji (x) = (C ji )T ϕ(x),

(6.78a)

∇ Vˆ ji (x) = ∇ϕ T (x)C ji ,

(6.78b)

. .

[ ]T ji ji ji where .C ji = C1 C2 . . . C N1 ∈ R N1 is the weight vector, .∇ϕ(x) = [∇ϕ1 (x) ∇ϕ2 (x) . . . ∇ϕ N1 (x)]T is the Jacobian of .ϕ(x). Given the NNs (6.78a)–(6.78b), (6.65) is rewritten as

.

) ( e ji (t) = q T (x) Qˆ j q(x) + (uˆ ji )T R uˆ ji + (C ji )T ∇ϕ(x) f (x) + g(x)uˆ ji , (6.79)

.

where .e ji (t) is the approximation error and is forced to be zero in average sense (Luo et al. 2014) to find .C ji , . Qˆ j , and .uˆ ji are approximation values using NN (6.78a). It follows from (6.78a) that the NN weight .C ji has . N1 unknown constants. In order to solve the unique .C ji , the batch least squares (BLS) method is used to construct ¯ 1 ≥ N1 equations for (6.78b) from . N¯ 1 different time points. The continuous-time .N data can be well approximated by discretization with a small interval .T > 0. Define ⎤ ⎤ ⎡ α(x(T )) ψ(x(T )) ⎢ α(x(2T )) ⎥ ⎢ ψ(x(2T )) ⎥ ⎥ ⎥ ⎢ ⎢ ji ji , 𝚿 .ϕ =⎢ = ⎥ ⎥, ⎢ .. .. ⎦ ⎦ ⎣ ⎣ . . ¯ ¯ α(x( N1 T )) ψ(x( N1 T )) ⎡

(6.80)

where ( ) α(x(T )) = ∇ϕ(x(T )) f (x(T )) + g(x(T ))uˆ ji (T ) ,

.

ψ(x(T )) = −q T (x(T ))Q j q(x(T )) − (uˆ ji (T ))T R uˆ ji (T ). Then, .C ji is uniquely solved by )−1 ji T ji ( C ji = (ϕ ji )T ϕ ji (ϕ ) 𝚿 .

.

(6.81)

The optimal control input (6.66) is approximated as 1 uˆ j (i+1) = − R −1 g T (x)∇ϕ(x)C ji . 2

.

(6.82)

172

6 Inverse Reinforcement Learning for Optimal Control Systems

When the solution set (.C ji , .uˆ j (i+1) ) converges to a set (.C j , .uˆ j ), according to (6.67), the estimated state-penalty weight . Qˆ j+1 is then updated as ) ( q T (x) Qˆ j+1 q(x) = u Te Ru e − 2u Te R uˆ j − (C j )T ∇ϕ(x) f (x) + g(x)uˆ j .

.

(6.83)

The matrix . Qˆ j+1 in (6.83) can be computed by using batch least squares (BLS) method. . Qˆ j+1 has .(n + 1)n/2 unknown parameters. . Qˆ j+1 is uniquely solved by constructing . N¯ 2 ≥ (n + 1)n/2 equations. Define ⎤ ⎤ ⎡ j vev(q(T ))T δ (T ) ⎢ vev(q(2T ))T ⎥ ⎢ δ j (2T ) ⎥ ⎥ ⎥ ⎢ ⎢ .┌ = ⎢ ⎥, ⎥ , ∆j = ⎢ .. .. ⎦ ⎦ ⎣ ⎣ . . vev(q( N¯ 2 T ))T δ j ( N¯ 2 T ) ⎡

(6.84)

where ( ) δ j (T ) = −(C j )T ∇ϕ(x(T )) f (x(T )) + g(x(T ))uˆ j (T )

.

+ u Te (T )Ru e (T ) − 2u Te (T )R uˆ j (T ). Then, . Qˆ j+1 is uniquely solved by ( )−1 T j vem( Qˆ j+1 ) = ┌ T ┌ ┌ ∆ .

.

(6.85)

See .vem(·) and .vev(·) as defined in “Abbreviations and Notation”. Now, a NN-based inverse RL Algorithm 6.4 is summed up below to implement the inverse RL Algorithm 6.3 using online data. Algorithm 6.4 Model-based inverse RL algorithm via NNs 1. Initialization: select Qˆ 0 > 0, R > 0, initial stabilizing uˆ 00 , and small thresholds ε1 , ε2 . Set j = 0. 2. Outer-loop iteration j based on IOC 3. Inner-loop iteration i using optimal control: given j and initial stabilizing uˆ j0 , set i = 0. 4. Policy evaluation: Compute C ji by (6.81). 5. Policy improvement: Compute uˆ j (i+1) by (6.82). 6. Stop if ‖C ji − C j (i−1) ‖ ≤ ε1 , then set C j = C ji and uˆ j = uˆ ji . Otherwise, set i ← i + 1 and go to Step 4. 7. State-penalty weight improvement: update Qˆ j+1 by (6.85) using the expert’s demonstration ue . 8. Stop if ‖ Qˆ j+1 − Qˆ j ‖ ≤ ε2 . Otherwise, set uˆ ( j+1)0 = uˆ j and j ← j + 1, then go to Step 3.

The next result shows the same convergence between Algorithms 6.4 and 6.3. Theorem 6.7 (Convergence of Algorithm 6.4) Algorithm 6.4 converges to Algorithm 6.3 and obtains a convergence (.x, .uˆ j ) .→ (.xe , .u e ) as . j → ∞.

6.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

173

Proof In inner-loop iteration, given . j, . Qˆ j > 0, and . R > 0, the NN weight .C ji can be uniquely determined if rank(.(ϕi j )T ϕi j ) .≥ Ni . This can be satisfied by letting the number of collected data in (6.80) satisfy. N¯ 1 ≥ N1 . Thus,.Vˆ ji is solved by BLS (6.81) and converges to .V ji solved by (6.65). It has been proven in Abu-Khalaf et al. (2008) that .Vˆ ji in (6.78a) uniformly approximates to .V j . By repeating (6.81)–(6.82), when ji .C converges to .C j , the learner has the input .uˆ j . The approximated input uniformly converges to (6.66), i.e., .uˆ ji → u ji . In outer-loop iterations, the learner can use (6.85) to uniquely solve the . Qˆ j+1 with the condition of rank(.┌ T ┌) .≥ (n + 1)n/2. This can be satisfied by letting the number of collected data have . N¯ 2 ≥ (n + 1)n/2 in (6.84). Due to the convergence ˆ ji → V ji and .uˆ ji → u ji , one thus has that . Qˆ j+1 is uniquely solved by (6.85) and .V converges to . Q j+1 . It is seen that (6.81)–(6.82) and (6.83) in Algorithm 6.4 are rigorously derived from (6.65)–(6.67) of Algorithm 6.3 using NN (6.78a). Algorithm 6.4 obtains unique solutions in (6.81)–(6.85). These solutions converge to the solutions of (6.65)–(6.67) in Algorithm 6.3. We conclude that Algorithm 6.4 converges to Algorithm 6.3. Thus, ◻ the learner obtains a convergence (.x, .uˆ j ) .→ (.xe , .u e ) as . j → ∞.

6.3.3 Model-Free Off-Policy Integral Inverse Reinforcement Learning Inverse RL Algorithms 6.3–6.4 require the knowledge of system dynamics. f and.g. In this section, we present an online model-free off-policy integral inverse RL algorithm to solve Problem 6.2 without using . f and .g. This algorithm is then implemented via NNs. Model-Free Integral Inverse RL Algorithm First, we use the off-policy integral RL (Luo et al. 2014) in inner-loop iterations of Algorithm 6.3, which allows to find model-free equations equivalent to (6.65)–(6.67). Rewrite (6.56) as .

x˙ = f (x) + g(x)u ji + g(x)(u − u ji ),

(6.86)

where .u ji ∈ Rm is the updated control input at inner-loop iteration .i given . j. Differentiating .V ji along with (6.86) yields .

V˙ ji = (∇V ji )T ( f + gu ji ) + (∇V ji )T g(x)(u − u ji ) = −q T (x)Q j q(x) − (u ji )T Ru ji − 2(u j(i+1) )T R(u − u ji ).

(6.87)

174

6 Inverse Reinforcement Learning for Optimal Control Systems

Then, integrating both sides of (6.87) from .t to .t + T yields the off-policy Bellman equation ʃ .

V (x(t + T )) − V (x(t)) − ji

ji

ʃ

t+T

=

(

t+T

)T ( ) ( 2 u j (i+1) R u − u ji dτ

t

) − q T (x)Q j q(x) − (u ji )T Ru ji dτ,

(6.88)

t

which will yield the converged .V j and .u j given . Q j . Next, the integral RL is used in outer-loop iteration . j of Algorithm 6.3 to find a model-free equation equivalent to (6.67). To update the . Q j+1 , one integrates both sides of (6.67) from .t to .t + T to yield ʃ

t+T

.

t

ʃ

q T (x)Q j+1 q(x) dτ

t+T

= t

(

) u Te Ru e − 2u Te Ru j dτ − V j (x(t + T )) + V j (x(t)).

(6.89)

The model-free off-policy integral inverse RL algorithm is summarized below. Algorithm 6.5 Model-free integral inverse RL algorithm 1. Initialization: select Q 0 > 0, R > 0, initial stabilizing u 00 , and small thresholds ε1 and ε2 . Set j = 0. Apply the stabilizing u to the learner dynamics (6.86). 2. Outer-loop iteration j based on IOC 3. Inner-loop iteration i using optimal control: given j and initial stabilizing u j0 , set i = 0. 4. Off-policy Integral RL for computing V j and u j+1 by (6.88). 5. Stop if ‖V ji − V j (i−1) ‖ ≤ ε1 , then set V j = V ji , and u j = u ji Otherwise, set i ← i + 1 and go to Step 4. 6. State-penalty weight improvement: update Q j+1 by (6.89) using the expert’s demonstration ue . 7. Stop if ‖Q j+1 − Q j ‖ ≤ ε2 . Otherwise, set u ( j+1)0 = u j and j ← j + 1, then go to Step 3.

The next theorem shows the convergence of Algorithms 6.5 to 6.3. Theorem 6.8 (Convergence of Algorithm 6.5) Algorithm 6.5 converges to Algorithm 6.3 such that the learner achieves a convergence (.x, .u j ) .→ (.xe , .u e ) as . j → ∞. Proof First, given . j, . j = 0, 1, . . . , ∞ and . Q j , one divides both sides of (6.88) by . T and takes the limit to be

6.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

175

ʃ t+T ( j (i+1) )T ( ) u 2 t R u − u ji dτ V ji (x(t + T )) − V ji (x(t)) − lim . lim T →0 T →0 T T ʃ t+T ( T ) j ji T ji q (x)Q q(x) + (u ) Ru dτ + lim t T →0 T = 0. (6.90) By L’Hopital’s rule, (6.90) becomes ( )T ( ) ) ( (∇V ji )T f + gu ji + g(x)(u − u ji ) − u j (i+1) R u − u ji

.

+ q T (x)Q j q(x) + (u ji )T Ru ji = 0.

(6.91)

Then, submitting .u j (i+1) (6.66) into (6.91) yields (6.90). This implies that (6.88) gives the same solution as the Lyapunov function (6.65) with the input (6.66). Similarly, one divides both the sides of (6.89) by .T and takes the limit operation. This yields (6.67) and shows that (6.89) gives the same solution as (6.67). Thus, the ◻ learner obtains a convergence (.x, .u j ) .→ (.xe , .u e ) as . j → ∞. Remark 6.6 For the learner, Algorithm 6.5 is completely model-free in both outerloop iteration . j and inner-loop iteration .i, which is compared with Choi et al. (2017); Ornelas et al. (2011) that require the knowledge of system dynamics for reconstructing cost functions. This is also in contrast to the Algorithms 6.3 and 6.4 that require the full knowledge of system dynamics in both iteration loops. Algorithm Implementation via NNs This subsection introduces an NN-based model-free off-policy inverse RL algorithm that implements the Algorithm 6.5 using online data without knowing any knowledge of system dynamics. Three NN-based approximators are designed for the learner’s value function .V ji and the updated control input .u j (i+1) in the Bellman equation (6.88) in Algorithm 6.5. Three approximators are .



.

Vˆ ji = (C ji )T ϕ(x),

j (i+1)

= (W ) φ(x), ji T

(6.92a) (6.92b)

[ ]T where .ϕ(x) in (6.92a) and .φ(x) = φ1 (x) φ2 (x) . . . φ N2 (x) are activation vector functions of two NNs, respectively. Moreover, .C ji ∈ R N1 and .W ji ∈ Rm×N2 . . N1 in (6.92a) and . N2 are hidden-layer neuron numbers of two NNs, respectively. ]T [ ∆ ji ji ji Define.u − u¯ ji = u˜ 1 u˜ 2 . . . u˜ m . Assume that weight. R is given in the form of .

R = diag{r1 , r2 , . . . , rm }T . Then, together with (6.92a)–(6.92b), (6.88) is expressed as

176

6 Inverse Reinforcement Learning for Optimal Control Systems

(C ji )T [ϕ(x(t + T ) − ϕ(x(t))] + 2

m ∑

.

ʃ = e¯ ji (t) −

ʃ ji

rh (Wh )T

h=1

t+T(

t+T

t

ji

φ(x)u˜ h dτ

) q T (x) Qˆ j q(x) + (u¯ ji )T R u¯ ji dτ,

(6.93)

t

where.e¯ ji (t) is the Bellman approximation error and is forced to be zero in the average sense to find .C ji and .W ji (Abu-Khalaf and Lewis ( 2005; Luo et al.)2014). One uses ji ji BLS method to solve the unique solution set . C ji W1 . . . Wm given . Q j . This solution set provides information for the update of . Qˆ j+1 in (6.89). Thus, one defines

∑ ji

.

⎡ʃ t+T t ⎢ ʃ t+2T ⎢ t+T =⎢ ⎢ ⎣ ʃ t+ιT

σ ji σ ji .. .

t+(ι−1)T σ

ji

⎤ ⎡ 1 1 dτ πx πu ⎥ ⎢πx2 πu2 dτ ⎥ ⎥ , ∏i j = ⎢ ⎢ .. .. ⎥ ⎣ . . ⎦ πxι πuι dτ

⎤ ⎥ ⎥ ⎥, ⎦

(6.94)

where σ ji = −q T (x) Qˆ j q(x) − (u¯ ji )T R u¯ ji ,

.

πxι = ϕ T (x(t + ιT )) − ϕ T (x(t + ιT − T )), [ ʃ t+ιT ] ʃ t+ιT ji φ T (x)u˜ 1 dτ . . . rm φ T (x)u˜ mji dτ . πuι = r1 t+ιT −T

(6.95)

t+ιT −T

The unknown parameters in (6.93) can be uniquely solved by using BLS when (∏ ji )T ∏ ji has full rank. It is required to satisfy the condition of persistent excitation that needs .ι groups of data collection. The positive integer .ι is no less than the number of unknown parameters in (6.93), i.e., .ι ≥ N1 + m N2 . Then, the unknown weights ji .C and .W ji in (6.94) are uniquely solved by .

.

[ ]T ji (C ji )T (W1 )T . . . (Wmji )T = ((∏ ji )T ∏ ji )−1 (∏ ji )T ∑ ji .

(6.96)

( ) ( ) When . C ji , W ji converges to . C j , W j , given the expert’s control input .u e and using this convergent solution set, the learner solves . Qˆ j+1 by ʃ

t+T

.

t

ʃ

( ) u Te Ru e − 2u Te RW j φ(x) dτ t ( ) − (C j )T ϕ(x(t + T ) − ϕ(x(t)) ,

q (x) Qˆ j+1 q(x) = T

t+T

(6.97)

where the unique . Qˆ j+1 is computed by constructing . N¯ 2 ≥ (n + 1)n/2 equations. Define

6.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

⎡ ʃ t+T ⎤ y j (T ) t ⎢ ʃ t+2T ⎢ y j (2T ) ⎥ ⎢ t+T ⎥ ¯ ⎢ j .Y = ⎢ ⎥,┌ = ⎢ .. ⎢ ⎦ ⎣ . ⎣ ʃ t+ N¯ 2 T y j ( N¯ 2 T )

⎤ vev(q(x(τ )))T dτ vev(q(x(τ )))T dτ ⎥ ⎥ ⎥, .. ⎥ . ⎦ T vev(q(x(τ ))) dτ t+( N¯ 2 −1)T



177

(6.98)

where ʃ .

y j (2T ) =

t+2T t+T

(

( ) ) u Te Ru e −2u Te RW j φ dτ −(C j )T ϕ(x(t + 2T )) − ϕ(x(t + T )) .

Then, . Qˆ j+1 is solved by ( )−1 T j vem( Qˆ j+1 ) = ┌¯ T ┌¯ ┌¯ Y .

.

(6.99)

Based on the above derivation, an online NN-based model-free integral inverse RL Algorithm 6.6 is summed up as follows. Algorithm 6.6 Model-free integral inverse RL algorithm via NNs 1. Initialization: Select Qˆ 0 > 0, R > 0, stabilizing u¯ 00 , and small thresholds ε1 , ε2 . Apply stabilizing u to (6.86). Set j = 0. 2. Outer-loop iteration j based on IOC 3. Inner-loop iteration i using optimal control: Given j and initial stabilizing u¯ j0 , set i = 0. 4. Off-policy integral RL for computing C ji and W ji by (6.96). 5. Stop if ‖Vˆ ji − Vˆ j (i−1) ‖ ≤ ε1 , then set (C j , W j ) = (C ji , W ji ). Otherwise, set i ← i + 1 and go to Step 4. 6. State-penalty weight improvement: Update Qˆ j+1 by (6.99) using the expert’s demonstration ue . 7. Stop if ‖ Qˆ j+1 − Qˆ j ‖ ≤ ε2 . Otherwise, set u¯ ( j+1)0 = u¯ j and j ← j + 1 and go to Step 3.

Remark 6.7 The convergence of Algorithms 6.6 to 6.3 can be obtained by the convergence of Algorithms 6.6 to 6.5. Referring to Theorem 6.7, it follows from Theorem 6.8 that Algorithm 6.5 convergences to Algorithm 6.3. Thus, one concludes that Algorithm 6.6 convergences to Algorithm 6.3.

6.3.4 Simulation Examples Consider the draft function . f (·) and control input function .g(·) for two systems as [ .

f (s) =

] [ ] −s1 + s2 0 , g(s) = , s2 −s13 − 21 s12 s2

(6.100)

178

6 Inverse Reinforcement Learning for Optimal Control Systems

where .s represents .x or .xe , .sn denotes the .n-th element of .s. The expert’s and ] [ ]T [ learner’s state penalties are in the form of . Q e (x) = x12 x22 Q e x12 x22 and . Q(x) = [ 2 2 ] [ 2 2 ]T x1 x2 Q x1 x2 with the penalty weights as [ .

Qe =

] [ ] 2 0.5 0.125 0.125 , Q0 = , 0.5 1 0.125 0.700

(6.101)

respectively. The control input weights are selected as . R = 21 and . Re = 21 . Based on the converse Hamilton–Jacobi–Bellman approach in Nevisti´c and Primbs (1996), 4 2 + xe2 . Then, we have .u e = −x24 . the expert’s optimal value function is .Ve = 21 xe1 ]T [ We select the activation function for the learner as .ϕ(x) = x14 x22 . Given the above parameters, we now show simulation results using the model-based inverse RL Algorithm 6.4. Set .T = 0.005. The initial states are .x(0) = xe (0) = [1, −1]T . j Figure 6.4 shows the convergence of the state penalty weight . Q j and .C2 . It is observed that . Q j converges to . Q ∞ which is equivalent to . Q e . They are .

Q



[

] [ ]T 0.3841 0.0852 = , C ∞ = 0.1385 0.5000 , 0.0852 0.5533

(6.102)

which results in .u ∗ = −x24 . This is the same as the expert’s control input. Figure 6.5 shows that the learner mimics the expert’s behavior (.xe , .u e ) with the learner using the learned policy. Note that . Q j in Fig. 6.4 does not converge to . Q e because . Q j converges to an equivalent weight. Q ∞ that is not equal to. Q e as shown in Definition 6.2 and Theorem 6.5. This equivalent weight can define the learner the same optimal behavior as the expert. The existence and the convergence of the equivalent weight are shown in Theorems 6.4 and 6.5. Fig. 6.4 Convergence of . Q j j and .C2 using Algorithm 6.4

2 1.9 1.8 1.7 0

2

4

6

8

10

12

14

16

18

0

2

4

6

8

10

12

14

16

18

0.54 0.53 0.52 0.51 0.5

6.3 Off-Policy Inverse Reinforcement Learning for Nonlinear … 1

States

Fig. 6.5 Trajectories of the expert and learner using Algorithm 6.4

179

0.5 0 -0.5 -1 0

1

2

3

4

5

6

4

5

6

Control inputs

Time (s) 0 -0.5 -1 -1.5 0

1

2

3

Time (s)

Fig. 6.6 Convergence of . Q j j and .W2 using Algorithm 6.6

1.6

1.55

1.5 0

2

4

6

8

10

12

14

16

18

0

2

4

6

8

10

12

14

16

18

1.1 1.05 1

Given the above parameters, we now show simulation results using the modelfree inverse RL Algorithm 6.6 for the nonlinear system example. Select the activation ]T ]T [ [ function for the learner as .ϕ(x) = x14 x22 x24 and .φ(x) = x14 x22 x24 . j Figure 6.6 shows the convergence of the state penalty weight . Q j and .W2 . Again, j ∞ . Q converges to . Q which is equivalent to . Q e . They are .

Q∞ =

[

] [ ]T 0.6078 0.1450 , W ∞ = 0.0002 1.0001 0.0001 , 0.1450 0.5340

(6.103)

which results in approximate .u ∗ = −x24 . This is the same as the expert’s control input. Figure 6.7 shows that the learner mimics the expert’s behavior (.xe , .u e ) with the learner using the learned policy.

Fig. 6.7 Trajectories of the expert and learner using Algorithm 6.6

6 Inverse Reinforcement Learning for Optimal Control Systems 1

States

180

0.5 0 -0.5 -1 0

1

2

3

4

5

6

4

5

6

Control inputs

Time (s) 0 -0.5 -1 -1.5 0

1

2

3

Time (s)

References Ab Azar N, Shahmansoorian A, Davoudi M (2020) From inverse optimal control to inverse reinforcement learning: A historical review. Ann Rev Control 50:119–138 Abbeel P, Ng AY (2004) Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the twenty-first ICML Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41(5):779–791 Abu-Khalaf M, Lewis FL, Huang J (2008) Neurodynamic programming and zero-sum games for constrained control systems. IEEE Trans Neural Networks 19(7):1243–1252 Arora S, Doshi P (2021) A survey of inverse reinforcement learning: Challenges, methods and progress. Artif Intell 297:103500 Atkeson CG, Schaal S (1997) Robot learning from demonstration. ICML 97:12–20 Bertsekas DP (1997) Nonlinear programming. J Oper Res Soc 48(3):334–334 Brown D, Niekum S (2018) Efficient probabilistic performance bounds for inverse reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence, 32(1) Choi S, Kim S, Jin Kim H (2017) Inverse reinforcement learning control for trajectory tracking of a multirotor UAV. Int J Control Autom Syst 15(4):1826–1834 Chu KF, Lam AY, Fan C, Li VO (2020) Disturbance-aware neuro-optimal system control using generative adversarial control networks. IEEE Trans Neural Networks Learn Syst 32(10):4565– 4576 Haddad WM, Chellaboina V (2011) Nonlinear dynamical systems and control. Princeton University Press Ho J, Ermon S (2016) Generative adversarial imitation learning. In: Advances in neural information processing systems Imani M, Braga-Neto UM (2018) Control of gene regulatory networks using Bayesian inverse reinforcement learning. IEEE/ACM Trans Comput Biol Bioinf 16(4):1250–1261 Imani M, Ghoreishi SF (2021) Scalable inverse reinforcement learning through multifidelity Bayesian optimization. IEEE Trans Neural Networks Learn Syst. https://doi.org/10.1109/ TNNLS.2021.3051012 Jean F, Maslovskaya S (2018). Inverse optimal control problem: the linear-quadratic case. In: 2018 IEEE conference on decision and control, pp 888–893 Jiang Y, Jiang ZP (2012) Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 48(10):2699–2704

References

181

Johnson M, Aghasadeghi N, Bretl T (2013) Inverse optimal control for deterministic continuoustime nonlinear systems. In: 52nd IEEE conference on decision and control, pp 2906–2913 Johnson M, Bhasin S, Dixon WE (2011) Nonlinear two-player zero-sum game approximate solution using a policy iteration algorithm. In: 2011 50th IEEE conference on decision and control and european control conference, pp 142–147 Kalman RE (1964) When is a linear control system optimal? Kiumarsi B, Lewis FL, Naghibi-Sistani MB, Karimpour A (2015) Optimal tracking control of unknown discrete-time linear systems using input-output measured data. IEEE Trans Cybern 45(12):2770–2779 Levine S, Koltun V (2012) Continuous inverse optimal control with locally optimal examples. ArXiv preprint arXiv:1206.4617 Levine S, Popovic Z, Koltun V (2011) Nonlinear inverse reinforcement learning with gaussian processes. Adv Neural Inf Process Syst 24 Lewis FL, Vrabie D, Syrmos VL (2012) Optimal control, 3rd edn. John Wiley & Sons Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circ Syst Mag 9(3):32–50 Lin X, Beling PA, Cogill R (2017) Multiagent inverse reinforcement learning for two-person zerosum games. IEEE Trans Games 10(1):56–68 Lin X, Adams SC, Beling PA (2019) Multi-agent inverse reinforcement learning for certain generalsum stochastic games. J Artif Intell Res 66:473–502 Luo B, Wu HN, Huang T (2014) Off-policy reinforcement learning for . H∞ control design. IEEE Trans Cybern 45(1):65–76 Nevisti´c V, Primbs JA (1996) Constrained nonlinear optimal control: a converse HJB approach. California Institute of Technology, Pasadena, CA, USA, Tech. Rep. TR96-021, 1996 Ng AY, Russell SJ (2000) Algorithms for inverse reinforcement learning. ICML 1(2) Ornelas F, Sanchez EN, Loukianov AG (2011) Discrete-time nonlinear systems inverse optimal control: A control Lyapunov function approach. In: 2011 IEEE international conference on control applications, pp 1431–1436 Sammut C, Hurst S, Kedzier D, Michie D (1992) Learning to fly. In: Machine learning proceedings 1992. Morgan Kaufmann, pp 385–393 Sanchez EN, Ornelas-Tellez F (2017) Discrete-time inverse optimal control for nonlinear systems. CRC Press Self R, Abudia M, Kamalapurkar R (2020) Online inverse reinforcement learning for systems with disturbances. In: 2020 American control conference, pp 1118–1123 Song J, Ren H, Sadigh D, Ermon S (2018) Multi-agent generative adversarial imitation learning. Adv Neural Inf Process Syst 31 Song J, Ren H, Sadigh D, Ermon S (2018) Multi-agent generative adversarial imitation learning. In: Advances in neural information processing systems, p 31 Syed U, Schapire RE (2007) A game-theoretic approach to apprenticeship learning. In: Advances in neural information processing systems, p 20 Vega CJ, Suarez OJ, Sanchez EN, Chen G, Elvira-Ceja S, Rodriguez-Castellanos D (2018) Trajectory tracking on complex networks via inverse optimal pinning control. IEEE Trans Autom Control 64(2):767–774 Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis FL (2009) Adaptive optimal control for continuoustime linear systems based on policy iteration. Automatica 45(2):477–484 Werbos P (1974) New tools for prediction and analysis in the behavioral sciences. Ph. D. Dissertation, Harvard University Xue W, Fan J, Lopez VG, Jiang Y, Chai T, Lewis FL (2020) Off-policy reinforcement learning for tracking in continuous-time systems on two time scales. IEEE Trans Neural Networks Learn Syst 32(10):4334–4346 Xue W, Kolaric P, Fan J, Lian B, Chai T, Lewis FL (2021) Inverse reinforcement learning in tracking control based on inverse optimal control. IEEE Trans Cybern 52(10):10570–10581

Chapter 7

Inverse Reinforcement Learning for Two-Player Zero-Sum Games

7.1 Introduction In Chap. 6, we demonstrated how to solve inverse RL or inverse optimal control problems for linear and nonlinear systems in a data-driven and model-free manner, assuming no external disturbances. In this chapter, we extend our focus to systems that are subject to non-cooperative and adversarial inputs. Such systems are commonly found in various applications, including aircraft, automobiles, electric power systems, economic entities, computer networks, manufacturing, and industrial systems. In control theory, the objective is to find control inputs that counteract disturbances and stabilize these systems. The framework of zero-sum games (Lewis et al. 2012) provides a powerful method to achieve this goal. By defining a cost function based on zero-sum game principles, the control input minimizes the cost, while the adversarial input maximizes it. The optimal control solution is obtained by minimizing the effects of the worst-case adversarial input, which involves solving the Hamilton–Jacobi– Isaacs equation for nonlinear systems or the game algebraic Riccati equation for linear systems. This approach is widely used for robust optimal control, H.∞ control, and finding the Nash equilibrium solution of zero-sum games. It is worth noting that RL offers a numerical calculation approach to solve the generalized algebraic Riccati equation and HJI equation more easily (Kiumarsi et al. 2012; Lewis et al. 2012; Modares et al. 2012). As evident, standard inverse zero-sum games (Tsai et al. 2016) involve recovering the corresponding cost function by utilizing the system’s dynamics information, behavior, or trajectory data when the system is at a Nash equilibrium. Similar principles apply to the inverse . H∞ problem (Fujii and Khargonekar 1988; Luo et al. 2012). Traditional approaches typically require knowledge of mathematical models that accurately reflect the system dynamics, which may not always be readily available. To fully exploit the information contained in the data, we delve deeper into the concepts of differential games and extend the results from Chap. 6 to twoplayer zero-sum games in this chapter. We propose a data-driven and model-free © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 B. Lian et al., Integral and Inverse Reinforcement Learning for Optimal Control Systems and Games, Advances in Industrial Control, https://doi.org/10.1007/978-3-031-45252-9_7

183

184

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

inverse solution for zero-sum games, which is compared with model-based inverse differential games (Lin et al. 2019; Self et al. 2020). The focus of this chapter is on model-free inverse solutions for two-player zero-sum games in both linear and nonlinear dynamic systems.

7.2 Inverse Q-Learning for Linear Two-Player Zero-Sum Games In contrast to the optimal control systems discussed in Chap. 6, this section focuses on inverse RL for linear expert and learner systems that are affected by external disturbances. Both systems are modeled as two-player zero-sum games and have their respective Nash equilibrium. The objective of the learner is to reconstruct the expert’s cost function in order to mimic the expert’s states and control inputs based on the trajectories of states and controls provided by the expert.

7.2.1 Problem Formulation Expert System The expert system is modeled as x˙ = Axe + Bu e + Dve ,

. e

(7.1)

where .xe ∈ Rn , .u e ∈ Rm , and .ve ∈ Rk denote the expert’s state, control input, and adversarial input, respectively. The constant system dynamics matrices . A, . B, and . D have appropriate dimensions. Assumption 7.1 The pair .(A, B) is controllable. Assumption 7.2 The control input .u e is optimal at the corresponding .xe . The control input .u e and the disturbance .ve are non-cooperative, which are two players in a zero-sum game. That is, .u e minimizes the following integrated cost function, while .ve∗ , the worst of .ve , maximizes it ∫ .

Ve (xe ) = t



(

) xeT Q e xe + u Te Re u e − γe2 veT ve dτ,

(7.2)

where . Q e = Q Te ∈ Rn×n ≥ 0 and . Re = ReT ∈ Rm×m > 0 are penalty weights, and .γe > 0 is the attenuation factor. According to the standard two-player zero-sum game (Lewis et al. 2012), the optimal .u e and the worst .ve∗ are given by

7.2 Inverse Q-Learning for Linear Two-Player Zero-Sum Games Δ

u = −Re−1 B T Pe xe = −K e xe , 1 T ∗ .ve = v Pe xe , γe2 . e

185

(7.3a) (7.3b)

Δ

where . K e = Re−1 B T Pe and . Pe > 0 satisfies the algebraic Riccati equation (ARE) .

AT Pe + Pe A + Q e − Pe B Re−1 B T Pe +

1 Pe D D T Pe = 0. γe2

(7.4)

Here, .γe ≥ γe∗ for some .γe∗ such that (7.4) has a unique stabilizing solution . Pe (van der Schaft 1992). The pair .(u e , ve∗ ) is the saddle point. With this, the cost function T . Ve in (7.2) can be represented as . Ve (x e ) = x e Pe x e , and the expert (7.1) reaches the Nash equilibrium .

∗ Ve (xe (t0 ), u e , ve ) ≤ Ve (xe (t0 ), u e , ve∗ ) ≤ Ve (xe (t0 ), u − e , ve )

(7.5)

− for any .u − e , where .u e / = u e .

Learner System The learner system that aims to imitate the expert’s trajectory is modeled as .

x˙ = Ax + Bu + Dv,

(7.6)

where .x ∈ Rn , .u ∈ Rm , and .v ∈ Rk denote the learner state, control input, and disturbance input, respectively. The control input .u and the disturbance .v are noncooperative. The learner plays the standard two-player zero-sum game (Lewis et al. 2012). Given an integrated cost function as ∫ .



V (x, u, v) =

(

) x T Qx + u T Ru − γ 2 v T v dτ,

(7.7)

t

where . Q = Q T ∈ Rn×n ≥ 0, . R = R T ∈ Rm×m > 0, and .γ > 0 are state-penalty weight, control input-penalty weight, and disturbance attenuation factor, respectively. The control input .u expects to minimize .V while the adversarial input .v expects to maximize it. To find such a solution, we start with a Hamiltonian function .

H (V, u, v) = x T Qx + u T Ru − γ 2 v T v + ∇V T (Ax + Bu + Dv).

(7.8)

By stationary conditions . ∂∂uH = 0 and . ∂∂vH = 0, the optimal input .u ∗ and the worst disturbance .v ∗ are given by

186

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games Δ

u ∗ = −R −1 B T P x = −K x, 1 T ∗ .v = v P x. γ2

.

(7.9a) (7.9b)

The optimized value function is represented as .

V ∗ (x) = min max V (x, u, v) = x T P x, u

v

(7.10)

where . P > 0 satisfies the ARE .

AT P + P A + Q − P B R −1 B T P +

1 P D D T P = 0, γ2

(7.11)

or, equivalently, the Bellman equation .

x T Qx + u T Ru − γ 2 v T v + ∇V ∗T (Ax + Bu + Dv) = 0.

(7.12)

The pair (.u ∗ , .v ∗ ) is called the saddle point and the learner reaches the Nash equilibrium .

V (x(t0 ), u ∗ , v) ≤ V (x(t0 ), u ∗ , v ∗ ) ≤ V (x(t0 ), u, v ∗ ).

(7.13)

Provided the above expert-learner systems, we provide the following assumption, definition, and research problem. Assumption 7.3 The system dynamics . A, . B, . D in (7.1) and (7.6), expert’s penalty weights . Q e , Re , ve , and the parameter . Pe are unknown. The learner knows the expert’s trajectory data of .(xe , u e ). Definition 7.1 Given . R and .γ , if a . Q in learner’s ARE (7.11) yields a . K in (7.9a) such that . K = K e in (7.3a), then we call such . Q as an equivalent weight to . Q e in (7.2). Assumption 7.4 Assume .γ in Definition 7.1 satisfies .γ > γ¯ where .γ¯ > 0 is the minimum attenuation level corresponding to the equivalent weight . Q to . Q e given . R. Remark 7.1 Assumption 7.4 ensures the existence of a unique stabilizing solution P in (7.11) when . Q is the equivalent weight to . Q e given a . R. This is achieved by selecting .γ as large as possible.

.

Problem 7.1 Let Assumptions 7.1–7.4 hold. Select a . R ∈ Rm×m > 0 and a .γ > 0, the learner aims to learn an equivalent weight to . Q e to perform (.xe , .u e ) when .v = ve with itself (6.60) stabilized. This is essentially a data-driven inverse RL zero-sum problem.

7.2 Inverse Q-Learning for Linear Two-Player Zero-Sum Games

187

7.2.2 Model-Free Inverse Q-Learning In order to address Problem 7.1, we present an inverse Q-learning algorithm in this section. The algorithm combines an inner Q-learning procedure, aimed at solving the Nash equilibrium of zero-sum games, with an outer inverse optimal control procedure to enhance the learner’s state-penalty weight. This algorithm is model-free as it develops a Q-function that incorporates the state, control, and disturbance within the inner Q-learning procedure. By leveraging this approach, we can effectively solve the problem without relying on explicit models of the system dynamics. Rewrite the learner’s value function in (7.10) by adding the learner Hamiltonian equation (7.12) as the Q-function .Q(x, u, v) Δ

Q(x, u, v) = H (V ∗ , u, v) + V ∗ (x)

.

= x T P(Ax + Bu + Dv) + (Ax + Bu + Dv)T P x + x T Qx + u T Ru − γ 2 v T v + x T P x ⎡ ⎤T ⎡ ⎤⎡ ⎤ x Q + P + AT P + P A P B P D x R 0 ⎦⎣u ⎦ BT P = ⎣u ⎦ ⎣ v v 0 −γ 2 Ik DT P ⎡ ⎤ Q¯ x x Q¯ xu Q¯ xv = X T ⎣ Q¯ ux Q¯ uu Q¯ uv ⎦ X Q¯ vx Q¯ vu Q¯ vv Δ

= X T Q¯ X,

(7.14)

where ⎡ ⎤ ⎡ Q¯ x x x Δ ¯ ⎣ ⎦ ⎣ u , Q = Q¯ ux .X = v Q¯ vx

⎤ Q¯ xv Q¯ uv ⎦ , Q¯ vv = AT P + P A + Q + P, Q¯ xu = Q¯ Tux = P B, = Q¯ Tvx = P D, Q¯ uu = R, Q¯ uv = Q¯ Tvu = 0, Q¯ v = −γ 2 Ik . Δ

Q¯ x x Q¯ xv

Q¯ xu Q¯ uu Q¯ vu

Here . Q¯ is called the Q-function kernel matrix. By using the integral RL technique (Li et al. 2014), one obtains the integral Bellman equation from (7.7) as ∫ .

t+T

V (x(t + T )) = V (x(t)) −

(

) x T Qx + u T Ru − γ 2 v T v dτ,

(7.15)

t

where .T > 0 is a small integral time interval. As shown in Vamvoudakis (2017), one has .V ∗ = Q∗ (x, u, v) = minu maxv Q(x, u, v) and the above integral Bellman equation in terms of the .Q∗ (x, u ∗ , v ∗ ) can be equivalently rewritten as

188

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

Q∗ (x(t + T ), u(t + T ), v(t + T )) ∫ t+T ( T ) = Q∗ (x(t), u(t), v(t)) − x Qx + u T Ru − γ 2 v T v dτ.

.

(7.16)

t

Furthermore, (7.16) is re-expressed in terms of . X T Q¯ X in (7.14) as .

X T (t + T ) Q¯ X (t + T ) ∫ t+T ( T ) x Qx + u T Ru − γ 2 v T v dτ. = X T (t) Q¯ X (t) −

(7.17)

t

The optimal control input .u ∗ in (7.9a) and the worst disturbance .v ∗ in (7.9b) are computed by ¯ u ∗ = arg min Q(x, u, v) = − Q¯ −1 uu Q ux x,

(7.18a)

¯ v ∗ = arg max Q(x, u, v) = − Q¯ −1 vv Q vx x.

(7.18b)

.

u

.

v

Provided the above formulation, we now provide an iterative way to solve Problem 7.1. The algorithm is shown in Algorithm 7.1. It has two loops: an inner Q-learning loop to solve the Nash equilibrium of zero-sum games and an outer inverse optimal control loop to improve the learner’s state-penalty weight. In the inner loop, given a fixed . R ∈ Rm×m > 0 and .γ > 0, and current estimate . Q j ∈ Rn×n of state-penalty weight, the corresponding optimal solution can be iteratively solved by Q-learning procedure (7.19)–(7.20b) without using system dynamics. The learner then has the converged optimal solution, including the optimal Q-function kernel matrix . Q¯ j , the optimal control input .u j , and the worst disturbance .v j . Then, the outer-loop iteration j ¯ j ,.u j , . j in (7.21) is a value iteration process that corrects . Q by using the converged . Q j j .v , and .u e . As . Q converges, the inner-loop iteration has the converged behavior, i.e., j j+1 .(x, u ) → (x e , u e ). Note that . Q (7.21) can be computed using the least square methods. Theorem 7.1 (Convergence of Algorithm 7.1) Given initial . Q 0 ≥ 0, . R > 0 and j .γ > 0, the learner, using Algorithm 7.1, obtains 1) the converged penalty weight . Q as . j → ∞ equivalent to . Q e , 2) the control gain that is equal to the expert’s, and 3) converged behaviors such that .(x, u j ) → .(xe , u e ) when .v = ve . Proof The inverse Q-learning Algorithm 7.1 consists of inner-loop iteration and outer-loop iteration. From (7.21), we can easily see that . Q j ≥ 0 holds for all outerloop iterations . j = 1, 2, . . . given an initial penalty weight . Q 0 ≥ 0. Then, for each . j, the policy evaluation in (7.19) and policy improvement in (7.20a)–(7.20b) in the inner-loop iterations form the standard Q-learning process for two-player zero-sum games. It is convergent as proved in Li et al. (2014) and, as .i → ∞, it converges to the optimal control solutions .u j and the worst-case disturbance .v j corresponding to the value function with the weights . Q¯ j , R, and .γ .

7.2 Inverse Q-Learning for Linear Two-Player Zero-Sum Games

189

Algorithm 7.1 Model-free inverse Q-learning for linear two-player zero-sum game 1. Initialization: select Q 0 ≥ 0, R > 0, γ > 0 and initial stabilizing u 00 . Set j = 0 and small ∈ Q¯ > 0 and ∈ Q > 0. 2. Outer-loop iteration j 3. Inner-loop iteration i: set i = 0 and use initial stabilizing u j0 4. Policy evaluation: Compute Q¯ ji by X T (t) Q¯ ji X (t) − X T (t + T ) Q¯ ji X (t + T ) ∫ t+T ( ) x T Q j x + (u ji )T Ru ji − γ 2 (v ji )T v ji dτ. =

(7.19)

t

5.

Policy improvement: Compute u j (i+1) and v j (i+1) by u j (i+1) = −( Q¯ uu )−1 Q¯ ux x,

(7.20a)

ji −1 ¯ ji −( Q¯ vv ) Q vx x.

(7.20b)

ji

v

j (i+1)

=

ji

Stop if || Q¯ ji − Q¯ j (i−1) || ≤ ∈ Q¯ , then set Q¯ j = Q¯ ji , u j = u ji , v j = v ji and go to Step 7, otherwise set i ← i + 1 and go to Step 4. 7. State-penalty weight correction: Update Q j+1 by 6.

∫ t

t+T

∫ xeT Q j+1 xe dτ =

t+T

(u e − u j (xe ))T R(u e − u j (xe )) dτ

t



+ t

t+T

xeT Q j xe dτ,

(7.21)

j j where u j (xe ) = −( Q¯ uu )−1 Q¯ ux xe . j+1 j − Q || ≤ ∈ Q , otherwise set u ( j+1)0 = u j and j ← j + 1, then go to Step 3. 8. Stop if ||Q

Equation (7.21) shows that the state-penalty weight . Q j increases as . j increases. Based on the fact that there are infinite equivalent state-penalty weights to the expert’s, j . Q would increase to the neighborhood of one equivalent penalty weight and simultaneously .u j approaches .u e , which in turn results in a smaller incremental of . Q j to be closer to the equivalent penalty weight. This outer-loop iterative procedure shares the same essential iteration principle as the iterative procedures in Lewis and Vrabie (2009), Moerder and Calise (1985), Nelder and Mead (1965), Lian et al. (2022a), Xue et al. (2023) and is convergent. As . j → ∞, it follows from (7.23) that one has ∫

t+T

.

t

xeT Q ∞ xe dτ =



t+T

(u e − u ∞ (xe ))T R(u e − u ∞ (xe ))dτ

t



+ t

t+T

xeT Q ∞ xe dτ,

(7.22)

and thus u ∞ (xe ) = u e .

.

(7.23)

190

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

That is Δ

u ∞ (xe ) = −R −1 B T P ∞ xe = −K ∞ xe ,

.

.



K xe =

Re−1 B T Pe xe .

(7.24a) (7.24b)

This shows that the learner obtains the expert control feedback gain, i.e., . K ∞ = −Re−1 B T Pe . Note that given (7.23) and (7.19), one has ⎤ Q ∞ + AT P ∞ + P ∞ A + P ∞ P ∞ B P ∞ D R 0 ⎦ BT P ∞ =⎣ 0 −γ 2 Ik DT P ∞ ⎡

.

Q¯ ∞

(7.25)

and −1 ∞ −1 T ∞ u ∞ (xe ) = −(Q ∞ B P xe . uu ) (Q ux )x e = −R

.

(7.26)

This is equivalent to the learner’s ARE .

AT P ∞ + P ∞ A + Q ∞ − P ∞ B R −1 B T P ∞ +

1 ∞ P D D T P ∞ = 0. γ2

(7.27)

The expert has a Q-function Kernel matrix ⎤ Q e + AT Pe + Pe A + Pe Pe B Pe D ¯e = ⎣ Re 0 ⎦, B T Pe .Q 0 −γe2 Ik D T Pe ⎡

(7.28)

which gives Δ

u = −(Q u e u e )−1 (Q u e xe )xe = −Re−1 B T Pe xe = −K e xe ,

. e

(7.29)

where . K e = Re−1 B T Pe . This implies that the converged Q-function yields the actual expert control gain. Referring to Definition 7.1, we conclude that the penalty weight. Q ∞ that is equivalent to the expert’s is found. Therefore, as . j → ∞, the learner’s optimal control input .u j and state .x can imitate the experts’ .u e and .xe when .v = ve . .∎ Remark 7.2 Algorithm 7.1 can find an equivalent weight to the expert’s statepenalty weight for the learner to solve the problem. However, this equivalent weight may not be unique. Note that the uniqueness discussed here means that the learned penalty weight . Q ∞ and the selected . R and .γ are exactly the same as the expert’s . Q e , Re , and.γe , respectively. The non-unique solutions in inverse RL zero-sum games are characterized as follows.

7.2 Inverse Q-Learning for Linear Two-Player Zero-Sum Games

191

Theorem 7.2 (Non-unique Solutions of Algorithm 7.1) Consider the solution in (7.24a), (7.24b), and ARE (7.27) that Algorithm 7.1 obtains. The control gain . K ∞ is equal to expert’s . K e , while . Q ∞ satisfies 1 ∞ .Q − Q e = AT (Pe − P ∞ ) + (Pe − P ∞ )A + 2 Pe D D T Pe γe 1 − 2 P ∞ D D T P ∞ − Pe B Re−1 (R − Re )Re−1 B T Pe , (7.30) γ and . P ∞ satisfies .

B T P ∞ = R Re−1 B T Pe ,

(7.31)

where . Q ∞ is the equivalent weight but not unique. Proof As shown in Theorem 7.1, the learner obtains the converged solution .

− R −1 B T P ∞ xe = −Re−1 B T Pe xe

(7.32)

which yields (7.31). Then, subtracting (7.27) from the expert’s ARE (7.4) yields (7.30). Obviously, (7.31) and (7.32) show the relations between learned parameters and expert’s parameters. For (7.31), one supposes. B T X = R Re−1 B T Pe with the unknown variable. X . There are indeed infinite solutions of . X if rank.(B) /= n. That is, Algorithm 7.1 can find infinite solutions of . P ∞ satisfying (7.31) such that the actual expert control gain is obtained. It is also seen in (7.30) that one cannot guarantee . R = Re or .γ = γe . Thus, the learned solution can be different from the expert’s and they are not unique. The solution is unique if and only if rank.(B) /= n, . R = Re , and .γ = γe . Note, however, that all the solutions satisfying (7.30) and (7.31) allow the learner to mimic the .∎ expert’s trajectories, no matter whether they are unique or not. Remark 7.3 Algorithm 7.1 is an online learning algorithm in the sense that it updates the parameters using online data. The stability of the learner under its online learning policy should be guaranteed. It is analyzed as follows. Theorem 7.3 (Stability of the Learner Algorithm 7.1) The learner system (7.6) that is associated with the control solution at each iteration of Algorithm 7.1 is asymptotically stable. Proof First, we analyze the stability of the learner system in the inner-loop iterations of Algorithm 7.1. Based on Vamvoudakis (2017), (7.19) is equivalent to ∫ .

t+T

V ji (t) − V ji (t + T ) =

(

) x T Q j x + (u ji )T Ru ji − γ 2 (v ji )T v ji dτ. (7.33)

t

Note that the learner system .x˙ = Ax + Bu ji + Dv ji is dissipative. It follows from van der Schaft (1992), Abu-Khalaf et al. (2008) that if . Q j ≥ 0, . R > 0, .γ > 0, and j .(A, B) is controllable, there exist . F (x) ≥ 0 for . j = 0, 1, . . ., such that

192

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

(∇ F j (x))T (Ax + Bu ji ) + x T Q j x + (u ji )T Ru ji +

.

1 (∇ F j (x))T D D T ∇ F j (x) γ2

Δ

= M j (x) ≤ 0,

(7.34)

where given . j, .∀i, . F j (x) ≥ V ji (x) holds. Also, one can rewrite 1 D D T ∇V ji (x)) γ2 1 = −x T Q j x − (u ji )T Ru ji + 2 (∇V j (x))T D D T ∇V j (x). γ

∇V j (i+1) (x)(Ax + Bu ji +

.

(7.35)

With (7.34) and (7.35), one has (∇ F j (x) − ∇V j (i+1) (x))T (Ax + Bu ji + Dv ji ) 1 = M j (x) − 2 (∇ F j − ∇V j (i+1) )T D D T (∇ F j − ∇V j (i+1) ) γ ≤ 0,

.

(7.36)

and . F j (x) − V j(i+1) (x) ≥ 0. Moreover, . F j (x) − V j(i+1) (x) = 0 and j j(i+1) ˙ (x) − V˙ .F (x) = 0 hold only when .x = 0. Thus, one selects . F j (x) − V j(i+1) (x) ≥ 0 as a Lyapunov function for the system .x˙ = Ax + Bu ji + Dv ji , thereby proving the asymptotic stability of inner-loop iteration. At outer-loop iteration . j, we only need . S j ≥ 0 for .∀ j = 0, 1, . . .. This has been proved in Theorem 7.1. Therefore, the learner system (7.6) is asymptotically stable .∎ by using Algorithm 7.1. The next corollary analyzes inverse optimality. Corollary 7.1 Using Algorithm 7.1, the learner system (7.6) achieves the inverse optimality under the converged solution. Proof By using Algorithm 7.1, the learner obtains the converged solution .(x, u j ) → (x, u ∞ ) = (xe , u e ) as . j → ∞. The corresponding penalty weight . Q ∞ in Algorithm 7.1 satisfies ∫ t+T ∫ t+T ( ) . x T Q ∞ x dτ = −(u ∞ )T Ru ∞ + γ 2 (v ∞ )T v ∞ dτ t

t

+ X (t)T Q¯ ∞ X (t) − X (t + T )T Q¯ ∞ X (t + T ),

(7.37)

where . Q¯ ∞ in given in (7.25) One can rewrite the Q-function form (7.37) in terms of value function and take the derivative (Vamvoudakis 2017), such that .

x T Q ∞ x = −(u ∞ )T Ru ∞ + γ 2 (v ∞ )T v ∞ − 2x T P ∞ (Ax + Bu ∞ + Dv ∞ ), (7.38)

7.2 Inverse Q-Learning for Linear Two-Player Zero-Sum Games

193

where u ∞ = −R −1 B T P ∞ x, 1 ∞ .v = 2 D T P ∞ xe . γ

.

(7.39a) (7.39b)

One writes (7.38) as the Hamilton–Jaccobi–Bellman equation .

H (x, u ∞ , d ∞ ) = x T Q ∞ x + (u ∞ )T Ru ∞ − γ 2 (v ∞ )T v ∞ + ∇(V ∞ )T (x)(Ax + Bu ∞ + Dv ∞ ) = 0.

(7.40)

Based on Lewis et al. (2012), one then yields .

H (x(0), u ∞ , d) ≤ H (x(0), u ∞ , v ∞ ) ≤ H (x(0), u, v ∞ ).

(7.41)

where .x = xe and .u ∞ = u e . It is seen that the learner system (7.6) reaches the Nash .∎ equilibrium and inverse optimality (Haddad and Chellaboina 2011).

7.2.3 Implementation of Inverse Q-Learning Algorithm We now provide an implementation method for Algorithm 7.1. The method uses NN-based functional approximators and has a critic approximator and a state-penalty weight approximator. Critic Approximator Rewrite the Q-function in (7.19) in a NN form as .

X T Q¯ ji X = vem( Q¯ ji )T vev(X ) = (Wcji )T vev(X ),

(7.42)

ji ji ji where .vev(X ) ∈ R(m+n+k)(m+n+k+1)/2 and .Wc = vem( Q¯ ji ) = [ Q¯ 11 : Q¯ 1,m+n+k , ji ji ji Q¯ 22 : Q¯ 2,m+n+k , . . . , Q¯ m+n+k,m+n+k ]T ∈ R(m+n+k)(m+n+k+1)/2 denotes the corresponding critic weight vector. This is a critic approximator. Bringing it into (7.19) gives

(Wcji )T [vev(X (t)) − vev(X (t + T ))] ∫ t+T ( T j ) x Q x + (u ji )T Ru ji − γ 2 (v ji )T v ji dτ, =

.

(7.43)

t

which is a typical approximation problem. It can be uniquely solved by, for example, the batch least squares method if probing noise is injected into the control input and enough input–output data are collected. Probing noise can be white noises, which

194

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

is used to ensure a persistently exciting condition. The collected data group should ji be no less than .(m + n + k)(m + n + k + 1)/2. After solving .Wc and . Q¯ ji , the j (i+1) and the worst-case disturbance input .v j (i+1) can be optimal control input .u solved by using (7.20a) and (7.20b), respectively. State-Penalty Weight Approximator Rewrite the state penalty in (7.21) using NN as ∫



t+T

xeT Q j+1 xe

.

t

dτ = vem(Q

)

j+1 T

∫ = (Wqj+1 )T

t t+T

t+T

vev(xe ) dτ

vev(xe ) dτ,

(7.44)

t

where .

j+1

j+1

j+1

j+1

j+1 T Wqj+1 = vem(Q j+1 ) = [Q 11 , . . . , Q 1n , Q 22 , . . . , Q 2n , . . . , Q nn ] ∈R

n(n+1) 2

such that (7.21) is rewritten as ∫ j+1 T .(Wq )



t+T

∫ vev(xe ) dτ =

t

t+T

(u e − u j (xe ))T R(u e − u j (xe )) dτ

t

t+T

+ t

xeT Q j xe dτ.

(7.45)

This is also a typical approximation problem, and its solution shares the same princiji ple with the solution of .Wc in (7.43). Please refer to the algorithm implementation in Sect. 6.3.3 for the computation operation with data.

7.2.4 Simulation Examples We shall verify the effectiveness of the proposed inverse Q-learning Algorithm 7.1 by conducting a simulation experiment. The expert is considered as the following continuous-time linear system [ x˙ =

. e

] [ ] [ ] −3 −2 0 1 xe + ue + v, 2 −3 1 0 e

(7.46)

which is associated with the cost function (7.2) that has the given weights . Q e = diag{8, 8}, . Re = 2, and .γe = 5. Select .ve = 0.01sin(xe ). The expert’s control feedback gain . K e and the weight . Pe are .

Ke =

Re−1 B T Pe

] 1.3251 −0.0297 = −0.0149 0.6138 , Pe = . −0.0297 1.2276 [

]

[

(7.47)

7.2 Inverse Q-Learning for Linear Two-Player Zero-Sum Games

195

The value of the expert Q-function kernel matrix . Q¯ e in (7.28) is ⎡

1.2553 ⎢−0.0464 ¯e = ⎢ .Q ⎣−0.0297 1.3251

⎤ −0.0464 −0.0297 1.3251 1.9810 1.2276 −0.0297⎥ ⎥. ⎦ 1.2276 2 0 −0.0297 0 −25

The learner shares the same dynamics with the expert, i.e., [ .

x˙ =

] [ ] [ ] −3 −2 0 1 x+ u+ v. 2 −3 1 0

(7.48)

The initial states of the learner and the expert are .xe (0) = x(0) = [3 − 1]. The initial state-penalty weight . Q 0 , and the selected . R and .γ are . Q 0 = diag{1, 1}, . R = 1, and .γ = 6, respectively. The integral interval is . T = 0.0025. Select .∈ Q = 0.004 and .v = 0.01sin(x). Given the optimal control input .u e = −K e xe with . K e in (7.47), the learner (7.48) then uses the proposed inverse Q-learning Algorithm 7.1. The simulation result is presented in Figs. 7.1 and 7.2. We see that the learner obtains the converged statepenalty weight . Q ∞ , . Q¯ ∞ , and . K ∞ . Their converged values are ⎡

.

Q¯ ∞

Q∞ K∞

⎤ 0.2283 −0.0141 −0.0088 0.2304 ⎢−0.0141 0.9852 0.6114 −0.0088⎥ ⎥, =⎢ ⎣−0.0088 0.6114 ⎦ 1 0 0.2304 −0.0088 0 −25 [ ] 1.4152 −0.8201 = , −0.8201 4.0070 [ ] = 0.0178 −0.6114 .

Fig. 7.1 Convergence of j ,.Q ¯ j , and . K j using Algorithm 7.1

.Q

7 6.9 6.8 0

500

1000

1500

0

500

1000

1500

0

500

1000

1500

11.13 11.125 11.12

0.5

0

196

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

Fig. 7.2 Trajectories of the expert and learner using Algorithm 7.1

States

4 2 0 -2 0

0.5

1

1.5

2

2.5

3

2

2.5

3

Control inputs

Time (s) 0.6 0.4 0.2 0 -0.2 0

0.5

1

1.5

Time (s)

It is seen that . Q¯ ∞ is not equal to . Q¯ e because the state-penalty weight . Q converges to . Q ∞ which is equivalent to . Q e . However, the learner learns the same control input feedback gain, i.e., . K ∞ closes to . K e . As a result, as shown in Fig. 7.2, the learner mimics the expert states and control inputs, respectively, within a short time period.

7.3 Off-Policy Inverse Reinforcement Learning for Nonlinear Two-Player Zero-Sum Games In this section, we extend the results presented in Sect. 7.2 to the context of zero-sum games involving nonlinear systems. Specifically, we consider the scenario where both the expert and learner systems experience adversarial inputs, such as uncontrollable external noise, and exhibit affine dynamics. To tackle this problem, we introduce a model-free off-policy inverse RL algorithm that reconstructs the nonlinear expert’s cost function without relying on any knowledge of the system dynamics. By leveraging this algorithm, we are able to effectively learn and replicate the expert’s behavior in the face of adversarial inputs.

7.3.1 Problem Formulation Expert System Consider an expert with the nonlinear affine dynamics x˙ = f (xe ) + g(xe )u e + h(xe )ve ,

. e

(7.49)

7.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

197

where .xe ∈ Rn , .u e ∈ Rm , and .ve ∈ Rk denote the expert’s state, control input, and adversarial input, respectively; . f ∈ Rn , .g ∈ Rn×m , and .h ∈ Rn×k denote the drift function, the control input function, and the adversarial input function, respectively. Assumption 7.5 The functions . f , .g, and .h are Lipschitz functions, and . f (0) = 0, x = 0 is an equilibrium state. The system (7.49) is controllable on a compact set .Ω containing the origin.

.

Assumption 7.6 The control input .u e is optimal at the corresponding .xe . The expert’s control input .u e is optimal and can optimize the following performance cost function and simultaneously attenuate the effects of the adversary, ∫ .

Ve (xe , u e , ve ) = t



(

) Q e (xe ) + u Te Re u e − veT Se ve dτ,

(7.50a)

where . Q e (xe ) = q T (xe )Q e q(xe ) ∈ R is expert’s state penalty with the state-penalty weight . Q e = Q Te ∈ Rn×n ≥ 0 and a function .q(xe ) = [xes1 xes2 . . . xesn ] ∈ Rn with the power .s; . Re = ReT ∈ Rm×m > 0 and . Se = SeT ∈ Rk×k > 0 are weights of inputs .u e and .ve , respectively. The expert’s control input .u e and the adversarial input .ve are non-cooperative. The input .u e is optimal which minimizes .Ve , while the worst disturbance, denoted by .ve∗ , maximizes it. They are given by 1 u = − Re−1 g T (xe )∇Ve∗ (xe ), 2 1 −1 T ∗ S h (xe )∇Ve∗ (xe ), .ve = 2 e . e

(7.51a) (7.51b)

where .Ve∗ satisfies the expert’s Hamilton–Jacobi–Isaacs (HJI) equation 1 0 = Q e (xe ) − ∇Ve∗T (xe )g(xe )Re−1 g T (xe )∇Ve∗ (xe ) 4 1 ∗T + ∇Ve (xe )h(xe )Se−1 h T (xe )∇Ve∗ (xe ) + ∇Ve∗T (xe ) f (xe ). 4

.

(7.52)

With the saddle point (.u e , .ve∗ ), one has .

Ve∗ (xe ) = Ve∗ (u e , ve∗ ) = min max Ve (u e , ve ), ue

ve

(7.53)

where .Ve∗ is the optimal cost satisfying the Nash equilibrium condition (Lewis et al. ∗ − 2012), i.e., .Ve (xe (t0 ), u e , ve ) ≤ Ve (xe (t0 ), u e , ve∗ ) ≤ Ve (xe (t0 ), u − e , ve ) where .u e is − the control input that .u e /= u e . The control input .u e in (7.51a) can stabilize the expert system and reject the effects of adversarial disturbances. Learner System Consider the learner with nonlinear time-invariant affine dynamics

198

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games .

x˙ = f (x) + g(x)u + h(x)v,

(7.54)

where .x ∈ Rn denotes the learner state, .u ∈ Rm denotes its control input and .v ∈ Rk denotes its adversarial input. Note that the expert and learner have identical dynamics. The zero-sum game solution of the nonlinear learner system is reviewed here. Given an integrated cost function ∫ .

V (x(t0 ), u(x(t0 )), v, Q) =



(Q(x) + u T Ru − v T Sv)dτ,

(7.55)

t0

where. Q(x) = q T (x)Qq(x) ∈ R is learner’s state penalty with a state-penalty weight T n×n .Q = Q ∈ R ≥ 0 and a function .q(x) ∈ Rn of the state .x; . R = R T ∈ Rm×m > 0 T k×k and . S = S ∈ R > 0 are arbitrarily selected weights. The function .q(·) in (7.50a) and (7.55) have the same mapping. The control input .u expects to minimize it while the adversarial input .v expects to maximize it. To find the optimal solution, define the Hamiltonian function as .

H (V, u, v) = Q(x) + u T Ru − v T Sv + ∇V T ( f (x) + g(x)u + h(x)v).

(7.56)

By . ∂∂uH = 0 and . ∂∂vH = 0, the learner’s optimal control input .u ∗ and the worst .v ∗ are given by 1 u ∗ = − R −1 g T (x)∇V ∗ (x), 2 1 −1 T ∗ S h (x)∇V ∗ (x), .v = 2

.

(7.57a) (7.57b)

which satisfy the learner’s HJI equation 1 0 = Q(x) − ∇V ∗T (x)g(x)R −1 g T (x)∇V ∗ (x) 4 1 + ∇V ∗T (x)h(x)S −1 h T (x)∇V ∗ (x) + ∇V ∗T (x) f (x), 4

.

(7.58)

and equivalently the Bellman equation .

Q(x) + u T Ru − v T SV + ∇V T ( f (x) + g(x)u + h(x)v) = 0.

(7.59)

Therein, the pair (.u ∗ ,.v ∗ ) is called saddle point, and.V ∗ is the optimal cost and reaches the Nash equilibrium, i.e., .V ∗ (x) = V (u ∗ , v ∗ ) = minu maxv V (u, v). With the above nonlinear two-player zero-sum game expert and learner, we formulate an inverse RL problem. Assumption 7.7 The learner does not know the expert’s cost function weights, i.e., Q e , . Re , and . Se in (7.50a), but it knows the expert’s trajectory data of .u e .

.

7.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

199

Definition 7.2 Select any . R ∈ Rn×n > 0 and . S ∈ Rk×k > 0. If . Q in learner’s HJI equation (7.58) yields the .u ∗ in (7.57a) that equals to .u e in (7.51a) and .x = xe holds with the same initial states .x(t0 ) = xe (t0 ), then we call that . Q is an equivalent weight to . Q e . Assumption 7.8 For a given . R and an equivalent weight . Q, . S should satisfy λ (S) ≥ s ∗ , where .s ∗ is a bound scalar and .λmin denotes the minimum eigenvalue.

. min

Remark 7.4 Similar to Assumption 7.4 and according to Ba¸sar and Olsder (1998), van der Schaft (1992), Assumption 7.8 ensures the existence of a positive semidefinite solution in HJI (7.58) for .(Q, R) in Definition 7.2. Problem 7.2 Under Assumptions 7.5–7.8, the learner selects any . R ∈ Rn×n > 0 and . S ∈ Rk×k > 0 and learns an equivalent weight to . Q e , such that (1) it imitates the expert’s trajectory, i.e., (.x, .u ∗ ) = (.xe , .u e ) when .v = ve , and (2) stabilizes itself (7.54).

7.3.2 Inverse Reinforcement Learning Policy Iteration In order to solve the Problem 7.2, we provide the following model-based inverse RL Algorithm 7.2 policy iteration. It is an extension of inverse RL Algorithm 6.3 in Sect. 7.2 to two-player zero-sum games. Specifically, given an estimate . Q j ≥ 0 (. j is the iteration step) and fixed . R > 0 and . S > 0, the inner-loop iterations of Algorithm 7.2 are known as policy iteration process that solves the Nash equilibrium solutions j j j j .∇V (x), u , v . Then, the current estimate. Q is revised in the outer-loop iteration of Algorithm 7.2 using the demonstrated.u e and the converged solutions.∇V j (x), u j , v j from the inner-loop iterations. Repeat two loops until . Q j converges to an equivalent weight to . Q e . As a result, the learner obtains the behavior (.xe , .u e ) when .v = ve . The implementation of Algorithm 7.2 can be obtained by extending the implementation in Sect. 6.3.2 to the two-player zero-sum games. The next results provide the convergence, non-unique of equivalent weights, and the stability of Algorithm 7.2. Theorem 7.4 (Convergence of Algorithm 7.2) Algorithm 7.2 is convergent, and then the learner has .(x, u j ) = (xe , u e ) for . j → ∞ with actual .v = ve in (7.54) and (7.49), where .u j and .u e are given by (7.61a) and (7.51a), respectively. The penalty weight . Q j , . j = 0, 1, . . . converges to the weight . Q ∞ which is equivalent to . Q e . Proof The proof can be referred to Lian et al. (2021b) and is similar to that in Theorem 7.1 by extending it to nonlinear two-player zero-sum games. .∎ Theorem 7.5 (Non-uniqueness of . Q ∞ ) Let . Q j and .∇V j in Algorithm 7.2 converge to . Q ∞ and .∇V ∞ , respectively. Then, . Q ∞ satisfies

200

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

Algorithm 7.2 Model-based inverse RL PI for nonlinear two-player zero-sum games 1. Initialization: Select Q 0 ≥ 0, R > 0, S > 0, initial stabilizing u 00 , v 00 = 0, and small thresholds ε1 and ε2 . Set j = 0. 2. Outer-loop iteration j based on IOC 3. Inner-loop iteration i using optimal control: Given j, set i = 0, and use stabilizing u j0 . 4. Policy evaluation: Compute V ji by 0 = q T (x)Q j q(x) + (u ji )T Ru ji − (v ji )T Sv ji ) ( )T ( f (x) + g(x)u ji + h(x)v ji . + ∇V ji (x) 5.

(7.60)

Policy improvement: Compute u j (i+1) and v j (i+1) by 1 u j (i+1) = − R −1 g T (x)∇V ji (x), 2 1 v j (i+1) = S −1 h T (x)∇V ji (x). 2

(7.61a) (7.61b)

Stop if ||∇V ji − ∇V j (i−1) || ≤ ε1 , then set ∇V j (x) = ∇V ji (x), u j = u ji and v j = v ji . Otherwise, set i ← i + 1 and go to Step 4. 7. State-penalty weight improvement: Update Q j+1 using the expert’s u e 6.

q T (x)Q j+1 q(x) = u Te Ru e − 2u Te Ru j + (v j )T Sv j ( ) − (∇V j (x))T f (x) + g(x)u j + h(x)v j .

(7.62)

8. Stop if ||Q j+1 − Q j || ≤ ε2 . Otherwise, set u ( j+1)0 = u j and j ← j + 1, then go to Step 3.

q T (xe )(Q ∞ − Q e )q(xe ) 1 = (∇V ∞ )T (xe )g(xe )R −1 (R − Re )R −1 g T (xe )∇V ∞ (xe ) 4 1 + (∇Ve∗ )T (xe )h(xe )Se−1 h T (xe )∇Ve∗ (xe ) 4 1 − (∇V ∞ )T (xe )h(xe )S −1 h T (xe )∇V ∞ (xe ) 4 + (∇Ve∗ (xe ) − ∇V ∞ (xe ))T f (xe ),

.

(7.63)

where .∇Ve∗ is uniquely solved by (7.52), and .∇V ∞ (xe ) satisfies .

g T (xe )∇V ∞ (xe ) = R Re−1 g T (xe )∇Ve∗ (xe ),

(7.64)

which implies that . Q ∞ may not be unique. Proof If Algorithm 7.2 converges, let. Q j and.∇V j in Algorithm 7.2 converge to. Q ∞ and .∇V ∞ , respectively. Then, one has the converged learner’s HJI equation (7.58) as

7.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

1 0 = q T (xe )Q ∞ q(xe ) − (∇V ∞ )T (xe )g(xe )R −1 g T (xe )∇V ∞ (xe ) 4 1 ∞ T + (∇V ) (xe )h(xe )S −1 h T (xe )∇V ∞ (xe ) + (∇V ∞ )T (xe ) f (xe ), 4

201

.

(7.65)

and .u ∞ = u e , i.e., .



1 1 −1 T R g (xe )∇V ∞ (xe ) = − Re−1 g T (xe )∇Ve∗ (xe ), 2 2

(7.66)

which yields (7.64). Then, subtracting (7.65) from (7.52) yields (7.63). In (7.63), if one lets.g T (xe )X = R Re−1 g T (xe )∇Ve∗ (xe ), there will be infinite solutions for . X unless rank.(g(xe )) = n. This means that one may find many solutions of .∇V ∞ (xe ) that make (7.63) hold. When rank.(g(xe )) = n is not guaranteed, one may obtain non-unique .∇V ∞ (xe ) that is different from .∇Ve∗ (xe ). In addition, the selected . R and . S can be different from ∞ . Re and . Se , respectively. From (7.64), the difference . Q − Q e may be nonzero, i.e., ∞ .Q may be different from . Q e . There may be an infinite number of solutions for . Q ∞ . All possible and non-unique solutions for . Q ∞ satisfy (7.64) with .∇V ∞ satisfying .∎ (7.63). Theorem 7.6 (Stability of the learner using Algorithm 7.2) Each iteration of Algorithm 7.2 ensures that the learner (7.54) is asymptotically stable when the adversarial input .v = 0. Proof To prove the stability of (7.54) with .v = 0, we prove that .V˙ j (x) ≤ 0 holds for any . j ≥ 0 with .v = 0. It follows from (7.58) that at outer-loop iteration . j, one has 1 0 = q T (x)Q j q(x) − (∇V j (x))T g(x)R −1 g T (x)∇V j (x) 4 1 + (∇V j (x))T h(x)S −1 h T (x)∇V j (x) + (∇V j (x))T f (x) 4 = q T (x)Q j−1 q(x) + (u e − u j−1 )T R(u e − u j−1 ) 1 + (∇V j (x))T (g(x)R −1 g T (x) + h(x)S −1 h T (x))∇V j (x) 4 1 + (∇V j )T (x)( f (x) − R −1 g T (x)∇V j (x)) 2 = q T (x)Q j−1 q(x) + (u e − u j−1 )T R(u e − u j−1 ) + (u j )T Ru j + (v j )T Sv j + V˙ j (x).

.

(7.67)

As . Q j ≥ 0 holds for all . j = 0, 1, . . ., it follows from (7.67) that .V˙ j (x) ≤ 0. Note that .V˙ j (x) = 0 if and only if .x = 0. Thus, the learner, with .v = 0, is asymptotically .∎ stable during iterations of inverse RL Algorithm 7.2.

202

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

7.3.3 Model-Free Off-Policy Integral Inverse Reinforcement Learning Inverse RL Algorithm 7.2 requires the knowledge of system dynamics . f , .g, and h. In this section, we provide a model-free off-policy integral inverse RL algorithm without using . f , .g, and .h but only the trajectories of the learner and expert. Then, this model-free inverse RL algorithm will be implemented using neural networks (NNs). First, we use the off-policy integral RL technique in inner-loop iteration .i of Algorithm 7.2, which allows for the model-free forms of (7.60)–(7.61b). Rewrite (7.54) as

.

.

x˙ = f (x) + g(x)u ji + h(x)v ji + g(x)(u − u ji ) + h(x)(v − v ji ),

(7.68)

where .u ji ∈ Rm and .v ji ∈ Rk are inputs updated at inner-loop iteration .i given outerloop iteration . j. Differentiating .V ji along with (7.68) and (7.60)–(7.61b) yields .

V˙ ji = (∇V ji )T ( f + gu ji + hv ji ) + (∇V ji )T g(x)(u − u ji ) + (∇V ji )T h(x)(v − v ji ) = −q T (x)Q j q(x) − (u ji )T Ru ji + (v ji )T Sv ji − 2(u j(i+1) )T R(u − u ji ) + 2(v j(i+1) )T S(v − v ji ).

(7.69)

Integrating both sides of (7.69) from.t to.t + T yields the off-policy Bellman equation .

V ji (x(t + T )) − V ji (x(t)) ∫ t+T ( ( )T ( ) ( )T ( )) 2 u j (i+1) R u − u ji − v j (i+1) S v − v ji dτ + t



t+T

=

(

) − q T (x)Q j q(x) − (u ji )T Ru ji + (v ji )T Sv ji dτ.

(7.70)

t

It solves the converged .V j , .u j , and .v j given . Q j . Next, the integral RL is used in outer-loop iteration . j of Algorithm 7.2 to find a model-free equation equivalent to (7.62) for the update of . Q j+1 . We integrate both sides of (7.62) from .t to .t + T to yield ∫

t+T

.

t



= t

q T (x)Q j+1 q(x) dτ

t+T

( ) u Te Ru e − 2u Te Ru j + (v j )T Sv j dτ − V j (x(t + T )) + V j (x(t)). (7.71)

The model-free off-policy inverse RL algorithm is presented below.

7.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

203

Algorithm 7.3 Model-free off-policy inverse RL algorithm 1. Initialization: Select Q 0 ≥ 0, R > 0, S > 0, initial stabilizing u 00 , v 00 = 0 and small thresholds ε1 and ε2 . Set j = 0. Apply stabilizing u to the learner (7.68). 2. Outer-loop iteration j based on IOC 3. Inner-loop iteration i using optimal control: given j, set i = 0, and use stabilizing u j0 . 4. Off-policy Integral RL for computing V j , u j and v j by (7.70). 5. Stop if ||V ji − V j (i−1) || ≤ ε1 , then set V j = V ji , u j = u ji , and v j = v ji . Otherwise, set i ← i + 1 and go to Step 4. 6. State-penalty weight improvement: Update Q j+1 using the demonstration u e by (7.71). 7. Stop if ||Q j+1 − Q j || ≤ ε2 . Otherwise, set u ( j+1)0 = u j and j ← j + 1, then go to Step 3.

The next result shows the convergence of Algorithms 7.3–7.2. Theorem 7.7 (Convergence of Algorithm 7.3) Algorithm 7.3 converges to Algorithm 7.2 such that the learner achieves a convergence (.x, .u j ) .→ (.xe , .u e ) as . j → ∞. Proof The proof can be obtained by extending the proof of Theorem 6.8 to twoplayer zero-sum games. .∎ Algorithm Implementation via Neural Networks (NNs) Now we introduce an NN-based model-free off-policy inverse RL algorithm for implementing Algorithm 7.3 using online data without knowing any knowledge of system dynamics. Three NN-based approximators are designed for the learner’s value function .V ji , the updated control input .u j (i+1) and the updated adversarial input .v j (i+1) in the Bellman equation (7.70) in Algorithm 7.3. Three approximators are .

Vˆ ji = (C ji )T ϕ(x),

(7.72a)

.



j (i+1)

ji T

= (W ) φ(x),

(7.72b)

.



j (i+1)

= (H ) ρ(x),

(7.72c)

ji T

[ ]T [ ]T where .ϕ(x) = ϕ1 (x), ϕ2 (x), . . . , ϕ N1 (x) , .φ(x) = φ1 (x), φ2 (x), . . . , φ N2 (x) , [ ]T and .ρ(x) = ρ1 (x), ρ2 (x), . . . , ρ N3 (x) are activation vector functions of three NNs, respectively. Moreover, .C ji ∈ R N1 , .W ji ∈ Rm×N2 , and . H ji ∈ Rk×N3 . . N1 , . N2 , and . N3 are hidden-layer neuron numbers of three NNs, respectively. [ [ ]T ]T Δ Δ ji ji ji ji ji ji Define .u − u¯ ji = u˜ 1 , u˜ 2 , . . . , u˜ m and .v − v¯ ji = v˜1 , v˜2 , . . . , v˜k . Assume that weights . R and . S are given in the form of . R = diag{r1 , r2 , . . . , rm }T and . S = diag{s1 , s2 , . . . , sk }T , respectively. Then, together with (7.72a)–(7.72c), (7.70) is expressed as

204

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

(C ji )T [ϕ(x(t + T ) − ϕ(x(t))] ∫ t+T ∫ m k ∑ ∑ ji T ji ji T +2 rh (Wh ) φ(x)u˜ h dτ − 2 s p (H p )

.

t

h=1



t+T

t

p=1

ρ(x)v˜ pji dτ

t+T(

) q T (x) Qˆ j q(x) + (u¯ ji )T R u¯ ji − (v¯ ji )T S v¯ ji dτ,

= e¯ ji (t) −

(7.73)

t

where .e¯ ji (t) is the Bellman approximation error and is forced to be zero in some ji ji ji average sense ( to find .C , .W , and . H . One uses )BLS method to solve the unique ji ji ji ji ji solution set. C , W1 , . . . , Wm , H1 , . . . , Hk given. Q j . This solution set provides information for the update of . Qˆ j+1 in (7.71). Thus, one defines

∑ ji

.

⎡ ∫ t+T ⎤ ⎡ 1 1 σ ji dτ πx πu t ∫ ⎢ t+2T ji ⎥ 2 2 π ⎢ t+T σ dτ ⎥ i j ⎢ ⎢ x πu ⎢ ⎥ =⎢ . . .. ⎥ , ∏ = ⎢ ⎣ .. .. . ⎣ ⎦ ∫ t+ιT ji πxι πuι t+(ι−1)T σ dτ

⎤ πv1 πv2 ⎥ ⎥ .. ⎥ , . ⎦

(7.74)

πvι

where σ ji = −q T (x) Qˆ j q T (x) − (u¯ ji )T R u¯ ji + (v¯ ji )T S v¯ ji ,

.

πxι = ϕ T (x(t + ιT )) − ϕ T (x(t + ιT − T )), [ ∫ t+ιT ] ∫ t+ιT ji φ T (x)u˜ 1 dτ, . . . , rm φ T (x)u˜ mji dτ , πuι = r1 t+ιT −T t+ιT

[ ∫ ι πd = s1

t+ιT −T

∫ ji ρ T (x)v˜1

dτ, . . . , sk

t+ιT −T t+ιT

t+ιT −T

] ρ

T

ji (x)v˜k

dτ .

(7.75)

The unknown parameters in (7.73) can be uniquely solved by using BLS when ( ∏ ji )T ∏ ji has full rank. It is required to satisfy a persistently exciting (PE) condition which needs .ι groups of data collection. The positive integer .ι is no less than the number of unknown parameters in (7.73), i.e., .ι ≥ N1 + m N2 + k N3 . Then, the unknown weights .C ji , .W ji , and . H ji in (7.73) are uniquely solved by

.

.

[ ]T ji ji ji (C ji )T , (W1 )T , . . . , (Wmji )T , (H1 )T , . . . , (Hk )T = (( ∏ ji )T ∏ ji )−1 ( ∏ ji )T ∑ ji .

(7.76)

( ( ) ) When . C ji , W ji , H ji , converges to . C j , W j , H j , given the expert’s control input .u e and using this convergent solution set, the learner solves . Qˆ j+1 by

7.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …



t+T

.

t



q T (x) Qˆ j+1 q(x)

) u Te Ru e − 2u Te RW j φ + ρ T H j S(H j )T ρ dτ t ( ) − (C j )T ϕ(x(t + T ) − ϕ(x(t)) ,

=

205

t+T

(

(7.77)

where the unique . Qˆ j+1 is computed by constructing . N¯ 1 ≥ (n + 1)n/2 equations. Define ⎤ ⎡ ∫ t+T ⎤ ⎡ j vev(q(x(τ )))T dτ y (T ) t ∫ t+2T ⎥ ⎢ T ⎢ y j (2T ) ⎥ ⎢ t+T vev(q(x(τ ))) dτ ⎥ ⎥ ¯ ⎢ j ⎥, ⎢ .Y = ⎢ (7.78) ⎥, [ = ⎢ .. .. ⎥ ⎦ ⎣ . . ⎦ ⎣ ∫ t+ N¯ 1 T y j ( N¯ 1 T ) vev(q(x(τ )))T dτ ¯ t+( N1 −1)T

where ∫

.

(

) u Te Ru e − 2u Te RW j φ + ρ T H j S(H j )T ρ dτ t+(a−1)T ( ) − (C j )T ϕ(x(t + aT )) − ϕ(x(t + (a − 1)T )) , a ∈ {1, 2, . . . , N¯ 1 }.

y j (aT ) =

t+aT

Then, . Qˆ j+1 is solved by ( )−1 T j vem( Qˆ j+1 ) = [¯ T [¯ [¯ Y .

.

(7.79)

Based on the above derivations, an online NN-based model-free integral inverse RL Algorithm 7.4 is summed up as follows. Algorithm 7.4 Model-free integral inverse RL algorithm via NNs 1. Initialization: select Qˆ 0 ≥ 0, stabilizing u¯ 00 , v 00 = 0, and small thresholds ε1 , ε2 . Apply stabilizing u to (7.68). Set j = 0. 2. Outer-loop iteration j based on IOC 3. Inner-loop iteration i using optimal control: Given j, set i = 0, and use stabilizing u¯ j0 . 4. Off-policy integral RL for computing C ji , W ji and H ji by (7.76). 5. Stop if ||C ji − C j (i−1) || ≤ ε1 , then set (C j , W j , H j ) =(C ji , W ji , H ji ). Otherwise, set i ← i + 1 and go to Step 4. 6. State-penalty weight improvement: Update Qˆ j+1 using the demonstration u e by (7.79). 7. Stop if || Qˆ j+1 − Qˆ j || ≤ ε2 . Otherwise, set u¯ ( j+1)0 = u¯ j and j ← j + 1 and go to Step 3.

Remark 7.5 The convergence of Algorithms 7.4–7.2 can be obtained by proving the convergence of Algorithms 7.4–7.3. Then, it follows from Theorem 7.7 that

206

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

Algorithm 7.3 convergences to Algorithm 7.2. Thus, one concludes that Algorithm 7.4 convergences to Algorithm 7.2.

7.3.4 Simulation Examples Consider that the draft function. f (·), control input function.g(·) and adversarial input function .h(·) for two agents are given by [ .

f (x) =

] [ ] [ ] 0 0 −x1 + x2 , g(x) = , h(x) = . −x13 x2 x1

Note that the expert shares the same dynamics for .xe . Consider the state [ 2 2 ] [ 2 2 ]T xe2 Q xe1 xe2 for the expert and . Q(x) = penalty in the form of . Q e (x) = xe1 [ 2 2 ] [ 2 2 ]T x1 x2 Q x1 x2 for the learner with [

[ ] ] 2 −0.5 1 −0.125 0 .Qe = , Q = −0.5 1 −0.125 0.25 and the control input weights . Re = R = 1. Set . Se = S = 64. Based on the converse Hamilton–Jacobi–Bellman approach in Nevisti´c and Primbs (1996), the expert’s 2 4 . Set .T = 0.003. + 21 xe2 optimal value function is .Ve∗ = xe1 For Algorithm 7.4, we select the activation functions for .V , .u, and .v as .ϕ(x) = ]T [ [ [ 2 2 4 ]T x1 , x2 , x1 , .φ(x) = x1 x2 , x22 , x1 x23 , x13 x2 , and .ρ(x) = x1 x2 , x22 , x1 x23 , ]T x13 x2 , respectively. Figure 7.3 shows that the learner obtains a convergence behavior .(x, u ∗ ) = (xe , u e ) when .v = ve = 0.002cos(t). Figure 7.4 shows the convergence of statepenalty weight . Q j . One obtains the converged .C ∞ , .W ∞ , . H ∞ , and . Q ∞ as .

Q∞ =

[

] 1.7446 −0.5876 , −0.5876 1.0336

[ ]T C ∞ = 0.0002 0.4810 0.9557 , [ ]T W ∞ = 0.0002 −0.4950 0.0001 0 , [ ]T H ∞ = 0.0155 0.0001 0 0.0012 . Note that . Q j in Fig. 7.4 does not converge to . Q e because . Q j converges to an equivalent weight. Q ∞ that is not equal to. Q e as shown in Definition 7.2 and Theorem 7.5. This equivalent weight can define the learner with the same optimal behavior as the expert.

7.4 Online Adaptive Inverse Reinforcement Learning … 4

States

Fig. 7.3 Trajectories of states and control inputs of the nonlinear learner using Algorithm 7.4

207

2 0 -2 -4 0

0.5

1

1.5

2

1.5

2

Time (s)

Control inputs

0 -5 -10 -15 0

0.5

1

Time (s)

Fig. 7.4 Convergence of the nonlinear learner’s state-penalty weight . Q j using Algorithm 7.4

1.4

1.2

1

0.8

0.6

0.4

0.2 0

20

40

60

80

100

7.4 Online Adaptive Inverse Reinforcement Learning for Nonlinear Two-Player Zero-Sum Games For the Problem 7.2 in Sect. 7.3, this section presents an online adaptive inverse RL method using synchronous tuning neural networks (NNs). We consider .λmin (Se ) = γe2 > 0 in (7.50a) for the expert in (7.49) and .λmin (S) = γ 2 > 0 in (7.55) for the learner in (7.54). Note that .γe and .γ are scalars. The next assumption is needed in addition to Assumption 7.5. Assumption 7.9 There exists a compact set .Ω such that in this compact, the learner system (7.54) is stabilizable and .|| f (x)|| ≤ b f ||x|| with a constant .b f . Additionally, input dynamics .g(x) and .k(x) are bounded, i.e., .||g(x)|| ≤ bg and .||h(x)|| ≤ bk .

208

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

Remark 7.6 Similar to Assumption 7.4, .γ should satisfy .γ ≥ γ ∗ , such that the corresponding HJI equation has a positive semi-definite solution.

7.4.1 Integral RL-Based Offline Inverse Reinforcement Learning To find an equivalent equation to the learner’s Bellman equation (7.59) that does not need the drift dynamics . f , applying the integral RL technique (Vrabie et al. 2009) to (7.59), i.e., integrating both sides of it over the time interval .T > 0 gives the integral Bellman equation ∫ .

t+T

V (x(t)) =

(

) Q(x) + u T Ru − γ 2 v T v dτ + V (x(t + T )).

(7.80)

t

The learner finds the same .V (x) with that in the learner’s Bellman equation (7.59) by using (7.80). Together with the solution of.u ∗ in (7.57a) and.v ∗ (7.57b), the inner-loop iteration is formulated. See steps 3–6 in Algorithm 7.5. This is the standard integral RL process for zero-sum games. The process yields the saddle point solution .(u j , v j ) for each outer-loop iteration . j. The outer-loop iteration to correct penalty weight in (7.71) of Algorithm 7.2 is still used. See step 7 in Algorithm 7.5. Algorithm 7.5 is equivalent to Algorithm 7.2, which can be proved by taking the limit operator to (7.81) and (7.83). Therefore, they solve the same solutions, thereby sharing the same convergence, stability, and non-uniqueness properties.

7.4.2 Online Inverse Reinforcement Learning with Synchronous Neural Networks This subsection develops an online learning method utilizing neural networks (NNs) based on Algorithm 7.5 to address the imitation problem by leveraging real-time trajectory data from the learner and control input trajectory data from the expert Lian et al. (2021a). The proposed approach uses four NN-based approximators, namely a critic NN for .V (x) in (7.81), an actor NN for .u in (7.82a), an adversary NN for .v in (7.82b), and a state-penalty NN for . Q(x) in (7.83). This selection of NN-based approximators is inspired by actor–critic RL methods (Bhasin et al. 2010; Konda and Tsitsiklis 1999; Lian et al. 2022b; Sutton and Barto 2018; Vamvoudakis and Lewis 2010). The online inverse RL algorithm uses partial knowledge of the system dynamics.

7.4 Online Adaptive Inverse Reinforcement Learning …

209

Algorithm 7.5 Integral RL-based offline inverse RL algorithm 1. Initialization: select Q 0 ≥ 0, R > 0 and initial stabilizing u 00 and v 00 . Set j, i = 0 and small thresholds μv > 0 and μ Q > 0. 2. Outer-loop iteration j based on inverse optimal control 3. Inner-loop iteration i using optimal control: Given j, set i = 0. 4. Policy evaluation: Compute V ji by ∫ V ji (x(t)) =

t+T

( ) q T (x)Q j q(x) + (u ji )T Ru ji − γ 2 (v ji )T v ji dτ

t

+ V ji (x(t + T )). 5.

(7.81)

Policy improvement: Compute u j (i+1) and v j (i+1) by

1 u j (i+1) = − R −1 g T (x)∇V ji (x), (7.82a) 2 1 T v j (i+1) = h (x)∇V ji (x). (7.82b) 2γ 2 || || 6. Stop if ||V ji − V j (i−1) || ≤ μv , then set V j (x) = V ji (x), u j = u ji and v j = v ji , otherwise, set i ← i + 1 and go to Step 4. 7. State-penalty weight improvement: Update Q j+1 using expert’s control input trajectory u e by ∫

t+T

∫ q T (x)Q j+1 q(x) =

t

t

t+T

( ) u Te Ru e − 2u Te Ru j + γ 2 (v j )T v j dτ

− V j (x(t + T )) + V j (x(t)).

(7.83)

|| || 8. Stop if || Q j+1 − Q j || ≤ μ Q , otherwise set u ( j+1)0 = u j , d ( j+1)0 = v j and j ← j + 1, then go to Step 3.

Critic Neural Network According to the Weierstrass higher-order approximation theorem (Finlayson 2013), there exists a critic NN such that the learner’s value function .V (x) in (7.81) is uniformly approximated as .

V (x) = WcT φc (x) + ∈c (x),

(7.84)

where .Wc ∈ R Nc is the ideal critic NN weight parameter, . Nc is the number of NN neurons, .φc (x) ∈ R Nc is the NN activation function vector, and .∈c (x) is the NN approximation error. The gradient of .V (x) is then expressed as ∇V (x) = ∇φcT (x)Wc + ∇∈c (x).

.

(7.85)

210

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

The activation function is a completely independent basis set such that .V (x) and ∇V (x) are uniformly approximated. As . Nc → ∞, the approximation errors .∈c → 0, .∇∈c → 0. Given the current estimate . Q of . Q e and the estimate of .V (x) in (7.84), the integral RL form of learner’s Bellman equation (7.80) becomes .



t+T

(

.

t

) Δ Q(x) + u T Ru − γ 2 v T v dτ + WcT Δφc (x(t)) = ∈ B (t),

(7.86)

where .Δφc (x(t)) = φc (x(t + T )) − φc (x(t)), and .∈ B is the learner’s Bellman equation approximation error due to the critic NN approximation error. Assumption 7.10 Given the compact set .Ω, the critic NN shows that: 1. The ideal critic NN weight is bounded, i.e., .||Wc || ≤ Wcmax . 2. The critic NN approximation error and its gradient are bounded, i.e., .||∈c || ≤ ∈cmax and .||∇∈c || ≤ ∈cdmax . 3. The critic NN activation function and its gradient are bounded, i.e., .||φc || ≤ φcmax and .||∇φc || ≤ φcdmax . 4. The Bellman equation error is bounded, i.e., .||∈ B || ≤ ∈ Bmax . During online tuning, the ideal critic NN weight vector .Wc is unknown because the current NN weight applied to the system is assumed to be .Wˆ c and the real-time critic NN used by the learner system is .

Vˆ (x) = Wˆ cT φc (x).

(7.87)

Then, the real-time integral RL form of the learner’s Bellman equation is ∫ .

t

t+T

(

) Q(x) + u T Ru − γ 2 v T v dτ + Wˆ cT Δφc (x(t)) = e B (t),

(7.88)

which is rewritten as .

Wˆ cT Δφc (x(t)) + r (t) = e B (t),

(7.89)

∫ t+T ( ) Q(x) + u T Ru − γ 2 v T v dτ . Now the objective is to find a NN where .r (t) = t weight .Wˆ c such that the Bellman equation error .e B (t) is minimized. We do this by minimizing the squared residual error .

EB =

1 T e eB . 2 B

(7.90)

7.4 Online Adaptive Inverse Reinforcement Learning …

211

The normalized gradient descent rule is used to derive the tuning law of the critic NN as .

∂ EB W˙ˆ c = −αc ∂ Wˆ c

Δφc (t) T (Δφc (t)Δφc (t) (∫ t+T (

= −αc ×

t

+ 1)2

) ) Q(x) + u T Ru − γ 2 v T v dτ + Wˆ cT Δφc (x(t)) ,

(7.91)

where the scalar parameter .αc > 0 is the learning rate. Define the estimation error of Δ the critic NN weight as .W˜ c = Wc − Wˆ c . Then, the dynamics of the critic NN weight estimation error is ( ) ∈B ˙ T ˜ ˜ ¯ ¯ (7.92) . Wc = −αc Δφc Δφ c W c + , ΔφcT Δφc + 1 where .Δφ¯ c = Δφc /(ΔφcT Δφc + 1). The boundedness of .W˜ c is guaranteed by the following PE condition. Lemma 7.1 (Vamvoudakis and Lewis 2012) Let .u and .v be admissible bounded input policies. Select the critic NN tuning law as (7.91). If .Δφ¯ c satisfies the PE condition, i.e., if there exist .β1 and .β2 such that over the interval .[t, t + T0 ), one has ∫ t+T0 .β1 I ≤ Δφ¯ c Δφ¯ cT dτ ≤ β2 I , then .W˜ c exponentially converges to a residual set. t Actor Neural Network Now a synchronous actor NN is constructed to uniformly approximate the optimal control .u in (7.82a). The actor NN is defined as u(x) = WuT φu (x) + ∈u (x),

.

(7.93)

where .Wu ∈ R Nu ×m is the unknown ideal weight of the actor NN, .φu (x) ∈ R Nu is the activation function with . Nu neurons, and .∈u (x) is the actor NN approximation error. Assumption 7.11 Given the compact set .Ω, the actor NN shows that: 1. The actor NN approximation error and its gradient are bounded, i.e., .||∈u || ≤ ∈umax and .||∇∈u || ≤ ∈udmax . 2. The actor NN activation function and its gradient are bounded, i.e., .||φu || ≤ φumax and .||∇φu || ≤ φudmax . The real-time actor NN applied to approximate the control input .u is u(x) ˆ = Wˆ uT φu (x).

.

(7.94)

Given the real-time critic NN (7.87), the control input policy can be obtained as

212

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

1 uˆ (x) = − R −1 g T (x)∇φcT Wˆ c . 2

. 1

(7.95)

It follows from (7.94) and (7.95) that one defines the error for the real-time actor 1 e = u(x) ˆ − uˆ 1 (x) = Wˆ uT φu (x) + R −1 g T (x)∇φcT Wˆ c . 2

. u

(7.96)

Select the weight .Wˆ u such that the following squared residual error is minimized .

Eu =

1 T e eu , 2 u

(7.97)

which follows the gradient descent method and yields the tuning law of the actor NN ( )T ∂ Eu 1 −1 T ˙ T ˆ ˆ ˆ = −αu φu Wu φu (x) + R g (x)∇φc Wc , . Wu = −αu 2 ∂ Wˆ u

(7.98)

where the scalar parameter.αu > 0 is the learning rate of the actor NN. The estimation Δ error of actor NN weight is defined as.W˜ u = Wu − Wˆ u . It follows from (7.84), (7.93), (7.96), and (7.98) that one obtains the dynamics of .W˜ u as ( )T 1 −1 T 1 −1 T ˙ T T ˜ ˜ ˜ . Wu = −αu φu Wu φu + ∈u + R g (x)∇φc Wc + R g (x)∇∈c . (7.99) 2 2 Adversary Neural Network A synchronous adversary NN is constructed to uniformly approximate the worst-case adversarial attack input .v in (7.82b). The adversary NN is defined as v(x) = WvT φv (x) + ∈v (x),

.

(7.100)

where .Wv ∈ R Nv ×l is the unknown ideal weight of the adversary NN, .φv (x) ∈ R Nv is the activation function with. Nv neurons, and.∈v (x) is the adversary NN approximation error. Assumption 7.12 Given the compact set .Ω, the adversary NN shows that: 1. The adversary NN approximation error and its gradient are bounded, i.e., .||∈v || ≤ ∈vmax and .||∇∈v || ≤ ∈vvmax . 2. The adversary NN activation function and its gradient are bounded, i.e., .||φv || ≤ φvmax and .||∇φv || ≤ φvvmax . The real-time adversary NN applied to approximate the adversarial input .c is v(x) ˆ = Wˆ vT φv (x).

.

(7.101)

Given the real-time critic NN (7.87), the adversarial input update law is obtained as

7.4 Online Adaptive Inverse Reinforcement Learning …

vˆ =

. 1

1 T k (x)∇φcT Wˆ c . 2γ 2

213

(7.102)

It follows from (7.101) and (7.102) that one defines the error for the real-time adversary 1 T e = v(x) ˆ − vˆ1 (x) = Wˆ vT φv (x) − k (x)∇φcT Wˆ c . 2γ 2

. v

(7.103)

Select the weight .Wˆ v such that the following squared residual error is minimized using the gradient descent method .

Ev =

1 T e ev , 2 v

(7.104)

which yields the tuning law of the adversary NN as

.

( )T ∂ Ev 1 T T ˆ = −αv φv Wˆ vT φv (x) − k (x)∇φ , W W˙ˆ v = −αv c c 2γ 2 ∂ Wˆ v

(7.105)

where the scalar parameter .αv > 0 is the learning rate of the adversary NN. Define Δ the estimation error of adversary NN weight as.W˜ v = Wˆ v − Wˆ v . Then it follows from (7.84), (7.100), (7.103), and (7.105) that one obtains the dynamics of .W˜ v as ( )T 1 T 1 T ˙ T T ˜ ˜ ˜ k (x)∇φc Wc − k (x)∇∈c . . Wv = −αv φv W v φ v + ∈v − 2γ 2 2γ 2

(7.106)

State-Penalty Neural Network A synchronous state-penalty NN is constructed to approximate the state-penalty weight . Q in (7.83) using the above critic NN of .Vˆ , actor NN of .uˆ and adversary NN ˆ The state-penalty NN is defined as of .v. .

Q(x) = WqT φq (x) + ∈q (x),

(7.107)

where .Wq ∈ R Nq is the unknown ideal state-penalty NN weight, .φq (x) ∈ R Nq is the activation function with. Nq neurons, and.∈q (x) is the state-penalty NN approximation error. Assumption 7.13 Given the compact set .Ω, the state-penalty NN shows that: 1. The state-penalty NN approximation error and its gradient are bounded, i.e.,.||∈q || ≤ ∈qmax and .||∇∈q || ≤ ∈qdmax . 2. The state-penalty NN activation function and its gradient are bounded, i.e.,.||φq || ≤ φqmax and .||∇φq || ≤ φqdmax . The real-time state-penalty NN applied to the learner’s integral RL Bellman equation (7.88) is

214

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games .

ˆ Q(x) = Wˆ qT φq (x).

(7.108)

With real-time critic NN (7.87), the actor NN (7.94) and the adversary NN (7.101), the state penalty is revised as ∫

t+T

.

Qˆ 1 (x) dτ =



t

t

t+T

( T ) u e Ru e − 2u Te R uˆ + γ 2 vˆ T vˆ dτ − Wˆ cT Δφ(x(t)). (7.109)

Integrating on the state-penalty NN (7.108) from .t to .t + T and subtracting from (7.109) yields the error for the real-time state penalty ∫

t+T

Wˆ qT φq (x) dτ − p(t) + Wˆ cT Δφ(x) = eq ,

.

t

(7.110)

∫ t+T where . p(t) = t (u Te Ru e − 2u Te R uˆ + γ 2 vˆ T v) ˆ dτ . Select .Wˆ q to minimize the squared residual error using the gradient descent .

Eq =

1 T e eq , 2 q

(7.111)

which derives the tuning law of state-penalty NN weight as .

( )T ∂ Eq = −αq Δφq Wˆ qT Δφq − p(t) + Wˆ cT Δφc , W˙ˆ q = −αq ∂ Wˆ q

where .Δφq =

(7.112)

∫ t+T

φq (x(τ )) dτ and the scalar parameter of learning rate .αq > 0. Δ Define the estimation error of the state-penalty NN weight as .W˜ q = Wq − Wˆ q . Then, the dynamics of .W˜ q is given by .

t

( )T ˆ + W˜ cT Δφc + Δ∈c , W˙˜ q = −αq Δφq W˜ qT Δφq + Δ∈q + p(t) + p(t)

(7.113)

where ∫ Δ∈q =

t+T

.

∈q dτ,

(7.114a)

(uˆ T R uˆ − γ 2 vˆ T v) ˆ dτ,

(7.114b)

t

∫ .

p(t) ˆ =

t+T

t

Δ∈c = ∈c (x(t + T )) − ∈c (x(t)).

.

(7.114c)

ˆ Note that based on the real-time state-penalty NN . Q(x) (7.108), the actor NN u(x) ˆ (7.94) and the adversary NN .v(x) ˆ (7.101), the learner’s integral RL form of

.

7.4 Online Adaptive Inverse Reinforcement Learning …

215

Bellman equation error is now given by Δ eˆ (t) = Wˆ cT Δφc (x(t)) + rˆ (t),

. B

(7.115)

) ∫ t+T ( Wˆ q φq (x) + uˆ T R uˆ − γ 2 vˆ T vˆ dτ . Then, the critic NN tuning where .rˆ (t) = t law (7.91) becomes .

Δφc (t) ∂ EB = −αc eˆ . W˙ˆ c = −αc T (t)Δφ (t) + 1)2 B ˆ (Δφ c ∂ Wc c

(7.116)

To guarantee the asymptotic stability of the equilibrium of the learner’s closedloop system, additional robust terms are added into the real-time actor NN .uˆ in (7.94) and real-time adversary NN .vˆ in (7.101). Then, the real-time actor NN .uˆ and adversary NN .vˆ become uˆ = Wˆ uT φu (x) + ηu ,

.

vˆ =

.

Wˆ vT φv (x)

+ ηv ,

(7.117) (7.118)

||x|| 1m ||x|| 1m where.ηu = −cauu +x and.ηv = −cavv +x Tx T x . The scalars.au ,.cu ,.av , and.cv are parameters to be designed in Theorem 7.8. 2

2

Remark 7.7 Note that now the real-time actor NN of .uˆ and adversarial NN of .vˆ applied to the learner’s system are (7.117) and (7.118), respectively. Thus, the critic tuning law (7.116) and the state-penalty NN tuning law (7.112) will use the real-time actor and adversary in terms of (7.117) and (7.118). The diagram of the online adaptive inverse RL is depicted in Fig. 7.5. It is seen that the tuning of critic NN uses the information of the actor NN, adversary NN, and state-penalty NN. The actor NN and adversary NN update tuning laws by using the

Fig. 7.5 Diagram of online adaptive inverse RL with synchronous NNs

216

7 Inverse Reinforcement Learning for Two-Player Zero-Sum Games

information of critic NN, respectively. The state-penalty NN updates its own tuning law by using the information of the other three NNs. Theorem 7.8 (Stability of the System and NNs) Consider the expert system (7.49) and the learner system (7.54). Let the tuning laws for the learner’s critic NN, actor NN, adversary NN, and state-penalty NN be given by (7.116), (7.98), (7.105), and (7.112), respectively. Let Assumptions 7.9–7.13 hold and .Δφ¯ c in (7.91) satisfy the PE condition. Then, the learner’s closed-loop system state .x, the critic NN weight error .W˜ c , the actor NN weight error .W˜ u , the adversary NN weight error .W˜ v , and the state-penalty NN weight error .W˜ q converge asymptotically to zero for all .x(0) ∈ Ω, ˜ c (0), .W˜ u (0), .W˜ v (0), and .W˜ q (0), by letting the following inequalities hold .W ( .

) ρ − bg ||ηu || − bk ||ηv || < 0

(7.119a)

( αc + bg λmin (R −1 )φumax φcdmax − 4αc λmin (Δφ¯c Δφ¯ cT )

.

) bk φvmax φcdmax + φ φ 0 is the control input penalty weight, .i, j ∈ N . The expert’s input .u ie are the Nash solutions given by △

−1 T u = −Rie Bi Pie xe = −K ie xe , ∀i ∈ N ,

. ie

(8.3)



−1 T where . K ie = Rie Bi Pie , and . Pie is the unique stabilizing solution of expert’s . N -coupled Riccati equations

⎛ 0 = ⎝A −

N ∑

.

⎞T



T ⎠ Pie + Pie ⎝ A − B j R −1 je B j P je

j=1

+

N ∑

N ∑

⎞ T ⎠ B j R −1 je B j P je

j=1

T P je B j R −1 je B j P je + Q ie , i ∈ N .

(8.4)

j=1

The solution .(u ie , u −ie ) is that Nash equilibrium (Isaacs 1999; Lewis et al. 2012), such that .



− Vie∗ = xeT Pie xe = Vie (xe , u ie , u −ie ) ≤ Vie (xe , u ie , u −ie ), i ∈ N ,

(8.5)

− − where .u ie denotes the non-optimal control inputs, i.e., .u ie /= u ie .

Learner System Consider the learner with . N players as

.

x(t) ˙ = Ax(t) +

N ∑

Bi u i (t),

(8.6)

i=1

where .x ∈ Rn and .u i ∈ Rm i are the states and control inputs of the outcome of the learner player .i, .i ∈ N . Note that the learner system (8.6) has identical dynamics (. A, B1 , . . . , B N ) as the expert in (8.1). Define the integrated cost function for player .i of the learner system (8.6) as ʃ .



Vi (x(0), u i , u −i ) = 0

⎛ ⎝x T Q i x +

N ∑ j=1

⎞ u Tj R j u j ⎠ dτ, i ∈ N ,

(8.7)

228

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games

where . Q i = Q iT ∈ Rn×n ≥ 0 and . R j = R Tj ∈ Rm j ×m j > 0, .∀i, j ∈ N are arbitrarily selected trail weights, then, we would find the control inputs to minimize .Vi , i.e., .

.

Vi∗ (x(t)) = min Vi (x(t)), i ∈ N .

(8.8)

ui

With the feedback control given by the form.u i = −K i x, the optimal cost function Vi∗ can be represented in terms of the following quadratic form: .

Vi∗ (x(t)) = x T Pi x, i ∈ N ,

(8.9)

where . Pi = PiT ∈ Rn×n > 0 are the control policy matrices. To find the optimal control solution, we define the following Bellman equation: . Hi (x,

Pi , u 1 , . . . , u N ) = x Q i x + T

N ∑

⎛ u Tj R j u j

+ 2x Pi ⎝ Ax(t) + T

j=1

N ∑

⎞ B j u j (t)⎠ = 0.

j=1

(8.10) By the stationary condition . ∂∂uHii = 0, the optimal control solution is found as △

u ∗ = −Ri−1 BiT Pi x = −K i x, ∀i ∈ N ,

. i

(8.11)



where . K i = Ri−1 BiT Pi and . Pi satisfies learner’s . N -coupled Riccati equations ⎛ 0 = ⎝A −

N ∑

.

⎞T



T ⎠ Pi + Pi ⎝ A − B j R −1 j B j Pj

j=1

+

N ∑

T P j B j R −1 j B j Pj + Qi , i ∈ N .

N ∑

⎞ T ⎠ B j R −1 j B j Pj

j=1

(8.12)

j=1

Now we show the inverse RL problem for linear multiplayer non-zero-sum games. Assumption 8.2 The expert’s system dynamics information .(A, B1 , . . . , B N ), cost function weights . Q ie , . R je in (8.2), control policy matrix . Pie in (8.5), and control feedback gain . K ie in (8.3), .∀i, j ∈ N , are unknown to the learner. The learner can observe the expert’s trajectories, i.e., .xe (t), .u ie (t), for all .i ∈ N in (8.5). Definition 8.1 Give . R1 > 0, . . . , R N > 0 with appropriate dimensions. If the weights . Q i in coupled Riccati equations (8.12) yield . K i = K ie in (8.3) for all .i ∈ N , then . Q i is called an equivalent weight to . Q ie , .i ∈ N in coupled Riccati equations. Assumption 8.3 For the selected equivalent weight . Q i and .(R1 , . . . , R N ), assume there exist solutions in the coupled Riccati equations (8.12) that further satisfy the conditions that for each .i ∈ N , the pairs

8.2 Off-Policy Inverse Reinforcement Learning for Linear …

⎛ . ⎝A



∑ j/=i

⎞ ⎛ T ⎠ ⎝ B j R −1 j B j P j , Bi , A −

∑ j/=i

T B j R −1 j B j Pj ,

229

/

Qi +

∑ j/=i

⎞ T ⎠ P j B j R −1 j B j Pj

are respectively stabilizable and detectable. Remark 8.1 The above assumption ensures that the control policy (8.11) provides a Nash equilibrium solution for the . N -player learner system and with the policy, the closed-loop system of (8.6) is asymptotically stable. Problem 8.1 Select . R j ∈ Rm j ×m j > 0, j ∈ N for (8.7). Under Assumptions 8.1– 8.3, the learner aims to find each player .i an equivalent weight . Q i to . Q ie such that (8.11)–(8.12) result in . K i = K ie and imitate the expert’s trajectories, i.e., ∗ ∗ .(x, u 1 , . . . , u N ) = (x e , u 1e , . . . , u N e ). This is called a model-free data-driven inverse RL problem.

8.2.2 Inverse Reinforcement Learning Policy Iteration This section shows a model-based inverse RL PI algorithm to find each player an equivalent weight to . Q ie . The algorithm will be developed based on the following proposition which provides new Lyapunov functions to find a. Q i that yields. K i = K ie holds for each .i ∈ N . Proposition 8.1 If . Q i satisfies both . N -coupled Riccati equations (8.12) and the following Lyapunov equations: ⎛ .

⎝A −

⎞T

N ∑



B j K j ⎠ Pi + Pi ⎝ A −

j=1

N ∑

⎞ Bj K j⎠ +

j=1

N ∑

K Tj R j K j

j=1

= −αi (K i − K ie ) Ri (K i − K ie ) − Q i , i ∈ N , T

(8.13)

where .αi ∈ (0, 1] is a scalar, then, . K i in (8.11) will be equal to . K ie given by (8.3). That is, . Q i is equivalent to . Q ie as stated in Definition 8.1. Proof Subtracting (8.13) from (8.12) yields 0 = αi (K ie − K i )T Ri (K ie − K i ), i ∈ N .

.

(8.14)

Since .αi ∈ (0, 1] and . Ri > 0, (8.14) implies that .

K i = K ie , i ∈ N .

Then, . Q i is equivalent to . Q ie . This completes the proof.

(8.15) ◻

230

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games

Algorithm 8.1 Model-based inverse RL PI for linear N -player games 1. Initialization: select initial Q i0 = 0, arbitrary R j = R Tj ∈ Rm j ×m j > 0, initial stabilizing K i0 , and small thresholds ei for i, j ∈ N . Set h = 0. 2. Policy evaluation: Update Pih by the Lyapunov functions ⎛ ⎝A −

N ∑

⎞T



B j K hj ⎠ Pih + Pih ⎝ A −

j=1

=−

N ∑

⎞ B j K hj ⎠

j=1

N ∑ (K hj )T R j K hj − αi (K ih − K ie )T Ri (K ih − K ie ) − Q ih , i ∈ N .

(8.16)

j=1

3. State-penalty weight improvement: Update Q ih+1 by ⎛ Q ih+1

= − ⎝A −

N ∑

⎞T B j K hj ⎠

⎛ Pih



Pih

⎝A −

j=1

N ∑

⎞ B j K hj ⎠ −

j=1

N ∑

(K hj )T R j K hj , i ∈ N .

(8.17)

j=1

4. Policy improvement: Update K ih+1 by K ih+1 = Ri−1 BiT Pih , i ∈ N .

(8.18)

5. Stop if ‖K ih+1 − K ie ‖ ≤ ei , ∀i ∈ N . Otherwise, set h → h + 1 and go to Step 2.

Based on Proposition 8.1, a model-based inverse RL Algorithm 8.1 is given above for linear . N -player systems. In this algorithm, selecting fixed . R j , j ∈ N , Step 2 is the policy evaluation based on (8.13) to reconstruct the control policy matrix . Pih given the current . K hj , . K ie , and . Q ih at iteration .h, h = 0, 1, . . .. The scalar .αi ∈ (0, 1] is the learning rate to guarantee convergence. Step 3 is the state-penalty weight improvement. It uses IOC to improve the state-penalty weight . Q ih+1 toward the equivalent weight to . Q ie . Step 4 is the policy improvement of control feedback gain h+1 .Ki using optimal control. Remark 8.2 The standard IOC procedure (Step 3) and optimal control (Step 4) are solved as subproblems of the inverse RL problem. The next theorems analyze the convergence and the stability of Algorithm 8.1. We also show that the learned equivalent state-penalty weight to . Q ie may not be unique. All possible equivalent weights are characterized in terms of coupled Riccati equations. Theorem 8.1 (Convergence of Algorithm 8.1) With initial stabilizing, . K i0 , . R j > 0, 0 .∀ j ∈ N , and . Q i = 0 .∀i ∈ N , let . Q i ≥ 0 denote the equivalent weight to the expert’s . Q ie , then the learner, using Algorithm 8.1, obtains converged solutions as .h → ∞, such that

8.2 Off-Policy Inverse Reinforcement Learning for Linear …

231

(1) .(K 1h , K 2h , . . . , K Nh ) → .(K 1e , K 2e , . . . , K N e ) where . K ih and . K ie are given by (8.18) and (8.3), respectively; (2) The state-penalty weights . Q ih → Q i , where Q i is the equivalent weight to . Q ie , and the learner exhibits the same trajectories as the expert for all time while starting from the same initial states. Proof Subtracting (8.16) from (8.17) obtains .

Q ih+1 = αi (K ih − K ie )T Ri (K ih − K ie ) + Q ih ,

(8.19)

h which implies that .{Q ih }∞ h=0 are increasing sequences, and . Q i ≥ 0 hold for all .h = 0 0, 1, . . . given an initial . Q i = 0, i.e.,

0 = Q i0 ≤ Q i1 ≤ · · · ≤ Q ih ≤ Q ih+1 .

.

(8.20)

Select . Q i0 = 0 and .αi ∈ (0, 1], such that . Q i0 + αi (K i0 − K ie )T Ri (K i0 − K ie ) ≤ Q i where . Q i is an equivalent weight to . Q ie given . R j , . j ∈ N (see Definition 8.1). From (8.19), we know that . Q i ≥ Q i1 ≥ 0, and the increment of . Q i2 can be adjusted by .αi ∈ (0, 1]. Thus, by analogy, given a . Q i ≥ Q ih+1 ≥ 0, we can select .αi such that .

Q i ≥ Q ih+2 = αi (K ih − K ie )T Ri (K ih − K ie ) + Q ih+1 .

(8.21)

It is known that one state-input weight matrix defines one unique control feedback gain in a Riccati equation (Bittanti et al. 2012). We conclude that . Q ih+2 in (8.21) corresponds to a . K ih+2 which is closer to . K ie than . K ih+1 . Denote .△ih = (K ih − K ie )T Ri (K ih − K ie ) and we have 0 ≤ △ih+2 ≤ △ih+1 .

.

(8.22)

In turn, the decreased .△ih+2 leads to a . Q ih+3 ≥ 0 such that .△ih+3 ≤ △ih+2 . By deduction, as . Q ih increases, .△ih decreases and . K ih approaches . K ie . Hence, we have h h h .lim h→∞ △i = 0, .lim h→∞ K i = K ie , and .lim h→∞ Q i = Q i . Therefore, the learner has the converged behavior, i.e., .(K 1∞ , K 2∞ , . . . , K N∞ ) = .(K 1e , K 2e , . . . , K N e ). Starting from the same initials as the expert, the learner will exhibit the expert’s trajectories for all time, i.e., .(x, u 1 , u 2 , . . . , u N ) = (xe , u 1e , u 2e , . . . , .u N e ). Theorem 8.2 (Non-unique Solution of Algorithm 7.1) Denote the converged solutions obtained by Algorithm 8.1 as . Q i∞ and . Pi∞ , .∀i ∈ N . Then, . Q i∞ satisfies

.

Q i∞ − Q ie =

N ∑

−1 T ∞ P j∞ B j R −1 j (R je − R j )R j B j P j

j=1



+ ⎝A −

N ∑ j=1

⎞T T ∞⎠ B j R −1 (Pie − Pi∞ ) j B j Pj

232

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games

⎛ + (Pie − Pi∞ ) ⎝ A −

N ∑

⎞ T ∞⎠ , i ∈ N, B j R −1 j B j Pj

(8.23)

j=1

where . Pie satisfies (8.4), and . Pi∞ satisfies .

−1 T BiT Pi∞ = Ri Rie Bi Pie , i ∈ N .

(8.24)

This implies that the learner obtains the expert control gain . K ie in (8.3), but the converged equivalent weight . Q i∞ may not be unique. Proof As shown in Theorem 8.1, the learner learns the . Q i∞ that is equivalent to . Q ie , and the actual control gain of the expert players, i.e., . K i∞ = K ie for .i ∈ N . That is, the converged solution satisfies the . N -coupled Riccati equations ⎛ 0 = ⎝A −

N ∑

.

⎞T



T ∞⎠ B j R −1 Pi∞ + Pi∞ ⎝ A − j B j Pj

j=1

+

N ∑

N ∑

⎞ T ∞⎠ B j R −1 j B j Pj

j=1

T ∞ ∞ P j∞ B Tj R −1 j B j Pj + Qi , i ∈ N ,

(8.25)

j=1

and .

−1 T − Ri−1 BiT Pi∞ = −Rie Bi Pie , ∀i ∈ N .

(8.26)

Note that (8.26) can be rewritten as (8.24). Subtracting expert’s . N -coupled Riccati equations (8.4) from (8.25) yields (8.23), where . Pi∞ satisfies (8.26). Note that in (8.24), if rank.(Bi ) /= n, there are infinite solutions for . Pi∞ . Moreover, . Ri /= Rie is allowed. Thus, there exist infinite groups of . Pi∞ that are not equal to . Pie . Correspondingly, there will be infinite number of ∞ T . Q i that satisfy (8.23) and are not equal to . Q ie . If rank.(Bi ) = n and . Ri = Rie , then ∞ ∞ . Pi = Pie and . Q i = Q ie , which is unique. ◻ The next example illustrates the non-uniqueness of . Q i∞ shown in Theorem 8.2. Example 8.1 Non-uniqueness of. Q i∞ Assume the expert and learner have the dynamics matrices as [ [ ] [ ] ] −3 −2 1 2 .A = (8.27) , B1 = , B2 = . 2 −3 2 −2 The expert has parameters . Q 1e = I2 , . Q 2e = 2 ∗ I2 , . R1e = 1, and . R2e = 2. Then, the learner has optimal control feedback gains .

[ ] [ ] K 1e = 0.1457 0.2979 , K 2e = 0.2943 −0.2718 .

(8.28)

8.2 Off-Policy Inverse Reinforcement Learning for Linear …

233

Select . R1 = 2 and . R2 = 1 for the learner. From (8.24), we have [ ] ∞ [ ] 1 2 P1 = −0.1457 −0.2979 , [ ] ∞ [ ] . 2 −2 P2 = −0.2943 0.2718 . .

(8.29) (8.30)

[ ] Since rank.( 1 2 )= 1, . P1∞ has many positive-definite solutions, so does . P2∞ . Corre∞ spondingly, it is seen from (8.23) that . Q ∞ 1 and . Q 2 have non-unique values. Theorem 8.3 (Stability and Optimality) Using Algorithm 8.1, the learner system (8.6) is asymptotically stable at each iteration .h. Furthermore, with the converged control gain . K i∞ obtained by Algorithm 8.1, the learner obtains the same Nash equilibrium .(u ie , u −ie ), .i ∈ N as the expert. Proof Theorem 8.1 indicates that. Q ih ≥ 0, ∀i ∈ N while using Algorithm 8.1. Then, we can write (8.16) as ( .

A−

N ∑

B j K hj

)T

j=1

N ( ) ∑ Pih + Pih A − B j K hj j=1

N ∑

=−

(K hj )T R j K hj −(K ih − K ie )T Ri (K ih − K ie ) −Q ih ≤ 0.

(8.31)

j=1

Taking the derivative of .Vih (x) = x T Pih x yields N N ( )T ( ) ∑ ∑ T h h T h ˙ A− . Vi (x) = x B j K j Pi x + x Pi A − B j K hj x ≤ 0. j=1

(8.32)

j=1

Note that .Vih = 0 and .V˙ih = 0 hold only when .x = 0. Thus, the learner system (8.6) is asymptotically stable at each .h. This also implies that (8.6) is asymptotically stable with the converged . K i∞ . Theorem 8.1 shows that . K i∞ = K ie . Since the expert and the learner have identical system dynamics, the learner hence obtains .(u i , u −i ) = (u ie , u −ie ), .∀i ∈ N , where .(u ie , u −ie ) is the Nash Equilibrium solution of the expert. ◻ Remark 8.3 Algorithm 8.1 learns. Q i instead of. R j . The reason lands on two aspects. First, it is known that . R j > 0 and . Q ih ≥ 0 provide extra freedom for solving the . N coupled ARE equation, and it is allowed to find . Q ih ≥ 0 by selecting . R j > 0 for the goal gain . K ie . Theorem 8.1 has shown that the existence of the solutions to (8.16) can be guaranteed (Lancaster and Rodman 1995). Second, this avoids the existence of trivial solutions (Johnson et al. 2013). Remark 8.4 Equations (8.16)–(8.18) formulate a new framework based on policy iteration for inverse RL problem. Different from the conventional RL policy iteration (Lewis and Vrabie 2020), Algorithm 8.1 has an extra IOC step, i.e., the state-penalty

234

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games

weight improvement (8.17). This is the key to addressing the requirement of RL, e.g., assuming appropriate predefined cost functions in computing optimal control inputs.

8.2.3 Model-Free Off-Policy Integral Inverse Reinforcement Learning Sometimes, accurate system dynamics knowledge .(A, .[B1 , .. . . , . B N ]) used in Algorithm 8.1 is hard to infer in practical applications. This section shows an off-policy inverse RL algorithm that removes this requirement from the learning process and uses only the measurement data of expert trajectories to solve Problem 8.1. This algorithm is model-free which takes advantage of the integral RL technique (Vrabie et al. 2009). Unlike the structure of the inverse RL algorithms in Chaps. 6 and 7, this algorithm has a single-loop iterative structure that does not involve inner-loop iterations. Firstly, the off-policy technique is introduced by rewriting the expert’s dynamics (8.1) as x˙ = Axe +

N ∑

. e

B j u hje +

j=1

N ∑

B j (u je − u hje ),

(8.33)

j=1

where .u hje = −K hj xe are auxiliary inputs and . K hj is to be updated. We use it for each player .i to write x˙ T Pih xe + xeT Pih x˙e ⎛ ⎞T ⎛ ⎞ N N ∑ ∑ = xeT ⎝ A − B j K hj ⎠ Pih xe + xeT Pih ⎝ A − B j K hj ⎠ xe

. e

j=1

+2

j=1

N ∑

(u je − u hje )T B Tj Pih xe .

(8.34)

j=1

Substituting (8.16) into (8.34) yields x˙ T Pih xe + xeT Pih x˙e ∑ h T =2 (u je − u hje )T B Tj Pih xe + 2(u ie − u ie ) Ri K ih+1 xe

. e

j∈−i N ∑ h T h − (u hje )T R j u hje − xeT Q ih xe − αi (u ie − u ie ) Ri (u ie − u ie ), j=1

(8.35)

8.2 Off-Policy Inverse Reinforcement Learning for Linear …

235

where .x˙eT Pih xe + xeT Pih x˙e is the derivative of a crossed value function known as T h . Vc = x e Pi x e using expert’s data and learner’s control parameters. It is a result of multiplying both sides of (8.16) by .xe and is an operation term to compute . P jh using data. Integrating both sides of (8.35) from .t to .t + T (.T > 0 is a small time period) yields x T (t + T )Pih xe (t + T ) − xeT (t)Pih xe (t) ⎛ ⎞ ʃ t+T ∑ ʃ ⎝ =2 (u je − u hje )T B Tj Pih xe ⎠ dτ + 2

. e

t



ʃ

t+T

− t

ʃ t

⎝xeT Q ih xe +

N ∑

t

h T (u ie − u ie ) Ri K ih+1 xe dτ

⎞ (u hje )T R j u hje ⎠ dτ

j=1 t+T



j∈−i

t+T

h T h αi (u ie − u ie ) Ri (u ie − u ie ) dτ.

(8.36)

This iterative form for the inverse RL problem is inspired by the off-policy integral RL technique. It is the data-driven version of policy evaluation (8.16) and policy improvement (8.18). It computes the unknowns . Pih , . B Tj Pih , and . K ih+1 by using the measurement data of expert’s trajectory (.xe , u 1e , . . . , u N e ). Similarly, we rewrite (8.6) with auxiliary inputs .u hj = −K hj x as

.

x˙ = Ax +

N ∑

B j u hj +

j=1

N ∑

B j (u j − u hj ).

(8.37)

j=1

Use (8.37) and (8.17) to write .V˙ih (x, Pih ) = x˙ T Pih x + x T Pih x˙ as . x˙

Pih x + x T Pih x˙ ⎛ ⎞T ⎛ ⎞ N N N ∑ ∑ ∑ = xT ⎝A − B j K hj ⎠ Pih x + x T Pih ⎝ A − B j K hj ⎠ x + 2 (u j − u hj )T B Tj Pih x T

j=1

= −x T Q ih+1 x −

j=1 N ∑

N ∑

j=1

j=1

(u hj )T R j u hj + 2

(u j − u hj )T B Tj Pih x,

which is integrated on both sides to yield

j=1

(8.38)

236

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games .

x T (t + T )Pih x(t + T ) − x T (t)Pih x(t) ⎞ ⎛ ʃ t+T ∑ ⎝x T Q ih+1 x − 2 =− (u j − u hj )T B Tj Pih x ⎠ dτ t

ʃ

j∈−i

ʃ

t+T

+

2(u i −

t

u ih )T Ri K ih+1 x

dτ − t

t+T

N ∑ (u hj )T R j u hj dτ.

(8.39)

j=1

This is the data-driven version of (8.17) to compute . Q ih+1 . In order to compute . Pih , . K ih+1 , and . Q ih+1 together, (8.36) and (8.39) are combined to yield T (t + T )P h x (t + T ) − x T (t)P h x (t) + x T (t + T )P h x(t + T ) − x T (t)P h x(t) e i e i e i i

.xe

−2 −2

∑ ʃ t+T (

j∈−i t

ʃ t+T ( t

=− −

) (u je − u hje )T B Tj Pih xe + (u j − u hj )T B Tj Pih x dτ

ʃ t+T (

ʃ t+T ) h )T R K h+1 x + (u − u h )T R K h+1 x dτ + (u ie − u ie x T Q ih+1 x dτ e i i i i i i t

h )T R (u − u h ) + x T Q h x ) dτ αi (u ie − u ie i ie e i e ie

t N ∑ ʃ t+T ( j=1 t

) (u hj )T R j u hj + (u hje )T R j u hje dτ, i ∈ N ,

(8.40)



where .−i = {1, . . . , i − 1, i + 1, . . . , N }. The terms of the right side of (8.40) are known along with the measured trajectory data online. The data-driven model-free off-policy integral inverse RL algorithm for linear multiplayer games is given in Algorithm 8.2. Algorithm 8.2 Off-policy inverse RL algorithm for linear multiplayer games 1. Initialization: Select Q i0 = 0n , arbitrary R j = R Tj ∈ Rm j ×m j > 0, initial stabilizing K 0j , and small thresholds ei for i, j ∈ N . Set h = 0. Apply u j = −K j x + ϵ j to (8.37), where K j is stabilizing policy and ϵ j are probing noises. Measure expert trajectories, i.e., (xe , u 1e , . . . , u N e ). 2. Policy and penalty weight evaluation and update: Update penalty weight Q ih+1 and control input u ih+1 , i.e., K ih+1 , ∀i ∈ N by (8.40). 3. Stop if ‖K ih+1 + u ie xeT (xe xeT )−1 ‖ ≤ ei , ∀ i ∈ N . Otherwise set h → h + 1, and go to Step 2.

Remark 8.5 Note that .−u ie xeT (xe xeT )−1 provides an estimate of . K ie . It is computed by collecting enough data of .(xe , u ie ). To let the system be persistently exciting, probing noise, such as random noise and sinusoidal, should be injected into .u j in the learner system (8.37), for uniquely

8.2 Off-Policy Inverse Reinforcement Learning for Linear …

237

solving unknowns in (8.40). The next lemma guarantees the unbiased solutions in Algorithm 8.2 due to probing noises. Lemma 8.1 Let . Pˆih be the solution of (8.40) with .uˆ j = u j + ϵ j , where .ϵ j /= 0 is the probing noise. Let . P¯ih be the solution of (8.40) with .u¯ j = u j . Then, one has unbiased solutions, i.e., . Pˆih = P¯ih , . Qˆ ih+1 = Q¯ ih+1 , and . Kˆ ih+1 = K¯ ih+1 . ◻

Proof Please see proofs of Theorem 4 in Lian et al. (2022a).

Algorithm Implementation An online implementation method for Algorithm 8.2 using real-time trajectory data of the learner and the expert is shown below. First, we rewrite (8.40) as . (vev(x e (t

+ T )) + vev(x(t + T )) − vev(xe (t)) − vev(x(t)))T vem(Pih ) ) (ʃ ʃ t+T t+T ∑ 2(u je − u hje )T ⊗ xeT dτ + 2(u j − u hj )T ⊗ x T dτ vec(B Tj Pih ) − t

j∈−i



t+T

− (ʃ − ʃ =− −

t t+T t t+T

t

) h T 2((u ie − u ie ) Ri ) ⊗ xeT dτ vec(K ih+1 )

) (ʃ 2((u i − u ih )T Ri ) ⊗ x T dτ vec(K ih+1 ) +

(

h T αi (u ie − u ie ) Ri (u ie

t N ʃ t+T ∑ j=1 t

h − u ie ) + xeT Q ih xe

)

t+T t

) vev(x)T dτ vem(Q ih+1 )



(

) (u hj )T R j u hj + (u hje )T R j u hje dτ, i ∈ N ,

(8.41)

where .vev, .vem, and .vec are as defined in “Abbreviations and Notation”. Note that (8.41) is used to solve the unknowns .vem(Pih ), .vec(B Tj Pih ), .vec(K ih+1 ), and .vem(Q ih+1 ). One can do this by using the batch least squares (BLS) method. Toward this end, define ⎡

⎡ h⎤ ⎤ c1 d1h1 . . . d hj1 . . . d Nh 1 z ih1 ax1 qi 1 ⎢c2 d1h . . . d hj . . . d Nh z ih ax2 ⎥ ⎢qih ⎥ 2 2 2 2 ⎢ ⎥ h ⎢ 2⎥ h .ϕi = ⎢ ⎥ , ρi = ⎢ .. ⎥ , .. ⎣ ⎣.⎦ ⎦ . qihι cι d1hι . . . d hjι . . . d Nh ι z ihι axι where

(8.42)

238

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games

c = [vev(xe (t + ιT )) + vev(x(t + ιT ))

. ι

− vev(xe (t + (ι − 1)T )) − vev(x(t + (ι − 1)T ))]T ʃ t+ιT ) ( d hjι = 2 (u hje − u je )T ⊗ xeT +(u hj − u j )T ⊗ x T dτ, t+(ι−1)T t+ιT

ʃ z ihι =

t+(ι−1)T t+ιT

) ( h 2 ((u ie − u ie )T Ri ) ⊗ xeT + ((u ih − u i )T Ri ) ⊗ x T dτ,

ʃ axι =

vev(x)T dτ, t+(ι−1)T ʃ t+ιT

qihι = −

(

t+(ι−1)T

+

h T h xeT Q ih xe + αi (u ie − u ie ) Ri (u ie − u ie )

N ∑

) (u hj )T R j u j + (u hje )T R j u je dτ.

j=1

Note that in (8.41), the unknowns . Pih , . Q ih+1 , . K ih+1 , . B Tj Pih , . j ∈ −i have .n(n + 1) ∑N ∑N .+ j=1 nm j independent elements in total. Thus, .ι ≥ n(n + 1) + j=1 nm j data tuples in (8.42) need to be collected, such that h T h .rank((ϕi ) ϕi )

= n(n + 1) +

N ∑

nm j .

(8.43)

j=1

Then, the vectors .vem(Pih ), .vec(B Tj Pih ), .vec(K ih+1 ), and .vem(Q ih+1 ) where . j ∈ −i are uniquely solved by [ .

T vem(Pih )T vec(B1T Pih )T . . . vec(B(i−1) Pih )T

T vec(B(i+1) Pih )T . . . vec(B NT Pih )T vec(K ih+1 )T vem(Q ih+1 )T )−1 h T ( (ϕi ) ρi , i ∈ N . = (ϕih )T ϕih

] (8.44)

Remark 8.6 The condition (8.43) is necessary for using the BLSs to solve the ∑ N unknowns in (8.41) uniquely. To make (8.43) hold, besides .ι ≥ n(n + 1) + j=1 nm j , the persistence of excitation condition should also be guaranteed. To this end, probing noise .ϵ, such as random white noise and sinusoidal, needs to be injected into the learner’s control inputs to generate data. The off-policy inverse RL framework in Algorithm 8.2 allows unbiased learning results when applying probing noise as shown in Lemma 8.1. Similar analyses are also seen in off-policy RL work (Jiang and Jiang 2012; Kiumarsi et al. 2012; Modares et al. 2012). Remark 8.7 Algorithm 8.2 is online in the sense that the learner updates its statepenalty weight, cost matrix, and control policy along with real-time measurements and collection of expert trajectories .(xe , u 1e , . . . , u N e ).

8.2 Off-Policy Inverse Reinforcement Learning for Linear …

239

Remark 8.8 In Algorithm 8.1, . Pih is the unique solution of (8.16). . Q ih+1 and . K ih+1 are then uniquely determined by (8.17) and (8.18), respectively. Moreover, from the derivations in this section, the . Pih and . K ih+1 also satisfy (8.36), and the . Q ih+1 also satisfies (8.39). Therefore, all of them satisfy (8.40). As claimed in Remark 8.6, the . Pih , . Q ih+1 , and . K ih+1 in (8.40) can be uniquely solved by using (8.44) when (8.43) holds. It leads to the conclusion that Algorithm 8.2 is equivalent to Algorithm 8.1 and solves the same solution as Algorithm 8.1 if given same initials. Therefore, they have the same convergence. Theorem 8.1 has proven the convergence of Algorithm 8.1. It hence concludes that Algorithm 8.2 has .(K 1h , K 2h , . . . , K Nh ) → (K 1∞ , K 2∞ , . . . , K N∞ ) = .(K 1e , K 2e , . . . , K N e ) and . Q ih .→ . Q i∞ as .h → ∞, where . Q i∞ is equivalent to . Q ie , .∀i ∈ N . With the control gains . K i∞ for all inputs .u i , the learner performs the expert’s trajectories.(xe , u 1e , u 2e , . . . , u N e ) when it has the same initials as the expert (Lian et al. 2022a, d).

8.2.4 Simulation Examples Two examples are provided to verify the model-free off-policy inverse RL Algorithm 8.2. Example 8.2 Consider both the learner and the expert as three-player non-zero-sum game systems with the system dynamics [ ] [ ] [ ] [ ] −3 −2 1 2 −1 .A = , B1 = , B2 = , B3 = . 2 −3 2 −2 1.7 For each player’s input of the three-player expert system, the cost function weights are set as [ [ [ ] ] ] 6.9624 5.9493 7.5661 2.2652 7.6868 1.5283 . Q 1e = , Q 2e = , Q 3e = , 5.9493 12.9912 2.2652 12.1583 1.5283 11.9917 and . R1e = 1.5, R2e = 2, R3e = 3, which yield the expert’s optimal control feedback gains as .

[ ] [ ] [ ] K 1e = 1.3333 1.6667 , K 2e = 0.7500 −0.7500 , K 3e = −0.2200 0.5000 .

To implement Algorithm 8.2, we select . Q 01 = Q 02 = Q 03 = 0, . R1 = 1.2, . R2 = 1.5, . R3 = 2, .α1 = α2 = α3 = 1, .T = 0.0015, .e1 = e2 = e3 = 3 × 10−3 , and small probing noises .ϵ1 , ϵ2 , ϵ3 are random white noises within .[0, 1]. Figures 8.1 and 8.2 show the convergence of . Q ih and . K ih to . K ie , respectively, .∀i ∈ {1, 2, 3}. The converged values are

240

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games

Fig. 8.1 Convergence of h h and . Q h using 3 Algorithm 8.2

14

.Q1 , .Q2 ,

12 10 8 6 4 2 0 0

Fig. 8.2 Convergence of . K ih to . K ie , .∀i ∈ {1, 2, 3} using Algorithm 8.2

5

10

15

20

25

30

35

40

5

10

15

20

25

30

35

40

1.5

1

0.5

0 0



.Q1

K 1∞

] ] ] [ [ [ 4.1352 5.4445 3.0443 −2.3692 1.6728 −1.5499 ∞ , Q∞ , Q , = = 2 3 5.4445 10.2645 −2.3692 3.6807 −1.5499 4.4888 [ ] [ ] [ ] = 1.3375 1.6798 , K 2∞ = 0.7505 −0.74921 , K 3∞ = −0.2171 0.5005 . =

Figure 8.3 shows that the learner exhibits the same trajectories as the expert under the converged control feedback gains. Again, as shown in Fig. 8.1 and the above converged . Q i∞ value, . Q i∞ for each learner player is not equal to . Q ie for each expert player. However, this converged weight is the one that can be viewed as equivalent to . Q ie as it produces the control gain . K i∞ that is quite close to . K ie . Example 8.3 Consider a load frequency control problem of a two-area interconnected power system shown in Lian et al. (2022a) as a two-player non-zero-sum game

8.2 Off-Policy Inverse Reinforcement Learning for Linear … 5

States

Fig. 8.3 Trajectories of the learner using converged polices in Algorithm 8.2

241

0

-5 0

0.5

1

1.5

2

1.5

2

Control inputs

Time (s) 5 0 -5 -10 0

0.5

1

Time (s)

without disturbance. Both the expert and the learner have system dynamics matrices as ⎡ ⎤ ⎡ ⎤ a11 a12 0 0 0 0 0 a18 0 ⎢ 0 a22 a23 0 0 0 0 0 ⎥ ⎢ 0 ⎥ ⎢ ⎥ ⎢ ⎢ 0 0 a33 0 0 0 0 0 ⎥ ⎥ ⎢1/TG 1 ⎥ ⎢ ⎥ ⎢ ⎢ 0 0 0 a44 a45 0 0 a48 ⎥ ⎥ ⎢ 0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎥ 0 0 0 a a 0 0 , B .A = ⎢ 0 = B = 55 56 2 ⎢ 0 ⎥, ⎢ ⎥ 1 ⎢ ⎢ 0 0 0 0 a55 a56 0 0 ⎥ ⎥ ⎢ 0 ⎥ ⎢ ⎥ ⎢ ⎢ 0 0 0 a64 0 a66 a67 0 ⎥ ⎥ ⎢ ⎥ ⎣ 0 ⎦ ⎣a71 0 0 a74 a75 0 0 a78 ⎦ 0 a81 0 0 a84 0 0 0 0 where a = −1/T p1 , a12 = K P1 /T p1 , a18 = −K P1 /T p1 , a22 = −1/TT1 , a23 = 1/TT1 , a33 = −1/TG 1 , a44 = −1/TP1 , a45 = K P2 /T p2 ,

. 11

a48 = K P2 /T p2 , a55 = −1/TT2 , a56 = 1/TT2 , a64 = −1/r2 TG 2 , a66 = −1/TG 2 , a67 = −1/TG 2 , a71 = −2π T12 K T2 , a74 = K 2 /r2 − K t2 /r2 TP2 , a75 = K T2 K P2 /r2 TP2 , a78 = K T2 K P2 /r2 TP2 − K 2 , a81 = 2π T12 , a84 = −2π T12 . Set the parameters: .TG 1 = TG 2 = 0.08sec, .TT1 = TT2 = 0.08sec, .TP1 = TP2 = 0.08sec, . K P1 = K P2 = 0.08 Hz, .T12 = 0.00545p.u., .r2 = 2.4 Hz, . K 1 = K 2 = 0.15, and . K T1 = K T2 = 0.3. For the expert, set . Q 1e = diag{1, 0, 0, 0, 0, 0, 0, 0}, . Q 2e =

242

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games

diag{1, 0, 0, 1, 0, 0, 0, 2}, . R1e = 2, . R2e = 1, and the expert’s control feedback gains are .

[ ] K 1e = 10−3 × 5 0.6 0.6 0 0 0 0.2 2.8 , [ ] K 2e = 10−3 × 10.8 1.3 1.3 −1.5 0 0 5.8 292.6 .

For the learner, we select . Q 01 = Q 02 = 08 , . R1 = 5, and . R2 = 2. Figure 8.4 shows the convergence of . K ih to . K ie , and . Q ih . Again, . Q ih is the equivalent weight to . Q ie . With the converged control gains, as shown in Fig. 8.5, the learner has the same trajectories as the expert.

Fig. 8.4 Convergence of . K ih to . K ie , and . Q ih , .∀i ∈ {1, 2, 3} using Algorithm 8.2

0.6 0.4 0.2 0 0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

40

4

2

0

Fig. 8.5 Trajectories of the learner LFC system using converged polices by Algorithm 8.2

8.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

243

8.3 Off-Policy Inverse Reinforcement Learning for Nonlinear Multiplayer Non-Zero-Sum Games Section 8.3.1 presents an inverse RL problem involving an expert and a learner in the context of nonlinear . N -player non-zero-sum game systems. The objective is to design effective inverse RL algorithms that enable the learner to reconstruct the expert’s cost function and mimic their behaviors. In Sect. 8.3.2, we introduce a model-based inverse RL algorithm specifically tailored for . N -player nonlinear non-zero-sum game systems. This algorithm is applicable to both homogeneous control inputs, where all players have the same type of control input, and heterogeneous control inputs, where players have different control input characteristics. The model-based approach leverages the system dynamics and employs appropriate mathematical formulations to solve the inverse RL problem. Section 8.3.3 focuses on the development of a model-free inverse RL algorithm utilizing neural networks. This algorithm eliminates the need for explicit knowledge of the system dynamics and instead utilizes neural networks as function approximators. The model-free approach provides flexibility and adaptability to a wide range of . N -player nonlinear non-zero-sum game systems.

8.3.1 Problem Formulation Expert System Consider the expert with a nonlinear . N -player system x˙ = f (xe ) +

N ∑

. e

gi (xe )u ie ,

(8.45)

i=1

where .xe ∈ Rn , .u ie ∈ Rm i , . f (xe ) ∈ Rn , and .gi (xe ) ∈ Rn×m i denote expert’s state, △ control input of player .i, .i ∈ N = {1, 2, . . . , N }, state dynamics, and control input dynamics, respectively. ∑N Assumption 8.4 The dynamics function . f + i=1 gi u ie is Lipschitz continuous on a set .Θ containing the origin such that there exist continuous controls on .∏ that stabilize (8.45). Definition 8.2 (Lian et al. 2022b) The players in (8.45) are said to be heterogeneous if .gi /= g j for some .i /= j in (8.45) where .i, j ∈ N . The players in (8.45) are said to be homogeneous if .gi = g, .∀i ∈ N . The control inputs of the expert are Nash solutions of a non-zero-sum game (Ba¸sar and Olsder 1998) which minimize the integrated cost function for player .i given by

244

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games .

Vie (xe (t0 ), u 1e , . . . , u N e ) ⎛ ⎞ ʃ ∞ N ∑ ⎝q T (xe )Q ie q(xe ) + = u Tje R je u je ⎠ dτ, i, j ∈ N , t0

(8.46)

j=1

where .q(xe ) = [xes1 . . . xesn ]T ∈ Rn is the state function with the power of .s, . Q ie = T Q ie ∈ Rn×n > 0 are state-penalty weights, and. R je = R Tje ∈ Rm j ×m j > 0 are control input-penalty weights. The expert’s optimal control input of player .i is given by 1 −1 T u = − Rie gi (xe )∇Vie (xe ), i ∈ N , 2

(8.47)

. ie

where .∇Vie satisfies the . N -coupled Hamilton–Jacobi equations 1∑ T T ∇V je (xe )g j (xe )R −1 je g j (x e )∇V je (x e ) 4 j=1 ⎞ ⎛ N ∑ 1 −1 g j (xe )R je g Tj (xe )∇V je (xe )⎠ , i ∈ N . + ∇VieT (xe ) ⎝ f (xe ) − 2 j=1 N

0 = q T (xe )Q ie q(xe ) +

.

(8.48)

With the control inputs in (8.47), a Nash equilibrium is reached, i.e., .

Vie∗ = Vie (xe (t), u ie , u −ie ) ≤ Vie (xe (t), u¯ ie , u −ie ),

where .u¯ ie denotes a non-optimal control input and .−i denotes the set of neighbors △ of .i, i.e., .−i = {1, . . . , i − 1, i + 1, . . . N }. Learner System Consider the learner with a nonlinear . N -player system that has the same dynamic functions as the expert (8.45) as .

x˙ = f (x) +

N ∑

gi (x)u i ,

(8.49)

i=1

where .x ∈ Rn denotes the learner states and .u i ∈ Rm i denotes the control inputs of player .i, .i ∈ N . If we, based on the non-zero-sum games (Ba¸sar and Olsder 1998), define a integrated cost function of player .i to be minimized as ʃ .

Vi (x(t0 ), u i , u −i ) = t0



⎛ ⎝q T (x)Q i q(x) +

N ∑ j=1

⎞ u Tj R j u j ⎠ dτ,

(8.50)

8.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

245

where .q(·) ∈ Rn is the same state function as that in (8.46), state-penalty weights are . Q i = Q iT ∈ Rn×n > 0, control input-penalty weights are . R j ∈ Rm j ×m j > 0, for all .i, j ∈ N , then, the optimal control input for learner player .i is given by 1 u ∗ = − Ri−1 giT (x)∇Vi (x), i ∈ N 2

. i

(8.51)

which satisfies the . N -coupled Hamilton–Jacobi equations 1∑ T ∇V jT (x)g j (x)R −1 j g j (x)∇V j (x) 4 j=1 ⎞ ⎛ N ∑ 1 T ⎠ g j (x)R −1 + ∇ViT (x) ⎝ f (x) − j g j (x)∇V j (x) , i ∈ N . 2 j=1 N

0 = q T (x)Q i q(x) +

.

(8.52)

Assumption 8.5 The learner does not know the expert’s system dynamics.( f ,.g1 ,.. . ., g ), control policy parameters in (8.52), and penalty weights, i.e., . Q ie , . Rie , i ∈ N in (8.46), but knows the expert’s trajectory data, i.e., control inputs .u ie , .i ∈ N and state .xe .

. N

Definition 8.3 If . Q¯ i given . R1 , . . . , R N in (8.52) yield the same control inputs such that .u i∗ = u ie with .u i∗ defined in (8.51), then, . Q¯ i is an equivalent weight to . Q ie , .i ∈ N . Problem 8.2 Select . R j ∈ Rm j ×m j > 0, j ∈ N for (8.50). Under Assumptions 8.4– 8.5, the learner aims to find each player .i an equivalent weight . Q¯ i such that (8.51)– (8.52) result in . K i = K ie and imitate the expert’s trajectories, i.e., .(x, u ∗1 , . . . , u ∗N ) = (xe , u 1e , . . . , u N e ). This problem is similar to that in Sect. 8.2, but for nonlinear systems. It is a data-driven inverse RL problem for multiplayer games.

8.3.2 Inverse Reinforcement Learning Policy Iteration A model-based inverse RL policy iteration (PI) Algorithm 8.3 is shown below to find an equivalent weight . Q¯ i , .i ∈ N . It combines inner optimal control learning loops and outer inverse optimal control (IOC) learning loops. This is a two-loop learning structure and is different from the algorithm in Sect. 8.2. Algorithm 8.3 has multiple outer-loop iterations and inner-loop iterations. At the outer-loop iteration .k, the inner-loop iteration .l, .l = 0, 1, . . . , ∞ (see Steps 3–6) is the optimal control learning for multiplayer games. It is a policy iteration process with game control policy evaluation and control input update. Given the current estimate k k k . Q i of . Q ie , this process can solve the optimal solution (.u i , . Vi ) satisfying (8.51)– (8.52) for player .i. The outer-loop iteration .k is to update the penalty weights based

246

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games

Algorithm 8.3 Model-based inverse RL PI 1. Initialization: Select initial Q i0 > 0, any Ri > 0, stabilizing u i00 , and small thresholds εi > 0 and ei > 0, ∀i ∈ N . Set k = 0; 2. Outer-loop iteration k 3. Inner-loop iteration l using optimal control: Given k, set l = 0. 4. Policy evaluation: Compute Vikl , i ∈ N using 0=q

T

(x)Q ik q(x) +

⎛ ⎞ N N ( )T ∑ ∑ kl T kl kl kl ⎠ ⎝ f (x) + (u j ) R j u j + ∇Vi (x) g j (x)u j . (8.53) j=1

5.

j=1

Policy improvement: Compute the control policies using k(l+1)

ui

1 = − Ri−1 giT (x)∇Vikl (x). 2

(8.54)

k(l−1)

Stop if ‖Vikl − Vi ‖ ≤ εi , then set Vik = Vikl , u ik = u ikl , otherwise set l ← l + 1 and go to step 4; 7. Outer-loop iteration k based on IOC: Update the state-penalty weight Q ik+1 by

6.

q T (xe )Q ik+1 q(xe ) ( )T ( ) = αi u ie (xe ) − u ik (xe ) Ri u ie (xe ) − u ik (xe ) + q T (xe )Q ik q(xe ). (k+1)0

8. Stop if ‖u ik − u ie ‖ ≤ ei , otherwise, set u i

(8.55)

= u ik , k ← k + 1 and go to step 3.

on IOC. Using.xe ,.u ie , and.u ik obtained by inner-loop iterations, the learner revises the player .i’s state-penalty weight . Q ik+1 toward . Q¯ i . Repeating these steps for all players, the learner will eventually obtain the expert’s behavior, i.e., .(xe , u 1e , u 2e , . . . , u N e ). The stop condition .‖u ik − u ie ‖ ≤ ei can also be .‖Q ik+1 − Q ik ‖ ≤ e¯i with some small thresholds .e¯i , .i ∈ N . The next results show that Algorithm 8.3 is convergent and the learner eventually obtains the expert’s behavior, i.e., .(xe , u 1e , u 2e , . . . , u N e ). That is, an equivalent weight can be learned by Algorithm 8.3. However, the learned equivalent weight can be different from the expert’s actual weight values. This is the non-uniqueness of the learned solution. We would also note that the learner system (8.49) is stable during the learning process. Definition 8.4 The vector .x is said to be a uniformly approximate solution of . y if there exists a small threshold .e such that .‖x − y‖ ≤ e holds. Lemma 8.2 (Liu et al. 2014) Given . Q ik > 0 and admissible .u ik0 , the nonlinear Lyapunov equations (8.53) have solutions .Vik ≥ 0 for all .i ∈ N and .k = 0, 1, . . .. Theorem 8.4 (Convergence of Algorithm 8.3) Algorithm 8.3 terminates at a limited iteration step, and the learner can obtain the uniformly approximate solutions of an equivalent weight . Q¯ i .

8.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

247

Proof The inner loops are the standard RL policy iterations. The convergence of it at any inner iteration loops can be referred to in Liu et al. (2014). The policy iteration is proved to be quasi-Newton’s method in a Banach space (Deimling 2010; Kantorovitch 1939; Wu and Luo 2012) that yields the converged (.u ik , .Vik ), where .Vik is the optimal value and .u ik is the optimal control input at outer .k loop. Note that .u ik0 is obtained from Step 8, and is stabilizing. The stability is analyzed in Theorem 8.6. The positive definiteness of . Q ik is proved as below. We now prove the convergence of outer loops. Equation (8.55) implies that q T (xe )(Q ik+1 − Q ik )q(xe ) ( )T ( ) = αi u ie (xe ) − u ik (xe ) Ri u ie (xe ) − u ik (xe ) ≥ 0.

.

(8.56)

This is indeed an update that shares similar principles of sequential iteration in, for example, numerical algorithms (Moerder and Calise 1985). Given initial . Q i0 > 0, (8.56) implies 0 < Q i0 ≤ · · · ≤ Q ik ≤ Q ik+1 , ∀i, k

.

(8.57)

and q T (xe )Q ik q(xe )

.

1/2

1/2

= αi ‖Ri (u ie (xe ) − u ik−1 (xe ))‖2 + ρi ‖Ri (u ie (xe ) − u ik−2 (xe ))‖2 1/2

+ · · · + αi ‖Ri (u ie (xe ) − u i0 (xe ))‖2 + q T (xe )Q i0 q(xe ),

(8.58)

which implies that .{Q ik }∞ k=0 is an increasing sequence. The next theorem will show that there are infinitely many equivalent weights . Q¯ i . Hence, we select initial small . Q i0 > 0 such that there exists at least one . Q¯ i such that ¯ i ≥ Q i0 . Then, the tuning parameter .αi can adjust the increment of each . Q ik so that .Q k ¯ i . That is, there exists a small threshold .βi , such . Q i can increase to the neighbor of . Q that .‖Q ik − Q¯ i ‖ ≤ βi holds, as well as .‖q T (xe )Q ik q T (xe ) − q T (xe ) Q¯ i q(xe )‖ ≤ ρi with some small .ρi > 0. Note that as . Q ik approaches . Q¯ i , the increment of . Q ik in quadratic form (see (8.56)) decreases. Therefore, with updated . Q ik by outer loops at some .k, the inner loops will yield the approximate solution of .u ie , i.e., .‖u ik − u ie ‖ ≤ ei . As a result, Algorithm 8.3 will stop at .k and then one has ‖q T (xe ) Q¯ i q(xe )‖ + ρi ≥ ‖q T (xe )Q ik q(xe )‖

.

1/2

1/2

= αi ‖Ri (u ie − u ik )‖2 + · · · + αi ‖Ri (u ie − u i0 )‖2 + ‖q T (xe )Q i0 q(xe )‖ ≥ αi (k + 1)ei2 ‖Ri ‖ + ‖q T (xe )Q i0 q(xe )‖ ≥ αi (k + 1)ei2 λmin (Ri ) − ‖q(xe )‖2 ‖Q i0 ‖, which implies

(8.59)

248

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games

(

‖q‖2 ‖ Q¯ i ‖ + ‖q‖2 ‖Q i0 ‖ + ρi .k ≤ ceil αi ei2 λmin (Ri )

) ≡ k0 .

(8.60)

Please note that .q T (xe )Q ik q(xe ) and .q T (xe ) Q¯ i q(xe ) are scalars. It concludes that Algorithm 8.3 will stop after at most outer .k0 iterations and obtain the uniformly approximate solutions of . Q¯ i . Given the same initial states, i.e., .x(0) = xe (0), the learned control policy will produce the approximate same trajectories of the expert’s .(xe , u 1e , . . . , u N e ). Note that . Q¯ i might be different . Q ie for any .i ∈ N because . Q¯ i is only guaranteed to be the equivalent weight to . Q ie . Theorem 8.5 (Non-unique Solutions of Algorithm 8.3) The equivalent solution . Q¯ i that Algorithm 8.3 aims to approximate may be non-unique, which can be different from the expert’s actual . Q ie . All such possible equivalent weights . Q¯ i and the optimally corresponding .∇ V¯i (xe ) satisfy −1 T g T (xe )∇ V¯i (xe ) = Ri Rie gi (xe )∇Vie (xe ), i ∈ N ,

. i

(8.61)

and q T (xe )(Q ie − Q¯ i )q(xe )

.

1 ∑ ¯T −1 T ¯ ∇ V j (xe )g j (xe )R −1 j (R j − R je )R j g j (x e )∇ V j (x e ) 4 j=1 ⎛ ⎞ N ∑ ) ( 1 T ¯ ⎠ g j (xe )R −1 + ∇ V¯iT (xe ) − ∇VieT (xe ) ⎝ f (xe ) − je g j (x e )∇ V j (x e ) . 2 j=1 N

=

(8.62) Proof The solutions that Algorithm 8.3 aims to approximate, denoted as . Q¯ i and ¯i , satisfy .∇ V 1 −1 T ¯i (xe ), i ∈ N , .u ie = − Ri gi (x e )∇ V (8.63) 2 where .∇ V¯i (xe ) satisfies 0 = q T (xe ) Q¯ i q(xe ) +

.

1 ∑ ¯T ∇ V j (xe )g¯ j (xe )∇ V¯ j (xe ) 4 j=1 N

⎞ N ∑ 1 T ¯ ⎠ g j (xe )R −1 + ∇ V¯iT (xe ) ⎝ f (xe ) − j g j (x e )∇ V j (x e ) . 2 j=1 ⎛

(8.64)

It follows from Definition 8.3 that . Q¯ i are the equivalent weights. (8.63) and (8.47) indicate (8.61). Subtracting (8.64) from (8.48) yields (8.62).

8.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

249

−1 T If we view (8.61) as .g T (xe )X i = Yi with .Yi = Ri Rie g (xe )∇Vie (xe ) and .xe /= 0, there will be infinite solutions of . X i with rank.(g(xe )g T (xe )) < n. The system dynamics are not guaranteed to be rank.(g(xe )g T (xe )) = n. Thus, the value of is nonunique.∇ V¯i (xe ) in (8.61) and (8.64) are non-unique. Then,. Q¯ i in (8.62) are not unique. Note that .q(xe ) is non-zero in (8.62). Moreover, . Q¯ i is unique if rank.(g(xe )g T (xe )) = n. Thus, the equivalent weight . Q¯ i might not be unique, which can be infinitely many. Note that as shown in Theorem 8.4, Algorithm 8.3 stops as long as one of them is approximated. ◻

Theorem 8.6 (Stability of Algorithm 8.3) When applying Algorithm 8.3, the learner system (8.49) is asymptotically stable with .u ikl updated by (8.54) at each outer-loop iteration .k and inner-loop iteration .l, where .k = 0, 1, . . ., .l = 0, 1, . . ., and .i ∈ N . Proof With the initials of Algorithm 8.3 at outer-loop iteration .k = 0, the inner-loop iterations make sure that each updated.u ikl asymptotically stabilizes the learner system (8.49) (Lian et al. 2022b). Theorem 8.4 shows that . Q ik > 0 holds for all .k = 0, 1, . . .. This implies that the inner-loop iterations at all the outer-loop iterations .k yield the control inputs .u ikl that asymptotically stabilize the learner system (8.49). Therefore, the learner system (8.49) is asymptotically stable when Algorithm 8.3 is applied. ◻

8.3.3 Model-Free Off-Policy Integral Inverse Reinforcement Learning Here, a completely model-free inverse RL algorithm based on the model-based Algorithm 8.3 is to solve Problem 8.2. The model-free inverse RL algorithm has the following two steps. Step 1: Find a model-free equation that replaces (8.53) and (8.54). Off-policy integral RL (Jiang and Jiang 2012) is used in Algorithm 8.3’s inner loops. Toward this end, the learner system can be rewritten as

.

x˙ = f (x) +

N ∑

g j (x)u klj +

j=1

N ∑

) ( g j (x) u j − u klj ,

(8.65)

j=1

where .u klj ∈ Rm is the inputs to be updated, and .u j is the behavior inputs to generate data. Substituting (8.65) into (8.53) yields 0=q

.

T

(x)Q ik q(x)

+

N ∑

(u klj )T R j u klj

j=1

⎞ N N ∑ ∑ ) ( ) ( T + ∇Vikl (x) ⎝ f (x) + g j (x)u klj + g j (x) u j − u klj ⎠ , ⎛

j=1

j=1

(8.66)

250

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games

which is integrated from .t to .t + T to yield ʃ 0=

t+T

.

t

N ( ) ∑ T k q (x)Q i q(x) + (u klj )T R j u klj dτ j=1

+ Vikl (x(t + T )) − Vikl (x(t))τ ʃ t+T N ( ) )T ∑ ( + ∇Vikl (x) g j (x) u j − u klj dτ. t

(8.67)

j=1

We define the following additional control inputs 1 u k(l+1) = − R −1 g T (x)∇Vikl (x). 2 j j

(8.68)

. ij



Note that when . j = i, .u ik(l+1) = u iik(l+1) . With such a denotation, we enable to solve both heterogeneous and homogeneous multiplayer systems as shown in Definition 8.2. Further substituting (8.68) into (8.67) yields ʃ kl . Vi (x(t

ʃ

+ T )) −

t+T

= t

(

Vikl (x(t))

+

− q T (x)Q ik q(x) −

t+T

t N ∑

2(u ik(l+1) )T R j j

N ∑

(u j − u klj ) dτ

j=1

)

(u klj )T R j u klj dτ.

(8.69)

j=1

This equation allows to solve the solution .(Vikl , .u ik(l+1) ) .→ .(Vik , u ik ) for all .i, j ∈ N given the current estimate . Q ik . Note that when . j /= i, .u ik(l+1) can be solved in (8.69) j but is not used in the algorithm. Step 2: Use the converged control policy .u ik from Step 1 with observed .(xe , u ie ); (8.55) is used to update . Q ik+1 . Note that the computation of .u ik (xe ) will be shown as follows. The model-free off-policy integral inverse RL algorithm for nonlinear . N -player non-zero-sum game systems is shown below. The next theorem shows that Algorithm 8.4 has the same convergence property as Algorithm 8.3. Theorem 8.7 (Convergence of Algorithm 8.4) Algorithm 8.4 converges to Algorithm 8.3 and obtains the same convergence.

8.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

251

Algorithm 8.4 Model-free integral inverse RL algorithm for nonlinear N -player games 1. Initialization: Select Q i0 > 0 ∈ Rn×n , any R1 ∈ Rm 1 ×m 1 , . . . , R N ∈ Rm N ×m N , any stabilizing u i00j , and small thresholds εi , ei , ∀i, j ∈ N . Set k = 0. Use any stabilizing u j in (8.65); 2. Outer-loop iteration k based on IOC 3. Inner-loop iteration l using optimal control: given k, set l = 0; 4. Solve the N -tuple of costs and inputs using (8.69); k(l−1) ‖ ≤ εi , then set u ik = u ikl and Vik = Vikl , otherwise set l ← l + 1 5. Stop if ‖Vikl − Vi and go to step 4; 6. Outer-loop iteration k based on IOC: Update the state-penalty weight Q ik+1 using (xe , u ie ) by (8.55); (k+1)0 = u ik , k ← k + 1 and go to step 3. 7. Stop if ‖u ik − u ie ‖ ≤ ei , otherwise set u i

Proof Taking the limit to (8.69) of the time interval .T yields ʃ t+T ∑ N k(l+1) T ) R j (u j − u klj ) dτ Vikl (x(t + T )) − Vikl (x(t)) j=1 (u i j . lim − lim t T →0 T →0 T T ) ʃ t+T ( T ∑N k kl T kl q (x)Q i q(x) + j=1 (u j ) R j u j dτ t + lim T →0 T = 0. (8.70) One has ⎛ ⎞ N N ∑ ∑ ˙ikl (x(t)) − .V (u ik(l+1) )T R j (u j − u klj ) + ⎝q T (x)Q ik q(x) + (u klj )T R j u klj ⎠ j j=1

(

= ∇Vikl (x)

)T

j=1

⎛ ⎝ f (x) +

N ∑ j=1



N ∑

g j (x)u klj + ⎛

N ∑



g j (x)(u j − u klj )⎠

j=1

(u ik(l+1) )T R j (u j − u klj ) + ⎝q T (x)Q ik q(x) + j

j=1

= q T (x)Q ik q(x) +

⎞ (u klj )T R j u klj ⎠

j=1 N ∑

)T (

( (u klj )T R j u klj + ∇Vikl (x)

j=1

= 0,

N ∑

f (x) +

N ∑

g j (x)u klj

)

j=1

(8.71)

which implies that (8.69) is equivalent to (8.53). In addition, the outer-loop formulation is the same. This shows that Algorithm 8.4 and Algorithm 8.3 have the same ◻ solutions.

252

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games

Algorithm Implementation via Neural Networks (NNs) We now implement Algorithm 8.4 via NNs. We design two NN-based approximain (8.68). According to Werbos (1974), the two tors for .Vikl in (8.69) and .u ik(l+1) j approximators are defined as .

Vˆikl = (Wikl )T ϕi (x),

.u ˆ ik(l+1) j

(8.72a)

= (Z iklj )T φi j (x),

(8.72b)

where .Wikl ∈ R L i and . Z iklj ∈ R Si j ×m j are weights. .ϕi (x) ∈ R L i = [ϕi1 (x), ϕi2 (x), T S T .. . ., .ϕi L (x)] and .φi j (x) ∈ R i j = [φi 1 (x), φi 2 (x), . . . , φi S (x)] are basis function i ij △





vectors. In addition, if .i = j, .uˆ ik(l+1) = uˆ ik(l+1) , . Z ikl = (Z iklj ), .ϕi (x) = ϕi j (x), and j △

S = Si . With approximators (8.72a) and (8.72b), we rewrite (8.69) with residual errors .e ˜klj as . ij

ʃ [ϕi (x(t + T )) − ϕi (x(t))]T Wikl −

t+T

N ∑

.



ʃ

t+T

=− t

N ∑

⎝q T (x)Q ik q(x) +

t

j=1

kj

(φi j )T Z iklj R j (u j − u j ) dτ ⎞

(u klj )T R j u klj ⎠ dτ + e˜ikl .

(8.73)

j=1

Using Kronecker product, it can be rewritten as ʃ [ϕi (x(t + T )) − ϕi (x(t))]

.

T

Wikl



ʃ

t+T

=− t

⎝q T (x)Q ik q(x) +

− t

N ∑

t+T

N ∑

(φi j ⊗ (R j (u j − u klj )))T dτ vec(Z iklj )

j=1



(u klj )T R j u klj ⎠ dτ + e˜ikl .

(8.74)

j=1

( ) kl This is used to solve the unknown weights . Wikl , Z i1 , . . . , Z iklN for all .i ∈ N at inner loops given . Q ik for outer .k-th loop by using batch least squares. Based on the method of weighted residuals (Luo et al. 2014; Moerder and Calise 1985), these weights are determined letting .ϵ˜ikl ≡ 0 in (8.74). In detail, we define the following data sets with . yi data tuples ⎡ ⎡ ⎡ ⎤ ⎤ ⎤ h iklj |t+T δi |t+T gg i |t+T t t t t+2T ⎥ t+2T ⎥ t+2T ⎥ ⎢ δi |t+T ⎢ h iklj |t+T ⎢ ⎢ ⎢ ⎥ ⎥ kl ⎢ gg i |t+T ⎥ , G .△i = ⎢ = .. . ⎢ ⎥ , Hikkl = ⎢ ⎥ ⎥, .. .. ⎣ ⎣ ⎣ ⎦ ⎦ i ⎦ . . t+y T

δi |t+(yi i −1)T

[ Hikl = △i Hi1kl . . . Hiklj

t+y T

h iklj |t+(yi i −1)T ] . . . HiklN ,

t+y T

gg i |t+(yi i −1)T

(8.75)

8.3 Off-Policy Inverse Reinforcement Learning for Nonlinear …

253

where δ |

t+yi T

. j t+(y −1)T i

t+y T

h iklj |t+(yi i −1)T

= ϕ Tj (x(t + yi T )) − ϕ Tj (x(t + (yi − 1)T )), ʃ t+yi T =− (φi j ⊗ (Ri j (u j − u klj )))T dτ, ʃ

t+y T

gg i |t+(yi i −1)T = −

t+(yi −1)T t+yi T t+(yi −1)T

N ( ) ∑ q T (x)Q ik q(x) + (u klj )T R j u klj dτ. j=1

Based on (8.74)–(8.75), (8.73) can be solved using batch least squares as [ ]T ( )−1 kl T (Wikl )T vec(Z i1 ) . . . vec(Z iklN )T = (Hikl )T Hikl (Hikl )T G ikl

.

(8.76)

( ) kl to solve . Wikl , Z i1 , . . . , Z iklN at each .l. The uniqueness of the solution is guaranteed by injecting probing ∑ noise for the persistence of excitation of system data and collecting. yi ≥ L i + Nj=1 m j × Si j data tuples in (8.75), such that.(Hikl )T Hikl in (8.76) ∑ has full rank of . L i + Nj=1 m j × Si j . At each outer .k-th loop, inner loops have con( ) ( ) kl k verged weights, i.e., . Wikl , Z i1 , . . . , Z iklN → Wik , Z i1 , . . . , Z ikN . This solves the estimated .Vik and .u ikj . Then, using measurement data of expert’s demonstrations .(xe , u 1e , . . . , u N e ), we use (8.72b) to solve . Q ik+1 in (8.55) by q T (xe )Q ik+1 q(xe ) ( )T ( ) = αi u ie (xe ) − (Z ik )T φi (xe ) Ri u ie (xe ) − (Z ik )T φi (xe ) + q T (xe )Q ik q(xe ), (8.77)

.

which can be rewritten as vev(q(xe (t))T vem(Q ik+1 ) = γi ,

(8.78)

.

where )T ( ) ( γ (t) = αi u ie (xe (t)) − (Z ik )T φi (xe (t)) Ri u ie (xe (t)) − (Z ik )T φi (xe (t)) .

. i

To use the batch least squares method to solve . Q ik+1 , we define ⎡ .

⎢ ⎢ Xe = ⎢ ⎣

vev(q(xe (t))T vev(q(xe (t + T ))T .. .





γi (t) γi (t + T ) .. .



⎥ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥. ⎥ , ┌i = ⎢ ⎦ ⎦ ⎣ vev(q(xe (t + (ω − 1)T ))T γi (t + (ω − 1)T )

(8.79)

254

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games

Similarly, we sample .ω ≥ n(n + 1)/2 data tuples in (8.79), such that .rank(X eT X e ) = n(n + 1)/2. We thus uniquely solve . Q ik+1 by )−1 ( vem(Q ik+1 ) = X eT X e X eT ┌i .

.

(8.80)

Remark 8.9 Referring to the convergence of neural network approximation in Luo et al. (2014), we can infer that.Vˆikl → Vik and.uˆ ik(l+1) → u ik as. L i → ∞ and. Si j → ∞ in (8.72). Then, . Q ik+1 in (8.80) uniformly converges to that in (8.55). Thus, the solutions of implementations (8.76) and (8.80) converge to that of Algorithm 8.4 and Algorithm 8.3 based on Theorem 8.7. This also implies that the neural network solutions are stabilizing (Luo et al. 2014).

8.3.4 Simulation Examples An example of a heterogeneous multiplayer game system is provided to verify Algorithm 8.4. Consider both the learner and the expert to be the four-player nonlinear system: s˙ = f (s) + g1 (s)v1 + g2 (s)v2 + g3 (s)v3 ,

.

(8.81)

where .s denotes .x/xe , and .vi denotes .u i /u ie , .i ∈ {1, 2, 3}, and [ .

f (s) =

] [ ] [ ] [ ] −s1 0 0 0 , g , g , g = = = 1 2 3 −s23 + s2 s1 s2 1

with .sn the .n-th element of the state .x or .xe . The cost functions of the expert’s three players are .V1e = 0.25xe41 + xe22 , .V2e = 2 2 T xe2 ] and . R1e = R2e = 0.25xe41 + xe22 , .V3e = 0.25xe41 + 0.5xe22 with .q(xe ) = [xe1 R3e = 1. The corresponding optimal control inputs are .u 1e = −xe1 xe2 , .u 2e = −xe22 and .u 3e = −0.5xe2 . Set .φ1 = x1 x2 , .φ12 = x22 , .φ13 = x2 , .φ21 = x1 x2 , .φ2 = x22 , .φ23 = x2 , .φ31 = x1 x2 , .φ32 = x22 , and .φ3 = x2 . Set . R1 = R2 = R3 = 1, .α1 = α2 = α3 = 1, . T = 0.001s .e1 = e2 = e3 = 0.003, and .ε1 = ε2 = ε3 = 0.003. Figure 8.6 shows the convergence of learned NN weights for inputs, i.e., . Z ik . Figure 8.7 shows the convergence of learned cost penalty weights . Q ik . Figure 8.8 shows the satisfactory imitation of the learner to the expert using the learned control inputs. The learned values are .

Q∞ 1 =

[ ] [ ] [ ] 0.6173 0.5456 0.1671 0.5479 1.3023 −0.0123 ∞ , Q∞ = , Q = 2 3 0.5456 2.9792 0.5479 2.9878 −0.0123 0.9490

8.3 Off-Policy Inverse Reinforcement Learning for Nonlinear … Fig. 8.6 Convergence of control input NN weights . Z ik

255

-0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 -1 0

Fig. 8.7 Convergence of cost function weights . Q ik , where .i ∈ {1, 2, 3}

200

400

600

800

1000

200

400

600

800

1000

5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

and .

∞ ∞ Z 1∞ = −0.9965, Z 12 = −0.9968, Z 13 = −0.9945, ∞ ∞ ∞ Z 21 = −0.9975, Z 2 = −0.9970, Z 23 = −0.9985, ∞ ∞ Z 31 = −0.4851, Z 32 = −0.4758, Z 3∞ = −0.4857.

256

8 Inverse Reinforcement Learning for Multiplayer Non-Zero-Sum Games

Fig. 8.8 Trajectories of the expert and the learner using the converged control policies

10

5

0

-5

-10 0

0.05

0.1

0.15

0.2

Time (s)

References Ba¸sar T, Olsder GJ (1998) Dynamic noncooperative game theory. SIAM Bittanti S, Laub AJ, Willems JC (2012) The Riccati equation. Springer Deimling K (2010) Nonlinear functional analysis. Courier Corporation Isaacs R (1999) Differential games: a mathematical theory with applications to warfare and pursuit, control and optimization. Courier Corporation Jiang Y, Jiang ZP (2012) Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 48(10):2699–2704 Johnson M, Aghasadeghi N, Bretl T (2013) Inverse optimal control for deterministic continuoustime nonlinear systems. In: IEEE conference on decision and control, pp 2906–2913 Kantorovitch L (1939) The method of successive approximation for functional equations. Acta Math 71(1):63–97 Kiumarsi B, Lewis FL, Jiang ZP (2012) H.∞ control of linear discrete-time systems: off-policy reinforcement learning. Automatica 78(4):144–152 Lancaster P, Rodman L (1995) Algebraic Riccati equations. Clarendon Press Lewis FL, Vrabie D (2020) Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst Mag 9(3):32–50 Lewis FL, Vrabie D, Syrmos VL (2012) Optimal control. Wiley Li J, Xiao Z, Fan J, Chai T, Lewis FL (2022) Off-policy Q-learning: solving Nash equilibrium of multi-player games with network-induced delay and unmeasured state. Automatica 136:110076 Lian B, Donge VS, Lewis FL, Chai T, Davoudi A (2022a) Data-driven inverse reinforcement learning control for linear multiplayer games. IEEE Trans Neural Netw Learn Syst. https://doi. org/10.1109/TNNLS.2022.3186229 Lian B, Xue W, Lewis FL, Chai T (2022b) Inverse reinforcement learning for multi-player noncooperative apprentice games. Automatica 145:110524 Lian B, Donge VS, Xue W, Lewis FL, Davoudi Ali (2022c) Distributed minmax strategy for multiplayer games: stability, robustness, and algorithms. IEEE Trans Neural Netw Learn Syst. https:// doi.org/10.1109/TNNLS.2022.3215629 Lian B, Donge VS, Lewis FL, Chai T, Davoudi A (2022d) Inverse reinforcement learning control for linear multiplayer games. In: IEEE 61st conference on decision and control, pp 2839–2844

References

257

Liu D, Li H, Wang D (2014) Online synchronous approximate optimal learning algorithm for multi-player non-zero-sum games with unknown dynamics. IEEE Trans Syst Man Cybern Syst 44(8):1015–1027 Luo B, Wu HN, Huang T (2014) Off-policy reinforcement learning for . H∞ control design. IEEE Trans Cybern 45(1):65–76 Modares H, Lewis FL, Jiang ZP (2012) H.∞ tracking control of completely unknown continuoustime systems via off-policy reinforcement learning. IEEE Trans Neural Netw Learn Syst 26(10):2550–2562 Moerder DD, Calise, AJ (1985) Convergence of a numerical algorithm for calculating optimal output feedback gains. IEEE Trans Autom Control AC-30(9): 900–903 Odekunle A, Gao W, Davari M, Jiang ZP (2020) Reinforcement learning and non-zero-sum game output regulation for multi-player linear uncertain systems. Automatica 112:108672 Starr AW, Ho YC (1969) Nonzero-sum differential games. J Optim Theory Appl 3(3):184–206 Vamvoudakis KG, Lewis FL (2014) Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica 47(8):1556–1569 Vrabie D, Pastravanu O, Abu-Khalaf M, Lewis FL (2009) Adaptive optimal control for continuoustime linear systems based on policy iteration. Automatica 45(2):477–484 Wei Q, Li H, Y X, He H (2020) Continuous-time distributed policy iteration for multicontroller nonlinear systems. IEEE Trans Cybern 51(5):2372–2383 Werbos P (1974) New tools for prediction and analysis in the behavioral sciences. Ph.D. Dissertation, Harvard University Wu HN, Luo B (2012) Neural network based online simultaneous policy update algorithm for solving the HJI equation in nonlinear. H∞ control. IEEE Trans Neural Netw Learn Syst 23(12):1884– 1895 Yang Y, Vamvoudakis KG, Modares H (2020) Safe reinforcement learning for dynamical games. Int J Robust Nonlinear Control 30(9):3706–3726

Appendix A

Some Useful Facts in Matrix Algebra

The Kronecker product of two matrices . A = [ai j ] ∈ Rm×n and . B = [bi j ] ∈ R p×q is [ ] A ⊗ B = ai j B ∈ Rmp×nq

(A.1)

[ ] A ⊗ B = Abi j ∈ Rmp×nq .

(A.2)

.

or .

Define .x ∈ Rn = [x1 x2 . . . xn ]T as a vector, .s ∈ R as a scalar, and . f (x) ∈ Rm as an .m-vector function with respect to .x. The differential regarding .x is ⎤ d x1 ⎢d x 2 ⎥ ⎥ ⎢ .d x = ⎢ . ⎥ . ⎣ .. ⎦ ⎡

(A.3)

d xn The derivative of .x at time .t is ⎤ d x1 /dt ⎢d x2 /dt ⎥ dx ⎥ ⎢ . = ⎢ . ⎥. ⎣ .. ⎦ dt d xn /dt ⎡

(A.4)

The partial derivative of . y at .x (. y is a function of .x) is ⎤ ∂s/∂ x1 ⎢∂s/∂ x2 ⎥ ∂y ⎥ ⎢ . = ⎢ . ⎥. . ⎣ ∂x . ⎦ ∂s/∂ xn ⎡

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 B. Lian et al., Integral and Inverse Reinforcement Learning for Optimal Control Systems and Games, Advances in Industrial Control, https://doi.org/10.1007/978-3-031-45252-9

(A.5)

259

260

Appendix A: Some Useful Facts in Matrix Algebra

The differential of . y at .t is ( dy =

.

∂y ∂x

)T dx =

n ∑ ∂y d xi . ∂ xi i=1

(A.6)

Let .z be a function of two variable vectors .x and . y. Then ( dz =

.

∂z ∂x

(

)T dx +

∂z ∂y

)T dy.

[ 2 ] ∂2z ∂ z . = . ∂x2 ∂ xi x j

The second derivative is

The Jacobi matrix of . H in terms of .x is defined as [ ] ∂H ∂H ∂H ∂H . ... = ∈ Rm×n ∂x ∂ x1 ∂ x2 ∂ xn

(A.7)

(A.8)

(A.9)

and .

∂ HT ∆ = ∂x

(

∂H ∂x

)T ∈ Rn×m .

(A.10)

Furthermore, ∑ ∂H ∂H d xi . dx = ∂x ∂ xi i=1 n

df =

.

(A.11)

Let . y be a vector and . A be a matrix. Then one has ∂ (y T x) ∂x T

∂ (x T y) = y ∂x ∂ . (y Ax) = ∂∂x (x T AT y) = AT y ∂x ∂ . (y T f (x)) = ∂∂x ( f T (x)y) = f xT y ∂x ∂ . (x T Ax) = Ax + AT x. ∂x .

=

(A.12) (A.13) (A.14) (A.15)

Let . Q be a symmetric matrix. Then, one has ∂ (x T Qx) ∂x T

.

∂ . (x ∂x

− y) Q(x − y) = 2Q(x − y)

.

∂ (x ∂x2 2

.

= 2Qx

∂ 2 x T Qx ∂x2 T

(A.16) (A.17)

= 2Q

(A.18)

− y) Q(x − y) = 2Q.

(A.19)

Appendix A: Some Useful Facts in Matrix Algebra

261

For any matrix . Q: • If .x T Qx > (≥) 0 holds for all nonzero .x, then . Q > (≥) 0. • If .x T Qx < (≤) 0 holds for all nonzero .x, then . Q < (≤) 0. If .x ∈ Rn is a vector, then the square of the Euclidean norm is ‖x‖2 = x T x.

.

(A.20)

Index

A Activation functions, 23, 24, 26, 43, 48, 96, 106, 127, 178, 179, 206, 209–213, 220 Actor NN, 50, 51, 54, 96, 100, 101, 125, 208, 211–216, 220, 221 Admissibility, 14, 82, 88 Admissible control law, 29, 31 Adversarial input, 183–185, 196–198, 201, 203, 206, 212, 217 Adversary neural network, 212 Aircraft control systems, 12 Algebraic Riccati equation (ARE), 1, 3, 11, 16–21, 23, 27, 28, 31, 56–58, 60, 61, 66, 71, 73, 77–79, 82, 83, 109, 119, 127– 131, 134, 136, 138, 139, 153, 156, 158, 163, 183, 185, 186, 190, 191, 233 Algorithm implementation, 237, 252 Approximation error, 23, 24, 43, 54, 96, 99, 100, 102, 125, 171, 176, 204, 209–213 Asymptotically stable, 17, 20, 29, 32–34, 58, 67, 73, 74, 78, 85, 90, 93, 114–117, 131, 133, 134, 136, 170, 191, 192, 201, 229, 233, 249 Attenuation condition, 111–114 Augmented system, 75, 76, 78, 81, 83, 86– 88, 90, 92, 106, 112, 113, 117, 118, 127 Auxiliary input, 139, 159, 160, 234, 235

B Basis function, 24, 43, 96, 125, 252 Batch least squares (BLS), 83, 142, 143, 154, 162, 171–173, 176, 193, 204, 237, 252, 253

Bellman equation, 5, 14, 15, 18–22, 24, 30, 39, 41–47, 56, 60, 61, 63–67, 76–81, 86, 88, 90, 94–98, 100, 102, 113, 121– 125, 160, 165, 174, 175, 186, 187, 193, 198, 202, 203, 208, 210, 213, 215, 217, 228

C Causal solution, 73, 76 Closed-loop system, 17, 27–29, 32–34, 51, 54, 58, 59, 63, 75, 100, 111, 114–116, 118, 119, 127, 131, 215, 216, 229 dynamics, 17, 119 Coefficient matrix, 23 Compact set, 23, 24, 43, 45, 96, 197, 207, 210–213, 219 Control input, 12–14, 18, 19, 22, 24, 28, 31, 32, 40–43, 52, 55–57, 61, 65, 67, 68, 71–73, 76–79, 81, 84–86, 88–90, 93– 95, 99–102, 105–107, 110–112, 115, 119, 121, 123, 124, 128–130, 132, 141, 144, 151–153, 156, 159, 160, 164, 165, 171, 173, 175–179, 183–185, 188, 190, 193–198, 203, 204, 206–209, 211, 220, 221, 223, 225–228, 234, 236, 238, 243– 245, 247, 249, 250, 254, 255 Control policy, 1, 3, 5, 18–22, 24, 28, 31, 42, 43, 45, 50, 56, 60–67, 72, 74, 78– 81, 83, 85, 88, 90, 94–96, 98, 99, 111, 113–115, 122–125, 127, 128, 130–132, 134, 137, 139, 141, 152, 155, 156, 170, 228–230, 238, 245, 246, 248, 250, 256

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 B. Lian et al., Integral and Inverse Reinforcement Learning for Optimal Control Systems and Games, Advances in Industrial Control, https://doi.org/10.1007/978-3-031-45252-9

263

264 Controllability, 13 Convergence test, 42, 43 Coupled Hamilton–Jacobi equations, 244, 245 Coupled Riccati equations, 227–230, 232 Critic approximator, 193 Critic NN, 45–48, 50, 51, 54, 55, 96–101, 106, 125, 208–217, 220, 221, 225

D Data-driven off-policy IRL, 128, 138–140 Derivative, 14, 24, 28, 32, 52, 93, 94, 102, 103, 115, 153, 192, 217, 233, 235, 259, 260 Detectability, 13 Difference equation, 5, 21–24, 134 Discounted algebraic Riccati equation (Discounted ARE), 18, 19, 56, 57 Discounted optimal control, 56, 60 Discounted performance function, 55, 56, 60, 61, 86–88, 110, 112, 113 Distributed minmax strategy, 6, 109, 127, 128, 131, 133, 136, 139, 141 Disturbance attenuation, 111–114, 185 Drift dynamics, 40, 43, 72, 79, 83, 84, 86, 88, 101, 110, 208, 220

E Energy, 6, 12, 13, 85, 112 Equivalent weight, 154, 166, 167, 169, 170, 178, 186, 190, 191, 199, 206, 228–232, 242, 245–249 Euclidean norm, 261 Experience replay, 39, 45, 47, 48, 51, 54 Expert system, 152, 153, 159, 163–165, 168, 184, 196, 197, 216, 226, 239, 243 External disturbance, 110, 123, 183, 184

F Feedback control, 1–6, 11, 13, 27–29, 34, 67, 73, 86, 112, 153, 228 Forward-in-time, 39 Full system dynamics, 11, 20

G Gain margin (GM), 12, 131, 134, 136–138 Game algebraic Riccati equation (GARE), 119, 120 Generalized policy iteration, 45 Gradient descent, 46, 97, 154, 156, 211–214

Index H Hamilton–Jacobi–Bellman (HJB), 1–4, 15, 16, 18, 19, 21, 23, 24, 27, 30, 31, 35, 41, 42, 44, 52–54, 86, 88–91, 93–96, 99, 100, 102, 103, 105, 106, 165, 166, 168, 169, 178, 206 equation, 2, 3, 15, 16, 18, 19, 21, 23, 24, 27, 30, 31, 35, 41, 42, 44, 52, 86, 88–91, 93–96, 99, 100 PDE, 16, 21 Hamilton–Jacobi–Isaacs (HJI), 109, 112, 114–116, 118, 120, 121, 123, 124, 183, 197–200, 208 Hamiltonian dynamics, 15 Hamiltonian function, 14, 15, 18, 78, 89, 90, 117, 118, 129, 153, 185, 198 . H∞ tracking control, 6, 109–112, 114, 115 History stack, 47–49, 62, 63, 67

I Ill-posedness, 4, 5, 31, 35 Infinite-horizon quadratic performance index, 12, 29 Inner-loop iteration, 188, 189, 191, 192, 199, 200, 202, 203, 205, 208, 209, 234, 245, 246, 249, 251 Input constraints, 40, 84, 86, 87 Input–output data, 55, 193 Integral Bellman equation, 187, 208 Integral inverse RL, 173, 174, 177, 202, 205, 236, 250, 251 Integral reinforcement learning (IRL), 1, 3, 11, 18, 21, 23, 39, 42, 45, 55, 71, 79, 84, 96, 99, 106, 109, 110, 127, 138 Bellman equation, 21, 22, 24, 80, 81, 100, 122, 123 Integral reinforcement signal, 46 Inverse optimal control (IOC), 2, 4–6, 11, 27–32, 35, 151, 154, 155, 160, 161, 166, 167, 172, 174, 177, 183, 187, 188, 200, 203, 205, 209, 226, 230, 233, 245, 246, 251 Inverse optimality, 192, 193 Inverse Q-learning, 7, 184, 187–189, 193– 195 Inverse reinforcement learning policy iteration (Inverse RL PI), 154, 199, 229, 230, 245, 246

K Kleinman’s algorithm, 79

Index Kronecker product, 252, 259

L LaSalle’s extension, 93, 116, 134 Learner system, 153, 156, 157, 160, 164, 165, 184–186, 191–193, 196–198, 207, 210, 216, 227, 229, 233, 236, 244, 246, 249 Leibniz’s Formula, 14 L’Hopital’s rule, 66, 80, 123, 142, 175 Linear dynamics, 1, 17 Linear quadratic regulator (LQR), 2, 3, 6, 11–18, 20, 22, 23, 25, 26, 30, 55, 60, 74, 81, 131, 134, 138, 152, 155 Linear quadratic tracking (LQT), 71–83, 119, 121 ARE, 73, 77–79, 82, 83 Lipschitz functions, 197 . L 2 -gain, 114 . L 2 -gain, 110–114, 132 Lyapunov equation, 5, 20, 21, 30, 60, 66, 76, 79–81, 138, 139, 156, 229, 246

M Markov decision processes, 4, 5, 28, 151 MATLAB.® routine, 16, 18 Minimum attenuation level, 186 Minimum eigenvalue, 199 Minimum principle, 11, 14, 15, 18 Minimum singular value, 135 Minmax control, 128, 130, 132, 134, 137 Minmax strategy, 6, 109, 127–134, 136, 138, 139, 141, 145 Model-free off-policy, 55, 56, 60, 61, 64, 65, 159, 161, 173, 175, 196, 202, 203, 226, 234, 236, 239, 249 Multiplayer games, 6, 11, 109, 127–129, 131, 139, 144, 225, 236, 245, 254 Multiplayer non-zero-sum games, 7, 225, 226, 228, 243

N Nash condition, 113 Nash equilibrium, 109, 130, 183–188, 193, 197–199, 225, 227, 229, 233, 244 Near-optimal control law, 40, 50 Neural networks (NN), 7, 23, 43, 45, 96, 100, 105, 106, 121, 125, 127, 170, 202, 203, 207–209, 211–213, 243, 252, 254 Nonlinear dynamics, 1, 16, 220 Nonlinear performance index, 16

265 Nonlinear system, 3, 4, 6, 7, 12, 30–32, 40, 84, 85, 106, 111, 126, 152, 164, 179, 183, 196, 245, 254 Nonquadratic cost function, 40 Non-unique solutions, 155, 157, 168, 170, 190, 191, 201, 231, 248 Non-uniqueness, 199, 208, 232, 246 O Observability, 13, 40 Off-policy, 4, 6, 7, 39, 55, 56, 60, 61, 64, 65, 109, 110, 112, 121–128, 138–141, 152, 159–161, 164, 173–175, 177, 196, 202, 203, 205, 225, 226, 234–236, 238, 239, 243, 249, 250 Bellman equation, 60, 160, 174, 202 integral inverse RL, 173, 174, 202, 236, 250 inverse reinforcement learning (Offpolicy IRL), 6, 7, 109, 110, 121–123, 126–128, 138–141, 152, 159, 161, 164, 175, 196, 202, 203, 226, 234, 236, 238, 239, 243 IOC, 160 RL, 56, 60, 61, 64, 65, 112, 121, 124, 125, 238 Offline policy iteration (offline PI), 45, 79, 94, 95, 121 On-policy, 123, 124 Online Actor–Critic, 6, 84, 96, 225 Online Adaptive Inverse Reinforcement Learning, 207 Online IRL, 6, 24, 45, 79, 81, 86, 94, 99 Optimal control, 1, 2, 4–7, 11–16, 18, 23, 27–32, 35, 39, 41, 42, 44, 50, 54, 56, 57, 60, 61, 71, 72, 77, 79, 82, 83, 85, 89–91, 94, 99, 101, 102, 105, 109, 114– 117, 124, 127–130, 151–155, 159–161, 164–168, 171, 177, 183, 184, 188, 190, 194, 195, 198, 203, 209, 211, 220, 225, 226, 228, 230, 232, 234, 239, 244–247, 251, 254 learning, 154, 166, 245 Optimality, 1, 2, 12–14, 29, 77, 92, 111, 192, 193, 233 Optimal output-feedback, 55 Optimal tracking control problem (OTCP), 71, 72, 84–88, 90, 94, 99, 111 Orthogonal symmetric matrix, 91 Outer-loop iteration, 155, 166–168, 170, 172–175, 188, 189, 192, 199–203, 205, 208, 209, 245, 246, 249, 251 Output-feedback (OPFB), 6, 56, 61, 65–68

266 control, 65, 66, 68

P Partial derivative, 259 Partial differential equation (PDE), 15, 16, 19, 21–24, 42, 94 Bellman equation, 19, 21, 22, 24 Penalty weights, 29, 31, 152–155, 157, 163, 165, 169, 178, 179, 184, 186, 188–190, 192, 199, 208, 227, 230, 231, 236, 244– 246, 251, 254 Performance index, 12, 13, 17, 27–32, 34, 39, 40, 71, 72, 82, 90, 128 Persistence of the excitation condition, 40 Persistent of excitation (PE), 27, 81, 98, 105, 106, 204, 211, 216, 220 Policy evaluation, 3, 19, 42, 43, 60, 66, 79, 81, 82, 94, 121, 139, 167, 172, 188, 189, 200, 209, 226, 230, 235, 246 Policy evaluation:, 95, 200 Policy improvement, 3, 19, 20, 22, 24, 42, 43, 50, 60, 66, 79, 81, 94, 95, 121, 139, 167, 168, 172, 188, 189, 200, 209, 226, 230, 235, 246 Policy iteration (PI), 3, 6, 18–24, 42, 43, 45, 50, 54, 60, 71, 79, 82, 94, 121, 128, 138, 154, 166, 199, 225, 226, 229, 233, 245, 247 Pontryagin’s Minimum Principle, 2, 15, 118 Positive definite, 152, 168 Positive semi-definite, 199, 208 Prescribed performance function, 1, 2, 12 Probing noises, 48, 55, 61, 65, 67, 81, 141, 143, 145, 162, 193, 220, 226, 236–239, 253

Q Q-function, 187, 188, 190, 192, 193, 195 Quadratic form, 228, 247 Quadratic matrix equation, 16

R Recursive least squares (RLS), 25, 162 Reference trajectory, 71, 73–75, 81, 82, 84, 85, 93, 107, 110–112, 120, 127, 128 Regression vector, 25, 44 Reinforcement learning (RL), 1, 3, 5, 11, 15, 18, 21, 23, 28, 29, 39, 45, 55, 71, 79, 84, 96, 99, 106, 109, 110, 127, 138, 151, 154, 159, 164, 166, 173, 183, 196, 199,

Index 202, 207, 208, 225, 226, 229, 234, 243, 245, 249 Return difference equation, 134 Riccati equation, 1, 2, 11, 12, 16, 17, 27, 28, 71–73, 109, 119, 127, 129, 153, 183, 185, 227–229, 231, 232 Robustness, 12, 131, 134, 151, 226 S Saddle point, 113, 114, 185, 186, 197, 198, 208 Simulation examples, 7, 40, 48, 54, 67, 82, 110, 126, 128, 144, 163, 177, 194, 206, 220, 239, 254 Single-loop iterative structure, 234 Smooth function, 43, 96, 111 Stability, 5, 7, 12–14, 33, 50, 51, 54, 56–59, 78, 92–94, 97, 100, 101, 105, 106, 110, 112, 114–120, 128, 130–134, 136, 151, 152, 155, 156, 164, 168, 170, 191, 192, 199, 201, 208, 215, 216, 226, 230, 233, 247, 249 Stabilizability, 13 State feedback, 13, 17, 20, 65, 153 State reconstruction, 62 State-penalty neural network, 213 State-penalty weight approximator, 193, 194 State-penalty weight improvement, 155, 161, 167, 172, 174, 177, 200, 203, 205, 209, 226, 230, 234 State-space description, 11 State-space dynamics, 12 Static OPFB, 56, 67 Stationarity condition, 15, 16, 77, 89, 100, 114 System dynamic matrices, 56, 128 System identification, 12, 18, 25 T Taylor series, 23 Temporal difference (TD), 46, 97, 125 Tracking Bellman equation, 88, 94–98, 102, 123, 124 Tracking error, 71, 72, 77, 79, 85–87, 93, 94, 101, 110–112, 114, 116, 119 Transfer function, 12 Two-loop iteration structure, 155 Two-player zero-sum game, 6, 7, 11, 109, 113, 114, 128, 183–185, 188, 189, 196, 198–200, 203, 207

Index U Uniformly approximate solution, 246, 248 Uniformly ultimate bounded (UUB), 51, 54, 101, 105 stability, 54 Utility function, 72, 80

V Value function, 14, 16–25, 39, 42, 43, 45, 46, 50, 52, 54, 56, 57, 64–67, 73, 74, 76, 78, 80, 81, 83, 91, 92, 94–97, 99, 100, 102, 113, 114, 116, 121, 122, 124–126, 175, 178, 186–188, 192, 203, 206, 209, 220, 235

267 Value function approximation (VFA), 18, 23, 24, 26, 43, 45, 62, 67, 96 Value iteration, 3, 82, 188

W Weierstrass approximation theorem, 23 Weight vector, 24, 26, 125, 171, 193, 210

Y Young’s inequality, 59, 104, 218, 219

Z Zero-state observability, 40