Advanced Optimal Control and Applications Involving Critic Intelligence 9789811972904, 9789811972911

347 88 7MB

English Pages [283] Year 2023

Polecaj historie

Optimal control: textbook 9786010405882

Some theoretical foundations of optimal control problem are expounded in the textbook: methods of variation calculus, ma

1,005 65 1MB Read more

Computational Intelligence: Theoretical Advances and Advanced Applications 9783110671353, 9783110655247

Computational intelligence (CI) lies at the interface between engineering and computer science; control engineering, whe

530 24 15MB Read more

Computational Intelligence: Theoretical Advances and Advanced Applications 9783110671353, 9783110655247

Computational intelligence (CI) lies at the interface between engineering and computer science; control engineering, whe

103 7 8MB Read more

Optimal Control and Geometry: Integrable Systems 9781316286852

The synthesis of symplectic geometry, the calculus of variations and control theory offered in this book provides a cruc

905 91 4MB Read more

Symplectic Pseudospectral Methods for Optimal Control: Theory and Applications in Path Planning [1st ed.] 9789811534379, 9789811534386

The book focuses on symplectic pseudospectral methods for nonlinear optimal control problems and their applications. Bot

696 80 8MB Read more

Optimal control theory: applications to management science and economics [3 ed.] 9783319982366, 9783319982373

1,090 144 4MB Read more

Optimal Control of ODEs and DAEs 9783110249996, 9783110249958

The intention of this textbook is to provide both, the theoretical and computational tools that are necessary to investi

234 74 18MB Read more

Functional Analysis, Calculus of Variations and Optimal Control 9781447148203, 1447148207

Functional analysis owes much of its early impetus to problems that arise in the calculus of variations. In turn, the me

1,423 254 6MB Read more

Dynamics and Optimal Control of Road Vehicles 0198825714, 9780198825715

Dynamics and Optimal Control of Road Vehicles uniquely offers a unified treatment of tyre, car and motorcycle dynamics,

1,177 180 11MB Read more

Reinforcement Learning: Optimal Feedback Control with Industrial Applications (Advances in Industrial Control) [1st ed. 2023] 3031283937, 9783031283932

This book offers a thorough introduction to the basics and scientific and technological innovations involved in the mode

108 99 45MB Read more

Advanced Optimal Control and Applications Involving Critic Intelligence
9789811972904, 9789811972911

Author / Uploaded
Ding Wang
Mingming Ha
Mingming Zhao

Table of contents :
Preface
Acknowledgements
Contents
1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design
1.1 Introduction
1.2 Discrete-Time Optimal Regulation Design
1.3 Discrete-Time Trajectory Tracking Design
1.4 The Construction of Critic Intelligence
1.4.1 Basis of Reinforcement Learning
1.4.2 The Neural Network Approximator
1.4.3 The Critic Intelligence Framework
1.5 The Iterative Adaptive Critic Formulation
1.6 Significance and Prospects
References
2 Event-Triggered Adaptive Optimal Regulation of Constrained Affine Systems
2.1 Introduction
2.2 Problem Description
2.3 Stability Analysis of the Event-Triggered Control System
2.4 Event-Triggered Constrained Control Design via HDP
2.4.1 The Proposed Control Structure
2.4.2 Neural Network Implementation
2.4.3 The Whole Design Procedure
2.5 Simulation Experiments
2.6 Conclusions
References
3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic
3.1 Introduction
3.2 Problem Description
3.3 Event-Driven Iterative Adaptive Critic Design via DHP
3.3.1 Derivation and Convergence Discussion
3.3.2 Neural Network Implementation
3.4 Event-Based System Stability Analysis
3.5 Special Discussion of the Affine Nonlinear Case
3.6 Highlighting the Mixed-driven Control Framework
3.7 Simulation Experiments
3.8 Conclusions
References
4 Near-Optimal Regulation with Asymmetric Constraints via Generalized Value Iteration
4.1 Introduction
4.2 Problem Statement
4.3 Properties of the Generalized Value Iteration Algorithm …
4.3.1 Derivation of the Generalized Value Iteration Algorithm
4.3.2 Properties of the Generalized Value Iteration Algorithm
4.3.3 Implementation of the Generalized Value Iteration Algorithm
4.4 Simulation Studies
4.5 Conclusion
References
5 Nonaffine Neuro-Optimal Tracking Control with Accuracy and Stability Guarantee
5.1 Introduction
5.2 Problem Formulation
5.3 Neuro-Optimal Tracking Control Based on the Value Iteration Algorithm
5.3.1 The Value Iteration Algorithm for Tracking Control
5.3.2 Convergence of the Value Iteration Algorithm
5.3.3 Closed-Loop Stability Analysis with Value Iteration
5.4 Neural Network Implementation of the Iterative HDP Algorithm
5.4.1 The Critic Network
5.4.2 The Action Network
5.5 Simulation Experiments
5.5.1 Example 1
5.5.2 Example 2
5.6 Conclusions
References
6 Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach
6.1 Introduction
6.2 Problem Description
6.3 The Optimal Tracking Control Based on the Iterative DHP Algorithm
6.3.1 Derivation of the Iterative ADP Algorithm
6.3.2 Derivation of the Iterative DHP Algorithm
6.4 Data-Based Iterative DHP Implementation
6.4.1 Neuro-Identifier for Estimation of Nonlinear Dynamics
6.4.2 The Critic Network
6.4.3 The Action Network
6.5 Simulation Studies
6.6 Conclusion
References
7 Adaptive Critic with Improved Cost for Discounted Tracking and Novel Stability Proof
7.1 Introduction
7.2 Problem Formulation and VI-Based Adaptive Critic Scheme
7.3 Novel Stability Analysis of VI-Based Adaptive Critic Designs
7.4 Discounted Tracking Control for the Special Case of Linear Systems
7.5 Simulation Studies
7.5.1 Example 1
7.5.2 Example 2
7.6 Conclusions
References
8 Iterative Adaptive Critic Control Towards an Urban Wastewater Treatment Plant
8.1 Introduction
8.2 Platform Description with Control Problem Statement
8.2.1 Platform Description
8.2.2 Control Problem Statement
8.3 The Data-Driven IAC Control Method
8.4 Application to the Proposed Wastewater Treatment Plant
8.5 Revisiting Wastewater Treatment via Mixed Driven NDP
8.6 Conclusions
References
9 Constrained Neural Optimal Tracking Control with Wastewater Treatment Applications
9.1 Introduction
9.2 Problem Statement with Asymmetric Control Constraints
9.3 Intelligent Optimal Tracking Design
9.3.1 Description of the Iterative ADP Algorithm
9.3.2 DHP Formulation of the Iterative Algorithm
9.4 Neural Network Implementation
9.5 A Wastewater Treatment Application
9.6 Conclusion
References
10 Data-Driven Hybrid Intelligent Optimal Tracking Design with Industrial Applications
10.1 Introduction
10.2 Problem Statement
10.3 Offline Learning of the Pre-designed Controller
10.4 Online Near-Optimal Tracking Control with Stability Analysis
10.4.1 The Critic Network
10.4.2 The Action Network
10.4.3 Uniformly Ultimately Bounded Stability of Weight Estimation Errors
10.4.4 Summary of the Proposed Tracking Approach
10.5 Experimental Simulation
10.5.1 Application to a Torsional Pendulum Device
10.5.2 Application to a Wastewater Treatment Plant
10.6 Concluding Remarks
References
Appendix Index
Index

Citation preview

Intelligent Control and Learning Systems 6

Ding Wang · Mingming Ha · Mingming Zhao

Advanced Optimal Control and Applications Involving Critic Intelligence

Intelligent Control and Learning Systems Volume 6

Series Editor Dong Shen , School of Mathematics, Renmin University of China, Beijing, Beijing, China

The Springer book series Intelligent Control and Learning Systems addresses the emerging advances in intelligent control and learning systems from both mathematical theory and engineering application perspectives. It is a series of monographs and contributed volumes focusing on the in-depth exploration of learning theory in control such as iterative learning, machine learning, deep learning, and others sharing the learning concept, and their corresponding intelligent system frameworks in engineering applications. This series is featured by the comprehensive understanding and practical application of learning mechanisms. This book series involves applications in industrial engineering, control engineering, and material engineering, etc. The Intelligent Control and Learning System book series promotes the exchange of emerging theory and technology of intelligent control and learning systems between academia and industry. It aims to provide a timely reflection of the advances in intelligent control and learning systems. This book series is distinguished by the combination of the system theory and emerging topics such as machine learning, artificial intelligence, and big data. As a collection, this book series provides valuable resources to a wide audience in academia, the engineering research community, industry and anyone else looking to expand their knowledge in intelligent control and learning systems.

Ding Wang · Mingming Ha · Mingming Zhao

Advanced Optimal Control and Applications Involving Critic Intelligence

Ding Wang Faculty of Information Technology Beijing University of Technology Beijing, China Mingming Zhao Faculty of Information Technology Beijing University of Technology Beijing, China

Mingming Ha School of Automation and Electrical Engineering University of Science and Technology Beijing Beijing, China

ISSN 2662-5458 ISSN 2662-5466 (electronic) Intelligent Control and Learning Systems ISBN 978-981-19-7290-4 ISBN 978-981-19-7291-1 (eBook) https://doi.org/10.1007/978-981-19-7291-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

Nowadays, we are going through a profound revolution in all walks of life, due to the rapid development of artificial intelligence and intelligent techniques. Among the numerous amazing achievements, intelligent optimization methods are commonly applied, not only to scientific research but also to practical engineering. Those application areas cover cybernetics, computer science, computational mathematics, and so on. Remarkably, the idea of optimization plays an important role in artificialintelligence-based advanced control design and is significant to construct various intelligent systems. Unlike the ordinary linear case, nevertheless, the nonlinear optimal control is often difficult to address. Particularly, with the wide popularity of networked techniques and the extension of computer control scales, more and more dynamical systems are encountered with the difficulty of building mathematical models accurately and are operated based on increasing communication resources. For example, the intelligent optimal control of wastewater treatment systems is an important avenue of resources cyclic utilization when coping with the modern urban diseases. However, there always exist obvious nonlinearities and uncertainties within wastewater treatment processes. Because of the more and more common dynamics complexity, it is always difficult to achieve direct optimization design and the related control efficiencies are often low. Therefore, it is necessary to establish advanced optimal control strategies for complex discrete-time nonlinear systems. Characterized by agent-environment interaction, reinforcement learning is closely related to dynamic programming when conducting intelligent optimization design. During the adaptive critic framework, reinforcement learning is combined with the neural network approximator to cope with complex optimization problems approximately. In the last two decades, the adaptive critic mechanism has been widely used to solve complex optimal control problems and many excellent results have been developed in the sense of adaptive optimal control design. This book intends to report new optimal control results with critic intelligence for complex discrete-time systems, which covers the novel control theory, advanced control methods, and typical applications for wastewater treatment systems. Therein, combining with artificial intelligence techniques, such as neural networks and reinforcement learning, the novel intelligent critic control theory as well as a series of advanced optimal regulation and v

vi

Preface

trajectory tracking strategies are established for discrete-time nonlinear systems, followed by application verifications to complex wastewater treatment processes. Consequently, developing such kind of critic intelligence approaches is of great significance for nonlinear optimization and wastewater recycling. Overall, ten chapters are included in this book, focused on background introduction (Chap. 1), optimal regulation (Chaps. 2–4), trajectory tracking (Chaps. 5–7), and industrial applications particularly wastewater treatment (Chaps. 8–10). Prof. Ding Wang contributes to each of the ten chapters. Dr. Mingming Ha contributes to Chaps. 1–3, 5–8. Dr. Mingming Zhao contributes to Chaps. 1, 4, 9, and 10. They perform the discussion, revision, and improvement for all ten chapters. In Chap. 1, considering learning approximators and the reinforcement formulation, a learning-based control framework is established and applied to intelligent critic learning and control for complex nonlinear systems under unknown dynamics within discrete-time domain. In addition, the bases, the derivations, and recent progresses of critic intelligence for discrete-time advanced optimal control design are presented. In terms of normal regulation and trajectory tracking, the advanced optimal control methods are also verified via simulation experiments and wastewater treatment applications, which effectively address unknown factors for complex nonlinear systems, observably enhance control efficiencies, and really improve intelligent optimization performances. In Chap. 2, based on an effective heuristic dynamic programming algorithm, the adaptive event-triggered controller is designed for a class of discrete-time nonlinear systems with constrained inputs. First, to conquer control constraints, a nonquadratic performance index is introduced and the triggering threshold is provided with stability analysis using the Lyapunov technique. Second, three neural networks are constructed in the algorithm scheme and a novel weight initialization approach is developed to improve approximation accuracy of the model network. Simulation results further demonstrate the validity of the proposed strategy by comparison with the traditional method. In Chap. 3, the event-based self-learning optimal regulation is developed for discrete-time nonlinear systems based on the iterative dual heuristic dynamic programming algorithm, which substantially decreases the computation cost. First, during the iterative process, the convergence of the event-based adaptive critic algorithm is discussed. Second, an appropriate triggering condition is established so as to ensure the input-to-state stability of the event-based system. In addition, the mixeddriven control framework is clarified with data and event considerations. Simulation examples are conducted to demonstrate the effectiveness and superiority of the proposed approach when compared with the traditional technique. In Chap. 4, an effective generalized value iteration algorithm is established to deal with the discounted near-optimal control issue for systems with control constraints. First, a nonquadratic performance function is introduced to overcome saturation and the initial cost function is selected as an arbitrary positive semidefinite function instead of zero. Considering constrained systems with the discount factor, the monotonicity and convergence of the iterative cost function sequence are discussed. Then, in order to implement the proposed algorithm, two neural networks are constructed

Preface

vii

to approximate the cost function and the control policy. Additionally, two simulation examples are conducted to certify the validity of the proposed method. In Chap. 5, a novel neuro-optimal tracking controller is developed based on value iteration for discrete-time nonlinear systems. The optimal trajectory tracking problem is transformed into the optimal regulation problem through constructing a new augmented system. Then, the convergence of the iterative cost function for the value-iteration-based tracking control algorithm is provided and the uniformly ultimately bounded stability of the closed-loop system is discussed. In addition, the heuristic dynamic programming algorithm is utilized to implement the proposed method and two simulations are conducted to testify the effectiveness of the proposed strategy. In Chap. 6, a data-based optimal tracking control technique is established based on the iterative dual heuristic dynamic programming algorithm for discrete-time nonaffine systems. In order to implement the proposed algorithm, three neural networks are conducted to approximate the system model, the costate function, and the control strategy, respectively. In addition, when the model network is trained, biases are introduced to improve identification accuracy and the gradient descent algorithm is utilized to update weights and biases of all layers. Through the Lyapunov approach, the uniformly ultimately bounded stability is discussed and the simulation is carried out to demonstrate the validity of the proposed method. In Chap. 7, the discounted optimal control design is developed based on the adaptive critic scheme with a novel performance index to solve the tracking control problem for both nonlinear and linear systems. First, the tracking errors cannot be eliminated completely in previous method with a traditional performance index. To deal with the problem, a novel cost function is introduced. Second, the cost function in optimal tracking control cannot be deemed as a Lyapunov function, and therefore the new stability analysis is discussed to ensure the tracking error tends to zero as the number of time steps increases. Two numerical simulations are performed to verify the effectiveness with the comparation of the tracking performance for the iterative adaptive critic designs under different performance index functions. In Chap. 8, a data-driven iterative adaptive critic scheme is developed to deal with the nonlinear optimal feedback control problem in wastewater treatment systems. In order to ensure the dissolved oxygen concentration and the nitrate level are maintained at their desired setting points, the iterative adaptive critic control framework is established. In this way, faster response and less oscillation are obtained using the scheme compared with the incremental proportional–integral–derivative method and the convergence is discussed. In addition, the mixed-driven control framework is also introduced and then applied to the wastewater treatment plant. The simulation demonstrates the effectiveness of the proposed intelligent controller for nonlinear optimization and wastewater recycling. In Chap. 9, a data-driven iterative adaptive tracking controller involving the dual heuristic dynamic programming structure is established to improve the control performance of the dissolved oxygen concentration and the nitrate nitrogen concentration in the constrained nonlinear plant of wastewater treatment. First, to address asymmetric constraints of the control input, a nonquadratic performance function is introduced.

viii

Preface

Then, the steady control strategy is obtained and the next system state is evaluated by the model network. Through applying to the wastewater treatment plant, the proposed method is verified to be feasible and efficient. In Chap. 10, based on the accelerated generalized value iteration algorithm, a hybrid intelligent tracking control strategy is developed to achieve optimal tracking for a class of nonlinear discrete-time systems. Both offline and online training are utilized, where the former can obtain the admissible tracking control law and the latter can enhance the control performance. In addition, the acceleration factor is introduced to improve the performance of value iteration and the admissible tracking control is obtained. The input–output data of the unknown system is collected to construct the model neural network, so as to attain the steady control and the approximate control matrix. Considering approximation errors of neural networks, the uniformly ultimately bounded stability is discussed via the Lyapunov approach. Two examples with industrial application backgrounds are involved to demonstrate the availability and effectiveness of the proposed method. Beijing, China Beijing, China Beijing, China September 2022

Ding Wang Mingming Ha Mingming Zhao

Acknowledgements

The authors would like to thank Prof. Derong Liu, Prof. Junfei Qiao, and Prof. Long Cheng for providing valuable discussions when conducting related research. The authors also would like to thank Jin Ren, Xin Xu, and Huiling Zhao, for preparing some basic materials of this book. The authors also would like to thank Ning Gao, Peng Xin, Lingzhi Hu, Junlong Wu, Jiangyu Wang, Xin Li, Zihang Zhou, Wenqian Fan, Haiming Huang, Ao Liu, Yuan Wang, and Hongyu Ma, for checking and improving some chapters of the book. The authors are very grateful to National Key Research and Development Program of China (Grant 2021ZD0112302), the National Natural Science Foundation of China (Grant 62222301, 61773373), and Beijing Natural Science Foundation (Grant JQ19013), for providing necessary financial support to our research in the past 4 years.

ix

Contents

1

2

3

On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Discrete-Time Optimal Regulation Design . . . . . . . . . . . . . . . . . . . 1.3 Discrete-Time Trajectory Tracking Design . . . . . . . . . . . . . . . . . . . 1.4 The Construction of Critic Intelligence . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Basis of Reinforcement Learning . . . . . . . . . . . . . . . . . . . 1.4.2 The Neural Network Approximator . . . . . . . . . . . . . . . . . . 1.4.3 The Critic Intelligence Framework . . . . . . . . . . . . . . . . . . 1.5 The Iterative Adaptive Critic Formulation . . . . . . . . . . . . . . . . . . . . 1.6 Significance and Prospects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 5 9 12 13 14 15 17 20 22

Event-Triggered Adaptive Optimal Regulation of Constrained Affine Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Stability Analysis of the Event-Triggered Control System . . . . . . 2.4 Event-Triggered Constrained Control Design via HDP . . . . . . . . . 2.4.1 The Proposed Control Structure . . . . . . . . . . . . . . . . . . . . . 2.4.2 Neural Network Implementation . . . . . . . . . . . . . . . . . . . . 2.4.3 The Whole Design Procedure . . . . . . . . . . . . . . . . . . . . . . . 2.5 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29 29 31 33 39 39 39 43 43 49 51

Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Event-Driven Iterative Adaptive Critic Design via DHP . . . . . . . . 3.3.1 Derivation and Convergence Discussion . . . . . . . . . . . . . .

53 53 55 58 58 xi

xii

Contents

3.3.2 Neural Network Implementation . . . . . . . . . . . . . . . . . . . . 3.4 Event-Based System Stability Analysis . . . . . . . . . . . . . . . . . . . . . . 3.5 Special Discussion of the Affine Nonlinear Case . . . . . . . . . . . . . . 3.6 Highlighting the Mixed-driven Control Framework . . . . . . . . . . . . 3.7 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

5

Near-Optimal Regulation with Asymmetric Constraints via Generalized Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Properties of the Generalized Value Iteration Algorithm with the Discount Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Derivation of the Generalized Value Iteration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Properties of the Generalized Value Iteration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Implementation of the Generalized Value Iteration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nonaffine Neuro-Optimal Tracking Control with Accuracy and Stability Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Neuro-Optimal Tracking Control Based on the Value Iteration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 The Value Iteration Algorithm for Tracking Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Convergence of the Value Iteration Algorithm . . . . . . . . . 5.3.3 Closed-Loop Stability Analysis with Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Neural Network Implementation of the Iterative HDP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 The Critic Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 The Action Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59 62 68 70 71 85 87 89 89 91 93 94 94 102 104 115 116 119 119 122 124 124 125 127 129 129 130 133 133 138 142 143

Contents

6

7

8

Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 The Optimal Tracking Control Based on the Iterative DHP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Derivation of the Iterative ADP Algorithm . . . . . . . . . . . . 6.3.2 Derivation of the Iterative DHP Algorithm . . . . . . . . . . . . 6.4 Data-Based Iterative DHP Implementation . . . . . . . . . . . . . . . . . . . 6.4.1 Neuro-Identifier for Estimation of Nonlinear Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 The Critic Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 The Action Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adaptive Critic with Improved Cost for Discounted Tracking and Novel Stability Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Problem Formulation and VI-Based Adaptive Critic Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Novel Stability Analysis of VI-Based Adaptive Critic Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Discounted Tracking Control for the Special Case of Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Iterative Adaptive Critic Control Towards an Urban Wastewater Treatment Plant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Platform Description with Control Problem Statement . . . . . . . . . 8.2.1 Platform Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Control Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 The Data-Driven IAC Control Method . . . . . . . . . . . . . . . . . . . . . . . 8.4 Application to the Proposed Wastewater Treatment Plant . . . . . . . 8.5 Revisiting Wastewater Treatment via Mixed Driven NDP . . . . . . 8.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiii

147 147 149 152 152 154 155 156 164 164 165 168 169 173 173 175 179 183 186 186 189 194 194 197 197 199 199 200 201 205 214 216 216

xiv

9

Contents

Constrained Neural Optimal Tracking Control with Wastewater Treatment Applications . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Problem Statement with Asymmetric Control Constraints . . . . . . 9.3 Intelligent Optimal Tracking Design . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Description of the Iterative ADP Algorithm . . . . . . . . . . . 9.3.2 DHP Formulation of the Iterative Algorithm . . . . . . . . . . 9.4 Neural Network Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 A Wastewater Treatment Application . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10 Data-Driven Hybrid Intelligent Optimal Tracking Design with Industrial Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Offline Learning of the Pre-designed Controller . . . . . . . . . . . . . . . 10.4 Online Near-Optimal Tracking Control with Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 The Critic Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.2 The Action Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.3 Uniformly Ultimately Bounded Stability of Weight Estimation Errors . . . . . . . . . . . . . . . . . . . . . . . . 10.4.4 Summary of the Proposed Tracking Approach . . . . . . . . 10.5 Experimental Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 Application to a Torsional Pendulum Device . . . . . . . . . . 10.5.2 Application to a Wastewater Treatment Plant . . . . . . . . . 10.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

219 219 221 225 225 226 229 231 236 237 241 241 243 246 249 249 250 252 256 257 257 261 268 268

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

Chapter 1

On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

Abstract The idea of optimization can be regarded as an important basis of many disciplines and hence is extremely useful for a large number of research fields, particularly for artificial-intelligence-based advanced control design. Due to the difficulty of solving optimal control problems for general nonlinear systems, it is necessary to establish a kind of novel learning strategies with intelligent components. Besides, the rapid development of computer and networked techniques promotes the research on optimal control within the discrete-time domain. In this chapter, the bases, derivations, and recent progresses of critic intelligence for discrete-time advanced optimal control design are presented with an emphasis on the iterative framework. Among them, the so-called critic intelligence methodology is highlighted, which integrates learning approximators and the reinforcement formulation. Keywords Adaptive critic · Advanced optimal control · Dynamic systems · Intelligent critic · Reinforcement learning

1.1 Introduction Intelligent techniques are being achieved tremendous attention nowadays, because of the spectacular promotions that artificial intelligence brings to all walks of life (Silver et al. 2016). Within the plentiful applications related to artificial intelligence, an obvious feature is the possession of intelligent optimization. As an important foundation of several disciplines, such as cybernetics, computer science, and applied mathematics, optimization methods are commonly used in many research fields and engineering practices. Note that optimization problems may be proposed with respect to minimum fuel, minimum energy, minimum penalty, maximum reward, and so on. Actually, most organisms in nature desire to act in optimal fashions for conserving limited resources and meanwhile achieving their goals. Without exception, the idea of optimization also plays a key role in artificial-intelligence-based advanced control design and the construction of intelligent systems. However, with the wide popularity of networked techniques and the extension of computer control scales, an increasing number of dynamical systems are encountered with increasing communication burdens, the difficulty © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D. Wang et al., Advanced Optimal Control and Applications Involving Critic Intelligence, Intelligent Control and Learning Systems 6, https://doi.org/10.1007/978-981-19-7291-1_1

1

2

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

of building accurate mathematical models, and the existence of various uncertain factors. As a result, it is always not an easy task to achieve optimization design for these systems and the related control efficiencies are often low. Hence, it is extremely necessary to establish novel, advanced, and effective optimal control strategies, especially for complex discrete-time nonlinear systems. Unlike solving the Riccati equation for linear systems, the optimal control design of nonlinear dynamics often contains the difficulty of addressing the Hamilton– Jacobi–Bellman (HJB) equation. Although dynamic programming provides an effective pathway to deal with the problems, it is often computationally untenable to run this method to obtain optimal solutions due to “curse of dimensionality” (Bellman 1957). Moreover, the backward searching direction obviously precludes the use of dynamic programming in real-time control. Therefore, considering the usualness of encountering with nonlinear optimal control problems, some numerical methods have been proposed to overcome the difficulty of solving HJB equations, particularly under the dynamic programming formulation. Among them, the adaptive-criticrelated framework is an important avenue and artificial neural networks are often taken as a kind of supplementary approximation tools (Werbos 1974, 1977, 1992, 2008). Although other computational intelligence methods, such as fuzzy logic and evolutionary computation, can also be adopted, neural networks are employed more frequently to serve as the function approximator. In fact, there are several synonyms included within the framework, such as adaptive dynamic programming, approximate dynamic programming, neural dynamic programming, and neuro-dynamic programming (Bertsekas 2017; Bertsekas and Tsitsiklis 1996; Fu et al. 2020; He et al. 2012; He and Zhong 2018; Liu et al. 2012; Prokhorov and Wunsch 1997; Si et al. 2004; Si and Wang 2001). For both continuous-time and discrete-time systems, these methods have been used to solve optimal control problems including output regulation (Gao et al. 2022; Jiang et al. 2020a, b; Luo et al. 2018), multi-player games (Li et al. 2022; Lv and Ren 2019; Narayanan et al. 2020; Zhang et al. 2017a), zero-sum games (Wang et al. 2022g; Wei et al. 2018; Xue et al. 2020; Zhu et al. 2017), and so on. Remarkably, the well-known reinforcement learning is also closely related to such methods, which provides the important property of reinforcement. Actually, classical dynamic programming is deemed to have limited utilities in the field of reinforcement learning due to the common assumption of exact models and the vast computational expense, but it is still significant to boost the development of reinforcement learning in the sense of theory (Sutton and Barto 2018). Most of the strategies of reinforcement learning can be regarded as active attempts to accomplish much the same performances as dynamic programming, without directly relying on perfect models of the environment and making use of superabundant computational resources. At the same time, an essential and pivotal foundation can be provided by traditional dynamic programming to better understand various reinforcement learning techniques. In other words, they are highly related to each other, and both of them are useful to address optimization problems by employing the principle of optimality. In this chapter, we name the effective integration of learning approximators and the reinforcement formulation as critic intelligence (Wang et al. 2022c). Within this new component, dynamic programming is taken to provide the theoretical foundation of

1.1 Introduction

3

optimization, reinforcement learning is regarded as the key learning mechanism, and neuralnetworksareadoptedtoserveasanimplementationtool.Then,consideringcomplex nonlinearities and unknown dynamics, the so-called critic intelligence methodology is deeply discussed and comprehensively applied to optimal control design within discrete-time domain. Hence, by involving critic intelligence, a learning-based intelligent critic control framework is constructed for complex nonlinear systems under unknown environment. Specifically, combining with artificial intelligence techniques, such as neural networks and reinforcement learning, the novel intelligent critic control theory as well as a series of advanced optimal regulation and trajectory tracking strategies are established for discrete-time nonlinear systems (Dong et al. 2017; Ha et al. 2020, 2021a, b, 2022a, b, c; Li et al. 2021; Liang et al. 2020a, b; Lincoln and Rantzer 2006; Liu et al. 2015; Luo et al. 2021; Liu et al. 2012, 2018; Luo et al. 2020b; Mu et al. 2018; Na et al. 2022; Song et al. 2021; Wang et al. 2011, 2012, 2020a, b, 2021a, b, c, 2022c; Wei et al. 2015, 2020, 2022a, b; Zhang et al. 2014; Yan et al. 2017; Zhao et al. 2015; Zhong et al. 2018, 2016; Zhu and Zhao 2022; Zhu et al. 2019), followed by application verifications to complex wastewater treatment processes (Wang et al. 2020b, 2021a, b, 2022d; Yang et al. 2022a). That is, the advanced optimal regulation and trajectory tracking of discrete-time affine nonlinear systems and general nonaffine plants are addressed with applications, respectively. It is worth mentioning that, in this chapter, we put emphasis on discrete-time nonlinear optimal control. Besides, there are six major features included in this chapter and they are listed as follows: • • • •

The nonlinear optimal regulation and trajectory tracking designs are discussed. Both affine nonlinear plants and nonaffine systems are covered. Normal optimal control and constrained optimal control problems are addressed. Data-driven and event-driven techniques are combined to become the novel mixeddriven mechanism. • The traditional adaptive critic with stability proof and the iterative adaptive critic framework with convergence guarantee are displayed. • Complete theoretical analysis, advanced design approaches, and some typical wastewater treatment applications are all involved. Here, we incidentally point out that the adaptive-critic-based optimal control design for continuous-time dynamics has also achieved great progresses, in terms of normal regulation, trajectory tracking, disturbance attenuation, and other aspects (AbuKhalaf and Lewis 2005; Beard et al. 1997; Bian and Jiang 2016; Fan et al. 2022; Fan and Yang 2016; Gao and Jiang 2016, 2019; Han et al. 2022; Huo et al. 2022; Jiang and Jiang 2015; Luo et al. 2020a; Modares and Lewis 2014a, b; Mu and Wang 2017; Murray et al. 2002; Pang and Jiang 2021; Song et al. 2016; Vamvoudakis 2017; Vamvoudakis and Lewis 2010; Wang et al. 2022a, b, e, 2017; Wang and Qiao 2019; Wang and Xu 2022g; Wang and Liu 2018; Wang et al. 2016; Xue et al. 2022a, b; Yang et al. 2021a, b, 2022b, c; Yang and He 2021; Zhang et al. 2017b, 2018; Zhao and Liu 2020; Zhao et al. 2016, 2018; Zhu and Zhao 2018). There always exist some monographs and survey papers discussing most of these topics (Lewis and Liu 2013; Lewis et al. 2012; Liu et al. 2013, 2017; Kiumarsi et al. 2018;

4

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

Fig. 1.1 Structure of critic-intelligence-based advanced optimal control design

Liu et al. 2021; Vrabie et al. 2013; Wang et al. 2009; Zhang et al. 2013a, b). The similar idea and design architectures are contained in these two cases. Actually, they are considered together as an integrated framework of critic intelligence. However, the adaptive critic control for discrete-time systems is different from the continuous-time case. These differences, principally, come from the dynamic programming foundation, the learning mechanism, and the implementation structure. Needless to say, stability and convergence analysis of the two cases are also not the same. At the end of this section, we present a simple structure of critic-intelligencebased discrete-time advanced nonlinear optimal control design in Fig. 1.1, which also displays the fundamental idea of this chapter. Remarkably, the whole component highlighted in the dotted box of Fig. 1.1 clearly reflects critic intelligence, which is a combination of dynamic programming, reinforcement learning, and neural networks. The arrows in Fig. 1.1 indicate that by using the critic intelligence framework with three different components, the control problem of discrete-time dynamical systems can be addressed under nonlinear and unknown environments. What we can construct through this chapter is a class of discrete-time advanced nonlinear optimal control systems with critic intelligence.

1.2 Discrete-Time Optimal Regulation Design

5

1.2 Discrete-Time Optimal Regulation Design Optimal regulation is an indispensable component of modern control theory and is also useful for feedback control design in engineering practice (Abu-Khalaf and Lewis 2005; Ha et al. 2021a; Lincoln and Rantzer 2006; Liu et al. 2012; Wang et al. 2020a, 2012, 2022f; Wei et al. 2015). Dynamic programming is a basic and important tool to solve such kind of design problems. Consider the general formulation of nonlinear discrete-time systems described by x(k + 1) = F(x(k), u(k)),

(1.1)

where the time step k = 0, 1, 2, . . . , x(k) ∈ Rn is the state vector, and u(k) ∈ Rm is the control input . In general, we assume that the function F : Rn × Rm → Rn is continuous, and without loss of generality, that the origin x = 0 is a unique equilibrium point of system (1.1) under u = 0, i.e., F(0, 0) = 0. Besides, we assume that the system (1.1) is stabilizable on a prescribed compact set Ω ∈ Rn . Definition 1.1 (cf. (Al-Tamimi et al. 2008)) A nonlinear dynamical system is defined to be stabilizable on a compact set Ω ⊂ Rn if there exists a control input u ∈ Rm such that, for all initial conditions x(0) ∈ Ω, the state x(k) → 0 as k → ∞. For the infinite horizon optimal control problem, it is desired to find the control law u(x) which can minimize the cost function given by J (x(k), u(k)) =

∞

U (x( p), u( p)),

(1.2)

p=k

where U (·, ·) is called the utility function, U (0, 0) = 0, and U (x, u) ≥ 0 for all x and u. Note that the cost function J (x(k), u(k)) can be written as J (x(k)) for short. Particularly, the cost function starting from k = 0, i.e., J (x(0)), is often paid more attention. When considering the discount factor γ , where 0 < γ ≤ 1, the infinite horizon cost function is described by J¯(x(k)) =

∞

γ p−k U (x( p), u( p)).

(1.3)

p=k

Note that with the discount factor, we can modulate the convergence speed of regulation design and reduce the value of the optimal cost function. In this chapter, we mainly discuss the undiscounted optimal control problem. Generally, the designed feedback control must not only stabilize the system on Ω but also guarantee that (1.2) is finite, i.e., the control law must be admissible. Definition 1.2 (cf. (Al-Tamimi et al. 2008; Wang et al. 2012; Zhang et al. 2009)) A control law u(x) is defined to be admissible with respect to (1.2) on Ω, if u(x)

6

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

is continuous on a compact set Ωu ⊂ Rm , u(0) = 0, u(x) stabilizes (1.1) on Ω, and ∀x(0) ∈ Ω, J (x(0)) is finite. With this definition, the designed feedback control law u(x) ∈ Ωu , where Ωu is called the admissible control set. Note that admissible control is a basic and important concept of the optimal control field. However, it is often difficult to determine whether a specified control law is admissible or not. Thus, it is meaningful to find advanced methods that do not rely on the requirement of admissible control laws. The cost function (1.2) can be written as ∞

J (x(k)) = U (x(k), u(k)) +

U (x( p), u( p))

p=k+1

= U (x(k), u(k)) + J (x(k + 1)).

(1.4)

Denote the control signal as u(∞) when the time step approaches to ∞, i.e., k → ∞. According to Bellman’s optimality principle, the optimal cost function defined as J ∗ (x(k)) =

min

∞

u(k),u(k+1),...,u(∞)

U (x( p), u( p))

(1.5)

p=k

can be rewritten as J (x(k)) = min U (x(k), u(k)) + ∗

u(k)

min

u(k+1),u(k+2),...,u(∞)

∞

U (x( p), u( p)) .

p=k+1

In other words, J ∗ (x(k)) satisfies the discrete-time HJB equation J ∗ (x(k)) = min U (x(k), u(k)) + J ∗ (x(k + 1)) . u(k)

(1.6)

(1.7)

The above expression (1.7) is called the optimality equation of dynamic programming and is also taken as the basis to implement the dynamic programming technique. The corresponding optimal control law u ∗ can be derived by u ∗ (x(k)) = arg min U (x(k), u(k)) + J ∗ (x(k + 1)) . u(k)

(1.8)

Using the optimal control formulation, the discrete-time HJB equation becomes J ∗ (x(k)) = U x(k), u ∗ (x(k)) + J ∗ (x(k + 1)),

(1.9)

which, observing the system dynamics, is specifically J ∗ (x(k)) = U x(k), u ∗ (x(k)) + J ∗ F x(k), u ∗ (x(k)) .

(1.10)

1.2 Discrete-Time Optimal Regulation Design

7

As a special case, we consider a class of discrete-time nonlinear systems with input-affine form x(k + 1) = f (x(k)) + g(x(k))u(k), (1.11) where f (·) and g(·) are differentiable in their argument with f (0) = 0. Similarly, we assume that f + gu is Lipschitz continuous on a set Ω in Rn containing the origin, and that the system (1.11) is controllable in the sense that there exists a continuous control on Ω that asymptotically stabilizes the system. For this affine nonlinear system, if the utility function is specified as U (x( p), u( p)) = x T ( p)Qx( p) + u T ( p)Ru( p),

(1.12)

where Q and R are positive definite matrices with suitable dimensions, then the optimal control law is calculated by 1 ∂ J ∗ (x(k + 1)) . u ∗ (x(k)) = − R −1 g T (x(k)) 2 ∂ x(k + 1)

(1.13)

With this special formulation, the discrete-time HJB equation for the affine nonlinear plant (1.11) is written as

1 ∂ J ∗ (x(k + 1)) T 4 ∂ x(k + 1) ∂ J ∗ (x(k + 1)) + J ∗ (x(k + 1)). × g(x(k))R −1 g T (x(k)) ∂ x(k + 1)

J ∗ (x(k)) = x T (k)Qx(k) +

(1.14)

This is also a special expression of (1.10), when considering the affine dynamics and the quadratic utility. When studying the classical linear quadratic regulator problem, the discrete-time HJB equation is reduced to the Riccati equation that can be solved efficiently. However, for the general nonlinear problem, it is not the case. Furthermore, we observe from (1.13) that the optimal control u ∗ (x(k)) is related to x(k + 1) and J ∗ (x(k + 1)), which cannot be determined at the present time step k. Hence, it is necessary to employ approximate strategies to address this kind of discrete-time HJB equations and the adaptive critic method is a good choice. In other words, it is helpful to adopt the adaptive critic framework to deal with optimal control design under nonlinear dynamics environment. At the end of this section, we recall the optimal control basis of continuous-time nonlinear systems (Vamvoudakis and Lewis 2010; Wang et al. 2017, 2016; Zhu and Zhao 2018). Consider a class of affine nonlinear plants given by x(t) ˙ = f (x(t)) + g(x(t))u(t),

(1.15)

8

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

where x(t) is the state vector and u(t) is the control vector. Similarly, we introduce the quadratic utility function formed as (1.12) and define the cost function as J (x(t)) =

∞

U (x(τ ), u(τ ))dτ.

(1.16)

t

Note that in the continuous-time case, the definition of admissible control laws can be found in (Abu-Khalaf and Lewis 2005). For an admissible control law u(x), if the related cost function (1.16) is continuously differentiable, the infinitesimal version is the nonlinear Lyapunov equation 0 = U (x, u(x)) + (∇ J (x))T [ f (x) + g(x)u(x)]

(1.17)

with J (0) = 0. Define the Hamiltonian function of system (1.15) as H (x, u(x), ∇ J (x)) = U (x, u(x)) + (∇ J (x))T [ f (x) + g(x)u(x)].

(1.18)

Using Bellman’s optimality principle, the optimal cost function defined as

∗

J (x) = min

∞

u∈A (Ω) t

U (x(τ ), u(τ ))dτ

(1.19)

satisfies the continuous-time HJB equation min H (x, u(x), ∇ J ∗ (x)) = 0.

u∈A (Ω)

(1.20)

The optimal feedback control law is computed by u ∗ (x) = arg min H (x, u(x), ∇ J ∗ (x)) u

1 = − R −1 g T (x)∇ J ∗ (x). 2

(1.21)

Using the optimal control expression (1.21), the continuous-time HJB equation for the affine nonlinear plant (1.15) turns out to be the form 1 0 = x T Qx + (∇ J ∗ (x))T f (x) − (∇ J ∗ (x))T g(x)R −1 g T (x)∇ J ∗ (x) 4

(1.22)

with J ∗ (0) = 0. Although the general nonaffine dynamics are not discussed, it is clear to observe the differences between continuous-time and discrete-time optimal control formulations from the above affine case. The optimal control laws (1.13) and (1.21) are not identical, while the HJB equations (1.14) and (1.22) are also different. In particular, the optimal control expression of the continuous-time case depends on the

1.3 Discrete-Time Trajectory Tracking Design

9

state vector and the optimal cost function of the same time instant. Hence, if the optimal cost function is approximated by function approximators like neural networks, the optimal controller can be calculated directly, which is quite distinctive from the discrete-time case.

1.3 Discrete-Time Trajectory Tracking Design The tracking control problem is common in many areas, where a dynamical system is forced to track a desired trajectory (Li et al. 2021; Lu et al. 2020; Modares and Lewis 2014b; Song and Zhu 2019; Wang et al. 2021b, c; Wei et al. 2016). For the discrete-time system (1.1), we define the reference trajectory r (k) as r (k + 1) = ζ (r (k)).

(1.23)

e(k) = x(k) − r (k).

(1.24)

The tracking error is defined as

In many situations, it is supposed that there exists a steady control u d (k) which satisfies the following equation: r (k + 1) = F(r (k), u d (k)).

(1.25)

The feedforward or steady-state part of the control input is used to assure perfect tracking. If x(k) = r (k), i.e., e(k) = 0, the steady-state control u d (k) corresponding to the reference trajectory can be directly used to make x(k + 1) reach the desired point r (k + 1). If there does not exist a solution of (1.25), the system state x(k + 1) cannot track the desired point r (k + 1). Here, we assume that the function of u d (k) about r (k) is not implicit and u d (k) is unique. Then, we define the steady control function as u d (k) = ξ(r (k)).

(1.26)

μ(k) = u(k) − u d (k)

(1.27)

By denoting

and using (1.1), (1.23), and (1.24), we derive the following augmented system:

e(k + 1) = F(x(k), u(k)) − r (k + 1), r (k + 1) = ζ (r (k)).

(1.28)

10

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

Based on (1.23), (1.24), (1.26), and (1.27), we can write (1.28) as

e(k + 1) = F e(k) + r (k), μ(k) + ξ(r (k)) − ζ (r (k)), r (k + 1) = ζ (r (k)).

(1.29)

By defining ¯ F(e(k), r (k), μ(k)) =

F e(k) + r (k), μ(k) + ξ(r (k)) − ζ (r (k)) ζ (r (k))

(1.30)

and X (k) = [eT (k), r T (k)]T , the new augmented system (1.29) can be written as ¯ (k), μ(k)), X (k + 1) = F(X

(1.31)

which also takes the nonaffine form. With such a system transformation and a proper definition of the cost function, the trajectory tracking problem can always be formulated as the regulation design of the augmented plant. Similarly, in the following, we discuss the optimal tracking design of affine nonlinear systems formed as (1.11), with respect to the reference trajectory (1.23). Here, we discuss the case with x(k) = r (k). For x(k + 1) = r (k + 1), we need to find the steady control input u d (k) of the desired trajectory to satisfy r (k + 1) = f (r (k)) + g(r (k))u d (k).

(1.32)

If the system dynamics and the desired trajectory are known, u d (k) can be solved by u d (k) = g + (r (k))[ζ (r (k)) − f (r (k))],

(1.33)

where g + (r (k)) is called the Moore–Penrose pseudoinverse matrix of g(r (k)). According to (1.11), (1.23), (1.24), (1.27) and (1.33), the augmented system dynamics is given as follows: ⎧ + ⎪ ⎨ e(k + 1) = f (e(k) + r (k)) + g(e(k) + r (k))g (r (k)) × [ζ (r (k)) − f (r (k))] − ζ (r (k)) + g(e(k) + r (k))μ(k), ⎪ ⎩ r (k + 1) = ζ (r (k)).

(1.34)

By denoting F (e(k), r (k)) = f (e(k) + r (k)) + g(e(k) + r (k))g + (r (k)) × [ζ (r (k)) − f (r (k))] − ζ (r (k)), G (e(k), r (k)) = g(e(k) + r (k)), the augmented plant (1.34) can be rewritten as

(1.35)

1.3 Discrete-Time Trajectory Tracking Design

11

e(k + 1) F (e(k), r (k)) G (e(k), r (k)) = + μ(k). r (k + 1) ζ (r (k)) 0

(1.36)

In this case, through observing X (k) = [eT (k), r T (k)]T , the affine augmented system is established by X (k + 1) = F(X (k)) + G(X (k))μ(k),

(1.37)

where the system matrices are

F (X (k)) F (e(k), r (k)) = , ζ (r (k)) ζ (r (k))

G (X (k)) G (e(k), r (k)) G(X (k)) = = . 0 0 F(X (k)) =

(1.38a) (1.38b)

For the augmented system (1.37), we define the cost function as J (X (k), μ(k)) =

∞

U (X ( p), μ( p)),

(1.39)

p=k

where U (X ( p), μ( p)) ≥ 0 is the utility function. Here, considering the quadratic utility formed as (1.12), it is found that

Q 0 e( p) T T U (X ( p), μ( p)) = e ( p), r ( p) + μT ( p)Rμ( p) 0 0 r ( p) = eT ( p)Qe(k) + μT ( p)Rμ( p) = U (e( p), μ( p)).

(1.40)

Then, we can write the cost function as the following form: J (e(k)) =

∞

U (e( p), μ( p)).

(1.41)

p=k

Note that the controlled plant related to (1.41) can be regarded as the tracking error dynamics of the augmented system (1.37) without involving the part of the desired trajectory. For clarity, we express it as follows: e(k + 1) = F (e(k), r (k)) + G (e(k), r (k))μ(k),

(1.42)

where G (e(k), r (k)) = g(e(k) + r (k)). Since r (k) is not relevant to e(k), the tracking error dynamics (1.42) can be simply rewritten as

12

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

e(k + 1) = F (e(k)) + G (e(k))μ(k).

(1.43)

In this sense, we should study the optimal regulation of error dynamics (1.43) with respect to the cost function (1.41). It means that, the trajectory tracking problem has been transformed into nonlinear regulation design. Based on Bellman’s optimality principle, the optimal cost function J ∗ (e(k)) satisfies the HJB equation: J ∗ (e(k)) = min{eT (k)Qe(k) + μT (k)Rμ(k) + J ∗ (e(k + 1))}. μ(k)

(1.44)

Then, the corresponding optimal control is obtained by μ∗ (e(k)) = arg min{eT (k)Qe(k) + μT (k)Rμ(k) + J ∗ (e(k + 1))}. μ(k)

(1.45)

Observing (1.42), μ∗ (e(k)) is solved by

1 −1 ∂e(k + 1) T ∂J ∗ (e(k + 1)) R 2 ∂μ(e(k)) ∂e(k + 1) 1 −1 T ∂J ∗ (e(k + 1)) . = − R g (e(k) + r (k)) 2 ∂e(k + 1)

μ∗ (e(k)) = −

(1.46)

If the optimal control law μ∗ (e(k)) is derived via adaptive critic, the feedback control u ∗ (k) that applies to the original system (1.11) can be computed by u ∗ (k) = μ∗ (e(k)) + u d (k).

(1.47)

By using such a system transformation and the adaptive critic framework, the trajectory tracking problem of general nonlinear plants can be addressed. Overall, the idea of critic intelligence is helpful to cope with both optimal regulation and trajectory tracking problems, where the former is the basis and the latter is an extension.

1.4 The Construction of Critic Intelligence For constructing the critic intelligence framework, it is necessary to provide the bases of reinforcement learning and neural networks. They are introduced to solve nonlinear optimal control problems under the dynamic programming formulation.

1.4 The Construction of Critic Intelligence

13

1.4.1 Basis of Reinforcement Learning The learning ability is an important property and one of the bases of intelligence. In a reinforcement learning system, several typical elements are generally included: the agent, the environment, the policy, the reward signal, the value function, and optionally, the model of the environment (Sutton and Barto 2018; Li et al. 2018). In simple terms, the reinforcement learning problem is meant to learn through the interaction to achieve a goal. The interacting process between an agent and the environment consists of selecting actions by the agent and responding to those actions by the environment as well as presenting new situations to the agent. Besides, the environment gives rise to rewards or penalties, which are special numerical values that the involved agent tries to maximize or minimize over time. Hence, such a process is closely related to dynamic optimization. Different from general supervised learning and unsupervised learning, reinforcement learning is inspired by natural learning mechanisms and is considered as a relatively new machine learning paradigm. It is actually a behavioral learning formulation and belongs to the learning category without a teacher. As the core idea of reinforcement learning, the agent-environment interaction is characterized by the adopted action of the agent through a corresponding numerical reward signal generated by the environment. It is worth mentioning that actions may affect not only the current reward but also the situation in the next time step and even all subsequent rewards. Within the field of reinforcement learning, the real-time evaluative information is required to explore the optimal policy. As mentioned in (Sutton and Barto 2018), the challenge of reinforcement learning lies in how to reach a compromise between exploration and exploitation, so as to maximize the reward signal. In the learning process, the agent needs to determine which actions yield the largest reward. Therefore, the agent is able to sense and control states of the environment or the system. As stated in (Haykin 2009; Jiang and Jiang 2013; Sutton and Barto 2018; Werbos 2009), dynamic programming provides the mathematical basis of reinforcement learning and hence lies at the core of reinforcement learning. In many practical situations, explicit system models are always unavailable, which diminishes the application range of dynamic programming. Reinforcement learning can be considered as an approximate form of dynamic programming and is greatly related to the framework of ADP. One of their common focuses is how to solve the optimality equation effectively. There exist some resultful ways to compute the optimal solution, where policy iteration and value iteration are two basic ones. When the state and action spaces are small enough, the value functions can be represented as tables to exactly find the optimal value function and the optimal policy, such as for Gridworld and FrozenLake problems. In this case, the policy iteration, value iteration, Monte Carlo, and temporal-difference methods have been developed to address these problems (Sutton and Barto 2018). However, it is difficult to find accurate solutions for other problems with arbitrarily large state spaces. Needless to say, some new techniques are required to effectively solve such complex problems.

14

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

Therefore, a series of representative approaches have been adopted, including policy gradient and Q-learning (Sutton and Barto 2018). In addition, reinforcement learning with function approximation has been widely applied to various aspects of decision and control system design (Bertsekas 2019). The adaptive critic also belongs to these approximate strategies and serves as the basis of advanced optimal control design in this chapter. In particular, for various systems containing large state spaces, the approximate optimal policy can be iteratively obtained by using value iteration and policy iteration with function approximation.

1.4.2 The Neural Network Approximator As an obbligato branch of computational intelligence, neural networks are rooted in many disciplines, such as neurosciences, mathematics, statistics, physics, computer science, and engineering (Haykin 2009). Traditionally, the term neural network is used to refer to a network or a circuit of biological neurons. The modern usage of the term often represents artificial neural networks, which are composed of artificial neurons or nodes. Artificial neural networks are composed of interconnecting artificial neurons, or namely, programming constructs that can mimic the properties of biological neurons. They are used either to gain an understanding of biological neural networks, or to solve artificial intelligence problems without necessarily creating authentic models of biological systems. Neural networks possess the massively parallel distributed structure and the ability to learn and generalize. The generalization denotes the reasonable output production of neural networks, with regard to inputs not encountered during the learning process. Since real biological nervous systems are highly complex, artificial neural network algorithms attempt to abstract the complexity and focus on what may hypothetically matter most from an information processing point of view. Good performance, including good predictive ability and low generalization error, can be regarded as one source of evidence towards supporting the hypothesis that the abstraction really captures something important from the perspective of brain information processing. Another incentive for these abstractions is to reduce the computation amount when simulating artificial neural networks, so as to allow one to experiment with larger networks and train them on larger data sets (Doya et al. 2001; Haykin 2009; Schultz 2004). There exist many kinds of neural networks in the literature, such as single-layer neural networks, multilayer neural networks, radial-basis function networks, and recurrent neural networks. Multilayer perceptrons represent a frequently used neural network structure, where a nonlinear differentiable activation function is included in each neuron model and one or more layers hidden from both the input and output modes are contained. Besides, a high degree of connectivity is possessed and the connection extent is determined by synaptic weights of the network. A computationally effective method for training multilayer perceptrons is the backpropagation algorithm, which is regarded as a landmark during the development of neural

1.4 The Construction of Critic Intelligence

15

networks (Haykin 2009). In recent years, some new structures of neural networks have been proposed, where convolutional neural networks provide an efficient method to constrain the complexity of feedforward neural networks by weight sharing and restriction to local connections. Convolutional neural networks are a truly successful deep learning approach where many layers of a hierarchy are successfully trained in a robust manner (LeCun et al. 2015). Till now, neural networks are still a hot topic, especially under the background of artificial intelligence. Due to remarkable properties of nonlinearity, adaptivity, self-learning, fault tolerance, and universal approximation of input–output mapping, neural networks can be extensively applied to various research areas of different disciplines, such as dynamic modeling, time series analysis, pattern recognition, signal processing, and system control. In this chapter, neural networks are most importantly taken as an implementation tool or a function approximator.

1.4.3 The Critic Intelligence Framework The combination of dynamic programming, reinforcement learning, and neural networks is the so-called critic intelligence framework. The advanced optimal control design based on critic intelligence is named as intelligent critic control. It is almost a same concept as the existing adaptive critic method, only note that the intelligence property is highlighted. The basic idea of intelligent critic design is depicted in Fig. 1.2, where three main components are included, i.e., critic, action, and environment. In line with the general reinforcement learning formulation, the components of critic and action are integrated into an individual agent. When implementing this technique in the sense of feedback control design, three kinds of neural networks are built to approximate the cost function, the control, and the system. They are called the critic network, the action network, and the model network, performing functions of evaluation, decision, and prediction, respectively. Before implementing the adaptive critic technique, it is necessary to determine which structure should be adopted. Different advantages are contained in different implementation structures. Heuristic dynamic programming (HDP) (Dong et al. 2017) and dual heuristic dynamic programming (DHP) (Zhang et al. 2009) are two basic, but commonly used structures for adaptive critic design. The globalized dual heuristic dynamic programming (globalized DHP or GDHP) (Liu et al. 2012; Wang et al. 2012) is an advanced structure with an integration of HDP and DHP. Besides, the action-dependent versions of these structures are also used sometimes. It should be pointed out that neural dynamic programming (NDP) was also adopted in (Si and Wang 2001), with an emphasis on the new idea for training the action network. Considering the classical structures with critic, action, and model networks, the primary difference is reflected by the outputs of their critic networks. After the state variable is input to the critic network, only the cost function is approximated as the critic output in HDP while the derivative of the cost function is approximated in DHP. However, for GDHP, both the cost function and its derivative are approximated as

16

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

Fig. 1.2 Basic idea of intelligent critic design Table 1.1 Comparisons of three main structures of adaptive critic Structure Critic Critic output (approximated value) Main advantages input HDP DHP GDHP

State variable State variable State variable

Only the cost function

Simplicity and intuitiveness

Derivative of the cost function

Simplicity and high efficiency

The cost function and its derivative

High efficiency and intuitiveness

the critic output. Therefore, the structure of GDHP is somewhat more complicated than HDP and DHP. Since the derivative of the cost function can be directly used to obtain the control law, the computational efficiency of DHP and GDHP is obviously higher than HDP. However, because of the inclusion of the cost function itself, the convergence process of HDP and GDHP is more intuitive than that of DHP. Overall, when pursuing a simple architecture and a low computational burden, the DHP strategy is a good choice. In addition, the HDP and GDHP strategies can be selected to intuitively observe the evolution tendency of the cost function, but HDP is simpler than GDHP in architecture. For clarity, comparisons of three main structures of adaptive critic are given in Table 1.1. As a summary, the major characteristics of the critic intelligent framework are highlighted as follows: • The theoretical foundation of optimization is considered under the formulation of dynamic programming with the system, cost function, and control being included.

1.5 The Iterative Adaptive Critic Formulation

17

• A behavioral interaction between the environment and an agent (critic and action) along with reinforcement learning is embedded as the key learning mechanism. • Neural networks are constructed to serve as an implementation tool of main components, i.e., environment (system), critic (cost function), and action (control).

1.5 The Iterative Adaptive Critic Formulation As two basic iterative frameworks in the field of reinforcement learning, value iteration and policy iteration provide great inspirations to adaptive critic control design. Although the adaptive critic approach has been applied to deal with optimal control problems, initially stabilizing control policies are often required in the policy iteration process for instance. It is often difficult to obtain such control policies, particularly for complex nonlinear systems. In addition, since much attention is paid to system stability, the convergence proofs of adaptive critic schemes are quite limited. Hence, it is necessary to develop an effectively iterative framework with respect to adaptive critic. The main focuses include how to construct the iterative process to solve the HJB equation (1.9) and then prove its convergence (Al-Tamimi et al. 2008; Dierks et al. 2009; Heydari 2014; Jiang and Zhang 2018; Liu et al. 2012; Wang et al. 2012; Zhang et al. 2009). In fact, the derivation of the iterative adaptive critic formulation is inspired by the numerical analysis method. Consider an algebraic equation z = ϕ(z),

(1.48)

which is difficult to be solved analytically. By denoting i as the iteration index, where i = 0, 1, 2, . . . , we can solve it iteratively from an initial value z (0) . Then, we conduct an iterative process with respect to z (i) until the iteration index reaches to infinity. Using mathematical expressions, the iterative process is written as follows: z (i+1) = ϕ z (i) =⇒ z (∞) = ϕ z (∞) .

(1.49)

From the above formulation, the convergence result can be obtained. Employing the similar idea as the above approach, we construct two sequences to iteratively solve the optimal regulation problem in terms of the cost function and the control law. They are called the iterative cost function sequence {J (i) (x(k))} and the iterative control law sequence {u (i) (x(k))}, respectively. When using the HDP structure, the successive iteration mode is described as follows: J (0) (x(k)) → u (0) (x(k)) → J (1) (x(k)) → · · · → u (i) (x(k)) → J (i+1) (x(k)) → · · · (1.50) Note that this is the common value iteration process which begins from a cost function, rather than the policy iteration.

18

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

Specifically, for the nonaffine system (1.1) and the HJB equation (1.9), the iterative adaptive critic algorithm is performed as follows. First, we start with the initial cost function J (0) (·) = 0 and solve u (0) (x(k)) = arg min U (x(k), u(k)) + J (0) (x(k + 1)) u(k) = arg min U (x(k), u(k)) + J (0) F(x(k), u(k)) . u(k)

(1.51)

Then, we update the cost function by J (1) (x(k)) = min U (x(k), u(k)) + J (0) (x(k + 1)) u(k) = U x(k), u (0) (x(k)) + J (0) F x(k), u (0) (x(k)) .

(1.52)

Next, for i = 1, 2, . . . , the algorithm iterates between u (i) (x(k)) = arg min U (x(k), u(k)) + J (i) (x(k + 1)) u(k) = arg min U (x(k), u(k)) + J (i) F(x(k), u(k))

(1.53)

J (i+1) (x(k)) = min U (x(k), u(k)) + J (i) (x(k + 1)) u(k) = U x(k), u (i) (x(k)) + J (i) F x(k), u (i) (x(k)) .

(1.54)

u(k)

and

How to guarantee convergence of the iterative algorithm is an important topic of the adaptive critic field. The convergence proof of the iterative process (1.51)–(1.54) has been presented in (Abu-Khalaf and Lewis 2005; Wang et al. 2012), where the cost function J (i) (x(k)) → J ∗ (x(k)) and the control law u (i) (x(k)) → u ∗ (x(k)) as i → ∞. When implementing the DHP scheme, we often introduce a new costate function sequence to denote the derivative of the cost function (Wang et al. 2020b; Zhang et al. 2009). By letting λ(i+1) (x(k)) =

∂ J (i+1) (x(k)) (i) ∂ J (i) (x(k + 1)) , λ (x(k + 1)) = , ∂ x(k) ∂ x(k + 1)

(1.55)

the derivative of the iterative cost function (1.54), i.e., ∂U x(k), u (i) (x(k)) ∂ J (i+1) (x(k)) ∂ x(k + 1) T ∂ J (i) (x(k + 1)) = + , (1.56) ∂ x(k) ∂ x(k) ∂ x(k) ∂ x(k + 1) can be concisely written as

1.5 The Iterative Adaptive Critic Formulation

λ

(i+1)

∂U x(k), u (i) (x(k)) ∂ x(k + 1) T (i) + (x(k)) = λ (x(k + 1)). ∂ x(k) ∂ x(k)

19

(1.57)

Using the costate function, the iterative control law can be obtained more directly, since the partial derivative computation of J (i) (x(k + 1)) with respect to x(k + 1) is eliminated. Note that (1.57) is an important expression during implementing the iterative DHP algorithm as the following mode: λ(0) (x(k)) → u (0) (x(k)) → λ(1) (x(k)) → · · · → u (i) (x(k)) → λ(i+1) (x(k)) → · · · (1.58) After establishing the iterative adaptive critic framework, three kinds of neural networks are built to approximate the iterative cost function, the iterative control law, and the control system. Here, the output of the critic network is denoted as Cˆ (i+1) (x(k)), which is a uniform expression including the cases of HDP, DHP, and GDHP. It can be specified to represent the approximate cost function Jˆ(i+1) (x(k)) in HDP, the approximate costate function λˆ (i+1) (x(k)) in DHP, or both of them in GDHP. Clearly, the dimension of the critic output in the iterative GDHP algorithm is n + 1. Besides, the output of the action network is the approximate iterative control law uˆ (i) (x(k)). The controlled plant is approximated by using the model network so that its output is x(k ˆ + 1). For clarity, the critic outputs of main iterative adaptive critic algorithms are given in Table 1.2. As the end of this section, the general structure of discrete-time optimal control via iterative adaptive critic is depicted in Fig. 1.3. There are two critic networks included in Fig. 1.3, which output cost functions at different iteration steps and time steps. The two critic networks possess the same architecture and are connected by weight transmission. The model network is often trained before carrying out the main iterative process and the final converged weight matrices should be recorded. The critic network and action network are trained according to their error functions, namely the critic error function and the action error function. These error functions can be defined as different formulas in accordance with the design purpose. Overall, there exist several main features for the above iterative structure, which are listed as follows:

Table 1.2 The critic outputs of the main iterative adaptive critic algorithms Iterative algorithm Specific value of the critic Dimension output Cˆ (i+1) (x(k)) Jˆ(i+1) (x(k)) Iterative HDP (Al-Tamimi 1 et al. 2008) λˆ (i+1) (x(k)) Iterative DHP (Zhang et al. n 2009) Iterative GDHP (Wang et al. [ Jˆ(i+1) (x(k)), λˆ (i+1)T (x(k))]T n + 1 2012)

20

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

Fig. 1.3 General structure of the iterative adaptive critic method

• The iteration index is always embedded in the expressions of the cost function and the control law function. • Different neural network structures can be employed, where multilayer perceptrons are most commonly used with gradient descent. • Error functions of the critic network and action network can be determined with the specific choice of implementation structures, such as HDP, DHP, and GDHP. • Three neural networks are built and integrated into a whole formulation, even though their training sequences are different. • It is more readily to be implemented in an offline manner, since the final action weight matrices are adopted to construct the available control law. • It is also applicable to deal with other control problems that can be transformed to and considered as the regulation design. The fundamental objective of employing the iterative adaptive critic framework is to solve the HJB equation approximately, which for instance, takes the form of (1.44) with regard to the proposed tracking control problem. It should be pointed out that, when addressing trajectory tracking problems, the tracing error e(k) can be regarded as the new state of Fig. 1.3 and the same iterative formulation can be conducted to the transformed optimal regulation design expediently. That is to say, both the optimal regulation and trajectory tracking problems can be effectively handled under the iterative adaptive critic framework.

1.6 Significance and Prospects Optimization methods have been widely used in many research areas. As a fundamental element of artificial intelligence techniques, the idea of optimization is also

1.6 Significance and Prospects

21

being paid large attentions at present. When combining with automatic control, it is necessary to establish a series of intelligent methods to address discrete-time optimal regulation and trajectory tracking problems. This is more important because of the increasing complexity of controlled plants, the augmentation of available data resources, and the generalization of unknown dynamics (Wang et al. 2017). In fact, various advanced-control-based applications have been conducted on transportation systems, power systems, chemical processes, and so on. However, they also should be addressed via critic-intelligence-based methods due to the complexity of the practical issues. For example, complex wastewater treatment problems need to be considered under the intelligent environment and more advanced control strategies are required. It is an indispensable part during the process of accomplishing smart environmental protection. In this survey, wastewater treatment process control is regarded as a typical application of the critic intelligence approach. Developing other application fields with critic intelligence is also an interesting topic of the future research. In this chapter, by involving the critic intelligence formulation, advanced optimal control methods towards discrete-time nonlinear systems are developed in terms of normal regulation and trajectory tracking. The given strategies are also verified via simulation experiments and wastewater treatment applications. Through providing advanced solutions for nonlinear optimal control problems, we guide the development of intelligent critic learning and control for complex systems, especially the discrete-time case. It is important to note that the given strategies can not only strengthen the theoretical results of adaptive critic control but also provide new avenues to intelligent learning control design of complex discrete-time systems, so as to effectively address unknown factors, observably enhance control efficiencies, and really improve intelligent optimization performances. Additionally, it will be beneficial for the construction of advanced automation techniques and intelligent systems as well as be of great significance both in theory and application. In particular, it is practically meaningful to enhance the level of wastewater treatment techniques and promote the recycling of water resources, and therefore, to the sustainable development of our economy and society. As described in (Alex et al. 2008; Han et al. 2019), the primary control objective of the common wastewater treatment platform, i.e., Benchmark Simulation Model No. 1, is to ensure that the dissolved oxygen concentration in the fifth unit and the nitrate level in the second unit are maintained at their desired values. In this case, such desired values can be regarded as the reference trajectory. Note that the control parameters are, respectively, the oxygen transfer coefficient of the fifth unit and the internal recycle flow rate of the fifth-second units. In fact, the control design of the proper dissolved oxygen concentration and nitrate level is actually a trajectory tracking problem. Thus, the intelligent critic framework can be constructed for achieving effective control of wastewater treatment processes. There have been some basic conclusions in (Wang et al. 2020b, 2021a, b, c) and more results will be reported in the future. Reinforcement learning is an important branch of machine learning and is undergoing rapid development. It is meaningful to introduce more advanced learning approaches to the automatic control field. Particularly, the consideration of reinforcement learning under the deep neural network formulation can result in dual superi-

22

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

orities of perception and decision in high-dimensional state-action space. Moreover, it is also necessary to utilize big data information more sufficiently and establish advanced data-driven schemes for optimal regulation and trajectory tracking. Additionally, since we only consider discrete-time optimal control problems, it is necessary to propose advanced methods for continuous-time nonlinear systems in the future. Using proper system transformations, the advanced optimal control schemes can also be extended to other fields, such as robust stabilization, distributed control, and multi-agent systems. Except for the wastewater treatment, critic intelligence approaches can be applied to more practical systems in engineering and society. With developments in theory, methods, and applications, it is beneficial to constitute a unified framework for intelligent critic learning and control. In summary, more fantastic achievements will be generated through the involvement of critic intelligence.

References Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41(5):779–791 Alex J, Benedetti L, Copp J, Gernaey KV, Jeppsson U, Nopens I, Pons MN, Rieger L, Rosen C, Steyer JP, Vanrolleghem P, Winkler S (2008) Benchmark simulation model no. 1 (BSM1), IWA task group on benchmarking of control strategies for WWTPs, London Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof. IEEE Trans Syst Man Cybern Part B: Cybern 38(4):943–949 Beard RW, Saridis GN, Wen JT (1997) Galerkin approximations of the generalized HamiltonJacobi-Bellman equation. Automatica 33(12):2159–2177 Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton, New Jersey Bertsekas DP (2017) Value and policy iterations in optimal control and adaptive dynamic programming. IEEE Trans Neural Networks Learn Syst 28(3):500–509 Bertsekas DP (2019) Feature-based aggregation and deep reinforcement learning: A survey and some new implementations. IEEE/CAA J Automat Sinica 6(1):1–31 Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific, Belmont, Massachusetts Bian T, Jiang ZP (2016) Value iteration and adaptive dynamic programming for data-driven adaptive optimal control design. Automatica 71:348–360 Dierks T, Thumati BT, Jagannathan S (2009) Optimal control of unknown affine nonlinear discretetime systems using offline-trained neural networks with proof of convergence. Neural Networks 22(5–6):851–860 Dong L, Zhong X, Sun C, He H (2017) Adaptive event-triggered control based on heuristic dynamic programming for nonlinear discrete-time systems. IEEE Trans Neural Networks Learn Syst 28(7):1594–1605 Doya K, Kimura H, Kawato M (2001) Neural mechanisms of learning and control. IEEE Control Syst Mag 21(4):42–54 Fan QY, Wang D, Xu B (2022) H∞ codesign for uncertain nonlinear control systems based on policy iteration method. IEEE Trans Cybern 52(10):10101–10110 Fan QY, Yang GH (2016) Adaptive actor-critic design-based integral sliding-mode control for partially unknown nonlinear systems with input disturbances. IEEE Trans Neural Networks Learn Syst 27(1):165–177

References

23

Fu H, Chen X, Wang W, Wu M (2020) MRAC for unknown discrete-time nonlinear systems based on supervised neural dynamic programming. Neurocomputing 384:30–141 Gao W, Jiang ZP (2016) Adaptive dynamic programming and adaptive optimal output regulation of linear systems. IEEE Trans Automat Control 61(12):4164–4169 Gao W, Jiang ZP (2019) Adaptive optimal output regulation of time-delay systems via measurement feedback. IEEE Trans Neural Networks Learn Syst 30(3):938–945 Gao W, Mynuddin M, Wunsch DC, Jiang ZP (2022) Reinforcement learning-based cooperative optimal output regulation via distributed adaptive internal model. IEEE Trans Neural Networks Learn Syst 33(10):5229–5240 Han H, Wu X, Qiao J (2019) A self-organizing sliding-mode controller for wastewater treatment processes. IEEE Trans Control Syst Technol 27(4):1480–1491 Han X, Zhao X, Karimi HR, Wang D, Zong G (2022) Adaptive optimal control for unknown constrained nonlinear systems with a novel quasi-model network. IEEE Trans Neural Networks Learn Syst 33(7):2867–2878 Ha M, Wang D, Liu D (2020) Event-triggered adaptive critic control design for discrete-time constrained nonlinear systems. IEEE Trans Syst Man Cybern: Syst 50(9):3158–3168 Ha M, Wang D, Liu D (2021a) Generalized value iteration for discounted optimal control with stability analysis. Syst Control Lett 147(104847):1–7 Ha M, Wang D, Liu D (2021b) Neural-network-based discounted optimal control via an integrated value iteration with accuracy guarantee. Neural Networks 144:176–186 Ha M, Wang D, Liu D (2022a) A novel value iteration scheme with adjustable convergence rate. IEEE Trans Neural Networks Learn Syst (in press) Ha M, Wang D, Liu D (2022b) Discounted iterative adaptive critic designs with novel stability analysis for tracking control. IEEE/CAA J Automat Sinica 9(7):1262–1272 Ha M, Wang D, Liu D (2022c) Offline and online adaptive critic control designs with stability guarantee through value iteration. IEEE Trans Cybern 52(12):13262–13274 Haykin S (2009) Neural networks and learning machines, 3rd edn. Pearson Prentice Hall, Upper Saddle River, New Jersey He H, Ni Z, Fu J (2012) A three-network architecture for on-line learning and optimization based on adaptive dynamic programming. Neurocomputing 78:3–13 Heydari A (2014) Revisiting approximate dynamic programming and its convergence. IEEE Trans Cybern 44(12):2733–2743 He H, Zhong X (2018) Learning without external reward. IEEE Comput Intell Mag 13(3):48–54 Huo Y, Wang D, Qiao J (2022) Adaptive critic optimization to decentralized event-triggered control of continuous-time nonlinear interconnected systems. Optimal Control Appl Methods 43(1):198– 212 Jiang Y, Fan J, Gao W, Chai T, Lewis FL (2020a) Cooperative adaptive optimal output regulation of nonlinear discrete-time multi-agent systems. Automatica 121:109149 Jiang Y, Kiumarsi B, Fan J, Chai T, Li J, Lewis FL (2020b) Optimal output regulation of linear discrete-time systems with unknown dynamics using reinforcement learning. IEEE Trans Cybern 50(7):3147–3156 Jiang Z, Jiang Y (2013) Robust adaptive dynamic programming for linear and nonlinear systems: An overview. Eur J Control 19(5):417–425 Jiang Y, Jiang ZP (2015) Global adaptive dynamic programming for continuous-time nonlinear systems. IEEE Trans Automat Control 60(11):2917–2929 Jiang H, Zhang H (2018) Iterative ADP learning algorithms for discrete-time multi-player games. Artif Intell Rev 50(1):75–91 Kiumarsi B, Vamvoudakis KG, Modares H, Lewis FL (2018) Optimal and autonomous control using reinforcement learning: A survey. IEEE Trans Neural Networks Learn Syst 29(6):2042–2062 LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444 Lewis FL, Liu D (2013) Reinforcement learning and approximate dynamic programming for feedback control. John Wiley, New Jersey

24

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers. IEEE Control Syst Mag 32(6):76– 105 Liang M, Wang D, Liu D (2020a) Improved value iteration for neural-network-based stochastic optimal control design. Neural Networks 124:280–295 Liang M, Wang D, Liu D (2020b) Neuro-optimal control for discrete stochastic processes via a novel policy iteration algorithm. IEEE Trans Syst Man Cybern: Syst 50(11):3972–3985 Li H, Liu D, Wang D (2018) Manifold regularized reinforcement learning. IEEE Trans Neural Networks Learn Syst 29(4):932–943 Li C, Ding J, Lewis FL, Chai T (2021) A novel adaptive dynamic programming based on tracking error for nonlinear discrete-time systems. Automatica 129(109687):1–9 Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Automat Control 51:1249–1260 Liu D, Wang D, Zhao D, Wei Q, Jin N (2012) Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming. IEEE Trans Automat Sci Eng 9(3):628–634 Liu D, Li H, Wang D (2013) Data-based self-learning optimal control: Research progress and prospects. Acta Automat Sinica 39(11):1858–1870 Liu D, Li H, Wang D (2015) Error bounds of adaptive dynamic programming algorithms for solving undiscounted optimal control problems. IEEE Trans Neural Networks Learn Syst 26(6):1323– 1334 Liu D, Wei Q, Wang D, Yang X, Li H (2017) Adaptive dynamic programming with applications in optimal control. Springer, London Liu D, Xu Y, Wei Q, Liu X (2018) Residential energy scheduling for variable weather solar energy based on adaptive dynamic programming. IEEE/CAA J Automat Sinica 5(1):36–46 Liu D, Xue S, Zhao B, Luo B, Wei Q (2021) Adaptive dynamic programming for control: A survey and recent advances. IEEE Trans Syst Man Cybern: Syst 51(1):142–160 Li J, Xiao Z, Fan J, Chai T, Lewis FL (2022) Off-policy Q-learning: Solving Nash equilibrium of multi-player games with network-induced delay and unmeasured state. Automatica 136:1–7 Luo B, Yang Y, Liu D (2018) Adaptive Q-learning for data-based optimal output regulation with experience replay. IEEE Trans Cybern 48(12):3337–3348 Luo B, Yang Y, Liu D, Wu HN (2020a) Event-triggered optimal control with performance guarantees using adaptive dynamic programming. IEEE Trans Neural Networks Learn Syst 31(1):76–88 Luo B, Yang Y, Wu HN, Huang T (2020b) Balancing value iteration and policy iteration for discretetime control. IEEE Trans Syst Man Cybern: Syst 50(11):3948–3958 Luo B, Yang Y, Liu D (2021) Policy iteration Q-learning for data-based two-player zero-sum game of linear discrete-time systems. IEEE Trans Cybern 51(7):3630–3640 Lu J, Wei Q, Wang FY (2020) Parallel control for optimal tracking via adaptive dynamic programming. IEEE/CAA J Automat Sinica 7(6):1662–1674 Lv Y, Ren X (2019) Approximate Nash solutions for multiplayer mixed-zero-sum game with reinforcement learning. IEEE Trans Syst Man Cybern: Syst 49(12):2739–2750 Modares H, Lewis FL (2014a) Linear quadratic tracking control of partially-unknown continuoustime systems using reinforcement learning. IEEE Trans Automat Control 59(11):3051–3056 Modares H, Lewis FL (2014b) Optimal tracking control of nonlinear partially-unknown constrainedinput systems using integral reinforcement learning. Automatica 50(7):1780–1792 Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE Trans Syst Man Cybern-Part C: Appl Rev 32(2):140–153 Mu C, Wang D (2017) Neural-network-based adaptive guaranteed cost control of nonlinear dynamical systems with matched uncertainties. Neurocomputing 245:46–54 Mu C, Wang D, He H (2018) Data-driven finite-horizon approximate optimal control for discretetime nonlinear systems using iterative HDP approach. IEEE Trans Cybern 48(10):2948–2961

References

25

Na J, Lv Y, Zhang K, Zhao J (2022) Adaptive identifier-critic based optimal tracking control for nonlinear systems with experimental validation. IEEE Trans Syst Man Cybern: Syst 52(1):459– 472 Narayanan V, Modares H, Jagannathan S (2020) Event-triggered control of input-affine nonlinear interconnected systems using multiplayer game. Int J Robust Nonlinear Control 31:950–970 Pang B, Jiang ZP (2021) Adaptive optimal control of linear periodic systems: An off-policy value iteration approach. IEEE Trans Automat Control 66(2):888–894 Prokhorov DV, Wunsch DC (1997) Adaptive critic designs. IEEE Trans Neural Networks 8(5):997– 1007 Schultz W (2004) Neural coding of basic reward terms of animal learning theory, game theory, microeconomics and behavioural ecology. Curr Opin Neurobiol 14(2):139–147 Si J, Barto AG, Powell WB, Wunsch DC (2004) Handbook of learning and approximate dynamic programming. Wiley-IEEE Press, New Jersey Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529:484–489 Si J, Wang YT (2001) On-line learning control by association and reinforcement. IEEE Trans Neural Networks 12(2):264–276 Song R, Lewis FL, Wei Q, Zhang H (2016) Off-policy actor-critic structure for optimal control of unknown systems with disturbances. IEEE Trans Cybern 46(5):1041–1050 Song R, Wei Q, Zhang H, Lewis FL (2021) Discrete-time non-zero-sum games with completely unknown dynamics. IEEE Trans Cybern 51(6):2929–2943 Song R, Zhu L (2019) Optimal fixed-point tracking control for discrete-time nonlinear systems via ADP. IEEE/CAA J Automat Sinica 6(3):657–666 Sutton RS, Barto AG (2018) Reinforcement learning: an introduction, 2nd edn. The MIT Press, Cambridge, Massachusetts Vamvoudakis KG (2017) Q-learning for continuous-time linear systems: A model-free infinite horizon optimal control approach. Syst Control Lett 100:14–20 Vamvoudakis KG, Lewis FL (2010) Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 46(5):878–888 Vrabie D, Vamvoudakis KG, Lewis FL (2013) Optimal adaptive control and differential games by reinforcement learning principles. IET, London Wang FY, Zhang H, Liu D (2009) Adaptive dynamic programming: an introduction. IEEE Comput Intell Mag 4(2):39–47 Wang FY, Jin N, Liu D, Wei Q (2011) Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with ε-error bound. IEEE Trans Neural Networks 22(1):24–36 Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832 Wang D, Liu D, Zhang Q, Zhao D (2016) Data-based adaptive critic designs for nonlinear robust optimal control with uncertain dynamics. IEEE Trans Syst Man Cybern: Syst 46(11):1544–1555 Wang D, He H, Liu D (2017) Adaptive critic nonlinear robust control: A survey. IEEE Trans Cybern 47(10):3429–3451 Wang D, Ha M, Qiao J (2020a) Self-learning optimal regulation for discrete-time nonlinear systems under event-driven formulation. IEEE Trans Automat Control 65(3):1272–1279 Wang D, Ha M, Qiao J, Yan J, Xie Y (2020b) Data-based composite control design with critic intelligence for a wastewater treatment platform. Artif Intell Rev 53(5):3773–3785 Wang D, Ha M, Qiao J (2021a) Data-driven iterative adaptive critic control towards an urban wastewater treatment plant. IEEE Trans Ind Electron 68(8):7362–7369 Wang D, Zhao M, Ha M, Ren J (2021b) Neural optimal tracking control of constrained nonaffine systems with a wastewater treatment application. Neural Networks 143:121–132

26

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

Wang D, Zhao M, Qiao J (2021c) Intelligent optimal tracking with asymmetric constraints of a nonlinear wastewater treatment system. Int J Robust Nonlinear Control 31(14):6773–6787 Wang D, Cheng L, Yan J (2022a) Self-learning robust control synthesis and trajectory tracking of uncertain dynamics. IEEE Trans Cybern 52(1):278–286 Wang D, Ha M, Cheng L (2022b) Neuro-optimal trajectory tracking with value iteration of discretetime nonlinear dynamics. IEEE Trans Neural Networks Learn Syst (in press) Wang D, Ha M, Zhao M (2022c) The intelligent critic framework for advanced optimal control. Artif Intell Rev 55(1):1–22 Wang D, Hu L, Zhao M, Qiao J (2022d) Adaptive critic for event-triggered unknown nonlinear optimal tracking design with wastewater treatment applications. IEEE Trans Neural Networks Learn Syst (in press) Wang D, Qiao J, Cheng L (2022e) An approximate neuro-optimal solution of discounted guaranteed cost control design. IEEE Trans Cybern 52(1):77–86 Wang D, Ren J, Ha M, Qiao J (2022f) System stability of learning-based linear optimal control with general discounted value iteration. IEEE Trans Neural Networks Learn Syst (in press) Wang D, Zhao M, Ha M, Qiao J (2022g) Stability and admissibility analysis for zero-sum games under general value iteration formulation. IEEE Trans Neural Networks Learn Syst (in press) Wang D, Liu D (2018) Learning and guaranteed cost control with event-based adaptive critic implementation. IEEE Trans Neural Networks Learn Syst 29(12):6004–6014 Wang D, Qiao J (2019) Approximate neural optimal control with reinforcement learning for a torsional pendulum device. Neural Networks 117:1–7 Wang D, Xu X (2022g) A data-based neural policy learning strategy towards robust tracking control design for uncertain dynamic systems. Int J Syst Sci 53(8):1719–1732 Wei Q, Liu D, Yang X (2015) Infinite horizon self-learning optimal control of nonaffine discrete-time nonlinear systems. IEEE Trans Neural Networks Learn Syst 26(4):866–879 Wei Q, Liu D, Xu Y (2016) Neuro-optimal tracking control for a class of discrete-time nonlinear systems via generalized value iteration adaptive dynamic programming approach. Soft Comput 20(2):697706:1–10 Wei Q, Liu D, Lin Q, Song R (2018) Adaptive dynamic programming for discrete-time zero-sum games. IEEE Trans Neural Networks Learn Syst 29(4):957–969 Wei Q, Song R, Liao Z, Li B, Lewis FL (2020) Discrete-time impulsive adaptive dynamic programming. IEEE Trans Cybern 50(10):4293–4306 Wei Q, Wang L, Lu J, Wang FY (2022a) Discrete-time self-learning parallel control. IEEE Trans Syst Man Cybern: Syst 52(1):192–204 Wei Q, Zhu L, Li T, Liu D (2022b) A new approach to finite-horizon optimal control of discrete-time affine nonlinear systems via a pseudo-linear method. IEEE Trans Automat Control 67(5):2610– 2617 Werbos PJ (1974) Beyond regression: New tools for prediction and analysis in the behavioural sciences. Ph.D. dissertation, Harvard University Werbos PJ (1992) Approximate dynamic programming for real-time control and neural modeling. Neural Fuzzy Adapt Approach Handbook Intell Control 493–526 Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of intelligence. Gen Syst Yearbook 22:25–38 Werbos PJ (2008) ADP: The key direction for future research in intelligent control and understanding brain intelligence. IEEE Trans Syst Man Cybern-Part B: Cybern 38(4):898–900 Werbos PJ (2009) Intelligence in the brain: A theory of how it works and how to build it. Neural Networks 22(3):200–212 Xue S, Luo B, Liu D (2020) Event-triggered adaptive dynamic programming for zero-sum game of partially unknown continuous-time nonlinear systems. IEEE Trans Syst Man Cybern: Syst 50(9):3189–3199 Xue S, Luo B, Liu D, Gao Y (2022a) Event-triggered ADP for tracking control of partially unknown constrained uncertain systems. IEEE Trans Cybern 52(9):9001–9012

References

27

Xue S, Luo B, Liu D, Yang Y (2022b) Constrained event-triggered H∞ control based on adaptive dynamic programming with concurrent learning. IEEE Trans Syst Man Cybern: Syst 52(1):357– 369 Yang X, He H (2021) Event-driven H∞ -constrained control using adaptive critic learning. IEEE Trans Cybern 51(10):4860–4872 Yang X, He H, Zhong X (2021a) Approximate dynamic programming for nonlinear-constrained optimizations. IEEE Trans Cybern 51(5):2419–2432 Yang Y, Vamvoudakis KG, Modares H, Yin Y, Wunsch DC (2021b) Hamiltonian-driven hybrid adaptive dynamic programming. IEEE Trans Syst Man Cybern: Syst 51(10):6423–6434 Yang R, Wang D, Qiao J (2022a) Policy gradient adaptive critic design with dynamic prioritized experience replay for wastewater treatment process control. IEEE Trans Ind Inf 18(5):3150–3158 Yang X, Zeng Z, Gao Z (2022b) Decentralized neuro-controller design with critic learning for nonlinear-interconnected systems. IEEE Trans Cybern 52(11):11672–11685 Yang Y, Gao W, Modares H, Xu CZ (2022c) Robust actor-critic learning for continuous-time nonlinear systems with unmodeled dynamics. IEEE Trans Fuzzy Syst 30(6):2101–2112 Yan J, He H, Zhong X, Tang Y (2017) Q-learning-based vulnerability analysis of smart grid against sequential topology attacks. IEEE Trans Inf Forensics Secur 12(1):200–210 Zhang H, Luo Y, Liu D (2009) Neural-network-based near-optimal control for a class of discretetime affine nonlinear systems with control constraints. IEEE Trans Neural Networks 20(9):1490– 1503 Zhang H, Liu D, Luo Y, Wang D (2013a) Adaptive dynamic programming for control: algorithms and stability. Springer, London Zhang H, Zhang X, Luo Y, Yang J (2013b) An overview of research on adaptive dynamic programming. Acta Automatica Sinica 39(4):303–311 Zhang H, Qin C, Jiang B, Luo Y (2014) Online adaptive policy learning algorithm for H∞ state feedback control of unknown affine nonlinear discrete-time systems. IEEE Trans Cybern 44(12):2706–2718 Zhang H, Jiang H, Luo C, Xiao G (2017a) Discrete-time nonzero-sum games for multiplayer using policy-iteration-based adaptive dynamic programming algorithms. IEEE Trans Cybern 47(10):3331–3340 Zhang Q, Zhao D, Zhu Y (2017b) Event-triggered H∞ control for continuous-time nonlinear system via concurrent learning. IEEE Trans Syst Man Cybern: Syst 47(7):1071–1081 Zhang Q, Zhao D, Wang D (2018) Event-based robust control for uncertain nonlinear systems using adaptive dynamic programming. IEEE Trans Neural Networks Learn Syst 29(1):37–50 Zhao B, Liu D (2020) Event-triggered decentralized tracking control of modular reconfigurable robots through adaptive dynamic programming. IEEE Trans Ind Electron 67(4):3054–3064 Zhao Q, Xu H, Jagannathan S (2015) Neural network-based finite-horizon optimal control of uncertain affine nonlinear discrete-time systems. IEEE Trans Neural Networks Learn Syst 26(3):486– 499 Zhao D, Zhang Q, Wang D, Zhu Y (2016) Experience replay for optimal control of nonzero-sum game systems with unknown dynamics. IEEE Trans Cybern 46(3):854–865 Zhao B, Wang D, Shi G, Liu D, Li Y (2018) Decentralized control for large-scale nonlinear systems with unknown mismatched interconnections via policy iteration. IEEE Trans Syst Man Cybern: Syst 48(10):1725–1735 Zhong X, Ni Z, He H (2016) A theoretical foundation of goal representation heuristic dynamic programming. IEEE Trans Neural Networks Learn Syst 27(12):2513–2525 Zhong X, He H, Wang D, Ni Z (2018) Model-free adaptive control for unknown nonlinear zero-sum differential game. IEEE Trans Cybern 48(5):1633–1646 Zhu Y, Zhao D (2018) Comprehensive comparison of online ADP algorithms for continuous-time optimal control. Artif Intell Rev 49(4):531–547 Zhu Y, Zhao D (2022) Online minimax Q network learning for two-player zero-sum Markov games. IEEE Trans Neural Networks Learn Syst 33(3):1228–1241

28

1 On the Critic Intelligence for Discrete-Time Advanced Optimal Control Design

Zhu Y, Zhao D, Li X (2017) Iterative adaptive dynamic programming for solving unknown nonlinear zero-sum game based on online data. IEEE Trans Neural Networks Learn Syst 28(3):714–725 Zhu Y, Zhao D, Li X, Wang D (2019) Control-limited adaptive dynamic programming for multibattery energy storage systems. IEEE Trans Smart Grid 10(4):4235–4244

Chapter 2

Event-Triggered Adaptive Optimal Regulation of Constrained Affine Systems

Abstract In this chapter, the event-triggered constrained near-optimal control problem for a class of discrete-time affine nonlinear systems is solved through heuristic dynamic programming (HDP). The proposed method can signally reduce the computational amount without deteriorating system stability. A non-quadratic performance index is introduced to overcome control constraints. Then, stability analysis of the event-triggered system with control constraints is investigated and a practical eventtriggered control design algorithm is given. Three neural networks are built in the HDP scheme, which are designed to approximate the unknown dynamics, the value function, and the control law, respectively. Remarkably, an effective strategy is developed to initialize the weight matrices of the model network. Two examples are finally included to demonstrate the effectiveness of the present approach. Keywords Adaptive dynamic programming · Control constraints · Event-triggered control · Heuristic dynamic programming · Neural networks · Nonlinear discrete-time system

2.1 Introduction The actuator saturation is a universal phenomenon in practical control applications. Because of the existence of physical limitations, the control inputs need to satisfy some constraints in almost all practical systems. Therefore, if the controller is designed without considering the actuator saturation, the dynamic performance of the system will be poor or even the stability will not be guaranteed. The actuator of nonlinear dynamics possesses the nonanalytic nature and the exact system functions are unknown, which puts forward a challenge to control engineers (Zhang et al. 2009). How to design effective methods to overcome the actuator saturation has attracted more and more attention recently (Saberi et al. 1996; Sussmann et al. 1994; Bitsoris and Gravalou 1999). In (Berastein 1995), the actuator saturation was considered and the corresponding control policy was given. A non-quadratic performance index was introduced to deal with the control constraint for nonlinear systems by (Lyshevski 1998a, b). In (Zhang et al. 2009), the near-optimal control problem for nonlinear © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D. Wang et al., Advanced Optimal Control and Applications Involving Critic Intelligence, Intelligent Control and Learning Systems 6, https://doi.org/10.1007/978-981-19-7291-1_2

29

30

2 Event-Triggered Adaptive Optimal Regulation of Constrained Affine Systems

discrete-time systems with control constraints was studied by using iterative adaptive dynamic programming (ADP) algorithm. Moreover, the iterative ADP algorithm was developed to search for the cost function of the constrained near-optimal control problem and the related convergence analysis was elaborated. As an effective method for solving optimal control forward in time, the ADP framework has been widely studied in (Werbos 1992; Al-Tamimi et al. 2008; Prokhorov and Wunsch 1997; Si and Wang 2001; Liu et al. 2017; Li et al. 2017; Wang et al. 2012). There are several synonyms used for ADP, including “approximate dynamic programming” (Werbos 1992; Al-Tamimi et al. 2008), “neuro-dynamic programming”, “adaptive critic designs” (Prokhorov and Wunsch 1997), “neural dynamic programming” (Si and Wang 2001), “reinforcement learning” (Li et al. 2017), and “adaptive dynamic programming” (Wang et al. 2012; Liu et al. 2017). In (Werbos 1992), ADP approaches were classified as heuristic dynamic programming (HDP), dual heuristic dynamic programming (DHP), and globalized dual heuristic dynamic programming (GDHP). An enormous amount of studies have been carried out on ADP to solve optimal control problems. In order to improve electricity efficiency of the residential microgrid and to solve the residential energy scheduling problem, an action dependent HDP method was described in detail in (Liu et al. (2018)). Main results of adaptive-critic-based (or ADP-based) robust control of nonlinear continuous-time systems were reviewed in (Wang et al. 2017). In (Al-Tamimi et al. 2008), a critic network was established to approximate the value function, and an action network was employed to approximate the optimal control input. A single network DHP scheme which eliminates the action network was elaborated in (Wang and Liu 2013). In (Wang et al. 2012), a finite-horizon neuro-optimal tracking control policy for discrete-time nonlinear systems via the HDP technique was proposed. The finite-horizon optimal tracking controller was obtained, which ensured the critic network to approximate the optimal cost within an ε-error bound. The infinite-horizon robust optimal control problem for uncertain nonlinear systems was investigated by applying data-based adaptive critic designs in (Wang et al. 2016). In (Zhang et al. 2011), an innovative iterative ADP technique was investigated to solve continuous-time two-person nonlinear zero-sum differential games, where the optimal control pair was obtained and the performance index function was guaranteed to achieve the saddle point. A novel iterative two-stage DHP method was presented in (Zhang et al. 2014b), which was applied to solve the optimal control problem of discrete-time switched nonlinear systems with constrained inputs. The H∞ control design was presented in (Zhang et al. 2014a; Qin et al. 2016), where the ADP technique was employed to affine nonlinear discrete-time systems. In (Qin et al. 2014), a new framework was obtained by using ADP, in order to address the online optimal tracking control problem for continuous-time linear systems with unknown dynamics. With obvious advantages of saving network resources and lessening the computational cost, the event-triggered control method is a practical implementation for advanced control design without affecting system stability. Therefore, the eventtriggered control method is always proposed to improve resource utilization, instead of the time-triggered design. There are some existing results about event-triggered

2.2 Problem Description

31

controller design based on ADP algorithms for systems with unknown dynamics. The nonlinear discounted optimal regulation was handled under the event-driven adaptive critic framework in (Wang et al. 2017). In (Sahoo et al. 2013), a neural-networkbased event-triggered controller for a class of nonlinear discrete-time systems was proposed. In (Dong et al. 2017a), a novel adaptive event-triggered control method for nonlinear discrete-time systems with unknown dynamics was presented based on the HDP technique. In (Wang et al. 2017), using a neural dynamic programming approach, an event-triggered adaptive robust control method for continuous-time nonlinear systems was investigated. The event-triggered adaptive H∞ control problem for continuous-time nonlinear systems was addressed in (Zhang et al. 2017). Then, an event-triggered near-optimal control structure was developed for nonlinear continuous-time systems with control constraints in (Dong et al. 2017b). There exist various methods for control design of dynamical systems. However, many of these papers do not consider the event-triggered control technique to reduce the computational burden for general nonlinear discrete-time systems with control constraints. In this chapter, we investigate the event-triggered control design for discrete-time-constrained nonlinear systems (Ha et al. 2020). The contributions of this chapter can be summarized as follows: (1) Introduce a novel event-triggered technique to substantially reduce the computational burden for a class of discrete-time affine nonlinear systems with control constraints and apply a non-quadratic functional to solve the actuator saturation problem. (2) Establish a new event-triggering condition with control constraints and conduct stability analysis towards the eventtriggered system with actuator saturation. (3) Propose a novel weight initialization approach, which greatly improves the learning performance of the model neural network. (4) Implement the event-triggered HDP algorithm with control constraints, which acquires the near-optimal control input as well as reduces the computational cost markedly.

2.2 Problem Description With the simple structure and the demonstration effect, it is useful to study the control design of nonlinear systems possessing input-affine form. In this chapter, we consider discrete-time affine nonlinear systems described by x(k + 1) = f (x(k)) + g(x(k))u(k),

(2.1)

where x(k) ∈ Rn is the state vector, u(k) ∈ Rm is the control vector, and f : Rn → Rn and g : Rn → Rn×m are differentiable with respect to their arguments with f (0) = 0. Assumption 2.1 Assume that the nonlinear system (2.1) is controllable and observable. Assume that f + gu is Lipschitz continuous on a set Ω in Rn containing the origin. Assume that x(k) = 0 is a unique equilibrium state of system (2.1) under the control u(k) = 0.

32

2 Event-Triggered Adaptive Optimal Regulation of Constrained Affine Systems

Remark 2.1 The system (2.1) is controllable with Assumption 2.1. This means there exists at least one continuous-time state feedback control policy u(k) = μ(x(k)) that can asymptotically stabilize the system. Denote the set of control signals with input constraints as Ωu = {u(k)|u(k) = [u 1 (k), u 2 (k), . . . , u m (k)]T ∈ Rm , |u q (k)| ≤ u¯ q , q = 1, 2, . . . , m},

(2.2)

where u¯ q is the saturation bound of the qth actuator. As the handling method of (Zhang et al. 2009), we let U¯ ∈ Rm×m be the constant diagonal matrix, which is given by U¯ = diag{u¯ 1 , u¯ 2 , . . . , u¯ m }. In the event-triggered control design, we introduce a monotonically increasing ∞ and call them sampling instants (Eqtami et al. subsequence of time instants {ki }i=0 2010). The control inputs are only updated at sampling instants: k0 , k1 , k2 , . . . . It means that the actuators do not affect control signals between ki and ki+1 . Therefore, the feedback control policy can be rewritten as u(k) = μ(x(ki )),

(2.3)

where x(ki ) is the state vector at the time instant ki and ki ≤ k < ki+1 , i = 0, 1, 2, . . . . Define the triggering error as e(k) = x(ki ) − x(k)

(2.4)

for ki ≤ k < ki+1 with i = 0, 1, 2, . . . , where x(k) is the current state and x(ki ) is the sampled state held by the Zero-Order Hold (ZOH). By substituting (2.4) into (2.3), we obtain u(k) = μ(e(k) + x(k)).

(2.5)

Therefore, the feedback control system can be derived as x(k + 1) = f (x(k)) + g(x(k))μ(e(k) + x(k)).

(2.6)

In this chapter, our objective is to find an optimal state feedback controller for system (2.1) that can minimize the generalized performance functional J (x(k), μ(·)) =

∞ U x( j), μ(e( j) + x( j)) ,

(2.7)

j=k

where U x( j), μ(e( j) + x( j)) is the utility function and μ(e( j) + x( j)) = μ(x(ki )) is the control policy held by ZOH. The utility function is defined as

2.3 Stability Analysis of the Event-Triggered Control System

U x(k), μ(e(k) + x(k)) = U x(k), μ(x(ki )) = x T ( j)Qx( j) + W μ(x(ki ))

33

(2.8)

with U (0, 0) = 0, where W μ(x(ki )) ∈ R is positive definite and the weight matrix Q is also positive definite. For the control problem without considering input constraints, the utility component W μ(x(ki )) is often chosen as the quadratic form of the control input μ(x(k)). For the purpose of solving the constrained control problem and inspired by (Beard 1995), a non-quadratic functional is defined as follows: μ(x(ki )) W μ(x(ki )) = 2 ϕ −T (U¯ −1 s)U¯ Rds, (2.9) 0 ϕ −1 μ(x(ki )) = [φ −1 μ1 (x(ki )) , φ −1 μ2 (x(ki )) , . . . , φ −1 μm (x(ki )) ]T , (2.10) where s ∈ Rm , ϕ ∈ Rm , ϕ −T denotes (ϕ −1 )T , and φ(·) is a bounded one-to-one function satisfying |φ(·)| ≤ 1 and belonging to C p ( p ≥ 1) and L 2 (Ω). Furthermore, it is a monotonically increasing odd function, whose first derivative is bounded by a constant M. Actually, it is not difficult to find such a function. One example is the hyperbolic tangent function φ(·) = tanh(·). In addition, the matrix R is positive definite and assumed to be for simplicity of analysis. In this situation, it is diagonal evident to assure that W μ(x(ki )) is positive definite since φ −1 (·) is a monotonic odd function and R is positive definite (Zhang et al. 2009). On the basis of Bellman’s optimality principle (Lyshevski 1998b), the optimal value function J ∗ (x) conforms to the following HJB equation: J ∗ (x(k)) = min {x T (k)Qx(k) + W μ(x(ki )) + J ∗ (x(k + 1))}. μ(x(ki ))

(2.11)

The optimal control law μ∗ (x(ki )) at time instant ki is defined as μ∗ (x(ki )) = arg min {x T (k)Qx(k) + W μ(x(ki )) + J ∗ (x(k + 1))}. μ(x(ki ))

(2.12)

In the next section, we prove that the event-triggered system (2.6) is asymptotically stable under a proper event-triggering condition.

2.3 Stability Analysis of the Event-Triggered Control System A control law u(k) is defined to be admissible with regard to (2.7) on Ω, if u(k) is continuous with μ(x(k)) ∈ Ωu , ∀x(k) ∈ Ω and stabilizes (2.6) on Ω, u(k) = 0 if x(k) = 0, and ∀x0 ∈ Ω, J (x0 ) is finite (Abu-Khalaf and Lewis 2005). For simplicity

34

2 Event-Triggered Adaptive Optimal Regulation of Constrained Affine Systems

of analysis, we discuss the constant diagonal saturation bound matrix U¯ ∈ Rm×m with m = 1 of system (2.6). Then, we have u¯ = U¯ and hence μ(x(ki )) ≤ u. ¯ In this chapter, we define an event-triggering condition e(k) ≤ eT ,

(2.13)

where eT is the threshold. The control action will be updated when the event occurs in the sense that the event-triggering condition (2.13) is violated. Meanwhile, the eventtriggered error (2.4) will be reset to zero. Therefore, a ZOH is needed to maintain the event-triggered control input μ(x(ki )) during ki ≤ k < ki+1 until the next event occurs. By substituting (2.4) into the system functions f (x(k)) and g(x(k)), we have f (x(k)) = f (x(ki ) − e(k)), g(x(k)) = g(x(ki ) − e(k)).

(2.14) (2.15)

Here, we give some assumptions as follows. Assumption 2.2 For system (2.6), there exist two positive constants F and G, such that f (x(k) − e(k)) ≤ F x(k) + F e(k) , g(x(k) − e(k)) ≤ G x(k) + G e(k) .

(2.16) (2.17)

Assumption 2.3 There exist positive constants L 1 , α, and β, a continuously differentiable function V : Rn → R ≥ 0, and class K ∞ functions α1 and α2 , such that α1 ( x ) ≤ V (x(k)) ≤ α2 ( x ), ∀x ∈ Rn , (2.18) V f (x(k)) + g(x(k))μ(e(k) + x(k)) − V (x(k)) ≤ −αV (x(k)) + β e(k) , (2.19)

α1−1 ( x ) ≤ L 1 x ,

(2.20)

where V is called an input-to-state stable (ISS) Lyapunov function for system (2.6) (Jiang and Wang 2001). Remark 2.2 A continuous function z(a) with z : [0, a) → [0, ∞) belongs to class K ∞ functions, if it is a strictly monotonically increasing function, z(0) = 0, and lim z(r ) = ∞.

r →∞

(2.21)

Remark 2.3 As is illustrated in classic Lyapunov theory, a continuous function V : Rn → R ≥ 0 is an ISS-Lyapunov function for system (2.6), if the K ∞ functions α1 and α2 satisfy (2.18) and (2.20). Moreover, for the positive constants α and β, V (x) needs to satisfy (2.19).

2.3 Stability Analysis of the Event-Triggered Control System

35

Lemma 2.1 The event-triggered error for the proposed discrete-time constrained nonlinear system needs to satisfy e(k) ≤ (F + G u) ¯

1 − (F + G u) ¯ k−ki x(ki ) . 1 − (F + G u) ¯

(2.22)

Proof Based on (Eqtami et al. 2010), we have e(k + 1) ≤ x(k + 1) ,

(2.23)

where k ∈ [ki , ki+1 ). According to (2.4), we can get e(k + 1) = x(ki ) − x(k + 1). If k = ki , it is clear that e(k) = 0 holds. Based on Assumption 2.2 and substituting (2.4) and (2.6) into (2.23), we obtain e(k + 1) ≤ f (x(k)) + g(x(k))μ(x(ki )) = f (x(ki ) − e(k)) + g(x(ki ) − e(k))μ(x(ki )) ≤ f (x(ki ) − e(k)) + g(x(ki ) − e(k)) u¯ ≤ (F + G u) x(k ¯ ¯ i ) + (F + G u) e(k) .

(2.24)

Hence, by further expanding (2.24), we can derive that ¯ − 1) e(k) ≤ (F + G u) x(k ¯ i ) + (F + G u) e(k ≤ (F + G u)((F ¯ + G u) x(k ¯ i ) + (F + G u) e(k ¯ − 2) ) + (F + G u) x(k ¯ i ) ≤ (F + G u) ¯ k−ki e(ki ) + (F + G u) ¯ k−ki x(ki ) + (F + G u) ¯ k−ki −1 x(ki ) + · · · + (F + G u) x(k ¯ i ) .

(2.25)

The corresponding threshold eT can be obtained by solving (2.25) with e(ki ) = 0. Then, the event-triggering condition (2.13) can be rewritten as ¯ e(k) ≤ eT = (F + G u)

1 − (F + G u) ¯ k−ki x(ki ) . 1 − (F + G u) ¯

(2.26)

In the following part, we prove that the system (2.6) is asymptotically stable under the above condition.

36

2 Event-Triggered Adaptive Optimal Regulation of Constrained Affine Systems

Theorem 2.1 The system (2.6) with event-triggered constrained control is asymptotically stable under Assumptions 2.2 and 2.3, if 0 < F + G u¯ < 1 and the ISSLyapunov function V (x(k)) satisfies V (x(k)) ≤ −ζ αV (x(ki ))(k − ki ) + V (x(ki )),

(2.27)

where α ∈ (0, 1) and 0 1, we find that (1 − α)k−ki < 1 − α.

(2.37)

Based on (2.37), we can further get 1 − (1 − α) 1 − (1 − α)k−ki α2 =α 0 is the learning rate of the action network.

(2.61)

2.5 Simulation Experiments

43

2.4.3 The Whole Design Procedure In this part, we present the detailed design procedure towards the HDP-based eventtriggered constrained control and summarize it in Algorithm 1. Algorithm 1 Event-Triggered Constrained Control Design Algorithm 1: Initialize the matrices Q and R, the control constraint U¯ , and the data set xm (k) randomly chosen for training the model network. Select the initial state vector x(0) and the initial control vector u(0). 2: Construct the model network and pre-train the weight matrix ω1m from the input layer to the hidden layer by using the data set xm (k). 3: Keep ω1m unchanged after the pre-training process. Then, train the model network as usual, where only the weight matrix ω2m from the hidden layer to the output layer is updated. 4: Construct the critic network and the action network. 5: Set the initial time step as k = 0 and the maximal number of time steps as kmax . Select two appropriate positive constants F and G. 6: while k ≤ kmax do 7: Compute the triggering threshold eT according to (2.26). 8: Obtain the system state vector x(k) through the model network. 9: if x(k) − x(ki ) ≥ eT then 10: Update the critic weight matrix from the hidden layer to the output layer via (2.57). 11: Update the action weight matrix from the hidden layer to the output layer via (2.61). 12: Set x(ki ) = x(k) and update the control input μ(x(ki )) through the action network. 13: else 14: Keep the control input μ(x(ki )) and the sampled state x(ki ) unchanged by using a ZOH. 15: end if 16: Let k ← k + 1. 17: end while

Remark 2.4 As is stated in Algorithm 1, the weight matrix of the model network is initialized by applying the proposed pre-training approach, which greatly improves the accuracy of the model network. Additionally, the Algorithm 1 has overcome the constrained input while considerably reducing the computational burden. If the triggering condition x(k) − x(ki ) ≥ eT is satisfied, then the system states are sampled and the control inputs are updated. In other words, the Algorithm 1 only estimates the system states through the model network and maintains the control inputs by a ZOH when the triggering condition is not violated.

2.5 Simulation Experiments To support the theoretical analysis and show the advantages of the proposed method, two simulation examples are provided in this section. Example 2.1 Consider the following affine discrete-time system:

44

2 Event-Triggered Adaptive Optimal Regulation of Constrained Affine Systems

0 0.01x2 (k) + x1 (k) + u(k), 0.0099 −0.09x1 (k) + 0.97x2 (k)

x(k + 1) =

(2.62)

where x(k) = [x1 (k), x2 (k)]T ∈ R2 and u(k) ∈ R are the state and control variables, respectively. The initial state vector is chosen as x(0) = [−1, 1]T and the control constraint is set to |u| ≤ 0.3 so that U¯ = 0.3. In order to form the utility of the cost function, we select the involved weight matrices as Q = I2 and R = I , respectively, where I denotes the identity matrix with suitable dimensions. Then, we have U (k) = x (k)I2 x(k) + 2 T

μ(x(ki ))

0.3 tanhT (0.3−1 s)ds.

(2.63)

0

To begin with, we construct a model network with the structure of 3–8–2 (i.e., three input neurons, eight hidden neurons, and two output neurons). Under the learning rate αm = 0.1, the model network is trained by the proposed method. We employ 500 samples selected randomly as −1 ≤ x1 ≤ 1 and −1 ≤ x2 ≤ 1 to train the model network for 30 steps and use another 500 samples chosen in these sets to test the performance of the model network. The weight matrix from the input layer to the hidden layer is pre-trained by applying the gradient descent rule and then is fixed. Next, we start to train the weight matrix from the hidden layer to the output layer as usual. In order to ensure a fair comparison, we pre-train the weight matrix ω1m of the model network for only one epoch. Then, we use the general method to keep ω1m fixed and update ω2m for 20 epochs. Since we employ 500 training samples, the result has been improved greatly after one training epoch. The sum of square errors defined as (2.52) of each iteration step is displayed in Fig. 2.2, where the proposed weight pre-training approach and the traditional random initialization method are used, respectively. As is shown in Fig. 2.2, the weight matrix ω2m is updated for 30 times by the traditional method. However, the pre-training approach optimizes the initialization process for ω1m , which makes the convergence rate faster than the traditional method. We can obviously observe that in Fig. 2.2, the error E m of the proposed strategy comes to a small value more quickly than the traditional training method after several iterations. Remarkably, the beginning value of E m is smaller than the traditional method due to the pre-training process. After constructing the model network and training it, the network weight matrices are kept unchanged. Next, the critic network and the action network are constructed with the same structure of 2–8–1, where the learning rates are αc = 0.05 and αa = 0.05. Each network is trained for 200 iterations with each iteration of 1000 training steps, which ensures the given error bound = 10−6 is reached. According to (2.26), we let F + G u¯ = 0.1 and obtain the event-triggering threshold as follows: eT =

1 − 0.1k−ki · 0.1 x(ki ) . 1 − 0.1

(2.64)

2.5 Simulation Experiments

45

120 Error with weight pretraining Error with random initialization

Sum of square error

100

80

60

40

20

0 0

5

10

15

20

25

30

Iteration

Fig. 2.2 Sum of square error of the model network (Example 2.1)

After training the critic network and the action network, we apply the obtained control law to the nonlinear plant for 500 time steps. For comparison, we also apply the traditional HDP method without considering control constraints to this discretetime system and set the parameters same as our proposed method. Note that in these two situations, we adopt completely different control policies and performance functions. Specifically, the state trajectories obtained via the traditional HDP approach without considering the actuator saturation and by using the proposed event-triggered constrained control method are shown in Fig. 2.3. Here, similar state performances can be observed with these two methods. For the control trajectory of the traditional method, it is shown in Fig. 2.4 that there exists the phenomenon of actuator saturation from 10 to 80 time steps. However, such saturation does not exist in the control trajectory of the event-triggered constrained strategy. Besides, the changing trend of the threshold is illustrated in Fig. 2.5. In this simulation, the action network under the traditional HDP framework updates control input at each time (that is 500 times totally), while the action network of the proposed method only updates 114 times. Note that the actuator saturation actually exits in this system. Using the traditional method, if the control input generated by the action network overruns the control constraint, it is limited to the bounded value. However, by adopting the event-triggered constrained control method, only if the state error e(k) is triggered, the control input is updated and the sampled state is reset with the current state. Before the occurring of the next event, the control input is remained by a ZOH. Overall, the proposed method substantially reduces the computational cost as well as overcomes the problem of actuator saturation, since the control input is only updated at the triggering event.

46

2 Event-Triggered Adaptive Optimal Regulation of Constrained Affine Systems

System state x1

0.5

0

-0.5 Event-triggered HDP with constraint Traditional HDP without constraint

-1 0

50

100

150

200

250

300

350

400

450

500

System state x2

2 Event-triggered HDP with constraint

1.5

Traditional HDP without constraint

1 0.5 0 0

50

100

150

200

250

300

350

400

450

500

Time steps

Fig. 2.3 State trajectories under two aforementioned methods (Example 2.1) 0.5 0.4 0.3 Event-triggered HDP with constraint Upper control constraint Lower control constraint Traditional HDP without constraint

Control input

0.2 0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5 0

50

100

150

200

250

300

350

400

Time steps Fig. 2.4 Control trajectories under two aforementioned methods (Example 2.1)

450

2.5 Simulation Experiments

47

0.5 Threshold e T

0.45 0.4

Threshold

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0

50

100

150

200

250

300

350

400

450

500

Time steps

Fig. 2.5 The triggering threshold curve (Example 2.1)

Example 2.2 Consider the dynamics of a classical torsional pendulum used in (Dong et al. 2017a): ⎧ dθ ⎪ ⎨ = ω, dt (2.65) dω dθ ⎪ ⎩K = u − Mgl sin θ − f d , dt dt where K = 4/3Ml 2 , the mass is M = 1/3kg, the length of the pendulum bar is l = 3/2m, the acceleration of gravity is g = 9.8m/s2 , and the frictional factor is f d = 0.2. Letting x1 (k) = θ (k) and x2 (k) = ω(k), we discretize the above dynamics with a sampling period Δt = 0.1s (Liu and Wei 2014). Then, the derived nonlinear discretetime system can be rewritten as follows: 0 0.1x2 (k) + x1 (k) + u(k), x(k + 1) = 0.1 −0.49 sin(x1 (k)) + 0.98x2 (k)

(2.66)

where x(k) = [x1 (k), x2 (k)]T ∈ R2 and u(k) ∈ R are the state variable and the control variable, respectively. In this simulation, we set weight matrices as Q = I2 and R = 0.01I and initialize the state vector to be x(0) = [−1, 1]T . The control constraint is chosen as |u| ≤ 2 and hence U¯ = 2. The utility function is formulated as follows:

48

2 Event-Triggered Adaptive Optimal Regulation of Constrained Affine Systems 140 Error of Weight Pretraining Error of Random Initialization

120

Sum of square error

100

80

60

40

20

0 0

5

10

15

20

25

30

Iteration Fig. 2.6 Sum of square error of the model network (Example 2.2)

U (k) = x T (k)I2 x(k) + 2

μ(x(ki ))

tanh−T (2−1 s)0.02ds.

(2.67)

0

First, we apply the initialization approach to establish the model network. The sampling numbers and the training times are set the same as Example 2.1. The sum of square error results of the traditional approach and the proposed initialization method are presented in Fig. 2.6. Fig. 2.6 further verifies the effectiveness of the weight initialization method proposed in this chapter. After training the model network, the weight matrices of the model network is fixed. Then, we start to train the critic network and the action network. The structures of the critic network and the action network are both 2–8–1 under the learning rates αc = 0.05 and αa = 0.05. In this case, we still let F + G u¯ = 0.1 and also obtain the triggering threshold as the form of (2.64). In order to compare with the traditional HDP method, we also construct a control scheme regardless of the actuator saturation. The Fig. 2.7 shows the state trajectories, obtained respectively, with the traditional HDP framework and the proposed method. We can observe that the state convergence rate of our approach is very similar to that of traditional HDP. From Fig. 2.8, we can see the control convergence rates of the two methods are almost the same. In this example, the traditional HDP technique needs to update control inputs up to 200 time steps, even if the control signals at the previous moment are the same as that of the latter moment. However, the proposed method only needs to update control inputs 74 times. Observing Fig. 2.8, there exists the

2.6 Conclusions

49

System state x 1

0.5

0

-0.5

Event-triggered HDP with constraint Traditional HDP without constraint

-1 0

20

40

60

80

100

120

140

160

180

200

System state x 2

2 Event-triggered HDP with constraint Traditional HDP without constraint

1

0

-1 0

20

40

60

80

100

120

140

160

180

200

Time steps

Fig. 2.7 State trajectories under two aforementioned methods (Example 2.2)

phenomenon of actuator saturation, which is presented in the first several time steps. However, the obtained controller of our method is not affected by the constrained input in this example. From Fig. 2.9, we can see that the changing threshold curve converges to zero and is related to the sampled state x(ki ). These simulation results verify the effectiveness of the present approach.

2.6 Conclusions In this chapter, we propose an effective HDP algorithm to design the adaptive eventtriggered controller for a class of discrete-time nonlinear systems with constrained inputs. The non-quadratic performance index is introduced to conquer the control constraints. The triggering threshold for discrete-time nonlinear systems with constrained control is discussed and the stability analysis is presented by using the Lyapunov technique. Additionally, we develop a novel weight initialization approach, which greatly improves the approximation accuracy of the model network. Through comparison with traditional HDP, simulation results confirm the effectiveness of the proposed method and a significant reduction of the computational cost. It is worth noting that the above strategy is established only for affine nonlinear systems. Hence, it is also important to investigate how to extend the current results

50

2 Event-Triggered Adaptive Optimal Regulation of Constrained Affine Systems 2.5 2 1.5

Event-triggered HDP with constraint Upper control constraint Lower control constraint Traditional HDP without constraint

Control input

1 0.5 0 -0.5 -1 -1.5 -2 -2.5 0

20

40

60

80

100

120

140

160

180

Time steps Fig. 2.8 Control trajectories under two aforementioned methods (Example 2.2) 0.25 Threshold e T

Threshold

0.2

0.15

0.1

0.05

0 0

20

40

60

80

100

120

Time steps

Fig. 2.9 The triggering threshold curve (Example 2.2)

140

160

180

200

References

51

to optimal regulation and trajectory tracking of nonaffine nonlinear systems. We will conduct further discussions on these topics in the following chapters.

References Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41(5):779–791 Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern-Part B: Cybern 38(4):943–949 Beard R (1995) Improving the closed-loop performance of nonlinear systems. Ph.D. dissertation, Electr. Eng. Dept., Rensselaer Polytech. Inst., Troy, NY Berastein DS (1995) Optimal nonlinear, but continuous, feedback control of systems with saturating actuators. Int J Control 62(5):1209–1216 Bitsoris G, Gravalou E (1999) Design techniques for the control of discrete-time systems subject to state and control constraints. IEEE Trans Automat Control 44(5):1057–1061 Dong L, Zhong X, Sun C, He H (2017) Adaptive event-triggered control based on heuristic dynamic programming for nonlinear discrete-time systems. IEEE Trans Neural Networks Learn Syst 28(7):1594–1605 Dong L, Zhong X, Sun C, He H (2017) Event-triggered adaptive dynamic programming for continuous-time systems with control constraints. IEEE Trans Neural Networks Learn Syst 28(8):1941–1952 Eqtami A, Dimarogonas DV, Kyriakopoulos KJ, (2010) Event-triggered control for discrete-time systems. In: Proceedings of the American control conference. pp 4719–4724 Ha M, Wang D, Liu D (2020) Event-triggered adaptive critic control design for discrete-time constrained nonlinear systems. IEEE Trans Syst Man Cybern: Syst 50(9):3158–3168 Jiang ZP, Wang Y (2001) Input-to-state stability for discrete-time nonlinear systems. Automatica 37(6):857–869 Li J, Modares H, Chai T, Lewis FL, Xie L (2017) Off-policy reinforcement learning for synchronization in multi-agent graphical games. IEEE Trans Neural Networks Learn Syst 28(10):2434–2445 Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Trans Neural Networks Learn Syst 25(3):621–634 Liu D, Wei Q, Wang D, Yang X, Li H (2017) Adaptive dynamic programming with applications in optimal control. Springer, Cham, Switzerland Liu D, Xu Y, Wei Q, Liu X (2018) Residential energy scheduling for variable weather solar energy based on adaptive dynamic programming. IEEE/CAA J Automat Sinica 5(1):36–46 Lyshevski SE (1998) Optimal control of nonlinear continuous-time systems: Design of bounded controllers via generalized nonquadratic functionals. In: Proceedings of the American control conference. pp 205–209 Lyshevski SE (1998) Nonlinear discrete-time systems: Constrained optimization and application of nonquadratic costs. In: Proceedings of the American control conference. pp 3699–3703 Prokhorov DV, Wunsch DC (1997) Adaptive critic designs. IEEE Trans Neural Networks 8(5):997– 1007 Qin C, Zhang H, Luo Y (2014) Online optimal tracking control of continuous-time linear systems with unknown dynamics by using adaptive dynamic programming. Int J Control 87(5):1000–1009 Qin C, Zhang H, Wang Y, Luo Y (2016) Neural network-based online H∞ control for discrete-time affine nonlinear system using adaptive dynamic programming. Neurocomputing 198:91–99 Saberi A, Lin Z, Teel A (1996) Control of linear systems with saturating actuators. IEEE Trans Automat Control 41(3):368–378

52

2 Event-Triggered Adaptive Optimal Regulation of Constrained Affine Systems

Sahoo A, Xu H, Jagannathan S (2013) Neural network-based adaptive event-triggered control of affine nonlinear discrete time systems with unknown internal dynamics. In: Proceedings of the American control conference. pp 6418–6423 Si J, Wang YT (2001) On-line learning control by association and reinforcement. IEEE Trans Neural Networks 12(2):264–276 Sussmann H, Sontag ED, Yang Y (1994) A general result on the stabilization of linear systems using bounded controls. IEEE Trans Automat Control 39(12):2411–2425 Wang D, He H, Liu D (2017) Adaptive critic nonlinear robust control: A survey. IEEE Trans Cybern 47(10):3429–3451 Wang D, He H, Zhong X, Liu D (2017) Event-driven nonlinear discounted optimal regulation involving a power system application. IEEE Trans Ind Electron 64(10):8177–8186 Wang D, Liu D (2013) Neuro-optimal control for a class of unknown nonlinear dynamic systems using SN-DHP technique. Neurocomputing 121:218–225 Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocomputing 78(1):14–22 Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832 Wang D, Liu D, Zhang Q, Zhao D (2016) Data-based adaptive critic designs for nonlinear robust optimal control with uncertain dynamics. IEEE Trans Syst Man Cybern: Syst 46(11):1544–1555 Wang D, Mu C, He H, Liu D (2017) Event-driven adaptive robust control of nonlinear systems with uncertainties through NDP strategy. IEEE Trans Syst Man Cybern: Syst 47(7):1358–1370 Werbos PJ (1992) Approximate dynamic programming for real-time control and neural modeling. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy and adaptive approaches (chapter 13). Van Nostrand Reinhold, New York Zhang H, Luo Y, Liu D (2009) Neural-network-based near-optimal control for a class of discretetime affine nonlinear systems with control constraints. IEEE Trans Neural Networks 20(9):1490– 1503 Zhang H, Qin C, Jiang B, Luo Y (2014) Online adaptive policy learning algorithm for H∞ state feedback control of unknown affine nonlinear discrete-time systems. IEEE Trans Cybern 44(12):2706–2718 Zhang H, Qin C, Luo Y (2014) Neural-network-based constrained optimal control scheme for discrete-time switched nonlinear system using dual heuristic programming. IEEE Trans Automat Sci Eng 11(3):839–849 Zhang H, Wei Q, Liu D (2011) An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games. Automatica 47(1):207–214 Zhang Q, Zhao D, Zhu Y (2017) Event-triggered H∞ control for continuous-time nonlinear system via concurrent learning. IEEE Trans Syst Man Cybern: Syst 47(7):1071–1081

Chapter 3

Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic

Abstract In this chapter, the self-learning optimal regulation for discrete-time nonaffine nonlinear systems under event-driven formulation is investigated. An eventbased adaptive critic algorithm is developed with a convergence discussion of the iterative process. The input-to-state stability (ISS) analysis for the present nonlinear plant is established. Then, a suitable triggering condition is proved to ensure the ISS of the controlled system. An iterative dual heuristic dynamic programming (DHP) strategy is adopted to implement the event-driven framework. As a special case discussion, the event-driven iterative adaptive critic design of affine nonlinear systems is also presented. Additionally, the mixed-driven control framework is highlighted involving data and event considerations and is integrated with the iterative adaptive critic algorithm. Simulation examples are carried out to demonstrate the applicability of the constructed event-driven method, respectively, implemented via DHP, heuristic dynamic programming, and neural dynamic programming. Remarkably, compared with the traditional DHP algorithm, the event-based algorithm is able to substantially reduce the updating times of the control input while still maintaining an impressive performance. Keywords Discrete-time nonlinear dynamics · Event-driven formulation · Iterative adaptive critic · Neural networks · Optimal regulation · Self-learning control

3.1 Introduction Hamilton–Jacobi–Bellman(HJB)equationsarealways encounteredwhencopingwith nonlinear optimal control problems. How to effectively solve HJB equations is a challenging issue in the field of optimal control. Till now, there are rare efficient ways to obtain their analytical solutions. Nevertheless, adaptive dynamic programming (ADP) methods are able to obtain satisfying numerical solutions of general HJB equations. Over the past three decades, ADP has been successfully developed and widely adopted in several communities like control and decision. Originally speaking, ADP was proposed in (Werbos 1974) to approximately address optimal control problems, by using © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D. Wang et al., Advanced Optimal Control and Applications Involving Critic Intelligence, Intelligent Control and Learning Systems 6, https://doi.org/10.1007/978-981-19-7291-1_3

53

54

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic

adaptive critic structures and function approximator tools. The ADP algorithms can be categorized into several major groups, including heuristic dynamic programming (HDP), dual HDP (DHP), and globalized DHP (GDHP) (Werbos 1992). Then, the neural dynamic programming (NDP) technique was proposed in (Si and Wang 2001) with an emphasis on the action-dependent feature. Because of the self-learning-based intelligence property, ADP and its relevant methods have been paid great attention (Werbos 1992; Si and Wang 2001; Al-Tamimi et al. 2008; Wang et al. 2012, 2020e; Zhu and Zhao 2018; Zhang et al. 2009; Li et al. 2017, 2018; Wang et al. 2017). Particularly, the iterative adaptive critic algorithms with three main structures were constructed in (Al-Tamimi et al. 2008; Wang et al. 2012; Zhang et al. 2009). Al-Tamimi et al. (2008) shed light on the convergence proof and implementation procedure of the iterative HDP algorithm. In (Zhang et al. 2009), an iterative DHP approach was applied to adaptively control the constrained nonlinear systems by employing a nonquadratic performance index. It can be observed that the costate function was employed to update the iterative control law, which made the DHP technique surpass HDP. In (Wang et al. 2012), an iterative method based on the advanced GDHP technique was developed, which aimed at tackling the optimal control problem for unknown nonaffine systems. Wang et al. (2020e) successfully applied the iterative adaptive critic method to a complex wastewater treatment system. Zhu et al. (2018) reviewed the advantages of various ADP algorithms, provided comparison results, and further highlighted the importance property of online learning. Moreover, a model-free optimal solution was elaborated in (Li et al. 2017) by using off-policy reinforcement learning and was applied to two-scale industrial processes. Li et al. (2018) also proposed a novel off-policy Q-learning algorithm to optimize the operational process for rougher flotation circuits. In recent years, the robustness of adaptive critic methods also has been studied, displaying a close relationship between robust and optimal control formulations (Wang et al. 2017; Wang and Liu 2018; Wang 2019, 2020a, b; Wang et al. 2020f; Zhang et al. 2018). This is helpful to extendtheadaptive-critic-basedoptimal regulationschemes toaddress complexrobust control problems. As mentioned by the survey (Wang et al. 2017), in order to enhance the resource utilization rate and reduce the computation cost, event-driven approaches are becoming more and more important to serve as an effective supplement to the classical time-based control manner. The key attribute of the event-based framework lies in that the control inputs are updated based on a certain triggering condition. To the best of our knowledge, considerable attentions have been paid to develop various eventdriven methods (Tallapragada and Chopra 2013; Batmani et al. 2017; Tabuada 2007; Dong et al. 2017; Zhong and He 2017; Vamvoudakis et al. 2017; Postoyan et al. 2015; Wang et al. 2017; Eqtami et al. 2010). Among them, two classical event-based frameworks were developed for stabilizing control scheduling (Tabuada 2007) and nonlinear tracking design (Tallapragada and Chopra 2013). Recently, two event-triggered adaptive algorithms were proposed by Zhong and He (2017) as well as Wang et al. (2017), which aimed at aperiodically controlling continuous-time systems according to certain triggering conditions. In addition, some researchers employed the eventtriggered technique to solve the optimal tracking control problem (Vamvoudakis et al. 2017). Postoyan et al. (2015) brought the event-driven approach into the track-

3.2 Problem Description

55

ing controllers of unicycle mobile robots and investigated the related stabilization issue. Batmani et al. (2017) proposed an event-triggered suboptimal tracking control strategy for discrete-time systems and applied it to a laboratory three-tank system. Note the event-triggered approaches in (Postoyan et al. 2015; Batmani et al. 2017) were not related to the adaptive critic framework. Also for the discrete-time systems, Dong et al. (2017) described the event-based HDP algorithm, designed the adaptive controller, and analyzed the Lyapunov stability, showing that the weight estimation errors of the critic and action networks are uniformly ultimately bounded (UUB). Note that the consideration of adaptive critic under event-driven situation has been proposed in (Dong et al. 2017; Zhong and He 2017; Vamvoudakis et al. 2017; Wang et al. 2017), where only Dong et al. (2017) is for discrete-time systems. So far, no research has been conducted on the topic of iterative adaptive critic combined with discrete-time event-based formulation. Stability is the foundation of controlled systems and there exist abundant results including the input-to-state stability (ISS) (Sontag 2008). In (Jiang and Wang 2001) and (Huang et al. 2005), the ISS was extended to the discrete-time case, which provided us a powerful tool to analyze the discrete-time system stability. Under eventbased environment, it is also necessary to study the stability of discrete-time plants. In this chapter, we pay more attention to the ISS of the event-driven discrete-time system. Besides, we choose the iterative DHP algorithm to implement the proposed event-based method since the DHP technique often results in preferable performance against the basic HDP algorithm (Prokhorov et al. 1995). Overall, compared with existing results, the major contributions of this chapter lie in the development of the discrete-time iterative adaptive critic method under event-driven formulation and the ISS analysis of the event-based discrete-time system (Wang and Ha 2020c; Wang et al. 2020d). Notations: The following notations will be used throughout the chapter. R is the set of all real numbers while R+ denotes the nonnegative part. Rn is the Euclidean space of all n-dimensional real vectors. Rn×m is the space of all n × m real matrices. · denotes the vector norm of a vector in Rn or the matrix norm of a matrix in Rn×m . In represents the n × n identity matrix. Let Ω be a compact subset of Rn and (Ω) be the set of admissible control laws on Ω. N = {0, 1, 2, . . . } denotes the set of all non-negative integers. In addition, the superscript “T” is taken to represent the transpose operation while f (g(·)) is denoted as f ◦ g(·). At last, the term I represents the identity function. Finally, we clarify the main expressions related to the control signal as in Table 3.1.

3.2 Problem Description In view of the wide existence and the general form, it is necessary to study the feedback control problem of nonaffine nonlinear systems. In this chapter, we consider a type of discrete-time nonlinear systems with nonaffine structure given by

56

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic

Table 3.1 Main expressions related to the control signal of this chapter Symbol Meaning u(k) μ(x(k)) μ(x(s j )) μ∗ (x(s j )) μ(i) (x(s j )) μˆ (i) (x(s j )) μˆ (i) (x(k)) μˆ ∗ (x(s j )) μˆ ∗ (x(k))

A general control input at the time step k A general feedback control law related to the state x A general event-driven control law at the sampling instant s j The desired event-driven optimal control law The event-driven iterative control law of the i-th iteration The approximate event-driven iterative control law The approximate iterative control law with the zero-order hold The approximate event-driven optimal control law The approximate optimal control law with the zero-order hold

x(k + 1) = F(x(k), u(k)), k ∈ N,

(3.1)

where x(k) = [x1 (k), x2 (k), . . . , xn (k)]T ∈ Rn represents the state vector and u(k) = [u 1 (k), u 2 (k), . . . , u m (k)]T ∈ Rm denotes the control input. We let x(0) = x¯ be the initial state. Assume that the function F : Rn × Rm → Rn is continuous, and without loss of generality, that the origin x = 0 is a unique equilibrium point of system (3.1) under u = 0, i.e., F(0, 0) = 0. Besides, assume that system (3.1) can be stabilized on a set Ω ⊂ Rn by a continuous feedback law with the form u(k) = μ(x(k)). Under the event-triggered mechanism (Dong et al. (2017); Zhong and He (2017); Vamvoudakis et al. (2017); Wang et al. (2017)), we define a monotonically increasing sequence composed of different sampling instants and written as {s j }∞ j=0 with j ∈ N. The event-based control signal is only updated at discrete-time instants s0 , s1 , s2 , . . . . Then, the feedback control law can be denoted as u(k) = μ(x(s j )), where x(s j ) represents the state vector at the time instant k = s j , k ∈ [s j , s j+1 ), j ∈ N. In this situation, a zero-order hold is needed to keep the event-triggered control input at the time instant k = s j , until the next event occurs. Now, we define the event-triggered error vector as e(k) = x(s j ) − x(k), k ∈ [s j , s j+1 )

(3.2)

where x(s j ) is the sampled state and x(k) is the current state. Obviously, we have x(k) = x(s j ) and then e(k) = 0 at k = s j , j ∈ N. Using this notation, the feedback control law can be rewritten as u(k) = μ(x(s j )) = μ(x(k) + e(k)). Then, the closedloop form of system (3.1) turns to be x(k + 1) = F x(k), μ(x(k) + e(k)) , k ∈ N.

(3.3)

The stability of system (3.3) is the main focus of this chapter. During the optimal control design, we aim to find a feedback control law μ ∈ (Ω) to minimize the cost function

3.2 Problem Description

57

J (x(k)) =

∞ U x( p), μ(x(s j )) p=k

∞ U x( p), μ(x( p) + e( p)) , =

(3.4)

p=k

where j ∈ N, U is the utility function, U (0, 0) = 0, and U (x, u) ≥ 0 for all x and u. In this chapter, the utility function is specifically selected as U x(k), μ(x(s j )) = x T(k)Qx(k)+μT (x(s j ))Rμ(x(s j ))

(3.5)

and the positive definite matrices Q and R satisfy Q ∈ Rn×n and R ∈ Rm×m . According to Bellman’s optimality principle, the optimal cost function defined as J ∗ (x(k)) = min

{μ(·)}

∞ U x( p), μ(x(s j ))

(3.6)

p=k

can be rewritten as J (x(k)) = min U x(k), μ(x(s j )) ∗

μ(x(s j ))

+ min

{μ(·)}

∞

U x( p), μ(x(s j )) .

(3.7)

p=k+1

That’s to say, the optimal cost J ∗ (x(k)) satisfies the discrete-time HJB equation J ∗ (x(k)) = min U x(k), μ(x(s j )) + J ∗ (x(k + 1)) . μ(x(s j ))

(3.8)

The corresponding optimal control μ∗ (x(s j )) is denoted by μ∗ (x(s j )) = arg min

μ(x(s j ))

U x(k), μ(x(s j )) + J ∗ (x(k + 1)) .

(3.9)

Observing (3.8) and (3.9), we find that the value of the next time step J ∗ (x(k + 1)) and the real system model are needed to obtain the optimal cost J ∗ (x(k)) and then the optimal control μ∗ (x(s j )). Clearly, it is difficult or impossible to conduct for the nonlinear case with unknown dynamics. Hence, we will introduce an iterative adaptive critic algorithm to solve the discrete-time HJB equation in the next section. For system (3.3) and under the event-driven formulation, we define an eventtriggering condition as follows: e(k) ≤ ethr ,

(3.10)

58

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic

where ethr is the threshold for improving the event-driven control. It is important to note that the control input can only be updated in case that an event is triggered. This, actually, means that the corresponding event-triggering condition is violated. The primary problem of event-based control lies in how to determine a suitable triggering threshold. This also will be the main task of the next section.

3.3 Event-Driven Iterative Adaptive Critic Design via DHP In this section, the iterative adaptive critic algorithm with convergence discussion and neural network implementation is derived, where the advanced DHP technique is employed.

3.3.1 Derivation and Convergence Discussion The iterative adaptive critic algorithm of discrete-time nonlinear systems is performed in the following Algorithm 2. The main idea is to construct two iterative sequences {J (i) (x(k))} and {μ(i) (x(s j ))}, where i denotes the iteration index and i ∈ N, so as to conduct the value iteration process. Algorithm 2 Iterative Adaptive Critic Algorithm 1: Choose a small positive number . Initialize the iteration index as i = 0 and the cost function as J (0) (·) = 0. Set the maximal iteration number as i max . 2: while i ≤ i max do 3: Solve the iterative control function by μ(i) (x(s j )) = arg min U x(k), μ(x(s j )) + J (i) (x(k + 1)) μ(x(s j ))

= arg min

μ(x(s j ))

4:

U x(k), μ(x(s j )) + J (i) F x(k), μ(x(s j )) .

(3.11)

Update the iterative cost function by J (i+1) (x(k)) = min U x(k), μ(x(s j )) + J (i) (x(k + 1)) μ(x(s j ))

= U x(k), μ(i) (x(s j )) + J (i) F x(k), μ(i) (x(s j )) . 5: if |J (i+1) (x(k)) − J (i) (x(k))| ≤ then 6: Stop the algorithm and obtain the approximate optimal control law μ∗ (x(s j )). 7: else 8: Let i ← i + 1. 9: end if 10: end while

(3.12)

3.3 Event-Driven Iterative Adaptive Critic Design via DHP

59

The convergence result of the above iterative adaptive critic algorithm can be obtained by using a similar procedure as (Al-Tamimi et al. 2008; Wang et al. 2012; Zhang et al. 2009). By showing {J (i) (x(k))} is a nondecreasing sequence with an upper bound, i.e., J (0) (·) ≤ J (1) (·) ≤ · · · ≤ J (∞) (·),

(3.13)

it can be derived that J (i) (x(k)) → J (∞) (x(k)) = J ∗ (x(k)) and then μ(i) (x(s j )) → μ∗ (x(s j )) as i → ∞.

3.3.2 Neural Network Implementation For implementing the iterative DHP algorithm, three neural networks are constructed with same feedforward architectures as in (Wang et al. 2012; Zhang et al. 2009) but possessing different roles. They are: the model network employed for prediction, the critic network built for evaluation, and the action network used for control. By inputting x(k) and the approximate control μˆ (i) (x(k)) that will be given later, we obtain the model network output

T x(k ˆ + 1) = ωmT σ νmT x T (k), μˆ (i)T (x(k)) ,

(3.14)

where ωm and νm are weight matrices and σ (·) is the hyperbolic tangent activation function. Note the neuron numbers of the input and output layers are m + n and n, respectively, while for the hidden layer, it can be selected experimentally. Using (3.14) and state information of the controlled plant, i.e., x(k + 1), the training performance measure of the model network is E m (k) =

T

1 x(k ˆ + 1) − x(k + 1) x(k ˆ + 1) − x(k + 1) . 2

(3.15)

If we employ the gradient-based adaptation rule, the weight matrices of the model network can be updated according to Δωm = −αm Δνm = −αm

∂ E m (k) , ∂ωm

(3.16a)

∂ E m (k) , ∂νm

(3.16b)

where αm > 0 is the learning rate and Δωm is the difference value of two orderly updating steps. After training the model network, its weight matrices are recorded and kept unchanged.

60

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic

Unlike the simple HDP structure, the critic network of DHP is used to approximate ∂ J (i) (x(k))/∂ x(k) of the ith iteration, which is denoted as λ(i) (x(k)). That is, denoting Jˆ(i+1) (x(k)) as the approximate iterative cost function, the critic network of DHP outputs an approximate vector as follows: ∂ Jˆ(i+1) (x(k)) ∂ x(k) (i+1)T σ νc(i+1)T x(k) . = ωc

λˆ (i+1) (x(k)) =

(3.17)

Using (3.12), the target function is derived by λ

(i+1)

∂ x T (k)Qx(k) + μˆ (i)T (x(k))R μˆ (i) (x(k)) (x(k)) = ∂ x(k) T ˆ(i) ∂ x(k ˆ + 1) ∂ J (x(k ˆ + 1)) + ∂ x(k) ∂ x(k ˆ + 1) T ∂ x(k ˆ +1) λˆ (i) (x(k ˆ +1)). = 2Qx(k)+ ∂ x(k)

(3.18)

Hence, the training performance measure is written as E c(i+1) (k) =

T 1 (i+1) λˆ (x(k)) − λ(i+1) (x(k)) 2

× λˆ (i+1) (x(k)) − λ(i+1) (x(k)) .

(3.19)

Similarly, the weight matrices are updated in light of Δωc(i+1) = −αc

Δνc(i+1) = −αc

∂ E c(i+1) (k) ∂ωc(i+1) ∂ E c(i+1) (k) ∂νc(i+1)

,

(3.20a)

,

(3.20b)

where αc > 0 is the learning rate of the critic network. The action network generates the approximate optimal control which can be formulated as (3.21) μˆ (i) (x(s j )) = ωa(i)T σ νa(i)T x(s j ) . Note that the involved parameters of critic and action networks can be stated similar with the model network. Based on the calculation of (3.11) and the output of (3.21), the performance measure for training the action network is

3.3 Event-Driven Iterative Adaptive Critic Design via DHP

E a(i) (s j ) =

61

T 1 (i) μˆ (x(s j )) − μ(i) (x(s j )) 2

× μˆ (i) (x(s j )) − μ(i) (x(s j )) .

(3.22)

Hence, the weight-updating algorithm is performed based on Δωa(i)

= −αa

Δνa(i) = −αa

∂ E a(i) (s j ) ∂ωa(i)

∂ E a(i) (s j ) ∂νa(i)

,

(3.23a)

,

(3.23b)

where αa > 0 is the learning rate of the action network. Here, by conducting a training process based on the error between μˆ (i) (x(s j )) and μ(i) (x(s j )), we can obtain suitable weight matrices of the action network, which are necessary to train the critic network and update the output of the model network. Remark 3.1 The original plant is taken to generate real states in the process of training the model network. In the developed adaptive critic framework, the converged weight matrices of the model network are used to get the necessary partial derivative information when training critic and action networks. For the general nonaffine dynamics, if we adopt the known system model F directly, the backpropagation process of critic and action networks can not be performed well, and the application scope of the proposed method can not cover unknown plants. Overall, the simple diagram of the event-driven iterative adaptive critic control framework via DHP is displayed in Fig. 3.1, where the solid line is the signal flow and the dashed line is the backpropagating path of critic and actions networks. The state vectors are input to the model network and the critic network directly, and after event-driven transformation, to the action network. The event-based control signal μˆ (i) (x(s j )) becomes μˆ (i) (x(k)) with the action of the zero-order hold. It means that, the output of the action network is related to the event-triggered state x(s j ), and after the involvement of the zero-order hold, the control function becomes μˆ (i) (x(k)), which is used to the model network. Besides, the weight transmission occurs between two critic networks in each iteration. Remark 3.2 Since J (i) → J ∗ as i → ∞ and λ(i) (x(k)) = ∂ J (i) (x(k))/∂ x(k), we can conclude the sequence {λ(i) } is also convergent such that λ(i) → λ∗ as i → ∞, where λ∗ is the optimal value. However, this is just the ideal case. Practically speaking, the gradient-based adaptation rule is persistently adopted by three networks until satisfying convergence results are observed. This phenomenon often occurs when the iteration index i is sufficient large, rather than infinite.

62

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic

Fig. 3.1 Simple diagram of the event-driven iterative adaptive critic framework via DHP

3.4 Event-Based System Stability Analysis In this section, the stability of the event-based closed-loop system is discussed rigorously. The event-based discrete-time nonlinear system is proved to be input-to-state stable, which brings in a kind of stability, namely ISS. To this end, we first revisit the definition of ISS and then redefine the ISS-Lyapunov function. Note that some important functions, including K -function, K∞ -function, and K L -function (Jiang and Wang 2001), are used in these definitions. Definition 3.1 (cf. Jiang and Wang 2001) The system (3.3) is called input-to-state stable if there exist a K L -function θ and a K -function δ, such that for each initial state x(0) = x¯ ∈ Rn and each input u ∈ Rm , the following inequality holds: x(k, x, ¯ u) ≤ θ (x, ¯ k) + δ(u sup ),

(3.24)

where x(k, x, ¯ u) is the trajectory of system (3.3) with the initial state x(0) = x¯ and the input u while u sup = sup{u(k) : k ∈ N} < ∞. Considering the event-based background, the ISS-Lyapunov function for discretetime systems should be redefined, which is different with that in (Jiang and Wang 2001). Definition 3.2 A continuous function V : Rn → R+ is named as an ISS-Lyapunov function for the nonlinear system (3.3) if there exist K∞ -functions β1 , β2 , β3 , a positive constant L, and a K -function ρ, such that β1 (x(k)) ≤ V (x(k)) ≤ β2 (x(k)),

(3.25)

β1−1 (x(k)) ≤ L · x(k),

(3.26)

3.4 Event-Based System Stability Analysis

63

and V F x(k), μ(x(k) + e(k)) − V (x(k)) ≤ −β3 (x(k)) + ρ(e(k))

(3.27)

hold for all x(k) ∈ Rn and all e(k) ∈ Rn . The following lemma presents an alternative expression for the above inequality. Lemma 3.1 If there exists a function β4 satisfying β4 (·) = β3 (β2−1 (·)), then the inequality V F x(k), μ(x(k) + e(k)) − V (x(k)) ≤ −β4 V (x(k)) + ρ(e(k))

(3.28)

can be derived from the formula (3.27) of Definition 3.2. Proof As described in Definition 3.2, the function β2 is a K∞ -function. According to (3.25), we easily get β2−1 V (x(k)) ≤ x(k).

(3.29)

Since β3 is also a K∞ -function, we have β3 β2−1 V (x(k)) ≤ β3 (x(k)).

(3.30)

Combining (3.27) with (3.30), we derive the following inequality: V F x(k), μ(x(k) + e(k)) − V (x(k)) ≤ −β3 (β2−1 V x(k)) + ρ(e(k)).

(3.31)

Through letting β4 (·) = β3 (β2−1 (·)) = β3 ◦ β2−1 (·), we can obtain that (3.28) is true. The next lemma about a special state set is useful to deduce the main conclusion of ISS. Lemma 3.2 Consider a set defined by X = {x(k) : V (x(k)) ≤ V },

(3.32)

where V = β4−1 ◦ ϕ −1 ◦ ρ(e(k)). If there exists a certain integer k0 ∈ N such that x(k0 ) ∈ X, then x(k) ∈ X, k ≥ k0 . Proof Without loss of generality, we assume I − β4 is a K -function. Besides, let ϕ be any K∞ -function such that I − ϕ is also a K∞ -function.

64

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic

For performing mathematical induction, we first assume x(k0 ) ∈ X. Then, we have V (x(k0 )) ≤ V . According to Lemma 3.1, we have V F x(k0 ), μ(x(k0 ) + e(k0 )) − V (x(k0 )) ≤ −β4 V (x(k0 )) + ρ(e(k0 )).

(3.33)

Due to the property of I − β4 , we can derive V (x(k0 + 1)) ≤ (I − β4 ) V (x(k0 )) + ρ(e(k0 )) ≤ (I − β4 )(V ) + ρ(e(k0 )).

(3.34)

Note the definition of V implies that ϕ ◦ β4 (V ) = ρ(e(k)).

(3.35)

Combining (3.34) with (3.35), we can further obtain V (x(k0 + 1)) ≤ (I − β4 )(V ) + ρ(e(k0 )) = − (I − ϕ) ◦ β4 (V ) + V − ϕ ◦ β4 (V ) + ρ(e(k0 )) = − (I − ϕ) ◦ β4 (V ) + V ≤V,

(3.36)

which shows that x(k0 + 1) ∈ X. Via mathematical induction, we can prove V (x(k0 + j)) ≤ V , ∀ j ∈ N. This indicates the conclusion x(k) ∈ X for all k ≥ k0 is true. With the help of the above definitions and lemmas, we turn to the check the ISS of the event-based nonlinear system. Theorem 3.1 The event-driven nonlinear system (3.3) is input-to-state stable, if it admits a continuous ISS-Lyapunov function V and there exists a continuous function κ such that e(k) ≤ κ(u sup )

(3.37)

is satisfied. Proof Assume that the event-triggered nonlinear system (3.3) admits an ISSLyapunov function V . Define k1 = min{k ∈ N : x(k) ∈ X} ≤ ∞. Then, based on Lemma 3.2, we have V (x(k)) ≤ τ (e(k)), ∀k ≥ k1

(3.38)

with τ (·) = β4−1 ◦ ϕ −1 ◦ ρ(·), which implies ϕ ◦ β4 (V (x(k))) ≤ ρ(e(k)). However, when k < k1 , the expression

3.4 Event-Based System Stability Analysis

ρ(e(k)) < ϕ ◦ β4 (V (x(k)))

65

(3.39)

holds. Therefore, according to Lemma 3.1 and employing (3.39), the following inequality can be obtained: V F x(k), μ(x(k) + e(k)) − V (x(k)) ≤ − β4 (V (x(k))) + ρ(e(k)) = − (I − ϕ) ◦ β4 (V (x(k))) − ϕ ◦ β4 (V (x(k))) + ρ(e(k)) ≤ − (I − ϕ) ◦ β4 (V (x(k))).

(3.40)

Let ξ(¯z , k) be the solution of the scalar difference equation z(l + 1) = z(l) − (I − ϕ) ◦ β4 (z(l))

(3.41)

with an initial condition z(0) = z¯ . Since for any z¯ > 0, the sequence z(l) decreases to zero, the solution ξ(¯z , k) is a K L -function. Note that the formula (3.40) implies that V F x(k), μ(x(k) + e(k))

≤ V (x(k)) − (I − ϕ) ◦ β4 V (x(k)) .

(3.42)

Then, by performing induction on k, we obtain V (x(k)) ≤ ξ V (x(0)), k

(3.43)

for all 0 ≤ k ≤ k1 + 1. Considering (3.38) and (3.43), we find that V (x(k)) satisfies ⎧ ⎨ ξ V (x(0)), k , 0 ≤ k < k1 ; V (x(k)) ≤ max ξ V (x(0)), k , τ (e(k)) , k1 ≤ k ≤ k1 + 1; ⎩ τ (e(k)), k > k1 + 1.

(3.44)

Therefore, we derive V (x(k)) ≤ max ξ V (x(0)), k , τ (e(k))

(3.45)

for all k ∈ N. According to (3.37), we can further get that V (x(k)) ≤ max ξ V (x(0)), k , τ ◦ κ(u sup ) . Based on the inequality (3.25) of Definition 3.2, we acquire

(3.46)

66

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic

V (x(k)) ≤ max ξ(β2 (x(0)), k), τ ◦ κ(u sup ) .

(3.47)

Since β1 (x(k)) ≤ V (x(k)) and inequality (3.26), we derive that x(k) ≤ β1−1 V (x(k)) ≤ L · max ξ(β2 (x(0)), k), τ ◦ κ(u sup ) .

(3.48)

Then, we finally obtain that x(k) ≤ L · ξ(β2 (x(0)), k) + L · τ ◦ κ(u sup ).

(3.49)

Consequently, we obtain the condition (3.24) with ¯ k) θ (x, ¯ k) = L · ξ(β2 (x),

(3.50)

δ(u sup ) = L · τ ◦ κ(u sup ).

(3.51)

and

In light of the Definition 3.1, we have proved that system (3.3) possesses ISS. Finally, we specify the triggering condition pointed out in (3.10). The triggering threshold will be given for the proposed nonlinear system satisfying appropriate stability conditions. To this end, an assumption presented in (Dong et al. 2017) is recalled with proper modification. Assumption 3.1 For system (3.3), there exist a positive constant C and a continuous function κ˜ such that F x(k), μ(x(k) + e(k)) ≤ Cx(k) + Ce(k)

(3.52)

e(k) ≤ x(k) < κ(u ˜ sup )

(3.53)

and

hold. Theorem 3.2 If Assumption 3.1 holds and the nonlinear system (3.3) admits a continuous ISS-Lyapunov function, then the triggering condition e(k) ≤ ethr (C) =

1 − (2C)k−s j Cx(s j ), C = 0.5 1 − 2C

ensures that system (3.3) is input-to-state stable.

(3.54)

3.4 Event-Based System Stability Analysis

67

Proof Noticing system (3.3) and according to the two inequalities of Assumption 3.1, we have e(k) ≤ x(k) ≤ Cx(k − 1) + Ce(k − 1).

(3.55)

The relationship in (3.2) yields x(k − 1) ≤ e(k − 1) + x(s j ).

(3.56)

Considering (3.55) and (3.56), we derive that e(k) ≤ 2Ce(k − 1) + Cx(s j ).

(3.57)

Consequently, we attain e(k − 1) ≤ 2Ce(k − 2) + Cx(s j ).

(3.58)

Then, the following inequality can be obtained: e(k) ≤ 2C(2Ce(k − 2) + Cx(s j )) + Cx(s j ) ··· ≤ (2C)k−s j e(s j ) + (2C)k−s j −1 Cx(s j ) + (2C)k−s j −2 Cx(s j ) + · · · + (2C)Cx(s j ) + Cx(s j ),

(3.59)

where e(s j ) = 0. Note that the last k − s j terms of (3.59) compose a geometric sequence. Hence, we can get the triggering condition e(k) ≤ ethr , or specifically e(k) ≤ ethr (C), where the threshold is ethr (C) =

1 − (2C)k−s j Cx(s j ), 1 − 2C

(3.60)

and C = 0.5. According to the condition (3.53) in Assumption 3.1, the inequality 1 − (2C)k−s j 1 − (2C)k−s j Cx(k) ≤ C κ(u ˜ sup ) 1 − 2C 1 − 2C

(3.61)

can be further derived. Then, we have e(k) ≤

1 − (2C)k−s j C κ(u ˜ sup ). 1 − 2C

Therefore, we can find a continuous function κ given as

(3.62)

68

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic

κ(q) =

1 − (2C)k−s j C κ(q) ˜ 1 − 2C

(3.63)

to satisfy the inequality (3.37) of Theorem 3.1. According to the result of Theorem 3.1, the triggering condition (3.54) guarantees the ISS of system (3.3). This ends the proof. Remark 3.3 The triggering condition for designing the event-driven controller is presented in Theorem 3.2, which is convenient to use in practice. Note that the value of the threshold is non-unique. Actually, it is related to the sampled state x(s j ) and the experimental choice of the constant C.

3.5 Special Discussion of the Affine Nonlinear Case As a special case study, in this section, we discuss the event-based iterative adaptive critic method for discrete-time affine nonlinear dynamics. Consider a type of affine nonlinear systems formulated by x(k + 1) = f (x(k)) + g(x(k))u(k),

(3.64)

where f (·) and g(·) are differentiable in their arguments with f (0) = 0. Assume that f + gu is Lipschitz continuous on a set Ω ⊂ Rn containing the origin. Similarly, we also assume that system (3.64) can be stabilized on the set Ω ⊂ Rn by a continuous state feedback control law u(k) = μ(x(k)). Under the proposed event-based formulation, the feedback control law can be denoted as u(k) = μ(x(s j )). Considering the affine dynamics (3.64) and the quadratic utility (3.5), the event-based optimal control law (3.9) can be further expressed as 1 ∂ J ∗ (x(k + 1)) . (3.65) μ∗ (x(s j )) = − R −1 g T (x(k)) 2 ∂ x(k + 1) In this case, the main iterative process between (3.11) and (3.12) is written as 1 ∂ J (i) (x(k + 1)) μ(i) (x(s j )) = − R −1 g T (x(k)) 2 ∂ x(k + 1)

(3.66)

and J (i+1) (x(k)) = U x(k), μ(i) (x(s j )) + J (i) f (x(k))+g(x(k))μ(i)(x(s j )) . (3.67) In this section, HDP is taken as the implementation technique and the model network is not required because of the known dynamics. Note that when considering the iterative HDP method, the critic network outputs the approximate iterative cost

3.5 Special Discussion of the Affine Nonlinear Case

69

Fig. 3.2 Simple diagram of the event-based iterative HDP framework with discrete-time plants

function Jˆ(i+1) (x(k)). Here, different from (3.19), the training performance measure of the critic network is chosen as

2 1 E¯ c(i+1) (k) = Jˆ(i+1) (x(k)) − J (i+1) (x(k)) . 2

(3.68)

Note that the action network is constructed and trained the same as the previous DHP case. Needless to say, when emphasizing the model-free property, a model network should be built to serve as a neural identifier. For clarity, the simple diagram of the event-based iterative HDP framework with discrete-time nonlinear systems is displayed in Fig. 3.2. Therein, the solid line is the signal flow and the dashed line is the backpropagating path of the two neural networks. The state information is transmitted to the event-based module for transforming the signal status, to the controlled plant for updating the system state, and to the critic network for computing the cost function, respectively. Hence, threefold important roles are included within the system state component. For clarifying the effect of the triggering condition, the event-based implementation after the iterative HDP algorithm is illustrated in Fig. 3.3, where μˆ ∗ (x(k)) is the obtained approximate optimal controller. It is just the practical control law used for the event-based design. The green dashed line in Fig. 3.3 shows the state of the next time step, distinct from the current state. When the triggering condition is satisfied (switch to “Y”), the control signal is kept as the previous value μˆ ∗ (x(s j−1 )). However, when the triggering condition is violated (switch to “N”), the control signal is updated as μˆ ∗ (x(s j )) via the action network. After the function of zero-order hold, either the event-based control signal μˆ ∗ (x(s j−1 )) or μˆ ∗ (x(s j )) is turned to μˆ ∗ (x(k)) and then is applied to the controlled plant.

70

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic

Fig. 3.3 Event-based implementation after the iterative HDP algorithm Fig. 3.4 Simple diagram of the developed mixed-driven architecture

3.6 Highlighting the Mixed-driven Control Framework In this chapter, we combine the data-driven iterative adaptive critic algorithm with the event-driven structure. Here, the integration of data-driven and event-driven control mechanisms is called the mixed-driven control framework. The simple diagram of the mixed-driven architecture is illustrated in Fig. 3.4, where data-driven learning for the complex dynamics and event-based triggering under the network environment are naturally included. The primary advantages of the mixed-driven control framework are described as follows: • The dynamics information can be excavated sufficiently during the data-driven learning process. • The practical control updating times can be reduced markedly with the eventtriggering mechanism. • The advanced mixed-driven controller can be established including efficient data and event utilizations. The primary purpose of this section lies in highlighting the mixed driven iterative adaptive critic control approach towards discrete-time nonlinear dynamics. This is a novel mixed framework in discrete-time domain along with the idea of iterative adaptive critic algorithms. With the help of this framework, the data and communication resources are both optimized during the iterative control design process. Most of the adaptive critic techniques including HDP, DHP, GDHP, and NDP can be combined with the mixed-driven control framework, where the major differences lie in the construction of the critic network and the training method of the

3.7 Simulation Experiments

71

action network. For example, by combining the NDP technique with the mixeddriven framework, the knowledge of the controlled plant is needless and the number for updating control inputs is signally reduced when dealing with the near-optimal regulation (Wang and Ha 2020c). Note that the critic output of NDP is the same as HDP, but their action training processes are different. When using the NDP method, the action network is trained according to the performance measure

2 1 ˆ(i) J (x(k ˆ + 1)) − V0 , E¯ a(i) (x(k)) = 2

(3.69)

where Jˆ(i) (x(k ˆ + 1)) is the critic network and V0 is always set as zero in light of (Dong et al. 2017; Si and Wang 2001). It is very important to note that the action ˆ + 1)) − V0 within the NDP formulation. This is training error is defined as Jˆ(i) (x(k quite different from the training strategy given in (Zhang et al. 2009; Wang et al. 2012) and the previous one (3.22), where the vector error between the iterative controller and the optimal controller is considered. By virtue of this manner, the model-free property of the NDP technique is emphasized, which is very significant to address the optimal feedback control design of nonaffine dynamics. Specifically, under the NDP formulation, the hidden-output weight increment of the action network is calculated by Δωa(i)

ˆ(i) ˆ + 1)) T ∂ x(k ˆ + 1) ∂ μˆ (i) T ∂ J (x(k (i) ˆ = −αa J (x(k ˆ + 1)) , ∂ x(k ˆ + 1) ∂ μˆ (i) ∂ωa(i)

(3.70)

which is a backpropagation process through the critic network, the model network, ˆ + 1))/∂ x(k ˆ + 1) can be obtained and the action network. Here, the term ∂ Jˆ(i) (x(k via the critic network. Besides, the term ∂ x(k ˆ + 1)/∂ μˆ (i) is gotten through the model network while ∂ μˆ (i) /∂ωa(i) is acquired from the action network. Similarly, the inputhidden weight increment Δνa(i) also can be derived based on the neural network backpropagation operation.

3.7 Simulation Experiments In this section, we apply the proposed event-based method to some specific systems to verify its practical performance, where first three examples are with the DHP structure, while the last two cases are with HDP and NDP. Example 3.1 Consider a discretized version of the Mass–Spring–Damper system derived from (Dong et al. 2017) and written as

x1 (k + 1) = 0.9996x1 (k) + 0.0099x2 (k), x2 (k + 1) = −0.0887x1 (k) + 0.97x2 (k) + 0.0099u(k).

(3.71)

72

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic 10-3

1.2

1

Error

0.8

0.6

0.4

0.2

0 0

50

100

150

200

250

300

350

400

450

500

Samples Fig. 3.5 The sum of square error (Example 3.1)

This is a discrete-time system with the state vector x(k) = [x1 (k), x2 (k)]T and the control variable u(k). In order to handle the self-learning optimal regulation, the utility term of the cost function is selected as the form of (3.5) with weight matrices Q = 0.5I2 and R = I . As for the adaptive critic implementation, the model network with the structure 3–8–2 (number of the input, hidden, and output layers) is employed to identify the unknown system. The initial weight terms ωm and νm are set randomly in [−0.1, 0.1] and the learning rate αm is selected as 0.1. Based on (3.16), we collect 500 data samples with 50 training epochs for each sample to learn the dynamics information. Then, another 500 data samples are used to verify the approximation performance of the model network. The sum of square error of the state is shown in Fig. 3.5. It is evidently observed that the sum of square error of the state is less than 1.2 × 10−3 . Therefore, the model network with high accuracy has been obtained. After training the model network, the final weight matrices ωm and νm are recorded and maintained. Next, we start to train the critic and action networks according to (3.20) and (3.23). We apply the iterative adaptive critic algorithm to update these two networks with structures 2–8–2 and 2–8–1, respectively. The initial state vector is chosen as x(0) = [0.5, 0.5]T . The initial weights of the critic network ωc and νc are chosen randomly in [−0.1, 0.1]. Besides, ωa and νa are chosen randomly in [−1, 1]. Similar to the model network, the critic and action networks are required to train sufficiently. Moreover, the event-based mechanism is applied to the action network. In the neural

3.7 Simulation Experiments

73

System state x 1

0.6 Event-driven iterative adaptive critic Traditional iterative DHP algorithm

0.4

0.2

0 0

50

100

150

200

250

300

350

400

450

500

Time steps

System state x 2

0.5

0

Event-driven iterative adaptive critic Traditional iterative DHP algorithm

-0.5 0

50

100

150

200

250

300

350

400

450

500

Time steps

Fig. 3.6 The state trajectory (Example 3.1)

training process, we experimentally let the learning rates be αc = αa = 0.05. For ensuring sufficient learning and getting satisfying performance, if the prespecified accuracy = 10−6 is reached during the iterative algorithm, then the training process of critic and action networks can be terminated. Here, we employ the iterative DHP algorithm for 200 iterations and set 2000 training times for each iteration, so as to find that the prespecified accuracy can be satisfied. In the sequel, the event-driven approach is applied to the optimization algorithm. Letting C = 0.1, we can specify the threshold (3.60) as follows: ethr (0.1) =

1 − 0.2k−s j · 0.1x(s j ), 1 − 0.2

(3.72)

which depends on the sampled state x(s j ). The state response with event-driven iterative adaptive critic is given in Fig. 3.6. We can clearly see that the state variables x1 and x2 eventually converge to zero without deteriorating the convergence rate, when compared with the traditional iterative DHP algorithm. The control curve of the proposed approach is stair-stepping, which is shown in Fig. 3.7. The control input with the event-based formulation is only updated by 173 time steps, while the control variable of the time-based case is operated by 500 time steps. Therefore, the event-based method greatly improves the resource utilization. During the simulation, the evolution curve of the triggering threshold is depicted in Fig. 3.8, which involves a trend to zero along with the state signal.

74

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic 1.6 Event-driven iterative adaptive critic Traditional iterative DHP algorithm

1.4 1.2

Control input

1 0.8 0.6 0.4 0.2 0 -0.2 0

50

100

150

200

250

300

350

400

450

500

350

400

450

500

Time steps

Fig. 3.7 The control input (Example 3.1) 0.1 0.09 0.08

Threshold

0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0

50

100

150

200

250

300

Time steps

Fig. 3.8 The triggering threshold (Example 3.1)

3.7 Simulation Experiments

75

10-3

2.5

2

Error

1.5

1

0.5

0 0

50

100

150

200

250

300

350

400

450

500

Samples

Fig. 3.9 The sum of square error (Example 3.2)

Example 3.2 Consider the following discrete-time nonlinear system:

x1 (k + 1) = x1 (k) + 0.1x2 (k), x2 (k + 1) = −0.17 sin(x1 (k)) + 0.98x2 (k) + 0.1u(k).

(3.73)

The weight matrices of the utility are set as Q = I2 and R = 0.01I . The network structure, the learning rate αm , and other relevant parameters of the model component are chosen the same as Example 1. By conducting an effective learning stage, the sum of square error of the state is displayed in Fig. 3.9. It is clear that the sum of square error of the state is less than 2.5 × 10−3 . Then, the final values of weight parameters ωm and νm are kept unchanged. Similar to Example 3.1, we apply the iterative algorithm to train the critic and action networks. The main structures, the learning rates, and the initial weights of these two networks are selected the same as that in Example 3.1. The initial state vector is also chosen as x(0) = [0.5, 0.5]T . The iterative DHP algorithm is adopted for 200 iterations with 2000 training times being included in each iteration. In order to employ the event-based mechanism, we set C = 0.2 and establish the following threshold: ethr (0.2) =

1 − 0.4k−s j · 0.2x(s j ). 1 − 0.4

(3.74)

76

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic

System state x 1

0.6 Event-driven iterative adaptive critic Traditional iterative DHP algorithm

0.4

0.2

0 0

20

40

60

80

100

120

140

160

180

200

Time steps

System state x 2

0.5

0 Event-driven iterative adaptive critic Traditional iterative DHP algorithm

-0.5 0

20

40

60

80

100

120

140

160

180

200

Time steps

Fig. 3.10 The state trajectory (Example 3.2)

With the function of the related triggering condition, state and control trajectories can be derived. The convergence time of state variables x1 and x2 is very close to the traditional iterative DHP algorithm, as compared in Fig. 3.10. In addition, the control input is only updated 61 time steps, whose curve is shown in Fig. 3.11. As a comparison, the control variable of the traditional DHP algorithm is needed to update 200 time steps, which indicates that a great improvement in the resource utilization has been obtained under event-driven formulation. Here, the changing trend of the triggering threshold is displayed in Fig. 3.12. From these results, we find that the control efficiency has been evidently enhanced and the regulation performance still can be maintained. Example 3.3 Consider the discrete-time nonlinear system ⎧ ⎨ x1 (k + 1) = x1 (k) + 0.1x2 (k), x2 (k + 1) = −0.17 sin(x1 (k)) + 0.98x2 (k) + 0.1u 1 (k), ⎩ x3 (k + 1) = 0.1x1 (k) + 0.2x2 (k) + x3 (k) cos(u 2 (k)),

(3.75)

where the state vector is x(k) = [x1 (k), x2 (k), x3 (k)]T and the control variable is u(k) = [u 1 (k), u 2 (k)]T . The weight matrices of the utility are set as Q = I3 and R = 0.01I2 . The learning rate αm and other relevant parameters of the model component are chosen the same as Example 3.1, but with the structure 5–8–3. By conducting an effective learning stage, the sum of square error of the state is displayed in Fig. 3.13.

3.7 Simulation Experiments

77

1 Event-driven iterative adaptive critic Traditional iterative DHP algorithm

0.5

Control input

0 -0.5 -1 -1.5 -2 -2.5 -3 0

20

40

60

80

100

120

140

160

180

200

140

160

180

200

Time steps

Fig. 3.11 The control input (Example 3.2) 0.25

Threshold

0.2

0.15

0.1

0.05

0 0

20

40

60

80

100

120

Time steps

Fig. 3.12 The triggering threshold (Example 3.2)

78

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic 0.06

0.05

Error

0.04

0.03

0.02

0.01

0 0

50

100

150

200

250

300

350

400

450

500

Samples

Fig. 3.13 The sum of square error (Example 3.3)

It is clear that the sum of square error of the state is less than 0.06. Then, the final values of weight parameters ωm and νm are kept unchanged. Finally, we apply the developed algorithm to train the critic network (3–8–3) and the action network (3–8–2). The iteration times, the learning rates, and the initial weights of these two networks are selected the same as that in Example 3.1. Here, the initial state vector is chosen as x(0) = [0.5, 0.5, 0.5]T . For adopting the eventbased mechanism, we establish the threshold as the form of Example 3.1. Then, the state trajectories of the developed method and the traditional algorithm are shown in Fig. 3.14. In addition, the control input is only updated 138 time steps, whose curve is shown in Fig. 3.15. As a comparison, the control variable of the traditional DHP algorithm is needed to update 200 time steps. Remarkably, an evident improvement in the resource utilization has been obtained under event-driven formulation. From these results, we observe that the regulation performance can be maintained while the control efficiency has been signally enhanced, which demonstrates the effectiveness of the event-driven iterative adaptive critic approach. Example 3.4 In this example, we apply the event-driven iterative HDP algorithm to an affine nonlinear system

x1 (k + 1) = x1 (k) + 0.03x2 (k) − 0.5 cos(1.4x2 (k)) sin(0.4x1 (k)), x2 (k + 1) = −0.1x1 (k) + x2 (k) + 0.1x22 (k) + 0.008u(k),

(3.76)

3.7 Simulation Experiments

79

System state x 1

0.6 Event-driven iterative adaptive critic Traditional iterative DHP algorithm

0.4 0.2 0 0

20

40

60

80

100

120

140

160

180

200

Time steps

System state x 2

0.5 0 -0.5

Event-driven iterative adaptive critic Traditional iterative DHP algorithm

-1 0

20

40

60

80

100

120

140

160

180

200

Time steps

System state x 3

0.5

0 Event-driven iterative adaptive critic Traditional iterative DHP algorithm

-0.5 0

20

40

60

80

100

120

140

160

180

200

Time steps

Fig. 3.14 The state trajectory (Example 3.3)

where the state vector is x(k) = [x1 (k), x2 (k)]T and the control variable is u(k). For dealing with the event-based optimal control problem, the main parameters are set as Q = 0.01I2 , R = 2I , and x(0) = [1, −1]T . By presetting the network structures as 2–8–1 and 2–8–1, we train the critic and action networks within the iterative framework, respectively. The initial weights of the critic network and the action network are chosen randomly in [−0.1, 0.1] and [−1, 1], respectively. The learning rates are selected as αc = αa = 0.1 and the triggering threshold is chosen as (3.72). After conducting an iterative process of 300 times, the convergence of the cost function can be derived. Different from the DHP technique, the convergence of the iterative cost function can be well checked in HDP, which is meaningful when we pay attention to the value learning stage. For comparison, the iterative HDP algorithm of two cases is compiled, where Case 1 is the developed event-based situation and Case 2 is the traditional version proposed in (Al-Tamimi et al. 2008). With event-based and time-based formulations, the state trajectories of two case studies are given in Fig. 3.16. We can observe from Fig. 3.16 that the two trajectories are very close, and both contain good stability results. Besides, the triggering threshold and the control input are displayed in Figs.

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic

Control input u 1

80

2

Event-driven iterative adaptive critic Traditional iterative DHP algorithm

0

-2

-4 0

20

40

60

80

100

120

140

160

180

200

Time steps

Control input u 2

2 Event-driven iterative adaptive critic Traditional iterative DHP algorithm

0

-2

-4 0

20

40

60

80

100

120

140

160

180

200

Time steps

Fig. 3.15 The control input (Example 3.3)

3.17 and 3.18, respectively. Unlike the state curves, the control trajectories of the two cases are with evident distinction. In this example, the control inputs of the timebased and event-based formulations are updated with 300 and 85 times, respectively. Exactly speaking, the event-based framework brings in a decrease in updating times up to 71.67%. These simulation results also verify that the event-based strategy can signally reduce updating times of the control input while remain an acceptable control performance. Example 3.5 In this example, we apply the mixed-driven iterative NDP algorithm to a specified nonaffine system, in order to verify the near-optimal control performance Wang and Ha (2020c). We consider a discrete-time nonlinear plant formulated as

x1 (k + 1) = 0.3x2 (k) − 0.5 cos(x2 (k)) sin(0.6x1 (k)) + x1 (k) tanh(u(k)), x2 (k + 1) = −0.1x1 (k) + x2 (k) + 0.1x22 (k) + 0.8u 3 (k), (3.77)

where the involved state vector is x(k) = [x1 (k), x2 (k)]T and the control variable is u(k). In order to handle the approximate optimal regulation, the utility term of the cost function is selected with Q = 0.2I2 , R = I and then three neural networks are constructed within the proposed mixed driven framework. We first train the model network with an architecture of 3–8–2 by choosing the learning rate as αm = 0.2. After a data-driven learning stage with a randomly initial

3.7 Simulation Experiments

81

0 Case 2 Case 1

-0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 -1 -0.2

0

0.2

0.4

0.6

0.8

1

250

300

Fig. 3.16 The state trajectories of Case 1 and Case 2 (Example 3.4) 0.45 0.4 0.35

Threshold

0.3 0.25 0.2 0.15 0.1 0.05 0 0

50

100

150

Time index

Fig. 3.17 The triggering threshold (Example 3.4)

200

82

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic 0.4 Case 2 Case 1

0.35

Control input

0.3 0.25 0.2 0.15 0.1 0.05 0 0

50

100

150

200

250

300

Time index

Fig. 3.18 The control inputs of Case 1 and Case 2 (Example 3.4)

choice in [−0.1, 0.1], the weight variables of the identifier finally converge to two constant matrices. ⎡

0.3104 ⎢ −0.3585 ⎢ ⎢ −0.4393 ⎢ ⎢ 0.1182 νˆ m∗ = ⎢ ⎢ −0.1591 ⎢ ⎢ 0.5083 ⎢ ⎣ −0.6490 0.1475 ⎡ 0.1127 ⎢ −0.1305 ⎢ ⎢ −0.0437 ⎢ ⎢ 0.0375 ∗ ωˆ m = ⎢ ⎢ 0.0721 ⎢ ⎢ 0.2481 ⎢ ⎣ −0.2622 0.0246

⎤T 0.5278 0.2298 −0.5736 −0.2770 ⎥ ⎥ −0.3610 0.5760 ⎥ ⎥ 0.5631 0.3486 ⎥ ⎥ , −0.4354 0.3653 ⎥ ⎥ −0.7035 −0.0559 ⎥ ⎥ 0.7313 0.0933 ⎦ 0.2251 0.5993 ⎤ 0.3233 −0.3763 ⎥ ⎥ 0.0046 ⎥ ⎥ 0.3485 ⎥ ⎥. −0.0758 ⎥ ⎥ −0.2854 ⎥ ⎥ 0.2691 ⎦ 0.1885

(3.78a)

(3.78b)

Note that the converged weights νˆ m∗ and ωˆ m∗ can be employed to update the system states of next time steps, which replaces the usage of the controlled plant during

3.7 Simulation Experiments

83

the following main training process. Thus, the modeling accuracy is greatly relevant to the latter control performance. Then, we determine the structures of the critic and action networks as 2–8–1 and 2–8–1, respectively, and train the action network according to (3.69). During the learning process, we choose the initial state as x(0) = [1, −1]T and set the learning rates as αc = αa = 0.2. Similar to the model network, the initial weights of critic and action networks are both chosen randomly in [−0.1, 0.1]. In this situation, we employ the iterative NDP algorithm for 28 iterations and then the prespecified accuracy ε = 10−6 is reached, where 2000 training times are involved for each iteration. Next, for turning to the event-driven design, we severally let C = 0.2, C = 0.1, and C = 0.3, so that specify the triggering threshold (3.54) as the following cases: 1 − 0.4k−s j x(s j ), 3 1 − 0.2k−s j x(s j ), ethr (0.1) = 8 3(1 − 0.6k−s j ) x(s j ). ethr (0.3) = 4

ethr (0.2) =

(3.79a) (3.79b) (3.79c)

In the sequel, several case studies are implemented based on the learnt weights of the iterative NDP algorithm for 300 time steps. In Case 1A, we apply the adaptive critic controller with C = 0.2 and the event-driven threshold being selected as (3.79a). In Case 2, we revisit the traditional time-based controller design method like in (Zhang et al. 2009). Using the corresponding controllers, the state trajectories of two case studies are given in Fig. 3.19. Note the state responses therein are almost the same. Besides, the triggering threshold related to Case 1A is depicted in Fig. 3.20. Remarkably, the control trajectories of the two cases are illustrated in Fig. 3.21, where a stair-stepping curve is clearly observed, just affected by the event-based mechanism. At last, for checking the event-driven control performance and updating times along with the variation of the threshold, we severally set C = 0.1 (Case 1B) and C = 0.3 (Case 1C) and conduct other two case studies. To this end, we change the corresponding thresholds to (3.79b) and (3.79c) and compare the obtained control curves with the time-based case of Case 2 via Figs. 3.22 and 3.23, respectively. It is observed from Figs. 3.21–3.23 that the stair-stepping phenomenon of the eventdriven control curves become more and more obvious as the enlargement of the constant C. Additionally, we let the control updating times of event-based and timebased formulations be denoted as T1 and T2 , respectively. In this example, we apply the traditional algorithm for 300 time steps, so that T2 = 300. However, with the involvement of the event-based mechanism, the related updating times of control signals are overtly reduced. For Cases 1A, 1B, and 1C, the updating times are T1 = 25, T1 = 80, and T1 = 8, respectively. For clarity, the comparison of the updating times of four case studies is shown in Table 3.2. Such simulation results verify

84

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic 0 Case 1A Case 2

-0.1 Ending point -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8

Starting point -0.9 -1 -0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Fig. 3.19 The state trajectories of Case 1A and Case 2 (Example 3.5) 0.3

0.25

Threshold

0.2

0.15

0.1

0.05

0 0

50

100

150

Time index

Fig. 3.20 The triggering threshold of Case 1A (Example 3.5)

200

250

300

3.8 Conclusions

85

0.25 Case 1A Case 2

Control input

0.2

0.15

0.1

0.05

0 0

50

100

150

200

250

300

Time index

Fig. 3.21 The control inputs of Case 1A and Case 2 (Example 3.5) Table 3.2 Control updating times of four case studies in Example 3.5 Case 1A Case 1B Case 1C C = 0.2 T1 = 25

C = 0.1 T1 = 80

C = 0.3 T1 = 8

Case 2 None T2 = 300

that the event-driven scheme can greatly lessen control updating times while still guaranteeing a satisfying control performance.

3.8 Conclusions In this chapter, we study the event-based optimal regulation for discrete-time nonlinear systems. The event-driven method based on iterative DHP algorithm is proposed to substantially decrease the computation cost. An appropriate event-triggering condition is developed and the event-based system is proved to possess ISS. Then, the special case study of affine nonlinear plants is also presented with event-driven iterative adaptive critic. In addition, the mixed-driven control framework is clarified with data and event considerations. Simulation examples are conducted to verify that the proposed approach can greatly improve the resource utilization when compared with

86

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic 0.25 Case 1B Case 2

0.2

Control input

0.025

0.15

0.02

0.015

0.1

0.01 20

25

30

35

0.05

0 0

50

100

150

200

250

300

Time index

Fig. 3.22 The control inputs of Case 1B and Case 2 (Example 3.5) 0.25 Case 1C Case 2

Control input

0.2

0.15

0.1

0.05

0 0

50

100

150

200

Time index

Fig. 3.23 The control inputs of Case 1C and Case 2 (Example 3.5)

250

300

References

87

the traditional DHP technique. How to extend the developed framework to address trajectory tracking and robust stabilization is within our future work. It is worth pointing out that the mixed-driven adaptive critic approach can be applied to address complex optimal control problems with effectiveness control performances. Both the general structure and the iterative framework can be considered by integrating with mixed-driven adaptive critic design. It is one of the hot topics of the adaptive critic control field and more new results are called for in this aspect.

References Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof. IEEE Trans Syst Man Cybern-Part B: Cybern 38(4):943–949 Batmani Y, Davoodi M, Meskin N (2017) Event-triggered suboptimal tracking controller design for a class of nonlinear discrete-time systems. IEEE Trans Ind Electron 64(10):8079–8087 Dong L, Zhong X, Sun C, He H (2017) Adaptive event-triggered control based on heuristic dynamic programming for nonlinear discrete-time systems. IEEE Trans Neural Networks Learn Syst 28(7):1594–1605 Eqtami A, Dimarogonas DV, Kyriakopoulos KJ, (2010) Event-triggered control for discrete-time systems. In: Proceedings of the American control conference. pp 4719–4724 Huang S, James MR, Nesic D, Dower PM (2005) Analysis of input-to-state stability for discrete time nonlinear systems via dynamic programming. Automatica 41(12):2055–2065 Jiang ZP, Wang Y (2001) Input-to-state stability for discrete-time nonlinear systems. Automatica 37(6):857–869 Li J, Chai T, Lewis FL, Fan J, Ding Z, Ding J (2018) Off-policy Q-learning: Set-point design for optimizing dual-rate rougher flotation operational processes. IEEE Trans Ind Electron 65(5):4092– 4102 Li J, Kiumarsi B, Chai T, Lewis FL, Fan J (2017) Off-policy reinforcement learning: Optimal operational control for two-time-scale industrial processes. IEEE Trans Cybern 47(12):4547– 4558 Postoyan R, Bragagnolo MC, Galbrun E, Daafouz J, Nesic D, Castelan EB (2015) Event-triggered tracking control of unicycle mobile robots. Automatica 52:302–308 Prokhorov DV, Santiago RA, Wunsch DC (1995) Adaptive critic designs: A case study for neurocontrol. Neural Networks 8(9):1367–1372 Si J, Wang YT (2001) On-line learning control by association and reinforcement. IEEE Trans Neural Networks 12(2):264–276 Sontag ED (2008) Input to state stability: Basic concepts and results. In: Nistri P, Stefani G (eds) Nonlinear and optimal control theory. Springer, Berlin, Heidelberg, pp 163–220 Tabuada P (2007) Event-triggered real-time scheduling of stabilizing control tasks. IEEE Trans Automat Control 52(9):1680–1685 Tallapragada P, Chopra N (2013) On event triggered tracking for nonlinear systems. IEEE Trans Automat Control 58(9):2343–2348 Vamvoudakis KG, Mojoodi A, Ferraz H (2017) Event-triggered optimal tracking control of nonlinear systems. Int J Robust Nonlinear Control 27(4):598–619 Wang D (2019) Research progress on learning-based robust adaptive critic control. Acta Automat Sinica 45(6):1031–1043 Wang D (2020a) Robust policy learning control of nonlinear plants with case studies for a power system application. IEEE Trans Ind Inf 16(3):1733–1741

88

3 Self-Learning Optimal Regulation with Event-Driven Iterative Adaptive Critic

Wang D (2020b) Intelligent critic control with robustness guarantee of disturbed nonlinear plants. IEEE Trans Cybern 50(6):2740–2748 Wang D, Ha M (2020c) Mixed driven iterative adaptive critic control design towards nonaffine discrete-time plants. In: Proceedings of 21st IFAC world congress 53(2):3803–3808 Wang D, Ha M, Qiao J (2020d) Self-learning optimal regulation for discrete-time nonlinear systems under event-driven formulation. IEEE Trans Automat Control 65(3):1272–1279 Wang D, Ha M, Qiao J, Yan J, Xie Y (2020e) Data-based composite control design with critic intelligence for a wastewater treatment platform. Artif Intell Rev 53(5):3773–3785 Wang D, Xu X, Zhao M (2020f) Neural critic learning toward robust dynamic stabilization. Int J Robust Nonlinear Control 30(5):2020–2032 Wang D, He H, Liu D (2017) Adaptive critic nonlinear robust control: A survey. IEEE Trans Cybern 47(10):3429–3451 Wang D, He H, Zhong X, Liu D (2017) Event-driven nonlinear discounted optimal regulation involving a power system application. IEEE Trans Ind Electron 64(10):8177–8186 Wang D, Liu D (2018) Learning and guaranteed cost control with event-based adaptive critic implementation. IEEE Trans Neural Networks Learn Syst 29(12):6004–6014 Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832 Werbos PJ (1974) Beyond regression: new tools for prediction and analysis in the behavioural sciences. Ph.D. dissertation, Harvard University, Cambridge, MA Werbos PJ (1992) Approximate dynamic programming for real-time control and neural modeling. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy and adaptive approaches (chapter 13). Van Nostrand Reinhold, New York Zhang H, Luo Y, Liu D (2009) Neural-network-based near-optimal control for a class of discretetime affine nonlinear systems with control constraints. IEEE Trans Neural Networks 20(9):1490– 1503 Zhang Q, Zhao D, Wang D (2018) Event-based robust control for uncertain nonlinear systems using adaptive dynamic programming. IEEE Trans Neural Networks Learn Syst 29(1):37–50 Zhong X, He H (2017) An event-triggered ADP control approach for continuous-time system with unknown internal states. IEEE Trans Cybern 47(3):683–694 Zhu Y, Zhao D (2018) Comprehensive comparison of online ADP algorithms for continuous-time optimal control. Artif Intell Rev 49(4):531–547

Chapter 4

Near-Optimal Regulation with Asymmetric Constraints via Generalized Value Iteration

Abstract In this chapter, a generalized value iteration algorithm is developed to address the discounted near-optimal control problem for discrete-time systems with control constraints. The initial cost function is permitted to be an arbitrary positive semi-definite function without being zero. First, a nonquadratic performance functional is utilized to overcome the challenge caused by saturating actuators. Then, the monotonicity and convergence of the iterative cost function sequence with the discount factor are analyzed. For facilitating the implementation of the iterative algorithm, two neural networks with the Levenberg–Marquardt training algorithm are constructed to approximate the cost function and the control law. Furthermore, the initial control law is obtained by employing the fixed point iteration approach. Finally, two simulation examples are provided to validate the feasibility of the present strategy. It is emphasized that the established control laws are successfully constrained for randomly given initial state vectors. Keywords Adaptive critic · Control constraints · Convergence analysis · Discounted optimal control · Generalized value iteration

4.1 Introduction It is well known that the actuator saturation is a universal nonlinear phenomenon in practical engineering applications. The saturation nonlinearity of the actuator is inevitable and leads to system performance degradation or even system instability. Due to the nonanalytic nature, the challenge derived from nonlinear systems with actuator saturations is always encountered. These features require that the designed strategy can effectively confront the problem of constrained nonlinear control. For the actuator constraint problem, it is formidable to solve the nonlinear Hamilton– Jacobi–Bellman (HJB) equation and obtain its analytical solutions because of its inherently nonlinear nature. In recent decades, a great deal of methods have been developed to derive control laws with input constraints in (Zhang et al. 2009; Ha et al. 2020a, b; Modares et al. 2013). By using the nonquadratic performance functional, the iterative adaptive dynamic programming (ADP) algorithm was developed to search © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D. Wang et al., Advanced Optimal Control and Applications Involving Critic Intelligence, Intelligent Control and Learning Systems 6, https://doi.org/10.1007/978-981-19-7291-1_4

89

90

4 Near-Optimal Regulation with Asymmetric Constraints …

for near-optimal control laws for constrained nonlinear systems in (Zhang et al. 2009). In order to enhance the resource utilization rate, the approximate optimal control problems for constrained nonlinear affine and nonaffine systems with the eventtriggered mechanism were investigated in (Ha et al. 2020a, b). However, the algorithm implementation in the above studies is based on a specific initial state, which limits its practicability and adaptability. In addition, the control laws of practical industrial systems are generally constrained in the asymmetric range. Therefore, it is significant to propose more sophisticated techniques to solve the symmetric and asymmetric control constraints for all states in the operation region of the controlled system. In this chapter, the framework of the HJB equation is utilized to deal with the control problem for constrained systems. As an approximate method, ADP can obtain satisfying numerical solutions of HJB equations and has been extensively adopted in solving optimal control problems. To the best of our knowledge, the ADP algorithms are categorized into several schemes, including heuristic dynamic programming (HDP), dual HDP (DHP), action-dependent HDP (ADHDP), ADDHP, globalized DHP (GDHP), and ADGDHP (Prokhorov and Wunsch 1997; Werbos 1992). Till now, there have been abundant results of ADP to tackle control problems, such as event-based control (Wang et al. 2020a; Dong et al. 2017; Wang and Liu 2018; Fan and Yang 2018), robust control (Wang et al. 2020b; Wang 2020; Yang et al. 2022), optimal tracking (Zhang et al. 2008; Kiumarsi and Lewis 2015; Song et al. 2019; Wang et al. 2021a), and control constraints (Luo et al. 2018; Wang et al. 2021c, d), which strongly show the good benefits and great potential of ADP algorithms. Under the event-driven formulation, the self-learning optimal regulation was studied and the input-to-state stability analysis of the discrete-time system was given in (Wang et al. 2020a). It should be noted that the triggering threshold of the algorithm is essential to the design process of optimal feedback controllers. A general intelligent critic control design with robustness guarantee was investigated for disturbed nonaffine nonlinear systems in (Wang 2020). In order to achieve effective tracking, an actor-critic-based reinforcement learning algorithm was applied to obtain the solution of the tracking HJB equation online without knowing the system dynamics in (Kiumarsi and Lewis 2015). By using the value iteration-based Q-learning with the critic-only structure, the constrained control problem of nonaffine nonlinear discretetime systems was investigated in (Luo et al. 2018). By means of the model network, a new optimal tracking control approach was proposed for the wastewater treatment plant with control constraints consideration in (Wang et al. 2021c, d). As revealed in (Kiumarsi et al. 2018), most of basic iterative ADP methods were divided into policy iteration (Liu and Wei 2014; Yang et al. 2021; Fan et al. 2022) and value iteration (Al-Tamimi et al. 2008; Wang et al. 2012; Mu et al. 2017). The convergence analysis and stability properties of the policy iteration approach for discrete-time nonlinear systems were provided in (Liu and Wei 2014). As an indispensable branch of iterative ADP algorithms, the value iteration algorithm has gained much attention. Particularly, convergence of the value iteration algorithm by adopting the HDP framework was proven in (Al-Tamimi et al. 2008). By using system data, a novel iterative ADP strategy was built to obtain the control laws by directly minimizing the iterative cost function in (Mu et al. 2017). However, it is worth considering

4.2 Problem Statement

91

the selection of the initial cost function in the iterative algorithm. As mentioned in (Wei et al. 2016), the traditional value iteration algorithm started from a zero initial cost function, which limited its application. This is because the initial cost function is not always zero in practical applications. Therefore, a class of generalized value iteration algorithms were proposed in (Wei et al. 2016; Li and Liu 2012; Ha et al. 2021; Wei et al. 2017, 2018). It is emphasized that the initial cost function can be any positive semi-definite function including zero initial cost function. Wei et al. (2017, 2018) shed light on the admissibility and convergence properties for the undiscounted discrete-time local value iteration algorithm with different initial cost functions. However, there are rare relevant studies to take control constraints into account based on the generalized value iteration algorithm. Remarkably, in this chapter, it is the first time that the symmetric and asymmetric control constraints are investigated by introducing the generalized value iteration algorithm with the discount factor. Moreover, the iterative process of the developed algorithm is conducted in an operation region, which is called multiple initial states. The key contribution of multiple initial states is to improve the adaptability of the algorithm because the corresponding control input can be constrained with an arbitrary initial state vector. It is exactly the advantage that is not reflected in (Ha et al. 2020a, b). In this chapter, a generalized value iteration algorithm, including zero and nonzero initial cost functions, is developed to deal with the discounted optimal control problems for discrete-time constrained systems (Wang et al. 2021b). We discuss symmetric control constraints and asymmetric control constraints. With the discount factor and the control constrains, some properties of the proposed generalized value iteration algorithm for nonlinear systems are investigated under different initial conditions, which involve the monotonicity and convergence. The application scope of the designed controller is broadened by conducting the algorithm with multiple initial states.

4.2 Problem Statement Consider the discrete-time dynamical systems described as x(k + 1) = F(x(k)) + G(x(k))u(k),

(4.1)

where x(k) ∈ Rn is the n-dimensional state vector and u(k) ∈ Rm is the mdimensional control vector. Let system functions F(·) ∈ Rn and G(·) ∈ Rn×m be differentiable on a compact set Ω ⊂ Rn . Let x(0) be the initial state and x(k) = 0 ∈ Ω be an equilibrium state of (4.1). Assume that the system function F(·) satisfies F(0) = 0. Assume that system (4.1) is controllable and can be stabilized by a continuous control law u(k) = u(x(k)) on the set Ω ⊂ Rn with u(0) = 0. Define Ωu = {u(k)|u(k) = [u 1 (k), u 2 (k), . . . , u m (k)]T , −u¯ j ≤ u j (k) ≤ u¯ j , j = 1, 2, . . . , m, where −u¯ j is the lower bound and u¯ j is the upper bound of the ¯ ∈ Rm×m be the constant diagonal matrix given by U ¯ = jth actuator. Let U

92

4 Near-Optimal Regulation with Asymmetric Constraints …

diag{u¯ 1 , u¯ 2 , . . . , u¯ m }. If the saturating upper and lower bound norms are not equiva≤ u j (x(k)) ≤ u¯ max ¯ min lent, then we have u¯ min j j , j = 1, 2, . . . , m, where u j is the lower max bound and u¯ j is the upper bound of the jth actuator. For the optimal control problem, we focus on designing an optimal state feedback controller for the system (4.1) that ensures the state x(k) → 0 as k → ∞. The objective also contains finding the optimal control law u(x(k)) that can minimize the cost function with a discount factor γ as follows: J (x(k)) =

∞

γ ζ −k U x(ζ ), u(x(ζ )) ,

(4.2)

ζ =k

where 0 < γ ≤1, and U is the utility function with U (0, 0) = 0, U x(ζ ), u(x(ζ )) ≥ 0, ∀x(ζ ), u(x(ζ )). In this chapter, the utility func tion is selected as the form U (x(ζ ), u x(ζ )) = x T (ζ )Qx(ζ ) + W(u(ζ )) = Q(x(ζ )) + W(u(ζ )). It is worth noting that Q(x(ζ )) ∈ Rn×n and W(u(ζ )) ∈ R are positive definite. Definition 4.1 A control law u(x(k)) is said to be admissible (Al-Tamimi et al. 2008; Zhang et al. 2009) with respect to (4.2) on Ω if u(x(k)) is continuous on Ω, ∀x(k) ∈ Ω, u(0) = 0, u(x(k)) stabilizes (4.1) on Ω, and J (x(0)) is finite, ∀x(0) ∈ Ω. In order to tackle the constrained control problem, the nonquadratic functional is introduced as follows: W(u(ζ )) = 2

u(ζ )

−1

¯ ψ −T (U

¯ z)URdz, z ∈ Rm ,

0

T ψ −1 (u(ζ )) = ψ −1 (u 1 (ζ )), ψ −1 (u 2 (ζ )), . . . , ψ −1 (u m (ζ )) ,

(4.3)

where R is a positive definite and diagonal matrix given by R = diag{r1 , r2 , . . . , rm }, z = [z 1 , z 2 , . . . , z m ]T . Note that ψ(·) is a bounded one-to-one function satisfying |ψ(·)| ≤ 1 and belonging to C p ( p ≥ 1) and L 2 (Ω) (Zhang et al. 2009). Generally speaking, the nonquadratic functional in (4.3) is used to deal with symmetrical control constraints that the upper and lower bound norms are the same. Therefore, we always choose the hyperbolic function ψ(·) = tanh(·) as the bounded one-to-one function. Remark 4.1 The hyperbolic function tanh(·) is invalid and useless for asymmetric control constraints that the upper and lower bound norms are not equivalent. Therefore, a new nonquadratic functional is introduced in the form of W(u(ζ )) = u(ζ ) x −x , where u¯ max is the 2 0 ψ −T (z)Rdz with the function ψ(x) = (1/u¯ max )ee x −e j −(1/u¯ min )e−x j

j

is the lower bound, u¯ max = 0, u¯ min = 0. Compared with symupper bound and u¯ min j j j metric control constraints, the main difficulties of solving asymmetric control constraints are to calculate the inverse operation of the function ψ(x) and the integration of the inverse operation ψ −1 (·) in the nonquadratic functional. In addition, it requires

4.3 Properties of the Generalized Value Iteration Algorithm …

93

u¯ max = 0 and u¯ min = 0, otherwise the function ψ(x) tends to zero. In order to overj j = τ when u¯ max = 0 and let u¯ min = τ when u¯ min = 0, come the deficiency, we let u¯ max j j j j where τ is an arbitrarily small number. Whether the number τ is positive or negative depends on the actual situation. For convenience, the nonquadratic functional is uniformly represented by the form of W(u(k)) in the sequel. To further explain the cost function, Eq. (4.2) is written as follows: J (x(k)) = Q(x(k)) + W(u(k)) + γ

∞

γ ζ −k−1 U x(ζ ), u(x(ζ ))

ζ =k+1

= Q(x(k)) + W(u(k)) + γ J (x(k + 1)).

(4.4)

Recalling the well-known Bellman’s optimality principle, the optimal cost function satisfies the following HJB equation:

J ∗ (x(k)) = min Q(x(k)) + W(u(k)) + γ J ∗ (x(k + 1)) . u(k)

(4.5)

The optimal feedback control u ∗ (x(k)) should satisfy

u ∗ (x(k)) = arg min Q(x(k)) + W(u(k)) + γ J ∗ (x(k + 1)) . u(k)

(4.6)

It is well known that the analytic solution of the HJB equation is difficult to obtain. In (Zhang et al. 2009), an iterative DHP algorithm was proposed to solve the nearoptimal control problem for discrete-time systems with control constrains. To guarantee the convergence of the iterative cost function, the initial cost function was set to zero, which greatly reduced its versatility. Therefore, in the following, we shall develop a new generalized value iteration algorithm to seek the near-optimal control solution.

4.3 Properties of the Generalized Value Iteration Algorithm with the Discount Factor This section consists of three parts. First, the generalized value iteration algorithm with the discount factor is derived for the constrained plant. Second, the corresponding properties of the generalized value iteration algorithm are discussed. Third, the implementation of the proposed control approach is presented.

94

4 Near-Optimal Regulation with Asymmetric Constraints …

4.3.1 Derivation of the Generalized Value Iteration Algorithm In this part, we construct two sequences, i.e., {Vi (x(k))} and {νi (x(k))}, where i = 0, 1, . . . is the iteration index. In the traditional value iteration algorithm, the initial cost function and the initial control law are assumed to be V0 (·) = 0 and ν0 (·) = 0, and then the iterative algorithm is carried out. Inspired by the pioneering work in (Li and Liu 2012), we start with the initial cost function V0 = x T (k)Γ0 x(k), where Γ0 is a positive semi-definite matrix. Furthermore, the initial control vector ν0 (x(k)) is computed by

ν0 (x(k)) = arg min Q(x(k)) + W(u(k)) + γ V0 (x(k + 1)) .

(4.7)

u(k)

Then, we update the cost function as

V1 (x(k)) = min Q(x(k)) + W(u(k)) + γ V0 (x(k + 1)) u(k) = Q(x(k)) + W ν0 (x(k)) + γ V0 (x(k + 1)).

(4.8)

Next, for i = 1, 2, . . . , the generalized value iteration algorithm iterates between

νi (x(k)) = arg min Q(x(k)) + W(u(k)) + γ Vi (x(k + 1))

(4.9)

Vi+1 (x(k)) = min Q(x(k)) + W(u(k)) + γ Vi (x(k + 1)) u(k) = Q(x(k)) + W νi (x(k)) + γ Vi (x(k + 1)),

(4.10)

u(k)

and

where x(k + 1) = F(x(k)) + G(x(k))νi (x(k)). In the above iteration process, we update the cost function sequence {Vi } and the control law sequence {νi } by making i = i + 1 until i → ∞. Particularly, it is desired that the cost function sequence and the control law sequence converge to the optimum values when i → ∞, i.e., Vi → J ∗ and νi → u ∗ . For the infinite-horizon problem, in the following, we will discuss the monotonicity, boundedness, and convergence of the iterative algorithm with the discount factor being considered.

4.3.2 Properties of the Generalized Value Iteration Algorithm Lemma 4.1 Let {ξi } be an arbitrary sequence of control laws and {νi } be the control law sequence as in (4.9). Define Vi as in (4.10) and define Ψi as

4.3 Properties of the Generalized Value Iteration Algorithm …

95

Ψi+1 (x(k)) = Q(x(k)) + W ξi (x(k)) + γ Ψi (F(x(k)) + G x(k))ξi (x(k)) . (4.11) If V0 (·) = Ψ0 (·) = x T (k)Γ0 x(k), then Vi (x(k)) ≤ Ψi (x(k)), ∀i. Proof It is clear from the fact that Vi+1 is the minimum value of the right-hand side of (4.10) with respect to the control law u(k), while Ψi+1 (·) is the result of an arbitrary control law. Theorem 4.1 Define the control law sequence {νi } as in (4.9) and the cost function sequence {Vi } as in (4.10) with V0 = x T (k)Γ0 x(k). If V0 (x(k)) ≥ V1 (x(k)) holds for all x(k) ∈ Ω, then the cost function satisfies Vi (x(k)) ≥ Vi+1 (x(k)), ∀i ≥ 0. If V1 (x(k)) ≥ V0 (x(k)) holds for all x(k) ∈ Ω, then the cost function satisfies Vi+1 (x(k)) ≥ Vi (x(k)), ∀i ≥ 0. Proof The proof is cut into two parts. (1) Suppose that V0 (x(k)) ≥ V1 (x(k)) holds for all x(k). Let {Di (x(k))} be a new sequence defined by

D1 (x(k)) = Q(x(k)) + W ν0 (x(k)) + γ D0 F(x(k)) + G(x(k))ν0 (x(k)) Di+1 (x(k)) = Q(x(k)) + W νi−1 (x(k)) + γ Di F(x(k)) + G(x(k))νi−1 (x(k)) , (4.12)

where D0 (·) = V0 (·) = x T (k)Γ0 x(k) and i ≥ 1. Note that W νi (x(k)) is the nonquadratic functional with respect to the control law νi (x(k)). Furthermore, we find that D1 (x(k)) = V1 (x(k)) and further obtain V0 (x(k)) − D1 (x(k)) = V0 (x(k)) − V1 (x(k)) ≥ 0.

(4.13)

Hence, V0 (x(k)) ≥ D1 (x(k)) holds. Next, we assume that Vi−1 (x(k)) ≥ Di (x(k)) holds, ∀i ≥ 1 and ∀x(k). Combining (4.10) and (4.12) yields Vi (x(k)) − Di+1 (x(k)) = γ Vi−1 (x(k + 1)) − γ Di (x(k + 1)).

(4.14)

Considering Vi−1 (x(k + 1)) ≥ Di (x(k + 1)), we have Vi (x(k)) ≥ Di+1 (x(k)), i ≥ 0.

(4.15)

According to Lemma 4.1, it is observed that Vi+1 (x(k)) ≤ Di+1 (x(k)), which means that Vi+1 (x(k)) ≤ Vi (x(k)), i ≥ 0, ∀x(k) ⊂ Ω.

(4.16)

96

4 Near-Optimal Regulation with Asymmetric Constraints …

Therefore, the first part of the proof is completed. (2) Suppose that V0 (x(k)) ≤ V1 (x(k)) holds for all x(k). Let {Bi (x(k))} be a new sequence defined by Bi (x(k)) = Q(x(k)) + W νi (x(k)) + γ Bi−1 (x(k + 1)),

(4.17)

where B0 (x(k)) = V0 (x(k)) = x T (k)Γ0 x(k). For V1 (x(k)) and B0 (x(k)), we have the following inequality: V1 (x(k)) − B0 (x(k)) = V1 (x(k)) − V0 (x(k)) ≥ 0.

(4.18)

Then, we assume that it holds for i − 1, that is, Vi (x(k)) ≥ Bi−1 (x(k)), ∀i ≥ 1 and x(k) ∈ Ω. Combining (4.10) and (4.17), one has Vi+1 (x(k)) − Bi (x(k)) = γ Vi (x(k + 1)) − γ Bi−1 (x(k + 1)) ≥ 0,

(4.19)

which implies Vi+1 (x(k)) ≥ Bi (x(k)), ∀i ≥ 0.

(4.20)

Similarly, Vi (·) is the minimum value of the right-hand side of (4.10), whereas Bi (·) is the result of an arbitrary control law. Therefore, Vi (x(k)) ≤ Bi (x(k)) holds. Then, we can obtain Vi (x(k)) ≤ Vi+1 (x(k)), ∀i ≥ 0.

(4.21)

Therefore, the second part of the proof is completed. We can conclude that {Vi } is a monotonic cost function sequence. Theorem 4.2 Let δ(x(k)) be an arbitrary control law with δ(0) = 0. Then, we define a new iterative cost function as Λi+1 (x(k)) = Q(x(k)) + W δ(x(k)) + γ Λi F(x(k)) + G(x(k))δ(x(k)) , (4.22) where Λ0 (x(k)) = V0 (x(k)). If δ(x(k)) is an admissible control law, then limi→∞ Λi (x(k)) is finite. Proof In the light of (4.22), Λi+1 (x(k)) can further be written as

4.3 Properties of the Generalized Value Iteration Algorithm …

97

Λi+1 (x(k)) = Q(x(k)) + W δ(x(k)) + γ Q(x(k + 1)) + γ W δ(x(k + 1)) + γ 2 Λi−1 (x(k + 2)) = Q(x(k)) + W δ(x(k)) + γ Q(x(k + 1)) + γ W δ(x(k + 1)) + γ 2 Q(x(k + 2)) + γ 2 W δ(x(k + 2)) + γ 3 Λi−2 (x(k + 3)) .. . = Q(x(k)) + W δ(x(k)) + γ Q(x(k + 1)) + γ W δ(x(k + 1)) + γ 2 Q(x(k + 2)) + γ 2 W δ(x(k + 2)) + ··· + γ i Q(x(k + i)) + γ i W δ(x(k + i)) + γ i+1 Λ0 (x(k + i + 1)) =

i

γ Q(x(k + )) + W δ(x(k + )) + γ +1 Λ0 (x(k + + 1)).

=0

(4.23) Define the limit of Λi (x(k)) as lim Λi (x(k)) = lim

i→∞

i→∞

i−1

γ Q(x(k + )) + W δ(x(k + )) + γ +1 Λ0 (x(k + + 1)).

(4.24)

=0

Note that δ(x(k)) is an admissible control law and x(k) is an arbitrary finite state, which means that the terms in the right-hand side of (4.24) are finite. Therefore, there exists an upper bound Y such that lim Λi (x(k)) ≤ Y.

i→∞

Therefore, we complete the proof.

(4.25)

According to Theorem 4.1, we can observe that the monotonicity of the generalized value iteration sequence is affected by the initial cost function V0 . Note that different initial cost functions make the iterative sequence have different monotonicities, or even non-monotonicities. Therefore, the traditional convergence analysis in Al-Tamimi et al. (2008) is not feasible for the non-zero initial cost function. In Wei et al. (2016), the convergence of the value iteration towards the undiscounted regulation problem was discussed. Nevertheless, when the discount factor is introduced, the generalized value iteration for constrained problems has not been studied. Therefore, it is necessary to analyze the uniform convergence property of the developed algorithm with the discount factor. In what follows, the convergence proof of the generalized value iteration algorithm with a discount factor is provided. Theorem 4.3 Let Vi (x(k)) and νi (x(k)) be obtained by (4.7)–(4.10) with V0 = x T (k)Γ0 x(k), i = 0, 1, . . . . Let a, b, c, and d be constants that satisfy

98

4 Near-Optimal Regulation with Asymmetric Constraints …

0 0, we use (5.23) to obtain 1 V () (ek ) ≤ (1 + ς −1 )

(1 + ς −1 ) − 1 ζ. (1 + ς −1 )

(5.25)

Note that (5.25) can be reformatted as 1 V () (ek ) (1 + ς −1 ) 1 = V () (ek ) − 1 − V () (ek ) (1 + ς −1 ) 1 ≤ 1− ζ. (1 + ς −1 )

V () (ek ) − V () (ek ) +

(5.26)

Then, we consider the last inequality of (5.26) and get V () (ek ) − V () (ek ) ≤ ζ. 1 1− (1 + ς −1 )

(5.27)

According to Lemmas 5.1 and 5.2, the following inequalities can be derived: V () (ek ) ≤ V ∗ (ek ) ≤

V () (ek ) . 1 1− (1 + ς −1 )

(5.28)

Combining (5.27) with (5.28), we can finally derive that V () (ek ) − V () (ek ) 1 1− (1 + ς −1 ) ≤ ζ,

V ∗ (ek ) − V () (ek ) ≤

which ends the proof.

(5.29)

5.3 Neuro-Optimal Tracking Control Based on the Value Iteration Algorithm

127

Remark 5.1 As mentioned in the previous work (Wang et al. 2012; Rantzer 2006; Al-Tamimi et al. 2008; Li and Liu 2012), when → ∞, the iterative value function can reach the optimal value function. However, it is impossible to perform infinite iteration steps to approximate V ∗ in practical applications. Theorem 5.1 illustrates the approximation error between V ∗ and V () when using the iterative value function.

5.3.3 Closed-Loop Stability Analysis with Value Iteration In this part, we discuss the UUB stability of the closed-loop system (5.13) using value iteration. Definition 5.1 (cf. Khalil 2002) If there exists a compact set Ω S ⊂ R n so that there ¯ ek ), for all ek ∈ Ω S , such exists a positive constant B¯ ≥ 0 and a lapsed time of T ( B, ¯ that e K ≤ B, ∀K ≥ k + T , then the equilibrium point eq is UUB. From Definition 5.1, the equilibrium point of the controlled plant is UUB in the sense that, for all initial states in Ω S , the system trajectory eventually reaches a neighborhood e K ≤ B¯ of eq after T time steps. Theorem 5.2 Under Assumption 5.1, if {V () } with V (0) (·) = 0 and {u () } are iteratively updated by (5.20) and (5.21) and there exists a positive constant b¯ satisfying 1 ¯ V () (ek ) ≤ b, (1 + ς −1 ) − 1

(5.30)

where ∈ N+ , then the tracking error using the control policy u () is UUB. Proof Since V (0) (·) = 0 and u (0) (ek ) = 0 as mentioned in Section III-A, then we acquire that V (1) (ek ) = ekT Qek . Obviously, V (1) (ek ) is a positive definite function. According to (5.20), we have V (2) (ek ) = ekT Qek + u (1)T (ek )Ru (1) (ek )

+ V (1) F (ek , u (1) (ek )) .

(5.31)

Under Assumption 5.1, if ek = 0, then u(ek ) = 0, F (0, 0) = 0, and U (0, 0) = 0. The value function V (2) (ek ) satisfies V (2) (ek ) > 0 for any ek = 0. Using the mathematical induction, it can be concluded that, for ∈ N+ , V () (ek ) is a positive definite function. Therefore, there must exist two K -functions h 1 (·) and h 2 (·) satisfying h 1 (ek ) ≤ V () (ek ) ≤ h 2 (ek ).

(5.32)

Next, considering (5.30) and Theorem 5.1, we obtain ¯ V ∗ (ek ) − V () (ek ) ≤ b.

(5.33)

128

5 Nonaffine Neuro-Optimal Tracking Control with Accuracy and Stability Guarantee

According to Lemma 5.2, we have V ∗ (ek ) ≥ V () (ek ). Then, from (5.33), the following inequality can be derived: ¯ V (+1) (ek ) − V () (ek ) ≤ b.

(5.34)

Considering (5.20), we can find that V () (ek+1 ) − V (+1) (ek ) = −U (ek , u () (ek )).

(5.35)

The relationships in (5.34) and (5.35) imply ¯ V () (ek+1 ) − V () (ek ) ≤ −U (ek , u () (ek )) + b.

(5.36)

Define a new state set as ¯ Θe = {ek | U (ek , u () (ek )) ≤ b}.

(5.37)

For any ek ∈ Θe , ek is bounded. Denote the supremum of ek in this region as γ = sup {ek }. ek ∈Θe

(5.38)

According to (5.37) and (5.38), there exists a positive constant Υ , which satisfies h 1 (Υ ) ≥ h 2 (γ ), Υ ≥ γ .

(5.39)

For all ≥ Υ , there exists a function h 3 (·) satisfying h 2 ◦ h 3 ( ) ≤ h 1 ( ), h 3 ( ) ≥ γ ,

(5.40)

where h 2 ◦ h 3 (·) means h 2 (h 3 (·)). We can easily find a state ek to satisfy γ < ek ≤ h 3 ( ). Hence, considering (5.32) results in V () (ek ) ≤ h 2 ◦ h 3 ( ) ≤ h 1 ( ).

(5.41)

According to (5.41), when γ < ek ≤ h 3 ( ), the inequality V () (ek+1 )− V () (ek ) < 0 holds and there exists a T > 0 ensuring h 1 (ek+T ) ≤ V () (ek+T ) ≤ V () (ek ) ≤ h 1 ( ),

(5.42)

which yields ek+T ≤ . Then, for any ek > γ , there exists a T such that ek+T ≤ γ ≤ Υ . According to Definition 5.1, the tracking error is UUB. Remark 5.2 The parameter ζ in Theorem 5.1 reflects the approximation accuracy by using the iterative value function, while b¯ in Theorem 5.2 determines the state set

5.4 Neural Network Implementation of the Iterative HDP Algorithm

129

in (5.37). Observing the state set Θe , the lapsed time of T is related to b¯ and ek as mentioned in Definition 5.1.

5.4 Neural Network Implementation of the Iterative HDP Algorithm In this section, we particularly introduce the implementation procedure of the value iteration algorithm. In the iterative HDP algorithm, two neural networks are used to approximate the optimal value function and the optimal control policy. They are the critic network and the action network. A lemma is proposed to develop the novel updating rules of the action network, which avoids establishing the model network for tracking error dynamics when the original system function is known.

5.4.1 The Critic Network The critic network is trained to approximate the value function V () . We use a threelayer neural network with L hidden layer neurons to construct the critic network. The output of the critic network is denoted as follows: ()T ()T () , σ wc1 ek + bc1 V () (ek ) = wc2

(5.43)

where ek is the input of the critic network, σ (·) is the activation function of the () () () ∈ Rn×L and wc2 ∈ R L are weight matrices, and bc1 ∈ R L is neural network, wc1 the threshold vector at the -th iteration step. Here, we choose sigmoid(·) as its activation function. According to (5.20), the target value function can be computed by V () (ek ) = ekT Qek + uˆ (−1)T (ek )R uˆ (−1) (ek ) + V (−1) (ek+1 ),

(5.44)

where V (−1) (ek+1 ) is obtained by the critic network and ek+1 = F(xk , uˆ (−1) (ek ) + μck ) − ρ(ck ). Note that, according to (5.5), μck can be obtained by using various numerical methods. Next, let us define the approximation error as Ec =

T

1 () V (ek ) − V () (ek ) V () (ek ) − V () (ek ) . 2

(5.45)

Here, based on a proper learning rate θ ∈ (0, 1), we employ the Levenberg– Marquardt training algorithm (Hagan and Menhaj 1994; Fu et al. 2011) to update the weight matrices and the threshold vector.

130

5 Nonaffine Neuro-Optimal Tracking Control with Accuracy and Stability Guarantee

5.4.2 The Action Network The action network is used to approximate the iterative control input u () . Similarly, a three-layer neural network with M hidden layer neurons is applied to establish the action network. However, for nonaffine systems, the expression of u () cannot be obtained directly. The gradient descent algorithm is used to find the control input u () in (5.21). First, the output of the action network is defined as follows: ()T ()T δ wa1 ek , uˆ () (ek ) = wa2

(5.46)

() () where wa1 ∈ Rn×M and wa2 ∈ R M×m are weight matrices at the -th iteration and δ(·) is the activation function of the action network. Here, tanh(·) is chosen as the activation function. In order to find the control policy which minimizes the value function, we define the approximation error of the action network as

Ea =

1 T ε εa , 2 a

(5.47)

where εa = V () (ek+1 ) + U (ek , uˆ (−1) (ek )) − Uc

(5.48)

and Uc is the desired training objective often specified as zero, i.e., Uc = 0 (Si and Wang 2001). The action network weights are updated to solve (5.21). Considering (5.46) and denoting action network weights as W A() , (5.21) can be rewritten as W A() = arg min{ekT Qek + uˆ Tk R uˆ k + V () (ek+1 )}. W A()

(5.49)

() () For simplicity, we collect the action network weight matrices wa1 and wa2 into two column vectors Wa1() ∈ Rn M and Wa2() ∈ R Mm , respectively. Next, based on the gradient descent method, the updating rules are given as follows:

Wa1() := Wa1() − ϑ ×

∂ uˆ (−1) (ek )

∂Wa1() ∂U (ek , uˆ (−1) (ek )) ∂ uˆ (−1) (ek )

T

∂Ea ∂U (ek , uˆ (−1) (ek ))

∂ek+1 + (−1) (e ) () ∂ u ˆ ∂ V (ek+1 ) k ∂Ea

T

∂ V () (ek+1 ) , ∂ek+1

5.4 Neural Network Implementation of the Iterative HDP Algorithm

Wa2() := Wa2() − ϑ × +

∂ uˆ (−1) (ek )

∂Wa2() ∂U (ek , uˆ (−1) (ek )) ∂ uˆ (−1) (ek )

T

∂Ea ∂U (ek , uˆ (−1) (ek ))

∂ek+1 (−1) (e ) () ∂ u ˆ ∂ V (ek+1 ) k ∂Ea

131

T

∂ V () (ek+1 ) , ∂ek+1

(5.50)

where the symbol := denotes the assignment operation and ϑ ∈ (0, 1) is the learning rate of the action network. Note that, at each iteration, the neural network weights are trained by a complete set of iterations on (5.50), called the inner loop in (Heydari 2014). Before updating the action network, we need to obtain ∂ek+1 /∂ uˆ (−1) (ek ). Generally, it is necessary to construct a model network for the tracking error dynamics such that ∂ek+1 /∂ uˆ (−1) (ek ) can be computed by using the mathematical expression of the model network, even if the original system function is known. In the following, we give a new approach to compute ∂ek+1 /∂ uˆ (−1) (ek ) without using the model network. Lemma 5.3 Define the control input μsk towards the original system as in (5.2) and the new control input μk with respect to the augmented system as in (5.7). Then, the expression ∂ek+1 ∂ F(xk , μsk ) ∂ xk+1 = = ∂μk ∂μsk ∂μsk

(5.51)

is true, which constructs an equivalent relationship between different partial derivatives. Proof Since ψ(ck ) and ρ(ck ) are related to ck , according to the augmented system (5.8), the term ∂ek+1 /∂μk satisfies the following condition: ∂ek+1 ∂(F(ek + ck , μk + ψ(ck )) − ρ(ck )) = ∂μk ∂μk ∂ F(ek + ck , μk + ψ(ck )) . = ∂(μk + ψ(ck ))

(5.52)

On the other hand, based on the original system (5.2), we have ∂ F(xk , μsk ) ∂ F(ek + ck , μk + ψ(ck )) . = ∂μsk ∂(μk + ψ(ck ))

(5.53)

Then, according to (5.52) and (5.53), equation (5.51) holds and the proof is completed. Therefore, based on Lemma 5.3, the updating rules can be further rewritten as follows:

132

5 Nonaffine Neuro-Optimal Tracking Control with Accuracy and Stability Guarantee

Wa2()

∂ uˆ (−1) (ek )

T

∂U (ek , uˆ (−1) (ek )) ∂ uˆ (−1) (ek ) ∂Wa1() T

∂ F(xk , μ(−1) ) ∂ V () (ek+1 ) sk , + ∂ek+1 ∂μ(−1) sk

(−1) T ∂ uˆ ∂U (ek , uˆ (−1) (ek )) (ek ) () := Wa2 − ϑεa ∂ uˆ (−1) (ek ) ∂Wa2()

T ∂ F(xk , μ(−1) ) ∂ V () (ek+1 ) sk + , ∂ek+1 ∂μ(−1) sk

Wa1() := Wa1() − ϑεa

(5.54)

where μ(−1) = uˆ (−1) (ek ) + μck . Also, the iterative control input corresponding to sk the original system needs to be iteratively updated. It is used to control the original system (5.2). Then, the state xk+1 can be obtained and the tracking error ek+1 can be computed by (5.4). The detailed design procedure is summarized as Algorithm 4. Algorithm 4 HDP-Based Iterative Tracking Control Algorithm 1: Initialize the matrices Q and R, the initial desired trajectory c0 , and the stopping error ε of the value iteration. Let V (0) (·) = 0 and u (0) (·) = 0. Set the maximum number of the iteration process as max . (1) (S) 2: Evenly select an array of S state vectors xk , . . . , xk . (1) (S) and set the iteration index = 1. 3: Compute corresponding tracking errors ek , . . . , ek 4: Construct and train the critic network to compute V (1) ek(1) , . . . , V (1) ek(S) under V (0) (·) = 0 and u (0) (·) = 0. 5: while ≤ max do 6: Train action network to compute corresponding control inputs the (1) (S) () () u ek , . . . , u ek according to (5.21). () (1) () (S) according to (5.7). 7: Compute μck based on (5.6) and obtain u s ek , . . . , u s ek (1) (S) 8: Compute the states xk+1 , . . . , xk+1 according to the original system and obtain the errors (1) (S) ek+1 , . . . , ek+1 based on (5.4). (1) (S) () ek+1 , . . . , V () ek+1 through the critic network. 9: Obtain V 10: Train the critic network to approximate the value function (1) (S) according to (5.20). V (+1) ek , . . . , V (+1) ek (1) (S) 11: if max |ΔV () (ek )|, . . . , |ΔV () (ek )| < ε then 12: Stop. 13: end if 14: Let ← + 1. 15: end while

5.5 Simulation Experiments

133

( p) Remark 5.3 The increment of the value function is denoted as |ΔV () (ek )| = ( p) ( p) ( p) |V () (ek ) − V (−1) (ek )| in Algorithm 4, where ek represents the p-th tracking error sample. The proposed Algorithm 4 has three key points. First, the original system function F(·) in (5.2) and the desired trajectory ρ(·) are known. Second, μck is required in this algorithm. If there is no analytic solution to Eq. (5.5), it is not difficult to obtain the numerical solution by using a numerical method. Third, the algorithm utilizes the relationship between ∂ek+1 /∂μk and ∂ F(xk , μsk )/∂μsk to update the weights of the action network without establishing a model network for the tracking error dynamics.

Remark 5.4 To the best of our knowledge, the iterative value function plays an important role in the approximation process of the optimal value function. The introduction of the threshold vector can improve the fitting performance.

5.5 Simulation Experiments In this section, we employ two simulation examples to demonstrate the effectiveness of the neuro-optimal tracking control algorithm, where a linear system is used to verify the optimality of the controller and an inverted pendulum plant with the hyperbolic tangent input is employed to test the applicability of the proposed method.

5.5.1 Example 1 Consider the following linear system: xk+1 = Axk + Bμsk 0.9996 0.0099 1 = xk + μ , −0.0887 0.97 0 sk

(5.55)

where xk ∈ R2 and μsk ∈ R. We define the reference trajectory function as follows: ck+1 = H ck =

0.9963 0.4 c . −0.0887 0.97 k

(5.56)

The parameter values used in the proposed algorithm are given in Table 5.1. The state space is selected as Ωx = {(x1k , x2k ) : − 0.5 ≤ x1k ≤ 0.5, −0.5 ≤ x2k ≤ 0.5} ∈ R2 . In this region, 441 state vector samples are evenly selected and the initial point of the reference trajectory is c0 = [−0.1, 0.2]T . Then, 441 corresponding tracking error samples are obtained as {ek(1) , . . . , ek(441) }. With this operation, we have the state space of the tracking error Ωe ∈ R2 . Then, we start to construct the critic and action networks with the structure 2–12–1. The learning rates θ and ϑ of these two

134

5 Nonaffine Neuro-Optimal Tracking Control with Accuracy and Stability Guarantee

Table 5.1 Parameter values of the iteration algorithm in Example 1 Parameters θ ϑ Q R Values

0.01

0.02

I2

I1

max

ε

30

0.01

networks are also shown in Table 5.1. Note that 100 training epochs are set to ensure the approximation accuracy in each iteration. In the iteration process, if the iterative stopping error ε is reached, then the learning procedure of the iteration algorithm is finished. On the other hand, according to the system function (5.55) and Eq. (5.5), the steady control μck corresponding to the desired trajectory is solved by the following equation: μck = c1(k+1) − 0.9996c1k − 0.0099c2k .

(5.57)

When we obtain the control input u k and the steady control μck , we can obtain the control input μsk of the original system by considering the relationship in (5.7). In the state space Ωx , the convergence process of iterative value functions is given ( p) in Fig. 5.1a. Increments of iterative value functions and ΔV (30) (ek ) are plotted in Fig. 5.1b, c, respectively. It can be observed that the value function sequence ( p) has converged after 30 iteration steps and all increments between V (29) (ek ) and ( p) V (30) (ek ), in the region Ωx , have been less than 0.01. As shown in Fig. 5.1a–c, simulation results show that {V () } is a nondecreasing sequence. The initial state vector x0 is set as x0 = [0.3, −0.3]T . The initial tracking error vector is obtained as x0 − c0 = [0.4, −0.5]T . In the iteration process, the convergence of the value function at x0 is shown in Fig. 5.1d. To demonstrate the convergence of parameters of neural networks, the weights and the threshold of the critic and action networks () () () () () , wc2 , bc1 , wa1 , and wa2 are recorded in the iteration process. Here, the norms of wc1 are shown in Fig. 5.2. After finishing the value iteration process, we terminate the neural network training. Then, starting from the initial state and the initial tracking error, we apply the trained critic network to estimate the optimal value function, and also find the optimal control policy for the tracking error dynamics by using (5.54). Next, we compute μsk to control the original system. It is worth noting that μsk needs to be computed according to the output of the action network and the steady control of the reference trajectory. Considering the linear system (5.55) and the reference trajectory (5.56), we can obtain ek+1 = xk+1 − ck+1 = Axk + Bμsk − H ck . According to (5.5) and (5.7), (5.58) can be rewritten as

(5.58)

5.5 Simulation Experiments

135

( p)

( p)

( p)

Fig. 5.1 Iterative value functions: a V () (ek ); b ΔV () (ek ); c ΔV (30) (ek ); d V () (e0 ) 25 20 15 10 5 0 0

5

10

15

20

25

30

5

10

15

20

25

30

1.1 1 0.9 0.8 0.7 0.6 0

Fig. 5.2 The convergence trajectories of parameters of the critic and action networks

136

5 Nonaffine Neuro-Optimal Tracking Control with Accuracy and Stability Guarantee

The system state x1k

1.5

The optimal state x1k The reference trajectory c1k

1

The state x1k controlled by the proposed method

0.5 0 -0.5 0

10

20

30

40

50

60

70

80

90

100

The system state x2k

Time steps The optimal state x2k

0.6

The reference trajectory c2k

0.4

The state x2k controlled by the proposed method

0.2 0 -0.2 -0.4 0

10

20

30

40

50

60

70

80

90

100

Time steps

Fig. 5.3 The optimal state trajectories of the original system

ek+1 = Axk + B(μck + μk ) − H ck = Axk − Ack + Bμk + Ack + Bμck − H ck = Aek + Bμk .

(5.59)

Since the tracking error plant is a linear system, the linear quadratic regulator (LQR) problem can be considered here. To verify the optimality of the proposed method, by solving directly the algebraic Riccati equation of the system (5.59), we can obtain its optimal solution P ∗ and the optimal feedback gain G ∗ as follows:

1.7823 −1.1481 P = , G ∗ = 0.6769 −0.3939 . −1.1481 9.2451 ∗

(5.60)

The state curves of the original system and the tracking error trajectories are shown in Figs. 5.3 and 5.4, respectively. Based on the LQR method, the optimal control solved by u ∗ (ek ) = −G ∗ ek and the optimal value function computed by V ∗ (ek ) = ekT P ∗ ek are shown in Figs. 5.5 and 5.6, respectively. The proposed method possesses competitive performance and high precision by reference to the optimal state, control, and value function trajectories.

5.5 Simulation Experiments

137

0.4 The optimal error e1k

e1k

0.2

The state e1k controlled by u 1k

0 -0.2 -0.4 0

10

20

30

40

50

60

70

80

90

100

Time steps 0 The optimal error e2k The state e2k controlled by u 2k

e2k

-0.2

-0.4

-0.6 0

10

20

30

40

50

60

70

80

90

100

Time steps

Fig. 5.4 The tracking error trajectories 0.1

The optimal control u(e k)

0

-0.1

-0.2

-0.3

-0.4 The optimal control input The control input generated by the action network

-0.5 0

10

20

30

Time steps

Fig. 5.5 The optimal control of tracking error dynamics

40

50

60

138

5 Nonaffine Neuro-Optimal Tracking Control with Accuracy and Stability Guarantee 3.5 The optimal value function The value function generated by the critic network

3

The value function

2.5

2

1.5

1

0.5

0 0

10

20

30

40

50

60

70

80

90

100

Time steps Fig. 5.6 The optimal value function

5.5.2 Example 2 Consider a type of inverted pendulum plants with hyperbolic tangent inputs derived from (Wang and Qiao 2019). The system dynamics are given as follows: ρ˙ = ω, κ2 κ3 Mgl sin(ρ) − ρ˙ + (tanh(μs ) + μs ), ω˙ = − κ1 κ1 κ1

(5.61)

where ρ, ω and μs are the current angle, the associated angular velocity, and the control input, respectively. The parameter values of the mass M, the length of the pendulum l, the gravitational acceleration g, the rotary inertia κ1 , the frictional factor κ2 , and the parameter κ3 of the control input are given in Table 5.2.

Table 5.2 Parameter values of the controlled plant in Example 2 Parameters System Parameters g M l κ1 Values

9.8

0.5

1

0.8

κ2

κ3

0.2

1

5.5 Simulation Experiments

139

Table 5.3 Parameter values of the iteration algorithm in Example 2 Parameters Algorithm Parameters θ ϑ Q R Values

0.01

0.02

I2

0.1I1

max

ε

30

2 × 10−3

Regarding x = [ρ, ω]T as the state vector, we discretize the inverted pendulum plant by using the Euler method with the sampling interval Δt = 0.1s. Then, the discrete-time state space formulation is obtained as follows:

xk+1

x1k + 0.1x2k = −0.6125 sin(x1k ) + 0.975x2k 0 + . 0.125(μsk + tanh(μsk ))

(5.62)

We set the desired reference trajectory as c1k + 0.1c2k . −0.2492c1k + 0.9888c2k

ck+1 =

(5.63)

The parameter values used in the algorithm are shown in Table 5.3. The state space is selected as Ωx = {(x1k , x2k ) : − 0.4 ≤ x1k ≤ 0.4, −0.4 ≤ x2k ≤ 0.4} ∈ R2 . Similarly, 441 state vector samples are evenly chosen and c0 = [−0.1, 0.2]T is set as the initial reference point. The critic and action networks have identical structures and training epochs with them in Example 1. Besides, the steady control μck corresponding to the desired trajectory needs to satisfy the following equation: ck+1 = − 0.6125 sin(c1k ) − 0.025c2k + c2k + 0.125(μck + tanh(μck )).

(5.64)

Obviously, Eq. (5.64) is an implicit function. In this case, we use the function “fsolve” in MATLAB to solve μck . Then, the critic network starts to iteratively learn the optimal value function. The convergence process of iterative value functions is given in Fig. 5.7a. Increments of ( p) iterative value functions and ΔV (30) (ek ) are plotted in Fig. 5.7b, c, respectively. The initial state vector x0 is set as x0 = [0.3, −0.3]T . In the learning process, the value function trajectory of the initial state is presented in Fig. 5.7d. Besides, the norms of weight matrices and the threshold vector in the iteration process are plotted in Fig. 5.8. Furthermore, starting from the initial state, the trained critic network is used to obtain the optimal value function and the optimal control. The state trajectories of the original system and the tracking error curves are shown in Figs. 5.9 and 5.10,

140

5 Nonaffine Neuro-Optimal Tracking Control with Accuracy and Stability Guarantee

( p)

( p)

( p)

Fig. 5.7 Iterative value functions: a V () (ek ); b ΔV () (ek ); c ΔV (30) (ek ); d V () (e0 ) 30 20 10 0 0

5

10

15

20

25

30

5

10

15

20

25

30

2

1.5

1

0.5 0

Fig. 5.8 The convergence trajectories of parameters of the critic and action networks

5.5 Simulation Experiments

141

The system state x1k

0.6 The state x1k

The reference trajectory c1k

0.4 0.2 0 -0.2 0

10

20

30

40

50

60

70

80

90

100

The system state x2k

Time steps

The state x2k

0.6

The reference trajectory c2k

0.4 0.2 0 -0.2 -0.4 0

10

20

30

40

50

60

70

80

90

100

60

70

80

90

100

60

70

80

90

100

Time steps

Fig. 5.9 State trajectories of the original system

Tracking errors e1k

0.4 0.3 0.2 0.1 0 0

10

20

30

40

50

Time steps

Tracking errors e2k

0

-0.2

-0.4

-0.6 0

10

20

30

40

50

Time steps

Fig. 5.10 The tracking error trajectories

142

5 Nonaffine Neuro-Optimal Tracking Control with Accuracy and Stability Guarantee

uk

1 0.5 0 0

10

20

30

40

50

60

70

80

90

100

60

70

80

90

100

60

70

80

90

100

Time steps

u ck

0.5

0

-0.5 0

10

20

30

40

50

Time steps

u sk

1 0.5 0 -0.5 0

10

20

30

40

50

Time steps

Fig. 5.11 Control inputs

respectively. The proposed method ensures that tracking error dynamics converge to zero within 40 time steps. The control input curves of the tracking error system, the reference trajectory, and the original system are all plotted in Fig. 5.11. These results show the availability of the advanced trajectory tracking design strategy.

5.6 Conclusions In this chapter, an effective approach is developed to solve the steady control input μck for nonaffine systems and novel updating rules are provided to find the optimal control policy. In addition, the value iteration algorithm is introduced to solve the optimal tracking control problem. A detailed stability analysis of the closed-loop system is also provided. Furthermore, the proposed algorithm is implemented by using the critic and action networks and then its design procedure is developed. At the end of this chapter, we conduct two simulation examples to verify the optimality and the availability of the tracking control method for nonaffine systems. However, the controlled plant should be known in this chapter. It is also interesting to further extend the present tracking control method to the model-free tracking problem. Besides, stronger stability properties of closed-loop systems with noise will be

References

143

discussed in the future work. More advanced adaptive critic tracking control methods are also needed for some practical complex systems with unknown dynamics.

References Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part B Cybern 38(4):943–949 Cheng L, Liu Y, Hou ZG, Tan M, Du D, Fei M (2021) A rapid spiking neural network approach with an application on hand gesture recognition. IEEE Trans Cogn Dev Syst 13(1):151–161 Esfandiari K, Abdollahi F, Talebi HA (2017) Adaptive near-optimal neuro controller for continuoustime nonaffine nonlinear systems with constrained input. Neural Netw 93:195–204 Fan QY, Wang D, Xu B (2022) H∞ codesign for uncertain nonlinear control systems based on policy iteration method. IEEE Trans Cybern 52(10):10101–10110 Fan QY, Yang GH (2018) Policy iteration based robust co-design for nonlinear control systems with state constraints. Inf Sci 467(9):256–270 Fu J, He H, Zhou X (2011) Adaptive learning and control for MIMO system based on adaptive dynamic programming. IEEE Trans Neural Netw 22(7):1133–1148 Gao W, Jiang Z, Lewis FL, Wang Y (2018) Leader-to-formation stability of multiagent systems: an adaptive optimal control approach. IEEE Trans Autom Control 63(10):3581–3587 Gao W, Mynuddin M, Wunsch DC, Jiang ZP (2022) Reinforcement learning-based cooperative optimal output regulation via distributed adaptive internal model. IEEE Trans Neural Netw Learn Syst 33(10):5229–5240 Ha M, Wang D, Liu D (2020) Event-triggered adaptive critic control design for discrete-time constrained nonlinear systems. IEEE Trans Syst Man Cybern Syst 50(9):3158–3168 Ha M, Wang D, Liu D (2021) Generalized value iteration for discounted optimal control with stability analysis. Syst Control Lett 147:104847 Hagan MT, Menhaj MB (1994) Training feedforward networks with the Marquardt algorithm. IEEE Trans Neural Netw 5(6):989–993 Han X, Zheng Z, Liu L, Wang B, Cheng Z, Fan H, Wang Y (2020) Online policy iteration ADP-based attitude-tracking control for hypersonic vehicles. Aerosp Sci Technol 106:106233 Ha M, Wang D, Liu D (2022) Offline and online adaptive critic control designs with stability guarantee through value iteration. IEEE Trans Cybern 52(12):13262–13274 Heydari A (2014) Revisiting approximate dynamic programming and its convergence. IEEE Trans Cybern 44(12):2733–2743 Heydari A (2016) Theoretical and numerical analysis of approximate dynamic programming with approximation errors. J Guidance Control Dyn 39(2):301–311 Heydari A (2018) Stability analysis of optimal adaptive control under value iteration using a stabilizing initial policy. IEEE Trans Neural Netw Learn Syst 29(9):4522–4527 Heydari A, Balakrishnan S (2014) Adaptive critic based solution to an orbital rendezvous problem. J Guidance Control Dyn 37:344–350 Jiang H, Zhang H (2018) Iterative ADP learning algorithms for discrete-time multi-player games. Artif Intell Rev 50(1):75–91 Jiang H, Zhang H, Luo Y, Han J (2019) Neural-network-based robust control schemes for nonlinear multiplayer systems with uncertainties via adaptive dynamic programming. IEEE Trans Syst Man Cybern Syst 49(3):579–588 Kamalapurkar R, Dinhb H, Bhasin S, Dixon WE (2015) Approximate optimal trajectory tracking for continuous-time nonlinear systems. Automatica 51:40–48 Khalil HK (2002) Nonlinear system. Prentice-Hall Press, New York, NY, USA

144

5 Nonaffine Neuro-Optimal Tracking Control with Accuracy and Stability Guarantee

Kiumarsi B, Lewis FL (2015) Actor-critic-based optimal tracking for partially unknown nonlinear discrete-time systems. IEEE Trans Neural Netw Learn Syst 26(1):140–151 Kiumarsi B, Lewis FL, Naghibi-Sistani MB, Karimpour A (2015) Optimal tracking control of unknown discrete-time linear systems using input-output measured data. IEEE Trans Cybern 45(12):2770–2779 Kiumarsi B, Vamvoudakis KG, Modares H, Lewis FL (2018) Optimal and autonomous control using reinforcement learning: a survey. IEEE Trans Neural Netw Learn Syst 29(6):2042–2062 Li H, Liu D (2012) Optimal control for discrete-time affine non-linear systems using general value iteration. IET Control Theory Appl 6(18):2725–2736 Li J, Ding J, Chai T, Lewis FL, Jagannathan S (2021a) Adaptive interleaved reinforcement learning: robust stability of affine nonlinear systems with unknown uncertainty. IEEE Trans Neural Netw Learn Syst 33(1):270–280 Li C, Ding J, Lewis FL, Chai T (2021b) A novel adaptive dynamic programming based on tracking error for nonlinear discrete-time systems. Automatica 129:109687 Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Autom Control 51(8):1249–1260 Liu D, Wei Q, Wang D, Yang X, Li H (2017) Adaptive dynamic programming with applications in optimal control. Springer Press, Gewerbestrasse, Cham, Switzerland Liu YJ, Zeng Q, Tong S, Chen CLP, Liu L (2020) Actuator failure compensation-based adaptive control of active suspension systems with prescribed performance. IEEE Trans Ind Electron 67(8):7044–7053 Liu L, Liu YJ, Tong S (2018a) Neural networks-based adaptive finite-time fault-tolerant control for a class of strict-feedback switched nonlinear systems. IEEE Trans Cybern 49(7):2536–2545 Liu L, Wang Z, Zhang H (2018b) Neural-network-based robust optimal tracking control for MIMO discrete-time systems with unknown uncertainty using adaptive critic design. IEEE Trans Neural Netw Learn Syst 29(4):1239–1251 Liu YJ, Lu S, Tong S, Chen X, Chen CLP, Li D (2018c) Adaptive control-based Barrier Lyapunov Functions for a class of stochastic nonlinear systems with full state constraints. Automatica 87:83–93 Luo B, Liu D, Huang T, Wang D (2016) Model-free optimal tracking control via critic-only Qlearning. IEEE Trans Neural Netw Learn Syst 27(10):2134–2144 Modares H, Lewis FL (2014) Optimal tracking control of nonlinear partially-unknown constrainedinput systems using integral reinforcement learning. Automatica 50(7):1780–1792 Qin C, Zhang H, Luo Y (2014) Online optimal tracking control of continuous-time linear systems with unknown dynamics by using adaptive dynamic programming. Int J Control 87(5):1000–1009 Rantzer A (2006) Relaxed dynamic programming in switching systems. IEE Proc Control Theory Appl 153(5):567–574 Si J, Wang YT (2001) Online learning control by association and reinforcement. IEEE Trans Neural Netw 12(2):264–276 Wang D (2020) Intelligent critic control with robustness guarantee of disturbed nonlinear plants. IEEE Trans Cybern 50(6):2740–2748 Wang D, Qiao J (2019) Approximate neural optimal control with reinforcement learning for a torsional pendulum device. Neural Netw 117:1–7 Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocomputing 78(1):14–22 Wang D, He H, Liu D (2017) Adaptive critic nonlinear robust control: a survey. IEEE Trans Cybern 47(10):3429–3451 Wang D, Ha M, Qiao J (2020) Self-learning optimal regulation for discrete-time nonlinear systems under event-driven formulation. IEEE Trans Autom Control 65(3):1272–1279 Wang D, Cheng L, Yan J (2022a) Self-learning robust control synthesis and trajectory tracking of uncertain dynamics. IEEE Trans Cybern 52(1):278–286

References

145

Wang D, Ha M, Cheng L (2022b) Neuro-optimal trajectory tracking with value iteration of discretetime nonlinear dynamics. IEEE Trans Neural Netw Learn Syst (in press) Wang D, Qiao J, Cheng L (2022c) An approximate neuro-optimal solution of discounted guaranteed cost control design. IEEE Trans Cybern 52(1):77–86 Wang D, Ha M, Qiao J (2021a) Data-driven iterative adaptive critic control toward an urban wastewater treatment plant. IEEE Trans Ind Electron 68(8):7362–7369 Wang D, Zhao M, Ha M, Ren J (2021b) Neural optimal tracking control of constrained nonaffine systems with a wastewater treatment application. Neural Netw 143:121–132 Wei Q, Lewis FL, Liu D, Song R, Lin H (2018) Discrete-time local value iteration adaptive dynamic programming: convergence analysis. IEEE Trans Syst Man Cybern Syst 48(6):875–891 Werbos PJ (1977) Advanced forecasting methods for global crisis warning and models of intelligence. Gen Syst Yearb 22:25–38 Yang X, He H (2021) Decentralized event-triggered control for a class of nonlinear-interconnected systems using reinforcement learning. IEEE Trans Cybern 51(2):635–648 Yang X, He H, Zhong X (2021) Approximate dynamic programming for nonlinear-constrained optimizations. IEEE Trans Cybern 51(5):2419–2432 Yang X, Liu D, Wei Q (2013) Neuro-optimal control of unknown nonaffine nonlinear systems with saturating actuators. In: Proceedings of IFAC international conference on intelligent control and automation science, vol 46(20), pp 574–579 Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans Syst Man Cybern Part B Cybern 38(4):937–942 Zhang H, Zhang K, Cai Y (2019) Adaptive fuzzy fault-tolerant tracking control for partially unknown systems with actuator faults via integral reinforcement learning method. IEEE Trans Fuzzy Syst 27(10):1986–1998 Zhu Y, Zhao D, Li X (2016) Using reinforcement learning techniques to solve continuoustime non-linear optimal tracking problem without system dynamics. IET Control Theory Appl 10(12):1339–1347 Zhu Y, Zhao D, He H (2020) Invariant adaptive dynamic programming for discrete-time optimal control. IEEE Trans Syst Man Cybern Syst 50(11):3959–3971

Chapter 6

Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach

Abstract In this chapter, the data-based optimal tracking control approach is developed by involving the iterative dual heuristic dynamic programming (DHP) algorithm for nonlinear systems. According to the iterative DHP method, the updating formula of the costate function and the new optimal control policy for unknown nonlinear systems are provided to solve the optimal tracking control problem. Moreover, three neural networks are used to facilitate the implementation of the proposed algorithm. The unknown nonlinear system dynamics is first identified by establishing a model neural network. To improve the identification precision, biases are introduced to the model network. The model network with biases is trained by the gradient descent algorithm, where the weights and biases across all layers are updated. The uniform ultimate boundedness stability with a proper learning rate is analyzed, by using the Lyapunov approach. The effectiveness of the proposed method is demonstrated through a simulation example. Keywords Adaptive dynamic programming · Data-based optimal tracking control · Lyapunov method · Neural network · Uniformly ultimately bounded stability · Value iteration

6.1 Introduction Adaptive/approximate dynamic programming (ADP) methods have enjoyed rather remarkable successes for a wide range of fields involving the continuous-time (Liu et al. 2014, 2021; Zhang et al. 2013; Wang et al. 2021) and discrete-time (Al-Tamimi et al. 2008; Liu and Wei 2013; Ha et al. 2021a) optimal control, the robust control (Liu et al. 2015; Wang et al. 2017a), and so forth. Also, the ADP technique has been applied to various engineering problems such as power systems (Liu et al. 2018; Foruzan et al. 2018; Wang et al. 2017b; Wei et al. 2017), sensor networks (Heydari 2019), etc. There are generally two learning algorithms in the iterative ADP framework, namely, the policy iteration (Liu and Wei 2014; Bertsekas 2017) and value iteration (Liu et al. 2017; Heydari 2018a, b). In the value iteration scheme, the convergence of the iterative value function and stability of the closed-loop sys© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D. Wang et al., Advanced Optimal Control and Applications Involving Critic Intelligence, Intelligent Control and Learning Systems 6, https://doi.org/10.1007/978-981-19-7291-1_6

147

148

6 Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach

tems using iterative control policies have always been the focus of close attentions. (Al-Tamimi et al. 2008; Lincoln and Rantzer 2006; Rantzer 2006) gave a series of convergence and monotonicity properties of the iterative value function with the undiscounted cost and pointed that, with the increasing of iteration steps, the iterative value function will converge to the optimal value function. On this foundation, Wang et al. (2012b) developed the convergence properties of the discounted cost and implemented the iterative ADP algorithm by using globalized dual heuristic programming (GDHP). In (Heydari 2018b), considering the approximation errors, Heydari proposed a stabilizing value iteration algorithm and provided the stability analysis of controlled systems using the control policy derived from this algorithm. In (Postoyan et al. 2017; Granzotto et al. 2020), inspired by ADP and model predictive control, Postoyan and Granzotto et al. analyzed the stability of nonlinear plants for the infinite-horizon and finite-horizon discounted optimal control problems. The robust stability could also be ensured under some additional conditions in (Postoyan et al. 2017), which implied that near-optimal control inputs could also stabilize the controlled plant. In (Granzotto et al. 2020), some stronger stability properties were obtained under certain initial conditions. For the event-based ADP scheme, Wang et al. (2020) studied the self-learning optimal regulation based on dual heuristic programming (DHP) and established the input-to-state stability analysis under the event-driven formulation. The works on unknown nonlinear systems are extensive (Wang et al. 2012b; Luo et al. 2017; Ni et al. 2015). In (Luo et al. 2017), a modelfree policy gradient ADP technique was investigated to improve the control policy by using the online and offline data samples. Ni et al. (2015) developed a modelfree DHP method to avoid the extra computational cost caused by offline training of the model network. In general, for nonlinear plants with unknown system functions, three neural networks are adopted to constitute the ADP scheme. Therefore, the stability of the neural-network-based controller is significant. Some papers (Mu et al. 2017, 2018; Zhang et al. 2013) have rigorously analyzed the stability of weight estimation errors. Meanwhile, for affine systems, the H∞ state feedback controller was designed and the stability analysis was elaborated in (Zhang et al. 2014). Nevertheless, these studies focused on the weights from the hidden layer to the output layer. In the training process, the weights from the input layer to the hidden layer are fixed, which might weaken the performance of the neural network. With this operation, the fitting precision depends largely on the initialization of weights. Sokolov et al. (2015) made a tremendous contribution in this respect. They extended previous results to the case of multi-layer neural networks across all layers. Additionally, the proposed control method based on action-dependent heuristic dynamic programming is uniformly ultimately bounded (UUB) with specific conditions. Motivated by the previous works, we focus on the stability of the model network, where the weights and biases across all layers are updated by the gradient descent algorithm. The UUB stability of the weight and bias estimation errors can be guaranteed by designing different Lyapunov functions for weights and biases of different layers. On the other hand, the optimal tracking control is a significant topic of the control community. Its objective is to make the controlled systems track the desired trajectories. Some works (Kiumarsi and Lewis 2015; Luo et al. 2016; Qin et al. 2014; Wang

6.2 Problem Description

149

et al. 2012a; Zhang et al. 2008; Zhu et al. 2016) have been reported to solve the tracking control problem by using ADP in the last decades. For the affine nonlinear system x(k + 1) = f (x(k)) + g(x(k))μx (x(k)), where x(k) is the system state, μx (x(k)) is the control input, and f (·) and g(·) are system functions, there exist abundant works to solve the optimal tracking control problem. The heuristic dynamic programming (HDP) algorithm was employed to make the discrete-time nonlinear affine systems track the desired trajectories in (Zhang et al. 2008). Meanwhile, the rigorous convergence analysis is provided. Additionally, Wang et al. 2012a investigated the finitehorizon neuro-optimal tracking control by transforming the controlled affine systems into the augmented systems. However, these two methods (Wang et al. 2012a; Zhang et al. 2008) need to establish the model of the error dynamics, which reduces the accuracy of the model. On the other hand, the system functions f (·) and g(·) need to be obtained to compute the steady control input μd (d(k)) corresponding to the reference trajectory, such as μd (d(k)) = g + (d(k))(d(k + 1) − f (d(k))), where d(k) is the desired trajectory and g + (·) is the generalized inverse of g(·). From then on, for affine systems with input constraints, the policy evaluation and the policy improvement are applied to solve the tracking problem by (Kiumarsi and Lewis 2015). For continuous-time nonlinear systems, by using reinforcement learning, Modares and Lewis 2014 trained the learning algorithm to learn the optimal tracking control input and studied the convergence and stability problems of the whole system. The modelfree optimal tracking controller was designed by introducing reinforcement learning in (Luo et al. 2016), which learned the optimal tracking control policy from real system data samples. Nevertheless, the proposed algorithm needs to be given an initial admissible control policy and a series of activation functions in neural networks need to be manually designed. In this chapter, we focus on these difficulties and solve the data-based optimal tracking control problem (Ha et al. 2020a, 2021b). Furthermore, some superior results are obtained.

6.2 Problem Description Consider the following nonlinear systems: x(k + 1) = F(x(k), u(k)), k ∈ N,

(6.1)

where x(k) ∈ Rn and u(k) ∈ Rm are the n-dimensional state vector and mdimensional control vector, respectively, N = {0, 1, 2, . . . }, and F(·) is the nonlinear differentiable system function with respect to its arguments. For the optimal tracking control problem, the objective is to obtain an optimal state feedback control policy μ∗x (x(k)) such that the nonlinear system (6.1) tracks a desired trajectory. Next, we define the desired trajectory as the following form: d(k + 1) = κ(d(k)),

(6.2)

150

6 Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach

where d(k + 1) ∈ Rn , and κ(·) : Rn → Rn is a differentiable function in d(k). Define the tracking error vector as follows: e(k) = x(k) − d(k).

(6.3)

Next, the steady control corresponding to the desired trajectory is defined as μd (d(k)). It is assumed that the control input exists and satisfies d(k + 1) = F d(k), μd (d(k)) .

(6.4)

μd (d(k)) can be obtained by solving (6.4). In order to facilitate analysis, we assume that μd (d(k)) can be denoted as follows: μd (d(k)) = ϕ(d(k)).

(6.5)

In other words, μd (d(k)) is the solution of (6.4). We can obtain it by using the analytic method or various numerical methods. Then, we define a new control input in the following: μe (e(k)) = μx (x(k)) − μd (d(k)).

(6.6)

According to (6.1)–(6.6), we obtain a new augmented system as

e(k + 1) = F e(k) + d(k), μe (e(k)) + ϕ(d(k)) − κ(d(k)), d(k + 1) = κ(d(k)).

(6.7)

Therefore, the augmented system (6.7) can be rewritten as X (k + 1) = F X (k), μe (e(k)) ,

(6.8)

where F : R2n → R2n , X (k) = [eT (k), d T (k)]T and μe (e(k)) are the 2ndimensional state vector and m-dimensional control input of the new system, respectively. Since functions F(·) and κ(·) are differentiable, F is also differentiable in its arguments. It is assumed that the system (6.8) is controllable on the set Ω ∈ R2n and F is Lipschitz continuous on Ω. In order to solve optimal tracking control problems, we define the following performance index and need to find a control input sequence to minimize it ∞ U X (l), μe (e(l)) , J X (k), μe (e(k)) = l=k

(6.9)

6.2 Problem Description

151

where U X (l), μe (e(l)) is the positive definite utility function and satisfies U (0, 0) = 0. Inspired by the works of (Kiumarsi and Lewis 2015; Wang et al. 2012a; Zhang et al. 2008), the utility function is selected as follows: Q 0 e(l) U X (l), μe (e(l)) = eT (l) d T (l) + μTe (e(l))Rμe (e(l)) 0 0 d(l) = eT (l)Qe(l) + μTe (e(l))Rμe (e(l)) = U e(l), μe (e(l)) ,

(6.10)

where Q ∈ Rn×n and R ∈ Rm×m are the symmetric positive definite matrices. In view of the form of (6.10), the performance index of the augmented system can be simplified as ∞

T e (l)Qe(l) + μTe (e(l))Rμe (e(l)) . J e(k), μ(e(k)) =

(6.11)

l=k

Therefore, the main part of system (6.8) can be considered as e(k + 1) = G e(k), μe (e(k)) .

(6.12)

Suppose that G(0, 0) = 0 holds. Furthermore, the performance index is rewritten as J e(k), μe (e(k)) = eT (k)Qe(k) + μTe (e(k))Rμe (e(k)) +

∞

eT (l)Qe(l) + μTe (e(l))Rμe (e(l))

l=k+1

= e (k)Qe(k) + μTe (e(k))Rμe (e(k)) + J e(k + 1), μe (e(k + 1)) . T

(6.13)

According to Bellman’s optimality principle, the optimal value function J ∗ satisfies the following equation: J ∗ (e(k)) = min

μe (e(k))

U e(k), μe (e(k)) + J ∗ (e(k + 1)) .

(6.14)

The optimal control policy μ∗e (e(k)) should satisfy μ∗e (e(k)) = arg min

μe (e(k))

U e(k), μe (e(k)) + J ∗ (e(k + 1)) .

(6.15)

Then, the optimal tracking control of the original system is given by μ∗x (x(k)) = μd (d(k)) + μ∗e (e(k)),

(6.16)

152

6 Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach

where μd (d(k)) is obtained by Eq. (6.5). The tracking problem for the nonlinear system (6.1) is transformed in an optimal regulation problem. Note that the tracking performance of the control policy μ∗x (·) is determined by the steady control corresponding to the desired trajectory.

6.3 The Optimal Tracking Control Based on the Iterative DHP Algorithm This section includes two subsections. In the first subsection, the value iteration algorithm is elaborated and a new optimal control policy for nonlinear systems is obtained. Then, the novel iterative DHP algorithm for nonlinear tracking systems is derived in the second subsection.

6.3.1 Derivation of the Iterative ADP Algorithm Since it is difficult to solve the Bellman’s equation directly, we present the value iteration algorithm to obtain its numerical solution. The value function is initialized as J0 (·) = 0. The corresponding control input is denoted as follows:

μe0 (e(k)) = arg min eT (k)Qe(k) + μTe (e(k))Rμe (e(k)) + J0 (e(k + 1)) . μ(e(k))

(6.17) Furthermore, the value function is updated as follows:

J1 (e(k)) = min eT (k)Qe(k) + μTe (e(k))Rμe (e(k)) + J0 (e(k + 1)) μ(e(k))

= eT (k)Qe(k) + μTe0 (e(k))Rμe0 (e(k)) + J0 (e(k + 1)).

(6.18)

Therefore, the value iteration algorithm can be implemented between the control policy improvement μei (e(k)) = arg min

μe (e(k))

T e (k)Qe(k) + μTe (e(k))Rμe (e(k)) + Ji (e(k + 1)) (6.19)

and the value function update

6.3 The Optimal Tracking Control Based on the Iterative DHP Algorithm

153

eT (k)Qe(k) + μTe (e(k))Rμe (e(k)) + Ji (e(k + 1)) μe (e(k)) = eT (k)Qe(k) + μTei (e(k))Rμei (e(k)) + Ji G e(k), μei (e(k)) , (6.20)

Ji+1 (e(k)) = min

where i = 1, 2, . . . . It is significant to obtain the solution of (6.19) for nonlinear systems. For affine systems, we can get the optimal control law by letting ∂ eT (k)Qe(k) + μTe (e(k))Rμe (e(k)) + Ji (e(k + 1)) /∂e(k) = 0. However, it is not available for nonaffine systems. Therefore, it is necessary to develop a novel method to obtain the optimal control policy, which is available for both affine and nonaffine systems. We use the gradient-based method to find μei (e(k)) and minimize Ji+1 (e(k)). First, we randomly initialize μei (e(k)) as μ(0) ei (e(k)). Then, the updating rule of μei (e(k)) is a gradient-based adaptation rule formulated by ( j+1) μei (e(k))

∂ eT (k)Qe(k) + μTei (e(k))Rμei (e(k)) = − αμ ∂μei (e(k)) ∂ Ji e(k + 1)) − αμ ∂μei (e(k)) ( j) μei (e(k))

( j)

( j)

= μei (e(k)) − 2αμ Rμei (e(k))

T ∂e(k + 1) ∂ Ji (e(k + 1)) , − αμ ( j) ∂e(k + 1) ∂μei (e(k))

(6.21)

where αμ ∈ (0, 1) is the learning rate with respect to μe (e(k)) and j, unlike i, is the iteration step of (6.21). Next, by introducing a theorem, we will discuss how to obtain ∂e(k + 1)/ ( j) ∂μei (e(k)) in (6.21) without modeling the error dynamics. Theorem 6.1 Define the control input of the original system as μx (x(k)) in (6.1) and the control input of the new system as μe (e(k)) in (6.12) , then ∂e(k + 1)/∂μe (e(k)) and ∂e(k + 1)/∂e(k) satisfy the following equations: ∂ x(k + 1) ∂e(k + 1) ∂ x(k + 1) ∂e(k + 1) = , = . ∂μe (e(k)) ∂μx (x(k)) ∂e(k) ∂ x(k)

(6.22)

Proof To the best of our knowledge, according to (6.4), ϕ(d(k)) is related to d(k) and the system function F(·). Also, according to the augmented system (6.7), the term ∂e(k + 1)/∂μe (e(k)) satisfies the following condition: ∂ F e(k) + d(k), μe (e(k)) + ϕ(d(k)) − κ(d(k)) ∂e(k + 1) = ∂μe (e(k)) ∂μe (e(k)) ∂ F e(k) + d(k), μe (e(k)) + ϕ(d(k)) = . ∂ μe (e(k)) + ϕ(d(k))

(6.23)

154

6 Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach

On the other hand, based on the original system (6.1), we have ∂ F x(k), μx (x(k)) ∂ x(k + 1) = ∂μx (x(k)) ∂μ (x(k)) x ∂ F e(k) + d(k), μe (e(k)) + ϕ(d(k)) = . ∂ μe (e(k)) + ϕ(d(k))

(6.24)

Then, ∂e(k + 1)/∂μe (e(k)) = ∂ x(k + 1)/∂μx (x(k)) holds. Similarly, ∂e(k + 1)/ ∂e(k) = ∂ x(k + 1)/∂ x(k) also holds. The proof is completed. According to (6.21) and Theorem 6.1, the updating rule of μei (e(k)) can be rewritten as follows: ( j+1)

μei

( j)

( j)

(e(k)) = μei (e(k)) − 2αμ Rμei (e(k))

T ∂ x(k + 1) ∂ Ji (e(k + 1)) . − αμ ( j) ∂e(k + 1) ∂(μei (e(k)) + ϕ(d(k)))

(6.25)

6.3.2 Derivation of the Iterative DHP Algorithm First, we assume that the value function Ji (e(k)) is smooth so that ∂ Ji (e(k))/∂e(k) exists. According to (6.25), we find that the control law μei (e(k)) at each step of iteration has to be computed by ∂ Ji (e(k + 1))/∂e(k + 1), which is not an easy task. Therefore, in the following, we will present the iterative DHP algorithm to implement the iterative ADP algorithm. Define the costate function as follows: λi (e(k)) =

∂ Ji (e(k)) . ∂e(k)

(6.26)

First, we start with an initial costate function λ0 (·) = 0. Then, for λi+1 (e(k)) = ∂ Ji+1 (e(k))/∂e(k), according to (6.20), the following equation can be deduced:

6.4 Data-Based Iterative DHP Implementation

155

∂U e(k), μei (e(k)) ∂ Ji (e(k + 1)) λi+1 (e(k)) = + ∂e(k) ∂e(k) T ∂μei (e(k)) ∂μTei (e(k))Rμei (e(k)) ∂eT (k)Qe(k) = + ∂e(k) ∂e(k) ∂μei (e(k)) T ∂e(k + 1) ∂ Ji (e(k + 1)) + ∂e(k) ∂e(k + 1) T T ∂e(k + 1) ∂ Ji (e(k + 1)) ∂μei (e(k)) + ∂e(k) ∂μei (e(k)) ∂e(k + 1) T ∂μei (e(k)) Rμei (e(k)) = 2Qe(k) + 2 ∂e(k) T ∂e(k + 1) ∂μei (e(k)) ∂e(k + 1) + + λi (e(k + 1)). ∂e(k) ∂μei (e(k)) ∂e(k)

(6.27)

Next, using Theorem 6.1, we eventually obtain λi+1 (e(k)) = 2Qe(k) + 2 +

∂μei (e(k)) ∂e(k)

T Rμei (e(k))

∂ x(k + 1) ∂μei (x(k)) ∂ x(k + 1) + ∂ x(k) ∂μxi (x(k)) ∂e(k)

T λi (e(k + 1)).

(6.28)

Therefore, in the iterative DHP scheme, the costate function sequence {λi } and the control sequence {μei } are updated by implementing the iteration between (6.25) and (6.28). From (6.25) and (6.28), the control policy can directly be computed by the costate function and the costate function can be obtained by solving the system model rather than the error dynamic model.

6.4 Data-Based Iterative DHP Implementation Three subsections are included in this section, namely, the model neural network, the critic neural network, and the action neural network. First, the neuro-identifier, i.e., model network, is used to estimate the nonlinear dynamics. The stability analysis of the neuro-identifier with reconstruction errors is provided. Second, the training process of the critic network is shown. Finally, the construction of the action network is elaborated in the last subsection. The flowchart of the proposed algorithm is displayed in Fig. 6.1.

156

6 Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach

Fig. 6.1 The flowchart of the optimal tracking control algorithm

6.4.1 Neuro-Identifier for Estimation of Nonlinear Dynamics Since the nonlinear plant (6.1) is unknown, in this section, a model network is established to identify the system dynamics and the stability analysis of the model network with reconstruction errors is discussed. The input–output data samples generated by (6.1) are used to train the model network. We construct a three-layer neural network with an N -neuron hidden layer. The estimation of the state vector is formulated as x(k ˆ + 1) = Wˆ 2T (k)ρ Wˆ 1T (k)xm (k) + bˆ1 (k) + bˆ2 (k),

(6.29)

T where xm (k) = x T (k), u T (k) ∈ R(n+m) is the input of the model network, Wˆ 1 (k) ∈ R(n+m)×N and Wˆ 2 (k) ∈ R N ×n are weight matrices, bˆ1 (k) ∈ R N and bˆ2 (k) ∈ Rn are bias or threshold vectors, and ρ(·) is the activation function. The activation function

6.4 Data-Based Iterative DHP Implementation

157

is assumed to be bounded, i.e., ρ(·) ≤ ρ, ¯ where ρ¯ is the upper boundary. Here, we choose tanh(·) as the activation function. Remark 6.1 In our study, we find that the performance of the model network directly determines the effectiveness of the controller. Therefore, it is necessary to introduce the biases into the model network to improve the approximation performance of the neural network. In general, an appropriate number of hidden layer neurons need to be selected. Too many hidden layer neurons will lead to overfitting while too few will result in underfitting. Note that the selected activation function and its derivative need to be bounded. For convenience of analysis, the hyperbolic tangent function is selected in this chapter. Also, sigmoid function satisfies this condition. According to (6.29), the input of the hidden layer can be regarded as follows: xm (k) . Wˆ 1T (k)xm (k) + bˆ1 (k) = Wˆ 1T (k), bˆ1 (k) 1

(6.30)

Define Wˆ 1 (k) = [Wˆ 1 (k), bˆ1T (k)]T ∈ R(n+m+1)×N and x m (k) = [xmT (k), 1]T ∈ Rn+m+1 . Therefore, the output of the model network can be rewritten as T x(k ˆ + 1) = Wˆ 2T (k)ρ(Wˆ 1 (k)x m (k)) + bˆ2 (k).

(6.31)

Define the estimation error as x(k ˜ + 1) = x(k ˆ + 1) − x(k + 1),

(6.32)

where x(k + 1) is generated by the unknown system. Assume that the optimal weight matrices and bias vector as W ∗1 , W2∗ , and b2∗ , respectively. Then the system state vector can be reconstructed as ∗ x(k + 1) = W2∗T ρ(W ∗T 1 x m (k)) + b2 + ξ(k),

(6.33)

where ξ(k) is the reconstruction error. ρ ∗ (k) is used to denote ρ(W ∗T 1 x m (k)). The training objective is to minimize the following performance measure: E (k + 1) =

1 T x˜ (k + 1)x(k ˜ + 1). 2

(6.34)

Therefore, the gradient-based updating rules are used to tune weight matrices and the bias vector ∂E (k + 1) Wˆ 1 (k + 1) = Wˆ 1 (k) − ηm ∂ Wˆ 1 (k) = Wˆ 1 (k) − ηm x m (k)(Ψ (k)Wˆ 2 (k)x(k ˜ + 1))T ,

(6.35)

158

6 Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach

∂E (k + 1) Wˆ 2 (k + 1) = Wˆ 2 (k) − ηm ∂ Wˆ 2 (k) = Wˆ 2 (k) − ηm ρ(k)x˜ T (k + 1),

(6.36)

∂E (k + 1) bˆ2 (k + 1) = bˆ2 (k) − ηm ∂ bˆ2 (k) ˆ = b2 (k) − ηm x(k ˜ + 1),

(6.37)

2 2 Ψ (k) = where ηm ∈ (0, 1) is the learning rate, diag{1 − ρ1 (k), 1 − ρ2 (k), . . . , 1 − T ρ N2 (k)} in (6.35) and ρ(k) = ρ Wˆ 1 (k)x m (k) is the output of the hidden layer.

Before stability analysis, we introduce the following necessary assumption: Assumption 6.1 It is assumed that W ∗1 , W2∗ , b2∗ , ρ(k) and ξ(k) are bounded such that 1. The optimal constant weight matrices and the bias vector are upper bounded ∗ ∗ by W ∗1 ≤ W¯ 1 , W2∗ ≤ W¯ 2 and b2∗ ≤ b¯2 , where W¯ 1 , W¯ 2 , and b¯2 are the corresponding upper boundaries. 2. The output vector of the hidden layer, i.e., ρ(k), is upper bounded by ρ(k) ≤ ρ. ¯ 3. The reconstruction error vector is bounded by ξ(k) ≤ ξ¯ . According to the universal approximation theorem, there exist a set of optimal weight matrices and bias vector S {(W ∗1 , W2∗ , b2∗ ) | W ∗1 ∈ R(n+m+1)×N , W2∗ ∈ R N ×n , b2∗ ∈ Rn } satisfying (6.33). The optimal weight matrices and bias vector are constant matrices and vector, respectively. On the other hand, the selected activation function is bounded. Therefore, Assumption 6.1 is reasonable. Define the weight and bias estimation errors as (6.38a) W˜ 1 (k) = Wˆ 1 (k) − W ∗1 , W˜ 2 (k) = Wˆ 2 (k) − W2∗ ,

(6.38b)

b˜2 (k) = bˆ2 (k) − b2∗ .

(6.38c)

Next, we will discuss the stability of the weight and bias estimation errors. A lemma is introduced before presenting the main theorem. Lemma 6.1 Consider positive definite Lyapunov function candidates: L 1 (k) = T tr{W˜ 1 (k)W˜ 1 (k)}, L 2 (k) = η1m tr{W˜ 2T (k)W˜ 2 (k)}, and L 3 (k) = η1m tr{b˜2T (k)b˜2 (k)}. Under Assumption 6.1, the first differences of L 1 (k), L 2 (k), and L 3 (k) satisfy the following equations: ˜ + 1))T 2 ΔL 1 (k) = ηm2 x m (k)(Ψ (k)Wˆ 2 (k)x(k − 2ηm x Tm (k)W˜ 1 (k)Ψ (k)Wˆ 2 (k)x(k ˜ + 1),

(6.39)

6.4 Data-Based Iterative DHP Implementation

159

ΔL 2 (k) = ηm x(k ˜ + 1)ρ T (k)2 − 2ρ T (k)W˜ 2 (k) X˜ (k + 1),

(6.40)

˜ + 1)2 − 2 x˜ T (k + 1)b˜2 (k), ΔL 3 (k) = ηm x(k

(6.41)

and

T ˜ + 1) in where ηm x˜ T (k + 1)Wˆ 2T (k)Ψ T (k)W˜ 1 (k)x m (k) in (6.39), ρ T (k)W˜ 2 (k)x(k T ˜ (6.40), and x˜ (k + 1)b2 (k) in (6.41) are scalars. Proof According to (6.35) and (6.38a), the estimation error of W˜ 1 (k + 1) is formulated as

W˜ 1 (k + 1) = Wˆ 1 (k + 1) − W ∗1 = W˜ 1 (k) − ηm x m (k)(Ψ (k)Wˆ 2 (k)x(k ˜ + 1))T .

(6.42)

Hence, according to the definition of L 1 (k), the first difference of L 1 (k) is computed by T T ΔL 1 (k) = tr{W˜ 1 (k + 1)W˜ 1 (k + 1) − W˜ 1 (k)W˜ 1 (k)} = tr [W˜ 1 (k) − ηm x m (k)(Ψ (k)Wˆ 2 (k)x(k ˜ + 1))T ]T

T × [W˜ 1 (k) − ηm x m (k)(Ψ (k)Wˆ 2 (k)x(k ˜ + 1))T ] − W˜ 1 (k)W˜ 1 (k) ˜ + 1))T ]T [ηm x m (k)(Ψ (k)Wˆ 2 (k) X˜ (k + 1))T ] = tr [ηm x m (k)(Ψ (k)Wˆ 2 (k)x(k T ˜ + 1))T − ηm W˜ 1 (k)x m (k)(Ψ (k)Wˆ 2 (k)x(k − ηm Ψ (k)Wˆ 2 (k)x(k ˜ + 1)x Tm (k)W˜ 1 (k) .

(6.43)

We use the properties of the matrix trace, i.e., tr{AT } = tr{A} and tr{BC D} = tr{D BC}, where Ar ×r , B p×q , Cq×r , and Dr × p denote r × r -dimensional, p × qdimensional, q × r -dimensional, and r × p-dimensional matrices, respectively. Then, considering these properties, it results in T

tr{ηm W˜ 1 (k)x m (k)(Ψ (k)Wˆ 2 (k)x(k ˜ + 1))T } = tr{ηm Ψ (k)Wˆ 2 (k)x(k ˜ + 1)x Tm (k)W˜ 1 (k)} = ηm x Tm (k)W˜ 1 (k)Ψ (k)Wˆ 2 (k)x(k ˜ + 1).

(6.44)

Therefore, Eq. (6.39) in Lemma 6.1 holds. Next, considering the estimation error W˜ 2 (k + 1) and according to (6.36) and (6.38b), we have W˜ 2 (k + 1) = Wˆ 2 (k + 1) − W2∗ = W˜ 2 (k) − ηm ρ(k)x˜ T (k + 1).

(6.45)

160

6 Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach

Then, the first difference of L 2 (k) is obtained as follows: 1 ˜T tr W2 (k + 1)W˜ 2 (k + 1) − W˜ 2T (k)W˜ 2 (k) ηm T 1 ˜ tr W2 (k) − ηm ρ(k)x˜ T (k + 1) W˜ 2 (k) − ηm ρ(k)x˜ T (k + 1) = ηm − W˜ 2T (k)W˜ 2 (k) 1 = tr [ηm ρ(k)x˜ T (k + 1)]T ηm ρ(k)x˜ T (k + 1) − ηm W˜ 2T (k)ρ(k)x˜ T (k + 1) ηm ˜ + 1)ρ T (k)W˜ 2 (k) . (6.46) − ηm x(k

ΔL 2 (k) =

Similarly, we have ˜ + 1)ρ T (k)W˜ 2 (k) = ρ T (k)W˜ 2 (k)x(k ˜ + 1). tr W˜ 2T (k)ρ(k)x˜ T (k + 1) = tr x(k (6.47) Combining (6.46) and (6.47), it leads to (6.40) in Lemma 6.1. Lastly, according to (6.37) and (6.38c), b˜2 (k + 1) is computed by ˜ + 1). b˜2 (k + 1) = bˆ2 (k + 1) − b2∗ = b˜2 (k) − ηm x(k

(6.48)

Substituting (6.48) into the Lyapunov function candidate L 3 (k), the first difference of L 3 (k) is obtained as 1 tr{b˜2T (k + 1)b˜2 (k + 1) − b˜2T (k)b˜2 (k)} ηm 1 ˜ tr [b2 (k) − ηm x(k ˜ + 1)]T [b˜2 (k) − ηm x(k ˜ + 1)] − b˜2T (k)b˜2 (k) = ηm 1 2 T tr ηm x˜ (k + 1)x(k ˜ + 1) − b˜2T (k)ηm x(k ˜ + 1) − ηm x˜ T (k + 1)b˜2 (k) = ηm ˜ + 1)2 − 2 x˜ T (k + 1)b˜2 (k). (6.49) = ηm x(k

ΔL 3 (k) =

Therefore, (6.41) in Lemma 6.1 holds and the proof is finished.

In what follows, the following theorem is presented to demonstrate that the estimation errors are UUB. Theorem 6.2 Let weight matrices and the bias vector of the model network be tuned according to the gradient-based adaptation rules (6.35), (6.36), and (6.37). Then, x(k) ˜ as well as W˜ 1 (k), W˜ 2 (k), and b˜2 (k) defined in (6.32), (6.38a), (6.38b), and (6.38c) are UUB, if the following condition is fulfilled:

6.4 Data-Based Iterative DHP Implementation

0 < ηm < min

−β(k) +

k

161

β 2 (k) − 4α(k)γ (k) , 2α(k)

(6.50)

where α(k) = Ψ (k)Wˆ 2 (k)2 x m (k)2 ,

(6.51)

T β(k) = W˜ 1 (k)x m (k)2 Wˆ 2 (k)2 + ρ(k)2 + 1,

(6.52)

and γ (k) =

2W2∗T (ρ(k) − ρ ∗ (k)) − ξ(k) − 1. x(k ˜ + 1)

(6.53)

Proof Define L(k) = x˜ T (k)x(k) ˜ + L 1 (k) + L 2 (k) + L 3 (k). Then, according to Lemma 6.1, the first difference of L(k) satisfies the following equation: ˜ + 1) − x˜ T (k)x(k) ˜ + ΔL 1 (k) + ΔL 2 (k) + ΔL 3 (k) ΔL(k) = x˜ T (k + 1)x(k = x˜ T (k + 1)x(k ˜ + 1) − x˜ T (k)x(k) ˜ + ηm x(k ˜ + 1)2 − 2 x˜ T (k + 1)b˜2 (k) + ηm2 x m (k)(Ψ (k)Wˆ 2 (k)x(k ˜ + 1))T 2 + ηm x(k ˜ + 1)ρ T (k)2 − 2ηm x Tm (k)W˜ 1 (k)Ψ (k)Wˆ 2 (k)x(k ˜ + 1) − 2ρ T (k)W˜ 2 (k)x(k ˜ + 1). (6.54) Since x(k ˜ + 1)ρ T (k)2 ≤ x(k ˜ + 1)2 ρ(k)2 , (6.54) satisfies the following inequality: ˜ + 1) − x˜ T (k)x(k) ˜ − 2 x˜ T (k + 1)W˜ 2T (k)ρ(k) ΔL(k) ≤ 2 x˜ T (k + 1)x(k − 1 − ηm ρ(k)2 − ηm − ηm2 Ψ (k)Wˆ 2 (k)2 x m (k)2 x(k ˜ + 1)2 T − 2ηm x˜ T (k + 1)Wˆ 2T (k)Ψ T (k)W˜ 1 (k)x m (k) − 2 x˜ T (k + 1)b˜2 (k). (6.55)

According to (6.31), (6.32), and (6.33), the first term in (6.55) satisfies x˜ T (k + 1)x(k ˜ + 1) = x˜ T (k + 1) Wˆ 2T (k)ρ(k) + bˆ2 (k) − (W2∗T ρ ∗ (k) + b2∗ + ξ(k)) = x˜ T (k + 1) W˜ 2T (k)ρ(k) + b˜2 (k) + W2∗T (ρ(k) − ρ ∗ (k)) − ξ(k) = x˜ T (k + 1)W˜ 2T (k)ρ(k) + x˜ T (k + 1) W2∗T (ρ(k) − ρ ∗ (k)) − ξ(k) + x˜ T (k + 1)b˜2 (k).

(6.56)

162

6 Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach

Moreover, the second to the last term in (6.55) fulfills the following inequality: T −2ηm x˜ T (k + 1)Wˆ 2T (k)Ψ (k)W˜ 1 (k)x m (k) T ≤ ηm W˜ 1 (k)x m (k)x˜ T (k + 1)Wˆ 2T (k)2 + ηm Ψ (k)2 T

≤ ηm W˜ 1 (k)x m (k)2 x(k ˜ + 1)2 Wˆ 2 (k)2 + ηm Ψ (k)2 . (6.57) 2 Since x˜ T (k)x(k) ˜ is a scalar and x˜ T (k)x(k) ˜ = tr{x˜ T (k)x(k)} ˜ = x(k) ˜ , combining (6.55)–(6.57) results in

2 ΔL(k) ≤ − x(k) ˜ + ηm Ψ (k)2 − 1 − ηm ρ(k)2 − ηm T − ηm W˜ 1 (k)x m (k)2 Wˆ 2 (k)2 ˜ + 1)2 − ηm2 Ψ (k)Wˆ 2 (k)2 x m (k)2 x(k + 2 x˜ T (k + 1) W2∗T (ρ(k) − ρ ∗ (k)) − ξ(k) .

(6.58)

According to (6.58), ΔL(k) satisfies 2 + ηm Ψ (k)2 − 1 − ηm ρ(k)2 − ηm ΔL(k) ≤ − x(k) ˜ T − ηm W˜ 1 (k)x m (k)2 Wˆ 2 (k)2 ˜ + 1)2 − ηm2 Ψ (k)Wˆ 2 (k)2 x m (k)2 x(k

+ 2W2∗T (ρ(k) − ρk∗ ) − ξ(k)x(k ˜ + 1) 2 = − x(k) ˜ + ηm Ψ (k)2 − 1 − ηm ρ(k)2 − ηm − ηm2 Ψ (k)Wˆ 2 (k)2 x m (k)2 T − ηm W˜ 1 (k)x m (k)2 Wˆ 2 (k)2

−

2W2∗T (ρ(k) − ρ ∗ (k)) − ξ(k) x(k ˜ + 1)2 . x(k ˜ + 1)

(6.59)

To guarantee that the last term in (6.59) is negative, we need to choose the learning rate in the following manner: T 1 − ηm W˜ 1 (k)x m (k)2 Wˆ 2 (k)2 − ηm2 Ψ (k)Wˆ 2 (k)2 x m (k)2

− ηm ρ(k)2 − ηm −

2W2∗T (ρ(k) − ρ ∗ (k)) − ξ(k) > 0. x(k ˜ + 1)

(6.60)

Therefore, (6.60) can be regarded as a quadratic function of the learning rate ηm

6.4 Data-Based Iterative DHP Implementation

163

T Ψ (k)Wˆ 2 (k)2 x m (k)2 ηm2 + W˜ 1 (k)x m (k)2 Wˆ 2 (k)2 + ρ(k)2 + 1 ηm

2W2∗T (ρ(k) − ρ ∗ (k)) − ξ(k) < 0. (6.61) − 1− x(k ˜ + 1) Here, define α(k), β(k), and γ (k) as given in (6.51), (6.52), and (6.53), where α(k) > 0 and β(k) > 1. To ensure that the quadratic function is solvable, we let γ (k) < 0 such that 1−

2W2∗T (ρ(k) − ρ ∗ (k)) − ξ(k) > 0. x(k ˜ + 1)

(6.62)

In this way, β 2 (k) − 4α(k)γ (k) > 0 holds. Then, (6.62) holds if the following condition is fulfilled: x(k ˜ + 1) > 2W2∗T (ρ(k) − ρ ∗ (k)) − ξ(k).

(6.63)

According to Assumption 6.1, there exists an upper bound Θ¯ such that 2W2∗T (ρ(k) − ρ ∗ (k)) − ξ(k) ≤ Θ¯ holds. Therefore, the learning rate needs to satisfy (6.50). On the other hand, to guarantee that ΔL(k) is negative, x(k) ˜ needs to satisfy the following inequality: x(k) ˜ >

√ ηm Ψ (k).

(6.64)

Since ρ¯ is the upper bound of ρ(k), it is not difficult to show Ψ (k) is bounded such ¯ √ηm Ψ¯ }. Therefore, if ηm satisfies that Ψ (k) ≤ Ψ¯ . Then, we define M = max{Θ, (6.50), then ΔL(k) < 0 with x(k) ˜ > M.

(6.65)

This means that the state estimation error x(k) ˜ and the weight and bias estimation errors W˜ 1 (k), W˜ 2 (k), and b˜2 (k) are UUB. Remark 6.2 Here, we will discuss the rationality of condition (6.50). According to the above analysis, we have α(k) > 0, β(k) > 0, and γ (k) < 0. On the one hand, 2 ˆ β 2 (k) − 4α(k)γ (k) > β 2 (k) other hand, if W2(k) = ∞

holds. On the in (6.51), 2 then α(k) = ∞ and mink − β(k) + β (k) − 4α(k)γ (k) /(2α(k)) is greater than zero. Hence, we reasonably conclude that there exists a proper learning rate ηm satisfying condition (6.50). Since the system function is unknown, it is difficult to solve (6.4). Therefore, the expression of the model network is utilized to obtain μd (d(k)). (6.4) can be rewritten as follows: ˆ + 1) = Wˆ 2T (k)ρ(Wˆ 1T (k)dm (k) + bˆ1 (k)) + bˆ2 (k), d(k

(6.66)

164

6 Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach

where dm (k) = [d T (k), μTd (d(k))]T . Then, various numerical methods can be applied to obtain the solution of (6.66). It is noteworthy that the model performance directly determines the accuracy of μd (d(k)). Therefore, thresholds are added in the model network to improve the modeling precision.

6.4.2 The Critic Network According to the iterative DHP algorithm, the critic network is used to approximate the costate function. The input of the critic network is the tracking error vector e(k). The output is denoted as T T δ(wc1 e(k)). λˆ i+1 (e(k)) = wc2

(6.67)

Define the approximation error as follows: Ec =

1 (λˆ i+1 (e(k)) − λi+1 (e(k)))T (λˆ i+1 (e(k)) − λi+1 (e(k))), 2

(6.68)

where λi+1 (e(k)) is computed by (6.38b). Similarly, the updating rules of the weight matrices are given by using the gradient descent algorithm ∂Ec , ∂wc1 ∂Ec , := wc2 − η ∂wc2

wc1 := wc1 − η wc2

(6.69)

where η ∈ (0, 1) is the learning rate of the critic network.

6.4.3 The Action Network In the action network, the output is formulated as T T δ(wa1 e(k)). μˆ ei (e(k)) = wa2

(6.70)

Define the approximation error in the following: Ea =

T 1 μˆ ei (e(k)) − μei (e(k)) μˆ ei (e(k)) − μei (e(k)) . 2

(6.71)

6.5 Simulation Studies

165

The optimal control policy μei (e(k)) at each iteration step can be obtained through iteration updating. According to (6.37) and (6.38a), the updating rule is given as follows: ( j+1)

μei

( j)

( j)

(e(k)) = μei (e(k)) − 2αμ Rμei (e(k))

T ∂ x(k + 1) λi (e(k + 1)). − αμ ∂ μei (e(k)) + ϕ(d(k))

(6.72)

The updating rules for weights are obtained as ∂Ea , ∂wa1 ∂Ea , := wa2 − ζ ∂wa2

wa1 := wa1 − ζ wa2

(6.73)

where ζ ∈ (0, 1) is the learning rate of the action network.

6.5 Simulation Studies In this section, the performance of the proposed tracking algorithm is demonstrated through a simulation example. The example derived from (Luo et al. 2016) with some modifications is considered as tanh(x1 (k)) + 0.05 tanh(x2 (k)) x(k + 1) = −0.3 tanh(x1 (k)) + tanh(x2 (k)) 0 + , (6.74) sin(u(k)) where x(0) = [0.9, −0.8]T . First, the model network needs to be constructed. In order to improve the accuracy of the model network, we set the number of hidden layer neurons as 40. In the implementations, we use the MATLAB neural network toolbox. The learning rate is ηm = 0.02. Additionally, the initial values of weight matrices and threshold vectors are set by default. Then, 1000 data samples generated by the nonlinear system are used to train the model network for 150 iteration steps. Figure 6.2 demonstrates the

166

6 Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach

Testing error e m1

6

10-3

4

2

0 0

50

100

150

200

250

300

350

400

450

500

350

400

450

500

Testing data samples

Testing error e m2

4

10-3

3 2 1 0 0

50

100

150

200

250

300

Testing data samples

Fig. 6.2 The testing error of the model network

performance of the model network by using 500 data samples to test. Here, training and testing data samples are randomly selected in x ∈ [−1, 1] and μx ∈ [−1, 1]. In ˆ + 1) − this example, we employ the performance error measure as em = abs(x(k x(k + 1)) to help us clearly verify the performance of the model network, where abs(·) denotes the absolute values of elements in x(k ˆ + 1) − x(k + 1). Next, we define the desired trajectory as follows: 0.9963d1 (k) + 0.0498d2 (k) , −0.2492d1 (k) + 0.9888d2 (k)

d(k + 1) =

(6.75)

where d(0) = [0.1, 0.2]T . We use the proposed method to make the unknown system (6.74) track the desired trajectory. For this purpose, we construct the critic and action networks with structures 2–8–2 and 2–8–1, correspondingly. The learning rates are set as η = ζ = 0.05 and the weights are initialized randomly. In the DHP algorithm, the weight matrices in the utility function are chosen as Q = 0.1I2×2 and R = I1×1 and the learning rate in the updating rule of μei (e(k)) is also set as αμ = 0.05. After learning is completed, the action network is regarded as the tracking controller to control the nonlinear system (6.74). The convergence curves of the state and the tracking error are shown in Figs. 6.3 and 6.4, respectively. The results demon-

6.5 Simulation Studies

167

The system state x1

1

The state of the original system The reference trajectory 0.5

0 0

50

100

150

200

250

300

Time steps The system state x2

0.5

0

-0.5

The state of the original system The reference trajectory -1 0

50

100

150

200

250

300

Time steps

Fig. 6.3 The state trajectory of the original system and the reference trajectory 0.4 The optimal error e1k

e1k

0.2

The state e1k controlled by u 1k

0 -0.2 -0.4 0

10

20

30

40

50

60

70

80

90

100

Time steps 0 The optimal error e2k The state e2k controlled by u 2k

e2k

-0.2

-0.4

-0.6 0

10

20

30

40

50

Time steps

Fig. 6.4 The tracking error

60

70

80

90

100

168

6 Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach

The control input

x

0.2 0.15 0.1 0.05 0 0

50

100

150

200

250

300

200

250

300

Time steps

The control input

e

0.2 0.15 0.1 0.05 0 0

50

100

150

Time steps

Fig. 6.5 The control input of the original system and the augmented system

strate that the state vector reaches the desired trajectory fast, within 30 time steps. The curves of the control inputs corresponding to the original and augmented system are shown in Fig. 6.5. Additionally, it is worth mentioning that μd (d(k)) needs to be obtained in the learning process. According to the approach described in Sect. 6.4.1, the steady control corresponding to the desired trajectory can be computed by the expression of the model network. Here, we employ the function “fsolve” in MATLAB to solve μd (d(k)). The curves of μd (d(k)) by solving (6.4) and (6.66) are displayed in Fig. 6.6, which verifies the effectiveness of the developed approach.

6.6 Conclusion In this chapter, for unknown nonlinear systems, the new updating rule of the costate function based on DHP is proposed. The model network is used to estimate the system state. Its UUB stability is discussed in detail, which extends the previous works to multi-layer neural networks with biases across all layers. Additionally, the steady control corresponding to the desired trajectory is obtained by solving the expression of the model network. It should be noted that the developed approach to

References

169 10-3

The steady control of the desired trajectory

8 6 4 2 0 -2 -4 -6 -8 0

50

100

150 Time steps

200

250

300

Fig. 6.6 The steady control input corresponding to the desired trajectory of the proposed approach and its real value

estimate μd (d(k)) is actually a model-based technique. The precision of the model network and the effectiveness of the developed algorithm are demonstrated by the simulation results. It is also interesting to further generalize the proposed approach, by considering the helpful results for event-triggered control (Wang et al. 2020; Ha et al. 2020b; Wang and Liu 2018), robust control (Wang et al. 2017a; Wang 2020), sensor networks (Ding et al. 2020, 2021), and wastewater treatment plant (Wang et al. 2022).

References Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part B (Cybern) 38(4):943–949 Bertsekas DP (2017) Value and policy iterations in optimal control and adaptive dynamic programming. IEEE Trans Neural Netw Learn Syst 28(3):500–509 Ding D, Wang Z, Han Q (2020) A set-membership approach to event-triggered filtering for general nonlinear systems over sensor networks. IEEE Trans Autom Control 65(4):1792–1799 Ding D, Han QL, Ge X, Wang J (2021) Secure state estimation and control of cyber-physical systems: a survey. IEEE Trans Syst Man Cybern Syst 51(1):176–190

170

6 Data-Driven Optimal Trajectory Tracking via a Novel Self-Learning Approach

Foruzan E, Soh L, Asgarpoor S (2018) Reinforcement learning approach for optimal distributed energy management in a microgrid. IEEE Trans Power Syst 33(5):5749–5758 Granzotto M, Postoyan R, Busoniu L, Nesic D, Daafouz J (2020) Finite-horizon discounted optimal control: stability and performance. IEEE Trans Autom Control 66(2):550–565 Ha M, Wang D, Liu D (2020a) Data-based nonaffine optimal tracking control using iterative DHP approach. In: Proceedings of 21st IFAC world congress, vol 53(2), pp 4246–4251 Ha M, Wang D, Liu D (2020b) Event-triggered adaptive critic control design for discrete-time constrained nonlinear systems. IEEE Trans Syst Man Cybern Syst 50(9):3158–3168 Ha M, Wang D, Liu D (2021a) Generalized value iteration for discounted optimal control with stability analysis. Syst Control Lett 147:104847 Ha M, Wang D, Liu D (2021b) Neural-network-based discounted optimal control via an integrated value iteration with accuracy guarantee. Neural Netw 144:176–186 Heydari A (2018a) Stability analysis of optimal adaptive control under value iteration using a stabilizing initial policy. IEEE Trans Neural Netw Learn Syst 29(9):4522–4527 Heydari A (2018b) Stability analysis of optimal adaptive control using value iteration with approximation errors. IEEE Trans Autom Control 63(9):3119–3126 Heydari A (2019) Optimal codesign of control input and triggering instants for networked control systems using adaptive dynamic programming. IEEE Trans Ind Electron 66(1):482–490 Kiumarsi B, Lewis FL (2015) Actor-critic-based optimal tracking for partially unknown nonlinear discrete-time systems. IEEE Trans Neural Netw Learn Syst 26(1):140–151 Lincoln B, Rantzer A (2006) Relaxed dynamic programming. IEEE Trans Autom Control 51(8):1249–1260 Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-time nonlinear systems. IEEE Trans Cybern 43(2):779–789 Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634 Liu D, Wang D, Li H (2014) Decentralized stabilization for a class of continuous-time nonlinear interconnected systems using online learning optimal control approach. IEEE Trans Neural Netw Learn Syst 25(2):418–428 Liu D, Yang X, Wang D, Wei Q (2015) Reinforcement-learning-based robust controller design for continuous-time uncertain nonlinear systems subject to input constraints. IEEE Trans Cybern 45(7):1372–1385 Liu D, Xu Y, Wei Q, Liu X (2018) Residential energy scheduling for variable weather solar energy based on adaptive dynamic programming. IEEE/CAA J Automatica Sinica 5(1):36–46 Liu D, Xue S, Zhao B, Luo B, Wei Q (2021) Adaptive dynamic programming for control: a survey and recent advances. IEEE Trans Syst Man Cybern Syst 51(1):142–160 Liu D, Wei Q, Wang D, Yang X, Li H (2017) Adaptive dynamic programming with applications in optimal control. Springer Press, Gewerbestrasse, Cham, Switzerland Luo B, Liu D, Huang T, Wang D (2016) Model-free optimal tracking control via critic-only Qlearning. IEEE Trans Neural Netw Learn Syst 27(10):2134–2144 Luo B, Liu D, Wu H, Wang D, Lewis FL (2017) Policy gradient adaptive dynamic programming for data-based optimal control. IEEE Trans Cybern 47(10):3341–3354 Modares H, Lewis FL (2014) Optimal tracking control of nonlinear partially-unknown constrainedinput systems using integral reinforcement learning. Automatica 50(7):1780–1792 Mu C, Wang D, He H (2017) Novel iterative neural dynamic programming for data-based approximate optimal control design. Automatica 81:240–252 Mu C, Wang D, He H (2018) Data-driven finite-horizon approximate optimal control for discretetime nonlinear systems using iterative HDP approach. IEEE Trans Cybern 48(10):2948–2961 Ni Z, He H, Zhong X, Prokhorov DV (2015) Model-free dual heuristic dynamic programming. IEEE Trans Neural Netw Learn Syst 26(8):1834–1839 Postoyan R, Busoniu L, Nesic D, Daafouz J (2017) Stability analysis of discrete-time infinitehorizon optimal control with discounted cost. IEEE Trans Autom Control 62(6):2736–2749

References

171

Qin C, Zhang H, Luo Y (2014) Online optimal tracking control of continuous-time linear systems with unknown dynamics by using adaptive dynamic programming. Int J Control 87:1000–1009 Rantzer A (2006) Relaxed dynamic programming in switching systems. IEE Proc Control Theory Appl 153(5):567–574 Sokolov Y, Kozma R, Werbos LD, Werbos PJ (2015) Complete stability analysis of a heuristic approximate dynamic programming control design. Automatica 59:9–18 Wang D (2020) Intelligent critic control with robustness guarantee of disturbed nonlinear plants. IEEE Trans Cybern 50(6):2740–2748 Wang D, Liu D (2018) Learning and guaranteed cost control with event-based adaptive critic implementation. IEEE Trans Neural Netw Learn Syst 29(12):6004–6014 Wang D, Ha M, Qiao J (2020) Self-learning optimal regulation for discrete-time nonlinear systems under event-driven formulation. IEEE Trans Autom Control 65(3):1272–1279 Wang D, Qiao J, Cheng L (2021) An approximate neuro-optimal solution of discounted guaranteed cost control design. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2020.2977318 Wang D, Ha M, Qiao J (2022) Data-driven iterative adaptive critic control toward an urban wastewater treatment plant. IEEE Trans Ind Electron 68(8):7362–7369 Wang D, He H, Liu D (2017a) Adaptive critic nonlinear robust control: a survey. IEEE Trans Cybern 47(10):3429–3451 Wang D, He H, Zhong X, Liu D (2017b) Event-driven nonlinear discounted optimal regulation involving a power system application. IEEE Trans Ind Electron 64(10):8177–8186 Wang D, Liu D, Wei Q (2012a) Finite-horizon neuro-optimal tracking control for a class of discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocomputing 78(1):14–22 Wang D, Liu D, Wei Q, Zhao D, Jin N (2012b) Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832 Wei Q, Liu D, Liu Y, Song R (2017) Optimal constrained self-learning battery sequential management in microgrid via adaptive dynamic programming. IEEE/CAA J Automatica Sinica 4(2):168– 176 Zhang H, Cui L, Luo Y (2013) Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern 43(1):206– 216 Zhang H, Qin C, Jiang B, Luo Y (2014) Online adaptive policy learning algorithm for H∞ state feedback control of unknown affine nonlinear discrete-time systems. IEEE Trans Cybern 44(12):2706–2718 Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans Syst Man Cybern Part B Cybern 38(4):937–942 Zhu Y, Zhao D, Li X (2016) Using reinforcement learning techniques to solve continuoustime non-linear optimal tracking problem without system dynamics. IET Control Theory Appl 10(12):1339–1347

Chapter 7

Adaptive Critic with Improved Cost for Discounted Tracking and Novel Stability Proof

Abstract The core task of tracking control is to make the controlled plant track a desired trajectory. The traditional performance index used in previous studies cannot eliminate completely the tracking error as the number of time steps increases. In this chapter, a new cost function is introduced to develop the value-iteration-based adaptive critic framework to solve the tracking control problem. Unlike the regulator problem, the iterative value function of tracking control problem cannot be regarded as a Lyapunov function. A novel stability analysis method is developed to guarantee that the tracking error converges to zero. The discounted iterative scheme under the new cost function for the special case of linear systems is elaborated. Finally, the tracking performance of the present scheme is demonstrated by numerical results and compared with that of the traditional approaches. Keywords Adaptive critic design · Approximate dynamic programming · Discretetime nonlinear systems · Reinforcement learning · Stability analysis · Tracking control · Value iteration

7.1 Introduction Recently, adaptive critic methods, known as approximate or adaptive dynamic programming (ADP) (Liu et al. 2017; Liu and Wei 2013; Wang et al. 2022; Wei and Liu 2014; Ha et al. 2020b; Wang et al. 2022; Ha et al. 2020c; Wang et al. 2017), have enjoyed rather remarkable successes for a wide range of fields in the energy scheduling (Wei et al. 2017; Liu et al. 2018), orbital rendezvous (Heydari and Balakrishnan 2014; Heydari 2016), urban wastewater treatment (Wang et al. 2021), attitude-tracking control for hypersonic vehicles (Han et al. 2020), and so forth. Adaptive critic designs have close connections to both adaptive control and optimal control (Lewis and Vrabie 2009; Lewis et al. 2012). For nonlinear systems, it is difficult to obtain the analytical solution of the Hamilton–Jacobi–Bellman (HJB) equation. Iterative adaptive critic techniques, mainly including value iteration (VI) (Li and Liu 2012; Wei et al. 2016; Heydari 2018b) and policy iteration (PI) (Liu and Wei 2014; Bertsekas 2017; Wang and Zhong 2019), have been extensively © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D. Wang et al., Advanced Optimal Control and Applications Involving Critic Intelligence, Intelligent Control and Learning Systems 6, https://doi.org/10.1007/978-981-19-7291-1_7

173

174

7 Adaptive Critic with Improved Cost for Discounted Tracking . . .

studied and successfully applied to iteratively approximate the numerical solution of the HJB equation (Liu et al. 2014; Al-Tamimi et al. 2008; Liu et al. 2015, 2021). In (Lincoln and Rantzer 2006), relaxed dynamic programming was introduced to overcome the “curse of dimensionality” problem by relaxing the demand for optimality. The upper and lower bounds of the iterative value function were first determined and the convergence of VI was revealed. For ensuring stability of the undiscounted VI, Heydari (2018a) developed a stabilizing VI algorithm initialized by a stabilizing policy. With this operation, the stability of the closed-loop system using the iterative control policy can be guaranteed. In (Wang et al. 2012), the convergence and monotonicity of discounted value function were investigated. The discounted iterative scheme was implemented by the neural-network-based globalized dual heuristic programming. Afterwards, Ha et al. (2021) discussed the effect of the discount factor on the stability of the iterative control policy. Several stability criteria with respect to the discount factor were established. Wang et al. (2020) developed an event-based adaptive critic scheme and presented an appropriate triggering condition to ensure the stability of the controlled plant. Optimal tracking control is a significant topic in the control community, which mainly aims at designing a controller to make the controlled plant track a reference trajectory. The literature on this problem is extensive (Zhang et al. 2008; Kiumarsi et al. 2014; Kiumarsi and Lewis 2015; Cao et al. 2019; Ha et al. 2020a, d) and reflects considerable current activity. Wang et al. (2012) developed a finite-horizon optimal tracking control strategy with convergence analysis for affine discrete-time systems by employing the iterative heuristic dynamic programming approach. For the linear quadratic output tracking control problem, Kiumarsi et al. (2015) presented a novel Bellman equation, which allows policy evaluation by using only the input, output, and reference trajectory data. Liu et al. (2018) concerned the robust optimal tracking control problem and introduced the adaptive critic design scheme into the controller to overcome the unknown uncertainty caused by multi-input multi-output discretetime systems. Luo et al. (2016) designed the model-free optimal tracking controller for nonaffine systems by using a critic-only Q-learning algorithm, while the proposed method needs to be given an initial admissible control policy. In (Li et al. 2021), a novel cost function was proposed to eliminate the tracking error. The convergence and monotonicity of the new value function sequence were investigated. On the other hand, some methods to solve the tracking problem for affine continuous-time systems can be found in (Modares and Lewis 2014; Qin et al. 2014; Kamalapurkar et al. 2015; Chen et al. 2019). For affine nonlinear partially unknown constraint-input systems, the integral reinforcement learning technique was studied to learn the solution to the optimal tracking control problem in (Modares and Lewis 2014), which does not require to identify the unknown systems. In general, the majority of adaptive critic tracking control methods need to solve the feedforward control input of the reference trajectory. Then, the tracking control problem can be transformed into a regulator problem. However, for some nonlinear systems, the feedforward control input corresponding to the reference trajectory might be nonexistent or not unique, which makes these methods unavailable. To avoid solving the feedforward control input, some tracking control approaches establish a

7.2 Problem Formulation and VI-Based Adaptive Critic Scheme

175

performance index function of the tracking error and the control input. Then, the adaptive critic design is employed to minimize the performance index. With this operation, the tracking error cannot be eliminated because the minimization of the control input cannot always lead to the minimization of the tracking error. Moreover, as mentioned in (Ha et al. 2021), the introduction of discount factor will affect the stability of the optimal control policy. If an inappropriate discount factor is selected, the stability of the closed-loop system cannot be guaranteed. Besides, unlike the regulator problem, the iterative value function of tracking control is not a Lyapunov function. Till now, few studies have focused on this problem. In this chapter, inspired by (Li et al. 2021), the new performance index is adopted to avoid solving the feedforward control and eliminate the tracking error (Ha et al. 2022). Based on the new performance index function, a novel stability analysis method for the tracking control problem is established. It is guaranteed that the tracking error can be eliminated completely. Then, the effect of the presence of the approximation errors derived from the value function approximator is discussed with respect to the stability of controlled systems. In addition, for linear systems, the new VI-based adaptive critic scheme between the kernel matrix and the state feedback gain is developed. Notations: Throughout this chapter, N and N+ are the set of all nonnegative and positive integers, respectively, i.e., N = {0, 1, 2, . . . } and N+ = {1, 2, . . . }. R denotes the set of all real numbers and R+ is the set of nonnegative real numbers. Rn is the Euclidean space of all n-dimensional real vectors. In and 0m×n represent the n × n identity matrix and the m × n zero matrix, respectively. C 0 means that the matrix C is negative semidefinite.

7.2 Problem Formulation and VI-Based Adaptive Critic Scheme Consider the following affine nonlinear systems given by xk+1 = F(xk ) + G(xk )u k

(7.1)

with the state xk ∈ Rn and input u k ∈ Rm , where n, m ∈ N+ and k ∈ N. F : Rn → Rn and G : Rn × Rm → Rn are the drift and control input dynamics, respectively. The tracking error is defined as ek = xk − dk ,

(7.2)

where dk is the reference trajectory at stage k. Suppose that dk is bounded and satisfies dk+1 = M(dk ),

(7.3)

where M(·) is the command generator dynamics. The objective of the tracking control problem is to design a controller to track the desired trajectory. Let

7 Adaptive Critic with Improved Cost for Discounted Tracking . . .

176

uk = {u k , u k+1 , . . . }, k ∈ N, be an infinite-length sequence of control inputs. Assume that there exists a control sequence u0 such that ek → 0 as k → ∞. In general, in the previous works (Wang et al. 2012; Kiumarsi and Lewis 2015), assume that there exists a feedforward control input ηk satisfying dk+1 = F(dk ) + G(dk )ηk to achieve perfect tracking. However, for some nonlinear systems, the feedforward control input might be nonexistent. To avoid computing the feedforward control input ηk , the performance index (Kiumarsi and Lewis 2015; Kiumarsi et al. 2014) is generally designed as J˘(e0 , u0 ) =

∞

γ p U (e p , u p ) =

p=0

∞

γ p Q(e p ) + R(u p ) ,

(7.4)

p=0

where γ ∈ (0, 1] is the discount factor and U (·, ·) is the utility function. Terms Q : Rn → R+ and R : Rm → R+ in the utility function are positive definite continuous functions. With this operation, both the tracking error and the control input in the performance index (7.4) are minimized. To the best of our knowledge, the minimization of the control input does not always result in the minimization of the tracking error unless the reference trajectory is assumed to be dk → 0 as k → ∞. Such assumption greatly reduces the application scope of the approach. Therefore, for the majority of desired trajectories, the tracking error cannot be eliminated (Li et al. 2021) by adopting the performance index (7.4). According to (Li et al. 2021), under the control sequence u0 , the new discounted cost function for the initial tracking error e0 and reference point d0 is introduced as J (e0 , d0 , u0 ) =

∞

γ p U (e p , d p , u p )

p=0

=

∞

γ p Q F(e p + d p ) + G(e p + d p )u p − M(d p ) .

(7.5)

p=0

The adopted cost function (7.5) not only avoids computing the feedforward control input, but also eliminates the tracking error. The objective of this chapter is to find a feedback control policy π(e, d), which both makes the dynamical system (7.1) track the reference trajectory and minimizes the cost function (7.5). According to (7.5), the state value function can be obtained as V (ek , dk ) = U (ek , dk , π(ek , dk )) + γ

∞

γ p−k−1 U (e p , d p , π(e p , d p ))

p=k+1

= U (ek , dk , π(ek , dk )) + γ V (ek+1 , dk+1 )

(7.6)

and its optimal value is V ∗ (ek , dk ). According to Bellman’s principle of optimality, the optimal value function for tracking control satisfies

7.2 Problem Formulation and VI-Based Adaptive Critic Scheme

V ∗ (ek , dk ) = min U (ek , dk , π(ek , dk )) + γ V ∗ (ek+1 , dk+1 ) , π

177

(7.7)

where ek+1 = F(ek + dk ) + G(ek + dk )π(ek , dk ) − M(dk ). The corresponding optimal control policy is computed by π ∗ (ek , dk ) = arg min U (ek , dk , π(ek , dk )) + γ V ∗ (ek+1 , dk+1 ) . π

(7.8)

Therefore, the Hamiltonian function for tracking control can be obtained as H (ek , dk ) = U (ek , dk , π(ek , dk )) + γ V ∗ (ek+1 , dk+1 ) − V ∗ (ek , dk ).

(7.9)

The optimal control policy π ∗ satisfies the first-order necessary condition for optimality, i.e., ∂ H/∂π = 0m (Li et al. 2021). The gradient of (7.9) with respect to π is given as ∂U (ek , dk , π ) + ∂π

∂ek+1 ∂π

T ∗ ∂ γ V (ek+1 , dk+1 ) = 0m . ∂ek+1

(7.10)

In general, the positive definite function Q is chosen as the following quadratic form: T Q(ek , dk ) = F(ek + dk ) + G(ek + dk )u k − M(dk ) × Q F(ek + dk ) + G(ek + dk )u k − M(dk ) ,

(7.11)

where Q ∈ Rn×n is a positive definite matrix. Then, the expression of the optimal control policy can be obtained by solving (7.10) (Li et al. 2021). Since it is difficult or impossible to directly solve the Bellman equation (7.7), iterative adaptive critic methods are widely adopted to obtain its numerical solution. Here, the VI-based adaptive critic scheme for the tracking control problem is employed to approximate the optimal value function V ∗ (ek , dk ) formulated in (7.7). The VI-based adaptive critic algorithm starts from a positive semidefinite continuous value function V (0) (ek , dk ). Using the initial value function V (0) (ek , dk ), the initial control policy is computed by π (0) (ek , dk ) = arg min U (ek , dk , π(ek , dk )) + γ V (0) (ek+1 , dk+1 ) , π

(7.12)

where ek+1 = F(ek + dk ) + G(ek + dk )π(ek , dk ) − M(dk ). For the iteration index ∈ N+ , the VI-based adaptive critic algorithm is implemented between the value function update V () (ek , dk ) = min U (ek , dk , π(ek , dk )) + γ V (−1) (ek+1 , dk+1 ) π

= U (ek , dk , π (−1) (ek , dk )) + γ V (−1) (ek+1 , dk+1 ),

(7.13)

178

7 Adaptive Critic with Improved Cost for Discounted Tracking . . .

and the policy improvement π () (ek , dk ) = arg min U (ek , dk , π(ek , dk )) + γ V () (ek+1 , dk+1 ) . π

(7.14)

In the iteration learning process, two sequences, namely, the iterative value function sequence {V () } and the corresponding control policy sequence {π () }, are obtained. The convergence and monotonicity of the undiscounted value function sequence have been investigated in (Li et al. 2021). Inspired by (Li et al. 2021), the corresponding convergence and monotonicity properties of the discounted value function can be obtained. Lemma 7.1 (cf. Li et al. 2021) Let the value function and control policy sequences be tuned by (7.13) and (7.14), respectively. For any ek and dk , the value function starts from V (0) (·, ·) = 0. 1. The value function sequence {V () (ek , dk )} is monotonically nondecreasing, i.e., V () (ek , dk ) ≤ V (+1) (ek , dk ), ∈ N. 2. Suppose that there exists a constant κ ∈ (0, ∞) such that 0 ≤ γ V ∗ (ek+1 , dk+1 ) ≤ κ U (ek , dk , u k ), where ek+1 = F(ek + dk ) + G(ek + dk )u k − M(dk ). Then, the iterative value function approaches the optimal value function with the following manner:

1 V ∗ (ek , dk ) ≤ V () (ek , dk ) ≤ V ∗ (ek , dk ). 1− (7.15) (1 + κ −1 ) It can be guaranteed that the discounted value function and corresponding control policy sequences approximate the optimal value function and optimal control policy as the number of iterations increases, i.e., lim→∞ V () (ek , dk ) = V ∗ (ek , dk ) and lim→∞ π () (ek , dk ) = π ∗ (ek , dk ). Note that the introduction of the discount factor will affect the stability of the optimal and iterative control policies. If the discount factor is chosen too small, the optimal control policy might be unstable. For the tracking control problem, the policy π ∗ (ek , dk ) cannot make the controlled plant track the desired trajectory. It is meaningless to design various iterative methods to approximate the optimal control policy. On the other hand, for the regulation problem, the iterative value function is a Lyapunov function to judge the stability of the closed-loop systems (Wei et al. 2016). However, for the tracking control problem, the iterative value function cannot be regarded as a Lyapunov function since the iterative value function does not only depend on the tracking error e. Therefore, it is necessary to develop a novel stability analysis approach for tracking control problems.

7.3 Novel Stability Analysis of VI-Based Adaptive Critic Designs

179

7.3 Novel Stability Analysis of VI-Based Adaptive Critic Designs In this section, the stability of the tracking error system is discussed. It is guaranteed that the tracking error under the iterative control policy converges to zero as the number of time steps increases. Theorem 7.1 Suppose that there exists a control sequence u0 for the system (7.1) and the desired trajectory (7.3) such that ek → 0 as k → ∞. If the discount factor satisfies (1 − γ )V ∗ (ek , dk ) ≤ c U (ek , dk , π ∗ (ek , dk )),

(7.16)

where c ∈ (0, 1) is a constant, then the tracking error under the optimal control π ∗ (ek , dk ) converges to zero as k → ∞. Proof According to (7.7) and (7.8), the Bellman equation can be rewritten as ∗ V ∗ (ek , dk ) = U (ek , dk , πk∗ ) + γ V ∗ (ek+1 , dk+1 ),

(7.17)

∗ = F(ek + dk ) + G(ek + dk )πk∗ − M(dk ). Considwhere πk∗ = π ∗ (ek , dk ) and ek+1 ering condition (7.16) and the Bellman equation (7.17), we obtain ∗ , dk+1 ) − V ∗ (ek , dk )) ≤ (c − 1)U (ek , dk , πk∗ ), γ (V ∗ (ek+1

(7.18)

which is equivalent to ∗ , dk+1 ) ≥ V ∗ (ek , dk ) − V ∗ (ek+1

1−c U (ek , dk , πk∗ ). γ

(7.19)

Applying (7.19) to the tracking errors e0 , e1 , . . . , e N and the corresponding reference points d0 , d1 , . . . , d N , it results in 1−c U (e0 , d0 , π0∗ ), γ 1−c U (e1∗ , d1 , π1∗ ), V ∗ (e1∗ , d1 ) − V ∗ (e2∗ , d2 ) ≥ γ .. . 1−c V ∗ (e∗N , d N ) − V ∗ (e∗N +1 , d N +1 ) ≥ U (e∗N , d N , π N∗ ). γ V ∗ (e0 , d0 ) − V ∗ (e1∗ , d1 ) ≥

Combining the inequalities in (7.20), we have

(7.20)

180

7 Adaptive Critic with Improved Cost for Discounted Tracking . . .

V ∗ (e0∗ , d0 ) − V ∗ (e∗N +1 , d N +1 ) ≥

N 1−c j=0

γ

U (e∗j , d j , π ∗j )

≥0

(7.21)

with e0∗ = e0 . According to (7.21), V ∗ (e0 , d0 ) ≥ V ∗ (e∗N +1 , d N +1 ). Considering (7.21), we con clude that the sequence of partial sums Nj=0 1−c U (e∗j , d j , π ∗j ) is finite. In view of γ ∗ ∗ the positive definiteness of U (e j , d j , π j ), the sequence of partial sums is nondecreas ing. Then, Nj=0 1−c U (e∗j , d j , π ∗j ) is convergent as N → ∞. Therefore, U (ek∗ , dk , γ ∗ πk ) → 0 as k → ∞, which implies that, from the definition of the utility function ∗ under the optimal control π ∗ (ek∗ , dk ) approaches zero (7.5), the tracking error ek+1 as k → ∞. For the discounted iterative adaptive critic tracking control, the condition (7.16) is important. Otherwise, the stability of the optimal control policy cannot be guaranteed. Theorem 7.1 reveals the effect of the discount factor on the convergence of the tracking error. However, the optimal value function is unknown in advance. In what follows, a practical stability condition is provided to guarantee that the tracking error converges to zero under the iterative control policy. Theorem 7.2 Let the value function with V (0) (·, ·) = 0 and the control policy be updated by (7.13) and (7.14), respectively. If the iterative value function satisfies V (+1) (ek , dk ) − γ V () (ek , dk ) ≤ c U (ek , dk , πk() ),

(7.22)

where πk() = π () (ek , dk ), then the tracking error under the iterative control policy πk() satisfies ek → 0 as k → ∞. Proof Let the tracking error at the time k + 1, generated through the scenario + + of applying π () (ek , dk ), be denoted as ek+1 and ek+1 = F(ek + dk ) + G(ek + () dk )πk − M(dk ). According to the value function update rule (7.13) and the stability condition (7.22), for the initial points e0 and d0 , it leads to V () (e1+ , d1 ) − V () (e0 , d0 ) ≤

c−1 U (e0 , d0 , π0() ). γ

(7.23)

Evaluating (7.23) at e+j and d j , we obtain V () (e+j+1 , d j+1 )−V () (e+j , d j ) ≤ which implies, for j = 1, 2, . . . , N ,

c−1 U (e+j , d j , π () j ), γ

(7.24)

7.3 Novel Stability Analysis of VI-Based Adaptive Critic Designs

c−1 U (e1+ , d1 , π1() ), γ c−1 U (e2+ , d2 , π2() ), V () (e3+ , d3 ) − V () (e2+ , d2 ) ≤ γ .. . c−1 () () + V () (e+ U (e+ (e N , d N ) ≤ N +1 , d N +1 ) − V N , d N , π N ). γ

181

V () (e2+ , d2 ) − V () (e1+ , d1 ) ≤

(7.25)

Combining (7.23) and (7.25), the following relationship can be obtained as () + V () (e+ (e0 , d0 ) ≤ N +1 , d N +1 ) − V

N c−1

γ

j=0

U (e+j , d j , π () j ),

(7.26)

where e0+ = e0 . Note that inequality (7.26) is equivalent to V () (e0 , d0 ) − V () (e+ N +1 , d N +1 ) ≥

N 1−c j=0

≥ 0.

γ

U (e+j , d j , π () j ) (7.27)

Since the iterative value function is positive definite and the iterative value function satisfies (7.15), considering (7.27), it results in V () (e0 , d0 ) ≥ V () (e+ N +1 , d N +1 ) ≥ 0 () (e0 , d0 ). Note in the sense that V () (e+ N +1 , d N +1 ) is upper bounded by the constant V that (7.27) and V () (e0 , d0 ) ≥ V () (e+ N +1 , d N +1 ) show that the sequence of partial N U (e+ , d j , π () sums in 1−c j ) is finite. Since the utility function is positive γ 1−cj=0 N j () + definite and γ j=0 U (e j , d j , π j ) is nondecreasing, the sequence of partial

sums is convergent as N → ∞, which implies that U (ek+ , dk , πk() ) → 0 as k → ∞. According to the definition of the utility function, we have Q(F(ek + dk ) + G(ek + + → 0 as k → ∞, which proves dk )πk() − M(dk )) → 0 as k → ∞. This leads to ek+1 the conclusion. According to the property 2 in Lemma 7.1, V (+1) (ek , dk ) − V () (ek , dk ) → 0 as → ∞. Therefore, condition (7.22) in Theorem 7.2 can be satisfied in the iteration process. There must exist an iterative control policy πk() in the control policy sequence {π () }, which makes ek → 0 as k → ∞. In general, for nonlinear systems, the value function update (7.13) cannot be solved exactly. Various fitting methods, such as neural networks, polynomial fitting, and so forth, can be used to approximate the iterative value function of the nonlinear systems and many numerical methods can be applied to solve (7.14). Note that the inputs of the function approximator are the tracking error vector e and the desired trajectory d. Especially, for high-dimensional nonlinear systems, the artificial neural network is applicable to approximate the iterative value function. Compared with the polynomial fitting method, the artificial neural network avoids manually designing

7 Adaptive Critic with Improved Cost for Discounted Tracking . . .

182

each basis function. The introduction of the function approximator inevitably leads to the approximation error. Define the approximation error at the -th iteration as ε() (ek , dk ). According to the value function update Eq. (7.13), the approximate value function is obtained as (−1) (ek+1 , dk+1 ) + ε(−1) (ek , dk ) () (ek , dk ) = min U (ek , dk , μ(ek , dk )) + γ V V μ

(−1) (ek+1 , dk+1 ) + ε(−1) (ek , dk ), = U (ek , dk , μ(−1) (ek , dk )) + γ V (7.28) where ek+1 = F(ek + dk ) + G(ek + dk )μ(ek , dk ) − M(dk ) and the corresponding control policy μ(ek , dk ) is computed by () (ek+1 , dk+1 ) . μ() (ek , dk ) = arg min U (ek , dk , μ(ek , dk )) + γ V μ

(7.29)

Note that the approximation error ε(−1) (ek , dk ) is not the error between the approx() (ek , dk ) and the exact value function V () (ek , dk ). Next, imate value function V considering the approximation error of the function approximator, we further discuss the stability of the closed-loop system using the control policy derived from the approximate value function. Theorem 7.3 Let the iterative value function with V (0) (·, ·) = 0 be approximated by a smooth function approximator. The approximate value function and the corresponding control policy are updated by (7.28) and (7.29), respectively. If the approximate value function with the approximation error ε() (ek , dk ) ≤ αU (ek , dk , μ() (ek , dk )) is finite and satisfies () (ek , dk ) ≤ c U (ek , dk , μ() (ek , dk )), (+1) (ek , dk ) − γ V V

(7.30)

where α ∈ (0, 1) and c ∈ (0, 1 − α) are constants, then the tracking error under the control policy μ() (ek , dk ) satisfies ek → 0 as k → ∞. Proof For convenience, in the sequel, μ() (ek , dk ) is written as μ() k . According to (7.28) and condition (7.30), it leads to () (ek , dk ) = U (ek , dk , μ() () (e˘k+1 , dk+1 ) (+1) (ek , dk ) − γ V V k )+γV () (ek , dk ) + ε() (ek , dk ) − γ V ≤ c U (ek , dk , μ() k ),

(7.31)

where e˘k+1 represents the tracking error at the time k + 1, generated () through the scenario of applying μ() k . Considering (7.31) and condition ε (ek , dk ) ≤ () α U (ek , dk , μk ), we obtain

7.4 Discounted Tracking Control for the Special Case of Linear Systems

1−c−α () (ek , dk ) − V () (e˘k+1 , dk+1 ). U (ek , dk , μ() k )≤ V γ

183

(7.32)

Evaluating (7.32) at the time steps k = 0, 1, 2, . . . , N , it results in 1−c−α () (e0 , d0 ) − V () (e˘1 , d1 ), U (e0 , d0 , μ() 0 )≤ V γ 1−c−α () (e1 , d1 ) − V () (e˘2 , d2 ), U (e˘1 , d1 , μ() 1 )≤ V γ .. . 1−c−α () (e˘ N , d N ) − V () (e˘ N +1 , d N +1 ). U (e˘ N , d N , μ() N )≤ V γ

(7.33)

Combining inequalities in (7.33), we obtain 0≤

N 1−c−α () (e˘0 , d0 ) − V () (e˘ N +1 , d N +1 ), U (e˘ j , d j , μ() j )≤ V γ j=0

(7.34)

() (e0 , d0 ) ≥ V () (e˘ N +1 , d N +1 ) ≥ 0 holds. where e˘0 = e0 . According to (7.34), V () Since the approximate value function V (e, d) is finite and the utility function is by the positive definite, Nj=0 U (e˘ j , d j , μ() j ) is nondecreasing and upper bounded N γ () (e0 , d0 ). Therefore, the sequence of partial sums constant 1−c−α V j=0 U (e˘ j , d j , μ() is convergent as N → ∞ in the sense that U (e˘k , dk , μ() j ) k ) → 0 as k → ∞. According to the definition of the utility function, Q(F(e˘k + dk ) + G(e˘k + + dk )μ() k − M(dk )) → 0 as k → ∞, which results in e˘k+1 → 0 as k → ∞.

7.4 Discounted Tracking Control for the Special Case of Linear Systems In this section, the VI-based adaptive critic scheme for linear systems and the stability properties are investigated. Consider the following discrete-time linear systems given by xk+1 = Axk + Bu k ,

(7.35)

where A ∈ Rn×n and B ∈ Rn×m are system matrices. Here, we assume that the reference trajectory satisfies dk+1 = dk , where ∈ Rn×n is a constant matrix. This form is used because its analysis is convenient. According to the new cost function (7.5),

7 Adaptive Critic with Improved Cost for Discounted Tracking . . .

184

for the linear system (7.35), a quadratic performance index with a positive definite weight matrix Q is formulated as follows: J (x0 , d0 , u0 ) =

∞

T γ p Ax p + Bu p − d p Q Ax p + Bu p − d p .

(7.36)

p=0

Combining the dynamical system (7.35) and the desired trajectory dk , we can obtain an augmented system as

x X k+1 = k+1 dk+1

B A 0n×n xk + uk = 0n×n

dk 0n×m A Xk + Bu k ,

(7.37)

where X is the augmented state. The linear state feedback control policy for linear systems is defined as u k = π( Xk ) = −K X k . According to (7.6), the state value function for the system (7.37) can be obtained as X k , π( X k )) + γ V ( A Xk + Bπ( X k )) V ( Xk) = U ( T X k+1 ) X k+1 + γ V ( = X k+1 Q A Bπ( X k ))T Q( Bπ( X k )) + γ V ( X k+1 ), Xk + = (A Xk +

(7.38)

satisfies where the new weight matrix Q = Q

Q −Q . −Q Q

(7.39)

As mentioned in (Lewis and Vrabie 2009; Lewis et al. 2012; Kiumarsi et al. 2015), the value function can be regarded as the quadratic form in the state, i.e., V ( Xk) = Then, the Bellman equation of linear quadratic X kT P X k for some kernel matrix P. tracking is obtained by T T X k+1 X k+1 Q X k+1 + γ P X k+1 V ( Xk) = T + γ P)( A = (A Xk + Bπ( X k )) ( Q Xk + Bπ( X k )).

(7.40)

The Hamiltonian function of linear quadratic tracking control is defined as + γ P)( A Xk + Bπ( X k ))T ( Q Xk + Bπ( X k )) − X kT P Xk. H ( Xk) = ( A

(7.41)

Considering the state feedback policy π( Xk ) = −K X k and Eq. (7.40), it results in

7.4 Discounted Tracking Control for the Special Case of Linear Systems

V ( Xk) = X kT P Xk T − )T ( Q + γ P)( A − ) = Xk ( A BK BK Xk.

185

(7.42)

Therefore, the linear quadratic tracking problem can be solved by using the following equation: + γ P)( A − ) − P = 0. − )T ( Q BK (A BK

(7.43)

Considering the Hamiltonian function (7.41), a necessary condition for optimality X k ))/∂π( X k ) = 0 (Lewis and Vrabie 2009; is the stationarity condition ∂ H ( X k , π( Lewis et al. 2012). The optimal control policy is computed by −1 + γ P) + γ P) A Xk) = − BT(Q B u k = π( BT(Q Xk.

(7.44)

K

Here, VI is used to iteratively solve Eq. (7.43). The VI algorithm starts with an (0) . Then, the initial state feedback gain initial positive semidefinite kernel matrix P T −1 T (0) (0) +γ P ) +γ P (0) ) A. For = 1, 2, . . . , = B (Q B B (Q is computed by K the kernel matrix and the state feedback gain are updated between +γ P (−1) A − (−1) () = A − (−1) T Q BK P BK

(7.45)

T −1 T +γ P () A. +γ P () ) () = B (Q B B Q K

(7.46)

and

Note that Lemma 7.1 is also applicable to linear systems. If the initial kernel matrix is set as the zero matrix, then the value function sequence {V ( X k )} is monotonically (+1) 0. From Lemma 7.1, we can obtain () − P nondecreasing in the sense that P ∗ and lim→∞ K ∗ as the number of iterations increases. () = P () = K lim→∞ P Theorem 7.4 Let the kernel matrix and the state feedback gain be iteratively updated by (7.45) and (7.46), respectively. If the iterative kernel matrix and state feedback gain satisfy () − c( A − () )T Q( A − () ) 0, (+1) − γ P BK BK P

(7.47)

then, for all xk and dk , the tracking error under the iterative control policy π () ( Xk) satisfies ek → 0 as k → ∞. Proof Considering condition (7.47), ∀ X k , we have − () )T Q( (+1) − γ P () A − () ) BK Xk, Xk ≥ X kT P BK X kT c( A

(7.48)

7 Adaptive Critic with Improved Cost for Discounted Tracking . . .

186

which implies X k ) − γ V () ( X k ) ≤ cU ( X k , π () ( X k )). V (+1) (

(7.49)

According to Theorem 7.2, we can obtain U ( X k , π () ( X k )) → 0 as k → ∞, which X k ) approaches shows that the tracking error under the iterative control policy π () ( zero as k → ∞. For linear systems, if the system matrices A and B are known, it is not necessary to use the function approximator to estimate the iterative value function. According to the iterative algorithm (7.45) and (7.46), there is no approximation error derived from the approximate value function in the iteration procedure.

7.5 Simulation Studies In this section, two numerical simulations with physical background are conducted to verify the effectiveness of the discounted adaptive critic design. Compared with the cost function (7.4) proposed by the traditional studies, the adopted performance index can eliminate the tracking error.

7.5.1 Example 1 As shown in Fig. 7.1, the spring–mass–damper system is used to validate the present results and compare the performance of the present and the traditional adaptive critic tracking control approaches. Let M, s, and da be the mass of the object, the stiffness constant of the spring, and the damping, respectively. The system dynamics is given as

Fig. 7.1 Diagrammatic sketch of the spring–mass–damper system

7.5 Simulation Studies

187

x˙ = v, v˙ = −

da f s x− v+ , M M M

(7.50)

where x denotes the position, v stands for the velocity, and f is the force applied to the object. Let the system state vector be X = [x, v]T ∈ R2 and the control input be u = f ∈ R. The continuous-time system dynamics (7.50) is discretized using the Euler method with sampling interval t = 0.01 sec. Then, the discrete-time state space equation is obtained as xk+1 = Axk + Bu k 1 t 0 ts tda xk + t u k . = − 1− M M M

(7.51)

In this example, the practical parameters are selected as M = 1 kg, s = 5 N/m, and da = 0.5 Ns/m. The reference trajectory is defined as

dk+1 = dk =

1 0.01 dk . −0.01 1

(7.52)

Combining the original system (7.51) and the reference trajectory (7.52), the augmented system is formulated as Xk + Bu k X k+1 = A ⎡ ⎤ ⎤ ⎡ 0 1 0.01 0 0

xk ⎢ 0.01 ⎥ ⎢ −0.05 0.995 0 0 ⎥ = ⎣ ⎦s d + ⎣ ⎦ uk . 0 0 1 0.01 0 k 0 0 −0.01 1 0

(7.53)

(0) = 04×4 and the state feedback gain are The iterative kernel matrix with P updated by (7.45) and (7.46), where Q = I2 and the discount factor is chosen as γ = 0.98. On the other hand, considering the following traditional cost function: J (x0 , d0 , u0 ) =

∞

T γ p (x p − d p Q(x p − d p + u Tp Ru p ,

(7.54)

p=0

the corresponding VI-based adaptive critic control algorithm for system (7.53) is implemented between (−1) T() = Q − T(−1) (7.55) + K T(−1)T R K T(−1) + γ A − T(−1) T P T P A BK BK and

7 Adaptive Critic with Improved Cost for Discounted Tracking . . .

188

−1 T () T() = R + γ T() T A, BT P B P K B γ

(7.56)

where R ∈ Rm is a positive definite matrix. As defined in (7.54), the objective of the cost function is to minimize both the tracking error and the control input. The role of the cost function (7.54) is to balance the minimizations of the tracking error and the control input according to the selection of the matrices Q and R. To compare the tracking performance under different cost functions, we carry out the new VIbased adaptive critic algorithm and the traditional approach for 400 iterations. Three traditional cost functions with different weight matrices Q i and Ri , i = 1, 2, 3 are selected to implement algorithms (7.55) and (7.56), where Q 1,2,3 = I2 and R1,2,3 = 1, 0.1, 0.01. After 400 iteration steps, the obtained corresponding optimal kernel matrices and state feedback gains are given as follows: ⎡

(400) P (400) K

⎤ 41.54 0.2076 −41.617 −0.237 ⎢ 0.2076 0.00104 −0.208 −0.0012 ⎥ ⎥ =⎢ ⎣ −41.617 −0.2079 41.691 0.2376 ⎦ , −0.2371 −0.0012 0.2376 0.00144 = 36.151, 99.939, −40.223, −100.49 ,

(7.57)

⎡

T(400) P 1 T(400) K 1

⎤ 97.353 −9.656 −48.821 37.997 ⎢ −9.656 25.214 −8.2838 −25.967 ⎥ ⎥ =⎢ ⎣ −48.821 −8.2838 49.701 0.19244 ⎦ , 37.997 −25.967 0.19244 47.227 = − 0.10162, 0.24321, −0.080643, −0.25218 ,

(7.58)

⎡

T(400) P 2 T(400) K 2

⎤ 72.596 −4.7887 −46.756 20.952 ⎢ −4.7887 19.103 −6.1303 −20.056 ⎥ ⎥ =⎢ ⎣ −46.756 −6.1303 47.115 1.2794 ⎦ , 20.952 −20.056 1.2794 32.681 = − 0.51589, 1.8171, −0.5916, −1.9207 ,

(7.59)

and ⎡

T(400) P 3 T(400) K 3

⎤ 48.071 0.38106 −43.383 2.7857 ⎢ 0.38106 9.1906 −3.0517 −9.5405 ⎥ ⎥ =⎢ ⎣ −43.383 −3.0517 43.603 2.0814 ⎦ , 2.7857 −9.5405 2.0814 12.189 = 0.14624, 8.2085, −2.8468, −8.5751 .

(7.60)

Let the initial system state and reference point be x0 = [0.1, 0.14]T and d0 = [−0.3, 0.3]T . Then, the obtained state feedback gains are applied to generate the

7.5 Simulation Studies

189

1.5 1 0.5 0 -0.5 0

2

4

6

8

10

12

14

16

18

20

2

4

6

8

10

12

14

16

18

20

1.5 1 0.5 0 -0.5 0

Fig. 7.2 The reference trajectory and system state curves under different cost functions (Example 1)

control inputs of the controlled plant (7.53). The system state and tracking error trajectories under different weight matrices are shown in Figs. 7.2 and 7.3. It can be observed that smaller R leads to smaller tracking error. The weight matrices Q and R reveal the importance of the minimizations of the tracking error and the control input. The tracking performance of the traditional cost function with smaller R is similar to that of the new tracking control approach. From (7.56), the matrix R cannot be T() B. The a zero matrix. Otherwise, there exist no inverse of the matrix R + γ BT P corresponding control input curves are plotted in Fig. 7.4.

7.5.2 Example 2 Consider the single link robot arm given in (Zhong et al. 2016). Let M, g, L, J , and fr be the mass of the payload, acceleration of gravity, length of the arm, moment of inertia, and viscous friction, respectively. The system dynamics is formulated as α¨ = −

MgL fr 1 sin(α) − α˙ + u, J J J

(7.61)

7 Adaptive Critic with Improved Cost for Discounted Tracking . . .

190

1

0.5

0

-0.5 0

2

4

6

8

10

12

14

16

18

20

2

4

6

8

10

12

14

16

18

20

1

0.5

0

-0.5 0

Fig. 7.3 The tracking error curves under different cost functions (Example 1) 3 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 0

2

4

6

8

10

12

14

Fig. 7.4 The control input curves under different cost functions (Example 1)

16

18

20

7.5 Simulation Studies

191

where α and u denote the angle position of robot arm and control input, respectively. Let the system state vector be x = [α, α] ˙ T ∈ R2 . Similar to Example 1, the single link robot arm dynamics is discretized using the Euler method with sampling interval t = 0.05 sec. Then, the discrete-time state space equation of (7.61) is obtained as xk+1 = F(xk ) + G(xk )u k ⎡ ⎤ ⎡ ⎤ 0

x1k + t x2k ⎦ + ⎣ t ⎦ u k . MgL fr = ⎣ sin(x1k ) + x2k x2k − t J J J

(7.62)

In this example, the practical parameters are set as M = 1 kg, g = 9.8 m/s2 , L = 1 m, J = 5 kg · m2 , and fr = 2. The desired trajectory is defined as

dk+1 =

1 0.05 dk . −0.025 1

(7.63)

The cost function (7.5) is set as the quadratic form, where Q and γ are selected as Q = I2 and γ = 0.97, respectively. In this example, since ek and dk are the independent variables of the value function, the function approximator of the iterative value function is selected as the following form: 2 2 2 2 3 3 3 3 () (ek , dk ) = W ()T e1k , e2k , d1k , d2k , e1k , e2k , d1k , d2k , e1k e2k , e1k d1k , e1k d2k , V 2 2 2 2 2 e2k d1k , e2k d2k , d1k d2k , e1k e2k , e1k d1k , e1k d2k , e2k e1k , e2k d1k , T 2 2 2 2 2 2 2 (7.64) e2k d2k , d1k e1k , d1k e2k , d1k d2k , d2k e1k , d2k e2k , d2k d1k ,

where W () ∈ R26 is the parameter vector. In the iteration process, 300 random samples in the region = {(e ∈ R2 , d ∈ R2 ) : − 1 ≤ e1 ≤ 1, −1 ≤ e2 ≤ 1, −1 ≤ d1 ≤ 1, −1 ≤ d2 ≤ 1} are chosen to learn the iterative value function V () (e, d) for 200 iteration steps. The value function is initialized as zero. In the iteration process, considering the first-order necessary condition for optimality, the iterative control policy can be computed by the following equation: −1 μ() (ek , dk ) = − G T (ek + dk )QG(ek + dk ) G T (ek + dk ) ∂V () (ek+1 , dk+1 ) × + Q F(ek + dk ) − M(dk ) . ∂ek+1

(7.65)

Note that the unknown control input μ() (ek , dk ) exists on both sides of (7.65). Then, at each iteration step, μ() (ek , dk ) is iteratively obtained by using the successive approximation approach. After the iterative learning process, the parameter vector is obtained as follows:

7 Adaptive Critic with Improved Cost for Discounted Tracking . . .

192

1.5 1 0.5 0 -0.5 0

5

10

15

20

25

1.5 1 0.5 0 -0.5 0

5

10

15

Fig. 7.5 The reference trajectory and system state curves under different cost functions (Example 2)

W (200) = 7.7011, 0.019253, 0, 0, 0, 0, 0, 0, 0.77011, 0, 0, 0, 0, 0, 0, 0, 0, T 0, 0, 0, 0, 0, 0, 0, 0, 0 . (7.66) Next, we compare the tracking performance of the new and the traditional methods. The traditional cost function is also selected as the quadratic form. Three traditional cost functions with Q 1,2,3 = I2 and R1,2,3 = 0.1, 0.01, 0.001 are selected. The initial state and initial reference point are set as x0 = [−0.32, 0.12]T and d0 = [0.12, −0.23]T , respectively. The obtained parameter vectors derived from the present and the traditional adaptive critic methods are employed to generate the near-optimal control policy. The controlled plant state trajectories using these nearoptimal control policies are shown in Fig. 7.5. The corresponding tracking error and control input curves are plotted in Figs. 7.6 and 7.7. From Figs. 7.6 and 7.7, it is observed that both the tracking error and the control input derived from the traditional approach are minimized. However, it is not necessary to minimize the control input by deteriorating the tracking performance for tracking control.

7.5 Simulation Studies

193

-0.2

-0.8 0

5

10

15

20

25

5

10

15

20

25

0.4

0.2

0

-0.2 0

Fig. 7.6 The tracking error curves under different cost functions (Example 2) 4

2

0

-2

-4

-6

-8 0

5

10

15

20

Fig. 7.7 The control input curves under different cost functions (Example 2)

25

194

7 Adaptive Critic with Improved Cost for Discounted Tracking . . .

7.6 Conclusions In this chapter, for the tracking control problem, the stability of the discounted VIbased adaptive critic method with a new performance index is investigated. Based on the new performance index, the iterative formulation for the special case of linear systems is given. Some stability conditions are provided to guarantee that the tracking error approaches zero as the number of time steps increases. Moreover, the effect of the presence of the approximation errors of the value function is discussed. Two numerical simulations are performed to compare the tracking performance of the iterative adaptive critic design under different performance index functions. It is also interesting to further extend the present tracking control method for nonaffine systems, data-based tracking control, output tracking control, various practical applications, and so forth. The developed tracking control method will be more advanced in the future work of online adaptive critic designs for some practical complex systems with noise.

References Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof. IEEE Trans Syst Man Cybern-Part B: Cybern 38(4):943–949 Bertsekas DP (2017) Value and policy iterations in optimal control and adaptive dynamic programming. IEEE Trans Neural Networks Learn Syst 28(3):500–509 Cao Y, Song Y, Wen C (2019) Practical tracking control of perturbed uncertain nonaffine systems with full state constraints. Automatica 110:108608 Chen C, Modares H, Xie K, Lewis FL, Wan Y, Xie S (2019) Reinforcement learning-based adaptive optimal exponential tracking control of linear systems with unknown dynamics. IEEE Trans Automat Control 64(11):4423–4438 Ha M, Wang D, Liu D (2020a) Data-based nonaffine optimal tracking control using iterative DHP approach. In: Proceedings of 21st IFAC world congress 53(2):4246–4251 Ha M, Wang D, Liu D (2020b) Event-triggered adaptive critic control design for discrete-time constrained nonlinear systems. IEEE Trans Syst Man Cybern: Syst 50(9):3158–3168 Ha M, Wang D, Liu D (2020c) Event-triggered constrained control with DHP implementation for nonaffine discrete-time systems. Inf Sci 519:110–123 Ha M, Wang D, Liu D (2020d) Value-iteration-based neuro-optimal tracking control for affine systems with completely unknown dynamics. In: Proceedings of 39th Chinese control conference. pp 1951–1956 Ha M, Wang D, Liu D (2021) Generalized value iteration for discounted optimal control with stability analysis. Syst Control Lett 147(104847):1–7 Ha M, Wang D, Liu D (2022) Discounted iterative adaptive critic designs with novel stability analysis for tracking control. IEEE/CAA J Automat Sinica 9(7):1262–1272 Han X, Zheng Z, Liu L, Wang B, Cheng Z, Fan H, Wang Y (2020) Online policy iteration ADP-based attitude-tracking control for hypersonic vehicles. Aerosp Sci Technol 106:106233 Heydari A (2016) Theoretical and numerical analysis of approximate dynamic programming with approximation errors. J Guidance Control Dyn 39(2):301–311 Heydari A (2018) Stability analysis of optimal adaptive control under value iteration using a stabilizing initial policy. IEEE Trans Neural Networks Learn Syst 29(9):4522–4527

References

195

Heydari A (2018) Stability analysis of optimal adaptive control using value iteration with approximation errors. IEEE Trans Automat Control 63(9):3119–3126 Heydari A, Balakrishnan S (2014) Adaptive critic based solution to an orbital rendezvous problem. J Guidance Control Dyn 37(1):344–350 Kamalapurkar R, Dinhb H, Bhasin S, Dixon WE (2015) Approximate optimal trajectory tracking for continuous-time nonlinear systems. Automatica 51:40–48 Kiumarsi B, Lewis FL (2015) Actor-critic-based optimal tracking for partially unknown nonlinear discrete-time systems. IEEE Trans Neural Networks Learn Syst 26(1):140–151 Kiumarsi B, Lewis FL, Modares H, Karimpour A, Naghibi-Sistani MB (2014) Reinforcement Qlearning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica 50(4):1167–1175 Kiumarsi B, Lewis FL, Naghibi-Sistani MB, Karimpour A (2015) Optimal tracking control of unknown discrete-time linear systems using input-output measured data. IEEE Trans Cybern 45(12):2770–2779 Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst Mag 9(3):32–50 Lewis FL, Vrabie D, Vamvoudakis KG (2012) Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers. IEEE Control Syst Mag 32(6):76– 105 Li C, Ding J, Lewis FL, Chai T (2021) A novel adaptive dynamic programming based on tracking error for nonlinear discrete-time systems. Automatica 129:109687 Li H, Liu D (2012) Optimal control for discrete-time affine non-linear systems using general value iteration. IET Control Theory Appl 6(18):2725–2736 Lincoln B, Rantzer A (2006) Relaxing dynamic programming. IEEE Trans Automat Control 51(8):1249–1260 Liu D, Wang D, Li H (2014) Decentralized stabilization for a class of continuous-time nonlinear interconnected systems using online learning optimal control approach. IEEE Trans Neural Networks Learn Syst 25(2):418–428 Liu L, Wang Z, Zhang H (2018) Neural-network-based robust optimal tracking control for MIMO discrete-time systems with unknown uncertainty using adaptive critic design. IEEE Trans Neural Networks Learn Syst 29(4):1239–1251 Liu D, Wei Q (2013) Finite-approximation-error-based optimal control approach for discrete-time nonlinear systems. IEEE Trans Cybern 43(2):779–789 Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Trans Neural Networks Learn Syst 25(3):621–634 Liu D, Wei Q, Wang D, Yang X, Li H (2017) Adaptive dynamic programming with applications in optimal control. Springer, London Liu D, Xu Y, Wei Q, Liu X (2018) Residential energy scheduling for variable weather solar energy based on adaptive dynamic programming. IEEE/CAA J Automat Sinica 5(1):36–46 Liu D, Xue S, Zhao B, Luo B, Wei Q (2021) Adaptive dynamic programming for control: A survey and recent advances. IEEE Trans Syst Man Cybern: Syst 51(1):142–160 Liu D, Yang X, Wang D, Wei Q (2015) Reinforcement-learning-based robust controller design for continuous-time uncertain nonlinear systems subject to input constraints. IEEE Trans Cybern 45(7):1372–1385 Luo B, Liu D, Huang T, Wang D (2016) Model-free optimal tracking control via critic-only Qlearning. IEEE Trans Neural Networks Learn Syst 27(10):2134–2144 Modares H, Lewis FL (2014) Optimal tracking control of nonlinear partially-unknown constrainedinput systems using integral reinforcement learning. Automatica 50(7):1780–1792 Qin C, Zhang H, Luo Y (2014) Online optimal tracking control of continuous-time linear systems with unknown dynamics by using adaptive dynamic programming. Int J Control 87(5):1000–1009 Wang D, Ha M, Qiao J (2020) Self-learning optimal regulation for discrete-time nonlinear systems under event-driven formulation. IEEE Trans Automat Control 65(3):1272–1279

196

7 Adaptive Critic with Improved Cost for Discounted Tracking . . .

Wang D, Ha M, Qiao J (2021) Data-driven iterative adaptive critic control toward an urban wastewater treatment plant. IEEE Trans Ind Electron 68(8):7362–7369 Wang D, Ha M, Zhao M (2022) The intelligent critic framework for advanced optimal control. Artif Intell Rev 55(1):1–22 Wang D, He H, Liu D (2017) Adaptive critic nonlinear robust control: A survey. IEEE Trans Cybern 47(10):3429–3451 Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocomputing 78(1):14–22 Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832 Wang D, Qiao J, Cheng L (2022) An approximate neuro-optimal solution of discounted guaranteed cost control design. IEEE Trans Cybern 52(1):77–86 Wang D, Zhong X (2019) Advanced policy learning near-optimal regulation. IEEE/CAA J Automat Sinica 6(3):743–749 Wei Q, Liu D (2014) A novel iterative θ-adaptive dynamic programming for discrete-time nonlinear systems. IEEE Trans Automat Sci Eng 11(4):1176–1190 Wei Q, Liu D, Lin H (2016) Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems. IEEE Trans Cybern 46(3):840–853 Wei Q, Liu D, Liu Y, Song R (2017) Optimal constrained self-learning battery sequential management in microgrid via adaptive dynamic programming. IEEE/CAA J Automat Sinica 4(2):168– 176 Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans Syst Man Cybern-Part B: Cybern 38(4):937–942 Zhong X, Ni Z, He H (2016) A theoretical foundation of goal representation heuristic dynamic programming. IEEE Trans Neural Networks Learn Syst 27(12):2513–2525

Chapter 8

Iterative Adaptive Critic Control Towards an Urban Wastewater Treatment Plant

Abstract The wastewater is an important avenue of resources cyclic utilization when coping with the modern urban diseases. However, there always exist obvious nonlinearities and uncertainties within wastewater treatment systems, such that it is difficult to accomplish proper optimization objectives towards these complex unknown platforms. In this chapter, a data-driven iterative adaptive critic (IAC) strategy is developed to address the nonlinear optimal control problem. The iterative algorithm is constructed with a general framework, followed by convergence analysis and neural network implementation. Remarkably, the derived IAC control policy with an additional steady control input is also applied to a typical wastewater treatment plant, rendering that the dissolved oxygen concentration and the nitrate level are maintained at desired setting points. When compared with the incremental proportional–integral–derivative method, it is found that faster response and less oscillation can be obtained during the IAC control process. Finally, the wastewater treatment problem is revisited by using the previous mixed driven control framework, with an emphasis on signally reducing control updating times. Keywords Data-driven control · Iterative adaptive critic · Learning systems · Optimal regulation · Wastewater treatment

8.1 Introduction Because of the serious shortage of the freshwater resource, how to effectively treat wastewater has becoming more and more significant around the world. Hence, designing advanced controllers towards wastewater treatment plants has been paid great attention for many years (Iratni and Chang 2019; Revollar et al. 2017; Han and Qiao 2014; Yang et al. 2014; Farahani et al. 2018; Han et al. 2019, 2020). However, due to the existence of nonlinearities and disturbances, plus the consideration of certain optimization objectives for unknown platforms, traditional proportional–integral–derivative (PID) controllers cannot be widely used for complex wastewater treatment processes (WWTP). Actually, there is a lack of effective optimization methods under unknown and nonlinear environment, by using © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D. Wang et al., Advanced Optimal Control and Applications Involving Critic Intelligence, Intelligent Control and Learning Systems 6, https://doi.org/10.1007/978-981-19-7291-1_8

197

198

8 Iterative Adaptive Critic Control Towards an Urban Wastewater Treatment Plant

artificial intelligence techniques. Thus, how to obtain intelligent optimal controllers towards complex nonlinear systems has been regarded as an important direction of the advanced control field. As an indispensable branch of modern control theory, there are many mature results on optimal feedback design for linear systems. However, the main difficulty in nonlinear optimal control design is addressing the complex Hamilton–Jacobi– Bellman (HJB) equations. Considering the rarity of analytical methods, adaptive critic algorithms combining with neural networks were developed to obtain approximate solutions (Werbos 1992). Heuristic dynamic programming (HDP), dual HDP (DHP), and globalized DHP were basic implementation tools of the adaptive critic field as described in (Werbos 1992). After this pioneering work, the online learning optimal control design was paid great attention particularly in (Si and Wang 2001). A direct adaptive state feedback control method was proposed for a class of nonlinear systems in the discrete-time environment (Yang et al. 2015). A data-based adaptive dynamic programming approach for discrete-time systems with multiple delays was given in (Zhang et al. 2020). The optimal control design of affine nonlinear systems with off-policy interleaved Q-learning was studied in (Li et al. 2019). An improved value iteration algorithm for neural-network-based stochastic optimal control design was given in (Liang et al. 2020). Moreover, the event-based adaptive critic controllers for various discrete-time systems were designed in (Ha et al. 2020; Dhar et al. 2018). Except these progresses, it is worth noting that some iterative algorithms in discretetime domain were developed to solve approximate optimal control problems, by adopting HDP (Al-Tamimi et al. 2008), DHP (Zhang et al. 2009), globalized DHP (Wang et al. 2012), and the goal representation HDP (Zhong et al. 2016) techniques, respectively. Therein, an additional goal network was integrated into the conventional actor-critic structure (Al-Tamimi et al. 2008; Zhang et al. 2009; Wang et al. 2012) to obtain an internal reinforcement signal for effective learning and control (Zhong et al. 2016). Besides, the iterative learning strategy was also applied to address the discrete-time H∞ tracking control problem recently (Hou et al. 2019). It is believed that the adaptive critic control for discrete-time complex systems will still be a hot topic of intelligent control design. Note that the idea of data-driven design is frequently emphasized in many existing results of the approximate optimal control synthesis (Zhang et al. 2020; Wang et al. 2012; Zhang et al. 2017). For example, the learning-based optimal consensus design for unknown discrete-time multi-agent systems was studied in (Zhang et al. 2017). Effective learning from the big data information is an important property of data-driven adaptive critic algorithms. Though possessing excellent self-learning and adaptivity performances, the application of adaptive critic algorithms to wastewater treatment has been rarely considered. An important progress was reported in (Bo and Qiao 2015), where the HDP framework combined with the echo state network was proposed for WWTP. Moreover, an optimal control method based on iterative HDP was presented in (Qiao et al. 2018) for WWTP. Note that the control performance has been improved in that work when compared with the traditional PID controller, but the monotonic convergence for the control law of the given algorithms was not proven. Hence, it is worth investigating the convergence of the iterative algorithm

8.2 Platform Description with Control Problem Statement

199

with wastewater treatment applications. In addition, the optimality of the proposed control algorithm can be ensured under the framework of value iteration. In this chapter, a data-driven iterative adaptive critic (IAC) method is developed to deal with the nonlinear optimal feedback control problem (Wang et al. 2021). The iterative framework of adaptive critic is established with convergence proof and neural network implementation. Since containing excellent self-learning abilities under unknown and nonlinear environment, the adaptive critic technique is promising to design intelligent optimal controllers for WWTP. Hence, by considering an additional steady control input, the developed intelligent critic control strategy is successfully applied to a typical wastewater treatment plant, which further verifies the theoretical results (Wang et al. 2021). In other words, the major contributions of the chapter are twofold, i.e., the construction of the general IAC control framework with convergence guarantee and the application to the wastewater treatment plant involving the steady control input. Both the theoretical derivation and application results are contained in this chapter. Overall, constructing such an effective composite control approach is greatly meaningful for advanced nonlinear optimization design and also for environmental protection in the aspect of wastewater recycling. The notations utilized throughout the chapter are described as follows. R is the set of all real numbers. Rn is the Euclidean space of all n-dimensional real vectors. Let be a compact subset of Rn and () be the set of admissible control laws (Al-Tamimi et al. 2008; Zhang et al. 2009; Wang et al. 2012) on . Rn×m is the space of all n × m real matrices. N denotes the set {0, 1, 2, . . . }. In is the n × n identity matrix and “T” is the transpose operation. Note the symbols k and i are used to represent the time step and the iteration index, respectively.

8.2 Platform Description with Control Problem Statement In this section, we first describe a typical platform of the wastewater treatment field and then state the related optimal control problem with respect to the dissolved oxygen concentration and the nitrate level.

8.2.1 Platform Description In this chapter, we study the data-driven IAC control design towards a typical plant, namely, Benchmark Simulation Model No. 1 (BSM1). Note the BSM1 is a classical experimental platform of WWTP Alex et al. (2008). The plant is composed of an integrated biochemical reaction tank with five units and a secondary sedimentation tank with ten layers, as shown in Fig. 8.1. The involved main parameters are listed in Table 8.1. After the process of primary treatment, the sludge of the wastewater is directly put into a specific tank and the remaining part is poured into the biochemical reactor with five compartments. Note that two anaerobic areas and three oxic areas

200

8 Iterative Adaptive Critic Control Towards an Urban Wastewater Treatment Plant

Fig. 8.1 A simple schematic diagram of the BSM1, where an integrated biochemical reaction tank with five units and a secondary sedimentation tank with ten layers are included Table 8.1 Some main parameters of the BSM1 Symbol Meaning S O,5 S N O,2 K L a5 Qa

The dissolved oxygen concentration in the fifth unit The nitrate level in the second unit The oxygen transfer coefficient of the fifth unit The internal recycle flow rate of the fifth-second units

are contained in the biochemical reaction tank. After the whole treating process, the upper clarified water is flowed into receiving bodies such as a river while the lower sludge is discharged or sent back to the initial stage via the external recycle. As stated in (Bo and Qiao 2015), the primary control objective of the BSM1 is to ensure that SO,5 and S N O,2 are maintained at desired setting points. For this platform, the manipulated variables with respect to SO,5 and S N O,2 are, respectively, K L a5 and Q a , which directly related to the control action to be designed.

8.2.2 Control Problem Statement Assume the nonlinear dynamics of the proposed wastewater treatment problem can be formulated as x(k + 1) = F(x(k), u(k)), k ∈ N, (8.1) where F(·, ·) is a continuous and unknown system function, x(k) ∈ Rn is the state variable, and u(k) ∈ Rm is the control input. We let x(0) be the initial state and assume it is a unique equilibrium point of system (8.1) under u = 0, i.e., F(0, 0) = 0. Generally, we also assume that system (8.1) can be stabilized on the set ⊂ Rn by a state feedback controller u(x(k)). In this chapter, we only consider the infinite-horizon optimal control problem and want to find a feedback control law u ∈ () to minimize the cost function

8.3 The Data-Driven IAC Control Method

J (x(k)) =

201

∞ U x(h), u(x(h)) h=k

∞ U x(h), u(x(h)) , = U x(k), u(x(k)) +

(8.2)

h=k+1

where U (x, u) ≥ 0, ∀x, u is the utility function ensuring U (0, 0) = 0. Normally, the utility function is selected as U x(k), u(x(k)) = x T (k)Qx(k)+u T (x(k))Ru(x(k)),

(8.3)

where Q and R are positive definite matrices of Rn×n and Rm×m , respectively. Note this is the specific choice of the utility in our case study, which is also common in the literature. Here, the term x T Qx is called the state utility while u T Ru is the control utility. According to the optimality principle and the form of (8.2), the optimal cost function defined as J ∗ (x(k)) = min

{u(·)}

∞ U x(h), u(x(h))

(8.4)

h=k

satisfies the discrete-time HJB equation J ∗ (x(k)) = min U x(k), u(x(k)) + J ∗ (x(k + 1)) . u(x(k))

(8.5)

It is hard to solve the above HJB equation with traditional manners, because the value of J ∗ (x(k + 1)) is unknown in advance and the controlled plant is nonaffine. This difficulty also exists when deriving the exact optimal control by using u ∗ (x(k)) = arg min U x(k), u(x(k)) + J ∗ (x(k + 1)) . u(x(k))

(8.6)

Therefore, it is necessary to pursue the near-optimal control design in the nonlinear discrete-time domain. For solving the discrete-time HJB equation approximately, an advanced iterative form of data-driven adaptive critic design can be employed in the sequel.

8.3 The Data-Driven IAC Control Method In this part, we describe the iterative self-learning algorithm step by step. Before carrying out the main iteration process, we should set a small positive number ε and construct two sequences {J (i) (x(k))} and {u (i) (x(k))}, where i denotes the iteration

202

8 Iterative Adaptive Critic Control Towards an Urban Wastewater Treatment Plant

index and i ∈ N. The iteration process starts with i = 0 and the initial cost function is chosen as J (0) (·) = 0. According to the current value of the cost function, the iterative control function is solved by u (i) (x(k)) = arg min U x(k), u(x(k)) + J (i) (x(k + 1)) , u(x(k))

(8.7)

where the involved state vector x(k + 1) = F x(k), u(x(k)) can be approximated by using a neural-network-based learning module. Incidentally, note that since J (0) (·) = 0, we can easily compute that u (0) (·) = 0. Based on the new control law, the iterative cost function is updated according to J (i+1) (x(k)) = min U x(k), u(x(k)) + J (i) (x(k + 1)) , u(x(k))

(8.8)

which, combining with the expression u (i) (x(k)), is also J (i+1) (x(k)) = U x(k), u (i) (x(k)) + J (i) F x(k), u (i) (x(k)) .

(8.9)

Once the newest cost function is updated, we should check the stopping criterion related to ε and decide whether the next iteration is necessary. In case that |J (i+1) (x(k)) − J (i) (x(k))| ≤ ε,

(8.10)

we can stop the iteration process and thus derive the near-optimal control law. Otherwise, we increase the iteration index using i = i + 1 and continue to implement the above steps as (8.7) and (8.8). In a word, the whole iterative process is carried out according to the following sequence: J (0) → u (0) → J (1) → · · · → u (i) → J (i+1) → · · · .

(8.11)

The convergence of this iterative algorithm is reflected by considering two aspects, namely boundedness and monotonicity. Theorem 8.1 The sequences of the iterative cost function and the iterative control law are both convergent with respect to the optimal cost function and the optimal control law. Proof The whole proof process is cut into two parts according to boundedness and monotonicity. (1) Show the boundedness of the iterative cost function sequence. Let η(x(k)) be any admissible control input and {V1(i) } be a new sequence defined by V1(i) (x(k)) = U (x(k), η(x(k))) + V1(i−1) (x(k + 1))

(8.12)

8.3 The Data-Driven IAC Control Method

203

with V1(0) (·) = 0. Initially, we can find that V1(1) (x(k)) = U (x(k), η(x(k))). By expanding V1(i) (x(k)) − V1(i−1) (x(k)) iteration by iteration and considering V1(0) (·) = 0, we derive V1(i) (x(k)) − V1(i−1) (x(k)) = V1(i−1) (x(k + 1)) − V1(i−2) (x(k + 1)) = ··· = V1(1) (x(k + i − 1))

(8.13)

and finally get V1(i) (x(k)) = V1(i−1) (x(k)) + V1(1) (x(k + i − 1)) = V1(i−2) (x(k)) + V1(1) (x(k + i − 2)) + V1(1) (x(k + i − 1)) = ··· =

i−1

V1(1) (x(k + )).

(8.14)

=0

Considering the admissibility of η(x(k)), there exists a positive constant B such that V1(i) (x(k)) ≤ B, ∀i. Because the cost J (i) (x(k)) as in (8.8) is a minimization result, we have J (i) (x(k)) ≤ V1(i) (x(k)) ≤ B. Since the cost function is nonnegative, we obtain that 0 ≤ J (i) (x(k)) ≤ B, i ∈ N. (2) Show the monotonicity of the iterative cost function sequence via mathematical induction. Here, we define another new sequence {V2(i) } as V2(i) (x(k)) = U x(k), u (i) (x(k)) +V2(i−1) (x(k +1))

(8.15)

with V2(0) (·) = 0. Initially, we observe V2(0) (x(k)) = 0 ≤ J (1) (x(k)). Then, we assume the inequality V2(i) (x(k)) ≤ J (i+1) (x(k)) holds for all k ∈ N and i = 1, 2, . . . . Combining (8.9) and (8.15), we develop

J (i+2) (x(k)) = U x(k), u (i+1) (x(k)) + J (i+1) (x(k + 1)), V2(i+1) (x(k)) = U x(k), u (i+1) (x(k)) + V2(i) (x(k + 1)),

(8.16)

so that the following inequality holds: V2(i+1) (x(k)) − J (i+2) (x(k)) = V2(i) (x(k + 1)) − J (i+1) (x(k + 1)) ≤ 0,

(8.17)

which means V2(i+1) (x(k)) ≤ J (i+2) (x(k)), i = 1, 2, . . . . Hence, we conclude that V2(i) (x(k)) ≤ J (i+1) (x(k)), ∀i ∈ N. Considering the computation of J (i) (x(k)) as the

204

8 Iterative Adaptive Critic Control Towards an Urban Wastewater Treatment Plant

form in (8.8), we obtain J (i) (x(k)) ≤ V2(i) (x(k)). Therefore, we have J (i) (x(k)) ≤ J (i+1) (x(k)) with i ∈ N. Combining the above two parts, we can finally obtain that J (i) (x(k)) → J ∗ (x(k)) and u (i) (x(k)) → u ∗ (x(k)) hold when i → ∞, which displays the convergence of the proposed learning algorithm. Remark 8.1 Since the controlled plant (8.1) is nonaffine, it is difficult to directly obtain the partial derivative of x(k + 1) with respect to u(k). Note that in the affine case, this derivative is just the control matrix. Hence, in order to get the solution of the iterative control law (8.7) by using ∂ x(k + 1) T ∂ J (i) (x(k + 1)) 1 , u (i) (x(k)) = − R −1 2 ∂u(k) ∂ x(k + 1)

(8.18)

we also should rely on the neural-network-based approximate expression. In the sequel, the detailed learning procedure of the above iterative algorithm is presented via the adaptive critic technique. This is a data-driven learning control process containing the approximate state x(k ˆ + 1), the approximate cost Jˆ(i) (x(k)), (i) and the approximate control uˆ (x(k)), which are just the outputs of three neural networks. For learning the nonlinear system dynamics, a neural network identifier is first constructed via data-driven processing. By inputting the state x(k) and the control uˆ (i) (x(k)), we can express the output of the neural identifier as

T x(k ˆ + 1) = ω1T σ ν1T x T (k), uˆ (i)T (x(k)) + b1 + b2 ,

(8.19)

where ω1 and ν1 are involved weight variables, b1 and b2 are the thresholds of the model network, and σ (·) is the activation function. Combining with the real updated state x(k + 1), the training performance measure related to the neural identifier is defined as E1 (k + 1) =

T

1

x(k ˆ + 1) − x(k + 1) x(k ˆ + 1) − x(k + 1) . 2

(8.20)

Establishing the neural identifier is a pre-training procedure that should be conducted before the main iteration of critic and action networks. In this chapter, we use the MATLAB neural network toolbox to train the model network. The critic network approximates the iterative cost function with weight matrices ω2 and ν2 and the formulation Jˆ(i+1) (x(k)) = ω2(i+1)T σ ν2(i+1)T x(k) .

(8.21)

Combining (8.9) with (8.21), the training performance measure is E2 (k) =

2 1 ˆ(i+1) J (x(k)) − J (i+1) (x(k)) . 2

(8.22)

8.4 Application to the Proposed Wastewater Treatment Plant

205

Note that the iteration index i is omitted in E2 (k) for simplicity. Actually, this performance measure is varied along with different iteration numbers. Using the state variable x(k) and the weight variables ω3 and ν3 , the action neural network approximates the iterative control law as follows: uˆ (i) (x(k)) = ω3(i)T σ ν3(i)T x(k) .

(8.23)

The performance measure for tuning action parameters is E3 (k) =

T

1 (i) uˆ (x(k)) − u (i) (x(k)) uˆ (i) (x(k)) − u (i) (x(k)) , 2

(8.24)

where the iterative control law u (i) (x(k)) can be directly obtained via (8.7). By adopting the gradient-based adaptation rule, the weight matrices of the critic network and the action network can be updated with a unified criterion

∂El (k) , ∂ωl ∂El (k) , l = 2, 3,

νl = −αl ∂νl

ωl = −αl

(8.25a) (8.25b)

where αl > 0, l = 2, 3 are the learning rates of the two networks and ωl and νl with l = 2, 3 are the difference values of two orderly updating steps. Remark 8.2 For keeping consistent with the initial setting of the MATLAB toolbox, the thresholds of the neural network are considered when modeling the controlled plant. However, with regard to other two neural networks, the thresholds are not involved for simplifying the training activity. After carrying out the data-driven algorithm with sufficient iteration steps, the practical near-optimal control law written as uˆ ∗ (x(k)) is derived, which is distinguished from the unknown ideal optimal controller u ∗ (x(k)). Therefore, we can apply this practical control policy to appropriate plants for attaining desired control performance.

8.4 Application to the Proposed Wastewater Treatment Plant In this section, we specify the parameters of the wastewater treatment plant given in Sect. 8.2, i.e., BSM1, and conduct data-driven IAC control involving comparative experiments with the incremental PID method. It will be observed that for this plant, the system state and the control input are both two-dimensional vectors (i.e., n = m = 2). Besides, the system function of the proposed control problem is unknown.

206

8 Iterative Adaptive Critic Control Towards an Urban Wastewater Treatment Plant

In the following, three stages are included for the data-driven IAC control design of the practical plant. First, we define suitable state and control vectors for describing the optimal feedback regulation under the wastewater treatment application background. Let x(k) ¯ = [x¯1 (k), x¯2 (k)]T , where x¯1 (k) and x¯2 (k) are practical values of SO,5 and S N O,2 measured at the time step k. Besides, we let x¯s = [x¯s1 , x¯s2 ]T , where x¯s1 = 2 (mg/L) and x¯s2 = 1 (mg/L) are setting points associated with SO,5 and S N O,2 (Alex et al. 2008). We define x(k) = x(k) ¯ − x¯s as the state vector of the plant, implying that x1 (k) = x¯1 (k) − x¯s1 and x2 (k) = x¯2 (k) − x¯s2 . Here, we aim at designing a practical control law to make sure x(k) approaches to zero gradually. Remark 8.3 Incidentally, we can also regard the above process as a tracking control problem, where the reference signal is a constant vector. Note the accurate dynamics of the control plant is unknown, which results in a data-driven tracking control design. Hence, x(k) also can be regarded as the tracking error, in the sense of trajectory tracking. We let u¯ 1 (k) and u¯ 2 (k) be the practical actions related to K L a5 and Q a at the time step k, so that the control vector of the plant is u(k) ¯ = [u¯ 1 (k), u¯ 2 (k)]T . In order to make SO,5 and S N O,2 approximate the constant setting points x¯s1 and x¯s2 , we define the steady control corresponding to the setting points as u¯ s = [u¯ s1 , u¯ s2 ]T . Then, we further define the control increment as u(k) = [u 1 (k), u 2 (k)]T , where u 1 (k) = u¯ 1 (k) − u¯ s1 and u 2 (k) = u¯ 2 (k) − u¯ s2 . Note that u(k) is just the control law for stabilizing the tracking error, namely x(k). Thus, x(k) and u(k) are regarded as the state and control vectors of the transformed tracking control problem, where the corresponding state space model F(·, ·) is unknown. Note that it is difficult to obtain the detailed value of u¯ s by using the BSM1 directly. Hence, we construct a neural identifier of the BSM1, making sure that u¯ s can be computed through the mathematical expression of this neural network. Similar as (Wang et al. 2012), here the steady control input needs to satisfy the following equation: ˆ x¯s , u¯ s ), x¯s = F(

(8.26)

ˆ ·) denotes the approximate relationship reflecting the dynamics of the where F(·, BSM1. Since the setting point vector x¯s is known, the parameter u¯ s can be obtained via this expression. We choose a three-layer feedforward neural network with the structure 4–8–2 (input-hidden-output). The input vector of the neural identifier is [x¯ T (k), u¯ T (k)]T ˆ¯ + 1). Similar as the form of (8.20), the training error of this and the output is x(k practical problem is taken as follows: 1

E¯1 (k) = (xˆ¯1 (k) − x¯1 (k))2 + (xˆ¯2 (k) − x¯2 (k))2 . 2

(8.27)

8.4 Application to the Proposed Wastewater Treatment Plant

207

(a) 6 Approximation

Practical data

4 2 0 0

2000

4000

6000

8000

10000

12000

14000

16000

12000

14000

16000

Sample

(b) 10 Approximation

Practical data

5

0 0

2000

4000

6000

8000

10000

Sample

Fig. 8.2 The training performance of S O,5 and S N O,2 : a S O,5 ; b S N O,2

Then, we train this identifier for 300 iteration steps with 16000 data samples and test them with 10000 data samples by using the MATLAB neural network toolbox with the learning rate α1 = 0.02. The training and testing results are given in Figs. 8.2–8.5. In detail, the training performance of SO,5 and S N O,2 is shown in Fig. 8.2 while the training error of the neural identifier is displayed in Fig. 8.3. The testing performance of SO,5 and S N O,2 is shown in Fig. 8.4 while the testing error of the neural identifier is depicted in Fig. 8.5. In Figs. 8.2 and 8.4, the solid green lines denote the neural-network-based outputs while the dotted pink lines denote the practical values of the BSM1. When the neural identifier is trained completely, it will not be updated anymore. Then, we use the “fsolve” function of MATLAB to solve the nonlinear Eq. (8.26). In fact, since the setting points x¯s1 and x¯s2 are invariant values, u¯ s1 and u¯ s2 are also constant. Here, the steady control input is derived as u¯ s = [204.8, 52940]T , which is useful to acquire the practical controller of the wastewater treatment plant. Second, we implement the data-driven IAC control method. Considering the application background, the cost function is set as (8.2), where the utility matrices are selected as Q = 0.01I2 and R = 0.01I2 . From the practical plant, we can observe that the initial values of SO,5 and S N O,2 are x¯1 (0) = 0.5 and x¯2 (0) = 3.7, respectively, so that the initial state vector of the transformed problem is x(0) = [−1.5, 2.7]T . For clarity, the data-driven IAC control framework towards the wastewater treatment plant is given in Fig. 8.6, where three kinds of neural networks are included. Note the composite control law applied to the practical biochemical reaction tank is obtained

208

8 Iterative Adaptive Critic Control Towards an Urban Wastewater Treatment Plant 0.016 0.014

The training error

0.012 0.01 0.008 0.006 0.004 0.002 0 0

2000

4000

6000

8000

10000

12000

14000

16000

Sample

Fig. 8.3 The training error of the neural identifier (a)

6 Approximation

Practical data

4 2 0 0

1000

2000

3000

4000

5000

6000

7000

8000

9000 10000

8000

9000 10000

Sample (b)

10 Approximation

Practical data

5

0 0

1000

2000

3000

4000

5000

6000

7000

Sample

Fig. 8.4 The testing performance of S O,5 and S N O,2 : a S O,5 ; b S N O,2

8.4 Application to the Proposed Wastewater Treatment Plant

209

0.016 0.014

The testing error

0.012 0.01 0.008 0.006 0.004 0.002 0 0

1000

2000

3000

4000

5000

6000

7000

8000

9000 10000

Sample

Fig. 8.5 The testing error of the neural identifier

Fig. 8.6 The data-driven IAC control framework towards the proposed wastewater treatment plant, including the model network, the critic network, and the action network (x¯s denotes the setting vector associated with S O,5 and S N O,2 while u¯ s denotes the computed steady control input)

210

8 Iterative Adaptive Critic Control Towards an Urban Wastewater Treatment Plant

Table 8.2 Structure of the critic and action networks Network category Learning rate

Network structure

α2 = 0.07 α3 = 0.2

Critic network Action network

2–8–1 2–10–2

15

Iterative cost function

12

9

6

3

0 0

300

600

900

1200

1500

Iteration step

Fig. 8.7 The convergence trend of the iterative cost function with data-driven IAC

by the IAC controller plus the steady control input u¯ s . In the sequel, we describe how to train the three networks. We construct the model network with the state vector x(k) and the control signal ˆ + 1). u(k). The input of the model network is [x T (k), u T (k)]T while the output is x(k After the training process is completed with the same method as the previous neural identifier, their weights are kept unchanged. Then, setting learning rates and network structures as in Table 8.2, we train the critic network and the action network for 1428 iteration steps with each iteration of 1000 training steps, in order to make sure the given error bound 10−8 is reached. Via conducting the IAC algorithm, the evolution trend of the iterative cost function is provided in Fig. 8.7, which verifies the theoretical convergence result. Using the converged weights, we can derive the IAC control law that should be applied to the wastewater treatment plant. Clearly, the control trajectory with respect to the transformed system is described in Fig. 8.8, which shows the good performance of the IAC control strategy.

8.4 Application to the Proposed Wastewater Treatment Plant

211

(a)

u1

0.5

0

-0.5 0

100

200

300

400

500

600

400

500

600

Time step

(b) 2

u2

1 0 -1 0

100

200

300

Time step

Fig. 8.8 The control input of the transformed system with data-driven IAC: a u 1 ; b u 2

Remark 8.4 It is observed that we construct a neural identifier and a model network in the previous two stages. Though possessing same network structures, the former is used for identifying the original platform while the latter is taken to model the transformed system. Actually, both of them can be called neural identifiers or model networks. Third, we evaluate the control performance of the aforementioned IAC strategy. As indicated in Fig. 8.6, it should be noticed that the practical controller applied to the BSM1 is u(k) ¯ = u(k) + u¯ s , where u(k) is derived approximately after the IAC learning process and, actually, it can be denoted as uˆ ∗ (x(k)). For comparison, we also employ the traditionally incremental PID method to control the wastewater treatment system. According to the parameter settings of the BSM1 (Alex et al. 2008), the initial control vector used to the original platform before applying the PID approach is [84, 55338]T . The control performances under data-driven IAC and incremental PID methods are given in Figs. 8.9–8.11. Here, the comparison results between IAC and PID strategies are provided. The state trajectory and the control input of the original system are depicted in Figs. 8.9 and 8.10, respectively. In Fig. 8.9, the dotted lines labeled as “Con2” and “Con1” represent the constant setting points with respect to SO,5 and S N O,2 . For clarity, the state trajectory of the transformed system, i.e., the tracking error, with the above two methods is shown in Fig. 8.11.

212

8 Iterative Adaptive Critic Control Towards an Urban Wastewater Treatment Plant

(a) 4 PID IAC Con2

3 2 1 0 0

100

200

300

400

500

600

Time step (b) 4 PID IAC Con1

3 2 1 0 0

100

200

300

400

500

600

Time step Fig. 8.9 The state trajectory of the original system: a S O,5 ; b S N O,2

From there plots, we find that the convergence speed of the IAC algorithm is faster than the incremental PID approach. Besides, the state trajectories of IAC possess less oscillation than the incremental PID method. Furthermore, there exists apparent saturation phenomenon in the control curve of the incremental PID approach while it doesn’t exist for IAC. Consequently, the data-driven IAC strategy is effective in terms of response speed and control quality. Remark 8.5 The practical controller is actually a composite policy, containing the steady control input and the part derived from the data-driven IAC algorithm. Hence, we should not only pay attention to the IAC control design, but also how to compute the steady control vector, which reflects one of the main contributions of this chapter compared with (Al-Tamimi et al. 2008; Zhang et al. 2009; Wang et al. 2012; Bo and Qiao 2015; Qiao et al. 2018; Wang et al. 2012). It means that, two different control inputs are considered during the composite formulation design, so as to achieve the desired performance towards the wastewater treatment plant.

8.4 Application to the Proposed Wastewater Treatment Plant

213

(a) 250 200 150 PID IAC

100 50 0

100

200

300

400

500

600

Time step (b)

104

8 6 4

PID IAC

2 0 0

100

200

300

400

500

600

Time step

Fig. 8.10 The control input of the original system: a K L a5 ; b Q a

(a) 2 PID IAC

1 0 -1 -2 0

100

200

300

400

500

600

Time step (b) 3 PID IAC

2 1 0 -1 0

100

200

300

400

Time step Fig. 8.11 The state trajectory of the transformed system: a x1 ; b x2

500

600

214

8 Iterative Adaptive Critic Control Towards an Urban Wastewater Treatment Plant

Fig. 8.12 The simple structure of the wastewater treatment platform involving mixed driven intelligent critic control design

8.5 Revisiting Wastewater Treatment via Mixed Driven NDP In this section, we revisit and address the above wastewater treatment problem by introducing the event-driven mechanism, i.e., using the mixed driven control framework highlighted in Chap. 3. Specifically, different from HDP, the mixed driven NDP technique is integrated into the IAC structure, in order to solve the proposed wastewater treatment problem. The purpose of this part is to construct a practical mixed driven intelligent critic control framework displayed in Fig. 8.12. By setting the network structures and learning rates the same as the previous section, we apply the mixed driven intelligent critic control algorithm to the transformed optimal regulation problem, i.e., the original tracking control problem. Note that for the event-driven design, the triggering threshold is also chosen as (3.54). After a sufficient iterative process, the involved neural networks are well trained. Based on the converged weights, the triggering threshold and the tracking control input, respectively, are shown in Figs. 8.13 and 8.14. Similar as Chap. 3, we also let the control updating times of the event-based and the corresponding time-based formulations be denoted as T1 and T2 . Based on the conducted experiment, it is observed that T1 = 64 and T2 = 400, showing that the control updating times are evidently reduced. From these plots, we can also observe excellent application results for treating wastewater in terms of the dissolved oxygen concentration and the nitrate level. Note that the tracking control input derived from mixed driven NDP is not same as the previous data-driven IAC approach, which is based on the HDP technique. Actually, for general control performance, there always exist some distinctions when using HDP, DHP, GDHP, and NDP techniques. As discussed in Chap. 1, different approaches have their own shortcomings and advantages. Anyway, they are all helpful to establish the learning-based intelligent wastewater treatment systems.

8.5 Revisiting Wastewater Treatment via Mixed Driven NDP

215

0.4 0.35 0.3

Threshold

0.25 0.2 0.15 0.1 0.05 0 0

50

100

150

200

250

300

350

400

Time index

Fig. 8.13 The triggering threshold with mixed driven NDP

(a) Tracking control

0.5 0 -0.5 -1 -1.5 0

50

100

150

200

250

300

350

400

250

300

350

400

Time index

(b) Tracking control

0.2 0 -0.2 -0.4 -0.6 -0.8 0

50

100

150

200

Time index

Fig. 8.14 The control input of the transformed system with mixed driven NDP: a u 1 ; b u 2

216

8 Iterative Adaptive Critic Control Towards an Urban Wastewater Treatment Plant

8.6 Conclusions For solving the nonlinear optimal feedback control problem, a data-driven IAC strategy has been constructed in this chapter and then applied to a typical wastewater treatment system. By utilizing the IAC control framework, the dissolved oxygen concentration and the nitrate level have been both guaranteed at their desired setting points, with faster response and less oscillation. The previous mixed driven control framework also has been applied to this wastewater treatment problem. Developing such an intelligent controller has been proved to be effective to accomplish nonlinear optimization and wastewater recycling.

References Alex J, Benedetti L, Copp J, Gernaey KV, Jeppsson U, Nopens I, Pons MN, Rieger L, Rosen C, Steyer JP, Vanrolleghem P, Winkler S (2008) Benchmark simulation model no. 1 (BSM1). IWA task group on benchmarking of control strategies for WWTPs, London Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof. IEEE Trans Syst Man Cybern-Part B: Cybern 38(4):943–949 Bo Y, Qiao J (2015) Heuristic dynamic programming using echo state network for multivariable tracking control of wastewater treatment process. Asian J Control 17(5):1654–1666 Dhar NK, Verma NK, Behera L (2018) Adaptive critic-based event-triggered control for HVAC system. IEEE Trans Ind Inf 14(1):178–188 Farahani SS, Soudjani S, Majumdar R, Ocampo-Martinez C (2018) Formal controller synthesis for wastewater systems with signal temporal logic constraints: The Barcelona case study. J Process Control 69:179–191 Ha M, Wang D, Liu D (2020) Event-triggered adaptive critic control design for discrete-time constrained nonlinear systems. IEEE Trans Syst Man Cybern: Syst 50(9):3158–3168 Han H, Liu Z, Hou Y, Qiao J (2020) Data-driven multiobjective predictive control for wastewater treatment process. IEEE Trans Ind Inf 16(4):2767–2775 Han H, Qiao J (2014) Nonlinear model-predictive control for industrial processes: An application to wastewater treatment process. IEEE Trans Ind Electron 61(4):1970–1982 Han H, Wu X, Qiao J (2019) A self-organizing sliding-mode controller for wastewater treatment processes. IEEE Trans Control Syst Technol 27(4):1480–1491 Hou J, Wang D, Liu D, Zhang Y (2019) Model-free optimal tracking control of constrained nonlinear systems via an iterative adaptive learning algorithm. IEEE Trans Syst Man Cybern: Syst 50(11):4097–4108 Iratni A, Chang NB (2019) Advances in control technologies for wastewater treatment processes: Status, challenges, and perspectives. IEEE/CAA J Automat Sinica 6(2):337–363 Liang M, Wang D, Liu D (2020) Improved value iteration for neural-network-based stochastic optimal control design. Neural Networks 124:280–295 Li J, Chai T, Lewis FL, Ding Z, Jiang Y (2019) Off-policy interleaved Q-learning: Optimal control for affine nonlinear discrete-time systems. IEEE Trans Neural Networks Learn Syst 30(5):1308– 1320 Qiao J, Wang Y, Chai W (2018) Optimal control based on iterative ADP for wastewater treatment process. J Beijing Univ Technol 44(2):200–206 Revollar S, Vega P, Vilanova R, Francisco M (2017) Optimal control of wastewater treatment plants using economic-oriented model predictive dynamic strategies. Appl Sci 7(8):1–21

References

217

Si J, Wang YT (2001) On-line learning control by association and reinforcement. IEEE Trans Neural Networks 12(2):264–276 Wang D, Ha M, Qiao J (2021) Data-driven iterative adaptive critic control towards an urban wastewater treatment plant. IEEE Trans Ind Electron 68(8):7362–7369 Wang D, Liu D, Wei Q, Zhao D, Jin N (2012) Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832 Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of discrete-time nonlinear systems using adaptive dynamic programming approach. Neurocomputing 78(1):14–22 Werbos U, PJ, (1992) Approximate dynamic programming for real-time control and neural modeling. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive approaches (chapter 13). Van Nostrand Reinhold, New York Yang T, Qiu W, Ma Y, Chadli M, Zhang L (2014) Fuzzy model-based predictive control of dissolved oxygen in activated sludge processes. Neurocomputing 136:88–95 Yang X, Liu D, Wei Q, Wang D (2015) Direct adaptive control for a class of discrete-time unknown nonaffine nonlinear systems using neural networks. Int J Robust Nonlinear Control 25:1844–1861 Zhang H, Jiang H, Luo Y, Xiao G (2017) Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method. IEEE Trans Ind Electron 64(5):4091–4100 Zhang H, Liu Y, Xiao G, Jiang H (2020) Data-based adaptive dynamic programming for a class of discrete-time systems with multiple delays. IEEE Trans Syst Man Cybern: Syst 50(2):432–441 Zhang H, Luo Y, Liu D (2009) Neural-network-based near-optimal control for a class of discretetime affine nonlinear systems with control constraints. IEEE Trans Neural Networks 20(9):1490– 1503 Zhong X, Ni Z, He H (2016) A theoretical foundation of goal representation heuristic dynamic programming. IEEE Trans Neural Networks Learn Syst 27(12):2513–2525

Chapter 9

Constrained Neural Optimal Tracking Control with Wastewater Treatment Applications

Abstract The wastewater treatment is an effective method for alleviating the shortage of water resources. In this chapter, a data-driven iterative adaptive tracking control approach is developed to improve the control performance of the dissolved oxygen concentration and the nitrate nitrogen concentration in the nonlinear wastewater treatment plant. First, the model network is established to obtain the steady control and evaluate the new system state. Then, a nonquadratic performance functional is provided to handle asymmetric control constraints. Moreover, the new costate function and the tracking control policy are derived by using the dual heuristic dynamic programming algorithm. In the present control scheme, two neural networks are constructed to approximate the costate function and the tracking control law. Finally, the feasibility of the proposed algorithm is confirmed by applying the designed strategy to the wastewater treatment plant. Keywords Adaptive critic · Asymmetric control constraints · Intelligent optimal tracking · Nonlinear control · Wastewater treatment

9.1 Introduction Nowadays, the shortage and pollution issues of fresh water resources are becoming more and more serious. Meanwhile, the wastewater treatment plays an important role in improving water quality. In the past few years, various methods have been proposed to deal with complex optimization and control problems in wastewater treatment processes (Qiao and Zhang 2018; Han et al. 2019, 2020; Zhang and Qiao 2020; Bo and Qiao 2015; Bo and Zhang 2018). However, the wastewater treatment is a highly nonlinear industrial process with strong coupling, large time-varying, and strong interference. In addition, the mathematical mechanism model is difficult to establish accurately, which increases the difficulty of wastewater treatment control. These features require that the designed controller is independent of the system model and effective for complex nonlinear dynamics. As revealed in (Bo and Qiao 2015), the dissolved oxygen concentration in the aerated section and the nitrate nitrogen concentration in the anoxic section have important effects on nitrogen removal. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D. Wang et al., Advanced Optimal Control and Applications Involving Critic Intelligence, Intelligent Control and Learning Systems 6, https://doi.org/10.1007/978-981-19-7291-1_9

219

220

9 Constrained Neural Optimal Tracking Control with Wastewater …

Generally speaking, the two main objectives of wastewater treatment control, namely the concentration of dissolved oxygen and nitrate nitrogen, are expected to be maintained at the optimal set value. Moreover, the relevant control parameters need to be constrained in a certain range due to energy consumption considerations. Consequently, how to design an intelligent method to effectively control these two variables is critical to affect the wastewater treatment process. Due to the poor adaptive ability and fixed parameters, the traditional controller design technology is not suitable for the complex characteristics of wastewater and can result in performance degradation. Therefore, many researchers conducted adaptive control and intelligent control design in the past decades (Niu et al. 2019, 2021; Sui et al. 2020, 2021). Especially, reinforcement learning based on the action-critic structure has achieved intensive attention because of the idea of agent–environment interaction (Wen et al. 2019, 2020, 2018b). As an advanced control method with strong adaptive ability, adaptive dynamic programming (ADP) integrates reinforcement learning, dynamic programming, and function approximation (Liu et al. 2013), which shows great potential in solving optimal control problems of nonlinear systems. Generally, researchers used the ADP technique to obtain the approximate optimal solution of nonlinear Hamilton–Jacobi–Bellman (HJB) equations since it is formidable to obtain analytical solutions. ADP mechanism mainly includes six structures, among which heuristic dynamic programming (HDP) and dual HDP (DHP) are the most widely used (Prokhorov and Wunsch 1997; Werbos 1992). Nowadays, ADP mechanism has been extensively studied and applied to address optimal regulation (Wen et al. 2018a; Wang et al. 2020a), tracking control (Hou et al. 2020; Zhang et al. 2008; Wang et al. 2012), control constraints (Abu-Khalaf and Lewis 2005; Modares et al. 2013), robust stabilization (Wang et al. 2020b; Wang and Liu 2018), power system control (Liang et al. 2012), and so on. In this chapter, the iterative ADP approach is introduced towards the wastewater treatment plant, concentrating on two applications: optimal tracking and control constraints (Wang et al. 2021b). For finite-horizon optimal tracking, the control strategy for completely unknown discrete-time nonlinear affine systems was elaborated by using the HDP algorithm in (Song et al. 2019). In addition, via the ADP approach, the infinite-time optimal tracking control scheme for discrete-time nonlinear systems was elucidated in (Kiumarsi and Lewis 2015). It should be noted that the results of the above optimal tracking control are inclined to affine systems. However, due to the lack of an accurate mathematical model, wastewater treatment systems can only be considered as nonaffine forms. Therefore, the data-driven neural identifier was constructed to facilitate the implementation of the adaptive critic control towards nonaffine plant as described in (Wang et al. 2021a). Hence, it is worth investigating how to design a tracking controller so that the controlled state can track the expected value for the wastewater treatment plant. For the actuator constraint problem, an excellent algorithm for finding the near-optimal controller with convergence analysis was presented in (Zhang et al. 2009). By introducing the event-triggered mechanism, the approximate optimal solutions for a class of constrained nonlinear discrete-time affine and nonaffine systems were investigated in (Ha et al. 2020a, b). However, there are no systematic results for asymmetric constraint problems, which

9.2 Problem Statement with Asymmetric Control Constraints

221

means the upper bound and the lower bound are not equal. By adopting the HDP technology, a nonquadratic function for solving asymmetric control constraints was developed in (Song et al. 2013), but the corresponding asymmetric simulation result was not provided. Moreover, the control laws of the wastewater treatment platform are very large, which will increase the computational burden during the iterative process. The integral term in the costate function is more difficult to compute with asymmetric constraints being considered. It should be noted that the control inputs of the wastewater treatment plant are constrained in the asymmetric range. Hence, we shall provide a new approach to obtain the control law by using the DHP framework. Remarkably, it is the first time that the asymmetric control constraints are investigated for the wastewater treatment plant by introducing the DHP algorithm. Under asymmetric control constraints, the new costate function and the tracking control law based on the DHP algorithm are given to control the dissolved oxygen concentration and the nitrate nitrogen concentration in the wastewater treatment plant. By collecting the input and output data, the proposed method shows how to obtain the steady control for the wastewater treatment plant with unknown system dynamics. The improved iterative process ensures that the experiment can be carried out with multiple initial states instead of only one initial state.

9.2 Problem Statement with Asymmetric Control Constraints Considering a common Benchmark Simulation Model No. 1 (BSM1) (Alex et al. 2008), the intelligent ADP control method is proposed for the wastewater treatment process. At present, the activated sludge process is generally used to treat wastewater in BSM1. The model of activated sludge treatment process is mainly composed of two anoxic sections, three aerated sections, and a secondary sedimentation tank, where anoxic sections and aerated sections are collectively regarded as biochemical reaction component. Here, the simple structure diagram of BSM1 with the biochemical reaction tank and the secondary sedimentation tank is shown in Fig. 9.1. The biochemical reaction tank is divided into five units, i.e., the first to the fifth unit from left to right. After a series of nitrification and denitrification reactions, the wastewater from the biochemical reaction tank enters the secondary sedimentation tank. Then the separated sludge is discharged or returned to the anoxic area as the carrier of biochemical reaction, and the separated water can be directly discharged into the river. During the wastewater treatment process, it is shown that the dissolved oxygen concentration in the fifth unit SO,5 and the nitrate nitrogen concentration in the second unit S N O,2 of the biochemical reaction tank directly affect the effluent quality (Bo and Qiao 2015). Therefore, the main purpose is to control the SO,5 through the oxygen transfer coefficient K La,5 in the fifth unit, and to control the S N O,2 through the internal recycle Q a in the second unit. Specifically, SO,5 and S N O,2 are required

222

9 Constrained Neural Optimal Tracking Control with Wastewater …

Fig. 9.1 The schematic diagram of BSM1 involving feedback controllers, the biochemical reaction tank, and the secondary sedimentation tank

to track the desired points, and the control parameters K La,5 and Q a are constrained within a certain range, which is a constrained tracking design problem. In this chapter, we assume the nonlinear dynamics of the above wastewater treatment problem belongs to the form x(k + 1) = F (x(k), u(k)),

(9.1)

where x(k) ∈ Rn is the state vector and u(k) ∈ Rm is the control vector, F (·, ·) are differentiable in their arguments with F (0, 0) = 0. Note x(0) is the initial state. Assume that system (9.1) can be stabilized on the set Ω ∈ Rn by the feedback controller u(x(k)). Define u(x(k)) = [u 1 (x(k)), u 2 (x(k)), . . . , u m (x(k))]T , u min j (x) ≤ max min u j (x(k)) ≤ u j (x), j = 1, 2, . . . , m, where u j (x) is the lower bound and u max j (x) is the upper bound of the jth actuator. In the following, we design the optimal tracking controller of the plant with constrained inputs. The reference trajectory is defined as follows: r (k + 1) = Ψ (r (k)), (9.2) where r (k + 1) ∈ Rn and Ψ (·) is differentiable. The main goal is to design the feedback controller u(x(k)) that makes the system state x(k) track the reference trajectory r (k). Meanwhile, the control vector u(x(k)) is constrained to the appropriate range. It is desired to find a steady control vector u(r (k)) = [u 1 (r (k)), u 2 (r (k)), . . . , u m (r (k))]T with respect to the reference trajectory, which satisfies r (k + 1) = F (r (k), u(r (k))) and can be solved by the mathematical method. For convenience, we assume that the steady control is expressed as u(r (k)) = Υ (r (k)).

(9.3)

9.2 Problem Statement with Asymmetric Control Constraints

223

In order to convert the tracking problem into a regulation problem, we define the tracking error as e(k) ˜ = x(k) − r (k). (9.4) The tracking error e(k) ˜ = 0 when perfect tracking is realized with r (k) = x(k). The corresponding tracking control is computed by u(e(k)) ˜ = u(x(k)) − u(r (k)),

(9.5)

˜ ≤ u j (e(k)) ˜ ≤ u max ˜ In view the form of (9.5), we can find that where u min j (e) j (e). max ( e) ˜ and u ( e) ˜ are related to u(x(k)) and u(r (k)), and are given by u min ˜ = u min j j j (e) min max max ˜ = u j (x) − u j (r (k)). By substituting the above u j (x) − u j (r (k)) and u j (e) equations into (9.4), one has e(k ˜ + 1) = x(k + 1) − r (k + 1) = F e(k) ˜ + r (k), u(e(k)) ˜ + Υ (r (k)) − Ψ (r (k)).

(9.6)

According to (9.2) and (9.6), a new augmented system is written as X (k + 1) = Φ(X (k), u(e(k)), ˜

(9.7)

where X (k) = [e˜T (k), r T (k)]T ∈ R2n is the 2n-dimensional state vector and u(e(k)) ˜ ∈ Rm is the m-dimensional control vector. Next, we define the cost function as follows: ∞ J X (k), u(e(k)) ˜ = U X (l), u(e(l)) ˜ .

(9.8)

l=k

In the following, a block matrix is given in the state-based utility because of the introduction of the augmented state. Inspired by the pioneering work in (Ha et al. 2020), we define the specified form of the utility function and further obtain Q0 T U X (l), u(e(l)) ˜ = X (l) X (l) + W (u(e(l)) ˜ 0 0 ˜ + W u(e(l)) ˜ , = e˜T (l)Q e(l)

(9.9)

˜ ∈ R is positive definite. where Q ∈ Rn×n is a positive definite matrix and W u(e(l)) It is clear that the utility function of (9.9) is determined by e(l) ˜ and u(e(l)). ˜ Therefore, the cost function of the augmented system is rewritten as follows:

J e(k), ˜ u(e(k)) ˜ =

∞ l=k

e˜T (l)Q e(l) ˜ + W u(e(l)) ˜ .

(9.10)

224

9 Constrained Neural Optimal Tracking Control with Wastewater …

In the following, the nonquadratic functional W u(e(l)) ˜ is offered for the asymmetricconstrained problem. According to (Song et al. 2013), the nonquadratic functional W u(e(l)) ˜ is defined as W u(e(l)) ˜ =2

u(e(l)) ˜

ϕ −T (s)Rds, s ∈ Rm ,

(9.11)

0

−1 T where ϕ −1 u(e(l)) ˜ = [ϕ −1 u 1 (e(l)) ˜ , ϕ u 2 (e(l)) ˜ , . . . , ϕ −1 u m (e(l)) ˜ ] , s = [s1 , s2 , . . . , sm ]T . Moreover, R is the positive definite and diagonal matrix is given by R = diag{r11 , r22 , . . . , rmm }. Note that ϕ(·) is a bounded one-to-one function satisfying |ϕ(·)| ≤ 1 and belonging to C p( p ≥ 1)and L 2 (Ω) (Zhang et al. 2009). Furthermore, the nonquadratic functional W u(e(l)) ˜ is expanded as follows: W u(e(l)) ˜ =2

0

u 1 (e(l)) ˜

ϕ −1 (s1 )r11 ds1 + 2

u 2 (e(l)) ˜

ϕ −1 (s2 )r22 ds2 + · · ·

0

u m (e(l)) ˜

+2

ϕ −1 (sm )rmm dsm .

(9.12)

0

To solve the problem of asymmetric constraints, the function ϕ(δ) is defined as follows: ϕ(δ) =

eδ − e−δ , α j eδ − β j e−δ

(9.13)

where α j = 1/u max ˜ β j = 1/u min ˜ and δ ∈ R. Furthermore, the inverse funcj (e), j (e), tion of ϕ(δ) is solved as follows:

1 − βj βj 1 αj . + ϕ (δ) = ln 2 1 − αjδ αj −1

(9.14)

At each iteration step, it is a large computing burden to obtain ϕ −1 (s) and the integral term of (9.11). Therefore, how to reduce the calculation pressure is the key point to be considered. ˜ = 0 and u max ˜ = 0; othRemark 9.1 In light of (9.13), we find that u min j (e) j (e) erwise, α j and β j tend to infinity. Actually, the control input u(x(k)) satisfies max min ˜ = u min u min j (x) ≤ u j (r (k)) ≤ u j (x). Hence, we can derive inequalities u j (e) j (x) max max min ˜ = u j (x) − u j (r (k)) ≥ 0. Herein, we let u j (e) ˜ =τ −u j (r (k)) ≤ 0 and u j (e) ( e) ˜ = 0, where τ is an arbitrarily small negative number. In the same way, when u min j max ( e) ˜ = −τ when u ( e) ˜ = 0. we let u max j j Recalling the well-known Bellman’s optimality principle, the optimal cost function satisfies the following HJB equation:

9.3 Intelligent Optimal Tracking Design

225

∞ e˜T (l)Q e(l) J (e(k)) ˜ = min ˜ +2 ∗

u(·)

l=k

= min

u(e(k)) ˜

e˜T (k)Q e(k) ˜ +2

u(e(l)) ˜

0 u(e(k)) ˜

ϕ

−T

(s)Rds

ϕ −T (s)Rds + J ∗ (e(k ˜ + 1)) . (9.15)

0

The optimal tracking control u ∗ (e(k)) ˜ is computed by u ∗ (e(k)) ˜ = arg min

u(e(k)) ˜

e˜T (k)Q e(k) ˜ +2

u(e(k)) ˜

ϕ −T (s)Rds + J ∗ (e(k ˜ + 1)) .

0

(9.16) ˜ and u ∗ (e(k)) ˜ can be obtained by solving Eqs. (9.15) and (9.16). Then, The J ∗ (e(k)) we can derive the optimal control law as ˜ + u(r (k)). u ∗ (x(k)) = u ∗ (e(k))

(9.17)

In general, the HJB equation cannot be solved precisely. Therefore, in the following, the iterative ADP algorithm is utilized to derive the near-optimal control solution.

9.3 Intelligent Optimal Tracking Design This section consists of two subsections. In the first subsection, we show how to derive the iterative ADP algorithm. In the second subsection, the data-driven iterative DHP algorithm is used to obtain the near-optimal tracking control law for nonaffine constrained tracking systems.

9.3.1 Description of the Iterative ADP Algorithm Since the analytic solution of the HJB equation is difficult to obtain, in the following, the iterative ADP algorithm is provided to obtain its numerical solution. ˜ and {u (i) (e(k))}, ˜ and instruct a We construct two iterative sequences {J (i) (e(k))} small positive number ς , where i denotes the iteration index. Let i = 0 and initial ˜ = 0, we can obtain u (0) (e(k)) ˜ = 0 and cost function J (0) (·) = 0. Since J (0) (e(k)) (1) T ˜ = e˜ (k)Q e(k). ˜ Furthermore, for i = 1, 2, . . . , the iterative ADP algoJ (e(k)) rithm iterates between

u(e(k)) ˜ (i) T −T (i) u (e(k)) ˜ = arg min e˜ (k)Q e(k) ˜ +2 ϕ (s)Rds + J (e(k ˜ + 1)) , u(e(k)) ˜

0

(9.18)

226

9 Constrained Neural Optimal Tracking Control with Wastewater …

and J (i+1) (e(k)) ˜ = min

u(e(k)) ˜

e˜T (k)Q e(k) ˜ +2

u(e(k)) ˜

ϕ −T (s)Rds + J (i) (e(k ˜ + 1))

0

u (i) (e(k)) ˜

˜ +2 = e˜T (k)Q e(k)

ϕ −T (s)Rds + J (i) (e(k ˜ + 1)),

(9.19)

0

where the involved state e(k ˜ + 1) is given by (9.6). Note that the iterative process is stopped when ˜ − J (i) (e(k))| ˜ < ς. |J (i+1) (e(k))

(9.20)

In the following, the convergence of the iterative algorithm is discussed simply. First, let ζ (e(k)) ˜ be any admissible control input and {φ (i) } be a new sequence defined by ˜ = U e(k), ˜ ζ (e(k)) ˜ + φ (i−1) (e(k ˜ + 1)), φ (i) (e(k))

(9.21)

˜ = U e(k), ˜ ζ (e(k)) ˜ , one has where φ (0) (·) = 0. Considering φ (1) (e(k)) φ (i) (e(k)) ˜ =

i−1

φ (1) (e(k ˜ + η)).

(9.22)

η=0

Since ζ (e(k)) ˜ is admissible control input, there exits a positive constant Y such that φ (i) (e(k)) ˜ ≤ Y, ∀i. Because J (i) (e(k)) ˜ is the minimization result of (9.19), then, ˜ ≤ φ (i) (e(k)) ˜ ≤ Y, i = 1, 2, . . .. Second, by induction, it follows that 0 ≤ J (i) (e(k)) (i) define a new sequence {Θ } as ˜ = U e(k), ˜ u (i) (e(k)) ˜ + Θ (i−1) (e(k ˜ + 1)), Θ (i) (e(k))

(9.23)

˜ = 0. Since Θ (0) (e(k)) ˜ ≤ J (1) (e(k)), ˜ then we assume that the where Θ (0) (e(k)) (i−1) (i) (e(k)) ˜ ≤ J (e(k)) ˜ holds, i = 1, 2, . . .. By induction, we can derive inequality Θ ˜ ≤ J (i+1) (e(k)). ˜ Therefore, we can conclude that J (i) (e(k)) ˜ ≤ Θ (i) (e(k)) (i) (i+1) ˜ ≤J (e(k)). ˜ Considering the boundedness and the monotonicity, we Θ (e(k)) ˜ → J ∗ (e(k)) ˜ and u (i) (e(k)) ˜ → u ∗ (e(k)) ˜ as i → ∞. can declare that J (i) (e(k))

9.3.2 DHP Formulation of the Iterative Algorithm In this part, the DHP algorithm for optimal tracking control of nonaffine systems with asymmetric constraints is described. To the best of our knowledge, for the nonaffine plant, the partial derivatives of x(k + 1) with respect to u(e(k)) ˜ and x(k) cannot be obtained by the system dynamics. Therefore, it is necessary to rely on the model

9.3 Intelligent Optimal Tracking Design

227

network, which is established using the system data generated by the controlled plant. Especially, the model network is constructed as the neural identifier to directly approximate the controlled plant instead of the error dynamics. Theorem 9.1 Considering the control input u(x(k)) in (9.1) and the tracking control u(e(k)) ˜ in (9.5), we have the following equations: ∂ e(k ˜ + 1) ∂ x(k + 1) ∂ e(k ˜ + 1) ∂ x(k + 1) = , = . ∂u(x(k)) ∂u(e(k)) ˜ ∂ x(k) ∂ e(k) ˜

(9.24)

Proof Considering the system (9.1), one has ∂F (x(k), u(k)) ∂ x(k + 1) = ∂u(x(k)) ∂u(x(k)) ∂F e(k) ˜ + r (k), u(e(k)) ˜ + Υ (r (k)) = ∂ u(e(k)) ˜ + Υ (r (k)) ∂ F e(k) ˜ + r (k), u(e(k)) ˜ + Υ (r (k)) − Ψ (r (k)) = ∂u(e(k)) ˜ ∂ e(k ˜ + 1) = . ∂u(e(k)) ˜

(9.25)

Similarly, we can obtain that ∂F (x(k), u(k)) ∂ x(k + 1) = ∂ x(k) ∂ x(k) ∂F e(k) ˜ + r (k), u(e(k)) ˜ + Υ (r (k)) = ∂ e(k) ˜ + r (k) ∂ F e(k) ˜ + r (k), u(e(k)) ˜ + Υ (r (k)) − Ψ (r (k)) = ∂ e(k) ˜ ∂ e(k ˜ + 1) . = ∂ e(k) ˜ This ends the proof.

(9.26)

According to (9.11) and (9.12), we have u(e(l)) ˜ ∂ 2 0 ϕ −T (s)Rds = 2Rϕ −1 u(e(k)) ˜ . ∂u(e(k)) ˜

(9.27)

Remarkably, the iterative tracking control law is solved by letting the gradient of the right-hand side of (9.19) with respect to u(e(k)) ˜ equal to zero. Therefore, based on (9.24), the corresponding control law u (i) (e(k)) ˜ is obtained as

228

9 Constrained Neural Optimal Tracking Control with Wastewater …

˜ + 1)) 1 −1 ∂ x(k + 1) T ∂ J (i) (e(k u (e(k)) ˜ =ϕ − R . 2 ∂u(x(k)) ∂ e(k ˜ + 1) (i)

(9.28)

As mentioned above, computing the integral term of (9.11) is not an easy task. In ˜ + 1))/∂ e(k ˜ + 1). addition, at each step of iteration, we have to compute ∂ J (i) (e(k In order to reduce the computation burden, we shall present the DHP algorithm for obtaining the near-optimal tracking control law. We assume that the iterative cost ˜ is smooth. The costate function is defined as follows: function J (i) (e(k)) λ(e(k)) ˜ =

∂ J (e(k)) ˜ . ∂ e(k) ˜

(9.29)

Furthermore, we have λ(i) (e(k)) ˜ = ∂ J (i) (e(k))/∂ ˜ e(k) ˜ and λ(i+1) (e(k)) ˜ = ∂ J (i+1) (0) (i) (e(k)) ˜ /∂ e(k). ˜ Starting with λ (·) = 0 and substituting λ (e(k)) ˜ into (9.28), the ˜ can be written as update rule of the iterative tracking control law u (i) (e(k))

∂ x(k + 1) T (i) 1 ˜ = ϕ − R −1 λ (e(k ˜ + 1)) . u (i) (e(k)) 2 ∂u(x(k))

(9.30)

According to (9.19) and (9.24), the costate function λ(i+1) (e(k)) ˜ is deduced λ

(i+1)

˜ ∂U e(k), ˜ u (i) (e(k)) ∂ J (i) (e(k ˜ + 1)) + (e(k)) ˜ = ∂ e(k) ˜ ∂ e(k) ˜ (i) T T ∂ W u(e(k)) ˜ ∂u (e(k)) ˜ ˜ ∂ e˜ (k)Q e(k) + = ∂ e(k) ˜ ∂ e(k) ˜ ∂u (i) (e(k)) ˜ T T (i) ∂ e(k ˜ + 1) ∂u (e(k)) ˜ ∂ J (i) (e(k ˜ + 1)) + (i) ∂ e(k) ˜ ∂u (e(k)) ˜ ∂ e(k ˜ + 1) T (i) ˜ + 1)) ∂ e(k ˜ + 1) ∂ J (e(k + ∂ e(k) ˜ ∂ e(k ˜ + 1) T (i) ∂u (e(k)) ˜ = 2Q e(k) ˜ +2 Rϕ −1 u (i) (e(k)) ˜ ∂ e(k) ˜ T ∂ x(k + 1) ˜ ∂ x(k + 1) ∂u (i) (e(k)) + λ(i) (e(k ˜ + 1)). (9.31) + ∂u (i) (x(k)) ∂ e(k) ˜ ∂ x(k)

By performing the iteration process between (9.30) and (9.31), we can update the ˜ and the costate function λ(i+1) (e(k)). ˜ In the tracking control sequence u (i) (e(k)) sequel, we shall show how to implement the learning process of the above iterative DHP algorithm by constructing three neural networks.

9.4 Neural Network Implementation

229

9.4 Neural Network Implementation In this part, we choose three neural networks to approximate the corresponding variables. First, the model neural network is employed to learn the nonlinear system for obtaining the state x(k ˆ + 1). Then, the critic neural network and the action neural ˜ and the tracking control network are used to estimate the costate function λ(i+1) (e(k)) ˜ respectively. All neural networks are composed of three layers: the law u (i) (e(k)), input layer, the hidden layer, and the output layer. First, the model network is constructed via the collected inputs x(k) and u(x(k)) from the controlled plant. The output x(k ˆ + 1) of the model network is given as follows: T T T σm ωm1 [x (k), u T (x(k))]T + bm1 + bm2 , (9.32) x(k ˆ + 1) = ωm2 where ωm1 is the input-to-hidden matrix and ωm2 is the hidden-to-output weight matrix. bm1 and bm2 are the threshold vectors and σm (·) is the bounded activation function. By defining the error function x˜k+1 = xˆk+1 − xk+1 , the corresponding perT x˜k+1 . In light formance measure related to the model network is given as E m = 21 x˜k+1 of the gradient descent algorithm, updating rules for the weights and thresholds are shown as follows: ∂ Em , ∂ωmq ∂ Em − αm , q = 1, 2, ∂bmq

ωmq : = ωmq − αm bmq : = bmq

(9.33)

where αm ∈ (0, 1) is the learning rate of the model network. After training the model network, weights and thresholds are kept unchanged. It is assumed that the steady control u(r (k)) exists and can be solved. Herein, the trained model network expression can be utilized to obtain the steady control of reference trajectory in terms of r (k + 1) = F (r (k), u(r (k)). Next, the r (k + 1) is rewritten as T T T σm ωm1 [r (k), uˆ T (r (k))]T + bm1 + bm2 . rˆ (k + 1) = ωm2

(9.34)

If the current state r (k) and the next state r (k + 1) are known, the control law u(r ˆ (k)) can be obtained by solving the nonlinear equation. Second, the critic network approximates the iterative costate function as T T λˆ (i+1) (e(k)) ˜ = ωc2 σc ωc1 e(k) ˜ ,

(9.35)

where ωc1 and ωc2 are the input-to-hidden matrix and the hidden-to-output weight matrix, and σc (·) is the bounded activation function. Similarly, with the error func˜ = λˆ (i+1) (e(k)) ˜ − λ(i+1) (e(k)), ˜ the performance measure of the critic tion λ˜ (i+1) (e(k)) network is defined as

230

9 Constrained Neural Optimal Tracking Control with Wastewater …

Ec =

1 (i+1)T λ˜ (e(k)) ˜ λ˜ (i+1) (e(k)). ˜ 2

(9.36)

Further, the weight matrices of the critic network are updated by ωcq := ωcq − αc

∂ Ec , q = 1, 2, ∂ωcq

(9.37)

where αc ∈ (0, 1) is the learning rate of the critic network. Third, for the input e(k), ˜ the output of the action network is computed as T T ˜ = ωa2 σa ωa1 e(k) ˜ , uˆ (i) (e(k))

(9.38)

where ωa2 and ωa1 are related weight matrices, and σa (·) is the activation function. ˜ = uˆ (i) (e(k)) ˜ − u (i) (e(k)), ˜ we Let the error function of action network be u˜ (i) (e(k)) have the objective function to be minimized as Ea =

1 (i)T u˜ (e(k)) ˜ u˜ (i) (e(k)). ˜ 2

(9.39)

According to the gradient descent criterion, the weight updating rules of the action network are given by ωaq := ωaq − αa

∂ Ea , q = 1, 2, ∂ωaq

(9.40)

where αa ∈ (0, 1) is the learning rate of the action network. Meanwhile, the whole diagram of the DHP algorithm involving three neural network structures is shown in ˆ˜ + 1) = x(k Fig. 9.2, where e(k ˆ + 1) − r (k + 1). Based on the neural networks, the main procedure of the DHP algorithm is summarized as Algorithm 5.

Fig. 9.2 The flowchart of the proposed DHP algorithm for the wastewater treatment plant with the model network, the critic network and the action network

9.5 A Wastewater Treatment Application

231

Algorithm 5 The Iterative Process Between the Critic and Action Networks 1: Select the necessary parameters L k , L i , L c , L a , ς, εc , εa , and the weight matrices Q and R. Let the iteration index i = 0 and the costate function λ(0) (·) = 0. 2: Construct the critic and the action networks as (9.35) and (9.38). 3: Set k = 0 and give an initial state x(k). Compute the steady control u(r ˆ (k)) according to (9.34) and the tracking error e(k) ˜ according to (9.4). 4: for i ← i + 1 do 5: Compute the tracking control uˆ (i) (e(k)) ˜ by (9.30) and train the action network with L a epoches until the given accuracy εa is reached. 6: Compute the costate function λ(i+1) (e(k)) ˜ by (9.31) and train the critic network with L c epoches until the given accuracy εc is reached. 7: Compute the system state x(k ˆ + 1) according to (9.32) and the tracking error e(k ˜ + 1) according to (9.4). 8: if |λ(i+1) (e(k)) ˜ − λ(i) (e(k))| ˜ < ς or i ≥ L i then 9: Stop 10: end if 11: end for 12: Let k = k + 1, e(k) ˜ = e(k ˜ + 1) and x(k) ˆ = x(k ˆ + 1). If k ≤ L k , go to Step 4; otherwise, go to Step 13. 13: Stop.

9.5 A Wastewater Treatment Application In this section, the proposed optimal tracking control approach is applied to the typical wastewater treatment process, i.e., BSM1. In the following, the detailed control process will be described by linking the variables in wastewater treatment plant with the parameters of the proposed algorithm. The main objective of designing the controller is to enable SO,5 and S N O,2 to track the set value r (k) = [2, 1]T . The analysis of wastewater treatment process shows that the concentration of dissolved oxygen and nitrate nitrogen at 2 (mg/L) and 1 (mg/L) is an overall optimization choice. Naturally, the state vector is defined as x(k) = [SO,5 , S N O,2 ]T and the control vector is defined as u(x(k)) = [K La,5 , Q a ]T . According to the actual working conditions, the control input u(x(k)) should satisfy 0 ≤ K La,5 ≤ 240 and 0 ≤ Q a ≤ 92230. Based on Eq. (9.17), we can find that obtaining the control u(x(k)) requires solving the steady control u(r (k)) and the tracking control u(e(k)). ˜ Incidentally, the optimal tracking problem can be transformed into the optimal regulation problem. When the tracking control u(e(k)) ˜ is adjusted to zero vector, x(k) = r (k) can be achieved with u(x(k)) = u(r (k)). Hence, we aim to design a feedback controller to make the tracking control tend to zero gradually. According to (9.4) and (9.5), we have e(k) ˜ = [SO,5 , S N O,2 ]T − [2, 1]T ,

(9.41)

u(e(k)) ˜ = [K La,5 , Q a ]T − u(r (k)),

(9.42)

and

232

9 Constrained Neural Optimal Tracking Control with Wastewater … 104 6 5.5 5

Influent flow

4.5 4 3.5 3 2.5 2 1.5 1 0

2

4

6

8

10

12

14

Time(days) Fig. 9.3 The inflow data in rainstorm weather

where u(r (k)) can be solved by using the model network expression. Before executing the iterative DHP algorithm, the model network should be trained in advance based on the practical influent data of rainstorm days. For clarity, the influent flow under rainstorm conditions is shown in Fig. 9.3. According to the data of rainstorm weather, we can find that the influent flow increased significantly on the 9th and 11th days. It should be noted that the model network is an approximation of the system dynamics. By using the MATLAB neural network toolbox, the model network is trained for 1500 iteration steps with 26880 data samples. In order to ensure the accuracy of the model network, we set the number of hidden layer neurons to 40. Hence, the structure of the model network is 4–40–2 and the learning rate is 0.01. In addition, 335 data samples are used to test the performance of the model network. The corresponding testing performance of S N O,2 and SO,5 are shown in Figs. 9.4 and 9.5. Considering the known weights and thresholds of the trained model network as well as r (k) = r (k + 1) = [2, 1]T , we can solve Eq. (9.34) by using the “fsolve” function in MATLAB. Meanwhile, the obtained steady control is u(r (k)) = [u 1 (r (k)), u 2 (r (k))]T = [158, 22339]T . Combining formulas (9.5) and (9.42), the constrained ˜ ∈ [−158, 82] and boundaries of the tracking control u(e(k)) ˜ are given as u 1 (e(k)) max u 2 (e(k)) ˜ ∈ [−22339, 69891] with u min u min 1 (x) = 0, u 1 (x) = 240, 2 (x) = max 0, u 2 (x) = 92230. At this point, the parameters α j and β j of formulas (9.13) and (9.14) are clear.

9.5 A Wastewater Treatment Application

233

5

4

S O,5

3

2

1

0

Approximation Practical data

-1 0

2

4

6

8

10

12

14

10

12

14

Time/day Fig. 9.4 The testing performance of S O,5 10 Approximation pratical data

9 8 7

S NO,2

6 5 4 3 2 1 0 0

2

4

6

8

Time/day Fig. 9.5 The testing performance of S N O,2

234

9 Constrained Neural Optimal Tracking Control with Wastewater …

Table 9.1 Parameter values used in the DHP algorithm Lk Li Lc La αc αa 250

25

100

100

0.05

0.05

ς

εc

εa

10−5

10−5

10−5

In the following, the critic network and the action network are trained to implement the iterative DHP algorithm. Based on the DHP algorithm, the used parameter values are listed in Table 9.1. According to the common method of the adaptive critic field, the tuning parameters mainly depend on experience and experiment. As for the cost function in (9.10), the positive definite matrices are chosen as Q = 10I2 and R = [1 0; 0 5], where I2 is the 2 × 2 identity matrix. In terms of (Mu et al. 2017), the estimation error of neural networks can be arbitrarily small with sufficiently large number of hidden layer neurons and bounded activation functions. Therefore, we construct the critic and action networks with the structures of 2–20–2 and 2–20–2. The weights ωc1 , ωc2 , ωa1 and ωa2 are randomly initialized in [−0.5, 0.5], and the learning rates are selected as αc = αa = 0.05. The outer-loop time steps L k is used to improve the adaptive ability of the actuator. The iteration cycle continues until the condition of (9.20) is satisfied or the maximal number of iterations L i is reached. For each iteration, the critic and action networks are trained with L c = L a epoches to make sure the tolerant error εc = εa are reached. After training the critic and action networks, the action network is used to control the system plant. Usually, the initial values of SO,5 and S N O,2 are x(0) = [0.5, 3.7]T . It is worth noting that x(k) experiences various states with the increase of time k in the training process, which means that the appropriate initial value can be stable through the feedback controller. Therefore, although the training initial vector is [0.5, 3.7]T , the action network can make some suitable initial states track the set value [2, 1]T . Defining x z (0) = [x1z (0), x2z (0)]T , z = {1, 2, . . . , 8}, initial state vectors are selected as x 1 (0) = [0.5, 3.7]T , x 2 (0) = [0.6, 3.5]T , x 3 (0) = [0.7, 3.3]T , x 4 (0) = [0.8, 3.1]T , x 5 (0) = [0.9, 2.9]T , x 6 (0) = [1.0, 2.7]T , x 7 (0) = [1.1, 2.5]T , x 8 (0) = [1.2, 2.3]T . Via conducting the experiment, the curves of system states and control inputs are shown in Figs. 9.6 and 9.7. Clearly, the system state SO,5 and S N O,2 quickly track the reference trajectory. Meanwhile, the control inputs K La,5 and Q a also meet the constrained requirements. Furthermore, the trajectory tracking errors and tracking control inputs are displayed in Figs. 9.8 and 9.9, which present the excellent performance of the iterative tracking algorithm. We can see that tracking errors tend to zero and the tracking control inputs are also constrained to the range. In short, the good application results demonstrate the effectiveness of the constrained tracking control algorithm with the DHP technology.

9.5 A Wastewater Treatment Application 4

235

x 12 (0)=3.7 x 22 (0)=3.5

3.5

x 32 (0)=3.3 x 42 (0)=3.1

3

x 52 (0)=2.9 x 62 (0)=2.7

2.5

x(k)

x 72 (0)=2.5 x 82 (0)=2.3

2

1.5

1

0.5 0

50

100

150

200

250

300

350

Time steps Fig. 9.6 The concentration of dissolved oxygen and nitrate nitrogen

KLa

5

200

150

100 0

50

100

150

200

250

300

350

250

300

350

Time steps 104

Qa

2.2344 2.2342 2.234 2.2338 0

50

100

150

200

Time steps Fig. 9.7 The oxygen transfer coefficient and the internal recycle

236

9 Constrained Neural Optimal Tracking Control with Wastewater … (a) 2 1 0 -1 0

200

(b)

400

2 1 0 -1 0

Time steps (c) 2 1 0 -1 0

200

400

2 1 0 -1 0

2 1 0 0

400

200

400

Time steps (f)

Time steps (e) 1 0 -1 0

200

Time steps (d)

1 0 200

400

-1 0

200

Time steps

200

400

Time steps (h)

Time steps (g)

400

1 0.5 0 -0.5 0

200

400

Time steps

Fig. 9.8 The tracking error curves with the initial state x z (0): a x 1 (0); b x 2 (0); c x 3 (0); d x 4 (0); e x 5 (0); f x 6 (0); g x 7 (0); h x 8 (0), where the green line represents x1z (0), and the pink line represents x2z (0)

9.6 Conclusion In this chapter, a new tracking control strategy is proposed to derive the optimal control law of the constrained wastewater treatment plant by using the DHP structure. The nonquadratic functional is given to deal with the asymmetric constraint problems and reduce the amount of computation. Three neural networks are constructed to approximate the corresponding variables. The experimental results of the wastewater treatment plant illustrate the effectiveness of the given method. However, more accurate model networks are required to obtain the steady control u(r (k)), which will be discussed in the future work.

References 40 20 0 -20 -40 0

237 (a)

200

400

Time steps (c) 10 0 -10 -20 -30 0

40 20 0 -20 -40 0

40 20 0 -20 -40 0

200

400

Time steps (e)

200

400

Time steps (g)

200

Time steps

400

40 20 0 -20 -40 0

40 20 0 -20 -40 0

40 20 0 -20 -40 0

40 20 0 -20 -40 0

(b)

200

400

Time steps (d)

200

400

Time steps (f)

200

400

Time steps (h)

200

400

Time steps

Fig. 9.9 The tracking control inputs with the initial state x z (0): a x 1 (0); b x 2 (0); c x 3 (0); d x 4 (0); e x 5 (0); f x 6 (0); g x 7 (0); h x 8 (0), where the green line represents x1z (0), and the pink line represent x2z (0)

References Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41(5):779–791 Alex J, Benedetti L, Copp J, Gernaey KV, Jeppsson U, Nopens I, Pons MN, Rieger L, Rosen C, Steyer JP, Vanrolleghem P, Winkler S (2008) Benchmark simulation model no. 1 (BSM1). IWA task group on benchmarking of control strategies for WWTPs, London Bo Y, Qiao J (2015) Heuristic dynamic programming using echo state network for multivariable tracking control of wastewater treatment process. Asian J Control 17(5):1654–1666 Bo Y, Zhang X (2018) Online adaptive dynamic programming based on echo state networks for dissolved oxygen control. Appl Soft Comput 62:830–839 Ha M, Wang D, Liu D (2020a) Event-triggered adaptive critic control design for discrete-time constrained nonlinear systems. IEEE Trans Syst Man Cybern Syst 50(9):3158–3168 Ha M, Wang D, Liu D (2020b) Event-triggered constrained control with DHP implementation for nonaffine discrete-time systems. Inf Sci 519:110–123 Han H, Wu X, Qiao J (2019) A self-organizing sliding-mode controller for wastewater treatment processes. IEEE Trans Control Syst Technol 27(4):1480–1491 Han H, Liu Z, Hou Y, Qiao J (2020) Data-driven multiobjective predictive control for wastewater treatment process. IEEE Trans Ind Inform 16(4):2767–2775 Ha M, Wang D, Liu D (2020) Data-based nonaffine optimal tracking control using iterative DHP approach. In: Proceedings of 21st IFAC world congress, vol 53(2), pp 4246–4251

238

9 Constrained Neural Optimal Tracking Control with Wastewater …

Hou J, Wang D, Liu D, Zhang Y (2020) Model-free H∞ optimal tracking control of constrained nonlinear systems via an iterative adaptive learning algorithm. IEEE Trans Syst Man Cybern Syst 50(11):4097–4108 Kiumarsi B, Lewis FL (2015) Actor-critic-based optimal tracking for partially unknown nonlinear discrete-time systems. IEEE Trans Neural Netw Learn Syst 26(1):140–151 Liang J, Venayagamoorthy GK, Harley RG (2012) Wide-area measurement based dynamic stochastic optimal power flow control for smart grids with high variability and uncertainty. IEEE Trans Smart Grid 3(1):59–69 Liu D, Li H, Wang D (2013) Data-based self-learning optimal control: research progress and prospects. Acta Automatica Sinica 39(11):1858–1870 Modares H, Lewis FL, Naghibi-Sistani M (2013) Adaptive optimal control of unknown constrainedinput systems using policy iteration and neural networks. IEEE Trans Neural Netw Learn Syst 24(10):1513–1525 Mu C, Wang D, He H (2017) Novel iterative neural dynamic programming for data-based approximate optimal control design. Automatica 81:240–252 Niu B, Wang D, Alotaibi ND, Alsaadi FE (2019) Adaptive neural state-feedback tracking control of stochastic nonlinear switched systems: an average dwell-time method. IEEE Trans Neural Netw Learn Syst 30(4):1076–1087 Niu B, Duan P, Li J, Li X (2021) Adaptive neural tracking control scheme of switched stochastic nonlinear pure-feedback nonlower triangular systems. IEEE Trans Syst Man Cybern Syst 51(2):975–986 Prokhorov DV, Wunsch DC (1997) Adaptive critic designs. IEEE Trans Neural Netw 8(5):997–1007 Qiao J, Zhang W (2018) Dynamic multi-objective optimization control for wastewater treatment process. Neural Comput Appl 29:1261–1271 Song R, Xiao W, Sun C (2013) Optimal tracking control for a class of unknown discrete-time systems with actuator saturation via data-based ADP algorithm. Acta Automatica Sinica 39(9):1413–1420 Song R, Xie Y, Zhang Z (2019) Data-driven finite-horizon optimal tracking control scheme for completely unknown discrete-time nonlinear systems. Neurocomputing 356:206–216 Sui S, Chen CLP, Tong S, Feng S (2020) Finite-time adaptive quantized control of stochastic nonlinear systems with input quantization: a broad learning system based identification method. IEEE Trans Ind Electron 67(10):8555–8565 Sui S, Chen CLP, Tong S (2021) Event-trigger-based finite-time fuzzy adaptive control for stochastic nonlinear system with unmodeled dynamics. IEEE Trans Fuzzy Syst 29(7):1914–1926 Wang D, Liu D (2018) Neural robust stabilization via event-triggering mechanism and adaptive learning technique. Neural Netw 102:27–35 Wang D, Liu D, Wei Q (2012) Finite-horizon neuro-optimal tracking control for a class of discretetime nonlinear systems using adaptive dynamic programming approach. Neurocomputing 78:14– 22 Wang D, Ha M, Qiao J (2020a) Self-Learning optimal regulation for discrete-time nonlinear systems under event-driven formulation. IEEE Trans Autom Control 65(3):1272–1279 Wang D, Xu X, Zhao M (2020b) Neural critic learning toward robust dynamic stabilization. Int J Rubost Nonlinear Control 30(5):2020–2032 Wang D, Ha M, Qiao J (2021a) Data-driven iterative adaptive critic control toward an urban wastewater treatment plant. IEEE Trans Ind Electron 68(8):7362–7369 Wang D, Zhao M, Qiao J (2021b) Intelligent optimal tracking with asymmetric constraints of a nonlinear wastewater treatment system. Int J Robust Nonlinear Control 31(14):6773–6787 Wen G, Chen CLP, Feng J, Zhou N (2018a) Optimized multi-agent formation control based on an identifier-actor-critic reinforcement learning algorithm. IEEE Trans Fuzzy Syst 26(5):2719–2731 Wen G, Ge SS, Tu F (2018b) Optimized backstepping for tracking control of strict-feedback systems. IEEE Trans Neural Netw Learn Syst 29(8):3850–3862 Wen G, Chen CLP, Ge SS, Yang H, Liu X (2019) Optimized adaptive nonlinear tracking control using actor-critic reinforcement learning strategy. IEEE Trans Ind Inform 15(9):4969–4977

References

239

Wen G, Chen CLP, Li B (2020) Optimized formation control using simplified reinforcement learning for a class of multiagent systems with unknown dynamics. IEEE Trans Ind Electron 67(9):7879– 7888 Werbos PJ (1992) Approximate dynamic programming for real-time control and neural modeling. In: White DA, Sofge DA (eds) Handbook of intelligent control: neural, fuzzy, and adaptive approaches (chapter 13). Van Nostrand Reinhold, New York Zhang W, Qiao J (2020) Multi-variable direct self-organizing neural network control for wastewater treatment process. Asian J Control 22(2):716–728 Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans Syst Man Cybern Part B Cybern 38(4):937–942 Zhang H, Luo Y, Liu D (2009) Neural-network-based near-optimal control for a class of discretetime affine nonlinear systems with control constraints. IEEE Trans Neural Netw 20(9):1490–1503

Chapter 10

Data-Driven Hybrid Intelligent Optimal Tracking Design with Industrial Applications

Abstract In this chapter, a hybrid intelligent tracking control approach is developed to address optimal tracking problems for a class of nonlinear discrete-time systems. The generalized value iteration algorithm is utilized to attain the admissible tracking control with offline training, while the online near-optimal control method is established to enhance the control performance. It is emphasized that the value iteration performance is improved by introducing the acceleration factor. By collecting the input–output data of the unknown system plant, the model neural network is constructed to provide the partial derivative of the system state with respect to the control law as the approximate control matrix. A novel computational strategy is introduced to obtain the steady control of the reference trajectory. The critic and action neural networks are utilized to approximate the cost function and the tracking control policy, respectively. Considering approximation errors of neural networks, the stability analysis of the specific systems is provided via the Lyapunov approach. Finally, two numerical examples with industrial application backgrounds are involved for verifying the effectiveness of the proposed approach. Keywords Accelerated generalized value iteration · Adaptive critic · Industrial applications · Intelligent optimal tracking control · Neural networks

10.1 Introduction The control and optimization problems of complex nonlinear systems widely exist in the field of industry and also in our life (Wang et al. 2021a; Roman et al. 2021; Turnip and Panggabean 2020; Roman et al. 2019; Du et al. 2014). Nowadays, various methods have been proposed to address the control problems of nonlinear systems (Precup et al. 2008; Kang et al. 2014). Different from the general control strategy, optimal control studies how to design the controller to ensure the optimal performance of the system under the premise of stability. Usually, the essence of general optimal control problems is to solve the nonlinear Hamilton–Jacobi–Bellman (HJB) equation (Al-Tamimi et al. 2008; Zhang et al. 2009; Wang et al. 2012b). As an intelligent control approach, adaptive dynamic programming (ADP) shows the powerful © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D. Wang et al., Advanced Optimal Control and Applications Involving Critic Intelligence, Intelligent Control and Learning Systems 6, https://doi.org/10.1007/978-981-19-7291-1_10

241

242

10 Data-Driven Hybrid Intelligent Optimal Tracking Design …

ability in obtaining the approximate solution of the nonlinear HJB equation. The controller designed by ADP algorithms can asymptotically stabilize the system in the near-optimal form based on the combination of reinforcement learning and neural networks (Ha et al. 2021). According to the iteration form, ADP algorithms are mainly divided into value iteration (Wang et al. 2020; Ha et al. 2020b; Mu et al. 2017; Wei et al. 2017, 2018a, 2016; Li and Liu 2012) and policy iteration (Wei et al. 2018b; Wang and Liu 2018; Liu and Wei 2014; Zhao et al. 2015; Wang et al. 2021b). It is worth mentioning that policy iteration and value iteration were integrated in (Wei et al. 2015), which illustrated that the difference between value iteration and policy iteration lied in the value of iteration indices. The convergence of the traditional and general value iteration algorithm was discussed in (Al-Tamimi et al. 2008; Wei et al. 2016; Li and Liu 2012). Generally speaking, the value iteration algorithm is more inclined to offline training and is easier to implement with a positive semidefinite initial cost function. However, the inherent disadvantage of the value iteration algorithm is that the iterative cost function converges slowly, which also leads to a large computational pressure (Luo et al. 2020). It is noteworthy that the work on accelerating the convergence process of value iteration is still scarce. Therefore, it is significant to come up with a new technique to overcome the deficiency that value iteration converges slowly. As a significant field, tracking control problems of nonlinear systems have been extensively investigated and many methods have been proposed, including proportional–integral–derivative (PID), fuzzy control, model predictive control, and so on. However, for some complex industrial systems, such as the wastewater treatment system, these methods cannot be perfectly applied due to the poor adaptive ability, the requirement of accurate mechanism model, and the dependence on experience. As an adaptive method, ADP has been widely used to deal with tracking problems without the system model information (Zhang et al. 2008; Wang et al. 2012a; Modares and Lewis 2014; Kiumarsi and Lewis 2015; Song et al. 2019; Hou et al. 2020; Sun and van Kampen 2020). By means of the system transformation, the heuristic dynamic programming iteration algorithm was adopted to address the optimal tracking control problem for discrete-time nonlinear systems in (Zhang et al. 2008). The key idea of (Wang et al. 2012a) was to design an optimal regulator for the tracking error dynamics instead of the original system. Based on the action-critic structure, a partially model-free adaptive optimal control approach was proposed to solve the tracking problem online without the known system dynamics (Kiumarsi and Lewis 2015). By establishing an augmented system, a H∞ optimal tracking controller was designed with convergence analysis in (Hou et al. 2020). In order to eliminate the tracking error, a new cost function was proposed to handle the tracking problem in (Li et al. 2021). However, the deficiency of above work is that the controlled systems are affine. Therefore, it is still formidable to deal with the tracking problem for completely unknown nonlinear plants. In order to confront the challenge, a novel mathematical method was developed to obtain the steady control for complex nonlinear systems in (Ha et al. 2020a). In addition, this handling method was applied to overcome the difficulties caused by the nonanalytic nature of the wastewater treatment system in the presence of symmetric and asymmetric control

10.2 Problem Statement

243

constraints (Wang et al. 2021d, e). To the best of our knowledge, the technology was implemented via iterative algorithms with offline training and reconstruction errors were ignored in (Wang et al. 2021d). Since the challenge derived from the disturbance input is always encountered, it is required that the designed controller can tune parameters adaptively by interacting with the environment. Compared with Wang et al. (2021d, e), we not only accelerate the iterative process of the algorithm offline, but also realize the online control in this chapter. In the last decades, there have existed abundant works to solve online optimal regulation problems (Zhang et al. 2014; Dierks and Jagannathan 2012; Zhao et al. 2022). Based on capabilities of neural networks, the prevailing idea is to find approximations of the optimal cost function and the optimal control law. However, there are rare methods to deal with the tracking problems online for complex nonlinear systems including the wastewater treatment process via adaptive critic. Under above backgrounds, in this chapter, we develop a hybrid near-optimal tracking control strategy to surmount the tracking control problem for a class of unknown nonaffine systems (Wang et al. 2021c). In consideration of the iteration algorithm, updating formulas of the novel cost function and the control law are derived. Remarkably, we shed light on the fast convergence of the generalized value iteration algorithm by introducing an acceleration factor. It is highlighted that the model network is used to obtain the steady control via the numerical method for solving tracking problems online. Additionally, based on neural networks and the new tracking control, the online tracking implementation is given and the uniformly ultimately bounded stability of weight estimation errors is analyzed via the Lyapunov theory. The proposed tracking approach is applied to the wastewater treatment process for the first time. Herein, the notations used throughout the chapter are summarized. Let N = {0, 1, 2, . . . } be the set of all nonnegative integers. Let R be the set of all real numbers and Rn be the Euclidean space of all n-dimensional real vectors. Define Ω as a compact subset of Rn and Ψ (Ω) as the set of admissible control laws (AlTamimi et al. 2008; Zhang et al. 2009; Wei et al. 2016) on Ω. “In ” represents the n × n identity matrix and the superscript “T” represents the transpose operation.

10.2 Problem Statement In this chapter, we consider a type of discrete-time unknown nonlinear systems defined as (10.1) xk+1 = F (xk , μk ), k ∈ N, where F (·, ·) is a continuous and unknown system function, xk ∈ Rn is the state variable and μk ∈ Rm is the control law. Without loss of generality, two reasonable assumptions are given as follows.

244

10 Data-Driven Hybrid Intelligent Optimal Tracking Design …

Assumption 10.1 The system function F (·) is controllable and differentiable with respect to its arguments in the sense that there exists a continuous control law on Ω that asymptotically stabilizes the system. Assumption 10.2 For the system (10.1), all system states xk and control inputs μk are observable and measurable. Considering the tracking control problem, we aim to design an optimal feedback controller that generates the control law μ(xk ) so that the system state xk can track the desired trajectory. In the following, we define the reference trajectory as βk+1 = ψ(βk ),

(10.2)

where βk ∈ Rn and ψ(·) is a differentiable function with regard to βk . We assume that there exists a steady control vector μ˘ k with respect to the reference trajectory, which satisfies βk+1 = F (βk , μ˘ k ) and can be solved. Our goal is to obtain the value of the steady control μ(βk ). For known affine systems, the steady control is easy to obtain. Unfortunately, we find that achievements are few for solving the steady control of the nonaffine system with unknown system dynamics. In this chapter, we shall use a mathematical method to obtain the steady control μ(βk ). For facilitating the implementation of the near-optimal tracking, we define the tracking error as follows: (10.3) ek = xk − βk . Then, we define the corresponding tracking control as μ(ek ) = μ(xk ) − μ(βk ).

(10.4)

By integrating above formulas, one has ek+1 = F (ek + βk , μ(ek ) + μ(βk )) − ψ(βk ).

(10.5)

The key idea of the optimal tracking control is to attenuate the tracking error to zero. Therefore, the tracking problem of the original system can be transformed into the regulation problem of the error system (10.5). Considering the optimal regulation problem, we assume that the system (10.5) is controllable in the sense that there is at least one continuous-time control law μ(ek ) to asymptotically stabilize the system. Our goal is to find a state feedback control sequence that ensures ek → 0 as k → ∞ and minimizes the cost function described by T (ek ) =

∞

U (e j , μ(e j )),

(10.6)

j=k

where k is the current time, j = k, k + 1, k + 2, . . . represents the time sequence after kth time, U (e j , μ(e j )) > 0 is the positive definite utility function at jth time,

10.2 Problem Statement

245

and T (ek ) is the sum of all utility functions at and after the time step k. Herein, we let U (e j , μ(e j )) = eTj Qe j + μT (e j )Rμ(e j ), where matrices Q ∈ Rn×n and R ∈ Rm×m are positive definite state matrix and control matrix, respectively. It is worth mentioning that the values of Q and R are not unique. The values of Q and R are required to ensure the convergence of the cost function. For convenience, we let Q(e j ) = eTj Qe j and R(μ(e j )) = μT (e j )Rμ(e j ). The cost function T (ek ) is served as a Lyapunov function with T (0) = 0. Definition 10.1 A control law μ(ek ) is said to be admissible (Al-Tamimi et al. 2008; Zhang et al. 2009; Li and Liu 2012) with respect to (10.6) on Ω if μ(ek ) is continuous on Ω, ∀ek ∈ Ω, μ(0) = 0, μ(ek ) stabilizes (10.5) on Ω, and T (e0 ) is finite, ∀e0 ∈ Ω. Assumption 10.3 For the system (10.5), there exists at least one admissible law μ(ek ), i.e., Ψ (Ω) = ∅, ∀e ∈ Ω (Li and Liu 2012). Extending Eq. (10.6) yields T (ek ) = Q(ek ) + R(μ(ek )) +

∞

U (e j , μ(e j ))

j=k+1

= Q(ek ) + R(μ(ek )) + T (ek+1 ).

(10.7)

The minimal value of T (ek ) in (10.7) is the optimal cost function T ∗ (ek ). In this case, the tracking control minimizing the cost function T (ek ) is called the optimal tracking control policy μ∗ (ek ), which can stabilize the error system (10.5). Therefore, the optimization problem is to find the optimal tracking control μ∗ (ek ). Based on the well-known Bellman’s optimality principle, the optimal cost function T ∗ (ek ) satisfies the following HJB equation: T ∗ (ek ) = min U (ek , μ(ek )) + T ∗ (ek+1 ) . μ(ek )

(10.8)

Correspondingly, the gradient of the right-hand side of (10.8) with respect to μ(ek ) is given as ∂U (ek , μ(ek )) ∂ek+1 T ∂T ∗ (ek+1 ) + = 0, ∂μ(ek ) ∂μ(ek ) ∂ek+1

(10.9)

which gives the optimal tracking control policy as follows: μ∗ (ek ) = −

1 −1 ∂ek+1 T ∂T ∗ (ek+1 ) R . 2 ∂μ(ek ) ∂ek+1

(10.10)

In the light of (Zhang et al. 2014; Dierks and Jagannathan 2012), the Hamiltonian of the optimal tracking control problem is defined as H (T , e, μ(ek )) = T (ek+1 ) + U (ek , μ(ek )) − T (ek ).

(10.11)

246

10 Data-Driven Hybrid Intelligent Optimal Tracking Design …

Furthermore, the optimal cost function T ∗ (ek ) and the optimal tracking control input μ∗ (ek ) satisfy the discrete-time nonlinear Lyapunov equation H (T ∗ , e, μ∗ (ek )) = T ∗ (ek+1 )+ U (ek , μ∗ (ek ))−T ∗ (ek ),

(10.12)

where H (T ∗ , e, μ∗ (ek )) = 0. As is known to all, the optimal cost function and the optimal tracking control policy cannot be solved precisely. On this occasion, the near-optimal solution of the HJB equation is perused instead of the analytical optimal solution. In the following, we shall propose a composite control approach to obtain the approximate solution of the HJB equation. First, the accelerated generalized value iteration algorithm is applied to obtain the suboptimal tracking control in an offline manner, which ensures that the tracking control is admissible. Then, the online nearoptimal control mechanism is utilized to ameliorate the offline tracking control and enhance the adaptive ability.

10.3 Offline Learning of the Pre-designed Controller It is knotty to attain the initial admissible tracking control for unknown industrial systems. Therefore, we decide to directly obtain the near-optimal tracking control policy by using the value iteration algorithm. However, the traditional value iteration algorithm converges slowly and requires a zero initial cost function. Consequently, an accelerated generalized value iteration algorithm is described to confront above defects. In the following, we let sequences of the iterative cost function and the iterative tracking control policy be expressed as {Vi (ek )} and {νi (ek )}, where the iteration index i ∈ N. Define the initial cost function as V0 = ekT Λek , where Λ is a positive semi-definite matrix. Then, the initial tracking control policy ν0 (ek ) is derived as ν0 (ek ) = arg min U (ek , μ(ek )) + V0 (ek+1 ) . μ(ek )

(10.13)

The iterative cost function is computed by V1 (ek ) = min {U (ek , μ(ek )) + V0 (ek+1 )} μ(ek )

= U (ek , ν0 (ek )) + V0 (xk+1 − βk+1 ),

(10.14)

where xk+1 = F (xk , ν0 (ek ) + μ(βk )). Considering i = 1, 2, . . . , the generalized value iteration algorithm iterates between νi (ek ) = arg min U (ek , μ(ek )) + Vi (ek+1 ) μ(ek )

and

(10.15)

10.3 Offline Learning of the Pre-designed Controller

247

Vi+1 (ek ) = min {U (ek , μ(ek )) + Vi (ek+1 )} μ(ek )

= U (ek , νi (ek )) + Vi (xk+1 − βk+1 ),

(10.16)

where xk+1 = F (xk , νi (ek ) + μ(βk )). Theorem 10.1 Assume that 0 ≤ V ∗ (ek+1 ) ≤ δU (e, μ(e)) (δ < ∞) holds uniformly and the initial cost function satisfies δV ∗ (ek ) ≤ V0 (ek ) ≤ δV ∗ (ek ), where 0 ≤ δ ≤ 1 ≤ δ < ∞. If Vi (ek ) and νi (ek ) are updated in the light of (10.13)–(10.16), then the sequence {Vi (ek )} approximates the optimal cost function according to the inequality 1+

δ−1 ∗ δ−1 ∗ V (ek ) ≤ Vi (ek ) ≤ 1 + V (ek ). −1 i (1 + δ ) (1 + δ −1 )i

(10.17)

Proof We prove the left-hand side of (10.17) by mathematical induction. For i = 0, it is observed that δV ∗ (ek ) ≤ V0 (ek ). Considering i = 1, we have V1 (ek ) = min {U (ek , μ(ek )) + V0 (ek+1 )} μ(ek ) ≥ min U (ek , μ(ek )) + δV ∗ (ek+1 ) μ(ek )

δ − 1 1 − δ ∗ U (ek , μ(ek )) + δ + V (ek+1 ) 1+δ μ(ek ) 1+δ 1+δ δ−1 ∗ V (ek ). (10.18) = 1+ 1 + δ −1 ≥ min

Assume that the relation in (10.17) holds for i − 1. Considering i reveals Vi (ek ) = min U (ek , μ(ek )) + Vi−1 (ek+1 ) μ(ek )

≥ min U (ek , μ(ek )) + 1 +

δ−1 V ∗ (ek+1 ) μ(ek ) (1 + δ −1 )i−1 δ i−1 (δ − 1) ∗ δU (e + , μ(e )) − V (e ) k k k+1 (1 + δ)i δ−1 min U (ek , μ(ek )) + V ∗ (ek+1 ) = 1+ −1 i (1 + δ ) μ(ek ) δ−1 ∗ V (ek ). = 1+ (1 + δ −1 )i

(10.19)

Note that the right-hand side can be derived by the same procedure and the proof is omitted. According to (10.17), when i → ∞, we can obtain limi→∞ 1 +

∗ δ−1

δ−1 V ∗ (ek ) = limi→∞ 1 + (1+δ = V ∗ (ek ). Define V∞ (ek ) = −1 )i V (ek ) (1+δ −1 )i ∗ limi→∞ Vi (ek ), and we can get V∞ (ek ) = V (ek ). Since the Ω is compact, we can obtain the uniform convergence of the cost function (Li and Liu 2012). Therefore, the proof is completed.

248

10 Data-Driven Hybrid Intelligent Optimal Tracking Design …

In this chapter, in order to accelerate the value iteration process, an acceleration factor is introduced to update the cost function as follows: V˘i+1 (ek ) = min U (ek , μ(ek )) + V˘i (ek+1 ) μ(ek ) + (i) min U (ek , μ(ek )) + V˘i (ek+1 ) − V˘i (ek ) , μ(ek )

(10.20)

where (i) ≥ 0 is a variable function related to i. It should be noted that (i) tends to zero quickly as the iteration index i increases. In this chapter, we focus on obtaining the near-optimal initial admissible tracking control more quickly. The acceleration factor is only an auxiliary tool, which does not affect the value of the optimal cost function. The acceleration factor plays an important role in ameliorating the iteration process, which broadens the scope of practical application, especially the optimal tracking problem. Correspondingly, the iterative tracking control policy ν˘ i (ek ) is obtained by replacing Vi (ek+1 ) in (10.15) with V˘i (ek+1 ). In the offline phase, the accelerated generalized value iteration algorithm is implemented based on the system data without the accurate system dynamics. Therefore, the model neural network is established to learn the system dynamics. Note that the steady control is solved based on the expression of the model network. Via datadriven process, the model network with Hm hidden layer neurons is used to learn the dynamics of the controlled system. Considering the state vector xk and the control input μ(xk ), the model network approximates the corresponding state xk+1 as

xˆk+1 = (Wk2 )T Θm (Wk1 )T [xkT , μT (xk )]T + Bk1 + Bk2 ,

(10.21)

where Wk1 ∈ R(n+m)×H m is the input-to-hidden matrix, Wk2 ∈ RH m ×n is the hiddento-output weight matrix, Θm (·) is the activation function, and Bk1 as well as Bk2 are the threshold vectors. Define the system identification error of the model network as x˜k+1 = xˆk+1 − xk+1 . The performance measure is expressed as follows: Em =

1 T x˜ x˜k+1 . 2 k+1

(10.22)

In order to improve the modeling precision, the MATLAB neural network toolbox is adopted to train the model network with the Levenberg–Marquardt training algorithm. The technology and theoretical foundation of the neural network toolbox are very mature and are not discussed in detail here. After ending the pre-training procedure, weights and thresholds of the model network are not updated. Based on the expression of the trained model network, the desired trajectory βk+1 is rewritten as

βk+1 = (Wk2 )T Θm (Wk1 )T [βkT , μT (βk )]T + Bk1 + Bk2 .

(10.23)

Since other parameters except μ(βk ) are known, we can utilize the numerical method to obtain the steady control based on (10.23). It is worth mentioning that the

10.4 Online Near-Optimal Tracking Control with Stability Analysis

249

pre-designed controller is not used to control the system in the offline stage, but only provides an initial admissible tracking control for online control.

10.4 Online Near-Optimal Tracking Control with Stability Analysis In the online phase, the designed controller is utilized to control the system for perfect tracking, and its parameters are updated as the time step k increases. The critic and action neural networks are constructed to approximate the cost function and the tracking control policy, respectively. In the following, we illustrate the updating process of weight parameters and the uniformly ultimately bounded stability of weight estimation errors for the critic and action networks.

10.4.1 The Critic Network The critic neural network with Hc neurons of the hidden layer is utilized to estimate the cost function. The target cost function is calculated by T ˘ H e ) + k , T (ek ) = T ϑ( c k

(10.24)

where ∈ Rn and H c ∈ Rn×H c are the ideal hidden-to-output and input-to˘ hidden weights, respectively. ϑ(·) is the bounded activation function, and k ∈ R is the bounded approximation error. The input-to-hidden weight is randomly initialT ˘ ized and not updated. For simplicity, we let ϑ( H c ek ) be denoted by ϑ(ek ). Letting ˆ be the estimated value of , then the estimated output of the critic network is ˆ kT ϑ(ek ). Tˆ (ek ) =

(10.25)

In terms of the relation between (10.12) and (10.25), we have the approximate Hamiltonian error ˆ kT (ϑ(ek+1 ) − ϑ(ek )) ηk = U (ek , μ(ek )) + ˆ kT Δϑ(ek ). = U (ek , μ(ek )) +

(10.26)

Correspondingly, the performance measure of the critic network is expressed as E c = 21 ηkT ηk . Then, we tune the weight of the critic network as follows:

250

10 Data-Driven Hybrid Intelligent Optimal Tracking Design …

∂ Ec ∂ ˆk αc Δϑ(ek ) ηT , = ˆk − 1 + Δϑ T (ek )Δϑ(ek ) k

ˆ k+1 = ˆ k − αc

(10.27)

where αc ∈ (0, 1) is the learning rate of the critic network. Substituting (10.24) into (10.7) yields (10.28) U (ek , μ(ek )) = − T Δϑ(ek ) − Δk , where Δk = k+1 − k . Letting the weight estimation error of the critic network be ˆ k − , we have the following equation: ˜k = ˜ k+1 =

I−

αc Δϑ(ek )Δϑ T (ek ) αc Δϑ(ek )Δk ˜k + . 1 + Δϑ T (ek )Δϑ(ek ) 1 + Δϑ T (ek )Δϑ(ek )

(10.29)

10.4.2 The Action Network The tracking control law is approximated by the action neural network with the structure of n–Ha –m as follows: T e ) + ςk , μ(ek ) = ωT θ˘ (ωH a k

(10.30)

where ω ∈ RH a ×m is the hidden-to-output weight vector, ωH a ∈ Rn×H a is the inputto-hidden weight and remains unchanged. Note that the initial ω and ωH a are admissible and obtained from the offline phase. θ˘ (·) is the bounded activation function and ςk is the approximation error of the action network. For simplicity, we replace T θ˘ (ωH ek ) with θ (ek ). Define ωˆ as the approximate value of ω and we can obtain the a output of the action network as μ(e ˆ k ) = ωˆ kT θ (ek ).

(10.31)

According to (10.10), the ideal tracking control is derived as 1 μ(ek ) = − R −1 2

∂ek+1 ∂μ(ek )

T

∂ Tˆ (ek+1 ) . ∂ek+1

(10.32)

Remark 10.1 In this chapter, the partial derivative of ek+1 with respect to μ(ek ) is regarded as the approximate control matrix since the controlled plant is nonaffine and unknown. Moreover, the tracking problem is transformed to the regulation problem in the light of the tracking error dynamics. However, we only model the original system to evaluate the system dynamic and generate the approximate steady control in (10.21) and (10.23). Therefore, it is still difficult to obtain the approximate control

10.4 Online Near-Optimal Tracking Control with Stability Analysis

251

matrix. In order to surmount the difficulty of solving ∂ek+1 /∂μ(ek ), a converted relation is introduced based on the model network in the sequel. Lemma 10.1 Considering the control input μ(xk ) in (10.1) and the tracking control μ(ek ) in (10.4), we have the following equation: ∂ xk+1 ∂ek+1 = . ∂μ(ek ) ∂μ(xk )

(10.33)

Proof Considering (10.1) and (10.5), we have

∂ F (ek + βk , μ(ek ) + μ(βk )) − ψ(βk ) ∂ek+1 = ∂μ(ek ) ∂μ(ek ) ∂F (ek + βk , μ(ek ) + μ(βk )) ∂(μ(ek ) + μ(βk )) = ∂(μ(ek ) + μ(βk )) μ(ek ) ∂F (ek + βk , μ(ek ) + μ(βk )) = ∂(μ(ek ) + μ(βk )) ∂ xk+1 . = ∂μ(xk )

(10.34)

Note that the steady control is solved in advance and has nothing to do with the tracking control. This completes the proof. In the light of (10.34), the practical error function of the action network is derived as ζk = where Zm = 21 R −1

ωˆ kT θ (ek )

∂ x˜k+1 ∂μ(xk )

T

1 + R −1 2

∂ xk+1 ∂μ(xk )

T

∂ Tˆ (ek+1 ) + Zm , ∂ek+1

(10.35)

∂ Tˆ (ek+1 ) . Our goal is to minimize the error performance ∂ek+1

measure E a = 21 ζkT ζk . According to the gradient-based adaptation, we update the weight of the action network in the form ∂ Ea ∂ ωˆ k αa θ (ek ) ζ T, = ωˆ k − 1 + θ T (ek )θ (ek ) k

ωˆ k+1 = ωˆ k − αa

(10.36)

where αa ∈ (0, 1) is the learning rate of the action network. Considering (10.24) and (10.30), one has 1 −1 R 2

∂ xk+1 ∂μ(xk )

T

∂ϑ T (ek+1 ) + k+1 + ωT θ (ek ) + ςk = 0. ∂ek+1

(10.37)

252

10 Data-Driven Hybrid Intelligent Optimal Tracking Design …

The weight estimation error of the action network is denoted as ω˜ k = ωˆ k − ω. Extending Eq. (10.36) yields αa θ (ek )θ T (ek ) αa θ (ek ) ωˆ k − = I− T 1 + θ (ek )θ (ek ) 1 + θ T (ek )θ (ek ) T 1 −1 ∂ xk+1 T ∂ϑ T (ek+1 ) ac ˜ k − ςk , × R 2 ∂μ(xk ) ∂ek+1

ω˜ k+1

(10.38)

T ∂ xk+1 where ςkac = 21 R −1 ∂μ(x k+1 + ςk − Zm . For convenience, we define G k as k) the control matrix with G k = ∂ xk+1 /∂μ(xk ) in the sequel. The system function F (xk , μk ) is differentiable with respect to its arguments on Ω in the sense that G k is bounded.

10.4.3 Uniformly Ultimately Bounded Stability of Weight Estimation Errors In what follows, we consider the unknown closed-loop system (10.1), where system states and control inputs are assumed to be observable. According to (Mu et al. 2017; Igelnik and Pao 1995), it is concluded that approximate errors of neural networks can be arbitrarily small with bounded activation functions, the sufficiently large number of hidden layer neurons, and fixed input-to-hidden weights. In the following, some commonly used assumptions and lemmas of the adaptive critic field are provided to analyze the uniformly ultimately bounded stability of weight estimation errors via the Lyapunov theory. Lemma 10.2 According to properties of the matrix trace operation, we can obtain the following equalities: (1) tr{AT } = tr{A}, where A ∈ Ra×a ; (2) tr{BCD} = tr{CDB}, where B ∈ Rb×c , C ∈ Rc×d and D ∈ Rd×b ; (3) tr{L1T L2 } = tr{L2T L1 } = L1T L2 = L2T L1 , where L1 ∈ Rn and L2 ∈ Rn . Lemma 10.3 Based on the Cauchy–Schwarz inequality, the vectors L1 and L2 satisfy (L1 + L2 )T (L1 + L2 ) ≤ 2(L1T L1 + L2T L2 ). Assumption 10.4 (1) The ideal constant weights of the critic and action networks are upper bounded, i.e., ≤ M and ω ≤ ω M , where M and ω M are positive constants. (2) The activation functions ϑ, θ , and approximation errors k as well as ςk are bounded, i.e., ϑ(·) ≤ ϑ M , θ (·) ≤ θ M , k ≤ M and ςk ≤ ς M , where ϑ M , θ M , M , and ς M are positive constants. (3) For the critic network, the gradients of the approximation error and the activation and function with respect to their parameters are upper bounded by k ≤ M

10.4 Online Near-Optimal Tracking Control with Stability Analysis

253

ϑk ≤ ϑ M , respectively. For the system dynamic and the model network, we assume that the control matrix G k is upper bounded with G k ≤ G M .

Theorem 10.2 Let the cost function and the tracking control input be provided by (10.25) and (10.31). Let the weight update laws for the critic and action networks be provided by (10.27) and (10.36). Then, the associated weight estimation errors of neural networks ˜ k and ω˜ k are uniformly ultimately bounded. Moreover, the tracking error system is regulated by using the near-optimal control method. Proof In this part, we define a Lyapunov function as follows: V = αc−1 Vc +

(2 +

2αc Δϑ 2 2 αa )Ξa (1 +

Δϑ¯ 2 )

Va ,

(10.39)

2 ˜ kT ˜ k }, Va = tr{ω˜ kT ω˜ k }, Ξa = (λmax (R −1 )G M ϑ M ) , and Δϑ ≤ where Vc = tr{ Δϑk ≤ Δϑ. The difference equation of the Lyapunov function is deduced as follows:

ΔV = αc−1 ΔVc +

2αc Δϑ 2 ΔVa . (2 + αa2 )Ξa (1 + Δϑ¯ 2 )

(10.40)

In practical application, the Lyapunov function is not unique and difficult to define. Via design experience, we usually define the form of ΔV after obtaining the ΔVc and ΔVa . In what follows, two parts are provided to deduce the difference equations of the Lyapunov function. Considering the weight evaluation error ˜ k+1 in (10.29), then the difference of Vc is derived as T ˜ k+1 ˜ k+1 } − tr{ ˜ kT ˜ k} ΔVc = tr{ T ˜ k Δϑ(ek )Δϑ T (ek ) ˜k 2αc = tr ˜ kT ˜k − 1 + Δϑ T (ek )Δϑ(ek ) ˜ T Δϑ(ek )Δϑ T (ek )Δϑ(ek )Δϑ T (ek ) ˜k α2 + c k T 2 (1 + Δϑ (ek )Δϑ(ek )) ˜ kT Δϑ(ek )Δk ˜ kT Δϑ(ek )Δϑ T (ek )Δϑ(ek )Δk 2αc2 2αc − + 1 + Δϑ T (ek )Δϑ(ek ) (1 + Δϑ T (ek )Δϑ(ek ))2 T 2 T α Δ Δϑ (ek )Δϑ(ek )Δk − tr{ ˜ kT + c k ˜ k }. (10.41) (1 + Δϑ T (ek )Δϑ(ek ))2

By using Lemmas 10.1 and 10.2, we can derive the following inequality:

254

10 Data-Driven Hybrid Intelligent Optimal Tracking Design …

ΔVc ≤ − + + + ≤ −

αc2 ˜ kT Δϑ(ek )2 ˜ kT Δϑ(ek )2 2αc + 1 + Δϑ T (ek )Δϑ(ek ) 1 + Δϑ T (ek )Δϑ(ek ) αc ΔkT Δk αc ˜ kT Δϑ(ek )2 + 1 + Δϑ T (ek )Δϑ(ek ) 1 + Δϑ T (ek )Δϑ(ek ) αc2 αc2 ΔkT Δk ˜ kT Δϑ(ek )2 + 1 + Δϑ T (ek )Δϑ(ek ) 1 + Δϑ T (ek )Δϑ(ek ) αc2 ΔkT Δk 1 + Δϑ T (ek )Δϑ(ek ) αc (1 − 2αc )Δϑ 2 2 ˜ k 2 + αc (1 + 2αc )Δ M , 2 1 + Δϑ

(10.42)

where ˜ kT Δϑ(ek )2 = ( ˜ kT Δϑ(ek ))T Δϑ T (ek ) ˜ k and Δk ≤ Δ M . T 2 ˜ k 2 = ˜ kT ˜ k , then the difLetting ω˜ k θ (ek ) = (ω˜ kT θ (ek ))T (ω˜ kT θ (ek )) and ference of Va is deduced by T ω˜ k+1 } − tr{ω˜ kT ω˜ k } ΔVa = tr{ω˜ k+1

2αa ω˜ kT θ (ek )θ T (ek )ω˜ k = tr ω˜ kT ω˜ k − 1 + θ T (ek )θ (ek )

αa2 ω˜ kT θ (ek )θ T (ek )θ (ek )θ T (ek )ω˜ k

2 (1 + θ T ek )θ (ek ) T 2αa ω˜ kT θ (ek ) 1 −1 T ∂ϑ T (ek+1 ) ac R Gk − ˜ k − ςk 1 + θ T (ek )θ (ek ) 2 ∂ek+1 T 2α 2 ω˜ T θ (ek )θ T (ek )θk 1 −1 T ∂ϑ T (ek+1 ) ac R + a k G ˜ − ς k

2 k k 2 ∂ek+1 1 + θ T (ek )θ (ek ) α 2 θ T (ek )θ (ek ) 1 −1 T ∂ϑ T (ek+1 ) ac R Gk + ˜ k − ςk a

2 2 ∂ek+1 1 + θ T (ek )θ (ek ) T 1 −1 T ∂ϑ T (ek+1 ) − tr{ω˜ kT ω˜ k }. R Gk × ˜ k − ςkac (10.43) 2 ∂ek+1 +

Based on Lemmas 10.1 and 10.2, we can further obtain the following inequality:

10.4 Online Near-Optimal Tracking Control with Stability Analysis

255

αa2 ω˜ kT θ (ek )2 2αa ω˜ kT θ (ek )2 + 1 + θ T (ek )θ (ek ) 1 + θ T (ek )θ (ek ) α 2 ω˜ T θ (ek )2 + 4(ςkac )T ςkac + a Tk 1 + θ (ek )θ (ek ) T ∂ϑ T (ek+1 ) ∂ϑ T (ek+1 ) R −1 G Tk + R −1 G Tk ˜ k − ςkac ˜ k − ςkac ∂ek+1 ∂ek+1

ΔVa ≤ −

αa2 ω˜ kT θ (ek )2 + 2αa2 (ςkac )T ςkac 1 + θ T (ek )θ (ek ) T T T αa2 −1 T ∂ϑ (ek+1 ) ac −1 T ∂ϑ (ek+1 ) ac R Gk R Gk + ˜ k − ςk ˜ k − ςk 2 ∂ek+1 ∂ek+1

−αa (2 − 3αa ) T αa2 2 2 λmax (R −1 )G M ϑ M ≤ ω ˜ θ (e ) + 1 + ˜ k 2 k k 2 2 1 + θM +

+ (4 + 2αa2 )(ςkac )T ςkac ,

(10.44)

ac . where ςkac satisfies ςkac ≤ ς M Substituting (10.42) and (10.44) into (10.40) yields

ΔV =

2αc Δϑ 2 ΔVa + αc−1 ΔVc (2 + αa2 )Ξa (1 + Δϑ¯ 2 )

≤ − +

2αc αa (2 − 3αa )Δϑ 2 ω˜ kT θ (ek )2 2 (2 + αa2 )Ξa (1 + Δϑ¯ 2 )(1 + θ M ) αc Δϑ 2 2

˜ k 2 +

ac 2 ) 4αc Δϑ 2 (ς M Ξa (1 + Δϑ¯ 2 )

1 + Δϑ (1 − 2αc )Δϑ 2 2 − ˜ k 2 + (1 + 2αc )Δ M . 1 + Δϑ¯ 2

(10.45)

4α Δϑ 2 (ς ac )2

2 , then the difference of V satisfies ΔV ≤ 0 Letting W = Ξc (1+Δϑ¯M2 ) + (1 + 2αc )Δ M a when the following inequality

ω˜ kT θ (ek ) ≥

(2 + αa2 )Ξa (1 + Δϑ¯ 2 )(1 + θ¯ 2 )W 2αc αa (2 − 3αa )Δϑ 2

(10.46)

or ˜ k ≥

(1 + Δϑ¯ 2 )W (1 − 3αc )Δϑ 2

(10.47)

holds. Note that the learning rates of the critic and action networks satisfy 0 < αc < 1/3 and 0 < αa < 2/3. In terms of the Lyapunov extension theorem, it is demon-

256

10 Data-Driven Hybrid Intelligent Optimal Tracking Design …

strated that weight estimation errors of the critic and action networks are uniformly ultimately bounded. Moreover, the error system is regulated in a near-optimal manner given by μ(e ˆ k )−μ∗ (ek ) ≤

(2 + αa2 )Ξa (1 + Δϑ¯ 2 )(1 + θ¯ 2 )W + ςM . 2αc αa (2 − 3αa )Δϑ 2

(10.48)

Therefore, it is observed that the tracking control problem is solved by using the online near-optimal control mechanism. This completes the proof.

10.4.4 Summary of the Proposed Tracking Approach In what follows, we summarize the implementation process of the proposed hybrid intelligent control approach. The hybrid intelligent control algorithm consists of two parts: obtaining the admissible tracking control offline and updating the parameters online. First, in the offline phase, we can obtain the near-optimal cost function and the tracking control policy in (10.8) and (10.10) by iteratively calculating the tracking control policy and the cost function. Note that the near-optimal tracking control policy is regarded as the initial admissible tracking control policy of online control. Second, in the online phase, we use the initial admissible tracking control policy to the system at k = 0 and update the action network weight of the tracking control as shown in (10.36) with the increase of time k. Based on neural networks, the online control process of the proposed method is summarized in Algorithm 6. Algorithm 6 The Procedure of Online Control Design 1: Let k = 0. Define Nk as the maximum time step. Give the initial state xk and the initial reference trajectory βk . 2: Compute the βk+1 according to (10.2) and the steady control μ(βk ) according to (10.23). 3: Compute the error ek = xk − βk according to (10.3). 4: Compute the cost function Tˆ (ek ) according to (10.25) and tune the weight of the critic network according to (10.27). 5: Compute the tracking control μ(e ˆ k ) according to (10.31) and tune the weight of the action network according to (10.36). 6: Compute the control law μ(xk ) = μ(e ˆ k ) + μ(βk ) according to (10.4). 7: Compute the new state xk+1 according to (10.1). 8: Let k = k + 1, xk = xk+1 , and rk = rk+1 . If k ≤ Nk , go to Step 2; otherwise, go to Step 9. 9: Stop.

10.5 Experimental Simulation

257

10.5 Experimental Simulation In this section, two numerical examples with industrial application backgrounds are involved to illustrate the control performance of the proposed hybrid intelligent control approach. Note that there is no prior knowledge about the internal dynamics of complex systems but only system data. In this chapter, the simulation software is MATLAB R2019a.

10.5.1 Application to a Torsional Pendulum Device We consider a class of inverted pendulum plants with hyperbolic tangent input as follows (Wang and Qiao 2019): ρ˙ = τ, τ˙ = −

κ2 κ3 Mgl sin(ρ) − ρ˙ + (tanh(μ(x)) + μ(x)), κ1 κ1 κ1

(10.49)

where M = 0.5 is the mass; g = 9.8 is the acceleration due to gravity; l = 1 is the length of the pendulum; κ1 = 0.8 is the rotary inertia; κ2 = 0.2 is the frictional factor; κ3 = 1 is the parameter of the control input; ρ, τ , and μ(x) are the current angle, the relative angular velocity, and the control input vector, respectively. Let the sampling interval Δt = 0.1 and x = [ρ, τ ]T . Then, the discretized version is given by xk[1] + 0.1xk[2] −0.6125sin(xk[1] ) + 0.975xk[2] 0 , + 0.125(tanh(μ(xk )) + μ(xk ))

xk+1 =

(10.50)

where xk = [xk[1] , xk[2] ]T ∈ R2 is the system state, μ(xk ) ∈ R is the control input, and x0 = [−0.2, 0.8]T . The reference trajectory is selected as βk[1] + 0.1βk[2] , = −0.2492βk[1] + 0.9888βk[2]

βk+1

(10.51)

where β0 = [−0.1, 0.2]T . According to the uniformly ultimately bounded stability, the learning parameters of the proposed algorithm are elaborated in Table 10.1. In the light of the common method of the adaptive critic field, the tuning parameters depend on experience and experiment, and are not unique. In the following, all the activation functions are set as hyperbolic functions, that is, ϑ(·) = θ (·) = tanh(·).

258

10 Data-Driven Hybrid Intelligent Optimal Tracking Design …

Table 10.1 Parameter values used in the hybrid intelligent control algorithm Q R Λ Hm Hc αm αc I2

I1

2I2

10

8

0.02

αa

0.05

0.02

10-7

2.5

The training error

2

1.5

1

0.5

0 0

100

200

300

400

500

600

700

800

900

1000

Samples

Fig. 10.1 The training error of the model network

For the critic and action networks, the weights of input-to-hidden are randomly initialized in [−1, 1] and then remain unchanged. In the offline phase, we train the model network for 400 iteration steps with 1000 samples by using the MATLAB neural network toolbox. Then, 500 data samples are used to test the trained model network. The training and testing results of the model network are displayed in Figs. 10.1 and 10.2. According to the new accelerated generalized value iteration algorithm, we iteratively calculate the tracking control policy and the cost function in (10.20) to obtain the initial admissible tracking control policy. Note that the stopping criterion of outer loop iteration is |V˘i+1 (ek ) − V˘i (ek )| < 10−4 . After ending the iterative process, the traditional cost function ( (i) = 0) and the accelerated cost function ( (i) = 6e−0.05i ) are shown in Fig. 10.3. It is seen that faster convergence exists in the accelerated algorithm than in the traditional one, i.e., 29 iterations versus 123 iterations, which indicates the effectiveness of the accelerated generalized value iteration algorithm. After 29 iterations, the corresponding tracking control policy μ29 (ek ) is used as the initial admissible tracking control to regulate the error system online.

10.5 Experimental Simulation 10-7

1.5

The testing error

259

1

0.5

0 0

50

100

150

200

250

300

350

400

450

500

Samples Fig. 10.2 The testing error of the model network 18 16 The traditional cost function The accelerated cost function

14

Cost functions

12 10 8 6 4 2 0 0

50

100

Iteration steps Fig. 10.3 Convergence curves of iterative cost functions

150

10 Data-Driven Hybrid Intelligent Optimal Tracking Design … Weights of the critic network

260

0.8 0.6 0.4 0.2 0 0

20

40

60

80

100

120

140

160

180

120

140

160

180

Weights of the action network

Time steps

0.5

0

-0.5 0

20

40

60

80

100

200

Time steps

Fig. 10.4 Weights of the critic network and the action network

In the online adaptive phase, we utilize the online near-optimal tracking control approach to drive the system state to track the desired trajectory for Nk = 200 time steps. The implementation process of the online near-optimal tracking control approach is given in Algorithm 6. For the weight update in (10.27) and (10.36), the critic and action networks are updated until the number of iterations satisfy the maximum iteration value (40 iterations for the critic network and 30 iterations for the action network) or the tolerant errors go down below 10−6 . When the error is very small, it is considered that the critic and action networks realize the effective approximation. Moreover, an additional disturbance is added to the control input with the form μ(ek ) + μ(βk ) + μ(dk ) for the first 60 time steps, where μ(dk ) is randomly selected in [−0.2, 0.2] over time. The trajectories of weights, system states, and tracking errors are shown in Figs. 10.4 and 10.5, which presents the good performance of the proposed approach. Based on the numerical method in (10.23), the function “fsolve” in MATLAB is applied to attain the steady control. Then, the steady control μ(βk ), the control input μ(xk ), and the tracking control μ(ek ) are described in Fig. 10.6. The trend of the tracking control and the error rapidly to zero illustrates that original system states successfully track the reference trajectory. For comparison, we also employ the neural optimal tracking control to the torsional pendulum device in (Wang et al. 2021d). In that chapter, the controller was designed without considering the acceleration factor and the online update. In a word, the design process of the controller is only to iteratively calculate the tracking control policy in (10.15) and the cost function in (10.16). Using the same parameters and

10.5 Experimental Simulation

261

x

0.5

(b) [1]

The system state x [2]

The system state x [1]

(a) [1]

0

-0.5 0

50

100

150

x

0 -0.5 -1 0

50

The tracking error e[2]

0.05 0 -0.05 -0.1

100

150

200

150

200

Time steps (d)

0.6

0.1

[2]

0.5

200

Time steps (c) The tracking error e[1]

1

[2]

0.4 0.2 0 -0.2

0

50

100

Time steps

150

200

0

50

100

Time steps

Fig. 10.5 System states and tracking errors: a xk[1] ; b xk[2] ; c ek[1] ; d ek[2]

iterations, we can obtain the tracking control strategy μ(ek ). Considering the same additional disturbance, the tracking error curves are shown in Fig. 10.7. Clearly, we can see that the tracking errors of the hybrid intelligent optimal tracking control have less oscillation than that of the neural optimal tracking control. From this point, we can conclude that the proposed tracking approach has better adaptability and usability.

10.5.2 Application to a Wastewater Treatment Plant Wastewater treatment plays a significant role in handling water pollution and ameliorating water shortage. The Benchmark Simulation Model No. 1 (BSM1) (Alex et al. 2008) provides an ideal platform to verify the technologies and strategies designed for wastewater treatment. The simple structure diagram of the BSM1 is displayed in Fig. 10.8. The biochemical reaction tank and the secondary sedimentation tank are core components of the BSM1 model. The biochemical reaction tank consists of five units, where the concentration of dissolved oxygen in the fifth unit SO,5 and the nitrate nitrogen concentration in the second unit S N O,2 have a tremendous impact on nitrogen removal. In general, the designed controller is required to guarantee that SO,5 = 2 and S N O,2 = 1 by the oxygen transfer coefficient of the fifth unit K La,5 and the internal recycle Q a during the wastewater treatment process (Bo and Qiao 2015). It is worth

262

10 Data-Driven Hybrid Intelligent Optimal Tracking Design … (a)

( )

1 0 -1 0

20

40

60

80

1

(x)

100

120

140

160

180

200

120

140

160

180

200

120

140

160

180

200

Time steps (b)

0 -1 0

20

40

60

80

100

Time steps (c) (e)

0 -0.2 -0.4 0

20

40

60

80

100

Time steps

Fig. 10.6 Control inputs: a The steady control; b The control input; c The tracking control

The tracking error e[1]

(a) The neural optimal tracking control The hybrid intelligent optimal tracking control

0.1 0.05 0 -0.05 -0.1 0

20

40

60

80

100

120

140

160

180

200

Time steps (b)

The tracking error e[2]

0.6

The neural optimal tracking control The hybrid intelligent optimal tracking control

0.4 0.2 0 -0.2 0

20

40

60

80

100

Time steps

Fig. 10.7 Tracking errors: a ek[1] ; b ek[2]

120

140

160

180

200

10.5 Experimental Simulation

263

Fig. 10.8 The structure diagram of the BSM1

mentioning that the range of control law is relatively large, which is 0 < K La,5 < 240 and 0 < Q a < 92230. In the biochemical reaction tank, there are many complex nitrification and denitrification reactions. In addition, the mathematical model is impossible to construct because the complex properties including strong coupling, large timevarying, and strong interference exist in the industrial process. Therefore, the model network is established to reflect the dynamics of the BSM1. By collecting the input– output data, the model network is trained for 800 iteration steps with 26880 training data samples and 10000 testing samples via the MATLAB neural network toolbox. The corresponding training and testing results are shown in Figs. 10.9 and 10.10. Moreover, the relevant parameters are given in Table 10.2. After training the model network completely, we can get the steady control μ(βk ) by using the function “fsolve”. In order to obtain the admissible tracking control policy, we perform the accelerated generalized value iteration algorithm with V0 = ekT Λek , Λ = 4I2 . After ending the iteration, the traditional cost function ( (i) = 0) with 129 iterations and the accelerated cost function ( (i) = 8e−0.05i ) with 70 iterations are presented in Fig. 10.11, which reduces the amount of computation and further verifies the better performance of the new method. Based on the initial admissible tracking control policy, the online near-optimal tracking control is implemented to balance the controlled plant. For each update, the critic network and the action network are tuned with 30 training steps to ensure that the tolerant errors are below 10−6 . In order to verify the adaptability of the designed controller, we add a big interference term to the control input with the form μ(ek ) + μ(βk ) + μ(dk ) for the first 200 time steps, where μ(dk[1] ) and μ(dk[2] ) are randomly selected in [−20, 20] and [−100, 100] over time, respectively. After carrying out the experiment, the curves of weights, system states as well as tracking errors with Nk = 600 time steps are displayed in Figs. 10.12 and 10.13. Considering the weight ω ∈ R8×2 , we calculate the norm for the two elements of each row.

264

10 Data-Driven Hybrid Intelligent Optimal Tracking Design … 0.01 0.009 0.008

The training error

0.007 0.006 0.005 0.004 0.003 0.002 0.001 0 0

0.5

1

1.5

2

2.5 104

Samples

Fig. 10.9 The training performance of S O,5 and S N O,2 10-3

7

6

The testing error

5

4

3

2

1

0 0

1000

2000

3000

4000

5000

6000

Samples Fig. 10.10 The testing performance of S O,5 and S N O,2

7000

8000

9000 10000

10.5 Experimental Simulation

265

Table 10.2 Parameter values used in the hybrid intelligent control algorithm Q R Hm Hc αm αc 0.01I2

0.01I2

12

8

0.02

0.2

αa 0.05

5.8 5.6 The traditional cost function The accelerated cost function

5.4

Cost functions

5.2 5 4.8 4.6 4.4 4.2 4 3.8 0

100

200

300

400

500

600

Iteration steps Fig. 10.11 Convergence curves of iterative cost functions

Hence, there are eight curves in the action weight of Fig. 10.12. Meanwhile, the control inputs as well as the tracking controls are displayed in Fig. 10.14, which further demonstrates the feasibility of the developed intelligent control approach. In order to implement the comparison, we use the PID method to control the concentrations of dissolved oxygen and nitrate nitrogen. According to the trial-and-error method, the setting parameters of PID include: the proportion part is [10000, 30], the integration part is [1000, 5], and the differentiation part is [100, 1]. The tracking errors are displayed in Fig. 10.15. It is clear that the errors of the PID approach have bigger oscillations in some responses than that of the proposed method, which further verifies the superiority of the developed algorithm.

10 Data-Driven Hybrid Intelligent Optimal Tracking Design … Weights of the critic network

266

0.4 0.2 0 0

100

200

300

400

500

400

500

Weights of the action network

Time steps

0.6 0.4 0.2 0 0

100

200

300

600

Time steps

Fig. 10.12 Weights of the critic network and the action network (a)

(b)

4

The system state x [2]

The system state x [1]

3

[1]

x[1]

2

1

[2]

x[2]

3

2

1

0 0

400

0

600

The tracking error e[2]

0

-1

-2 0

200

400

Time steps

200

400

600

Time steps (d)

Time steps (c)

1

The tracking error e[1]

200

600

2

1

0 0

200

400

Time steps

Fig. 10.13 System states and tracking errors: a xk[1] ; b xk[2] ; c ek[1] ; d ek[2]

600

10.5 Experimental Simulation

267

(a)

240

(x[2])

(x[1])

(b)

2.925

220 200

2.92 2.915

180

2.91 0

200

400

0

600

Time steps (c)

0.1

200

400

600

Time steps (d)

0.02 0.01

(e[2])

0.05

(e[1])

104

2.93

0

0 -0.01

-0.05

-0.02

-0.1 0

200

400

600

0

200

Time steps

400

600

Time steps

The tracking error e[1]

Fig. 10.14 Control inputs: a μ(xk[1] ); b μ(xk[2] ); c μ(ek[1] ); d μ(ek[2] ) 1 The PID control The hybrid intelligent optimal tracking control

0

-1

0

100

200

300

400

500

600

Time steps The tracking error e[2]

3 The PID control The hybrid intelligent optimal tracking control

2 1 0 0

100

200

300

Time steps

Fig. 10.15 Tracking errors: a ek[1] ; b ek[2]

400

500

600

268

10 Data-Driven Hybrid Intelligent Optimal Tracking Design …

10.6 Concluding Remarks For a class of unknown nonlinear systems, a hybrid intelligent tracking control approach including the accelerated generalized value iteration algorithm and the online near-optimal tracking control is developed to solve the tracking control problem. First, we convert the tracking problem to the regulation problem. Then, we can obtain the steady control and the approximate control matrix based on the model network. Next, the admissible tracking control is obtained via the accelerated generalized value iteration algorithm. Finally, we use the online regulatory mechanism to tune network parameters in real time. Meanwhile, the uniformly ultimately bounded stability of approximation errors of neural networks is provided by the Lyapunov approach. Two simulation examples are involved to verify the feasibility and availability of the developed approach.

References Alex J, Benedetti L, Copp J, Gernaey KV, Jeppsson U, Nopens I, Pons MN, Rieger L, Rosen C, Steyer JP, Vanrolleghem P, Winkler S (2008) Benchmark simulation model no. 1 (BSM1), IWA task group on benchmarking of control strategies for WWTPs, London Al-Tamimi A, Lewis FL, Abu-Khalaf M (2008) Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part B Cybern 38(4):943–949 Bo Y, Qiao J (2015) Heuristic dynamic programming using echo state network for multivariable tracking control of wastewater treatment process. Asian J Control 17(5):1654–1666 Dierks T, Jagannathan S (2012) Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update. IEEE Trans Neural Netw Learn Syst 23(7):1118–1129 Du J, Abraham A, Yu S, Zhao J (2014) Adaptive dynamic surface control with Nussbaum gain for course-keeping. Eng Appl Artif Intell 27:236–240 Ha M, Wang D, Liu D (2020a) Data-based nonaffine optimal tracking control using iterative DHP approach. In: Proceedings of 21st IFAC world congress, vol 53(2), pp 4246–4251 Ha M, Wang D, Liu D (2020b) Event-triggered adaptive critic control design for discrete-time constrained nonlinear systems. IEEE Trans Syst Man Cybern Syst 50(9):3158–3168 Ha M, Wang D, Liu D (2021) Generalized value iteration for discounted optimal control with stability analysis. Syst Control Lett 147:104847 Hou J, Wang D, Liu D, Zhang Y (2020) Model-free H ∞ optimal tracking control of constrained nonlinear systems via an iterative adaptive learning algorithm. IEEE Trans Syst Man Cybern Syst 50(11):4097–4108 Igelnik B, Pao Y (1995) Stochastic choice of basis functions in adaptive function approximation and the functional-link net. IEEE Trans Neural Netw 6(6):1320–1329 Kang J, Meng W, Abraham A, Liu H (2014) An adaptive PID neural network for complex nonlinear system control. Neurocomputing 135:79–85 Kiumarsi B, Lewis FL (2015) Actor-critic-based optimal tracking for partially unknown nonlinear discrete-time systems. IEEE Trans Neural Netw Learn Syst 26(1):140–151 Li C, Ding J, Lewis FL, Chai T (2021) A novel adaptive dynamic programming based on tracking error for nonlinear discrete-time systems. Automatica 129:109687 Li H, Liu D (2012) Optimal control for discrete-time affine non-linear systems using general value iteration. IET Control Theory Appl 6(18):2725–2736

References

269

Liu D, Wei Q (2014) Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst 25(3):621–634 Luo B, Yang Y, Wu H, Huang T (2020) Balancing value iteration and policy iteration for discretetime control. IEEE Trans Syst Man Cybern Syst 50(11):3948–3958 Modares H, Lewis FL (2014) Optimal tracking control of nonlinear partially-unknown constrainedinput systems using integral reinforcement learning. Automatica 50(7):1780–1792 Mu C, Wang D, He H (2017) Novel iterative neural dynamic programming for data-based approximate optimal control design. Automatica 81:240–252 Precup R, Preitl S, Tar JK, Tomescu ML, Takacs M, Korondi P, Baranyi P (2008) Fuzzy control system performance enhancement by iterative learning control. IEEE Trans Ind Electron 55(9):3461–3475 Roman RC, Precup RE, Bojan-Dragos CA, Szedlak-Stinean AI (2019) Combined model-free adaptive control with fuzzy component by virtual reference feedback tuning for tower crane systems. Procedia Comput Sci 162:267–274 Roman RC, Precup RE, Petriu Emil M (2021) Hybrid data-driven fuzzy active disturbance rejection control for tower crane systems. Eur J Control 58:373–387 Song R, Xie Y, Zhang Z (2019) Data-driven finite-horizon optimal tracking control scheme for completely unknown discrete-time nonlinear systems. Neurocomputing 356:206–216 Sun B, van Kampen E (2020) Incremental model-based global dual heuristic programming with explicit analytical calculations applied to flight control. Eng Appl Artif Intell 89:103425 Turnip A, Panggabean Jonny H (2020) Hybrid controller design based magneto-rheological damper lookup table for quarter car suspension. Int J Artif Intell 18(1):193–206 Wang D, Liu D (2018) Learning and guaranteed cost control with event-based adaptive critic implementation. IEEE Trans Neural Netw Learn Syst 29(12):6004–6014 Wang D, Ha M, Qiao J (2020) Self-learning optimal regulation for discrete-time nonlinear systems under event-driven formulation. IEEE Trans Autom Control 65(3):1272–1279 Wang D, Liu D, Wei Q (2012a) Finite-horizon neuro-optimal tracking control for a class of discretetime nonlinear systems using adaptive dynamic programming approach. Neurocomputing 78:14– 22 Wang D, Liu D, Wei Q, Zhao D, Jin N (2012b) Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming. Automatica 48(8):1825–1832 Wang D, Qiao J (2019) Approximate neural optimal control with reinforcement learning for a torsional pendulum device. Neural Netw 117:1–7 Wang D, Ha M, Qiao J (2021a) Data-driven iterative adaptive critic control toward an urban wastewater treatment plant. IEEE Trans Ind Electron 68(8):7362–7369 Wang D, Qiao J, Cheng L (2021b) An approximate neuro-optimal solution of discounted guaranteed cost control design. IEEE Trans Cybern 52(1):77–86 Wang D, Zhao M, Ha M, Hu L (2021c) Adaptive-critic-based hybrid intelligent optimal tracking for a class of nonlinear discrete-time systems. Eng Appl Artif Intell 105:104443 Wang D, Zhao M, Ha M, Ren J (2021d) Neural optimal tracking control of constrained nonaffine systems with a wastewater treatment application. Neural Netw 143:121–132 Wang D, Zhao M, Qiao J (2021e) Intelligent optimal tracking with asymmetric constraints of a nonlinear wastewater treatment system. Int J Robust Nonlinear Control 31(14):6773–6787 Wei Q, Liu D, Yang X (2015) Infinite horizon self-learning optimal control of nonaffine discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst 26(4):866–879 Wei Q, Lewis FL, Liu D, Song R, Lin H (2018a) Discrete-time local value iteration adaptive dynamic programming: convergence analysis. IEEE Trans Syst Man Cybern Syst 48(6):875–891 Wei Q, Li B, Song R (2018b) Discrete-time stable generalized self-learning optimal control with approximation errors. IEEE Trans Neural Netw Learn Syst 29(4):1226–1238 Wei Q, Liu D, Lin Q (2017) Discrete-time local value iteration adaptive dynamic programming: admissibility and termination analysis. IEEE Trans Neural Netw Learn Syst 28(11):2490–2502 Wei Q, Liu D, Lin H (2016) Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems. IEEE Trans Cybern 46(3):840–853

270

10 Data-Driven Hybrid Intelligent Optimal Tracking Design …

Zhang H, Luo Y, Liu D (2009) Neural-network-based near-optimal control for a class of discretetime affine nonlinear systems with control constraints. IEEE Trans Neural Netw 20(9):1490–1503 Zhang H, Qin C, Jiang B, Luo Y (2014) Online adaptive policy learning algorithm for H∞ state feedback control of unknown affine nonlinear discrete-time systems. IEEE Trans Cybern 44(12):2706–2718 Zhang H, Wei Q, Luo Y (2008) A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans Syst Man Cybern Part B Cybern 38(4):937–942 Zhao D, Xia Z, Wang D (2015) Model-free optimal control for affine nonlinear systems with convergence analysis. IEEE Trans Autom Sci Eng 12(4):1461–1468 Zhao Q, Si J, Sun J (2022) Online reinforcement learning control by direct heuristic dynamic programming: from time-driven to event-driven. IEEE Trans Neural Netw Learn Syst 33(8):4139– 4144

Index

A Action network, 15, 19, 20, 30, 39, 42–45, 48, 59–61, 69, 71, 72, 78, 79, 83, 103, 104, 119, 121, 129–132, 134, 155, 164–166, 205, 209, 210, 230, 234, 250–252, 256, 260, 266 Activation function, 14, 59, 103, 129, 130, 156–158, 204, 229, 230, 248, 249, 252 Adaptive critic, 3, 7, 12, 14–19, 30, 31, 104, 120, 121, 143, 173–175, 177, 179, 183, 186, 188, 192, 194, 198, 199, 204, 234, 243, 252, 257 Adaptive critic control, 17, 21, 87, 187, 197, 220 Adaptive critic design, 15, 87, 120, 174, 175, 201 Adaptive dynamic programming, 2, 30, 53, 89, 120, 173, 198, 220, 241 Admissible control, 6, 8, 96, 97, 120, 121, 149, 199, 202, 226, 243 Advanced optimal control, 1, 14, 15, 21, 22 Affine nonlinear systems, 3, 10, 53, 68, 119, 120, 175, 198 Approximate dynamic programming, 30, 147 Approximate optimal control, 90, 198 Approximation error, 119, 125, 127, 129, 130, 164, 182, 186, 249, 250, 252 Artificial intelligence, 3, 14, 15, 20 Asymmetric control constraints, 90–92, 102, 104, 112, 221 Asymptotically stable, 33, 35, 36, 39

Augmented system, 9–11, 120, 123, 131, 150, 151, 153, 168, 184, 187, 223 B Bellman equation, 174, 177, 179, 184 Benchmark Simulation Model No. 1, 199, 221, 261 Biochemical reaction tank, 199, 200, 207, 221, 261, 263 BSM1, 199, 200, 205–207, 211, 221, 222, 231, 261, 263 C Complex systems, 21, 143, 194, 198, 257 Computational intelligence, 2, 14 Constrained input, 43, 49 Control constraints, 29–31, 49, 89–92, 101, 102, 105, 112, 115, 220 Convergence property, 97 Cost function, 5, 6, 8, 10–12, 15–20, 30, 44, 56–58, 60, 69, 72, 79, 80, 89, 92–96, 101–103, 112, 116, 120, 123, 173, 174, 176, 183, 186–189, 191, 192, 200, 202, 207, 223, 234, 241, 243– 245, 247–249, 253, 256, 258, 260, 263 Critic intelligence, 1, 3, 4, 12, 15, 21, 22 Critic learning, 21, 22 Critic network, 15, 19, 20, 30, 39, 41–45, 48, 59–61, 68–72, 78, 79, 103, 104, 129, 132, 134, 139, 155, 164, 204, 205, 209, 210, 229, 230, 234, 249, 250, 252, 256, 260, 263, 266

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D. Wang et al., Advanced Optimal Control and Applications Involving Critic Intelligence, Intelligent Control and Learning Systems 6, https://doi.org/10.1007/978-981-19-7291-1

271

272 D Data-driven learning, 70, 80, 204 Desired trajectory, 9–11, 120, 122, 133, 134, 139, 149, 150, 166, 168, 169, 173, 175, 178, 179, 181, 184, 191, 248, 260 Discounted optimal control, 91, 148 Discount factor, 5, 89, 91–94, 97, 103, 112, 115, 174–176, 178–180, 187 Dissolved oxygen concentration, 21, 199, 200, 214, 216, 219, 221 Dual heuristic dynamic programming, 15, 30, 53, 147 Dynamical systems, 1, 4, 31, 91 Dynamic programming, 2, 4–6, 12, 13, 15, 16, 121, 174, 220

E Event-driven control, 56, 58, 70, 83 Event-driven formulation, 90, 148 Event-triggered control, 29–34, 39, 169

F Feedback control, 5, 6, 8, 12, 15, 32, 93, 123, 124, 149, 176, 184, 198, 200, 216, 244 Function approximation, 14 Function approximator, 2, 15, 54, 175, 181, 182, 186, 191

G Generalized value iteration, 89, 91, 93, 94, 97, 101, 102, 104, 109, 115, 241, 246, 248, 258, 263, 268 General value iteration, 121, 242 Globalized dual heuristic dynamic programming, 15, 30 Gradient descent algorithm, 130, 148, 164, 229

H Hamilton-Jacobi-Bellman equation, 53 Heuristic dynamic programming, 15, 29, 30, 53, 120, 148, 149, 174, 198

I Initial admissible control, 174 Initial cost function, 91, 94, 97, 101, 104, 109, 115, 202, 225, 242, 246, 247

Index Input-to-state stability, 53, 55, 90, 148 Input-to-state stable, 34 Intelligent control, 198, 220, 241, 256–258, 265 Intelligent critic control, 3, 15, 90, 199, 214 Intelligent critic framework, 21 Intelligent optimal tracking, 225, 241 Intelligent systems, 1 Internal recycle, 21, 200 Iterative adaptive critic, 3, 17–20, 53–55, 57–59, 61, 62, 68, 70, 72, 73, 78, 85, 173, 177, 180, 194, 197, 199 Iterative control law, 19, 202, 204, 205 Iterative cost function, 17–19, 89, 90, 93, 96, 98, 100, 101, 202–204, 210, 228, 242, 246 Iterative value function, 119, 125, 127, 128, 133, 147, 148, 173–175, 178, 180– 182, 186, 191

L Learning rate, 41, 42, 44, 147, 153, 158, 162– 166, 207, 210, 229, 230, 232, 250, 251 Lipschitz continuous, 31, 150 Lyapunov equation, 8, 246 Lyapunov function, 34, 158, 160, 173, 175, 178, 245, 253

M Mixed driven adaptive critic, 87 Model network, 19, 29, 39–45, 48, 49, 59– 61, 68, 69, 71, 72, 80, 83, 90, 103, 120, 121, 131, 133, 147, 148, 155– 157, 160, 163–166, 168, 169, 204, 209, 210, 219, 227, 229, 230, 232, 243, 248, 251, 253, 258, 259, 263, 268

N Near-optimal control, 29–31, 80, 89, 90, 93, 102, 115, 148, 201, 202, 205, 225, 241, 246, 253, 256 Neural dynamic programming, 2, 15, 30, 31, 53 Neural identifier, 204, 206–211, 227 Neural networks, 2–4, 9, 12, 14, 15, 17, 19, 20, 29, 39, 89, 103, 115, 129, 134, 148, 149, 168, 181, 198, 205, 207, 214, 228, 229, 234, 236, 241, 243, 249, 252, 253, 256, 268

Index Nitrate nitrogen concentration, 219, 221 Nonaffine nonlinear systems, 51, 53, 55, 90 Numerical method, 243, 248 O Online control, 243, 256 Optimal control, 1–4, 6–8, 12, 17, 19, 21, 22, 30, 53, 54, 56–58, 60, 68, 79, 87, 121, 129, 136, 137, 139, 142, 147, 151– 153, 165, 173, 175, 177–180, 185, 192, 198, 201, 220, 241, 242 Optimal control law, 6, 7, 12, 33, 92, 101, 124, 153, 202, 225, 236, 243 Optimal control problem, 5, 30, 92, 197 Optimal cost function, 5, 6, 8, 9, 12, 93, 98, 100, 101, 201, 202, 224, 245–248 Optimal regulation, 3, 5, 12, 17, 21, 29, 31, 51, 53, 54, 72, 80, 85, 89, 90, 119, 123, 148, 214, 220 Optimal tracking control, 30, 90, 119–121, 123, 142, 147–152, 156, 174, 219, 220, 225, 226, 231, 242, 244–246, 260, 261 Optimal value function, 13, 33, 125, 127, 129, 133, 136, 138, 139, 148, 176– 178, 180 Oxygen transfer coefficient, 200, 221, 235 P Performance index, 120, 121, 150, 151, 173, 175, 176, 184, 186, 194 Performance measure, 103, 157, 204, 205, 229, 248, 249, 251 Policy evaluation, 120, 121, 174 Policy improvement, 120, 121, 125, 178 Policy iteration, 13, 14, 17, 90, 121, 147, 173, 242 Positive definite function, 127, 177 Process control, 21 Proportional–integral–derivative, 197, 242 Q Q-learning, 14, 120, 174, 198 Quadratic function, 162, 163 Quadratic utility function, 8 R Reference trajectory, 9, 10, 21, 119–123, 133, 134, 139, 142, 149, 167, 174– 176, 183, 187, 189, 192, 222, 229, 234, 241, 244, 256, 257

273 Reinforcement learning, 2–4, 13–15, 17, 21, 90, 120, 149, 174, 220, 242

S Secondary sedimentation tank, 199, 200, 221 Steady control, 9, 10, 119–122, 134, 142, 149, 150, 168, 169, 199, 206, 207, 209, 210, 212, 222, 229, 231, 232, 236, 241–244, 248, 251, 256, 260, 262, 263, 268 Symmetrical control constraints, 92

T Tracking control, 9, 20, 30, 119, 120, 122– 124, 132, 133, 142, 143, 149, 173– 178, 180, 183, 184, 186, 189, 192, 194, 206, 214, 219–221, 223, 225, 227–229, 231, 232, 234, 236, 237, 241–246, 248–251, 253, 256, 258, 260, 263, 265, 268 Tracking error, 9, 11, 120–123, 131–134, 136, 137, 139, 141, 142, 150, 164, 166, 167, 173–176, 178–182, 185, 186, 188–190, 192–194, 206, 211, 212, 223, 231, 236, 242, 244, 250, 253, 261 Tracking problem, 120, 142, 149, 174, 185, 223, 231, 242, 244, 248, 268 Trajectory tracking, 3, 9, 12, 20–22, 51, 87, 119, 142, 147, 206, 234 Trajectory tracking problem, 10, 12, 21 Triggering condition, 43, 53, 54, 66–69, 76, 174

U Uniformly ultimately bounded, 119, 121, 243, 249, 252, 253 Unknown environment, 4 Utility function, 5, 7, 11, 32, 47, 92, 112, 123, 151, 166, 176, 180, 181, 183, 201, 223

V Value function, 29, 30, 39, 41, 119–121, 124, 125, 127, 129, 133, 134, 139, 152, 154, 174–178, 180–186, 191, 194 Value iteration, 13, 14, 17, 90, 91, 94, 97, 101, 104, 105, 121, 124, 125, 127, 129, 134, 142, 147, 148, 152, 173, 198, 199, 241, 242, 246, 248

274 W

Wastewater treatment, 197

Index Wastewater treatment plant, 90, 199, 205, 207, 209, 210, 212, 219–221, 230, 231, 236, 261 Wastewater treatment processes, 3