Modeling, Simulation and Optimization of Complex Processes HPSC 2018: Proceedings of the 7th International Conference on High Performance Scientific Computing, Hanoi, Vietnam, March 19-23, 2018 [1st ed.] 9783030552398, 9783030552404

This proceedings volume highlights a selection of papers presented at the 7th International Conference on High Performan

174 93 12MB

English Pages VIII, 405 [402] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Modeling, Simulation and Optimization of Complex Processes HPSC 2018: Proceedings of the 7th International Conference on High Performance Scientific Computing, Hanoi, Vietnam, March 19-23, 2018 [1st ed.]
 9783030552398, 9783030552404

Table of contents :
Front Matter ....Pages i-viii
Global Optimization Approach for the Ascent Problem of Multi-stage Launchers (O. Bokanowski, E. Bourgeois, A. Désilles, H. Zidani)....Pages 1-42
A Robust Predictive Control Formulation for Heliogyro Blade Stability (Adonis Pimienta-Penalver, John L. Crassidis, Jer-Nan Juang)....Pages 43-61
Piecewise Polynomial Taylor Expansions—The Generalization of Fa`di Bruno’s Formula (Tom Streubel, Caren Tischendorf, Andreas Griewank)....Pages 63-82
Grid-Enhanced Polylithic Modeling and Solution Approaches for Hard Optimization Problems (Josef Kallrath, Robert Blackburn, Julius Näumann)....Pages 83-96
Model Predictive Q-Learning (MPQ-L) for Bilinear Systems (Minh Q. Phan, Seyed Mahdi B. Azad)....Pages 97-115
SCOUT: Scheduling Core Utilization to Optimize the Performance of Scientific Computing Applications on CPU/Coprocessor-Based Cluster (Minh Thanh Chung, Kien Trung Pham, Manh-Thin Nguyen, Nam Thoai)....Pages 117-131
Chainer-XP: A Flexible Framework for ANNs Run on the Intel® Xeon PhiTM Coprocessor (Thanh-Dang Diep, Minh-Tri Nguyen, Nhu-Y Nguyen-Huynh, Minh Thanh Chung, Manh-Thin Nguyen, Nguyen Quang-Hung et al.)....Pages 133-147
Inverse Problems in Designing New Structural Materials (Daniel Otero Baguer, Iwona Piotrowska-Kurczewski, Peter Maass)....Pages 149-163
Coupled Electromagnetic Field and Electric Circuit Simulation: A Waveform Relaxation Benchmark (Christian Strohm, Caren Tischendorf)....Pages 165-200
SCIP-Jack: An Exact High Performance Solver for Steiner Tree Problems in Graphs and Related Problems (Daniel Rehfeldt, Yuji Shinano, Thorsten Koch)....Pages 201-223
Physical Parameter Identification with Sensors or Actuators Spanning Multiple DOF’s (Dong-Huei Tseng, Minh Q. Phan, Richard W. Longman)....Pages 225-245
Monotonization of a Family of Implicit Schemes for the Burgers Equation (Alexander Kurganov, Petr N. Vabishchevich)....Pages 247-256
The Insensitivity of the Iterative Learning Control Inverse Problem to Initial Run When Stabilized by a New Stable Inverse (Xiaoqiang Ji, Richard W. Longman)....Pages 257-275
Strategy Optimization in Sports via Markov Decision Problems (Susanne Hoffmeister, Jörg Rambau)....Pages 277-322
An Application of RASPEN to Discontinuous Galerkin Discretisation for Richards’ Equation in Porous Media Flow (Peter Bastian, Chaiyod Kamthorncharoen)....Pages 323-335
On the Development of Batch Stable Inverse Indirect Adaptive Control of Systems with Unstable Discrete-Time Inverse (Bowen Wang, Richard W. Longman)....Pages 337-355
An Improved Conjugate Gradients Method for Quasi-linear Bayesian Inverse Problems, Tested on an Example from Hydrogeology (Ole Klein)....Pages 357-385
Idiosyncrasies of the Frequency Response of Discrete-Time Equivalents of Continuous-Time System (Pitcha Prasitmeeboon, Richard W. Longman)....Pages 387-405

Citation preview

Hans Georg Bock · Willi Jäger · Ekaterina Kostina · Hoang Xuan Phu Editors

Modeling, Simulation and Optimization of Complex Processes HPSC 2018

Modeling, Simulation and Optimization of Complex Processes HPSC 2018

Hans Georg Bock Willi Jäger Ekaterina Kostina Hoang Xuan Phu •





Editors

Modeling, Simulation and Optimization of Complex Processes HPSC 2018 Proceedings of the 7th International Conference on High Performance Scientific Computing, Hanoi, Vietnam, March 19–23, 2018

123

Editors Hans Georg Bock Interdisciplinary Center for Scientific Computing (IWR) Heidelberg University Heidelberg, Germany

Willi Jäger Interdisciplinary Center for Scientific Computing (IWR) Heidelberg University Heidelberg, Germany

Ekaterina Kostina Interdisciplinary Center for Scientific Computing (IWR) Heidelberg University Heidelberg, Germany

Hoang Xuan Phu Institute of Mathematics Vietnam Academy of Science and Technology Hanoi, Vietnam

ISBN 978-3-030-55239-8 ISBN 978-3-030-55240-4 https://doi.org/10.1007/978-3-030-55240-4

(eBook)

Mathematics Subject Classification: 34B15, 35Q35, 35Q92, 49K15, 49J15, 49M30, 65K05, 65L05, 70E60, 93B30, 93B40 © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Front cover picture: Performance at the Water Puppet Theatre, Hanoi, courtesy of Johannes P. Schlöder. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

This volume contains a selection of papers referring to lectures presented at the 7th International Conference on High Performance Scientific Computing held at the Vietnam Institute for Advanced Study in Mathematics (VIASM) in Hanoi, on March 19–23, 2018. The conference has been organized by the Institute of Mathematics of the Vietnam Academy of Science and Technology, the Interdisciplinary Center for Scientific Computing (IWR) of Heidelberg University and the Vietnam Institute for Advanced Study in Mathematics. High Performance Scientific Computing is an interdisciplinary area that combines many fields such as mathematics, computer science and scientific and engineering applications. It is a key high-technology for competitiveness in industrialized countries, as well as for speeding up development in emerging countries. High performance scientific computing develops methods for computational simulation and optimization for systems and processes. In practical applications in industry and commerce, science and engineering, it helps to save resources, to avoid pollution, to reduce risks and costs, to improve product quality, to shorten development times or simply to operate systems better. The conference had about 300 participants from countries all over the world. The scientific program consisted of 210 talks, presented in 16 invited mini-symposia and contributed sessions. Eight talks were invited plenary lectures given by Mihai Anitescu (Argonne National Laboratory), Jose Antonio Carrillo (Imperial College London, now University of Oxford), William Cook (University of Waterloo), Ekaterina Kostina (Heidelberg University), Nils Henrik Risebro (University of Oslo), Adelia Sequeira (University of Lisbon), Eitan Tadmor (University of Maryland) and Fredi Tröltzsch (Technische Universität Berlin). Topics were mathematical modeling, numerical simulation, methods for optimization and control, parallel computing including computer architectures, algorithms, tools and environments, software development, applications of scientific computing in physics, mechanics, hydrology, chemistry, biology, medicine, transport, logistics, location planning, communication, scheduling, industry, business and finance.

v

vi

Preface

The participants enjoyed not only an intensive scientific program and discussions but also versatile social activities, like a snake dinner, a performance at the water puppet theatre, a boat tour around Trang An at Ninh Binh and excursions to the Imperial Citadel of Thăng Long and the Vietnam Museum of Ethnology. As at the previous conferences, a satellite workshop on Scientific Computing for the Cultural Heritage, including a field visit to the temples of Angkor, was jointly organized by the Royal University of Phnom Penh and the IWR at the Conservation d’Angkor Centre, Siem Reap, Cambodia. The submitted manuscripts have been carefully reviewed and 18 of the contributions have been selected for publication in these proceedings. We would like to thank all contributors and referees. Our special thanks go to Nam-Dũng Hoang for his invaluable assistance in preparing this proceeding volume. We would also like to use the opportunity to thank the sponsors whose support significantly contributed to the success of the conference: • Berlin-Brandenburg Academy of Sciences and Humanities (BBAW), • Center for Modeling and Simulation in the Biosciences (BIOMS) of Heidelberg Universtity, • Faculty of Computer Science and Engineering of the Ho Chi Minh City University of Technology, • Institute of Mathematics of the Vietnam Academy of Science and Technology, • Interdisciplinary Center for Scientific Computing (IWR) of Heidelberg University, • International Council for Industrial and Applied Mathematics (ICIAM), • Vietnam Academy of Science and Technology (VAST), • Vietnam Institute for Advanced Study in Mathematics (VIASM), • Zentrum für Technomathematik (ZeTeM) of the University of Bremen. Heidelberg, Germany June 2020

Hans Georg Bock Willi Jäger Ekaterina Kostina Hoang Xuan Phu

Contents

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . O. Bokanowski, E. Bourgeois, A. Désilles, and H. Zidani

1

A Robust Predictive Control Formulation for Heliogyro Blade Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adonis Pimienta-Penalver, John L. Crassidis, and Jer-Nan Juang

43

Piecewise Polynomial Taylor Expansions—The Generalization of Faà di Bruno’s Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tom Streubel, Caren Tischendorf, and Andreas Griewank

63

Grid-Enhanced Polylithic Modeling and Solution Approaches for Hard Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Josef Kallrath, Robert Blackburn, and Julius Näumann

83

Model Predictive Q-Learning (MPQ-L) for Bilinear Systems . . . . . . . . . Minh Q. Phan and Seyed Mahdi B. Azad

97

SCOUT: Scheduling Core Utilization to Optimize the Performance of Scientific Computing Applications on CPU/Coprocessor-Based Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Minh Thanh Chung, Kien Trung Pham, Manh-Thin Nguyen, and Nam Thoai Chainer-XP: A Flexible Framework for ANNs Run on the Intel® Xeon PhiTM Coprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Thanh-Dang Diep, Minh-Tri Nguyen, Nhu-Y Nguyen-Huynh, Minh Thanh Chung, Manh-Thin Nguyen, Nguyen Quang-Hung, and Nam Thoai Inverse Problems in Designing New Structural Materials . . . . . . . . . . . 149 Daniel Otero Baguer, Iwona Piotrowska-Kurczewski, and Peter Maass

vii

viii

Contents

Coupled Electromagnetic Field and Electric Circuit Simulation: A Waveform Relaxation Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Christian Strohm and Caren Tischendorf SCIP-Jack: An Exact High Performance Solver for Steiner Tree Problems in Graphs and Related Problems . . . . . . . . . . . . . . . . . . . . . . 201 Daniel Rehfeldt, Yuji Shinano, and Thorsten Koch Physical Parameter Identification with Sensors or Actuators Spanning Multiple DOF’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Dong-Huei Tseng, Minh Q. Phan, and Richard W. Longman Monotonization of a Family of Implicit Schemes for the Burgers Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Alexander Kurganov and Petr N. Vabishchevich The Insensitivity of the Iterative Learning Control Inverse Problem to Initial Run When Stabilized by a New Stable Inverse . . . . . . . . . . . . 257 Xiaoqiang Ji and Richard W. Longman Strategy Optimization in Sports via Markov Decision Problems . . . . . . 277 Susanne Hoffmeister and Jörg Rambau An Application of RASPEN to Discontinuous Galerkin Discretisation for Richards’ Equation in Porous Media Flow . . . . . . . . . . . . . . . . . . . . 323 Peter Bastian and Chaiyod Kamthorncharoen On the Development of Batch Stable Inverse Indirect Adaptive Control of Systems with Unstable Discrete-Time Inverse . . . . . . . . . . . . 337 Bowen Wang and Richard W. Longman An Improved Conjugate Gradients Method for Quasi-linear Bayesian Inverse Problems, Tested on an Example from Hydrogeology . . . . . . . . 357 Ole Klein Idiosyncrasies of the Frequency Response of Discrete-Time Equivalents of Continuous-Time System . . . . . . . . . . . . . . . . . . . . . . . . 387 Pitcha Prasitmeeboon and Richard W. Longman

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers O. Bokanowski, E. Bourgeois, A. Désilles, and H. Zidani

Abstract This paper deals with a trajectory optimization problem for a three-stage launcher with the aim to minimize the consumption of propellant needed to steer the launcher from the Earth to the GEO (geostationary orbit). Here we use a global optimization procedure based on Hamilton-Jacobi-Bellman approach and consider a complete model including the transfer from a GTO (geostationary transfer orbit) to the GEO. This model leads to an optimal control problem involving also some optimization parameters that appear in the flight phases. First, an adequate formulation of the control problem is introduced. Then, we discuss some theoretical results related to the value function and to the reconstruction of an optimal trajectory. Numerical simulations are given to show the relevance of the global optimization approach. This work has been undertaken in the frame of CNES Launchers’ Research and Technology program.

1 Introduction This paper concerns the design of a trajectory optimization procedure for space shuttles of Ariane 5 type, with the aim of steering a payload from Earth to the GEO orbit. O. Bokanowski (B) Lab. J.-L. Lions, University Paris Diderot, 5 rue Thomas Mann, 75013 Paris, France e-mail: [email protected] E. Bourgeois CNES Launcher Directorate, 52 rue Jacques Hillairet, 75012 Paris, France e-mail: [email protected] A. Désilles · H. Zidani Unité de Mathématiques Appliquées (UMA), Ensta ParisTech, 828 Bd des Maréchaux, 91762 Palaiseau Cedex, France e-mail: [email protected] H. Zidani e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_1

1

2

O. Bokanowski et al.

Trajectory optimization for aerospace launchers has been extensively studied in the literature, see for instance [7, 9, 16, 32, 40] and the references therein. The pioneering Goddard [24] problem is perhaps the simplest model. It consists in maximizing the final altitude of the rocket, for a vertical flight, with a given initial propellant allocation. In one dimension this model is described by three state variables: the altitude r of the launcher, its velocity v and its mass m. The system − → is submitted to the aerodynamic force (the drag FD ) and is controlled via the thrust − → force FT . Since this work, several studies were made on theoretical properties of the optimal trajectories [16, 17, 27] and numerical methods allowing to calculate these trajectories [8, 9, 16, 27, 32, 34, 35, 40], and in particular [18, 28, 30, 38] for the ascent problem. Over the last 5 decades, several numerical approaches have been proposed and analyzed leading to some efficient methods for computing the optimal trajectories. These methods are essentially based on two approaches. The first one, called “discretize-and-optimize”, consists of discretizing the control problem in order to recast it into a finite dimensional problem that can be solved by a numerical optimization solver. This approach is easily applicable and its implementation is widely accessible, as many efficient optimization solvers are available (Ipopt, WORHP, …). However, in general, the optimization algorithms require some smoothness of the objective function and some qualification conditions on the constraints. Moreover, the initialization of the iterative process may be challenging for some cases, and there is no guarantee that the computed solution is a global solution, unless the optimization problem enjoys some convexity properties. Another very used approach, the shooting method, consists in solving the optimality system derived from the continuous problem (i.e., before discretization). This optimality system leads to a two-boundary differential system that might be solved by a Newton-type method. While the implementation of the method is quite simple, it may require an a priori knowledge of the command structure (number of switching times, existence of singular arcs, etc. …). Unless for specific control problems, the structure of control laws is a challenging question. In this work, we investigate the resolution of the ascent problem by a third approach: the Hamilton-Jacobi-Bellman (HJB) approach. It is based on the Dynamic Programming Principle (DPP) studied by R. Bellman [6]. It leads to a characterization of the value function as a solution of an HJB equation which is a first order nonlinear partial differential equation (PDE) in dimension d, where d is the number of state variables involved in the problem. The HJB equation may be viewed as a differential form of the DPP. An important breakthrough for this approach occurred in the 80’s, when the notion of viscosity solutions of nonlinear PDEs was introduced by Crandall and Lions [19–21]. This theory allows to establish a rigorous framework for the theoretical and numerical study the HJB equations arising in optimal control theory. The contributions in this direction do not cease growing, see the book of Bardi and Capuzzo-Dolcetta [5]. An interesting by-product of the HJB approach is the synthesis of the optimal control in feedback form. Once the HJB equation is solved, for any starting point, the reconstruction of the optimal trajectory can be performed in real time. Furthermore,

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

3

the method gives a global optimum and does not need any initialization procedure, see [13]. Although the theoretical framework of the HJB approach is well known, and despite its advantages (global optimization, easy management of state constraints, no initialization of the method is required), this approach is seldom used in real control problems, because it requires the resolution of a HJB equation (a nonlinear PDE) in high dimensions. This being said, the numerical analysis of HJB equations has attracted many attention in the literature, and significant numerical contributions has been done on the subject (see e.g.. [22, 25, 33]). In this work, we consider a complete model for steering the launcher from the earth to the GEO. The flight sequence is composed of 4 important phases (atmospheric phase, propulsion with first stage until exhaustion of the propellant, propulsion with the first stage until injection on a GTO, ballistic flight until injection on GEO). During the atmospheric phase, the control law is induced by two parameters that are the azimuth angle of launch and the angular velocity. On the other hand, the choice of a GTO is also an optimization parameter (more precisely, this parameter represents the angle of inclination of the GTO with respect to the GEO). And finally, the optimal time to reach the final destination is also an optimization parameter. Therefore, the mathematical formulation of the problem leads to a control problem with 6-dimensional space variables and 4 parameters. This challenging problem presents several difficulties for the implementation of the HJB approach. In particular, the presence of parameters would require to extend the space of state and consider these parameters as new state variables whose dynamics is 0. This would increase the dimension of the state space, making the resolution of the problem even more difficult. Here we use the structure of the control problem to propose an adequate mathematical formulation, which allows to manage the parameter in a rigorous manner and without increasing the state space. The mathematical formulation is the first contribution of the paper. The resolution of the HJB approach often faces three major difficulties namely, the discontinuity of the value function when the control problem involves state constraints, the evaluation of the Hamiltonian on each grid point, and the lack of boundary conditions that are needed for solving numerically the HJB equation on a computational domain. Here we show how to handle these difficulties in a rigorous mathematical way. For the numerical simulations, we use an efficient numerical scheme that is implemented with OpenMP parallelization techniques. This scheme gives very relevant results in reasonable computational times. The paper is organized as follows. Section 2 is devoted to the presentation of the physical model and its mathematical formulation. The global optimization algorithm procedure is presented in Sect. 3. The results of numerical simulations are presented in Sect. 4. Finally, some technical definitions and results are given in Appendices 1 and 2. Notations Throughout this paper, R denotes the sets of real numbers, | · | is the Euclidean norm and ·, · is the Euclidean inner product on R N , B the unit open ball ◦

{x ∈ R N : |x| < 1} and B(x, r ) = x + r B. For a set S ⊆ R N , S, S, ∂(S) and co(S)

4

O. Bokanowski et al.

denote its closure, interior, boundary and convex hull, respectively. The distance function to S is dist(x, S) = inf{|x − y| : y ∈ S}. We will also use the notation d S for the signed distance to S (i.e., d S (x) = −dist(x, S) if x ∈ S otherwise d S (x) = dist(x, S)).

2 Problem Statement. Mathematical Formulation This section presents the physical problem and its mathematical formulation. Several frames will be defined to describe the motion of the launcher in the most suitable way. Then the list of forces involved during the flight sequence will be defined. The differential system that governs the trajectory will be obtained by Newton’s law.

2.1 Physical Model Let O denotes the center of the Earth. We define a first frame R I = (O, i I , j I , k I ) to be considered as inertial. The vector k I is co-linear with the North-South axis of rotation, the vector i I is located in the equatorial plane of the Earth and points to the Greenwich meridian at an elected date set here as t = 0. The vector j I completes the orthonormal frame (see Fig. 1a). Consider also the frame R R = (O, i R , j R , k R ) (see Fig. 1b) that coincides with R I at time t = 0 and that is rotating with the Earth around the axis k I = k R with the angular velocity Ω. In all the sequel, we denote by r T the Earth’s mean radius and G the mass center of the vehicle. The spherical coordinates of G are (r, L , ), where r is the distance between G and O, L is the longitude and  is the latitude (see Fig. 2a). The vehicle’s position at time t = 0 will be denoted G 0 and its spherical candidates are (r0 , L 0 , 0 ). Two other local frames will be also used: a vertical local frame RV = (G, iV , jV , k V ) centered at G and defined such that k V is colinear with r G and pointing in the same direction. The vector jV is in the orthogonal plane to k V and pointing to the zI , zR

zI

kR jR

kI jI xI

iI Equator

zR

Nord

Nord

yI

iR xI

G

yR

yR

yI

Ω(t)

Equatorial Plan

xR

L

xR

(a) Quasi-inertial frame RI

(b) Geocentric frame RR

Fig. 1 Quasi inertial and geocentric frames

(c) The vehicle’s center of mass

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

zR

zV

zV

yV

5

xV

− → V

G

xD γ

yD yR equatorial plan xR

(a) Orientation of the local vertical frame RV

yV χ

zD

xV

(b) Dynamic frame RD

Fig. 2 Local vertical and dynamic frames

local North. The third vector iV := jV ∧ k V is defined in such way to complete the orthonormal frame (see Fig. 2). The frame RV will be linked to the center of mass of the launcher and should therefore evolve over time along the trajectory of the launcher. We will also need to consider the local inertial frame R I L defined as the vertical local coordinate R R = (G 0 , iV , jV , k V ) at time t = 0. Let X : t −→ (x(t), y(t), z(t)) be the trajectory of G in the quasi-inertial frame − → − → R R , and let V := xi ˙ R + y˙ j R + z˙ k R be the relative velocity. We define V in the local frame RV by its polar coordinate: the modulus v, the azimuth χ which is the angle − → between iV and the projection of V on (iV , jV ), the path inclination (flight angle) γ − → which is the angle between the projection of V on (iV , jV ) and iV (see Fig. 2). Likewise the local coordinate system related to the trajectory center of mass, we define also a local tangential frame related to the velocity of the launcher. For this we introduce the “dynamical” orthonormal frame R D = (G, i D , j D , k D ) defined such − → − → V that i D has the same direction as the velocity V (i.e., i D = − → ), k D is the unitary

V

vector in the plane (i D , k V ) perpendicular to i D and satisfying k V · k D < 0 and j D = k D ∧ i D (see Fig. 2). The frame R D will be useful to express some forces that act on the launcher. According to the flight phase of the launcher, it may be appropriate to consider − → − → the Cartesian coordinates of the position of G and its velocity ( X , V ) in the frame R R with ⎛ ⎞ ⎛ ⎞ x V − → ⎝ ⎠ − → ⎝ x⎠ X = y , V = Vy . y Vz The launcher may be also represented by the spherical coordinates of the position of G in R R and those of the velocity in the frame RV . These coordinates will be

6

O. Bokanowski et al.

denoted by (r, L , l, v, χ, γ). The formulas that allows to pass from one coordinate system into another are classical and can be found in [31].

2.2 Axis, Angles, and Forces During the phases of flight, the launcher is subject to different forces. We will express each of these forces in the most appropriate frame. For this, we will need to introduce some additional notations for the angles between the velocity and the axis of the launcher: • incidence α: the angle between the velocity vector and the axis of the launcher in the plane (iV , k V ) • sideslip δ: the angle measured in the plane (iV , jV ) • and the bank angle μ: is the angle between the axis of the shuttle and the axis k V . The launcher is subject to different forces that we describe here below: − → → → • Gravitational force: Fg = m − g where m is the mass of the vehicle, − g =    2 c0 rT − r 2 I + J2 r M k V is the gravitational field, where c0 is Earth’s gravitational constant, J2 is the second order term of the harmonic expansion of the gravitational field, I is identity matrix and M is the following matrix: ⎛

⎞ 1 − 5 sin2 () 0 0 ⎠ 0 1 − 5 sin2 () 0 M =⎝ 0 0 3 − 5 sin2 () • Aerodynamic forces: The best frame suited to express the aerodynamic forces here is R D . We will assume that the plane of symmetry of the vehicle coincides with the plane (x V , z V ) of the dynamical reference frame (i.e., the sideslip and the bank angles are zero during atmospheric flight, see Fig. 3). Under this assumption, the aerodynamic forces are: yV

zV

μ

δ

α

V

V Angle of attack α

xV

Fig. 3 Angles of the launcher

xV Sideslip angle δ

Bank angle μ

zV

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

7

− → − → – The drag force: FD = −FD (r, v, α)i D opposite to the velocity V . In this paper, we consider that FD is given by FD (r, v, α) = Sr Q(r, v)C x (r, v) where Q(r, v) is the dynamic pressure defined by Q(r, v) = 21 ρ(r )v2 , with ρ(r ) the atmospheric density, Sr is the reference surface, C x is a given aerodynamic coefficient. − → – The lift force: FL = −FL (r, v, α)k D 0 is neglected in this application due to the technical specifications of the considered launcher. • Thrust force: It is assumed that the direction of the thrust force coincides with the axis of the launcher. Assuming that μ = 0, the orientation of the thrust in the dynamical reference frame is defined by the incidence angle α and the sideslip angle δ. Then the trust force is given by: − → FT = FT (r )(cos α cos δ i D − cosα sin δ j D + sin α k D ), with FT (r ) = βg0 Isp − S P(r ) where g0 = 9.81 ms−2 , P(r ) is the atmospheric pressure, and β (flow rate), Isp (specific impulse) and S (surface) depend on the flight phases (see Sect. 2.4). − → − → • Coriolis force F C et centripetal force F P . These functions are defined by: → − → − → − F C = −2m Ω ∧ V

et

− → − → − → −−→ F P = −m Ω ∧ ( Ω ∧ OG),

− → where Ω is the Earth’s angular velocity. These two forces are important to be taken into account as far as the launcher’s trajectory is represented in a relative reference frame and not in the inertial one. Remark 1 The aerodynamic forces are generally defined with the relative launcher’s speed with respect to the air (or equivalently, relative to the ground when the atmosphere is considered static and therefore rigidly driven by the Earth’s rotation). Remark 2 The sideslip angle is assumed to be zero during the atmospheric flight (i.e., the launcher flies with zero incidence angle). To be more rigorous, if we consider a static atmosphere, so rigidly driven by the Earth’s rotation, the launcher has a nonzero velocity component out of plane relative to the air or on the ground (except in the limiting case of a launch from zero latitude). This assumption seems reasonable for a launch close to Ecuador, since the sideslip angle generally remains very low in the atmospheric phase .

8

O. Bokanowski et al.

2.3 Motion’s Equations Taking into account all these forces, and using Newton’s laws of motion, we get: − → dX − → = V, (1a) dt − → dV − → − → − → − → → − → − − → − → −−→ = Fg + FD + FL + FT − 2m Ω ∧ V − m Ω ∧ ( Ω ∧ OG). (1b) m dt Straightforward calculations yield to the motion’s equation in the spherical coordinates: dr dt dL dt d dt dv dt

= v sin γ

(2a)

v cos γ sin χ r cos  v = cos γ cos χ r

=

(2b) (2c)

FT (r ) FD (r, v, α) + cos α cos δ m m +Ω 2 r cos  (sin γ cos  − cos γ sin  cos χ) (2d)

g v g FT (r ) dγ r = − cos γ − − sin γ cos χ − sin α dt v r v mv r +2Ω cos  sin χ + Ω 2 cos  (cos γ cos  + sin γ sin  cos χ) (2e) v dχ −g sin χ v = − cos γ tan  sin χ − 2Ω (sin  − tanγ cos  cos χ) dt v cos γ r sin  cos  sin χ FT (r ) r + cos α sin δ (2f) +Ω 2 v cos γ m v cos γ = −gr sin γ + g cos γ cos χ −

with

r 2 T 1 + J2 (1 − 3 sin2 ) r c0 r T 2 sin  cos  g := −2 2 J2 r r gr :=

c0 r2

(3a) (3b)

The evolution of the mass m(t) is given by the following equation: m(t) ˙ = β(t).

(4)

where the function β is known and represents the consumption flow rate and depends on the launcher’s parameters. Furthermore the mass will jump at particular times corresponding to the jettisoning of the boosters or the different stages. The mass of

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

9

the launcher includes the mass of the structure, the payload, and the mass of the propellant (see next section). Equations (1) and (2) are two different formulations for the same motion. In the sequel, we will see that the first formulation is convenient to describe the atmospheric phase, while the second is more suitable for the flight outside the atmosphere.

2.4 The Flight’s Phases The launcher that is considered in this paper is of Ariane-5 type with three-stages. The latter are parts of the launcher that contain propellant and provide propulsion for the launcher. We denote by β E A P , β E1 and β E2 the mass flow rates for the boosters, the first and the second stage respectively. The rates β E A P and β E1 depend on the time variable (see Fig. 14, Appendix 2), while β E2 is a constant given in Table 5. Our aim is to minimize the propellant consumption while steering the vehicle from a given initial position on the Earth to the GEO. The launcher evolves according to the following phases. Phase 0: This phase starts when the vehicle leaves the launch base. Both boosters along with the stage E 1 are ignited and consume propellant with flow rates β E A P and β E1 respectively . • Phase 0–1: vertical flight for a fixed time τ0 . The flight is vertical relative to the ground in the approximate initial R I L . Thus, during this phase, the orientations of the thrust are constant : θ(t) := θ0 =

π , ψ(t) := ψ0 for t ∈ [0, τ0 ], 2

where ψ0 is the shooting azimuth. • Phase 0–2: the launcher rotates with constant speed ωbasc (angular velocity) changing its orientation during a time interval τbasc : θ(t) =

π + ωbasc t, ψ(t) = ψ0 for t ∈ [τ0 , τ0 + τbasc ]. 2

• Phase 0–3: The direction of the thrust is then fixed at the final values of the previous sub-phase until the angle of incidence is zero i.e. when α = 0 or equivalently → − → − FT · V cos(α) = − → − → ≡ 1.

F T

V

(5)

• Phase 0–4: Zero incidence flight until jettisoning of the boosters. In this paper, we consider that the boosters are ejected after the exhaustion of the propellants, and since the propellant flow β E A P is known the jettisoning’s time is known and will be denoted t1 .

10

O. Bokanowski et al. GEO

GTO Earth perigee

GTO apogee

i

GEO i

VGT O

VGEO

(a) Projection in the equatorial plane

(b) GTO orbit inclination relative to GEO

Fig. 4 GEO and GTO orbits

The durations τ0 , τbasc and t1 are fixed, while the value of the parameters ψ0 and ωbasc are unknowns that must be determined in such a way to optimize the launcher’s consumption. The set of possible positions corresponding to a large sample of these parameters can be obtained by a simple integration of the motion’s equations. The computation of the trajectories can be performed in parallel and with a high accuracy (more details are given in Sect. 4). Phase 1. During this phase the propulsion is provided by the first stage E 1 . The firing is rejected when reaching a threshold heat flux that is estimated to occur at time t1 + τ1 (where τ1 is known). The engine E 1 is on and consumes with the propellant flow β E1 until exhaustion at time t2 . After time t2 follows a short ballistic flight for a duration of τ2 . Then, at time t2 + τ2 the engine E 1 is ejected. We denote t2 := t2 + τ2 the end of phase 1. Phase 2. This phase starts at time t2 , the second stage is ignited. It ends at the time of injection of the launcher on a GTO orbit near its perigee. It is required for the GTO to have its ascending node (one of the two intersection points of the orbit with the equatorial plane) located on GEO orbit. Moreover, some parameters of the GTO are fixed, namely the distances from the center of Earth to the perigee r p and to the apogee ra (in fact ra is equal to the radius of the GEO). However the inclination i of the GTO relative to the GEO is considered as an optimization parameter. Furthermore, the time of injection t f is not fixed and will be considered as an optimization variable. Phase 3. Once the launcher has reached a GTO, the engine of the second stage is off. Then follows a ballistic flight phase until it reaches the apogee of the GTO. The duration of this flight is not fixed and depends on the GTO parameters. In our study, we assume that the GTO-GEO orbital transfer is performed through an impulsive boost (to change the velocity’s modulus and direction) . The amount of propellant required for the orbital transfer depends on the the GTO parameters and the launcher’s mass. It can be determined via Tsiolkovsky formula, see Appendix 1. In particular, we note that this amount of propellant depends only on the orbital inclination i of the GTO relative to the GEO (see Fig. 4b).

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

11

The definition of the mass depends on the phases as follows: Phase 0: m(t) := MEAP + ME1 + ME2 + m F + m CU     Str uctur e

Payload

+ MP,EAP (t) + MP,E1 (t) + MP,E2 (t)    Pr opellant

Phase 1: m(t) := ME1 + ME2 + m F + m CU + MP,E1 (t) + MP,E2 (t) Phase 2: m(t) := ME2 + m F + m CU + MP,E2 (t). The constants MEAP , ME1 , ME2 , m F denote respectively the mass of the boosters, the first and second stages, and the firing m CU denotes the mass of the payload MP,EAP (t), MP,E1 (t), MP,E2 (t) denote, respectively, the mass of propellant at time t ≥ 0 in the boosters, the first and the second stages. Remark 3 Throughout the paper, the times t1 , t1 + τ1 , t2 are supposed to be fixed. The whole study can be extended to other situations where the jettisoning of the boosters and/or the stage E 1 occurs after reaching some accelerometric threshold or a threshold ratio of propulsive forces (even if there remains some propellant). This case will be investigated in a future work. Remark 4 Another assumption that will be made in this paper concerns the aerodynamic forces FD and FL that will be considered negligible after Phase 0. Indeed, after this phase, the atmospheric density becomes very small.

2.5 Optimal Control Problem The optimization problem that we aim to study consists in: Minimi zing the launcher s consumption f or steering a given payload m CU until the G E O. Control laws. During Phase 1 and 2, the launcher is controlled by the incidence angle α(·) and the sideslip angle δ(·). In the following, we will use the notation u = (α, δ) for the control law. The admissible controls are measurable functions taking their values in the compact set U := [αmin , αmax ] × [δmin , δmax ], with constants αmin/max and δmin/max such that [αmin , αmax ] ⊂ [− π2 , π2 ] and [δmin , δmax ] ⊂ [−π, π]. We denote also Uad the set of all admissible controls:

12

O. Bokanowski et al.

Uad :=   u := (α(.), δ(.)) : measurable functions on [0, +∞) that take values in U . Parameters. We distinguish the previous control variables, which are time dependent functions, from control parameters that do not evolve in time but are optimization parameters: • p1 := (ψ0 , ωbasc ), where ψ0 is the shooting azimuth, ωbasc is the angular velocity. These parameters define the launcher’s trajectory during Phase 0. In the sequel, we assume that p1 belongs to a given set PIni . • the inclination angle i of the GTO from the GEO. This parameter lies in a given interval I. State variables. The launcher’s trajectory is described by Newton’s law. Throughout the sequel, we will denote by y(·) the position-velocity of the launcher. The state equation is given by (1) or (2). Actually, we will see later on that it will be more convenient to use the system (1) during Phase 0. One reason for this choice is that the dynamics in (2) requires a division by r and v (which are respectively the altitude and modulus of the velocity and which are 0 at time 0), while the dynamics in (1) is well defined everywhere. During Phases 1 & 2, r and v become large enough and then the dynamics in (2) becomes well defined and smooth enough. That is said, a valuable advantage in using (2) is that the dynamics doesn’t depend on the Longitude angle L, then the position-velocity vector can be described by only 5 components which are (r, , v, γ, χ) (the sixth component L can be deduced from the other ones by solving its corresponding ODE). The mass variable will not be considered as state variable as its profile is completely determined during the phases 0, and 2. As will be explained in Sect. 3, the only unknown parameter related to the mass profile is the duration of phase 2. This parameter, denoted ζ(t), will be also considered as a state variable with 0 velocity. State constraints. As already mentioned, the state variable satisfies the differential equation derived from Newton’s law. In the formulation of the optimal control problems, we shall introduce some additional requirements that are physically justified. These requirements aim to reduce the domain of computation in order to focus only on a region of interest. For instance, from Newton’s law, one can check easily that the region of small altitudes with high velocities is clearly not feasible. The set of state constraints will be denoted by K. At the end of Phase 2, the launcher is injected near the perigee of a GTO with orbital inclination i. For i ∈ I, we shall denote by Ci a segment around the perigee of the GTO of inclination i (see Appendix 1.1 for more details). Objective function. First, notice that the shape of the stage E 2 is fixed and its maximum load along with the flow’s rate are known. So, the maximum time for the consumption of the entire propellant load is a given T. In this paper, the objective is to maximize the mass of remaining propellent after the injection on the GEO orbit (or,

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

13

equivalently to minimize the consumption of stage E 2 , while steering the launcher until the GEO). Every GTO can be parameterized by its inclinaison angle i with respect to equatorial plan of the Earth (see Appendix 1.1). If t f denotes a time of injection on the GTO with inclinaison angle i, then the remaining propellant in the stage E 2 at time t f is MP,E2 (t f ). Afterwards, there will be still an additional consumption for the injection on the GEO. We point out that the modulus of the differential gear ΔV (i) that a vehicle has to provide for a transfer from the GTO with angle i to the GEO depends only on i (see Appendix 1.2). Furthermore, the corresponding amount of propellent M(t f , i) is explicitly given using the Tsiolkovsky’s formula (or William Moore’s formula [26], see also Appendix 1) stating that

− ΔV (i) M(t f , i) = m(t f ) · 1 − e g Isp where m(t f ) is the global launcher’s mass at the time t f : m(t f ) := m CU + ME2 + MP,E2 (t f ). In this paper, we assume that the transfer is impulsive. By consequence, the final remaining mass on the GEO can be expressed as: MP,E2 (t f ) − M(t f , i) = MP,E2 (t f ) · e

(i) − ΔV g Isp



− ΔV (i) − (m CU + ME2 ) 1 − e g Isp . (6)

In the sequel we define the cost function Φ : [0, T] × I → R by Φ(t f , i) = MP,E2 (t f ) · e

(i) − ΔV g Isp



− ΔV (i) − (m CU + ME2 ) · 1 − e g Isp .

(7)

From this definition, we can easily see that the function Φ is decreasing with respect to its first variable.

3 Global Optimization Approach As described in the previous sections, the launcher’s motion follows a dynamical system that we can write in a general form as: y˙ (t) = G(t, y(t), u(t), p) for t ≥ 0, y(0) = y0 ,

(8)

where u ∈ Uad and y0 is the initial condition (position, velocity) of the launcher u, p on the Earth. For a fixed positive time t, we will say that a solution of (8) yx is admissible on [0, t] if it is associated to an admissible control u ∈ Uad and a feasible parameter p ∈ PIni .

14

O. Bokanowski et al.

The control problem is of the following form: ⎧ sup Φ(t f , i) ⎪ ⎪ ⎨ y u, p is the solution of (a) associated to (u(·), p) u(·) ∈ Uad , p ∈ PIni , i ∈ I, t f > 0 ⎪ ⎪ ⎩ y u, p (τ ) ∈ K ∀τ ∈ [0, t f ], and y u, p (t f ) ∈ Ci .

(P)

where Φ is defined as in (7) and corresponds to the remaining mass at the GEO. The above optimization problem depends on several parameters (namely, the final time t f , p = (ψ0 , ωbasc ), and the parameter i) that have to be chosen to get the optimal cost. In general, to solve this problem with the HJB approach, the vector of parameters Π could be considered as an additional state variables evolving under the simple dynamics: ˙ Π(t) = 0. This would set back the problem into a general framework where the state is (y, Π )T . However, in this formulation the dimension of the state would become d + 4 therefore increasing the dimension of the HJB equation associated to the control problem, which we want to avoid. For the launcher’s path optimization problem, we are concerned by a control problem where the dynamics depends on some parameters in a very specific way. More precisely, on the time interval [0, t1 ] (where t1 is known and corresponds to the end of Phase 0), the function G depends only on two parameters in PIni . Then, on the time interval [t1 , t f ], the function depends only on the control variable and has no explicit dependency on the parameter p. In other terms, for every (x, u) ∈ Rd × U and every p ∈ PIni , the dynamics has the following form:  G(t, x, u, p) =

f 0 (t, x, p) for t ∈ (0, t1 ) f (t, x, u) for t ∈ (t1 , t f ).

Let us introduce X 0 the set of all positions that can be reached from y0 with parameters p ∈ PIni : X 0 = {y p (t1 ) | p ∈ PIni , y˙ p (t) = f 0 (t, y p (t), p), y p (0) = y0 }.

(9)

Then the problem (P) can be re-written as follows: ⎧ sup Φ(t f , i) ⎪ ⎪ ⎪ ⎪ ⎪ y˙tu1 ,x (t) = f (t, yxu (t), u(t)), t ∈ (t1 , t f ), ⎪ ⎪ ⎨ ytu1 ,x (t1 ) = x ∈ X 0 , ⎪ ⎪ ⎪ ⎪ u ∈ Uad , i ∈ I, t f ∈ [t1 , t f ], ⎪ ⎪ ⎪ ⎩ ytu1 ,x (τ ) ∈ K, ∀τ ∈ [t1 , t f ], and ytu1 ,x (t f ) ∈ Ci .

(10)

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

15

As mentioned in the previous section, for a given parameter i ∈ I, the objective function t f → Φ(t f , i) is decreasing. Hence, for any x ∈ X 0 and i ∈ I, if we define the minimal time function T (x, i) as follows: ⎧ inf t f ⎪ ⎪ ⎪ ⎪ y˙tu1 ,x (t) = f (t, yxu (t), u(t)), a.e. t ∈ (t1 , t f ), ⎨ ytu1 ,x (t1 ) = x, T (x, i) := (11) ⎪ ⎪ u ∈ U , t ∈ [t , T ], ⎪ ad f 1 ⎪ ⎩ ytu1 ,x (θ) ∈ K, ∀θ ∈ [0, t f ], and ytu1 ,x (t f ) ∈ Ci , then we have the following straightforward result: Lemma 1 The optimal value of the problem (P) is given by: sup(P) = sup{Φ(T (x, i), i) | i ∈ I and x ∈ X 0 }.

(12)

From this result we derive an optimization algorithm that we describe in the next subsection.

3.1 A Global Optimization Procedure Δ Consider PIni (resp. I Δ ) a subset of finite number of values in PIni (resp. of I). Here, Δ represents a discretization mesh step and we assume that when Δ tends to 0, the Δ and I Δ tend respectively to PIni and I, in the sense of convergence of sets. sets PIni The algorithm for solving the optimal control problem (P) is the following: Δ , solve the dynamical system Step 1. For any p ∈ PIni

y˙ p (t) = f 0 (t, y p (t), p),

y p (0) = y0

and compute y p (t1 ) that will be stored in a table X 0Δ . Step 2. For any i ∈ I Δ solve the inner problem (10) and compute T (x, i) for every x ∈ X 0Δ . Step 3. Compare the values Φ(T (x, i), i) for x ∈ X 0Δ and i ∈ I Δ , and define the optimal parameters ( p, ¯ ¯i) and its corresponding optimal solution (u, ¯ y¯ , t¯f ). To be more precise, Step 1 corresponds to the computation of the launcher’s Δ . trajectory during the Phase 0 for a sample of parameters p = (ψ0 , ωbasc ) ∈ PIni Recall that the launcher’s motion during this phase is well described in Sect. 2.4). Therefore, for a given parameter p = (ψ0 , ωbasc ), the trajectory is uniquely defined and can be numerically computed by solving the differential equation of the motion. This step is is performed by an accurate numerical scheme for ODEs (a Runge-Kutta approximation). It is possible to compute a bundle of trajectories by using a parallel computation.

16

O. Bokanowski et al.

In Step 2, for every i ∈ I Δ , the minimum time function T (·, i) will be characterized and computed by using an auxiliary control problem and the HJB approach. This step will be presented in Sects. 3.2–3.5. For Step 3, a simple comparison of the values Φ(T (x, i), i)) leads to the optimal Δ × I Δ . The optimal trajectory and its corresponding control law are pair ( p, i) ∈ PIni then given by a trajectory reconstruction algorithm using the minimum time function T (·, i). This will be described in Sect. 3.6.

3.2 Properties of the Minimal Time Function T . Existence of solutions for the inner problem (11) The computation of the minimum time function concerns only the Phases 1 & 2. As mentioned earlier, it is convenient during these phases to represent the positionvelocity state y by its spherical coordinates x := (r, , v, γ, χ) (the L variable is omitted). The corresponding dynamics f is, for any control u = (α, δ): f (t, x, u) = ( f 1 , f 2 , f 3 , f 4 , f 5 )T , where: f 1 := v sin γ v f 2 := cos γ cos χ r

FT (r ) cos α cos δ + m(t) Ω 2 r cos (sin γ cos  − cos γ sin  cos χ) v g FT (r ) gr sin α + 2Ω cos  sin χ + f 4 := − cos γ( − ) − sin γ cos χ + v r v m(t)v r Ω 2 cos (cos γ cos  + sin γ sin  cos χ) v −g sin χ v f 5 := − cos γ tan  sin χ − 2Ω(sin  − tan γ cos  cos χ) + v cos γ r FT (r ) r sin  cos  sin χ − cos α sin δ Ω2 v cos γ m(t)v cos γ f 3 := −gr sin γ + g cos γ cos χ +

with m(·) is the total mass at time t. First notice that the dynamics f is well defined everywhere excepted when r = 0 or v = 0. These regions are clearly not of interest for the flight in Phases 1 & 2, and the admissible set K will be chosen in order to avoid these regions. Moreover in the sequel, we will need to use a larger set Kη := K + ηB where η is a constant such that η > 0 and sufficiently small in order that Kη still avoids the regions r = 0 and v = 0.

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

17

The function f is a smooth function excepted for t1 + τ1 and t2 (corresponding to jumps in the launcher’s mass m(t)). More precisely, f satisfies the following properties: • for each interval J ∈ {[t1 , t1 + τ1 ), (t1 + τ1 , t2 ), (t2 , T]}, the function f is a Lipschitz continuous function on J × Kη × U • for every t ∈ [t1 , T]\{t1 + τ1 , t2 } and x ∈ Kη , the set f (t, x, U ) is a closed and bounded set. For t ∈ {t1 + τ1 , t2 } the set of left and right limits f (t ± , x, U ) are also closed and bounded sets. Remark 5 By using McShane-Whitney’s theorem, the dynamics f admits Lipschitz extensions on [t1 , t1 + τ1 ] × Rd × U , [t1 + τ1 , t2 ] × Rd × U and on [t2 , T] × Rd × U , that will still be denoted by f . Taking into account the properties of the dynamics f , for any control u(·) ∈ Uad and for every (t, x) ∈ [t1 , T] × Rd , there exists a unique absolutely continuous u (s) solution in the Carathéodory sense of the equation function yt,x u u (s) = f (s, yt,x (s), u(s)) a.e. s ∈ (t, T), y˙t,x

(13a)

u yt,x (t)

(13b)

= x.

In the sequel, we will denote S[t,T] (x) :=   u 1,1 u yt,x ∈ W (t, T) | yt,x is solutions of (13)corresponding to u ∈ Uad . This set may not be closed, and the control problem (11) may have no optimal solution. To define the closure of S[t,T] (x), one should consider the classical convexification of the dynamics f . For this, let us introduce the convexified set valued dynamics F # with: F # (s, x) := co( f (s, x, U ))

for (s, x) ∈ [t1 , T] × Rd .

The set-valued map F # has closed convex images. Moreover, F # is measurable with respect to (w.r.t.) the time variable, and Lipschitz continuous w.r.t. the space variable. # (x) be the set of absolutely continuous solutions to the differential inclusion: Let S[t,T] y˙ # (s) ∈ F # (s, y # (s)), a.e. s ∈ (t, T), y # (t) = x. # (x) is a compact It is known, by Filippov’s theorem [4], that for every x ∈ Rd S[t,T] # 1,1 0 set in W (t, T) endowed with the C -topology. Moreover, S[t,T] (x) is the closure of S[t,T] (x). Now, let us define also the minimal time T # (x, i) associated to this

18

O. Bokanowski et al.

convexified dynamics: ⎧ Min t f , ⎪ ⎪ ⎪ ⎪ ⎪ y # ∈ S[t#1 ,T] (x), ⎪ ⎪ ⎨ T # (x, i) := t f ∈ [t1 , T], ⎪ ⎪ ⎪ ⎪ y # (θ) ∈ K ∀θ ∈ [t1 , t f ], ⎪ ⎪ ⎪ ⎩ and y # (t f ) ∈ Ci .

(15)

Therefore, we have the following Proposition. Proposition 1 For every fixed i, the following assertions hold. (i) The function T # (·, i) is lower semi-continuous on Rd . (ii) For every x ∈ Rd such that T # (x, i) < +∞, the corresponding optimal control problem (15) admits an optimal trajectory. (iii) Let x ∈ Rd , and assume that t f := T # (x, i) < ∞. Then for any  > 0, there exists an admissible control u  ∈ U and its corresponding trajectory y u  , solution of (13) on (t1 , T), such that dist(y u  (t f ), Ci ) ≤ 

(16)

∀t ∈ [t1 , t f ], dist(y u  (t), K) ≤ .

(17)

and

Proof The proof of claims (i) and (ii) follows from the compactness of the set of relaxed trajectories S[t#1 ,T] (x), see [37]. To prove (iii), let y # ∈ S[t#1 ,T] (x) denote an optimal trajectory for the convexified dynamics. In particular, dist(y # (t f ), Ci ) = 0 and dist(y # (t), K) = 0 for all t ∈ [t1 , t f ]. Since the closure of S[t1 ,t f ] (x) for the C 0 -topology is S[t#1 ,t f ] , we conclude that for any  > 0, there exists a trajectory y ∈ S[t1 ,t f ] (x) such that maxt∈[t1 ,t f ] |y (t) − y # (t)| ≤ . From these inequalities we deduce the desired result. Recall that the set of constraints K is introduced only to reduce the computations to a region of interest. However, the final constraint y u (t f ) ∈ Ci is a “hard constraint”. Equation (16) means that for any  > 0, there exists an admissible trajectory solution of (13) and that reaches the set Ci + B at time t f . Typically, in the numerical simulations, we will choose  to be of the order of the mesh size of the space discretization.

3.3 The HJB Equation In this section, we fix a parameter i, and aim at characterizing the minimal time function T # (., i) by using the HJB approach. First, notice that because the dynamics

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

19

F # is non-autonomous, the minimal time function T # (., i) doesn’t satisfy the dynamic programming principle and cannot be characterized directly by a HJB equation, see [12]. Moreover, T # (., i) is merely lower semi-continuous and it may take infinite values in some regions where no admissible trajectories exist. For these reasons, and following some ideas introduced in [14, 29] we shall use an auxiliary control problem where the constraints are penalized exactly and from which we will be able to recover the minimal time function T # (., i). But prior to this, we consider the classical change of time variable, for every z = t f > 0:  tz (s) :=

t1 + s(t2 − t1 ) s ∈ [0, 1] t2 + (s − 1)(z − t2 ) s ∈ [1, 2].

We introduce the new dynamics:  f(s, y, z, u) :=

(t2 − t1 ) f (s, y, u) s ∈ (0, 1) (z − t2 ) f (s, y, u) s ∈ (1, 2).

(18)

This change of variable sets back the problem to a fixed final horizon problem (where here the final horizon is 2). The final time is then considered as a new state variable. It is clear that to each control u ∈ Uad corresponds a measurable function u such that u(s) ∈ U for a.e. s ∈ (0, 2). We shall keep the same notation Uad for the set of all admissible control functions u(·). We introduce the state equation: ⎧ ⎨ y˙ (s) = f(s, y(s), ζ(s), u(s)), a.e. s ∈ (κ, 2), ˙ = 0, a.e. s ∈ (κ, 2), ζ(s) ⎩ y(κ) = x, ζ(κ) = z.

(19)

u Remark 6 To each triplet (u, yt,x , t f ) of control-state satisfying (13) on (t, t f ), u corresponds a triplet (u, yκ,x , ζκ,z ) solution of (19) on (κ, 2), with u u (s) = yt,x (tz (s)). z = t f , tz (κ) = t, u(s) = u(tz (s)), yκ,x u , ζκ,z ) that satisfies (19) on (κ, 2), corresponds a Similarly, to each triplet (u, yκ,x u triplet (u, yt,x , t f ) satisfying (13) on (t, t f ) with t f = z and t = tz (κ).

Let η > 0 be a given constant (that we fix from now on), and consider v0 : Rd → R and g : Rd → R the two Lipschitz functions defined by: v0 (x) := min(dCi (x), η), g(x) := min(dK (x), η). It is clear by definition that we have: v0 (x) ≤ 0 ⇐⇒ x ∈ Ci ,

g(x) ≤ 0 ⇐⇒ x ∈ K,

(20)

20

O. Bokanowski et al. ◦

and g(x) = η for any x ∈ / Kη (we recall that Kη = K + ηB as defined in Sect. 3.2). For simplicity of notation we have omitted the dependence of i in the function v0 . Now, the auxiliary control problem is the following, for any (κ, x, z) ∈ [0, 2] × Rd × [t1 , T]: ⎧  u u ⎪ max g(yκ,x (s)) ⎪ inf v0 (yκ,x (2)) ⎪ s∈[κ,2] ⎨ u ϑ(κ, x, z) := with (yκ,x , ζκ,z ) solution of(19) ⎪ ⎪ ⎪ ⎩ and u(·) ∈ Uad .

(21)

 where the notation a b stands for max(a, b). Problem (21) corresponds to an optimal control problem with finite terminal time 2. Here again, in the definition of ϑ, the infimum is not necessarily achieved. However, the infimum is always finite, since the set of trajectories satisfying (19) is nonempty. Moreover, by definition of ϑ and using Remark 6, we have the following result. Lemma 2 For any (κ, x, z) ∈ [0, 2] × Rd × [t1 , T], the three following statements are equivalent: (i) ϑ(κ, x, z) ≤ 0 (ii) for any  > 0, there exists a control function u(·) and its corresponding trajectory u u u such that dist(yκ,x (2), Ci ) ≤  and dist(yκ,x (θ), K) ≤  for any s ∈ [κ, 2]. yκ,x u , with (iii) for any  > 0, there exists a control function u(·) and a trajectory yt,x u u t = tz (κ), such that dist(yt,x (z), Ci ) ≤  and dist(yt,x (θ), K) ≤  for any θ ∈ [t, z]. On the other hand, the function ϑ satisfies the following principle (see for instance [15]): Proposition 2 (Dynamic programming principle) For any 0 ≤ κ ≤ κ + h ≤ 2, with h ≥ 0, it holds:  u u ϑ(κ, x, z) = inf ϑ(κ + h, yκ,x (κ + h), z) max g(yκ,x (s)). (22) u∈U

s∈[κ,κ+h]

Main features of the auxiliary problem are given in the following theorem. Theorem 1 (HJB equation) Let v0 and g be Lipschitz continuous functions satisfying (20). Let T # (., i) and ϑ be the minimal time function and the value function defined, respectively, by (15) and (21). The following statements hold. (i) The value function ϑ is Lipschitz continuous on [0, 2] × Rd × [t1 , T ]. (ii) ϑ is the unique continuous viscosity solution of the following HJB equation: min(−∂κ ϑ + H (κ, x, z, ∇x ϑ), ϑ(κ, x, z) − g(x)) = 0 for κ ∈ [0, 2), x ∈ Rd , z ∈ [t1 , T ], (23a) ϑ(2, x, z) = max(v0 (x), g(x)) for x ∈ Rd , z ∈ [t1 , T ], (23b) ϑ(κ, x, z) = η



for κ ∈ [0, 2), x ∈ / Kη , z ∈ [t1 , T ], (23c)

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

21

where ∇x ϑ stands for the gradient of ϑ with respect to x, and the Hamiltonian function H is defined, for (κ, x, z, q) ∈ [0, 2] × Rd × [t1 , T ] × Rd , by:   H (κ, x, z, q) := max − f(κ, x, z, u) · q .

(24)

u∈U

(iii) Moreover, for every x ∈ Rd , we have: T # (x, i) = min{z ∈ [t1 , T ], ϑ(0, x, z) ≤ 0}.

(25)

Proof The Lipschitz continuity of ϑ is a classical result, see for instance [2, 23], where the HJB equation (23a)–(23b) is also derived. A comparison principle for (23a)–(23b), in a general case when the dynamics is measurable with respect to the time variable is given in [23]. The equation (23c) is an additional information that ◦  / Kη . It comes directly from the definition of ϑ since v0 (x) g(x) = η whenever x ∈ remains to prove the claim (iii). Let x ∈ Rd and z ∈ [t1 , T] such that ϑ(0, x, z) ≤ 0. By Lemma 2, for any  > 0, there exists a trajectory y ∈ S[t1 ,z] (x) such that dist(y (z), Ci ) ≤  and max dist(y (θ), K) ≤ . θ∈[t1 ,z]

Since S[t1 ,z] (x) = S[t#1 ,z] (x), and by using the definition of T # (·, i), we conclude that T # (x, i) ≤ z. On the other hand, if z := T # (x, i) < +∞, then by Proposition 1 (iii), there exists a sequence of trajectories (yn )n≥1 of S[t1 ,z] (x) such that dist(yn (z), Ci ) ≤

1 and n

max dist(yn (t), K) ≤

t∈[t1 ,z]

1 . n

Considering the corresponding trajectory yn as in Remark 6, we obtain a sequence of trajectories solution of (19) on (0, 2) and such that dist(yn (2), Ci ) ≤ n1 and maxθ∈[0,2] dist(yn (θ), K) ≤ n1 . By using (20) and the fact that v0 and g are Lipschitz continuous, we conclude that ϑ(0, x, z) ≤ 0. Finally, if T (x, i) = +∞, then there is no admissible trajectory that can reach Ci before the given time T. On the other hand if ϑ(0, x, z) ≤ 0 for some z ∈ [t1 , T] that would imply that T (x, i) ≤ z < ∞, which is not possible. Hence there is no z ∈ [t1 , T] such that ϑ(0, x, z) ≤ 0, which means that the right hand side of (25) is equal to +∞. This proves that (25) also holds true in that case. Before we finish this subsection, let us notice that the HJB equation (23) includes a simple Dirichlet condition (23c). This condition is very useful for numerical purposes.

22

O. Bokanowski et al.

3.4 Numerical Approximation Of ϑ In order to get an approximation of the value function ϑ in space dimension d + 1 = 6 for the variables (x, z) we use finite difference approximations for the HJB equation. First we introduce the notation i = (i 1 , . . . , i d ) for a multi-index of Zd . We consider a uniform space grid G = {xi , z j } on Kη × [t1 , T ], with mesh steps (Δxk )k=1,...,d for the x variable and Δz for the z variable, we also denote Δt > 0 and tn = nΔt for approximation of the time variable. For each time tn and at each node point (xi , z j ) ∈ G, the value Vi,n j denotes an approximation of ϑ(tn , xi , z j ). We follow the ENO approach for Hamilton-Jacobi equations as in [33] in order to obtain a consistent approximation of (23) that is first order in time (Euler forward) and second order in space. First a monotone Lax-Friedriech numerical Hamiltonian is considered:  Ck p− + p+ )− ( pi+ − pi− ). 2 2 k=1 d

Hnum (t, x, z, p − , p + ) := H (t, x, z,

(26)

with p± := ( p1± , . . . , pd± ). For 1 ≤ k ≤ d, the constant Ck > 0 is chosen to be an upper bound of | ∂∂ pHk (t, x, z, p)| for t ∈ [0, T ], x ∈ B, z ∈ [t1 , T ] and p ∈ R5 . The time step Δt > 0 is chosen constant for simplicity, and for stability reasons it is assumed to satisfy the following CFL condition: Δt

d  Ck 1 ≤ . Δxk 2 k=1

Let ek ∈ {0, 1}d be the unit vector with the k-th component equals to 1 and the other components are 0. For a given multi-index i = (i 1 , . . . , i d ), we thus have i ± ek := (i 1 , . . . , i k−1 , i k ± 1, i k+1 , . . . , i d ). For a given time tn , for any 1 ≤ k ≤ d we define DVi,n,±k := ± j

n − Vi,n j Vi±e k, j

pi,n,±k := DVi,n,±k j j

n n − 22 Vin + Vi−e Vi+e k k n D 2 2 Vi,k := , Δxi Δxi2

1 n 2 n ∓ Δxi minmod D 2 Vi∓e , D 2 V i, j k, j 2

,

and n,±k pi,n,± j := ( pi, j )1≤k≤d

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

23

where the minmod function is defined here by minmod(a, b) := a if (ab > 0 and |a| ≤ |b|), minmod(a, b) := b if (ab > 0 and |b| < |a|), and minmod(a, b) = 0 otherwise. Then the discrete approximation of (23) is given by: min

n Vi,n+1 j − Vi, j

Δt

 n,+  n+1 + Hnum tn , xi , z j , pi,n,− j , pi, j , Vi, j − g(x i ) (xi , z j ) ∈ G, n ≥ 0,

Vi,0 j Vi,n j

= max(v0 (xi ), g(xi )), (xi , z j ) ∈ G,

(27a) (27b)

= η, xi ∈ ∂Kη .

(27c)

By the Dirichlet condition (23c), we know that the function ϑ takes the value η outside Kη . This means that in (27a), whenever the value of V·n is needed outside Kη , then this value can be set to η. By using the fact that Δt > 0 it is easy to see that (27a) is equivalent to

 n,− n,+  n num t , g(x = max V − ΔtH , x , z , p , p ) , (xi , z j ) ∈ G, n ≥ 0. Vi,n+1 n i j i i, j i, j i, j j n Therefore Vi,n+1 j is an explicit expression in terms of the values (Vk, j )k . Furthermore, we point out that for the model studied in this work, an analytic expression for the Hamiltonian function H can be derived (see net subsection). This makes the scheme (27) straightforward to implement. On the other hand, the convergence and error estimates of this scheme can be found in [15, 33].

3.5 Analytic Expression of the Hamiltonian Function Let t > 0, z ∈ [t1 , T ], x = (r, , v, γ, χ) and q = (q1 , . . . , q5 ). The value of the Hamiltonian H (t, x, z, q), as defined in (24), can be rewritten as follows H (t, x, z, q) =

max

α∈[αmin ,αmax ] δ∈[δmin ,δmax ]



b1 cos(α) cos(δ) + b2 cos(α) sin(δ) + b3 sin(α) +C(t, x, z, q)

where C(t, x, z, q) does not depend neither on α nor δ, and b1 = − By setting B :=

FT (r ) FT (r ) FT (r ) q3 , b2 = − q5 , b3 = q4 . m mv m v cos γ

 b12 + b22 and δ B := arctan(b2 /b1 ) ∈ [−π, π], we get:

(28)

24

O. Bokanowski et al.

H (t, x, z, q) =

B cos(α)

max

α∈[αmin ,αmax ]

max

δ∈[δmin ,δmax ]

cos(δ − δ B ) + b3 sin(α)

+C(t, x, z, q) (where we have also used the fact that cos(α) ≥ 0 since α ∈ [−π/2, π/2]). Since δ − δ B ∈ [−2π, 2π], the following analytic formula holds max

δ∈[δmin ,δmax ]

cos(δ − δ B ) = max cos(ψ j )

(29)

j=1,2,3

with ⎧ ⎨ δ1 := δmin − δ B , δ2 := δmax − δ B , ⎩ δ3 := min(max(0, δ1 ), δ2 ). By setting δ ∗ := argmax j=1,2,3 cos(δ j ), R := α R := arctan(b3 /(B cos(δ ∗ )) ∈ [−π, π], we obtain H (t, x, z, q) =



B 2 cos(δ ∗ )2 + b32 ,

and



B cos(δ ) cos(α) + b3 sin(α) + C(t, x, z, q)

max

α∈[αmin ,αmax ]

=R

(30)

max

α∈[αmin ,αmax ]

cos(α − α R ) + C(t, x, z, q)

= R max cos(α j ) + C(t, x, z, q) j=1,2,3

where, as for (29), ⎧ ⎨ α1 := αmin − α R , α2 := αmax − α R , ⎩ α3 := min(max(0, α1 ), α2 ). This straightforward calculation shows that the Hamiltonian function can be computed exactly in a simple way.

3.6 Optimal Trajectory Reconstruction Procedure Once an approximation ϑh of ϑ is computed, we can deduce an approximation of the minimum time function by 

T

#,h

 (x, i) := min z, z ∈ [t1 , T ] and ϑ (0, x, z) ≤ 0 , h

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

25

and reconstruct a corresponding optimal trajectory along with an optimal feedback control law. In the case of Bolza or Mayer optimal control problems, reconstruction algorithms were proposed for instance in [5, Appendix A] or in [36]. In our setting, the control problem involves a maximum cost function. We shall then use the following algorithm studied in [3]. Algorithm 1. Set z¯ h = T #,h (x, i). For a given N ∈ N we consider a time step h = N2 and a uniform partition sk = 2k of [0, 2]. We define points (ykh )k=0,...,N and controls N h (u k )k=0,...,N −1 by recursion as follows. First we set y0h := x. Then for k = 0, . . . , N − 1, knowing the state ykh we define (i) an optimal control value u kh ∈ U such that   u kh ∈ arg min ϑh sk , ykh + h f(sk , ykh , z¯ h , u), z¯ h g(ykh ) u∈U

(31)

h (ii) a new state position yk+1 h yk+1 := ykh + hf(sk , ykh , z¯ h , u kh ).

 Note that in (31) the value of u kh can also be defined as a minimizer of ϑh sk , ykh +  h f(sk , ykh , z¯ h , u), z¯ , since this will imply in turn to be a minimizer of (31). We also associate to this sequence of controls a piecewise constant control uh (s) := u kh on s ∈ (sk , sk+1 ), and an approximate trajectory yh such that y˙ h (s) = f(sk , ykh , z¯ h , u kh ) a.e s ∈ (sk , sk+1 ), yh (sk ) = ykh .

(32) (33)

¯ z¯ ) of (yh , uh , z¯ h )h>0 is an optimal trajecFollowing [3], any cluster point (¯y, u, tory that realizes a minimum in the definition of ϑ(0, x, z¯ ). Hence, after an adequate ¯ ¯ z¯ (s)) = u(s) we obtain an optimal trajecchange of variable: y(tz¯ (s)) = y¯ (s) and u(t tory ( y¯ , u) ¯ corresponding to the minimal time T # (x, i). This claim is based on some arguments introduced in [36]. The precise statement and proof are given in [3]. For numerical reasons, in order to reduce the possible variations in the computed control value u k , we also consider the following modified algorithm. Algorithm 2. Let λh > 0 be a positive (small) constant. Set z¯ h = T #,h (x, i). We define points (ykh )k=0,...,N and controls (u kh )k=0,...,N −1 by recursion as follows. First we set y0h := x. For k = 0, we compute u 0h and y0h as in Algorithm 1. Then, for k ≥ 1 we compute: (i) an optimal control value u kh ∈ U such that

  h

u kh ∈ arg min ϑh sk , ykh + h f(sk , ykh , z¯ h , u), z¯ g(ykh ) + λh u − u k−1 u∈U

(34)

26

O. Bokanowski et al.

h (ii) a new state position yk+1 as follows h yk+1 := ykh + hf(sk , ykh , z¯ h , u kh ). h→0

This algorithm provides also a minimizing sequence for T # as soon as λh / h → 0. For instance, one can choose λh to be of the order of h 2 . See [3] for more details.

4 Numerical Simulations The numerical data used in the simulations was provided by CNES together with a corresponding reference trajectory computed by an other software (this trajectory is considered by CNES as an optimal solution to the optimization problem). We will compare our results with the reference trajectory. All the parameters involved in the model are given in Appendix 2.

4.1 Approximation of the Set X 0 The first step in the global optimization procedure (Sect. 3.1) consists in computing an approximation X 0Δ of X 0 . In all the simulations that will be presented in this paper, the set of initial parameters PIni is as follows: PIni := [ψmin , ψmax ] × [ωmin , ωmax ] ≡ [−8.50, −7.30] × [0.90, 0.94]. (35) We consider a uniform sample of (N I )2 points: Δ = {(ψi , ω j ), i = 1, . . . , N I , j = 1, . . . , N I } PIni

and define the corresponding discrete set X 0Δ as follows: Δ , y˙ p (t) = f 0 (t, y p (t), p), y p (0) = y0 }. X 0Δ = {y p (t1 ) | ∃ p ∈ PIni Δ Figure 5 shows a set of trajectories for t ∈ (0, t1 ) for different values of p ∈ PIni (in this figure, N I := 21). Here the simulations are performed by using a 4th order Runge-Kutta method (RK4 Gill).

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers r [km]

L [rad]

l [rad]

0.02

80

0.095

70

0.0945

0.015

60

27

0.094

50 0.01

40

0.0935 0.093

30

0.005

20 10

0.0925 0.092

0 50

100

150

50

Time [s]

50

150

100

150

Time [s]

χ [rad]

V [m/s]

2000

100

Time [s]

γ [rad] 1.5

1.76 1.75

1500

1.74 1000

1

1.73 1.72

500

0.5

1.71 50

100

Time [s]

150

50

100

150

50

Time [s]

100

150

Time [s]

Fig. 5 A bundle of atmospheric trajectories corresponding to a sample of initial parameters p ∈ Δ PIni

4.2 Numerical Computation of the Minimal Time Function T As mentioned earlier, some physical arguments can justify that the feasible trajectories lie within a restricted domain K. For instance, the region of small altitude with high velocity is clearly not interesting. Here, we shall consider a set of stateconstraints K as the set of points (r, , v, γ, χ) such that vmin (r ) ≤ v ≤ vmax (r ) with vmin (r ) = max(0, (r − b1 )/a1 , (r − b2 )/a2 ) vmax (r ) = min(11000, (r − b3 )/a3 , (r − b4 )/a4 ) and where the constants a1 , · · · , a4 and b1 , · · · , b4 are given by: a1 = 12, b1 = 1.72 × 105 + r T , a3 = 16, b3 = − 0.08 × 105 + r T , a2 = 40, b2 = 0.72 × 105 + r T , a4 = 30, b4 = − 1.28 × 105 + r T . Figure 6 shows the restricted domain K.

(36)

28

O. Bokanowski et al.

Fig. 6 State-constraints set K in the (r, v) plane

11000 10000 9000 8000

v [m s−1]

7000 6000 5000 4000 3000 2000 1000 0 50

100

150

200

250

300

r−rT [km]

A numerical approximation of the minimum time function T (·, i), for a given angle i, can be obtained by solving the HJB equation (27) (see Theorem 1). For this we define the obstacle function g defined, for x = (r, , v, γ, χ)), as follows : (2) g(x) := min(η, max(h (1) min (r ) − v, v − h min (r ))).

where η > 0 is a small constant that will be fixed on all the numerical simulation (the use of η is justified in the Dirichlet condition (23c)). The function v0 will be defined, for x = (r, , v, γ, χ)), as: v0 (x) := min(dCi (x), η), where the target Ci is a segment of the GTO (parameterized by the angle i) around its perigee (see Appendix 1.1 for details). The numerical approximation of the function ϑ can be then obtained by using the scheme presented in Sect. 3.5. Here the simulations are performed by using a general software ROC-HJ [1] for solving HJB equations. (This is a c++ code with OpenMP parallelization techniques, that works for any dimension, limited by the machine capacity). The computations are done on different space grids in dimension 6 (for the variables r, , v, γ, χ, z) on Kη × [t1 , T]. Of course, computations of very refined grids request a very large CPU time and huge memory capacities. However, the different experiments show that the computation of the value function on different grids is stable. Also, we have noticed that this computation is not very sensitive to the discretization of the last variable z. To show the relevance of the approach, we suggest in this paper to perform some simulations to compare the optimal trajectories that we obtain by our approach with a reference trajectory provided by CNES. This comparison will be performed in

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

29

Sect. 4.4. Before that, in the next subsection, we shall analyze the numerical sensitivity of T (·, i), for a given i, on the set X 0Δ .

4.3 Sensitivity Analysis with Respect to the Initial Parameters In this subsection, we fix the angle i and compute an approximation T #,h (·, i) of the minimum time function T # (·, i). It would be hard to represent this minimum time function on the domain X 0 since the dimension of the space is 5. However, it is possible to analyze the values of the minimum time with respect to the parameters p ∈ PIni . Δ = {(ψi , ω j ) ∈ PIni , i = 1, . . . , N I , j = 1, . . . , N I }. More precisely, set PIni Δ For any p ∈ PIni , consider y p (t1 ) ∈ X 0Δ the corresponding state defined by (9). Now, define: Δ σ( p) := T #,h (y p (t1 ), i), for p ∈ PIni Figure 7 shows the graph of σ( p) (left) and its contour plot (right) obtained using N I = 21, and a minimum time function computed on a grid of 40 × 16 × 40 × 40 × 32 × 5 nodes. This numerical result provides an idea on the behavior of σ(·) in the region X 0Δ . This behavior seems to be smooth enough and suggests that the minimization with respect to the parameters is achieved in a unique value p. ¯ Moreover, from Fig. 7, we can see that the minimum time function is low sensitive with regards to the azimuth angle. This remark is consistent with the physical characteristics of the mission: while targeting a GEO orbit, it is very efficient to take benefit of the natural velocity given by the rotation of the Earth around its own axis, and consequently to fly toward the East. The latitude of the launchpad being not so far from the Equatorial plane where the GEO orbit lies, a small variation of the azimuth around the optimal launch direction has a very limited impact on the launcher efficiency. On the contrary, the initial pitch maneuver has a strong impact on the global shape of the ascent phase of the trajectory, along which gravity and aerodynamics losses are very strong. Hence, a small variation around the optimal pitch rate may disturb significantly the global efficiency of the flight. Let us also stress that, once a computation of the minimum time function is provided, it is very easy and straightforward to analyze numerically the behavior of the minimum time on a larger set of parameters. From different experiments, it appeared that a relevant set of trajectories correspond to PIni as defined in (35).

4.4 Numerical Results In order to validate the relevance of the algorithm, we propose here three numerical tests.

30

O. Bokanowski et al. Approximation of minimum time as function of initial parameters: σ(p)

Contour plot of σ(p)

-7.4

0.236

5 0.23

-7.6

41 0.23

0.2332

0.2332

0.23

32

0.2323

0.2323

23 0.23

0.2314

0.2314

314

0.2305

0.2305

305 0.2

0.2296

296

0.2287

0.2

0.234

0.2

-7.8

ψ [deg]

0.232 0.23

7 28

0.2

0.2296

0.2287

0.2278

0.2

27

78

22

0.

-8

8

0.227

7

22

0.228

0.

0.22 78

ω [deg]

27

3 26

0.91 -8.4

0.905

0.91

0.2

-8.4

0.2

-8.2

ψ [deg]

0.2 26 3

0.92

-8

0.227

0.93

-7.8

0.2265

65 0.22

-7.6

27 0.2

-8.2

0.226 -7.4

0.2265

0.2263

65

0.22

0.915

0.92

0.925

0.93

ω [deg]

Fig. 7 The function σ( p)

Test 1: optimization for a given GTO orbit. In this first test we aim to check the quality of the trajectories obtained with the HJB approach. We also want to show the convergence of the HJB computations with respect to grid refinements. This test is realized under the following hypothesis: • The inclination parameter i is fixed as for the CNES reference trajectory : i = 7.61 degrees. • The parameters ψ and ωbasc (the azimuth and the angular speed) are varying in PIni as defined in (35). Δ We have tested several samples PIni with N I × N I values and have observed that for N I ≥ 9 the results are very similar. Therefore in what follows we fix N I = 9. Figure 8 shows the optimal trajectories obtained from the computation of the value function ϑ using different space grids (as defined in Table 1) for (r, , v, γ, χ, z) on Kη × [t1 , T]. We also compare these trajectories to the CNES reference trajectory. All trajectories are shown here in spherical coordinates. Table 1 compare some parameters corresponding to the optimal trajectories computed on different space grids. Column two gives the optimal azimuth angle (in degrees), the third column gives the optimal speed angle ωbasc (in deg s −1 ) and the fourth column gives the optimal time t f (in seconds) for an injection on the GTO Ci (we recall that for this test the angle i is fixed to the value 7.61 degrees). The last column gives the CPU time (in seconds) needed to perform all the optimization process as described in Sect. 3.1. Figure 9 represents (in red) the trajectory y h obtained from computation in the the finest grid along with the CNES reference trajectory y ref (in black). This comparison

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

31

L [rad]

r [km] 250

grid 1

l [rad]

1

0.1

0.8

0.08

200

grid 2 grid 3 grid 4

0.06

0.6 150

0.04 0.4 0.02

100 0.2

CNES

50

0

0

0 200

400

600

800

1000

−0.2

−0.02 0

200

400

600

800

1000

−0.04

0

200

Time [s]

Time [s]

600

800

1000

800

1000

Time [s] γ [rad]

χ [rad]

V [m/s]

400

10000

2 1.78

8000

1.5 1.76

6000

1

1.74 1.72

4000

0.5

1.7 2000 0

0

1.68 1.66 0

200

400

600

800

1000

Time [s]

0

200

400

600

800

1000

−0.5

0

200

400

600

Time [s]

Time [s]

Fig. 8 (Test 1) Optimal trajectories obtained with different grids in the spherical coordinates Table 1 (Test 1) Optimal shooting parameters obtained for different mesh sizes Grid (r, , v, γ, χ, z) Azimuth ψ ωbasc t f on GTO (s) CPU (s) (de.g.) (de.g. s−1 ) 20× 8 × 20 × 20 × 16 × 5 grid 2 30 × 12 × 30 × 30 × 24 × 5 grid 3 40 × 16 × 40 × 40 × 32 × 5 grid 4 50 × 20 × 50 × 50 × 40 × 5 CNES reference trajectory grid 1

−8.38

0.910

1011.80

205

−8.50

0.904

1015.55

1638

−8.04

0.904

1016.06

7651

−8.40

0.904

1027.30

26673

−7.50

0.920

1031.00

NA

is made in the inertial reference frame. The latter doesn’t depend on the trajectories and therefore it is better suited for comparison purposes. In Fig. 10, we have plotted the relative differences between y h and y ref in the same inertial reference frame (some parts in the Figure have been truncated for clarity h ref (t)| of presentation). The differences are calculated as follows |y (t)−y , on every time |y ref (t)| t ∈ (0, t f ). Table 2 gathers the performances computed on different grid sizes. The final line of the table presents the optimal values associated to the reference trajectory. Column 1 shows the number of grid nodes used in the computation of the HJB equation.

32

O. Bokanowski et al. x [km]

6500

y [km]

5000

z [km]

800

HJB CNES

6000

4000

600

3000

400

2000

200

5500

5000

4500

0

200

400

600

800

1000

0

0

-200

0

200

Time [s] Vx [m/s]

1000

400

600

800

0

200

Time [s] Vy [m/s]

8000

400

600

800

Time [s] Vz [m/s]

500

0 -1000

6000

0

4000

-500

2000

-1000

-2000 -3000 -4000 -5000 -6000

0

200

400

600

0

800

0

200

Time [s]

400

600

800

-1500

0

200

400

600

800

Time [s]

Time [s]

Fig. 9 (Test 1) Optimal trajectory obtained with the finest grid (in red) compared to the CNES reference trajectory (black), in the coordinates of the inertial reference frame Relative error x

0.05

Relative error y

0.05

0.04

0.04

0.04

0.03

0.03

0.03

0.02

0.02

0.02

0.01

0.01

0.01

0

0

500

1000

0

0

Time [s]

1000

0

Relative error Vy

0.05

0.04

0.03

0.03

0.03

0.02

0.02

0.02

0.01

0.01

0.01

500

1000

0

Time [s]

Fig. 10 (Test 1) Relative differences

0

500

Time [s]

1000

Relative error Vz

0.05

0.04

0

500

Time [s]

0.04

0

0

Time [s]

Relative error Vx

0.05

500

Relative error z

0.05

1000

0

0

500

Time [s]

1000

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

33

Table 2 (Test 1) Comparison of propellant consumption using optimized initial parameters, for different mesh sizes Grid Propellant consumed (kg) Mass gain (kg) Ratio (%) (r, , v, γ, χ, z) Φ(t f , i) Phase 2 Phase 3 Φ(t f , i) M P,E2 (0) 20× 8 × 20 × 18994 5015 474.0 1.93% 20 × 16 × 5 30×12 × 30 × 19139 4974 382.9 1.56% 30 × 24 × 5 40×16 × 40 × 19164 4967 368.5 1.50% 40 × 32 × 5 50×20 × 50 × 19617 4839 42.9 0.18% 50 × 40 × 5 CNES reference 19746 4754 0.0 0.00% trajectory

Column 2 gives the quantity of propellant in the second stage E 2 consumed before the injection into the Ci . In column 3, we present the quantity of propellant ΔM P,E2 (t f , i), estimated by Tsiolkovsky formula (see Appendix 1), that is needed for an impulsive transfer from the GTO to the GEO. In column 4 are given the maximal final cost (mass gain), and in column 5 we give the ratio between the mass gain and the initial propellant in the second stage. Note that in this test, the reference trajectory is optimized in such a way that the final mass gain is 0. From the results presented in Table 2, we see that the optimal performances converge to the reference values. In order to check the quality of the computed optimal trajectory we have integrated a ballistic flight phase that may follow the GTO injection to join the GEO. Our goal is to check if it is possible to reach the GEO following the ballistic flight starting from the final points of the approximated optimal trajectory obtained by our approach. We have used the RK4 Gill scheme to integrate the equation of motion during the ballistic flight, governed here only by the gravitational force (all launcher engines are off). Figure 11 shows the optimal trajectory obtained with the finest grid together with the corresponding ballistic flight. The cyan curve indicates the targeted GEO. The red segment of the trajectory corresponds to the part obtained by our optimization approach. The green segment corresponds to the ballistic part of the flight. As can be seen from this plot, our approach provides an optimal trajectory that can accomplish the entire mission (injection on the GTO Ci and then transfer to the GEO). Test 2 (global optimization). Let us recall that the cost function Φ(t f , i) defined by (7) represents the estimation of the remaining mass of the propellant in the second stage of the launcher when it has reached the target GEO orbit. The goal is to maximize this final mass. In this test we are looking for the optimal inclination of the GTO orbit that can be used to reach the GEO orbit. In order to solve the global optimization

34

O. Bokanowski et al.

GEO

HJB ascent trajectory

Balistic flight trajectory

Fig. 11 (Test 1) Optimal trajectory obtained with the finest grid and the corresponding ballistic flight phase Table 3 (Test 2) Optimal parameters with respect to the GTO inclination GTO inclination (deg) Azimuth ψ Ang. speedωbasc (deg s−1 ) Final time (de.g.) t f (s) 5.00 6.00 6.50 7.00 7.50 7.61 8.00 8.50 9.00 9.50 Reference

traj.

− −7.78 −8.14 −7.90 −7.90 −8.02 −8.14 −8.14 −7.90 − 7.61 −7.50

− 0.900 0.904 0.920 0.904 0.904 0.928 0.93 0.904 − 0.92

− 1029.20 1021.90 1017.18 1016.06 1016.06 1016.06 1018.94 1027.18 − 1031.00

Φ(t f , i) (kg) − 20.71 242.70 348.60 366.76 368.50 360.13 266.17 18.11 − 0.00

problem (P), as explained in Sect. 3, we first discretize the set I = [4.61, 10.61], and then solve the sub-problem (Pi ) defined by (15) for each i ∈ I Δ by using the HJB approach. For this test, all computations were realized on the grid mesh of size 40 × 16 × 40 × 40 × 32 × 5 (In Test 1, this grid gives a quite accurate solution in a very reasonable CPU time). In Table 3, we show some optimal trajectory parameters (the shooting azimuth ψ and the angular velocity ωbasc ), and the value of the cost function Φ(t f , i) for different values of inclination parameter i of the GTO orbit. One can observe clearly that the maximum of the cost function Φ(t f , i) is realized close to the inclination i = 7.61 degrees which is the optimal value for the CNES reference trajectory. Note that for values of i ≤ 5 of i ≥ 9, there is no feasible solution found by our approach.

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

35

control δ(t), rad

control α(t), rad 0.1

Regularized control

0.08

0.1

Reference control Optimal control

0.06

0

0.04 −0.1

0.02

−0.2

0

−0.3

−0.02 −0.04

−0.4

−0.06 −0.5

−0.08

−0.6

−0.1 100 200 300 400 500 600 700 800 900 1000 1100

200

300

400

500

600

700

800

900

1000

Fig. 12 (Test 3.) Optimal control laws obtained with Algorithm 1 (red) and Algorithm 2 (blue) using a regularization term Table 4 (Test 3) Optimal parameters for different trajectory reconstruction algorithms Algorithm Azimuth ψ (deg) Angular speed w Final time t f (s) Φ(t f , i) (kg) (deg s−1 ) 1 2 Reference traj.

−7.50 −7.50 −7.50

1.02 1.02 0.92

1004.12 1006.05 1031.00

718.097 656.513 0.000

Test 3. Control law regularization. The aim of this last test is to investigate the quality of the optimal control law obtained with our algorithms for the trajectory reconstruction. Figure 12 shows the optimal control laws corresponding to a reference trajectory (black) and to our approach using Algorithm 1 (in red) or Algorithm 2 (in blue). The finest grid 50 × 20 × 50 × 50 × 40 × 5 has been used. For Algorithm 2, the regularization parameter was set to λ := 0.05. In general this parameter should be chosen to be a small constant. The corresponding obtained optimized values are given in Table 4. One can first remark that the control law obtained by Algorithm 1 is rather irregular. This behavior (or small oscillations of the trajectories) is often observed for control laws directly issued from numerical discretization (see also [39] and the references therein). For the HJB approach, there are several reasons for this chattering behavior. Firstly, let us recall that there is a priori no uniqueness result for the optimal trajectory related to the optimal control problem. Hence there can exist multiple control laws corresponding to the same optimal value. The convergence result previously cited from [3], for the trajectory reconstruction algorithm, concerns only the trajectory but not the control law itself. Secondly, the choice of the optimal control value as defined in (31) is not obvious from numerical point of view in the case when the gradient of the value function is close to zero. This may produce spurious oscillations in the control law. The aim of Algorithm 2 is to penalize the control variation. In Fig. 12, we see that it helps reducing the oscillations of the controls. Table 4 shows that there are

36

O. Bokanowski et al.

only rather minor differences in the optimized parameters, while the control laws are more regular. The advantage of the HJB approach is that it does not require any initialization procedure (which can be a problem for direct approaches for trajectory optimization), and that it gives a global optimum. For the mission considered in this paper, the set of dynamics is non convex, and the set of trajectories S[0,T ] (x) is non compact, so the HJB approach may generate non physical control values and therefore non physical trajectories. However it is possible to use some regularization procedures for the control (we have given a basic example here, but the regularization procedures should be studied further).

4.5 Conclusion and Future Work In this work we present some theoretical and numerical results obtained using the HJB framework for the problem of trajectory optimization for space launchers. From the theoretical point of view we show that the problem of propellant optimization can be formalized as a minimum time problem. The main difficulty in this problem is the fact that the dynamics is not autonomous. To deal with this problem we consider the final time as an additional state variable. This allows us to apply some recent results on trajectory reconstruction using the value function of the corresponding optimal control problem and the dynamic programming principle. From the numerical point of view the main challenge of this work is to apply the HJB approach to a real physical system of large dimension. The numerical tests presented here show that the approach is quite promising (although the CPU time can still be very large, up to a few hours for the computation on a refined grid). The numerical results are validated by theoretical studies and the HJB approach does not require any initial guess of the solution. Moreover, our approach guarantee the global optimality of the numerical solution. Ongoing work on this subject is concerned with a more complete formulation of the problem of trajectory optimization for space launchers. Indeed, for this first study, we have assumed that the last boost used to perform the orbital transfer between the GTO and the final GEO, is impulsive. Making this assumption, we have estimated the needed quantity of propellant using Tsiolkovsky’s formula. To make our model more precise, we are working on a new formulation where the last boost is also a solution of an optimal control problem [10]. Some numerical results for an SSO (Sun-Synchronous Orbit) mission are given in [11]. Acknowledgements We thank the anonymous referees for their useful comments. This work is partially supported by Centre National d’Études Spatiales (CNES), under the grant RT-CR-4301304-CNES. It is also supported by public grant overseen by the French National Research Agency (ANR), through the “iCODE Institute project” funded by the IDEX Paris-Saclay, ANR-11-IDEX0003-02.

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

37

Appendix 1 Definition of Orbital Parameters For the phase 3 of the flight sequence there are two kinds of orbital maneuvers: • Injection on a GTO orbit • Orbital transfer from GTO to the GEO. We define here all necessary parameters to describe these maneuvers.

Appendix 1.1 GTO Injection Parameters A GTO orbit is represented on Fig. 13. We denote A and P the perigee and apogee centers of the orbit respectively. The line of nodes is the line of intersection of the orbital plane with the equatorial plane of the Earth. It’s extremal points are respectively ascending (NA) and descending (ND) nodes of the orbit. A GTO orbit can be fully defined with the following set of parameters (ra , r p , ϒ, ω, i), where: • ra and r p are respectively the distances from the center of the Earth of the perigee and the apogee of the orbit • ϒ is the polar position of the perigee P, measured positively from the ascending node NA. • i is the inclination of the orbital plane with respect to the equatorial plane of the Earth. On can deduce from these parameters the semi-major axis a, and the eccentricity e by: a=

ra − r p ra + r p , e= . 2 ra + r p

A point Q on the orbit is defined by its angular position ς, called true anomaly (see Fig. 13). Let c0 = 3.986 · 1014 m 3 s −2 be the Earth’s gravitational constant. Given

Υ P ND

ς

o des Line of n

A

Earth

Q Launcher rp

Fig. 13 GEO and GTO orbits

NA

ra

38

O. Bokanowski et al.

orbital parameters defined below and the true anomaly of a point Q on the orbit, one can deduce the corresponding orbit radius r Q (η) and orbit velocity v Q (ς) as follows: 

c0 1 + e2 + 2e cos(ς) a 1 − e2 (37) In this study we consider a family of GTOs with a fixed perigee altitude r 0p and that intersect a given GEO. Recall that a GEO is a circular orbit in the equatorial plane of the Earth. It is fully defined by the data of its radius r G E O . If a GTO intersects a GEO of radius r G E O , this fixes the orbit radius of its ascending node NA. Note that by definition (see Fig. 13) the true anomaly of the point NA is −ϒ. Then we have the following condition: r N A = r Q (−ϒ) = r G E O ra r p , v Q (ς) = r Q (ς) = ra (1 + cos(ς)) + r p (1 − cos(ς))

from which we deduce the apogee altitude: ra =

r G E O r p (1 − cos(ϒ)) . 2r p − r G E O (1 + cos(ϒ))

So, if we fix the perigee altitude r p and the GEO radius r G E O the parameter ra is also fixed. Then the set of considered GTO orbits is defined by the inclination i ∈ [i min , i max ] The perigee orientation ϒ is fixed. For each GTO orbit we define a segment of injection Ci as a neighborhood of the perigee.

Appendix 1.2 Orbital Transfer Parameters In our study, we assume that the GTO- GEO orbital transfer is performed through an impulse boost (to change the velocity’s modulus and direction). The amount of propellant required for the orbital transfer is determined by the choice of the GTO − → via Tsiolkovsky’s formula. Let V GT O be the speed at the ascending node of the GTO, and VG E O be the desired GEO speed. Then the differential gear to provide the vehicle is the vector difference: ΔV = VG E O − VGT O with the modulus: ΔV =



2 VG2 E O + VGT O − 2VGT O VG E O cos(i).

(38)

Note that the speed VGT O can be expressed using (37) for the ascending node: VGT O = v N A = v Q (−ω)

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers Table 5 Numerical data Notation Value

Units

Cx FT ME A P

See Fig. 14 See Fig. 14 2 × 37 × 103

N kg

M E1

18 × 103

kg

M E2

7 × 103

kg

MC M P,E A P M P,E1 M P,E2 MP L rT Sr SE A P S E1 S E2 β E A P (.) β E1 (.) I S PE2 β E2 c0 ρ(.) Ω [αmin , αmax ]

2500 2 × 237 000 170 000 24 500 5 307 6378135 23 2 × 7.0 4.0 0 See Fig. 14 See Fig. 14 465 40 3.986 × 1014 see Fig. 14 7.292 × 10−5 [−0.5, 0.2]

kg kg kg kg kg m m2 m2 m2 m2 kg · s−1 kg · s−1

[δmin , δmax ]

[−0.07, 0]

rad

kg · s−1 m3 · s−2 rad · s−1 rad

39

Comments Aerodynamic coefficients Thrust force Mass of boosters without propellant Mass of first stage without propellant Mass of second stage without propellant Mass of cap Mass of propellant in boosters Mass of propellant in first stage Mass of propellant in second stage Mass of the payload Earth’s mean radius Reference surface Exit nozzle area of boosters Exit nozzle area of first stage Exit nozzle area of second stage Flow rate for boosters Flow rate for first stage Flow rate for second stage Flow rate for second stage Earth’s gravitational constant Atmospheric density Earth’s angular velocity Interval for the incidence control α(.) Interval for the sideslip control δ(.)

The formula Tsiolkovsky connects the differential speed to provide the required fuel mass

m init (39) ΔV = g Isp ln m f inal where m init and m f inal are respectively the launcher’s masses before and after the operation. Thus one can calculate the weight of propellant required to the GTO-GEO orbital transfer in a single impulse, knowing the speed at the height of the GTO and its inclination with respect to the GEO.

40

O. Bokanowski et al. Drag coefficient

Air density

1.4 1.2

12

1.2

10

1 1

0.8

0.8

0.6

Pressure

4

1.4

x 10

8 6 4

0.4 0.6 0.4 0

2

0.2 2

4

6

8

0 0

20

Mach

40

60

80

0 0

Air speed

6

360

7

x 10

Boosters thrust

1.36 1.358

320

4

1.356

300

3

1.354

2

1.352

20

40

60

80

Altitude, km

x 10

1

1.35

0 0

1.348 0

50

100

60

80

150

1rst stage thrust

6

1.362

6

280

40 Altitude, km

5

340

260 0

20

Altitude, km

200

200

Time, sec

400

600

Time, sec

Fig. 14 Tabulated data used in the computations Table 6 Numerical data corresponding to the CNES reference trajectory Notation Value Units Comments rp ra

180 35786

km km

GTO perigee altitude GTO apogee altitude

Appendix 2 Numerical Data Used for Simulations Table 5 gives the list of numerical constants and data functions describing the dynamical properties of the launcher and the atmospheric model used in our computations. Furthermore, Fig. 14 represents the air density (and sound speed) models, the thrust (for boosters and stage E1), and drag function. All these data were provided by CNES (Table 6).

References 1. “ROC-HJ” software, a parallel d-dimensional c++ solver for reachability and optimal control using Hamilton-Jacobi equations. http://uma.ensta-paristech.fr/soft/ROC-HJ 2. Altarovici, A., Bokanowski, O., Zidani H.: A general Hamilton-Jacobi framework for non-linear state-constrained control problems. ESAIM: Cont., Optim. Calc. Var. 19(337–357), 2013 (2012 electronic)

Global Optimization Approach for the Ascent Problem of Multi-stage Launchers

41

3. Assellaou, M., Bokanowski, O., Desilles, A., Zidani, H.: HJB approach for state constrained control problems with maximum cost. Preprint (2016) 4. Aubin, J.-P., Cellina, A.: Differential inclusions. Comprehensive Studies in Mathematics, vol. 264. Springer, Berlin, Heidelberg, New York, Tokyo (1984) 5. Bardi, M., Capuzzo-Dolcetta, I.: Optimal Control and viscosity solutions of Hamilton-JacobiBellman Equations. Birkhäuser Boston (1997) 6. Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1961) 7. Bérend, N., Bonnans, J.F., Laurent-Varin, J., Haddou, M., Talbot, C.: An interior point approach to trajectory optimization. J. Guid. Cont. Dyn. 30(5), 1228–1238 (2005) 8. Betts, J.T.: Survey of numerical methods for trajectory optimization. J. Guid. Control Dyn. 21(2), 193–207 (1998) 9. Betts, J.T.: Practical Methods for Optimal Control Using Nonlinear Programming. Society for Industrial and Applied Mathematics, Philadelphia (2001) 10. Bokanowski, O., Bourgeois, E., Desilles, A., Zidani, H.: HJB approach for a multi-boost launcher trajectory optimization problem. In: IFAC Proceedings, 20th IFAC Symposium on Automatic Control in Aerospace (ACA), Vol. 49(17), pp. 456–461 (2016) 11. Bokanowski, O., Bourgeois, E., Desilles, A., Zidani, H.: Payload optimization for multi-stage launchers using hjb approach and application to a sso mission. In: IFAC Proceedings, 20th IFAC World Congress, Vol. 50(1), pp. 2904–2910 (2017) 12. Bokanowski, O., Briani, A., Zidani, H.: Minimum time control problems for non-autonomous differential equations. Syst. Cont. Lett. 58(10–11), 742–746 (2009) 13. Bokanowski, O., Cristiani, E., Zidani, H.: An efficient data structure and accurate scheme to solve front propagation problems. J. Sci. Comput. 42(10), 251–273 (2010) 14. Bokanowski, O., Forcadel, N., Zidani, H.: L 1 -error estimate for numerical approximations of Hamilton-Jacobi-Bellman equations in dimension 1. Math. Comput. 79, 1395–1426 (2010) 15. Bokanowski, O., Forcadel, N., Zidani, H.: Reachability and minimal times for state constrained nonlinear problems without any controllability assumption. SIAM J. Control Optim. 48(7), 4292–4316 (2010) 16. Bonnans, F., Martinon, P., Trélat, E.: Singular arcs in the generalized Goddard’s problem. J. Optim. Theory Appl. 139(2), 439–461 (2008) 17. Bonnard, B., Faubourg, L., Trélat, E.: Optimal control of the atmospheric arc of a space shuttle and numerical simulations with multiple shooting method. Math. Models Methods Appl. Sci. 15(1), 109–140 (2005) 18. Calise, A.J., Gath, P.F.: Optimization of launch vehicle ascent trajectories with path constraints and coast arcs. J. Guid., Control, Dyn. 24(2), 296–304 (2001) 19. Crandall, M.G., Evans, L.C., Lions, P.-L.: Some properties of viscosity solutions of HamiltonJacobi equations. Trans. Amer. Math. Soc 282(2), 487–502 (1984) 20. Crandall, M.G., Lions, P.-L.: Viscosity solutions of Hamilton Jacobi equations. Bull. Am. Math. Soc. 277(1), 1–42 (1983) 21. Crandall, M.G., Lions, P.-L.: Two approximations of solutions of Hamilton-Jacobi equations. Math. Comput. 43(167), 1–19 (1984) 22. Falcone, M., Ferretti, R.: Semi-Lagrangian Approximation Schemes for Linear and HamiltonJacobi Equations. SIAM - Society for Industrial and Applied Mathematics, Philadelphia (2014) 23. Forcadel, Nicolas., Rao, Zhiping, Zidani, Hasnaa: State-constrained optimal control problems of impulsive differential equations. Appl. Math. Optim. 68(1), 1–19 (2013) 24. Goddard, R.H.: A method of reaching extreme altitudes. Smithson. Miscellaneous Coll. 71(4), (1919) 25. Jiang, G.-S., Peng, D.: Weighted ENO schemes for Hamilton-Jacobi equations. SIAM J. Sci. Comput. 21(6), 2126–2143 (2000) 26. Johnson, W.: Contents and commentary on william moore’s a treatise on the motion of rockets and an essay on naval gunnery. Int. J. Impact Eng 16(3), 499–521 (1995) 27. Laurent-Varin, J.: Optimal Ascent and Reentry of Reusable Rockets. PhD thesis, Ecole Polytechnique, 2005. Supported by CNES and ONERA

42

O. Bokanowski et al.

28. Lu, P., Sun, H., Tsai, B.: Closed-loop endoatmospheric ascent guidance. J. Guid., Control, Dyn. 26(2), 283–294 (2003) 29. Margellos, K., Lygeros, J.: Hamilton-Jacobi Formulation for Reach-Avoid Differential Games. IEEE Trans. Autom. Control 56(8), 1849–1861 (2011) 30. Martinon, P., Bonnans, J.F., Laurent-Varin, J., Trélat, E.: Numerical study of optimal trajectories with singular arcs for an Ariane 5 launcher. J. Guid. Control Dyn. 32(1), 51–55 (2009) 31. Mooij, E.: The Motion of a Vehicle in a Planetary Atmosphere. Technical report, TU Delft (1994) 32. Oberle, H.J.: Numerical computation of singular control functions in trajectory optimization problems. J. Guid. Control Dyn. 13, 153–159 (1990) 33. Osher, S., Shu, C.-W.: High essentially nonoscillatory schemes for Hamilton-Jacobi equations. SIAM J. Numer. Anal. 28(4), 907–922 (1991) 34. Pesch, H.J.: Real-time computation of feedback controls for constrained optimal control problems, Part 2: a correction method based on multiple shooting. Optim. Control, Appl. Methods 10(2), 147–171 (1989) 35. Pesch, H.J.: A practical guide to the solution of real-life optimal control problems. Control Cybern. 23(1–2), 7–60 (1994) 36. Rowland, J.D.L., Vinter, R.B.: Construction of optimal feedback controls. Syst. Control Lett. 16(5), 357–367 (1991) 37. Vinter, R.: Optimal Control. Modern Birkhäuser Classics. Birkhäuser Boston, Inc., Boston, MA (2010). (Paperback reprint of the 2000 edition) 38. Zhang, L., Lu, P.: Fixed-point algorithms for optimal ascent trajectories of launch vehicles. Eng. Optim. 40(4), 361–381 (2008) 39. Zhu, J., Trélat, E., Cerf, M.: Minimum time control of the rocket attitude reorientation associated with orbit dynamics. SIAM J. Control Optim. 54(1), 391–422 (2016) 40. Zondervan, K.P. Bauer, T.P., Betts, J.T., Huffman, W.P.: Solving the optimal control problem using a nonlinear programming technique. part 3: Optimal shuttle reentry trajectories. Proceedings of the AIAA/AAS Astrodynamics conference, Seattle (1984)

A Robust Predictive Control Formulation for Heliogyro Blade Stability Adonis Pimienta-Penalver, John L. Crassidis, and Jer-Nan Juang

Abstract The Generalized Predictive Control (GPC) algorithm provides a framework with which to approach challenging systems of considerable non-linearity. Its implementation of on-line system identification techniques on time-domain response histories gives it the ability to adapt to changes in the system and dynamic environment. This paper presents the relationship between the GPC design parameters and the two conditions to achieve a robust system: (1) the closed-loop system must be asymptotically stable, and (2) the GPC controller itself must also be asymptotically stable. In order to achieve the aforementioned conditions, an optimization procedure is implemented to obtain the parameters related to the weighting in the controller synthesis, and the prediction/control horizons associated with future behavior. Additionally, this paper briefly presents a discretized representation of a heliogyro-type solar sail system, upon which the derived control method is demonstrated in real-time and shown to improve structural stability under measurement noise and changing environmental conditions.

1 Introduction The strength of the predictive control method resides in the output prediction step, making it an attractive strategy with which to approach flexible systems. Generalized Predictive Control (GPC) is a type of model-base predictive control which identifies an input-output (I/O) model in order to make an output prediction to a desired horizon in the future [1, 2]. Attempts to add robustness to GPC are few; most recently, Lew and Juang demonstrate a modification to make GPC robust to model uncertainty [3]. A. Pimienta-Penalver (B) · J. L. Crassidis Department of Mechanical and Aerospace Engineering, University at Buffalo, Amherst, NY, USA e-mail: [email protected] J. L. Crassidis e-mail: [email protected] J.-N. Juang Department of Engineering Science, National Cheng Kung University, Tainan, Taiwan e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_2

43

44

A. Pimienta-Penalver et al.

Fig. 1 HELIOS in deployment. Top, left to right: Hub and sail reels, extending truss. Bottom, left to right: Extended reel truss; deployment

The approach investigated in this paper employs the built-in tuning parameters to promote stability in the controller and closed-loop systems. In GPC, the I/O time histories of the system are collected into Markov parameters, and cast into an Auto-Regressive eXogenous (ARX) model which is readily converted to state-space observable or controllable canonical form. Specifically, the closed-loop and controller state matrices can be realized into a canonical form and, if the stability of both these matrices is achieved by tuning the GPC parameters, then the robustness of the controller can be guaranteed. Past research has shown GPC systems to be effective for structural control and flutter suppression [3–5] problems. The solar sail structure described in this paper provides yet another highly nonlinear flexible system upon which to showcase GPC and its adaptive properties. The concept of a heliogyro solar sail was first envisioned by Richard MacNeal in 1967 [6]. The structure makes use of long, thin, flexible membranes— the sails—that rotate about a central spacecraft hub. Currently, an effort is being undertaken at NASA Langley Research Center in order to develop a low-cost, CubeSat-based, 440-meter diameter heliogyro demonstration mission named HELIOS (High-performance, Enabling, Low-cost, Innovative, Operational Solar sail) [7]. Due to the gossamer nature of the heliogyro sails, active control strategies are needed to ensure structural stability and command the attitude changes of the spacecraft. For the sake of visualization, Fig. 1 depicts the HELIOS system during the deployment process. Recently, GPC has been shown to be a promising control scheme to achieve the aforementioned goals [8]. The present paper represents an extension of that work in the application of a Robust GPC algorithm to the heliogyro blade control problem.

A Robust Predictive Control Formulation for Heliogyro Blade Stability

45

The organization of this paper is as follows: the formulation for GPC is derived in Sect. 2, including the necessary and sufficient conditions for robustness. Sections 3 and 4 briefly describe the heliogyro dynamics utilized in this study, as well as its proposed control strategy. Section 5 shows relevant results from the search for robustness in the GPC method utilizing the heliogyro system model.

2 Control Formulation For a system with rc control inputs, rd disturbance inputs, n states, and m outputs, the finite difference model at time k is given by y(k) + α1 y(k − 1) + · · · + α p y(k − p) = β0 u(k) + β1 u(k − 1) + · · · + β p u(k − p) + γ0 d(k) + γ1 d(k − 1) + · · · + γ p d(k − p)

(1)

where p is the ARX model order, u(k) is the rc × 1 control input vector which could also include the effect of Solar Radiation Pressure (SRP) if these are not already subsumed within the model, d(k) is the rd × 1 disturbance input vector, and y(k) is the m × 1 output vector at time k. In Eq. (1), the coefficient matrices αi and βi , of sizes m × m and m × rc , respectively, for i ∈ {0 → p}, are known as the observer Markov parameters. The matrices γi , of size m × rd , are the weighting terms for the disturbance inputs, which are assumed to be unknown or unmeasureable. Thus, the disturbances are not included in the development of the controller, but are later modeled as Gaussian zero-mean white noise in simulations. When a system identification algorithm is used, such as those of Chap. 10 in Ref. [2], the ARX parameters αi and βi can be obtained from I/O data. The data matrices Y and V are made up of collected I/O histories.   Y = y(k − p) y(k − p + 1) · · · y(k − 1) ⎡ ⎤ u(k − p + 1) u(k − p + 2) · · · u(k) ⎢ v(k − p) v(k − p + 1) · · · v(k − 1) ⎥

⎢ ⎥ u(k) ⎢ ⎥ . . . . .. .. .. .. V=⎢ ⎥ where v(k) = y(k) ⎢ ⎥ ⎣ v(2) v(3) · · · v(k − p + 1)⎦ v(1) v(2) · · · v(k − p) (2) With ()† denoting a pseudo-inverse operation, the ARX coefficients are given as   P = β0 β1 −α1 β2 −α2 · · · β p −α p = YV† A recursive technique is utilized to predict ys (k), the future outputs:

(3)

46

A. Pimienta-Penalver et al.

ys (k) = T us (k) + Bu p (k − p) − A y p (k − p)

(4)

where y p (k − p) represents the vector of past outputs, us (k) and u p (k − p) are the future and past input vectors, respectively, which are homologous to ⎡ ⎢ ⎢ ys (k) = ⎢ ⎣

y(k) y(k + 1) .. .

⎤ ⎥ ⎥ ⎥ ⎦

y(k + s − 1)



⎤ y(k − p) ⎢y(k − p + 1)⎥ ⎢ ⎥ y p (k − p) = ⎢ ⎥ .. ⎣ ⎦ .

(5)

y(k − 1)

where the integer s is related to the desired prediction horizon by h p = s − 1. The matrices T , A , and B, are given by Eq. (6), where T is the toeplitz matrix formed by the ARX parameters. Note that B is similar to A , but with β instead α. ⎡ ⎢ ⎢ T =⎢ ⎣

β0 β0(1) .. .

⎤ β0 .. .

..

β0(s−1) β0(s−2)

. · · · β0

⎥ ⎥ ⎥ ⎦



⎤ · · · α1 (1) · · · α1 ⎥ ⎥ . ⎥ .. . .. ⎦ · · · α1(s−1)

α p α( p−1) (1) ⎢ α (1) ⎢ p α( p−1) A =⎢ . .. ⎣ .. . (s−1) (s−1) α( p−1) αp

(6)

The prediction ARX parameters are found from ( j)

( j−1)

( j−1)

α1 = α2 − α1 α1 ( j) ( j−1) ( j−1) α2 = α3 − α1 α2 .. . ( j)

( j−1)

α p = −α1

αp

( j)

( j−1)

( j−1)

β0 = β1 − α1 β0 ( j) ( j−1) ( j−1) β1 = β2 − α1 β1 .. . ( j)

( j−1)

β p = −α1

(7)

βp

In order for future outputs to be predicted using the finite difference model, the controlling inputs are applied only over a finite time-span, beyond which the inputs are said to be zero. The cost function to be minimized in the GPC is defined as J (k) =

T   1  ys (k) − y˜ s (k) ys (k) − y˜ s (k) + usT (k)Λus (k) 2

(8)

where Λ = λI is a diagonal matrix of weighting penalty terms. The term y˜ s is the desired vector of future outputs. Optimizing Eq. (8) with respect to us (k) produces

∂ys (k) T ∂ J (k) = [ys (k) − y˜ s (k)] + Λus (k) = 0 ∂us (k) ∂us (k) where

(9)

A Robust Predictive Control Formulation for Heliogyro Blade Stability

⎡ ⎢ ⎢ ∂ys (k) ⎢ =⎢ ∂us (k) ⎢ ⎢ ⎣

47

∂y(k) ∂u(k)

∂y(k) ∂u(k+1)

···

∂y(k) ⎤ ∂u(k+s−1)

∂y(k+1) ∂u(k)

∂y(k+1) ∂u(k+1)

···

.. .

.. .

..

∂y(k+1) ⎥ ∂u(k+s−1) ⎥ ⎥

∂y(k+s−1) ∂y(k+s−1) ∂u(k) ∂u(k+1)

. ···



.. .

∂y(k+s−1) ∂u(k+s−1)

⎥ ⎥ ⎦

(10)

The control sequence to be applied to the system over some predetermined control horizon, h c , is synthesized from introducing Eqs. (4) and (10) into Eq. (9):    us (k) = − (T T T + Λ)−1 T T Bu p (k − p) − A y p (k − p) − y˜ s (k)

(11)

The control inputs us (k) are calculated for each time-step as a weighted sum of I/O data back to time-step k − p. In the relation of Eq. (11) the first rc values of the resolved control sequence are collected, the rest are discarded, and a new control vector is synthesized for the next time step. For the regulation problem:    u(k) = first rc rows of −(T T T + Λ)−1 T T Bu p (k − p) − A y p (k − p) = α1c y(k − 1) + · · · + α cp y(k − p) + β1c u(k − 1) + · · · + β cp u(k − p)     where α1c α2c · · · α cp = the first rc rows of (T T T + Λ)−1 T T A     c c β1 β2 · · · β cp = the first rc rows of −(T T T + Λ)−1 T T B (12)

2.1 Stability Considerations It is important to note that, in Eq. (11), the matrix inverse of (T T T + Λ) for Λ = 0rc ×rc exists only when the prediction horizon is large enough so that the Toeplitz matrix T is full rank. Arranging Eqs. (1) and (11) yields 0 v(k) + 1 v(k − 1) + · · · +  p v(k − p) = γ0 d(k) + γ1 d(k − 1) + · · · + γ p d(k − p)

(13)

where

v(k) =





y(k) I −β0 αi −βi ; 0 = m ; i = ; i = 1, . . . , p 0 Irc −αic −βic u(k)

(14)

This helps to define the resulting closed-loop state matrix in observable canonical form, which is oversized to include closed-loop system eigenvalues and additional computational eigenvalues that do not affect the system’s behavior. Normally, the computational modes are not controllable, but observable.

48

A. Pimienta-Penalver et al.



0

0 0 .. .

0 0 .. .

··· ··· .. .

0 0 .. .

⎤ −0−1  p −0−1  p−1 ⎥ ⎥ ⎥ .. ⎦ .

⎢ Im+rc ⎢ Acl = ⎢ . ⎣ .. 0 0 0 · · · Im+rc −0−1 1

(15)

Generally, for a given prediction and control horizon, with h c ≤ h p , the smaller the λ penalty is, the larger the control magnitude of the generated inputs, thus increasing the stability of the closed loop. Precisely, the closed loop will be asymptotically stable only if the magnitude of the eigenvalues of the closed loop state matrix Acl is each less than one. When λ approaches zero, the closed-loop system eigenvalues also reach zero, becoming deadbeat. Similarly, for the isolated controller, asymptotic stability exists only if the magnitude of each eigenvalue of Eq. (16) is less than one. ⎡

0 ⎢ Irc ⎢ Ac = ⎢ . ⎣ .. 0

0 0 .. . 0

0 0 .. . 0

··· ··· .. . ···

⎤ 0 β cp 0 β cp−1 ⎥ ⎥ .. .. ⎥ . . ⎦ Irc β1c

(16)

For a given prediction horizon, h p , and some control horizon h c ≤ h p , the larger the value of λ, the smaller the control magnitude, thus improving controller stability.

2.2 Robustness The closed-loop system stability and controller stability can be negotiated by carefully tuning the weighting parameters. If these two conditions are satisfied, the robustness of the GPC will be guaranteed. This can be achieved by manually tuning h p and λ, or by applying an optimization procedure to find the best trade-off. Consider the following cost index, which essentially accounts for the energy applied in control:   J (h p , λ) = || α1c α2c · · · α cp β1c β2c · · · β cp ||2

(17)

with the constraints for the GPC closed-loop matrix |each eigenvalue of Acl | ≤ δ < 1

(18)

and for the GPC controller matrix |each eigenvalue of Ac | < 1

(19)

A Robust Predictive Control Formulation for Heliogyro Blade Stability

49

where δ refers to the desired closed-loop performance. Minimizing the cost index with respect to the parameters h p and λ subject to the two constraints will yield a robust GPC controller. The first constraint, in Eq. (18), gives the desired performance with pre-assigned value δ. For the case of δ ≈ 1 in a stable open-loop system, the coefficient matrices α1c , α2c , ... , α cp , β1c , β2c , ... , β cp approach to zero matrices; meaning that GPC control isn’t actually needed. Computational Steps A tentative sequence of steps to achieve robustness in the GPC follows: 1. Use any system identification technique (batch or recursive) to determine the open-loop observer Markov (ARX) parameters (α1 , α2 , ... , α p , β1 , β2 , ... , β p ) before the control action is turned on. The integer p must be chosen such that pm ≥ n, where n is the system order and m is the number of outputs 2. Compute the system Markov parameters with Eq. (3) 3. Form the Toeplitz matrix T of ms × r p, A of ms × mp, and B of ms × r p, as shown in Eq. (6). 4. According to a desired performance metric, δ, minimize the cost index of Eq. (17) over the (h p ,λ) space, subject to Eqs. (18) and (19) a. Use the Toeplitz matrix, T , and the chosen control penalties, λi , to find the control Markov parameters (α1c , α2c , ... , α cp , β1c , β2c , ... , β cp ) b. If the cost index of Eq. (17) is not minimum, then return to 4. Otherwise, stop the optimization process. A rigorous definition of robustness dictates that a controller be designed to cope with parameter changes by stabilizing the desired process provided that such parameters are within a specified range [9]. It should be noted that, in this paper, we omit this last requirement because we are interested in generating a general framework for robust heliogyro blade control and foresee a more in-depth study on the influence of a range of SRP, vehicle size, and radial distance from the Sun on an appropriate choice for δ in order to guarantee stability.

3 Heliogyro Attitude Dynamics The following sections briefly describe the derivation of the blade and attitude dynamics of the heliogyro. A discrete-mass approach is chosen to preserve coupling among all system coordinates in a straightforward procedure. The blade is divided into equal sections, where the mass is lumped into discrete points located at the edges of the sections and separated by massless, but rigid rods. The hub is shaped as a thick disk of mass m h , radius rh and thickness t, as shown in Fig. 2.

50

A. Pimienta-Penalver et al.

Fig. 2 Hybrid model: continuous hub and discrete masses. Note that r stands for the length of each segment, while a represents the half-width of the sail blade

ϕ

vw u Fig. 3 Deformations on each segment: Axial (u, v, and w, for the x, y, and z directions, respectively) and torsional, ϕ, elastic deformations due to material strain Fig. 4 Blade (exi , e yi , ezi ), root (exr , e yr , ezr ), hub (ex , e y , ez ), and inertial (e X , eY , e Z ) frames

3.1 Position and Velocity Formulation Figure 3 shows the three fundamental motions that describe the vibration of each blade segment, all of which are constrained by the distance of the massless rods. The generalized coordinates are collected in the following vector: T  q = w ji v ji ϕ ji ...

for j ∈ {1 → n b } and i ∈ {1 → n s }

(20)

The axial deformation, u, is ignored in this definition because it can be defined entirely in terms of w, v and r . Note that n b and n s refer to the number of blades in the heliogyro, and the number of segments per blade, respectively. Figure 4 shows the coordinate systems used: the segment-fixed frames are defined by [exi , e yi , ezi ] ∈ B, the root frame is [exr , e yr , ezr ] ∈ R, which is fixed to the hub-blade interface, the body-fixed hub coordinate system is [ex , e y , ez ] ∈ H, and finally, some inertial manifold is defined as [e X , eY , e Z ] ∈ I. The transcendental functions operating on the out-of-plane and in-plane bending angles are converted into expressions carrying only the deflection coordinates in Eq. (20) by way of

A Robust Predictive Control Formulation for Heliogyro Blade Stability

51

geometry. Thus, from the i th blade frame, B, to the root frame, R, the rotation is ⎡



⎡√

r 2 −wi2 r

1 0 0 ⎢ 0 TRB = ⎣0 cos ϕi sin ϕi ⎦ ⎢ ⎣ w i 0 − sin ϕi cos ϕi − r

⎤ ⎡ √r 2 −w2 −v2 √ 2 i 2 i √ v2i 2 0 ⎢ r −wi i ⎥ √ 2r −w 1 √ 0 ⎥⎢ r −wi2 −vi2 ⎢ v i ⎦ ⎣ −√ √2 2 r 2 −wi2 r 2 −wi2 r −wi 0 r 0 0 wi r

⎤ 0

⎥ ⎥ 0⎥ ⎦ 1

(21) Note that this also serves as the transformation from the i th to the (i − 1)th blade frames. The transformation from the root frame R to the hub frame H, THR , is ⎡ ⎤ 1 0 0 (22) THR = ⎣0 cos θ − sin θ ⎦ 0 sin θ cos θ The root, R, frame and the hub, H, frame both share the same x-direction; the pitching angle θ will be the only manner in which a control command can be transferred onto the blade. Finally, the i th tip mass positions are ⎤ ⎡ ⎤ ⎡ ⎤ r rh ρX Ri = ⎣ ρY ⎦ + THR TRBi ⎣a ⎦ + R(i−1)H where R0 = THR ⎣ a ⎦ 0 B ρZ I 0 R ⎡

(23)

Note that the inertial distance terms (ρ X,Y,Z ), superfluous in the attitude dynamics, are ignored for the scope of this paper. The conversion from the hub frame, H, to the inertial frame, I, follows a 3-2-1 Euler sequence with angles o1 , o2 , and o3 . The expressions of velocity can then be completely defined in terms of the generalized coordinates of the system in Eq. (20), Euler angles, and body rates (ωx , ω y , ωz ) [8].

3.2 Energy Formulation The kinetic and potential energies for the attitude dynamics are given by:  n      1 ˙ iB · R ˙ iB + 1 Ix x ωx2 + I yy ω2y + Izz ωz2 T = m R 2 2 i=1     n a  1 2 1 2 1 2 V = Eεi x x + Gεi x y + Gεi x z dy 2 2 −a 2 i=0

(24a) (24b)

where E, and G, are stress and shear modules, respectively. The axial strain is εx x , while εx y and εx z is the shear strain along the in-plane and out-of-plane directions. It is necessary to cast the strain variables, εx x , εx y , and εx z , into functions of the generalized coordinates, as well. In Ref. [10], Hodges and Dowell derive the

52

A. Pimienta-Penalver et al.

aforementioned strain deformations for a continuous rotor blade, the authors of the present paper then assimilate these quantities into the discretized blade. The underlying hypothesis being that the same set of strain-displacement relationships may be used on the heliogyro blade if it is imagined as a rotor blade with minimal thickness. The relative displacements on the exi , e yi , and ezi directions of the i th segment frame are defined as u, ¯ v¯ , and w, ¯ respectively. Additionally, φ¯ corresponds to the total angle of twist along the elastic axis (exi ). The following equation shows how these quantities are obtained from the present model: w¯ i = RiH z|a=0 w¯ 0 = 0 v¯ 0 = 0 ¯ v¯ i = RiH y a=0 φ =0 with w¯ 0 = 0 v¯ 0 = 0 ¯ 0 u¯ i = RiH x |a=0 − r · i =0 φ    w¯ n = 0 v¯ n = 0 n φ¯ i = ik=1 ϕk

(25)

Note that, since symmetry is assumed about the elastic axis (exi ), the point masses along either edge of the discretized blade may be used for the following definitions. Forward and backward finite differences are used for the spatial derivatives:   w¯  − w¯ i−1 v¯  − v¯ i−1 w¯ i − w¯ i−1 , w¯ i = i , v¯ i = i r r r ¯ φ − v ¯ − u ¯ − φ¯ i v ¯ u ¯ i i−1 i i−1 i+1 , u¯ i = , φ¯ i = v¯ i = r r r

w¯ i =

(26)

If the heliogyro membrane blade is imagined as a very thin rotor blade, the elastic strain-displacement relations found in Ref. [10] may be used. The strains on the blade are thus approximated as       φ¯ 2 w¯ 2 v¯ i2 + i + y 2 i − v¯ i y cos φ¯ i + θ − w¯ i y sin φ¯ i + θ (27) 2 2 2 = y φ¯ i , εi x z = 0 





εi x x = u¯ i + εi x y

where the variable y represents an arbitrary position along the blade cross section. Note that the terms corresponding to the out-of-plane direction, which produced relations associated with the thickness of the blade, are neglected in this approximation due to the extremely thin nature of a heliogyro membrane blade. As the physical system is discretized into lumped masses, an equivalent discretization for the moduli, E and G, isn’t readily found. These quantities are therefore treated as design parameters. In this paper, the modal performance at 0 SRP is used as a calibration point to choose these values. Following the Lagrangian approach (L = T − V ), the equations of motion that describe the fully-coupled nonlinear blade behavior can be found using d dt



∂L ∂ q˙ i

 −

∂L = τqi , ∂qi

where q = [wi vi ϕi ]T

(28)

A Robust Predictive Control Formulation for Heliogyro Blade Stability

53

while the corresponding form of Lagrange’s equation that describes the attitude dynamics of the system in terms of quasi-coordinates is given by d dt



∂L ∂ωl



 ∂L ∈ {x, y, z} for ω T ∂L T − [×] −B = B τol , where l (29) ∈ {1, 2, 3} for o ∂ωl ∂ol

In Eq. (29), the body angular rates ωx , ω y , and ωz are utilized in place of the generalized coordinates o˙ 1 , o˙ 2 , and o˙ 3 . This is derived by Meirovitch in Ref. [11], and facilitates the formulation of the kinetic energy, as well as any feedback control law, based on angular velocities about orthogonal body axes ex , e y , and ez as opposed to the Euler angles o1 , o2 , and o3 . Note that [×] represents the cross-product matrix of the body rates (ωx , ω y , and ωz )[12]. The matrix B is the kinematic matrix associated with the 3-2-1 rotation from the hub frame, H, to the inertial frame, I. and is given by ⎡ ⎤ 0 sin o1 cos o1 1 ⎣ 0 cos o2 cos o1 − cos o2 sin o1 ⎦ B= (30) cos o2 cos o sin o sin o sin o cos o 2 2 1 2 1 The inversion of B from a 3-2-1 rotation, shows that, for small Euler angles o1 and o2 , the quasi-coordinate body rates may be approximated as Euler angle rates: ⎡

⎤ ⎡ ⎤ ωx o˙ 3 ⎣ω y ⎦ = B −1 ⎣o˙ 2 ⎦ ωz o˙ 1

(31)

If o2 , o1 1, then ωx ≈ o˙ 1 , ω y ≈ o˙ 2 , and ωz ≈ o˙ 3 This allows to interpret the body rates ωx , ω y , and ωz , as the derivatives of the Euler angles, which in turn allows to configure the system’s state vector to reflect this equivalence when a linearization is applied.

3.3 SRP Forcing and Linearization The solution by Wie [13] is used to approximate the SRP force along the normal   fn = − f srp cos2 α en = − f srp (ei · en )2 en

(32)

where ei is the incident unit vector, en is the unit normal at an arbitrary point on the segment surface, and α represents the inscribed half angle. In the discrete-mass model, the SRP is assumed to be equally distributed over all point masses, yielding the following external torque vector by applying the principle of virtual work:

54

A. Pimienta-Penalver et al.

 ns   δR1 δR2 + fni · fni · τq = δq δq i=1

(33)

where R1 and R2 locate the point mass pairs to the corners of each segment. As the entire heliogyro spins up, the centrifugal force generated by the rotation will reach equilibrium with the normal force that the SRP exerts on each blade. This static deflection is obtained by setting all displacement variables (other than out-of-plane bending) and derivatives to zero, equating the results to the constant terms of the SRP forcing, and solving for the permanent out-of-plane bending, ws . The system is permitted to oscillate in the out-of-plane direction through the w coordinate. Linearizing around a constant spin rate about the ez -axis in the body frame defines a nominal rpm value, ω0 , around which variations, ω, are allowed. Both the fully non-linear equations of motion and the linearized versions are too large to be included in this paper. If the reader is so inclined, Appendix C of reference [8] contains the linearized equations of motion for a small heliogyro system and can be found online free of charge.

4 HELIOS Control Strategy The stability and attitude control in all six degrees of freedom of the heliogyro is achieved through coordinating pitching profiles commonly associated with helicopter rotor controls, as seen in Fig. 5. The heliogyro sails will be monitored using a dedicated camera assembly positioned above the plane of rotation by a deployable mast, seen in Fig. 6. Videogrammetry techniques will be applied on the obtained images from the cameras in order to measure deflections at one or several points along each of the blades. Furthermore, a set of dedicated motors are to be housed at the root of each of the blades in order to provide pitch actuation according to some desired maneuver profile. This setup has been demonstrated experimentally at NASA Langley Research Center, where investigations are being carried out on a hanging blade mock-up to obtain sufficiently good measurements for control purposes. In order to comply with attitude control requirements, heliogyros are planned to follow conventional helicopter controls, where a set of pitching profiles are applied

θco

θcy

θhp

θco

θco

ω θco

Fig. 5 Collective, Cyclic, and Half-Pitch Maneuvers

ω

θcy

ω

A Robust Predictive Control Formulation for Heliogyro Blade Stability

Mast cameras

Fig. 6 HELIOS system with deployed truss structure, and before unfurling sails. Note the location of the deployed mast camera and root actuator

Root actuator

Table 1 HELIOS structural parameters Symbol Blade mass Blade length Blade width Hub mass Hub diameter Hub thickness

55

m · (2n s + 2) r · ns 2a mh rh t

Value 0.674 kg 220 m 0.75 m 7.6 kg 0.5 m 0.5 m

to each blade to obtain the desired rotational moments or translational forces. To this end, combinations of cyclic, collective, and half-pitch maneuvers are integrated together as shown by Guerrant and Blomquist in [14] and [15], respectively. Therefore, in concept, the controller that commands each blade root actuator is designed to accept desired forces and/or moments (see Fig. 5), and the algorithm then provides the appropriate amount of pitching to be coordinated.

5 Analysis Table 1 shows the parameters utilized in the simulations presented in this paper. These values are taken directly from the HELIOS mission specifications [16]. The five-segment (n s = 5) heliogyro model is utilized and set to ω0 =1 rpm of nominal spin rate, as well as 9.1×10−6 Pa of SRP, which corresponds to a distance of 1 AU from the Sun. A modest additive noise term with a standard deviation of 5% of the largest deflection for each direction is applied, this would correspond to d(k) in Eq. 2. The u(k) term corresponds to a generated random excitation seed, which is admittedly not an accurate representation of all possible cases in the probabilistic sense, the goal of this study is to examine the general trends over the (h p , λ) design space. The output y(k) consists of tip twist angles and rates, as well as at the half-blade location, when specified.

56

A. Pimienta-Penalver et al.

x 10

-4

0

1

| eig( Acl ) | - 1

| eig( Acl ) | - 1

0 -1 -2

-0.001

-0.002

-3 -0.003 -4 -0.004

-5 500

500 400

400

1000 300 200

Prediction horizon ( hp )

100

10 200

0.1 0.001

Prediction horizon ( hp )

Control penalty ( λ )

0.1 100

0.001

Control penalty ( λ )

(b) Closed-loop space stability for p = 300

(a) Closed-loop space stability for p = 50

-0.003

| eig( Acl ) | - 1

-0.004

| eig( Acl ) | - 1

1000 300

10

-0.006 -0.008 -0.01

-0.006

-0.009

-0.012 -0.012 -0.014

500

500 400

1000 300

10 200

Prediction horizon ( hp )

400

1000 300

100

0.001

Control penalty ( λ )

(c) Closed-loop space stability for p = 500

10 200

0.1

Prediction horizon ( hp )

0.1 100

0.001

Control penalty ( λ )

(d) Closed-loop space stability for p = 700

Fig. 7 Comparison of closed-loop stability over several p-sizes

5.1 Model Size The central design element of the GPC is the ARX size, p. An algebraic analysis will reveal that, for a sufficiently large value of p, the GPC algorithm to identify the ARX coefficients from I/O data will approach the optimal solution of a Kalman filter [8]. This conclusion highlights the necessity to set a large enough value for p so that the GPC method is able to accurately predict future outputs. The controller’s ability to synthesize the appropriate future control commands hinges on the accuracy of the ARX model identification. In relation to robustness, however, the p-size is only relevant insofar as it is sufficient to allow the system identification portion of GPC to construct an accurate linear ARX representation of the system being controlled. This is demonstrated in Fig. 7, where GPC schemes of varying p-size are compared in the achievable closed-loop stability over the (h p , λ)-space. The surface plots of Fig. 7 each show the smallest distance from the eigenvalues of the constructed closed-loop matrix from Eq. (15) to the unit circle with the operation: max[eig(Acl )] − 1. This means that positive values on the vertical axis indicate at least one unstable mode in the closed-loop matrix, while negative values indicate stability at that particular instance

A Robust Predictive Control Formulation for Heliogyro Blade Stability

57

of h p and λ. Note that the studies are carried out using the same random number seed as excitation before turning on the GPC. The figure shows that as the ARX size increases, the system identification portion of the GPC algorithm is able to fully identify all the modes of the heliogyro system, which increases the quality of the multi-step ahead prediction ys . Nevertheless, there appears to be a limit to how much the user can increase the ARX size to get better stability results in the closed loop. This is shown by Fig. 7c and d, where there is very small improvement in the stability properties of the Acl matrix as the ARX size goes from p = 500 to 700.

5.2 Sensor Layout Another important feature to study is the possibility of using videogrammetry to track targets from additional locations along the blade, as opposed to limiting the measurements to the tip only. The plots of Fig. 8 investigate how using different sensor setups can affect the optimization problem. The left column (Fig. 8c, e, g) corresponds to twist angle videogrammetry at the sail tip, while the right column (Fig. 8d, f, h) corresponds to additional twist angle measurements halfway along the blade. Results show that the Layout # 2 slightly improve the performance of the controller stability, while generally halving the cost function evaluation. This is an expected result, since doubling the number of measurements available for the GPC to identify the same number of modes is essentially equivalent to doubling the size of p. Thus, with regards to the control of the heliogyro, utilizing as many measurements as possible on a single blade might be a suitable solution— limited only by on-board computational capabilities—to improve the stability of the system and decrease the effort associated with poorly identified modes.

5.3 Application onto Nonlinear System Until now, the results shown in this paper have been carried out on the identified ARX system. The following briefly describes the application of the GPC control method onto the fully non-linear system. A set of 50 experiments of varying random number streams is used to test the closed-loop stability of the controlled non-linear system using the synthesized inputs from the GPC. Results show that closed-loop stability of the non-linear system is improved by the application of GPC. In fact, even in cases when the open-loop system is erroneously identified as being unstable, the GPC still delivers some region of closed-loop stability, as shown in Fig. 9. Figure 9 shows the closed-loop stability space in the non-linear system for an experiment with p = 500, where the contour lines show the size of the largest eigenvalues found for that particular instance of (h p , λ). In this particular experiment run, the identified open-loop system appears to be unstable, which should not be the case, since the heliogyro system is, at worst, marginally stable [8]. It is evident, never-

58

A. Pimienta-Penalver et al.

(a) Layout # 1: Tip targets only

(b) Layout # 2: Tip and halfway targets

-0.018

| eig( Ac ) | - 1

| eig( Ac ) | - 1

-0.018 -0.02 -0.022 -0.024

-0.02 -0.022 -0.024

-0.026

-0.026

-0.028

-0.028

0.001

0.001 0.1

0.1

500 400

10

Control penalty ( λ )

500

200

1000

400

10

300 100

Control penalty ( λ )

300 200

1000

100

Prediction horizon ( hp )

(c) Controller constraint space (Layout # 1)

(d) Controller constraint space (Layout # 2)

-0.004

-0.007

| eig( Acl ) | - 1

| eig( Acl ) | - 1

Prediction horizon ( hp )

-0.006 -0.008 -0.01

-0.009 -0.011 -0.013

-0.012 -0.015

-0.014

500

500 400

Prediction horizon ( hp )

1000 300

10 200 100

10 200

0.1

0.1 100

0.001

0.001

Prediction horizon ( hp )

Control penalty ( λ )

1000

500 400

)

)

800

Control penalty ( λ )

(f) Closed-loop constraint space (Layout # 2)

Objective (J hp , λ ) )

(e) Closed-loop constraint space (Layout # 1)

Objective (J hp , λ))

400

1000 300

600 400 200 0

300 200 100 0 0.001

0.001 0.1

500 400

10

Control penalty ( λ )

300 1000

0.1

500

100

400

10

200

300 1000

Control penalty ( λ )

200 100

Prediction horizon ( hp )

(g) Objective function space (Layout # 1)

Prediction horizon ( hp )

(h) Objective function space (Layout # 2)

Fig. 8 Constraint and objective function spaces for different sensor layouts

A Robust Predictive Control Formulation for Heliogyro Blade Stability Fig. 9 Region of closed-loop stability, p = 500 (using nonlinear system)

59

Prediction horizon ( hp )

500

400

0.96

300

200

1

100 0.001

1.04

0.1

1.08

10

1.12

1000

Control penalty ( λ )

theless, that GPC is still an effective tool for control of the heliogyro structure even when the identification size is not sufficient, as it is shown here for p = 500. All experiments for p = 700 perform similarly to the results of Fig. 8e and f. Figure 10a and b show closed-loop non-linear system results for different measurement layouts (see Fig. 8a and b). The bottom two plots of both simulations show the closed-loop pitching history at mid-sail and tip (control is turned on at 400 s), indicating that although the settling time performance suffers slightly, layout # 2 greatly improves blade flatness. It is evident that augmenting the measurement channels provided to the GPC routine not only improve the solution to the optimization problem for robustness, but is also helpful in promoting blade the level of uniformity that is crucial for maneuverability of the whole system.

6 Conclusions This paper derives a formulation to introduce robustness into the design of Generalized Predictive Control, which may be applied in an iterative manner over the (h p , λ)-space. The idea hinges on guaranteeing the asymptotic stability of both the controller and the closed-loop systems, the latter of which can be used to define some desired performance metric for the applied control, while minimizing an objective function which accounts for control effort. The application of the GPC, with a focus on robustness and closed-loop stability, is investigated on the heliogyro solar sail system, which is shown to need active control attention in order to accomplish its intended capabilities. Analysis shows that an increasing ARX size, which defines the quality of the identification step within the GPC, has a positive effect on the closed-loop stability up until the point when the system’s principal modes are fully identified. This leads

60

A. Pimienta-Penalver et al. Blade 2 Tip

Open Loop (deg)

Blade 1 Tip 20

20

10

10

0

0

-10

-10 -20

-20

Closed Loop (deg)

0

100

200

300

400

500

600

0

700

20

20

10

10

0

0

-10

-10

100

200

300

400

500

600

700

Mid-Sail Pitch Tip Pitch

-20

-20 0

100

200

300

400

500

600

0

700

100

200

300

400

500

600

700

500

600

700

Time (sec)

Time (sec)

(a) GPC using Layout # 1 Blade 2 Tip

Open Loop (deg)

Blade 1 Tip 20

20

10

10

0

0

-10

-10 -20

-20

Closed Loop (deg)

0

100

200

300

400

500

600

0

700

20

20

10

10

0

0

-10

-10

100

200

300

400

Mid-Sail Pitch Tip Pitch

-20

-20 0

100

200

300

400

500

600

700

0

100

200

300

400

500

600

700

Time (sec)

Time (sec)

(b) GPC using Layout # 2 Fig. 10 Full nonlinear simulations. Sensor layout comparison

to a trade-off between computational effort and closed-loop stability. The closedloop stability of the system improves with an increase in the prediction horizon, h p , while the inverse is true with regards to the control penalty, λ. Conversely, the controller system stability is seen be improved with increasing control penalty, while a similar trend is true of the cost function. Additionally, the paper also highlights the benefit of increasing the measurement channels available to the GPC: the cost function evaluation is seen to proportionally decrease, while the closed-loop stability and the sail blade flatness are also seen to improve.

A Robust Predictive Control Formulation for Heliogyro Blade Stability

61

Acknowledgements The authors would like to thank the participating reviewers for the Proceedings of the 7th International Conference on High Performance Scientific Computing for helping increase the quality of this paper.

References 1. Clarke, D., Mohtadi, C., Tuffs, P.: Generalized predictive control - part I. The basic algorithm. Automatica 23(2), 137–148 (1987) 2. Juang, J., Phan, M.: Identification and Control of Mechanical Systems. Cambridge University Press, Cambridge (2001) 3. Lew, J., Juang, J.-N.: Robust generalized predictive control with uncertainty quantification. J. Guid. Control Dyn. 35(3), 930–937 (2012) 4. Juang, J., Eure, K.: Predictive feedback and feedforward control for systems with unknown disturbances. NASA Technical Memorandum, 1998–208744 (1998) 5. Chen, C.-W., Huang, J.-K., Phan, M., Juang, J.-N.: Integrated system identification and modal state estimation for control of large flexible space structures. J. Guid. Control Dyn. 15(1), 88–95 (1992) 6. MacNeal, R.: The Heliogyro: an interplanetary flying machine. NASA Contractor Report CR 84460 (1967) 7. Wilkie, W., Warren, J., Thomson, M., Lisman, P., Walkemeyer, P., Guerrant, D., Lawrence, D.: The Heliogyro reloaded. JANNAF 5th Spacecraft Propulsion Subcommittee Joint Meeting, Huntsville, AL (2011) 8. Pimienta-Penalver, A., Attitude Dynamics, Stability, and Control of a Heliogyro Solar Sail, Ph.D. Dissertation. Department of Mechanical and Aerospace Engineering, University at Buffalo (2017) 9. Ioannu, P.A., Sun, J.: Robust Adaptive Control. Dover Publications, Minneola (2012) 10. Hodges, D., Dowell, E.: Nonlinear equation of motion for the elastic bending and torsion of twisting nonuniform rotor blades. NASA Technical Note D-7818 (1974) 11. Meirovitch, R.: Hybrid state equations of motion for flexible bodies in terms of quasicoordinates. J. Guid. Control Dyn. 14, 1008–1013 (1991). https://doi.org/10.2514/3.20743 12. Markley, F.L., Crassidis, J.L.: Fundamentals of Spacecraft Attitude Determination and Control. Springer, New York (2014) 13. Wie, B.: Solar sail attitude control and dynamics, part 1. J. Guid. Control Dyn. 27(4), 526–535 (2004). https://doi.org/10.2514/1.11134 14. Guerrant, D., Lawrence, D.: Tactics for Heliogyro solar sail attitude control via blade pitching. J. Guid. Control Dyn. 1–15 (2015) 15. Blomquist, R.: Heliogyro Control. Ph.D. Dissertation, The Robotics Institute, Carnegie Mellon University (2009) 16. Wilkie, W., Warren, J., Horta, L., Lyle, K., Juang, J., Littell, J., Bryant, R., Thomson, M., Walkemeyer, P., Guerrant, D., Lawrence, D., Gibbs, S., Dowell, E., Heaton, A.: Heliogyro solar sail research at NASA. Advances in Solar Sailing, pp. 631–650. Springer (2014). https:// doi.org/10.1007/978-3-642-34907-239

Piecewise Polynomial Taylor Expansions—The Generalization of Faà di Bruno’s Formula Tom Streubel, Caren Tischendorf, and Andreas Griewank

Abstract We present an extension of Taylor’s Theorem for the piecewise polynomial expansion of non-smooth evaluation procedures involving absolute value operations. Evaluation procedures are computer programs of mathematical functions in closed form expression and allow a different treatment of smooth operations or calls to the absolute value function. The well known classical Theorem of Taylor defines polynomial approximations of sufficiently smooth functions and is widely used for the derivation and analysis of numerical integrators for systems of ordinary differentialor differential-algebraic equations, for the construction of solvers for continuous non-linear optimization of finite dimensional objective functions and for root solving of non-linear systems of equations. The long term goal is the stabilization and acceleration of already known methods and the derivation of new methods by incorporating piecewise polynomial Taylor expansions. The herein provided proof of the higher order approximation quality of the new generalized expansions is constructive and allows efficiently designed algorithms for the execution and computation of the piecewise polynomial expansions. As a demonstration towards the ultimate goal we will derive a prototype of a k-step method on the basis of polynomial interpolation and the proposed generalized expansions.

1 Introduction, Preliminaries and Notions In [6] a piecewise linear generalization of first order Taylor expansions as alternative but also related to Bouligand or B-subdifferentials has been introduced and analyzed in detail. Furthermore several efficient approaches for numerical integration, root solving and optimization have been proposed. The concept originally evolved T. Streubel (B) Zuse Institute Berlin, Humboldt University of Berlin, Berlin, Germany e-mail: [email protected] C. Tischendorf Humboldt University of Berlin, Berlin, Germany A. Griewank School of Mathematical Sciences and Information Technology, Yachaytech, Ecuador © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_3

63

64

T. Streubel et al.

among many researchers from several countries with different backgrounds and from different areas within the field of applied mathematics to solve practical non-smooth problems arising from industry, finance and economy. Many subsequent papers have been published, e.g. regarding numerical integration [1, 8, 22], root solving [2, 7, 9, 18, 19, 21] and optimization [3–5, 11, 12, 14]. All of them deal with the construction of new methods, their analysis and/or promising numerical experiments. This work derives and proves the existence of higher order piecewise polynomial expansions of certain non-smooth functions. In the spirit of [6] this paper continues and extends the original concept. The algorithmic piecewise linearization coincides with the first order expansion in our sense. Thus we consider the herein presented framework as part of a new discipline called algorithmic piecewise differentiation (APD) which is part of a parent discipline the algorithmic differentiation (AD). Towards the end and as a demonstration we will also provide a prototype of a class of k-step methods for solving non-smooth differential algebraic equations based on the generalized Taylor expansion. For the purpose of introducing new notions we will firstly formulate the classical Theorem of Taylor before its generalization. To that end we introduce the so called big-O from Landau-notation. Let therefor 0 < n, m, k, d- ∈ N be some positive integers, f : D ⊆ Rn → Rm and g : D ⊆ Rn → Rk functions, with 0 ∈ D, we then say - if and only if that f (x) = O(g(x) d) ∃C > 0, ∃ε > 0, ∀x ∈ D : x < ε =⇒  f (x) ≤ C · g(x) 1  p p . By consistent is meant for consistent choices of p-vector norms x p ≡ xi ∈x x i that we choose the p-vector norm in the range of f for the same p as the one in the range of g. Let again 0 < n, m, d- ∈ N positive integers then f ∈ C d,1 (D ⊆ Rn , Rm ) n m denotes a function f : D ⊆ R → R which is d- times Lipschitz-continuously differentiable whereas f ∈ C d (D ⊆ Rn , Rm ) will refer to a d-times continuously differentiable function. n . . , δn−1 ) ∈ (N ∪ {0}) We will also make use of multi-indices δ = (δ0 , .  n−1that n−1 can be used to abbreviate expressions such as |δ| = i=0 δi and δ! = i=0 δi ! n−1 δi δn−1 as well as for x ∈ Rn : x δ ≡ i=0 xi and ∂ δ f (x) ≡ (∂ δ0 /∂ x0δ0 ) . . . (∂ δn−1 /∂ xn−1 ) f (x0 , . . . , xn−1 ).

Theorem 1 (Theorem of Taylor) Let f ∈ C D⊆d,1 (Rn , R) be a sufficiently smooth function, x˚ ∈ D a reference point and Δx = x − x˚ ∈ D − x˚ ≡ {x − x˚ | x ∈ D} an offset in the domain of f. Then the following recursive statement holds true ˚ = ∀1 ≤ d ≤ d- : f(x˚ + Δx) − f(x)

d  i=1

˚ Δx) + O(Δxd+1 ), Δ(i) f(x;

(1)

Piecewise Polynomial Taylor Expansions—The Generalization …

65

 δ   ˚ x) ˚ Δx) = |δ|=i ∂ f( · Δx δ is a function in x˚ and Δx evaluating the where Δ(i) f(x; δ! sum of all mixed partial derivatives of f of order i in direction Δx. The Theorem of Taylor defines polynomial approximations of sufficiently smooth functions. However, many practical problems and most algorithms are not smooth everywhere. Instead, evaluations of max(a, b) = (a + b + abs(b − a))/2 and min(a, b) = (a + b − abs(b − a))/2 or of the absolute value function may be included. Also the comparison of more then 2 arguments with a max or min statement can be reformulated into a nested series of binary comparisons. With this in mind consider some piecewise smooth function f : Rn → R in the sense of [20] that does not necessarily satisfy the prerequisites of Theorem 1. We then call an ordered family of d- + 1 ≥ 1 Lipschitz˚ Δ(1) f (x; ˚ Δx), . . . , Δ(d) f (x; ˚ Δx)] a generalized continuous functions f ≡ [ f (x), Taylor expansion of order d- of f on a non-empty open set D¯ ⊆ D if f satisfies statement (1) for any x˚ ∈ D¯ and any Δx ∈ D¯ − x˚ despite the prerequisites of Theorem 1. ˚ Δx) of f is called increment of order i. Of course increThe ith element Δ(i) f (x; ments cannot be defined via partial derivatives of f in general, since they might not ˚ Δx) ≡ f (x) ˚ as point evaluaexist. For the sake of convenience we define Δ(0) f (x; ˚ Δx) when the arguments are tion and we will use the abbreviation Δ(i) f ≡ Δ(i) f (x; clear within the context. We will give algorithmic schemes for the evaluation of generalized Taylor expansions of piecewise smooth computer programs and prove the claimed approximation quality (1) for them. We will refer to the computer programs as composite piecewise smooth evaluation procedures or short evaluation procedures. The next sections are organized as follows. In Sect. 2 the concept of composite piecewise smooth evaluation procedures will be introduced and a propagation scheme for the algorithmic evaluation of their generalized Taylor expansions will be provided. In the subsequent Sect. 3 the connection of the generalized Taylor expansion and piecewise polynomials or splines will be investigated. Afterwards in Sect. 4 the formula of Faà di Bruno will be proven in the context of this generalization. In Sect. 5 a prototype integrator for differential-algebraic equations in semi-explicit form will be derived on the basis of the proposed generalized Taylor expansions and a numerical example will be provided. The derived method is closely related to linear k-step methods such as Adams-Moulton (sometimes referred to as implicit Adams) or BDF methods (see e.g. [13]).

66

T. Streubel et al.

2 Propagation Scheme of Expansions for Non-smooth Evaluation Procedures In this section an algorithm for the point evaluation of the generalized Taylor expansions will be provided. An evaluation procedure is a finite composition of so called unary ϕ : Dϕ ⊆ R → R and binary ψ : Dψ ⊆ R2 → R elementary operations, which are aggregated as a library Φabs = Φ ∪ {abs} in their symbolic form and thus make up the atomic constituents of complex and possibly vector-valued functions. Despite the absolute value function as the only exception any other elementary operation has to be at least d-times Lipschitz-continuously differentiable. This assumption is called elementary differentiability (ED). This means any evaluation procedure consisting solely of operations from Φ, which excludes the absolute value function, inherits their order d- differentiability by the chain rule. In our framework any unary operation complying to (ED) can be added to Φ by the user. But for the time being we want to restrict the selection of binary operations to {+, −, ·, /}, this is sum, difference, product and division. The dependencies among operations within some evaluation procedure define a partial ordering which is called data dependence relation. This relation corresponds to a directed acyclic graph or evaluation graph of the procedure. Example 1 (evaluation procedure, elementary code instruction, evaluation graph) Consider the following example evaluation procedure f ex (x0 , x1 , x2 ) = x2 + x1 + | sin(x1 + |3 · x0 |)| Example 1.a—closed form expression. Which maps R3 to R. The acyclic directed evaluation graph is depicted below in figure Example 1.c and it is not an expression tree due to the 2 arcs leaving node x1 . Each node defines a subgraph by the closure of its dependencies and thus can be interpreted as partial evaluation function of f ex up to this node within the full graph. v0 v1 v2 v3 z 0 = v4 v5 v6 z 1 = v7 v8 v9 f ex (x0 , x1 , x2 ) Example 1.b—elementary instructions

≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡ ≡

x2 x1 v0 + v1 x0 3 · v3 abs(z 0 ) v1 + v5 sin(v6 ) abs(v7 ) v2 + v8 v9

Piecewise Polynomial Taylor Expansions—The Generalization …

67

Furthermore the closed form expression corresponds uniquely to an ordered list of elementary instructions when read and interpreted strictly from left to right. The total ordering of the list matches one out of the possible extensions of the partial ordering among the dependencies. The left-hand-sides within the elementary code instructions are intermediate variables of f ex . Those of them which are arguments of absolute value operations deserve special attention. They will be referred to as switching variables throughout the text. So any evaluation procedure within our scope can be expressed as ordered list of elementary instructions and a data dependence relation denoted by ≺ and thus any intermediate variable vi for some integer i ∈ N matches either one of the cases described by Table 1. Variable initializations vi ≡ x j can be interpreted as identity operation vi = id(x j ) applied to some scalar-component x j of the input variable vector to fit into the scheme of Table 1. We will use the notion f ∈ span(Φabs ) whenever a function f : D ⊆ Rn → Rm has at least one symbolic representation or closed form expression as evaluation procedure of elementary operations from Φabs . Point evaluations of evaluation procedures can be carried out one by one going through the elementary instructions w.r.t. the ordering by the dependence relation. Likewise certain properties of the evaluation procedure can be proven by applying corresponding versions of the chain rule repeatedly in an induction like process. These processes are called propagation. For instance, the Lipschitz-continuity of all elementary operations and repetitive application of the chain rule implies the Lipschitz-continuity of all evaluation procedures f ∈ span(Φabs ) by propagation. With the concept of propagation at hand we can implicitly define more complex processes and algorithms by imposing propagation rules to all occurring cases Table 1 Generic table for elementary instructions and Φ1 ⊆ Φ being the sub-library of all unary operations ϕ : Dϕ ⊆ R → R that comply to assumption (ED)

68

T. Streubel et al.

of Table 1. For the application of operator overloading we will extend the interpretation of intermediate variables. From now on they consist of an ordered list [Δ(0) vi , Δ(1) vi , . . . , Δ(d) vi ] ∈ Rd+1 of increments representing the point evaluation of the generalized Taylor expansion of the partial evaluation function up to vi within a corresponding evaluation procedure to expand. The first entry of that list also is an intermediate value Δ(0) vi = vi (x˚0 , x˚1 , . . . , x˚n−1 ) ∈ R of the actual primal evaluation. Thus we will also use an alias v˚ i = Δ(0) vi because of its special role. The propagation rules for the primal evaluation and for the expansions are defined as follows Variable and constant initialization ˚ Δx ∈ R and consider v = x as well as v˜ = γ , then the propagation rules Let γ , x, are defined as follows ˚ = x, ˚ Δ(1) v = Δx and ∀2 ≤ d ≤ d- : Δ(d) v = 0, v(x) ˚ = γ , ∀1 ≤ d ≤ d- : Δ(d) v˜ = 0. v˜ (x) Sum, difference and linearity ˚ Δx ∈ Rn and u, w two intermediate variables. Consider v = Let α, β ∈ R, x, ˚ = α · u˚ ± β · w˚ and the increment propagation is α · u ± β · w, then v(x) ∀1 ≤ d ≤ d- : Δ(d) v = α · Δ(d) u ± β · Δ(d) w. Product ˚ Δx ∈ Rn and consider v = u · w, then Let u, w be some intermediate variables, x, ˚ = u˚ · w˚ and the increment operation is given by the the primal evaluation is v(x) Leibniz-formula in incremental notion ∀1 ≤ d ≤ d- : Δ(d) v =

d 

Δ(i) u · Δ(d−i) w

i=0

Division ˚ Δx ∈ Rn and consider v = wu , then Let u, w be some intermediate variables, x, u˚ ˚ = w˚ and the increment operation reads as follows the primal evaluation is v(x)

d−1  1 (d) (d) (i) (d−i) Δ v·Δ w ∀1 ≤ d ≤ d : Δ v = Δ u− w˚ i=0 Smooth unary operation ˚ Δx ∈ Rn and consider v = ϕ(u) for some Let u be some intermediate variable, x, elementary operation ϕ ∈ Φ1 complying with assumption (ED). The primal eval˚ = ϕ(u). ˚ uation is executed by the evaluation of the symbolic instruction v(x)

Piecewise Polynomial Taylor Expansions—The Generalization …

69

The formula of Faà di Bruno can be utilized for the formal definition of the increment propagation ∀1 ≤ d ≤ d- : Δ(d) v =



ϕ

(k1 +...+kd )

(k1 ,...,kd )∈Td

where (k1 , . . . , kd ) ∈ Td ⇐⇒ o(k1 , . . . , kd ) ≡ the (k1 + . . . + kd )th derivative of ϕ.

d 1 ˚ · · (Δ(i) u)ki , (u) k ! i i=1

d i=1

(2)

i · ki = d and ϕ (k1 +...+kd )

But note that for the propagation of frequently used functions (such as sin, cos, exp, log, . . . and other smooth operations from the standard cmath-library of c++) specialized formulas have been developed for the smooth algorithmic propagation of Taylor polynomial expansions. These formulas can be found e.g. in [10] or in [17] and are more efficient w.r.t. run-time and memory. Furthermore they are equivalent to the application of (2), but still need to be slightly adjusted in accordance to the incremental notion we use here. So far the propagation rules fully comply with standard Taylor arithmetic. Thus the generalized Taylor expansion becomes a polynomial Taylor expansion in the sense of Theorem 1 for evaluation procedures f ∈ span(Φ) = span(Φabs \ {abs }) that fully comply with assumption (ED) in that all elementary instructions of f do. But now the absolute value operation is finally the last addition to our listing and the seed for non-smoothness in evaluation procedures. Absolute value ˚ = |u| ˚ and Let u be some intermediate variable and consider v = abs(u), then v(x) the propagation rule is defined recursively ∀1 ≤ d ≤ d- :

d  i=0

   d (i) d (i) d−1 (d) Δ u =⇒ Δ v ≡ Δ u − Δ(i) v. i=0 i=0 i=0

Δ(i) v =

Having established all propagation rules we want to introduce a shorthand for the ˚ is defined as generalized Taylor expansion. Let f ∈ span(Φabs ) then T f(d) [x](x) ˚ ˚ + T f(d) [x](x) ≡ f (x)

d 

˚ x − x) ˚ Δ(i) f (x;

i=1

˚ that is the generalized Taylor expansion up to order d at a reference point x. We want to conclude this section with an example expansion of some non-smooth evaluation procedure. Example 2 Consider a non-linear, non-smooth evaluation procedure f ∈ span (Φabs ), with f (x0 , x1 ) = | exp(x0 ) − |x1 ||. There are 3 sets of non-differentiabilities M0 = {(x, exp(x)) | x ∈ R}, M1 = {(x, − exp(x)) | x ∈ R} and M2 = {(x, 0) | x ∈ R}.

70

T. Streubel et al.

(i)

˚ at a reference point x˚ = (0, 1) from order 1 to order 5 Fig. 1 Plots of f and its expansions T f [x]

˚ at x˚ = (0, 1) as Plots of the function f and of its generalized expansions T f(i) [x] reference point ranging from order 1 (or piecewise linearization) up to order 5 can be found in Fig. 1. The solid and dashed lines in Fig. 1 in the x0 -x1 -planes are Taylor polynomial expansions of x1 = ± exp(x0 ) and x1 = 0. The manifolds of non-differentiabilities of the generalized expansions coincide with the polynomial expansions.

3 Connections to Splines In this section we will investigate some properties of the generalized Taylor expansions and their connection to piecewise polynomials. Consider an ordered tuple x0 < x1 < · · · < xη of real valued numbers. We then call S dp (x0 , x1 , . . . , xη ) ⊆ C p ([x0 , xη ], R) the space of splines. Its elements are functions that are identical to polynomials of order smaller or equal to d > 0 on all sub intervals [xi , xi+1 ] for i ∈ {0, 1, . . . , η − 1}.

Piecewise Polynomial Taylor Expansions—The Generalization …

71

We can firstly observe S dp (x0 , x1 , . . . , xη ) ⊆ S0d (x0 , x1 , . . . , xη ) and conclude that the latter is spanned linearly by functions bi, j (x) = max(0, x − xi ) j , for any pair of indices i ∈ {0, 1, . . . , η} and j ∈ {0, 1, . . . , d}. ˚ of s ∈ S dp (x0 , x1 , . . . , xη ) of Lemma 1 The generalized Taylor expansion Tsd [x] ˚ ˚ x ∈ [x0 , xη ]. order d satisfies s(x) − Tsd [x](x) = 0 for any x, ˚ ˚ x ∈ [x0 , xη ] as well Proof It is sufficient to proof bi, j (x) − Tbdi, j [x](x) = 0 for any x, as any i ∈ {0, 1, . . . , η} and j ∈ {0, 1, . . . , d}, since s is a linear combination of all the bi, j . As a first step we prove that piecewise linearization or higher order expansion of ϕ(x) = max(0, x − a) = (x − a + |x − a|)/2 is identical to ϕ for any x˚ ∈ R as reference point. Let d- > 0 and we will make use of the bracket-tuple notation of increments for the generalized Taylor expansion and apply the propagation rules as established in Sect. 2. We then find expansions for the nodes of the evaluation graph of ϕ x x −a |x − a| ϕ(x)

: : :

˚ Δx, 0 . . . , 0], [x, [x˚ − a, Δx, 0 . . . , 0], [|x˚ − a|, |x˚ + Δx − a| − |x˚ − a|, 0 . . . , 0],

:

˚ 21 (Δx + |x˚ + Δx − a| − |x˚ − a|), 0, . . . , 0]. [ϕ(x),

The generalized Taylor expansion of ϕ is the sum of its increments ˚ ˚ + 21 (Δx + |x˚ + Δx − a| − |x˚ − a|) + 0 + · · · + 0 = ϕ(x) Tϕd [x](x) = 21 (x˚ − a + |x˚ − a|) + 21 (Δx + |x˚ + Δx − a| − |x˚ − a|), = 21 (x˚ − a + Δx + |x˚ + Δx − a|) = max(0, x˚ + Δx − a). ˚ From the identity x = x˚ + Δx we can finally conclude ϕ(x) = Tϕd [x](x). All func- Furthermore we already know tions bi, j are of the form (ϕ(x))m , for some m ≤ d. ˚ = x m since the power function is smooth and complying to assumption Txdm [x](x) (ED). Applying the Formula of Faà di Bruno is equivalent to simply chaining both d ˚ ˚ ˚ = Txdm [ϕ(x)](T = (max(0, x − a))m . This is the case expansions Tϕdm [x](x) ϕ [ x](x)) because the expansion of ϕ has no non-zero increment of order 2 or higher.  Beyond Lemma 1 we can conclude from the propagation rules of the generalized Taylor expansion that all non-linear operations during the expansion process are restricted to evaluations related to the reference point of expansion. Evaluating a generalized Taylor expansion finally only involves linear operations, products, powers and calls to the absolute value functions. From this we can conclude that such expansions are piecewise polynomials once their propagation is completed. Thus it is justified to consider the expansion process itself as implicit spline or more precisely implicit piecewise polynomial generation. With implicit we refer to the fact that the positions of restricted differentiabilities of the expansions are unknown right after

72

T. Streubel et al.

their creation. In contrast to this splines are composed of a base that has been constructed from these positions. It is explicitly known where a spline e.g. in B-Spline basis representation will have its transitions from one polynomial to another.

4 The Generalization of Taylor’s Theorem In this section we will prove that the propagation rules as provided in Sect. 2 generate an approximation in the sense of Eq. (1). Due to the concept of propagation introduced in the same section we only have to give a proof for any elementary instruction of Table 1. However most of the binary operations can be represented in terms of unary operations. The propagation by linear combinations v = α · u ± β · w, for u, w intermediate variables with generalized Taylor expansions (induction hypothesis) and α, β ∈ R combines the sum v = u + w, the difference v = u − w and unary scalar multiplication v = α · u into one rule. The scalar multiplication along with the variable and constant initialization can be proven straight forward. The difference can be expressed in terms of a sum followed by a scalar multiplication by −1 as v = u − w = u + ((−1) · w). For the sum v = u + w we can deduce from the generalized expansions of u and w ∀1 ≤ d ≤ d- : v(x˚ + Δx) − v˚ = u(x˚ + Δx) − u˚ + w(x˚ + Δx) − w˚ d

d

  (i) d+1 (i) d+1 = Δ u + O(Δx ) + Δ w + O(Δx ) i=1

=

d 

i=1

Δ(i) v + O(Δxd+1 ), where Δ(i) v = Δ(i) u + Δ(i) w

i=1

The division can be expressed in terms of a multiplication and a univariate inversion inv(w) = w−1 as v = wu = u · inv(w). The multiplication can be represented by the so called Apollonius identity v =u·w =

1 [(u + w)2 − (u − w)2 ] 4

and by that we have discussed any binary operation already. For the absolute value we will make use of the definition of the big-O and hence we can deduce

Piecewise Polynomial Taylor Expansions—The Generalization …

˚ − ∀1 ≤ d ≤ d- : u(x˚ + Δx) − u(x)

d 

73

Δ(i) u = O(Δxd+1 ),

i=1

⇐⇒

˚ − ∀1 ≤ d ≤ d- : |u(x˚ + Δx) − u(x)

d 

Δ(i) u| ≤ C · Δxd+1 .

i=1

A single application of the inverse triangle inequality provides d  (i) ˚ + Δ u| ≤ C · Δxd+1 , ∀1 ≤ d ≤ d : |u(x˚ + Δx)| − |u(x) i=1

⇐⇒

˚ + ∀1 ≤ d ≤ d- : |u(x˚ + Δx)| − |u(x)

d 

Δ(i) u| = O(Δxd+1 ).

i=1

From the propagation function we know v = |u(x˚ + Δx)| d rules(i)for the absolutevalue d ˚ + i=1 ˚ + i=1 as well as v(x) Δ v = |u(x) Δ(i) u| and thus we have ˚ + ∀1 ≤ d ≤ d- : v(x˚ + Δx) = v(x)

d 

Δ(i) v + O(Δxd+1 ).

i=1

Finally unary operations are left. Corresponding to the propagation rules defined in Sect. 2 it is sufficient to prove that Faà di Bruno’s formula still applies in the context of piecewise polynomial expansions. To that end the following identity will be useful and is carried out by applying  the multinomial expansion twice. Given values a1 , a2 , . . . , ad , b ∈ R and let A ≡ dk=1 ak , then

d  k=1

i ak + b

= (A + b) = i

i    i l=0

l

=

l i−l

Ab

d i     i l=0

l

l ak

bi−l

k=1

⎤  d bi−l l i! ⎣  kj = aj ⎦ l! k +···+k =l (k1 , . . . , kd ) j=1 (i − l)! l=0 1 d ⎛ ⎞ k i d i−l   ajj ⎠ b . (3) = i! ⎝ k ! (i − l)! l=0 k +···+k =l j=1 j i 





1

d

Theorem 2 (formula of Faà di Bruno) Let u be some intermediate variable with a generalized Taylor expansion at x˚ ∈ Rn , i.e. there is an ordered list of d- + 1 Lipschitzcontinuous functions ˚ Δx), Δ(1) u(x; ˚ Δx), . . . , Δ(d) u(x; ˚ Δx)] u : [u˚ = Δ(0) u(x;

74

T. Streubel et al.

that satisfies the following statement (which is identical to Eq. (1)) ∀1 ≤ d ≤ d- : u(x˚ + Δx) − u˚ =

d 

˚ Δx) + O(Δxd+1 ). Δ(i) u(x;

(4)

i=1

Furthermore let ϕ ∈ Φ1 be some unary function ϕ : D ⊆ R → R, which complies to assumption (ED), i.e. it is d-times Lipschitz-continuously differentiable. Note that ϕ has polynomial Taylor expansions ∀1 ≤ d ≤ d- : ϕ(μ) = ϕ(μ) ˚ +

d  ϕ (i) (μ) ˚

i!

i=1

(μ − μ) ˚ i + O(μ − μ ˚ d+1 ).

(5)

Then the composition v = ϕ(u) has a generalized Taylor expansion, i.e. an ordered list of Lipschitz-continuous functions generated by the formula of Faà di Bruno ˚ Δx) = ∀1 ≤ d ≤ d- : Δ(d) v(x;



ϕ

(k1 ,...,kd )∈Td

(k1 +...+kd )

d 1 ˚ · · (Δ(i) u)ki , (u) k ! i i=1

such that the same statement holds true for v ˚ = ∀1 ≤ d ≤ d- : v(x˚ + Δx) − v(x)

d 

˚ Δx) + O(Δxd+1 ). Δ(i) v(x;

(6)

i=1

Proof This proof is similar to a proof of the polynomial version of Faà di Bruno’s formula and can be found as Theorem 2.3 in [16]. By choosing μ ≡ u(x˚ + Δx) and μ˚ = u˚ the generalized Taylor expansion of u (Eq. (4)) can be substituted into the Taylor polynomial expansion of ϕ (Eq. (5)). Thus for all d ≤ d- one has ˚ ϕ(u(x˚ + Δx)) − ϕ(u)

i d d  ˚  (k) ϕ (i) (u) ˚ d+1 ) Δ u + O(Δxd+1 ) + O(u(x˚ + Δx) − u = i! i=1 k=1 (7) and applying identity (3) to Eq. (7) the right-hand-side of the latter extends to d  i=1

⎡ ˚ ⎣ ϕ (i) (u)

i  l=0

⎛ ⎝



d (Δ( j) u)k j

k1 +···+kd =l j=1

kj!



⎤ d+1 i−l O(Δx ) ⎠ ⎦ + O(Δxd+1 ). (i − l)! (8)

Piecewise Polynomial Taylor Expansions—The Generalization …

75

For the term within the brackets in Eq. (8) we can deduce ⎛ ∀l < i :





d (Δ( j) u)k j

k1 +···+kd =l j=1

kj!

⎞ ⎠ O(Δx ) (i − l)!

d+1 i−l

= O(Δxd+1 )

(9)

since the factors in the parentheses are at least of O(1) or smaller. By taking another closer look into the summands of parentheses of Eq. (9) we can deduce further o(k1 , . . . , kd ) ≡

d 

j · kj ≥ d + 1

d (Δ( j) u)k j

=⇒

j=1

= O(Δxd+1 ).

kj!

j=1

Thus Eq. (8) can be rearranged to ˚ = ϕ(u(x˚ + Δx)) − ϕ(u)

d 



˚ ϕ (i) (u)

i=1 o(k1 ,...,kd )≤d k1 +···+kd =i

d (Δ( j) u)k j j=1

kj!

+ O(Δxd+1 ) (10)

=

d d  



˚ ϕ (i) (u)

i=1 ϑ=1 o(k1 ,...,kd )=ϑ k1 +···+kd =i

=

d d  



˚ ϕ (i) (u)

i=1 ϑ=1 o(k1 ,...,kϑ )=ϑ k1 +···+kϑ =i

=

d 



d 

(Δ( j) u)k j + O(Δxd+1 ) k ! j j=1

ϑ (Δ( j) u)k j j=1

˚ ϕ (k1 +···+ki ) (u)

i=1 (k1 ,...,ki )∈Ti

=

d

kj!

+ O(Δxd+1 )

i (Δ( j) u)k j j=1

kj!

+ O(Δxd+1 )

˚ Δx) + O(Δxd+1 ). Δ(i) [ϕ ◦ u](x;

(11)

(12)

(13)

(14)

i=1

For the rearrangement from line (11) to (12) the following insight has been used o(k1 , . . . , kd ) = ϑ < d

=⇒

∀j > ϑ : kj = 0

=⇒

(k1 , . . . , kϑ ) ∈ Tϑ

All together the recursive property (6) holds true for the composition v = ϕ(u). 

76

T. Streubel et al.

5 A Generalized Integrator for Semi-explicit DAEs In this section an implicit k-step integration method for differential-equations will be derived, using a combined approach of polynomial interpolation similar to the method family of Adams-Moulton and generalized Taylor expansions. First of all a definition of the problem class is required and thus we define semi-explicit systems of differential-algebraic equations similar as in [15]. Definition 1 (partially non-smooth semi-explicit DAE) A semi-explicit system of differential-algebraic equations with potentially non-smooth differential-equations (DAE) is a system of equations of the form x˙1 (t) ≡

d x (t) dt 1

= f 1 (x1 (t), x2 (t), t),

(15a)

0 = f 2 (x1 (t), x2 (t), t),

(15b)

where x1 : [0, T ] → Rn , x2 : [0, T ] → Rm , f 1 : Rn × Rm × [0, T ] → Rn the system function of differential-equations (15a) and f 2 : Rn × Rm × [0, T ] → Rm the system function of algebraic equations (15b) and for some time horizon T > 0. In advance let f 1 ∈ span(Φabs ) denote here the evaluation procedure of a corresponding piecewise smooth function and let f 2 ∈ C 1,1 be at least once Lipschitz-continuously differentiable. In an attempt to approximate the solution of a system (15) we try to generate a table of data points t0 = 0

t1 = h

x1(0) , x2(0)

x1(1) , x2(1)

...

ti = i · h

...

tϑ = ϑ · h

...

x1(i) , x2(i)

...

x1(ϑ) , x2(ϑ)

where 0 < h  T and ϑ ∈ N such that (ϑ − 1) · h < T ≤ ϑ · h. The first pair is assumed to be a consistent initial value in that it satisfies x1(0) = x1 (0) as well as x2(0) = x2 (0) for some exact solution x1 : [0, T ] → Rn , x2 : [0, T ] → Rm of system (15). Any other data point 1 ≤ i ≤ ϑ is an approximation x1(i) ≈ x1 (ti ) and x2(i) ≈ x2 (ti ) of the same solution. Interpolation polynomials of order k ∈ N in Newton base representation can be used to interpolate the data on each subinterval [i · h, (i + 1)h] for k ≤ i < ϑ pl (ti+1 ; t) ≡

j−1 ( j) k  ∇i+1 xl j=0

j!

(t − ti+1−s ),

s=0

where l ∈ {1, 2} and ∇ ( j) the backward finite difference operator defined as ∇k(0) xl ≡ xl(k) ,

( j)

∇k x l ≡

( j−1)

∇k

( j−1)

xl − ∇k−1 xl . h

Piecewise Polynomial Taylor Expansions—The Generalization …

77

Now consider the integrated version of the differential-equations of system (15) on the same subinterval  1  h x˙1 (ti + t)dt = h f 1 (x1 (ti + hτ ), x2 (ti + hτ ), ti + hτ ) dτ 0 0  1 f 1 ( p1 (ti+1 ; ti + hτ ), p2 (ti+1 ; ti + hτ ), ti + hτ ) dτ. ≈h 0

Henceforth we will use an alias yi (τ ) ≡ f 1 ( p1 (ti+1 ; ti + hτ ), p2 (ti+1 ; ti + hτ ), ti + hτ ) for the integrand on each subinterval. Clearly yi ∈ span(Φabs ) since f 1 , p1 and p2 do so as well and thus we can apply a generalized Taylor expansion 

h



1

x˙1 (ti + t)dt = x1 (ti + h) − x1 (ti ) ≈ h

0

yi (1) +

0

k 

Δ( j) yi (1; τ − 1) dτ.

j=1

The generalized Taylor expansion defines piecewise polynomial approximations of f 1 , which can be integrated exactly within machine precision. This motivates the Pseudo-Algorithm 1 for a single numerical integration step. Let xˇ1 ≡ x1i and xˇ2 ≡ x2i then Pseudo-Algorithm 1 calculates iterates xˆ 1(s) and xˆ 2(s) of a potentially converging sequence towards x1(i+1) and x2(i+1) . Pseudo-Algorithm 1 (generalized order k + 1 combined Newton-Taylor method) • predict starting values (ˆx1(0) ∈ Rn and xˆ 2(0) ∈ Rm ) • iterate over loop variable s = 0, 1, 2, . . . and terminate at S ∈ N where either ˆxl(S) − xˆ l(S−1)  < atoll or ˆxl(S) − xˆ l(S−1)  < rtoll · xˇl , for both l ∈ {1, 2}, is met – – – –

generate interpolation polynomials l ∈ {1, 2} : pl(s) of xˆ l(s) , xˇl , xli−1 , . . . , xli−k define yi(s) (τ ) ≡ f 1 ( p1(s) (ti+1 ; ti + hτ ), p2(s) (ti+1 ; ti + hτ ), ti + hτ ) 1 calculate generalized quadrature1 : bs ≡ h 0 kj=1 Δ( j) yi(s) (1; τ − 1) dτ calculate next iterate (ˆx1(s+1) , xˆ 2(s+1) ) ∈ Rn+m as result of  (s+1)    − xˇ1 − h · f 1 (ˆx1(s+1) , xˆ 2(s+1) , ti+1 ) xˆ 1 b = s (s+1) (s+1) (s+1) 0 n ˆ (ˆ x , x , t ) f xˆ 1 ∈R 2 1 i+1 2 solve

xˆ 2(s+1) ∈Rm

• set x1(i+1) ≡ xˆ 1(S) , x2(i+1) ≡ xˆ 2(S) and the numerical integration step is finished. Remark 1 We did not require any index assumption on DAE system (15) since the existing index concepts do not cover such systems where f 1 or f 2 may be not differentiable. Consequently, it is not guaranteed that (15) is uniquely solvable and also the Pseudo-Algorithm 1 could fail. 1 where

bs = h

1 0

(s)

yi (1) +

k

( j) (s) j=1 Δ yi (1; τ

(s)

(s)

− 1) dτ − h f (xˆ1 , xˆ2 , ti+1 ) holds true.

78

T. Streubel et al.

If the generalized expansion will be replaced by a Taylor polynomial expansion in the sense of Theorem 1, Pseudo-Algorithm 1 turns into a linear k-step method in the classical sense as defined e.g. in [13]. The same happens whenever f 1 ∈ span(Φ) = span(Φabs \ {abs }) is satisfying assumption (ED), since the generalized and the Taylor polynomial expansion coincide for such functions. On the other hand when the generalized quadrature rule will be substituted by the simplified rule b j ≡ 0 in every step, Pseudo-Algorithm 1 turns into the well-known implicit Euler method. Example 3 (Newton-Taylor integration) Consider the following system of ordinary differential equations x˙0 (t) = |x0 (t)| + 1,

(16a)

x˙1 (t) = x0 (t) + 1,

(16b)

with exact solution  1 − exp(−t) if t < 0 , x0 (t) = exp(t) − 1 else

 x1 (t) =

exp(−t) − 1 + 2t if t < 0 , exp(t) − 1 else

for an initial value2 (x0(0) , x1(0) ) = (x0 (t0 ), x1 (t0 )) on a time interval [t0 , T ] = [−1, 1]. Thus the exact solution is infinitely many times differentiable almost everywhere but at t = 0. At the origin x0 is exactly once whereas x1 is exactly twice continuously differentiable. We defined the initial value with the exact solution at hand, but since the NewtonTaylor method is a k-step method we also computed the first needed k − 1 consecutive iterations with the exact solution as well. We applied the Newton-Taylor method and produced the convergence diagram of Fig. 2. We can make the following observations from the convergence diagram in Fig. 2. There are three phases from right to left. In phase one the order 2 method also shows an order 2 convergence with respect to a shrinking time steplength h. Both that is the order 3 and 4 method show an order 3 convergence in the beginning. Traditional solvers were expect to achieve order 2 convergence. In the second phase we can observe that convergence of all three break down. The higher order methods slightly sooner than the order 2 method but all of them at still relatively rough time stepsizes. Thus the results get worse with further shrinking time steplength. In the last phase convergence of all three then stabilize again at an order 1 convergence. Especially the convergence breakdown raises attention. It reminds of another phenomena from recursive finite differences in an attempt to approximate higher order derivatives of a function. Figure 3 shows the error of the aforementioned recursive finite differences. We can observe that the convergence behavior in the diagram suffers quite early from accumulating round-off errors. This effect gets worse with 2 We

use the phrase starting value within Pseudo-Algorithm 1 to refer to the beginning of an iteration for solving a nonlinear root problem and we use the phrase initial value as beginning of a time iteration.

Piecewise Polynomial Taylor Expansions—The Generalization …

79

Fig. 2 Convergence diagram for ODE (16) integrated with Newton-Taylor method for different orders of expansions and for several time steplengths h in a log-log plot. The convergence curves are displayed for each component of the ODE separately

Fig. 3 Figure displaying the same graphs twice but with restricted y-axis in the latter image. Here the error between recursive finite differences and symbolic differentiation of sin(x) at x = 1 is displayed against various step lengths on the x-axis

increasing order of differentiation. We indeed implemented the Newton interpolation using finite differences and monomial base representation. The fact that we made use of Newton polynomial interpolations of x0 and x1 could be responsible for the almost identical convergence behavior of the order 2 and order 3 method. This complies to our expectations since polynomial interpolations cannot reflect non-differentiabilities in higher order derivatives of the actual solution and therefor the convergence order should indeed be limited.

80

T. Streubel et al.

The overall convergence behavior requires an in depth analysis but the preliminary results of our prototype implementation justify our hope of achieving higher order convergence when using generalized Taylor expansions during method derivation. We take this as a starting point for the formulation of improved versions of the herein proposed demonstration of a Newton-Taylor method.

6 Conclusion and Outlook We have derived and presented a Taylor expansion in terms of piecewise polynomials for piecewise smooth functions incorporating absolute value operations in their ˚ d+1 ), evaluation graphs. In advance we have proven the residuum to be of O(x − x where x is the point of evaluation, x˚ the reference point of the expansion and d- a user definable order, supposing that certain parts of the underlying evaluation procedure are sufficiently smooth. Applications in optimization, root solving and numerical integration are possible and their derivation and analysis seem to be logical next steps. The presented expansion concept allows the generalization of already existing methods in such a way that the original functionality remains unchanged for elementary differentiable objective functions or systems. A prototype of a generalized integration method for semi-explicit DAEs based on Newton interpolations and generalized Taylor expansions was derived for the purpose of demonstration. One concrete next step is the extension of the presented method to semi-explicit DAEs with non-smooth algebraic equations f 2 ∈ span(Φabs ) which was excluded from our considerations so far. Furthermore the Newton polynomial interpolation can and should be replaced by some generalized Hermite interpolation scheme to achieve higher orders of convergence. The concept of Hermite extrapolations can be extended and piecewise polynomials can be propagated through evaluation graphs of non-smooth functions f similar to generalized Taylor expansions. By doing so we can calculate higher order approximations in one or several reference points x˚0 , x˚1 , · · · ∈ Rn satisfying f (x) − H f [x˚0 , . . . ](x) = O(x − x˚0 θ0 · x − x˚1 θ1 . . . ), where the exponents θ0 , θ1 , . . . are natural numbers and H denoting such a generalized Hermite interpolation of f . Another possible starting point for future work is to consider and generalize the multivariate formula of Faà di Bruno. The restriction onto mostly unary elementary operations would no longer be necessary. This allows the generalized Taylor expansion of multivariate functions f : Rn → Rn in so called Abs-Normal Form (ANF) representations 

   z G(x, |z|) = , f (x) F(x, |z|)

Piecewise Polynomial Taylor Expansions—The Generalization …

81

where the functions F : Rn+s → Rn and G : Rn+s → Rs are assumed to be suffi∂ G(x, |z|) being of strictly lower ciently smooth and the partial derivative matrix ∂|z| triangular form, everywhere. The latter condition allows the explicit computation of s−1 one by one. More precisely F and G may all the switching variables z = (z i )i=0 be black-box functions and not necessarily evaluations procedures in that context. Thus providing arbitrarily designed computation routines for the point evaluation of F, G and their Taylor polynomial expansions would be sufficient, since the propagation process can be applied on a higher abstraction level by considering their vector-components as new elementary operations. Acknowledgements We want to thank Dr. Lutz Lehmann, Christian Strohm and the anonymous referees for their constructive criticism of the manuscript. Furthermore the work for the article has been conducted within the Research Campus MODAL funded by the German Federal Ministry of Education and Research (BMBF) (fund number 05M14ZAM).

References 1. Boeck, P., Gompil, B., Griewank, A., Hasenfelder, R., Strogies, N.: Experiments with generalized midpoint and trapezoidal rules on two nonsmooth ODE’s. Mongol. Math. J. 17, 39–49 (2013) 2. Bosse, T., Narayanan, S.H.K., Hascoet, L.: Piecewise linear AD via source transformation. Published as preprint via Argonne National Laboratory (2016) 3. Clees, T., Nikitin, I., Nikitina, L.: Making network solvers globally convergent. In: Obaidat, M.S., Ören, T., Merkuryev, Y. (eds.) Simulation and Modeling Methodologies, Technologies and Applications, pp. 140–153. Springer International Publishing, Cham (2018). https://doi. org/10.1007/978-3-319-69832-8_9 4. Fiege, S., Walther, A., Griewank, A.: An algorithm for nonsmooth optimization by successive piecewise linearization. Math. Program. (2018). https://doi.org/10.1007/s10107-018-1273-5 5. Fiege, S., Walther, A., Kulshreshtha, K., Griewank, A.: Algorithmic differentiation for piecewise smooth functions: A case study for robust optimization. Optim. Methods Softw. 33(4–6), 1073–1088 (2018). https://doi.org/10.1080/10556788.2017.1333613 6. Griewank, A.: On stable piecewise linearization and generalized algorithmic differentiation. Optim. Methods Softw. 28(6), 1139–1178 (2013). https://doi.org/10.1080/10556788.2013. 796683 7. Griewank, A., Bernt, J.U., Radons, M., Streubel, T.: Solving piecewise linear systems in absnormal form. Linear Algebra Appl. 471, 500–530 (2015). https://doi.org/10.1016/j.laa.2014. 12.017 8. Griewank, A., Hasenfelder, R., Radons, M., Lehmann, L., Streubel, T.: Integrating Lipschitzian dynamical systems using piecewise algorithmic differentiation. Optim. Methods Softw. 33(4– 6), 1089–1107 (2018). https://doi.org/10.1080/10556788.2017.1378653 9. Griewank, A., Streubel, T., Lehmann, L., Hasenfelder, R., Radons, M.: Piecewise linear secant approximation via algorithmic piecewise differentiation. Optim. Methods Soft. 33(4–6), 1108– 1126 (2018). https://doi.org/10.1080/10556788.2017.1387256 10. Griewank, A., Walther, A.: Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM) (2008). https://doi.org/10.1137/1.9780898717761 11. Griewank, A., Walther, A.: First- and second-order optimality conditions for piecewise smooth objective functions. Optimization Methods and Software 31(5), 904–930 (2016). https://doi. org/10.1080/10556788.2016.1189549

82

T. Streubel et al.

12. Griewank, A., Walther, A., Fiege, S., Bosse, T.: On Lipschitz optimization based on graybox piecewise linearization. Math. Program. 158(1), 383–415 (2016). https://doi.org/10.1007/ s10107-015-0934-x 13. Hairer, E., Paul Nørsett, S., Wanner, G.: Solving Ordinary Differential Equations I: Nonstiff Problems. Springer Series in Computational Mathematics. Springer, Berlin, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78862-1 14. Kubota, K.: Enumeration of subdifferentials of piecewise linear functions with abs-normal form. Optim. Methods Softw. 33(4–6), 1156–1172 (2018). https://doi.org/10.1080/10556788. 2018.1458848 15. Lamour, R., März, R., Tischendorf, C.: Differential-Algebraic Equations: A Projector Based Analysis. Differential-Algebraic Equations Forum. Springer, Berlin, Heidelberg (2013). https:// doi.org/10.1007/978-3-642-27555-5 16. Leipnik, R.B., Pearce, C.E.M.: The multivariate Faá di Bruno formula and multivariate Taylor expansions with explicit integral remainder term. The ANZIAM J. 48(3), 327–341 (2007). https://doi.org/10.1017/S1446181100003527 17. Naumann, U.: The Art of Differentiating Computer Programs. Society for Industrial and Applied Mathematics (2011). http://dx.doi.org/10.1137/1.9781611972078https://doi.org/10. 1137/1.9781611972078 18. Radons, M.: Direct solution of piecewise linear systems. Theor. Comput. Sci. 626, 97–109 (2016). https://doi.org/10.1016/j.tcs.2016.02.009 19. Radons, M.: A note on surjectivity of piecewise affine mappings. Optimization Letters (2018). https://doi.org/10.1007/s11590-018-1271-9 20. Scholtes, S.: Introduction to Piecewise Differentiable Equations. Springer Briefs in Optimization. Springer, New York (2012). https://doi.org/10.1007/978-1-4614-4340-7 21. Streubel, T., Griewank, A., Radons, M., Bernt, J.U.: Representation and analysis of piecewise linear functions in abs-normal form. In: Pötzsche, C., Heuberger, C., Kaltenbacher, B., Rendl, F. (eds.) System Modeling and Optimization, pp. 327–336. Springer Berlin Heidelberg, Berlin, Heidelberg (2014). https://doi.org/10.1007/978-3-662-45504-3_32 22. Streubel, T., Strohm, C., Trunschke, P., Tischendorf, C.: Generic construction and efficient evaluation of flow network DAEs and their derivatives in the context of gas networks. In: Kliewer, N., Ehmke, J.F., Borndörfer, R. (eds.) Operations Research Proceedings 2017, pp. 627–632. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-31989920-6_83

Grid-Enhanced Polylithic Modeling and Solution Approaches for Hard Optimization Problems Josef Kallrath, Robert Blackburn, and Julius Näumann

Abstract We present a grid enhancement approach (GEA) for hard mixed integer or nonlinear non-convex problems to improve and stabilize the quality of the solution if only short time is available to compute it, e.g., in operative planning or scheduling problems. Branch-and-bound algorithms and polylithic modeling & solution approaches (PMSA)—tailor-made techniques to compute primal feasible points—usually involve problem-specific control parameters p. Depending on data instances, different choices of p may lead to variations in run time or solution quality. It is not possible to determine optimal settings of p a priori. The key idea of the GEA is to exploit parallelism on the application level and to run the polylithic approach on several cores of the CPU, or on a cluster of computers in parallel for different settings of p. Especially scheduling problems benefit strongly from the GEA, but it is also useful for computing Pareto fronts of multi-criteria problems or computing minimal convex hulls of circles and spheres. In addition to improving the quality of the solution, the GEA helps us maintain a test suite of data instances for the real world optimization problem, to improve the best solution found so far, and to calibrate the tailor-made polylithic approach.

1 Introduction The term polylithic modeling and solution approaches (PMSA, for short) has been coined by [21] and refers to a framework or tailor-made techniques for solving hard J. Kallrath (B) Department of Astronomy, University of Florida, Gainesville, FL 32611, USA e-mail: [email protected]; [email protected] R. Blackburn Discrete Optimization and Logistics, Karlsruhe Institute of Technology, 76133 Karlsruhe, Germany e-mail: [email protected] J. Näumann Technical University of Darmstadt, 64289 Darmstadt, Germany e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_4

83

84

J. Kallrath et al.

mixed integer optimization (MIP) or non-convex nonlinear programming (ncNLP) problems exploiting several models and their solutions to establish primal feasible points, and sometimes even dual bounds. These models can be relaxations of the original MIP problem, or auxiliary models to obtain, for instance, better lower bounds on the original problem, or bounds on integer variables. The key idea of PMSA is that exact optimization algorithms and heuristics are both used to solve a MIP or ncNLP problem. Related or similar are matheuristics connecting mathematical programming and metaheuristics; cf. [27]. Note that PMSA go beyond metaheuristics, i.e., master strategies such as evolutionary algorithms (in this group we find genetic algorithms), or local search techniques simulated annealing, or tabu search. Especially, when it comes to constrained optimization PMSA become superior. PMSA can also establish algorithms in their own right, e.g., variants of Fix-andRelax; cf. Section 3.6.1 in [30]. Usually, such techniques involve tuning parameters p controlling the selection of auxiliary models or conditions under which to operate them. Depending on data instances, different choices of p may lead to different running times or quality of the solution. It is not possible to determine a priori optimal settings of p. To weaken the dependence of the running time and quality of the solution on p and to improve the quality of the solution, in this paper we propose to enhance PMSA by a multi-grid or multi-start parameter approach called grid enhanced approach (GEA) or parallel PMSA (pPMSA). The essence of this approach is to run the whole PMSA on several cores of the CPU or on a cluster of computers in parallel for a full or partial list of parameter combinations p ∈ P. We can either let each job for a certain parameter combination run for a certain time and extract the best solution. Alternatively, if we have a problem with a single objective function or if we are able to qualify the goodness of the solution in the multi-criteria case, we can terminate jobs if they are dominated by the current best solution. Note the difference between the multi-start parameter approach and multi-start techniques. While the latter uses multi-starts to find initial feasible points when solving ncNLP problems or when applying local search techniques by using different initial variables x inherent to the optimization problem, the former uses different parameters selecting algorithms, sub-models or solvers, or parameters inherent to algorithms or solvers. Optimization is struggling with making use of today’s and tomorrow’s multi-core computing architecture as many of the optimization community’s algorithms run inherently sequential and even for algorithms that are suitable for parallel processing (e.g., branch-and-bound) the actual speed-up is limited. While usually one finds parallelization techniques deep on the level of the solver technology, cf. [7, 25, 34, 36, 37] or [38], or exploiting multiple threads (cf. [16] or [35]) when implementing branch-and-bound based methods, our parallelization attacks at a higher level of the application itself—and is thus very problem-specific. Running the same problem with different algorithms, parameters, etc. and choosing the fastest one, also known as concurrent optimization is one way – actually the easiest one called embarrassingly parallel or perfectly parallel by [17]—to utilize the parallel computing power. GEA takes this one step further by a) applying the multi-grid approach on the level of the application itself by exploiting the control parameters of the PMSA, and b) allowing

Grid-Enhanced Polylithic Modeling and Solution Approaches …

85

communication among the parallel runs as illustrated in the scheduling example in Sect. 6.2. To give a brief overview in the table below we present a few meanings of the parameters referred to by p: 1. 2. 3. 4.

choice of algorithm within a solver or PMSA, control and tuning parameters of algorithms of PMSA, choice of a solver, control and tuning parameters of solvers.

where solver refers to commercial MILP solvers CPLEX [18], GUROBI [15] or XPRESS [16], or NLP/MINLP solvers such as BARON [13], ANTIGONE [28], or LINDO [26] to name a few. Algorithm could be, for instance, primal simplex, dual simplex or barrier in the MILP solvers. It could also be MILP or genetic algorithm for solving large traveling salesman problems. Inner parameters of solvers could be upper limits on CPU time, or numeric tolerances for satisfying constraints. Highlights of this contribution: 1. With PMSA we present a generic framework for extending the set of hard MILP, non-convex NLP, or MINLP problems which can be solved in reasonable time. 2. The GEA on multicore platforms and clusters is generic and timeless and allows us to implement PMSA with reduced dependence on tuning and control parameters and increasing the quality of solutions if only limited time is available. Implementation issues are explained using the algebraic modeling language GAMS. 3. In addition to improving the quality of the solution, the GEA helps us maintain a test suite of data instances for the real world optimization problem at hand, to improve the best solution found so far, and last but not least, to calibrate the tailor-made polylithic approach. We try not to get lost in the details of the various application examples and rather focus on the generic principles.

2 Literature Review There are ideas and frameworks in the literature which are similar to the GEA or pPMSA but often with a different focus or motivation. Therefore, we try to cover books and articles and outline the ideas without claiming that this is complete. We identify three major areas where similar ideas or approaches are used: 1. Parallel algorithmic techniques (concurrent, concurrent-distributed- concurrent, distributed) within the MILP solvers CPLEX, GUROBI and XPRESS, 2. Parallel metaheuristics, and 3. Machine learning and hyper-parameter optimization.

86

J. Kallrath et al.

Prior to going in more depth for these three fields, we point the reader to a very useful taxonomy of parallel architectures provided by [39], pp. 522. The advantages of using multi-core platforms versus clusters of computer are discussed by [3], pp. 13. All commercial MILP solvers allow concurrent runs with various flavors: Concurrent, concurrent-distributed-concurrent, distributed. Concurrent optimization for MILP can be understood as the simplest realization of the GEA and is available in CPLEX, GUROBI or XPRESS. The next level is concurrent-distributed-concurrent which allows communication and interaction between parallel runs on cores or threads. Distributed MILP means: Each B&B search is started with different parameter settings, a permutation of the columns/rows, or just another random seed. The best one wins, or one even allows restarts and only continues with those settings that perform best so far. CPLEX, for instance, offers distributed with the following tasks: (i) work on lower bound on one thread; (ii) work on primal bound (heuristics!) on the other, and (iii) have a third thread to manage the search tree. Impressive results are provided by [34] using a parallel enhanced version of the solver SCIP (cf. [33] or [14]) and 80,000 cores in parallel on the Titan supercomputer to solve 12 previously unsolved MILP problem from the MIPLIB benchmark set. Although PMSA go beyond metaheuristics, it is worthwhile to be aware of what is going on in field of parallel metaheuristics; cf. [1, 2, 4], various chapters in [1] about parallel versions of genetic algorithms, simulated annealing, and tabu search, the early work by [12, 29], or [10]. If we follow [1] in his book Parallel Metaheuristics on p. 112, in many cases, pPMSA would fall into the class of independent run models. As in Sect. 6.1 we provide an example in which we have used pPMSA to compute the Pareto front, we point out that there exists a vast body of literature related to parallel techniques for solving multi-objective optimization problems. This requires to construct a set of solutions called the Pareto front. [11] favor evolutionary algorithms for this. [20] construct a specially defined parallel tabu search applied to the Pareto front reached by an evolutionary algorithm. A different community and field where parallel solution approaches have an impact is machine learning and hyper-parameter optimization in the context of Bayesian optimization. In machine learning, hyper-parameter optimization or tuning is the goal to select a set of optimal hyper-parameters for a learning algorithm. Hyper-parameters are parameters whose values are used to control the learning process, while the values of other parameters (usually node weights) are learned. Grid search and random search (cf. [6]) allow easy implementation to parallel approaches. [5] let a Gaussian process algorithm and a tree-structured parzen estimator run asynchronously in order to make use of multiple compute nodes and to avoid wasting time waiting for trial evaluations to complete. What comes closest to the ideas of our paper are the possibilities presented by [9, 16] for problem decomposition and concurrent solving from a modeling point of view with example implementations in Mosel that show handling of multiple models, multiple problems within a model, and as a new feature, distributed computation using a heterogeneous network of computers. In 2004 and 2012, the XPRESS module mmjobs probably focussed on solving to MILP problems or solving NLP

Grid-Enhanced Polylithic Modeling and Solution Approaches …

87

problems with multi-start techniques. This module allows one on the modeling level to determine what to parallelize and how to distribute jobs (whole model or submodels).

3 Mathematical Structure of the Grid Approach We want to solve a MILP, NLP, or MINLP problem P(x) in a vector x of variables (continuous or integer) defined in the most general case as min f (x) s.t. g(x) = 0 ∧ h(x) ≥ 0 . Let us discuss first why we want to use a grid approach for solving P(x). Reason 1: One relevant practical requirement is that we have a limit on the time available for returning a solution back to the user, i.e., we usually cannot solve the problem to optimality. In this situation, we want to get the best solution within the available time. Reason 2: Problem P(x) is a multi-criteria optimization problem, but it is hard to qualify what a good solution means for the owner of the problem. Therefore, we want to offer various solutions enabling the user to select by inspection the best solution. Note that both reasons can also show up in combination. For both reasons, we can distinguish two cases. Case 1 (P(x) is not too hard): We may want to solve P(x) using different solvers, or use a specific solver with different settings of some of its tuning parameters. Case 2: If P(x) is very hard to solve, we may want to resort to PMSA. In both case, it is not possible to determine a priori optimal settings of the tuning or control parameters. Therefore, we consider variations of P(x) named Pp (x) defined as min f p (x) s.t. gp (x) = 0 ∧ gp (x) = 0 , where p ∈ P is a set of string-valued parameters specifying an instance of P(x) or providing instructions on which solution approach or MINLP solver to use for solving Pp (x). Typical examples for such instructions could be related to whether to use the primal or dual Simplex algorithm when solving the MILP subproblems of P(x), use BARON or LINDO as a deterministic global solver, problem-specific input parameters, and controlling parameters of the tailor-made polylithic modeling and solution approach. Let n denote the number of instances to be evaluated, i.e., n = #P. Each instance Pp (x) is solved on a core of the CPU or on a machine within a cluster of computers. Instances may have different run times. By k we denote the number of cores or

88

J. Kallrath et al.

machines. If n ≤ k, we can solve all instances in parallel. In case n > k, we have up to k instances running in parallel. If instance Pp (x) has finished, we submit the next instance Pp+1 (x) to be solved. The result of Pp (x) may enable us to terminate an actively running instance Pp˜ (x) .

4 Optimization Problems and Optimization Algorithms Suitable for The Grid Approach Structurally, there are two categories of optimization problems which benefit from GEA: Problems where it is difficult to find a good feasible point in short time, or multi-criteria optimization problems. An example for a multi-criteria optimization problem is the simultaneous minimization of trimloss and the number of patterns in 1D cutting stock problems as described and solved in [24], hereafter referred to as PCSP. Scheduling problems in the process industry (cf. [19, 32], or [8]) or lock scheduling problems (cf. [40]) are solved exploiting polylithic techniques—they are also multi-criteria in nature. Lock scheduling problems (cf. [40]) are also very suitable for this. The convex hull minimization problems treated in [22, 23] have also been solved by polylithic approaches using various homotopy techniques in which a preliminary model is exploited to generate a feasible starting point which is then improved. The stronger the sensitivity of a problem w.r.t. some tuning parameters, the more suitable and efficient the GEA becomes to enhance the PMSA. In addition to the inherent structure of an optimization problem regarding the suitability of GEA, the algorithms themselves used to solve a difficult optimization problem can also make the usage of a GEA attractive. Very suitable are metaheuristics, e.g., genetic algorithms. Hyper-parameter grid search in machine learning had already been mentioned in Sect. 2. Parallel search in constraint programming is also a suitable technique for GEA; cf. [31].

5 IT-Aspects and Implementation 5.1 Generic Structure As in Fig. 1, the grid approach can be structured into the following modules (larger pieces or collections of several pieces of programming code): 1. Model M0 corresponding to P(x). 2. Module Mv creating variations Vn of model M0 , e.g., relaxations of P(x), or related, auxiliary models of P(x), combined with variations of solver configuration. 3. Module Mg defining and generating the instances

Grid-Enhanced Polylithic Modeling and Solution Approaches … Fig. 1 The modules of the grid approach. Module Mc is the master module with over-all control. Variations Vn are generated by Mv , possibly a priori by a Python script, and run by Mg . Results are collected by Mr reading from the directories generated by Mg

89

Problem P(x)

Model M0 corresponding to P(x) Mc Mv Variation V1

...

Variation Vn

Instance of V1

...

Instance of Vn

Mg

Submit, control, evaluate & terminate Mr Collect results

4. Module Mc submitting, controlling, evaluating and possibly terminating optimization runs of the instances 5. Module Mr collecting the results

5.2 Implementation in GAMS We have implemented the GEA in GAMS, but in principle, it could be implemented in any algebraic modeling or programming language. Here we provide an explanation involving some GAMS flavor. The GAMS program application.gms is the main GAMS file to be executed. It contains the monolithic model and the PMSA controlled by cntrl.txt or compile.par containing various scalars and compile-time parameters. It calls multi-start.gms, which triggers Mv to generate variations. Regarding our GAMS implementation, the varied instance parameters can be either string-valued compiletime parameters or numerical scalars. Multi-start.gms in turn calls application.gms asynchronously for each variation to create an instance and run it (module Mg ). The

90

J. Kallrath et al.

maximum number of cores to be used can be configured through a parameter in cntrl.txt or compile.par read by application.gms. All runs are administered (module Mc ) in multi-start.gms, which also collects the final results (module Mr ).

5.2.1

Module Mv

There are two ways to generate and handle the variations, either automatically or manually: Each generated variation will be assigned a number and is stored in a .gmi file named accordingly if generated a priori by a Python script. This approach is preferred if we want to generate all combinations of tuning and control parameters. The Python script creates variations based on input from a file. The following is an example of such a file, containing instructions to vary the solver to be used and a scalar parameter: SOLVER , C, L {GUROBI,CBC} Param_A, S, L {0.250,0.500}

Alternatively, variations can also be supplied directly as stored or programmed, respectively, in application_ParSet.gms. This approach has advantages when it is not possible to generate all combinations of parameters, e.g., when dealing with solver parameters not available in all solvers.

5.3 Module Mg Module Mg will prepare each variation for being run with GAMS by creating subdirectories containing the according .gmi files.

5.3.1

Modules Mc and Mr

Modules Mc and Mr are both implemented in multi-start.gms. Mc asynchronously submits each run and constantly watches for results. If a sufficient solution has been returned by a run, it will terminate the others. Module Mr has the function of specifying the best solution. For single-objective function problems, the best solution is obvious. For multi-criteria problems we need to proceed differently. We define and construct a solution metric which allows us to decide automatically which solution is the best.

Grid-Enhanced Polylithic Modeling and Solution Approaches …

91

6 Real World Examples 6.1 Cutting Stock Pareto Front This problem consists of the simultaneous minimization of trim loss and patterns in cutting stock problems (CSPs). The problem is solved by a PMSA described in [24] which is essentially an exhaustion approach combining a greedy approach (maximize pattern multiplicity) with a MILP formulation of the CSP containing various tuning parameters, among them wmax , the maximal permissible waste per pattern, as the most important control parameter. Based on the GEA, the Pareto front is computed as a function of wmax . In this simple case, the GEA only exploits parallelism by computing the Pareto front simultaneously for six different values of wmax : 20, 15, 10, 8, 6 and 4%, which results in a computational speedup of approximately a factor 6 as the jobs are independent. If the Pareto front of any multi-criteria optimization problem can be generated as a function of one or several parameters, it can be generated exploiting the GEA.

6.2 Scheduling Problem in the Process Industry Consider the plant system with many reactors, tanks, continuous units etc. leading to a multi-criteria scheduling problem solved by [19]. The core of the problem is a MILP model exploiting a continuous time formulation involving event points. It is solved polylithically by a Moving time window approach, in which the tunable parameters are maximum delay, maximum underproduction, and the maximum number of event points, as well as penalty coefficients on time and production target deviations. In the GEA we combine the tuning parameters of the approach and distinguish solution metrics for 1. 2. 3. 4.

make-to-order (bulk articles, mtoB), make-to-order (packed articles, mtoP), make-to-stock articles (mts), and all products.

Within the GEA we trace the following criteria c: c Description 01−04 Relative underproduction (mtoB, mtoP, mts, all) 05−08 Relative overproduction in % 09−12 Delays 13−16 Earliness 17 Number of changeovers 18−21 Ratio of priority 0/1/2/3 over all tasks 22 Elapsed time (more for just knowing it)

92

J. Kallrath et al.

Jobs run for 30 min at most. If the deviations from time and production targets, or a combined metric of them, become sufficiently small for a job, the solution is considered good and all other jobs not yet finished are terminated. An explorative test performed on a cluster ran with up to 1,000 jobs providing good schedules. In practical operative situations with running time constraints, one should not create more parameter combinations than cores or computing units available.

6.3 2D Minimal Perimeter Convex Hulls Given a set of n circles with radii Ri and a rectangle with length L and width W . Find a configuration of non-overlapping circles (specified by their center coordinates) which fit into the rectangle and do not overlap and lead to a convex hull whose perimeter has minimal length. The monolithic model M has been developed by [22]; it is a MINLP problem with bilinear, 4th-order polynomial, and trigonometric terms. The PMSA consists of different approaches to solve the MINLP problem with simplified models providing initial starting values for M: 1. P1: Minimize area or perimeter of a rectangle hosting the circles to produce initial values for M. 2. P2: Minimize the weighted distances of circles to center of circles to produce initial values for M. 3. T: Tour-specified approach. The GEA consists of parallel runs over M, P1, P2, and T. In this example, the GEA is helpful in developing efficient numeric schemes and in exploiting the current hardware as well as possible.

6.4 3D Minimal Surface Area Convex Hulls Given a set of n spheres with radii Ri . Find a configuration of non-overlapping spheres (specified by their center coordinates) which lead to a convex hull whose boundary has minimal area. The monolith model M, dominated by bilinear terms, has been developed by [22]. The PMSA consists of various approaches to solve the NLP problem with simplified models providing initial starting values for M: 1. P1: Minimize the volume of a sphere or a rectangular box hosting the spheres to produce initial values for M. 2. P2: Minimize the weighted distances of spheres to the center of spheres to produce initial values for M. The GEA consists of parallel runs over M, P1, P2. As above, the GEA is helpful in developing efficient numeric schemes and in exploiting the current hardware as well as possible.

Grid-Enhanced Polylithic Modeling and Solution Approaches …

93

7 Test Suite of Data Instances In our development work, the GEA helps us to maintain a test suite of data instances for the real world optimization problem at hand, to improve the best solution found so far, and last but not least, to calibrate the tailor-made polylithic approach. The test suite is populated by real word data instances. Accumulated over several month and years, the collected data instances represent real world situations more and more appropriately. Test runs over this test suite take longer and longer. Thus, the GEA is of great help in covering as many parameter combinations as possible. Eventually, we learn that certain parameter combinations are dominated by others. Especially for scheduling problems, it is usually computationally prohibitive to compute the strict global optimum. Therefore, at best, we find a good feasible solution. Script files automatically evaluate the results of the individual data instances involved in the test runs, detect whether we have obtained a solution better than the previous best solution, and thus improve our test suite.

8 Conclusions and Discussions We have constructed a generic grid enhancement approach (GEA) in connecting with polylithic modeling and solution approaches (PMSA) to solve a hard mixed integer or non-convex nonlinear optimization problem when a solution has to be returned within a given time limit, or when the user wants to inspect various solutions, for instance, in a multi-criteria optimization problem. The GEA works on one computer with several cores or on a cluster of computers. On each core or cluster computer we solve the problem at hand with possibly different solvers, different solver parameters, different algorithms in the sense of PMSA or different tuning parameters of the algorithms at hand. The approach is very useful for operative planning or scheduling problems. Although the idea of the GEA is simple in nature, great care has to be paid to implementation in whatever programming language regarding robustness and maintainability of the code. A valid point of discussion is the notice that PMSA increases the complexity of implemented decisions systems, and thus, also the maintenance effort. The crucial question to be answered is: Should everything which is possible also be done in mathematical optimization? The answer depends on the importance and the value of the decision problem. We have just tried to keep the approach as generic and clean as possible. Implementing the PMSA may take weeks or a month. The implementation for enhancing PMSA using the grid approach requires rather days, weeks, or possibly a month, including testing. However, the GEA provides another important advantage. For real word optimization problems, testing is very important for robustness. Therefore, we usually develop a test suite in which various problem instances with their optimal or best found solutions are stored. Grid enhanced PMSA allow us to improve a test suite, and,

94

J. Kallrath et al.

if dominant parameter combinations can be identified, to calibrate the tailor-made PMSA. For the future, there is room for improvement: If it is possible to specify a priori what is a good solution acceptable for the user, we have a weak interaction between parallel jobs in the sense of jobs already being dominated by the one just yielding an acceptable solution. The interaction could be deepened by evaluation pre-processing models in parallel, extracting and considering the information obtained from them, and exploiting this in follow-up computations—also in parallel. Acknowledgements The authors are indebted to the anonymous referees whose comments helped to improve this paper. We thank Dr. Michael Bussieck (GAMS GmbH, Frechen, Germany) for discussion on parallelism used in optimization, Dr. Jens Schulz & Dr. Susanne Heipcke (FICO, Berlin, Germany & Marseille, France) for hints and details on parallelization in XPRESS, and Prof. Dr. Michael Torsten Koch (ZIB Berlin, Berlin, Germany), Dr. Jens Schulz and Dr. Steffen Klosterhalfen (Mannheim, Germany) for their careful reading of and feedback on the manuscript.

References 1. Alba, E.: Parallel Metaheuristics: A New Class of Algorithms. Wiley-Interscience, New York, NY, USA (2005) 2. Alba, E., Luque, G.: In: Alba, E. (Ed.) Parallel Metaheuristics: A New Class of Algorithms, Wiley Series on Parallel and Distributed Computing, chap. 2. Measuring the Performance of Parallel Metaheuristics, pp. 43–62. Wiley (2005) 3. Alba, E., Luque, G., Nesmachnow, S.: Parallel metaheuristics: recent advances and new trends. ITOR 20(1), 1–48 (2013) 4. Alba, E., Talbi, E.G., Luque, G., Melab, N.: In: Alba, E., (Ed.) Parallel Metaheuristics: A New Class of Algorithms, Wiley Series on Parallel and Distributed Computing, chap. 4. Metaheuristics and Parallelism, pp. 79–104. Wiley (2005) 5. Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for Hyper-parameter Optimization. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pp. 2546–2554. Curran Associates Inc., USA (2011) 6. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012) 7. Berthold, T., Farmer, J., Heinz, S., Perregaard, M.: Parallelization of the FICO xpress-optimizer. Optim. Methods Soft. 33(3), 518–529 (2018) 8. Borisovsky, P.A., Eremeev, A.V., Kallrath, J.: Reducing the Number of Changeover Constraints in a MIP Formulation of a Continuous-Time Scheduling Problem. arXiv e-prints arXiv:1408.5832 (2014) 9. Colombani, Y., Heipcke, S.: Multiple Models and Parallel Solving with Mosel. Tech. rep., FICO Xpress Optimization, Birmingham, UK. http://www.fico.com/fico-xpress-optimization/docs/ latest/mosel/mosel_parallel/dhtml 10. Crainic, T.G.: Parallel metaheuristics and cooperative search. In: Gendreau, M., Potvin, J.Y. (Eds.) Handbook of Metaheuristics, pp. 419–451. Springer (2019) 11. Figueira, J., Liefooghe, A., Talbi, E.G., Wierzbicki, A.: A Parallel Multiple Reference Point Approach for Multi-objective Optimization. Eur. J. Oper. Res. 205(2), 390–400 (2010). https://doi.org/10.1016/j.ejor.2009.12.027. http://www.sciencedirect.com/science/article/pii/ S0377221710000081 12. Gendreau, M., Potvin, J.Y.: Handbook of Metaheuristics, 2nd edn. Springer Publishing Company, Incorporated (2010)

Grid-Enhanced Polylithic Modeling and Solution Approaches …

95

13. Ghildyal, V., Sahinidis, N.V.: Solving global optimization problems with BARON. In: Migdalas, A., Pardalos, P., Varbrand, P. (Eds.) From Local to Global Optimization. A Workshop on the Occasion of the 70th Birthday of Professor Hoang Tuy, chap. 10, pp. 205–230. Kluwer Academic Publishers, Boston, MA (2001) 14. Gleixner, A., Bastubbe, M., Eifler, L., Gally, T., Gamrath, G., Gottwald, R.L., Hendel, G., Hojny, C., Koch, T., Lübbecke, M.E., Maher, S.J., Miltenberger, M., Müller, B., Pfetsch, M.E., Puchert, C., Rehfeldt, D., Schlösser, F., Schubert, C., Serrano, F., Shinano, Y., Viernickel, J.M., Walter, M., Wegscheider, F., Witt, J.T., Witzig, J.: The SCIP Optimization Suite 6.0. Technical report, Optimization Online (2018). http://www.optimization-online.org/DB_HTML/2018/07/6692. html 15. Gurobi Optimization, L.: Gurobi Optimizer Reference Manual (2019). http://www.gurobi.com 16. Heipcke, S.: Xpress-Mosel: Multi-Solver, Multi-Problem, Multi-Model, Multi-Node Modeling and Problem Solving. In: Kallrath, J. (ed.) Algebraic Modeling Systems: Modeling and Solving Real World Optimization Problems, pp. 77–110. Springer, Heidelberg, Germany (2012) 17. Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming, Revised Reprint, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2012) 18. IBM: IBM ILOG CPLEX Optimization Studio (2017) CPLEX Users Manual (2017). http:// www.ibm.com 19. Janak, S.L., Floudas, C.A., Kallrath, J., Vormbrock, N.: Production Scheduling of a Large-Scale Industrial Batch Plant: I. Short-Term and Medium-Term Scheduling. Industrial and Engineering Chemistry Research 45, 8234–8252 (2006) 20. Jozefowiez, N., Semet, F., Talbi, E.G.: Parallel and hybrid models for multi-objective optimization: application to the vehicle routing problem. In: Guervós, J.J.M., Adamidis, P., Beyer, H.G., Schwefel, H.P., Fernández-Villacañas, J.L. (Eds.) Parallel Problem Solving from Nature – PPSN VII, pp. 271–280. Springer, Berlin Heidelberg, Berlin, Heidelberg (2002) 21. Kallrath, J.: Polylithic modeling and solution approaches using algebraic modeling systems. Optim. Lett. 5, 453–466 (2011). https://doi.org/10.1007/s11590-011-0320-4 22. Kallrath, J., Frey, M.M.: Minimal surface convex hulls of spheres. Vietnam J. Math. 46, 883– 913 (2018) 23. Kallrath, J., Frey, M.M.: Packing circles into perimeter-minimizing convex hulls. J. Global Optim. 73(4), 723–759 (2019). https://doi.org/10.1007/s10898-018-0724-0 24. Kallrath, J., Rebennack, S., Kallrath, J., Kusche, R.: Solving real-world cutting stock-problems in the paper industry: mathematical approaches, experience and challenges. Eur. J. Oper. Res. 238, 374–389 (2014) 25. Laundy, R.S.: Implementation of parallel branch-and-bound algorithms in xpress-MP. In: Ciriani, T.A., Gliozzi, S., Johnson, E.L., Tadei, R. (eds.) Operational Research in Industry. MacMillan, London (1999) 26. Systems, L.: Lindo API: User’s Manual. Lindo Systems Inc, Chicago (2004) 27. Maniezzo, V., Sttzle, T., Vo, S.: Matheuristics: Hybridizing Metaheuristics and Mathematical Programming, 1st edn. Springer Publishing Company, Incorporated (2009) 28. Misener, R., Floudas, C.: ANTIGONE: algorithms for coNTinuous/integer global optimization of nonlinear equations. J. Global Optim. 59, 503–526 (2014). https://doi.org/10.1007/s10898014-0166-2 29. Pardalos, P.M., Pitsoulis, L.S., Mavridou, T.D., Resende, M.G.C.: Parallel Search for Combinatorial Optimization: Genetic Algorithms, Simulated Annealing, Tabu Search and GRASP. In: Parallel Algorithms for Irregularly Structured Problems, Second International Workshop, IRREGULAR ’95, Lyon, France, September 4-6, 1995, Proceedings, pp. 317–331 (1995). https://doi.org/10.1007/3-540-60321-2_26 30. Pochet, Y., Wolsey, L.A.: Production Planning by Mixed Integer Programming. Springer, New York (2006) 31. Régin, J.C., Malapert, A.: Parallel constraint programming. In: Hamadi, Y., Sais, L. (Eds.) Handbook of Parallel Constraint Reasoning, pp. 337–379. Springer International Publishing (2018)

96

J. Kallrath et al.

32. Shaik, M.A., Floudas, C.A., Kallrath, J., Pitz, H.J.: Production scheduling of a large-scale industrial continuous plant: short-term and medium-term scheduling. Comput. Chem. Eng. 33, 670–686 (2009) 33. Shinano, Y., Achterberg, T., Berthold, T., Heinz, S., Koch, T.: ParaSCIP: a parallel extension of SCIP. In: Competence in High Performance Computing 2010 - Proceedings of an International Conference on Competence in High Performance Computing, Schloss Schwetzingen, Germany, June 2010., pp. 135–148 (2010) 34. Shinano, Y., Achterberg, T., Berthold, T., Heinz, S., Koch, T., Winkler, M.: Solving open MIP instances with ParaSCIP on supercomputers using up to 80,000 cores. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 770–779 (2016) 35. Shinano, Y., Berthold, T., Heinz, S.: A First Implementation of ParaXpress: combining internal and external parallelization to solve MIPs on Supercomputers. In: International Congress on Mathematical Software, pp. 308–316. Springer (2016) 36. Shinano, Y., Berthold, T., Heinz, S.: ParaXpress: an experimental extension of the FICO xpressoptimizer to solve hard MIPs on supercomputers. Optimization Methods & Software (2018). https://doi.org/10.1080/10556788.2018.1428602. Accepted for publication on 2018-01-1 37. Shinano, Y., Fujie, T., Kounoike, Y.: Effectiveness of parallelizing the ILOG-CPLEX mixed integer optimizer in the PUBB2 framework. In: K. H., Böszörményi, L., Hellwagner, H. (Eds.) Euro-Par 2003 Parallel Processing. Euro-Par 2003, Lecture Notes in Computer Science, vol. 2790, pp. 770–779 (2003). https://doi.org/10.1109/IPDPS.2016.56 38. Shinano, Y., Heinz, S., Vigerske, S., Winkler, M.: FiberSCIP - a shared memory parallelization of SCIP. INFORMS J. Comput. 30(1), 11–30 (2018). https://doi.org/10.1287/ijoc.2017.0762 39. Trelles, O., Rodriguez, A.: In: Alba, E. (Ed.) Parallel Metaheuristics: A New Class of Algorithms, Wiley Series on Parallel and Distributed Computing, chap. 21. Bioinformatics and Parallel Metaheuristics, pp. 517–549. Wiley (2005) 40. Verstichel, J., De Causmaecker, P., Spieksma, F., Vanden Berghe, G.: Exact and heuristic methods for placing ships in locks. Eur. J. Oper. Rese. 235(2), 387–398 (2014). https://doi.org/ 10.1016/j.ejor.2013.06.045. https://lirias.kuleuven.be/handle/123456789/403645

Model Predictive Q-Learning (MPQ-L) for Bilinear Systems Minh Q. Phan and Seyed Mahdi B. Azad

Abstract This paper provides a conceptual framework to design an optimal controller for a bilinear system by reinforcement learning. Model Predictive Q-Learning (MPQ-L) combines Model Predictive Control (MPC) with Q-Learning. MPC finds an initial sub-optimal controller from which a suitable parameterization of the Qfunction is determined. The Q-function and the controller are then updated by reinforcement learning to optimality.

1 Introduction Model Predictive Control (MPC) emerged as one of the most powerful modern control methods [4]. MPC minimizes a finite-time receding cost function. As the prediction horizon increases, the MPC solution approaches the infinite-time optimal control solution, thus provides a compromise between optimal performance and computational complexity. MPC can incorporate system identification, leading to data-based control methods that design feedback controllers for stabilization, tracking, and/or disturbance rejection directly from input-output measurements. In parallel with the development of modern control theory, Reinforcement Learning (RL) offers a seemingly different framework to solve problems that are often not treated in standard optimal control theory textbooks such as playing games or solving puzzles [22, 26]. However, RL can also be applied to solving optimal control problems. Recently, a number of publications appeared in the literature to bridge the gap between optimal control and reinforcement learning [1–3, 5–8, 10, 15–18, 23–25, 27–29]. This paper is a contribution in this regard. In [20] we related a standard RL method called Q-Learning to a standard modern control method called LQR (Linear Quadratic Regulator) through MPC. We demonstrated when and how LQR, MPC, and Q-Learning produce identical feedback gains. M. Q. Phan (B) · S. M. B. Azad Thayer School of Engineering, Dartmouth College, Hanover, NH 03755, USA e-mail: [email protected] S. M. B. Azad e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_5

97

98

M. Q. Phan and S. M. B. Azad

In [21] we developed an input-decoupled version of Q-Learning with two basic features: (1) Instead of finding the traditional Q-function which is a function of state and input, Q(x(k), u(k)), the algorithm is re-expressed to find Q ∗ (x(k)) which is the minimum of Q(x(k), u(k)) over u(k) directly, and (2) For a multiple-input problem, the optimal controller is found for each input variable at a time, thus replacing a more challenging multiple-variable optimization problem by a series of much simpler single-variable optimization problems without losing optimality. The decoupled solution is made possible by the newly derived input-decoupled Q-functions, which are functions of the state and one input variable at a time, Q i (x(k), u i (k)) for the i-th input variable. In this paper, we consider bilinear systems where MPC provides an initial sub-optimal controller for Q-Learning to iterate to optimality. Our previous work in Q-Learning applied to the magneto-hydrodynamic control problem and its relationship to MPC can be found in [11–14].

2 Model Predictive Control Consider an n-state, m-input discrete-time bilinear system of the form: x(k + 1) = Ax(k) +

m  i=1

bi u i (k) +

m 

Ni x(k)u i (k)

(1)

i=1

where x(k) denotes the n-by-1 state vector, u i (k) denotes the i-th scalar control input which is the i-th element of the m-by-1 input vector u(k). In the above equation, A is an n-by-n matrix, bi an n-by-1 column vector, and Ni an n-by-n matrix describing the bilinear coupling between the state and the input variables. We look for a statefeedback controller u(k), which is some unknown function of x(k), denoted as u(k) = g(x(k)), that minimizes the following infinite-time cost function: J=

∞ 

  γ i x(k + 1)T Qx(k + 1) + u(k)T Ru(k)

(2)

i=0

where 0 < γ ≤ 1 is a discount factor. In standard optimal control theory there is typically no discount factor (γ = 1), but in reinforcement learning, 0 < γ < 1 is used to ensure convergence of the iterative solution. Instead of working with the infinite-time cost function, MPC works with a finite-duration receding-horizon cost function defined as follows: V (k) =

r −1  i=0

  γ i x(k + 1 + i)T Qx(k + 1 + i) + u(k + i)T Ru(k + i)

(3)

Model Predictive Q-Learning (MPQ-L) for Bilinear Systems

99

The parameter r is called the prediction horizon. Define Qγ = Λn QΛn and Rγ = Λm RΛm , where ⎡ ⎢ ⎢ Λn = ⎢ ⎣

In

√ γn

..

.

√ r −1 γn





⎥ ⎥ ⎥ ⎦

⎢ ⎢ Λm = ⎢ ⎣

Im

√ γm

⎤ ..

.

√ r −1 γm

⎥ ⎥ ⎥ ⎦

(4)

Q and R are nr -by-nr and mr -by-mr block diagonal matrices with Q and R on the main block diagonal, respectively. Our strategy is to view the bilinear system (which is nonlinear) as a linear system with state-dependent input influence matrix: x(k + 1) = Ax(k) + B(x(k))u(k)

(5)





where the n-by-m matrix B(x(k)) = b1 + N1 x(k), b2 + N2 x(k), · · · , bm + Nm x(k) T  or B(k) for short, and u(k) = u 1 (k), u 2 (k), · · · , u m (k) is a column vector of the m inputs u i (k). Propagate the state equation r time steps into the future, xr (k + 1) = P1 x(k) + P2 (k)u r (k)

(6)

where for simplicity, the following supervectors are defined: ⎡

⎤ x(k + 1) ⎢ x(k + 2) ⎥ ⎢ ⎥ xr (k + 1) = ⎢ ⎥ .. ⎣ ⎦ .

⎡ ⎢ ⎢ u r (k) = ⎢ ⎣

x(k + r )

u(k) u(k + 1) .. .

⎤ ⎥ ⎥ ⎥ ⎦

(7)

u(k + r − 1)

and the matrices P1 and P2 (k) are given as ⎡

⎤ A ⎢ A2 ⎥ ⎢ ⎥ P1 = ⎢ . ⎥ ⎣ .. ⎦

⎡ ⎢ ⎢ P2 (k) = ⎢ ⎣

B(k) AB(k) .. .



⎥ B(k + 1) ⎥ ⎥ .. .. ⎦ . . Ar −1 B(k) ··· AB(k + r − 2) B(k + r − 1) Ar (8) In the following we derive various state-feedback controllers based on the model predictive control principle.

2.1 Design 1: Zero r-Step Ahead State From Eq. (6), we have x(k + r ) = Ar x(k) + Cr (k)u r (k)

(9)

100

M. Q. Phan and S. M. B. Azad

  where Cr (k) = Ar −1 B(k) · · · AB(k + r − 2) B(k + r − 1) . This design simply finds a sequence of control actions, u(k), u(k + 1), ..., u(k + r − 1), contained in u r (k) to bring the r -step ahead state x(k + r ) to zero. Solving for u r (k) produces u r (k) = −Cr (k)+ Ar x(k)

(10)

where Cr (k)+ denotes the pseudo-inverse of Cr (k). Because the x(k)-dependent B(x(k)), B(x(k + 1)), ..., B(x(k + r − 1)) in Cr are functions of u(k), u(k + 1), ..., u(k + r − 2) subsumed in u r (k), the right hand side of Eq. (10) is a function of x(k) and u r (k). Let this function be denoted by f 1 (.), u r (k) = f 1 (u r (k), x(k)). Thus, the u r (k) that produces x(k + r ) = 0 is a fixed-point of the function f 1 , and can be found ( j+1) ( j) (k) = f 1 (u r (k), x(k)). The input action u(k) to use by fixed-point iteration, u r at time step k is extracted from the first entry of the fixed-point u r (k).

2.2 Design 2: Input-Weighted r-Step Ahead State Design 2 is an enhancement of Design 1 to include a penalty on the magnitude of u r (k) by minimizing the following cost function: V (k) = x(k + r )T Qx(k + r ) + u r (k)T Rγ u r (k)

(11)

Substituting in the expression for x(k + r ) in Eq. (9), taking the partial derivative of V (k) with respect to u r (k), and setting the result to zero produces −1  Cr (k)T Q Ar x(k) u r (k) = − Cr (k)T QCr (k) + Rγ

(12)

In taking the above partial derivative we ignore the variations of the state x(k + 1), ..., x(k + r − 1) embedded in Cr (k) due to variations about the fixed point u r (k). In reality, these states vary with u r (k). Thus the above solution for u r (k) is suboptimal. Because Cr (k) is implicitly a function of u r (k), the above equation defines the fixed point u r (k), u r (k) = f 2 (u r (k), x(k)) where f 2 (.) denotes the right hand side of Eq. (12). Again, fixed-point iteration is used to find u r (k) from which u(k) is found. Moreover, when Q is set to the identity matrix and Rγ is set to zero, Design −1  Cr (k)T = Cr (k)+ . 2 reduces to Design 1 because Cr (k)T Cr (k)

2.3 Design 3: Finite-Duration Receding Cost Design 3 uses the full receding-horizon cost function defined in Eq. (3). Substituting in the expression for xr (k + 1), V (k) can be written succinctly as

Model Predictive Q-Learning (MPQ-L) for Bilinear Systems

101

V (k) = xr (k + 1)T Qγ xr (k + 1) + u r (k)T Rγ u r (k) = [P1 x(k) + P2 (k)u r (k)]T Qγ [P1 x(k) + P2 (k)u r (k)] + u r (k)T Rγ u r (k) (13)

At any time step k, the open-loop control input sequence u r (k) is found by taking the partial derivative of V (k) with respect to u r (k), and setting the result to zero to produce −1  P2 (k)T Qγ P1 x(k) (14) u r (k) = − P2 (k)T Qγ P2 (k) + Rγ Again, because the variations of x(k + 1), ..., x(k + r − 1) embedded in P2 (k) due to variations about the fixed point u r (k) are ignored, the above solution is suboptimal. For a given bilinear system A, B, N , the user-specified weighting matrices Q, R, the prediction horizon r , and the discount factor γ , the right hand side of Eq. (14) is a known function of u r (k) and the current state x(k). Let this function be denoted by f 3 (.), u r (k) = f 3 (u r (k), x(k)). The sub-optimal u r (k) is a fixed-point ( j+1) ( j) (k) = f 3 (u r (k), x(k)). The of f 3 , and can be found by fixed-point iteration, u r sub-optimal model predictive control action u(k) to use at time step k is the first entry of the fixed-point u r (k).

2.4 Finding a State Feedback Control Law In the linear case, Ni = 0, the MPC control input u(k) is a linear function of x(k), u(k) = G M PC x(k). In the bilinear case, the control actions are found by solving for u r (k) from Eqs. (10), (12), or (14) for a given x(k). In each case, a state-feedback controller can be represented generically as u(k) = gi (x(k)), i = 1, 2, 3, ..., r

(15)

Because of the complexity of f i (.), it is not desirable to solve for gi (.) analytically. Instead it can be found by the tools of “system identification” as follows. For each x(k), u r (k) is solved by fixed-point iteration from which the first entry is the sub-optimal control input u(k) is extracted. We thus have one pair of {x(k), u(k)} containing the control input u(k) to use with the state x(k). By varying x(k), several such state-optimal action pairs are generated. From this state-action data, the underlying controller function u(k) = gi (x(k)) can be identified by any nonlinear system identification techniques. In the trivial case where the state is one dimensional, simply plotting u(k) as a function of x(k) reveals the nonlinear controller. When the state is two dimensional, plotting u(k) as a function of the two state elements of x(k) reveals the nonlinear controller surface. For higher dimensional problems, the controller function can no longer be visualized, but it can still be modeled, for example, by an artificial neural network, or recently by HDMR (High Dimensional Model Representation) as done in [19].

102

M. Q. Phan and S. M. B. Azad

3 The Q-Function in Q-Learning So far, the sub-optimal state feedback controllers are found by MPC. We now proceed to make them optimal (actually minimizing a cost function) by using a RL method called Q-Learning. In Q-Learning, the Q-function is a key concept which can be explained as follows. Consider a controller u(k) = g(x(k)). At time step k, suppose that the system is at some arbitrary state x(k), and some arbitrary action u(k) is taken that brings the state from x(k) to x(k + 1). From time step k + 1 on, the controller g(x(k)) is used to generate the control actions, i.e., u(k + i) = g(x(k + i)), i = 1, 2, ..., r − 1. The value of the Q-function at (x(k), u(k)) is the cumulative r -step cost associated with the states x(k + 1), x(k + 2), ..., x(k + r ) and the actions u(k), u(k + 1), ..., u(k + r − 1) defined in Eq. (3). The Q-function is “almost”, but not exactly the r -step cost-to-go function associated with a controller g(x(k)) because the controller g(x(k)) is not used to compute u(k) at x(k). It is only used to compute u(k + 1), ..., u(k + r − 1). At time step k, the control action u(k) is a free variable. From the above consideration, associated with a controller g(x(k)), the r -step Q-function is a function of x(k) and u(k) only, and denoted by Q r (x(k), u(k)). If g(x(k)) is an optimal controller, then the Q-function associated with it is said to be an optimal Q-function. If the system state x(k) is known, then such an optimal Q-function becomes a function of the control action u(k) only. Because the optimal r -step Q-function produces the cumulative r -step cost which is minimized from time step k + 1 on, and the only free variable that remains is u(k). It is clear that the optimal input action associated with the state x(k) is the one that minimizes the optimal Q-function. In other words, the control action that minimizes the optimal Q-function is the optimal control action.

4 Q-Function Computed from System Model and Controller Given a system x(k + 1) = Ax(k) + B(x(k))u(k) and a controller u(k) = g(x(k)), the corresponding r -step Q-function Q r (x(k), u(k)) can be computed as follows: Q r (x(k), u(k)) =

r −1

 γ i x(k + 1 + i)T Qx(k + 1 + i) + u(k + i)T Ru(k + i)

i=0

= U (k) + γ U (k + 1) + γ 2 U (k + 2) + · · · + γ r −1 U (k + r − 1) (16)

where x(k) and u(k) are treated as free variables. U (k) is a function of x(k) and u(k) because U (k) = x(k + 1)T Qx(k + 1) + u(k)T Ru(k) = [Ax(k) + B(x(k))u(k)]T Q[Ax(k) + B(x(k))u(k)] + u(k)T Ru(k)

(17)

Model Predictive Q-Learning (MPQ-L) for Bilinear Systems

103

Thus, U (k) can be easily computed for any values of x(k) and u(k). Next, from x(k) and u(k), x(k + 1) can be computed, x(k + 1) = Ax(k) + B(x(k))u(k). From k + 1 on, the controller must be used, i.e., u(k + 1) = g(x(k + 1)) is used to produce x(k + 2), (18) x(k + 2) = Ax(k + 1) + B(x(k + 1))u(k + 1) so that U (k + 1) can now be computed. In the same manner, U (k + 2) to U (k + r − 1) can be computed. The value of Q r (x(k), u(k)) for any arbitrary pair of x(k) and u(k) can then be determined. By specifying a domain of interest for x(k) and u(k), the r -step Q-function Q r (x(k), u(k)) can be numerically mapped out in this domain.

5 A Representation of the Q-Function Instead of numerically generating the Q-function “point-wise” for any pair of x(k) and u(k), one can parameterize the r -step Q-function, and find the coefficients that model it. We now describe a parameterization that is natural for bilinear systems which involve the products between the state and the input variables. Recall that in general the Q-function is a function of x(k) and u(k). At any state x(k), the Q-function becomes a function of u(k) only, and the optimal action to take is the one that minimizes this Q-function. As a motivation for a suitable model structure for the Q-function, let us first consider the scalar-state scalar-input case. Assuming that Q r (x(k), u(k)) is a smooth function in x(k) and u(k), the constant-x(k) contour is some s-th order polynomial function of u(k), Q r (x(k), u(k)) = cs u(k)s + cs−1 u(k)s−1 + · · · + c1 u(k) + c0

(19)

where the coefficients ci ’s are some q-th order polynomial functions of x(k), cs = p(s,q) x(k)q + p(s,q−1) x(k)q−1 + · · · + p(s,1) x(k) + p(s,0) cs−1 = p(s−1,q) x(k) + p(s−1,q−1) x(k) .. . q

q−1

+ · · · + p(s−1,1) x(k) + p(s−1,0) (21) (22)

c1 = p(1,q) x(k) + p(1,q−1) x(k) + · · · + p(1,1) x(k) + p(1,0) q q−1 + · · · + p(0,1) x(k) + p(0,0) c0 = p(0,q) x(k) + p(0,q−1) x(k) q

Define

q−1



⎤ x(k)q ⎢ x(k)q−1 ⎥ ⎢ ⎥ ⎢ ⎥ .. x(k) = ⎢ ⎥ . ⎢ ⎥ ⎣ x(k) ⎦ 1

(20)

(23) (24)



⎤ u(k)s ⎢ u(k)s−1 ⎥ ⎢ ⎥ ⎢ ⎥ .. u(k) = ⎢ ⎥ . ⎢ ⎥ ⎣ u(k) ⎦ 1

(25)

104

M. Q. Phan and S. M. B. Azad

  and S(x(k), u(k)) from S(x(k), u(k)) 1 = (x(k) ⊗ u(k))T . It can be shown that Q r (x(k), u(k)) = S(x(k), u(k))P

(26)

Given a sufficient number of (x(k), u(k)) pairs, denoted by (x(k), u(k))i , i = 1, 2, 3, ... that are used to construct the Q-function, we can write down all available equations for P as Q = SP + E (27) where

⎤ Q r (x(k), u(k))1 ⎢ Q r (x(k), u(k))2 ⎥ ⎥ ⎢ Q = ⎢ Q r (x(k), u(k))3 ⎥ ⎦ ⎣ .. . ⎡



⎤ S(x(k), u(k))1 ⎢ S(x(k), u(k))2 ⎥ ⎢ ⎥ S = ⎢ S(x(k), u(k))3 ⎥ ⎣ ⎦ .. .

(28)

and E is the fitting error. By minimizing the norm of E, the coefficients of the Qfunction defined in P can be found by least-squares, P = S+ Q

(29)

where the + sign denotes the pseudo-inverse. Recall that the Q-value can be computed for any pair of (x(k), u(k)) for a given system and a given controller. In computing the Q-value at (x(k), u(k)), both x(k) and u(k) can be arbitrary, but from u(k + 1) on the controller must be used. The purpose of finding P in this manner is to determine the sufficient orders of s and q to model the Q-function that corresponds to the controller u(k) = g(x(k)) obtained through MPC. In so doing a suitable form for the Q-function can be found that avoids both under-modeling and over-modeling. Undermodeling the Q-function prevents it from achieving optimality. Over-modeling it creates unnecessary data requirement and convergence issues. Once an adequate representation for the Q-function is obtained, the next step will be updating it via the well-known recurrence equation in Q-Learning. To generalize to higher dimensions of state and control input variables, options are available. The most straightforward option is extending the polynomial basis functions to higher dimensions. For example, for a system with two states x1 (k) and x2 (k), and two inputs u 1 (k) and u 2 (k), a k-th order polynomial expansion of Q r (x(k), u(k)) takes the form: Q r (x(k), u(k)) =

q1 q2 s1  s2   

p(i, j,k,) x1 (k)i x2 (k) j u 1 (k)k u 2 (k)

(30)

i=0 j=0 k=0 =0

where q1 + q2 + s1 + s2 = k, and p(i, j,k,) = 0 because Q r (0, 0) = 0. The products of the elements of the state and the input variables serve as the basis functions in this representation, φ(i, j,k,) = x1 (k)i x2 (k) j u 1 (k)k u 2 (k) . One may use multi-resolution basis functions to capture the gross feature of the Q-function first by lower-resolution

Model Predictive Q-Learning (MPQ-L) for Bilinear Systems

105

basis functions, followed by finer features by higher-resolution basis functions. Both of these options are linear in the parameter estimation step, and later in the updating of the Q-function using the recurrence equation. One can also use the generic representation of an artificial neural network (ANN). Although an ANN has universal approximation capability, the relationship between the network parameters and the available data is nonlinear which, generally speaking, causes the parameter estimation step to be more challenging.

6 Updating the Q-Function via a Recurrence Equation Having found the Q-function from the MPC controller, we can now update it using the recurrence equation of Q-Learning. We established in [20] that the recurrence relationship for a finite-duration Q r (x(k), u(k)) satisfies Q r (x(k), u(k)) = γ Q r (x(k + 1), u(k + 1)) + U (k) − γ r U (k + r )

(31)

where the input u(k + 1) in the above equation must be an optimal input, i.e., u(k + 1) = arg min Q r (x(k + 1), u(k + 1))

(32)

u(k+1)

The Q-function at x(k + 1) with the above optimal u(k + 1) substituted in is the minimum among all possible choices for u(k + 1), Therefore, an exact recurrence relationship for the Q-function for a finite prediction horizon r can be written as  Q r (x(k), u(k)) = γ

 min Q r (x(k + 1), u(k + 1)) + U (k) − γ r U (k + r )

u(k+1)

(33) The presence of U (k + r ) in the above relation is problematic because it involves future undetermined input actions. However, if 0 < γ < 1 and r is sufficiently large then the last term can be neglected. Otherwise, if γ = 1 and U (k + r ) converges to zero as in a vibration suppression problem where both x(k) and u(k) converge to zero in the steady state as r tends to infinity, then the last term can also be neglected. In practice, choosing 0 < γ < 1 enhances stability of the learning process. We then obtain the well-known Q-Learning recurrence relationship as  Q r (x(k), u(k)) = γ

 min Q r (x(k + 1), u(k + 1)) + U (k)

u(k+1)

(34)

In our present problem, the Q-function is parameterized by P. An updated P, denoted by P  , is found from S(x(k), u(k))P  = γ S(x(k + 1), u(k + 1))P  + U (k)

(35)

106

M. Q. Phan and S. M. B. Azad

where x(k), u(k) can be arbitrary, but u(k + 1) is determined from u(k + 1) = g(x(k + 1)). Packaging all the available equations together, P  can be solved by least-squares (with regularization if necessary) from ⎤ ⎡ ⎤⎤ ⎤ ⎡ S(x(k + 1), u(k + 1))1 U (k)1 S(x(k), u(k))1 ⎢ ⎥⎥  ⎢ ⎥ ⎢⎢ S(x(k), u(k))2 ⎥ ⎦ − γ ⎣ S(x(k + 1), u(k + 1))2 ⎦⎦ P = ⎣ U (k)2 ⎦ ⎣⎣ .. .. .. . . . ⎡⎡

(36)

where S(x(k), u(k))1 and U (k)1 denote S and U computed with the pair (x(k), u(k))1 , where x(k + 1) is produced from x(k) and u(k), and u(k + 1) is computed from the controller u(k + 1) = g(x(k + 1)), etc. Once P  is estimated, an updated Q-function is found from (37) Q  (x(k), u(k)) = S(x(k), u(k))P  the updated controller is found from u  (k) = g  (x(k)) = arg min Q  (x(k), u(k))

(38)

u(k)

Note that the subscript r is dropped in Eq. (37) because the recurrence equation is in fact valid for r approaching infinity for 0 < γ < 1. The updated controller u  (k) = g  (x(k)) is used to generate an updated Q-function denoted by Q  (x(k), u(k)) via P  . From Q  = S P  , an updated u  (k) = g  (x(k)) is found. The process continues until convergence. There are fundamental advantages of using the recurrence equation to update the Q-function: First, there is no need to compute Q r (x(k), u(k)) to find P  from Eq. (37). Instead, P  is found from Eq. (35) which only requires x(k), u(k), x(k + 1), and u(k + 1) where x(k) and u(k) can be arbitrary, x(k + 1) is found from x(k), u(k), and u(k + 1) is found from u(k + 1) = g(x(k + 1)). U (k) is computed from x(k + 1) and u(k) by definition. In the language of RL, this utility is the feedback that the system receives from interacting with the environment when the action u(k) is taken. Thus the Q-function can be updated as the system interacts with the environment and receives feedback from it. This is a very powerful feature of the RL framework. Second, the role of the discount factor γ is revealed from the finite-time version of the recurrence equation given in Eq. (31). The last term can be neglected if the utility U (k + r ) approaches zero, or 0 < γ < 1 and r is sufficiently large. Since U (k + r ) might not approach zero if the current controller is not a stabilizing controller, the last term in Eq. (31) can still be neglected by using a positive discount factor that is strictly less than 1 and r is sufficiently large. This observation explains the widespread use of a positive discount factor less than 1 and its stabilizing affect in practically all RL algorithms. Third, the recurrence equation given by Eq. (34) is in fact valid for r approaching infinity. The implication of this result is that the Q-function derived from Eq. (36) is indeed the Q-function associated with the infinite-time optimal control problem, and the final controller it produces minimizes the infinite-time cost

Model Predictive Q-Learning (MPQ-L) for Bilinear Systems

107

function. The condition for a “sufficiently large” value of r is satisfied implicitly because r can be taken to be infinity. In this sense, one might say that Q-Learning “outsmarts” traditional MPC which is computed with a finite choice of r .

7 Model Predictive Q-Learning Algorithm (MPQ-L) The diagram below summarizes the MPQ-L algorithm:     g(x(k)) → S → P  → Q  → g  (x(k)) → P  → Q  → g  (x(k)) → · · · Step 1: Starting with a nominal model of the system, find a sub-optimal controller by MPC using Design 1, 2, or 3 with a sufficiently large value for r and 0 < γ ≤ 1. Denote this MPC controller as u(k) = g(x(k)). The input u(k) associated with x(k) is generated directly via fixed-point iteration where x(k) is randomly or systematically selected within a certain domain of the state space. We don’t actually need an explicit representation for g(x(k)). Instead, g(x(k)) can be represented non-parametrically as a collection of x(k) and u(k) values. Step 2: Compute Q r (x(k), u(k)) from the system model and the controller u(k) = g(x(k)) for different combinations of (x(k), u(k)). Step 3: Using values of (x(k), u(k)) and Q r (x(k), u(k)) obtained in Step 2, find a suitable structure for the Q-function determined by two parameters s and q. S(x(k), u(k)) contains the necessary basis functions from which Q r (x(k), u(k)) associated with g(x(k)) can be adequately represented. Step 4: Update the Q-function by first finding P  from the recurrence equation given in Eq. (36), where u(k + 1) = g(x(k + 1)) is found by interpolation from the collection of x(k) and u(k) values that define g(x(k)). An updated Qfunction, denoted by Q  (x(k), u(k)), can then be computed from Q  (x(k), u(k)) = S(x(k), u(k))P  . Step 5: Find an updated controller g  (x(k)) from Q  (x(k), u(k)) according to Eq. (38). The input u  (k) associated with x(k) is the value of u(k) that minimizes Q  (x(k), u(k)). It is not necessary to find an explicit representation for g  (x(k)). Instead, g  (x(k)) can be defined non-parametrically as a collection of x(k) and u  (k) values. Step 6: With g  (x(k)), which can be defined as a collection of x(k) and u  (k) values, compute P  from the recurrence equation using the same structure of the Qfunction defined in Step 3, where u  (k + 1) = g  (x(k + 1)) is found by interpolation. From P  , a new Q-function can be found from Q  = S P  . Step 7: Find an updated controller g  (x(k)) from Q  (x(k), u(k)) according to  u (k) = g  (x(k)) = arg min Q  (x(k), u(k)). u(k)

Step 8: Repeat Steps 6 and 7 until convergence. Notice that the nominal model and an initial sub-optimal model predictive controller are used in Steps 1, 2, and

108

M. Q. Phan and S. M. B. Azad

3 to seed the algorithm. From Step 4 on, improvement to the Q-function (and the controller) is achieved via the recurrence equation through the system interaction with the environment.

8 An Illustrative Example This paper provides a conceptual framework that combines MPC with Q-Learning for a bilinear system. Although the general approach is applicable to a multiple-state, multiple-input control problem, our current focus is on how various components of MPC and Q-Learning are related to each other. To this end, both the concept and the algorithm can be better illustrated if the Q-function can be visualized. Because the Q-function is a function of x(k) and u(k), visualization is only possible in the scalar-state scalar-input case. For example, in the linear case, it is known that the optimal Q-function is a quadratic function of x(k) and u(k). Thus, the Q-function resembles a “bowl”, where the horizontal plane is defined by the scalar state and the scalar input dimensions, and the height represents the Q-value. In the bilinear case, we are interested in how the shape of this “bowl” changes to produce a nonlinear controller, and how the linear case generalizes to the nonlinear case. Indeed, such a scalar example has not been provided in details elsewhere even though much valuable insight can be gained from it. In higher dimensions visualization becomes impossible except for certain nonlinear representations. We note here that higherdimensional examples can be found in [19, 21]. In [19], HDMR (High Dimensional Model Representation) is a nonlinear mapping technique that allows for some level of visualization into the minimum of the Q-function (the state-dependent value function or cost-to-go function) and the controller. In the present illustration, consider an openloop unstable scalar bilinear system where A = −1.1, B = 1, N = 1. The negative value of A can potentially cause the state to change its sign from time step to time step. The controller u(k) is some nonlinear function in x(k), and the Q-function is some surface in x(k) and u(k), both of which can be visualized in the scalar case.

8.1 MPC-Based Design A MPC-based controller is designed with r = 40, Q = 1, R = 100, and γ = 1. For each value of x(k), u r (k) is computed by fixed-point iteration according to Eq. (14), then u(k) is extracted from the first element of u r (k). Plotting pairs of x(k) and u(k) reveals the nonlinear control law u(k) = g3 (x(k)) shown in Fig. 1. The collection of points defined by x(k) and u(k) represents the MPC nonlinear controller. Superimposed in Fig. 1 is a straight line which represents the LQR controller obtained with N set to zero. The optimal LQR gain for the linear model with the chosen values of Q and R is found to be 0.2261 which matches the slope of u(k) = g3 (x(k)) at the origin. Notice that for x(k) less than a certain value in the negative direction, the linear

Model Predictive Q-Learning (MPQ-L) for Bilinear Systems

109

MPC-based controller and LQR controller (blue line) 0.8

0.6

Control u(k)

0.4

0.2

0

-0.2

-0.4 -4

-3

-2

-1

0

1

2

3

4

State x(k)

Fig. 1 The initial MPC-based controller (dotted curve) and the LQR controller (blue line) are shown. The initial controller is based on Design 3. The LQR controller is based on the linear portion of the bilinear system. The LQR controller matches the MPC-based controller near the origin of the x(k)-u(k) plane where the bilinear term can be neglected. Notice that the control input generated by the nonlinear MPC controller is opposite in sign to what the linear LQR controller would generate for x(k) less than a certain value

controller actually drives the system in the opposite direction when compared to the nonlinear controller. Figure 2 shows a typical open-loop state history, closed-loop state history, and closed-loop control history of the MPC controller.

8.2 Q-Learning Next, the sub-optimal MPC controller is updated by Q-Learning. Figure 3 shows the initial MPC controller (blue) and its updated version (red) by Q-Learning. On the right of Fig. 3 shows the r -step cumulative cost computed with the original controller (blue) and with the updated controller (red) for values of x(k) in the range between −4 and 4. It is clear that Q-Learning improves the controller because the r -step cost associated with the updated controller is less than the r -step cost associated with the original controller in the range of x(k) being evaluated. Numerical verification of the optimality of the updated controller is shown in the next section. The improvement is achieved through the Q-functions. Figure 4 illustrates the learning process. Using the MPC controller of Fig. 1, a Q-function is computed and shown on the left. The Q-function is a surface in two variables x(k) and u(k). From this Q-function an updated controller is found as follows. Consider the constantx(k) contours on the Q-function surface. The red dots (*) mark a path on this surface where the Q-values are the smallest along these constant-x(k) contours. These

110

M. Q. Phan and S. M. B. Azad

Open-loop state

200 100 0 -100 -200 -300

0

5

10

15

20

25

30

35

40

45

50

35

40

45

50

35

40

45

50

Number of time steps Closed-loop state

2

1

0

-1

0

5

10

15

20

25

30

Closed-loop control

Number of time steps 0.6 0.4 0.2 0 -0.2

0

5

10

15

20

25

30

Number of time steps

Fig. 2 Open-loop state history, closed-loop state history, and closed-loop control history of the initial MPC-based controller Original (blue) and updated (red) r-step cumulative cost

Initial (blue) and updated (red) controllers

100

0.8

90 0.6

r-step cumulative cost

80

Control u(k)

0.4

0.2

0

70 60 50 40 30 20 10

-0.2

0 -0.4

-4

-3

-2

-1

0

State x(k)

1

2

3

4

-10

-4

-3

-2

-1

0

1

2

3

4

Initial state x(0)

Fig. 3 The initial MPC-based controller (blue) is updated by Q-Learning. The updated controller is shown in red (left figure). The r -step cumulative cost of the initial MPC-based controller (blue) and the updated controller by Q-Learning (red) are shown on the right. The updated controller lowers the cumulative cost throughout the range of the state being considered, −4 ≤ x(k) ≤ 4

Model Predictive Q-Learning (MPQ-L) for Bilinear Systems

111 Updated Q-function

100

Q value

Q value

Initial Q-function

50 0 4 3

100 50 0 4 3

2

1

1

0.5

0 -1

1 1

-0.5

-3 -4

-1

0.5

0 -1

0 -2

State x(k)

2

0 -2

Control u(k)

State x(k)

-0.5

-3 -4

-1

Control u(k)

Fig. 4 The initial Q-function (left figure) is computed from the bilinear model and the initial MPC controller. The red dots mark the minimum Q-values along the constant x(k)-contours. The blue dots, which define an updated controller, are the projection of the red dots onto the x(k)-u(k) plane. The updated controller is used to generate an updated Q-function (shown on the right figure) from which another updated controller is found (the blue dots on the right figure). The iteration between the Q-function and the controller continues until convergence

minimum values represent the state-dependent value function associated with a particular controller (or the cost-to-go function in optimal control terminology). These constant-x(k) contours are referred to as the “ribbons” in [19, 21]. The projection of the red dots onto the x(k)-u(k) plane are the blue dots which define the updated controller produced from this Q-function. This controller has the property that at any state x(k) the controller produces an action u(k) that minimizes the r -step ahead receding-horizon cost function defined by the Q-function. Using the updated controller, a new Q-function is built as shown on the right of Fig. 4 from which another controller is found. Thus, given a controller, a Q-function can be built from it and the system model. By minimizing this Q-function, an updated controller can be produced. The optimal controller has the property that it is a fixed point of this controller and Q-function iteration, i.e., the controller that is used to build the Q-function is the same controller produced from it. In other words, the blue dots and the system model update the Q-function. The minimum of the Q-function along x(k) defines the red dots whose projection on the x(k)-u(k) plane reproduces the blue dots at convergence. After a few iterations, convergence is observed for the current example.

8.3 Verification of Optimality To verify numerically that the updated controller is indeed optimal, we compare the costs associated with perturbations about the original MPC-based controller to the costs associated with perturbations about the updated controller by Q-Learning at convergence. This comparison is shown in Fig. 5. The perturbations are introduced by adding very small random “functions” (sequences of normally distributed random

112

M. Q. Phan and S. M. B. Azad Perturbations about updated controller

Perturbations about original controller 72.015

79.5

72.01

r-step cumulative cost

r-step cumulative cost

79.4 79.3 79.2 79.1 79 78.9 78.8

72 71.995 71.99 71.985 71.98 71.975

78.7 78.6

72.005

100

200

300

400

500

600

Test number

700

800

900

1000

71.97

0

100

200

300

400

500

600

700

800

900

1000

Test number

Fig. 5 Cumulative costs (40-time step) corresponding to very small random perturbations about the original controller (left figure) and the updated controller at convergence (right figure) are shown. The solid blue line (left figure) marks the cost associated with the initial controller and the solid red line (right figure) marks the cost associated with the updated controller

numbers between −2 × 10−3 and 2 × 10−3 ) to each controller function while maintaining the property that it produces u(k) = 0 at x(k) = 0. The closed-loop system is then simulated with the perturbed controllers starting from the same initial condition. The initial condition is selected to be x(0) = −3.5, where a larger discrepancy between the original and updated controllers is observed (Fig. 3). A total of 1000 random perturbations each to the original and updated controllers are generated. The cumulative 40-step cost starting from x(0) = −3.5 is computed for each perturbed controller. It can be seen from Fig. 5 (left) that the original MPC controller is not optimal as expected because the resultant costs are found to be higher or lower than the cost associated with the original controller (solid blue line). On the other hand, the costs associated with perturbations about the updated controllers are higher than the cost associated with the updated controller which is marked by the solid red line of Fig. 5 (right). Because the system is nonlinear, only local optimality can be asserted. In this simple example, there is no need to use basis functions to model the Qfunction. However, in higher-dimensional problems, it is necessary to parameterize the Q-function. Figure 6 shows another view of the initial Q-function derived from the MPC controller, and its reconstruction is shown on the right with q = 8, s = 8. Through this reconstruction one finds a suitable representation for the Q-function (suitable values of q and s in this case) before Q-Learning is engaged. Ideally, the Qfunction should not be over-parameterized or under-parameterized. Once a suitable parameterization is found, the learning process updates the parameters P  in Eq. (36) that represents the Q-function instead of updating the entire Q-function. Fixed-point iteration is now between the controller and the parameters of the Q-function instead of iterating between the controller and the entire Q-function.

Model Predictive Q-Learning (MPQ-L) for Bilinear Systems

113 Reconstructed Q-function

100

Q value

Q value

Original Q-function

50

100 50 0 4

0 4 2

1 0.5

0

1

-0.5 -4 -1

Control u(k)

0.5

0

0

-2

State x(k)

2 0

-2

State x(k)

-0.5 -4

-1

Control u(k)

Fig. 6 Another view of the initial Q-function computed from the MPC-based controller (left figure). This initial Q-function can be used to find a suitable parameterization so that its parameters can be updated directly via the recurrence equation. A reconstructed Q-function parameterized by polynomial basis functions with s = 8, q = 8 is found to match the initial Q-function closely (right figure)

9 Discussion The proposed framework finds an optimal feedback controller that minimizes a LQRstyle infinite-time quadratic cost function for a bilinear system. Due to the bilinear term, analytical solution to this optimal control problem by MPC is difficult because the expression for the system state r steps into the future contains polynomial nonlinear product terms, and the number of these terms and their orders increase rapidly as r increases. Solving these high-order polynomials for the control actions quickly becomes intractable even for a small value of r . In this paper we combine MPC with Q-Learning to overcome the analytical difficulty. The relationship between optimal control and reinforcement learning can be understood through MPC. On the one hand, the ingenuity of the recurrence equation in Q-Learning is revealed in that Q-Learning actually finds the infinite-time optimal control solution that MPC cannot approximate due to excessive computational requirement. On the other hand, MPC addresses the hit-or-miss nature of the traditional Q-Learning algorithm applied to an unknown system without a suitable initial controller and without a suitable parametrization of the Q-function. MPC mitigates both difficulties by providing Q-Learning with an excellent initial sub-optimal controller and an initial Q-function so that suitable parameterization can be determined. In theory, Q-Learning is a model-free method that is capable of finding the optimal control solution while the system interacts with the environment and receives feedback from it. This paper represents a small step in this direction because a general nonlinear system can be approximated as a high-dimensional bilinear system via a mathematical technique called Carleman linearization, [9]. In reality, modeling and optimizing high-dimensional functions are not trivial tasks, and much work is needed to turn reinforcement learning into a practical and reliable nonlinear controller design method [19]. Finally, fixed-point iteration, a well-known numerical method, should

114

M. Q. Phan and S. M. B. Azad

be recognized as a key computational tool in many machine learning solutions. In this paper, fixed-point iteration is used both in the finding of an initial sub-optimal model predictive controller and in the subsequent optimization of the Q-function to bring the MPC solution to optimality. Acknowledgements The authors would like to thank the editors and the anonymous reviewers for the helpful comments.

References 1. Al-Tamimi, A., Vrabie, D., Abu-Khalaf, M., Lewis, F.L.: Model-free approximate dynamic programming schemes for linear systems. In: International Joint Conference on Neural Networks. IJCNN 2007, pp. 371–378. IEEE (2007) 2. Bradtke, S.J.: Reinforcement learning applied to linear quadratic regulation. In: Advances in Neural Information Processing Systems, pp. 295–295 (1993) 3. Bradtke, S.J., Ydstie, B.E., Barto, A.G.: Adaptive linear quadratic control using policy iteration. In: American Control Conference, 1994, vol. 3, pp. 3475–3479. IEEE (1994) 4. Camacho, E.F., Bordons, C.: Nonlinear model predictive control: an introductory review. In: Assessment and Future Directions of Nonlinear Model Predictive Control, pp. 1–16. Springer (2007) 5. Ernst, D., Glavic, M., Capitanescu, F., Wehenkel, L.: Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 39(2), 517–529 (2009) 6. Gaweda, A.E., Muezzinoglu, M.K., Jacobs, A.A., Aronoff, G.R., Brier, M.E.: Model predictive control with reinforcement learning for drug delivery in renal anemia management. In: Engineering in Medicine and Biology Society, 2006. EMBS’06. 28th Annual International Conference of the IEEE, pp. 5177–5180. IEEE (2006) 7. Kahn, G., Villaflor, A., Pong, V., Abbeel, P., Levine, S.: Uncertainty-aware reinforcement learning for collision avoidance. arXiv:1702.01182 (2017) 8. Kiumarsi, B., Lewis, F.L., Modares, H., Karimpour, A., Naghibi-Sistani, M.B.: Reinforcement q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica 50(4), 1167–1175 (2014) 9. Kowalski, K., Steeb, W.H.: Nonlinear dynamical systems and Carleman linearization. World Scientific, Singapore (1991) 10. Kretchmar, R.M., Young, P.M., Anderson, C.W., Hittle, D.C., Anderson, M.L., Delnero, C.C.: Robust reinforcement learning control with static and dynamic stability. Int, J. Robust Nonlinear Control 11(15), 1469–1500 (2001) 11. Kulkarni, N.V., Phan, M.Q.: Neural-network-based design of optimal controllers for nonlinear systems. J. Guid. Control Dyn. 27(5), 745–751 (2004) 12. Kulkarni, N.V., Phan, M.Q.: Performance optimization of the magnetohydrodynamic generator at the scramjet inlet. J. Propuls. Power 21(5), 822–830 (2005) 13. Kulkarni, N.V., Phan, M.Q.: Neural optimal magnetohydrodynamic control of hypersonic flows. J. Guid. Control Dyn. 30(5), 1519–1523 (2007) 14. Kulkarni, N.V., Phan, M.Q.: Reinforcement-learning-based magneto-hydrodynamic control of hypersonic flows. In: IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007, pp. 9–191. IEEE (2007) 15. Lewis, F., Liu, D., Lendaris, G., Werbos, P., Balakrishnan, S., Ding, J.: Special issue on adaptive dynamic programming and reinforcement learning in feedback control. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 38(4), 896 (2008)

Model Predictive Q-Learning (MPQ-L) for Bilinear Systems

115

16. Lewis, F.L., Vrabie, D., Vamvoudakis, K.G.: Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers. IEEE Control Syst. 32(6), 76–105 (2012) 17. Negenborn, R.R., De Schutter, B., Wiering, M.A., Hellendoorn, H.: Learning-based model predictive control for markov decision processes. IFAC Proceed. Vol. 38(1), 354–359 (2005) 18. Palanisamy, M., Modares, H., Lewis, F.L., Aurangzeb, M.: Continuous-time q-learning for infinite-horizon discounted cost linear quadratic regulator problems. IEEE Trans. Cybern. 45(2), 165–176 (2015) 19. Phan, M.Q.: Value iteration and q-learning for optimal control by high dimensional model representation (hdmr). In: AAS/AIAA Astrodynamics Specialist Conference (2019) 20. Phan, M.Q., Azad, S.M.B.: Model predictive control and model predictive q-learning for structural vibration control. In: AAS/AIAA Astrodynamics Specialist Conference (2017) 21. Phan, M.Q., Azad, S.M.B.: Input-decoupled q-learning for optimal control. J. Astron. Sci. (2019). https://doi.org/10.1007/s40295-019-00157-4 22. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT Press, Cambridge (1998) 23. Sutton, R.S., Barto, A.G., Williams, R.J.: Reinforcement learning is direct adaptive optimal control. IEEE Control Syst. 12(2), 19–22 (1992) 24. Ten Hagen, S., Kröse, B.: Linear quadratic regulation using reinforcement learning (1998) 25. Vamvoudakis, K.G.: Q-learning for continuous-time linear systems: a model-free infinite horizon optimal control approach. Syst. Control Lett. 100, 14–20 (2017) 26. Vrabie, D., Vamvoudakis, K.G., Lewis, F.L.: Optimal adaptive control and differential games by reinforcement learning principles, vol. 2. IET (2013) 27. Wang, F.Y., Zhang, H., Liu, D.: Adaptive dynamic programming: an introduction. IEEE Comput. Intell. Mag. 4(2) (2009) 28. Werbos, P.J.: A menu of designs for reinforcement learning over time. Neural Netw. Control, 67–95 (1990) 29. Zhang, T., Kahn, G., Levine, S., Abbeel, P.: Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 528–535. IEEE (2016)

SCOUT: Scheduling Core Utilization to Optimize the Performance of Scientific Computing Applications on CPU/Coprocessor-Based Cluster Minh Thanh Chung, Kien Trung Pham, Manh-Thin Nguyen, and Nam Thoai

Abstract Today’s scientific computing applications require many different kinds of task and computational resource. The success of scientific computing hinges on the development of High Performance Computing (HPC) system in the role of decreasing execution time. Remarkably, the support is more enhanced with the advent of accelerators like Graphics Processing Unit (GPU) or Intel Xeon Phi (MIC) coprocessor. However, problems related to coprocessor underutilization of MIC can lead to the thread and memory over-subscription. Based on logging the runtime behaviors of scientific applications, scheduling jobs usually has constraints on the completion time of jobs as deadline or due date assignment. These problems can be solved to improve the performance by a suitable method such as scheduling or assigning priorities to job submission. In this paper, we propose a scheduling module named SCOUT by exploiting factors from the view of the application’s performance to improve the scheduler on a CPU/Coprocessor-based cluster. SCOUT focuses on the performance of applications as well as reducing their execution time on Xeon Phi accelerator. Furthermore, our scheduling module decides the order of job execution to increase the throughput and minimize the delay time. Given a set of popular scientific applications, the experimental results show that the performance and throughput of SCOUT are better than others compared policies. Especially, we implement the entire module as a seamless plug-in to an HPC workload manager named PBS Professional and show the efficiency in practice.

M. T. Chung · K. T. Pham (B) · M.-T. Nguyen · N. Thoai (B) High Performance Computing Lab, Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, 268 Ly Thuong Kiet Street, District 10, Ho Chi Minh City, Vietnam e-mail: [email protected] N. Thoai e-mail: [email protected] M. T. Chung e-mail: [email protected] M.-T. Nguyen e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_6

117

118

M. T. Chung et al.

1 Introduction Recent years, HPC systems can rapidly solve complex problems across a diverse range of scientific applications. It plays a vital role in solving real-life problems such as simulating earthquake or folding proteins. Solving these problems can depend on the number of computing resources. Besides increasing the size of a cluster, people are currently using accelerator integrated to compute nodes for enlarging the computational capability. Intel Xeon Phi (MIC) [8] belongs to manycore architecture [24], which has the theoretical computing capacity equal to two Intel Xeon processors. An HPC system has many compute nodes with multiple MIC cards per node. MIC appears for the main ideal that uses resources with the high parallelism. One core on Intel Xeon Phi can support 4 threads running concurrently. However, this feature is also a problem with different kinds of application. Furthermore, the integration and allocation reasonably for computing resources are also important and necessary with a centralized manager. The important problems for MIC without a centralized resource manager is thread and memory over-subscription. At the user’s view, they have a list of jobs that need to submit to an HPC system. From the workload trace or log file, we can draw information or characteristics of job submission. For example, by analyzing the runtime behavior, we can get the characteristics of due date assignment problem in scheduling job. In detail, the user usually expects a time to get back the results after submitting jobs, which is called the due date value [17]. The main approach for this problem is at the priority of each job. If a job has a longer due date value, we can set a higher priority for other jobs. There are many solutions for the problems associated with a due date or deadline constraint, however, they are difficult to apply to a real cluster. Furthermore, problems related to CPU/Coprocessor-based cluster are mainly studied with the performance of applications. In this paper, SCOUT module combines due date and walltime of a job to produce a priority based algorithm. In particular, the performance of applications is also improved by thread mapping mechanism. The mechanism depends on the features as well as types of HPC applications to reduce the execution time. Our experiments are compared to classical and heuristic schedulers that include FirstCome First-Serve (FCFS), Shortest Processing Time (SPT), Largest Processing Time (LPT), Earliest Due Date (EDD). These policies based on the processing time, a requested number of processors or waiting time to decide the execution priority of each job. This paper is organized as follows: Sect. 2 reviews concepts related to the term of the paper. Section 3 shows related works along with their proposed solutions. Section 4 addresses the problem and motivation for our work. Sections 5 and 6 highlight our solution as well as the implementation, and the results. Finally, we conclude with a summary of our findings and future works.

SCOUT: Scheduling Core Utilization to Optimize the Performance …

119

2 Background 2.1 Heterogeneous System—CPU/Coprocessor Cluster The heterogeneous system [20] is a system that uses more than one kind of processors. In general, the extra computing resources or accelerators can be GPU, applicationspecific integrated circuit (ASIC) or Intel Xeon Phi coprocessor. This architecture aims to share the computing load with CPU. For instance, in the case of GPU, it can render images more quickly than CPU based on its parallel architecture. Heterogeneous systems are becoming more popular and playing an important role in scientific computing. In particular, Intel Xeon Phi which is introduced in 2013 has the manycore architecture (up to 61 physical cores) with 4 running threads currently per core. Intel Xeon Phi consists of components such as coprocessor cores, vector processing unit (VPU), L2 cache, ring interconnect, PCIe interface. Especially, VPU can operate 8 double and 16 single-precision numbers at the same time based on 512-bit single instruction, multiple data (SIMD) instruction set. In this work, our system is equipped with Intel Xeon Phi Coprocessor 7120P (MIC). Figure 1 shows the interconnection between CPU and MIC, the coprocessor also has its Operating System. Related to the running model of applications on Xeon Phi, there are two programming models on MIC. The first model is the native mode that a program is compiled and run separately on MIC. The second one is the offload mode that the main part of the code runs on the host processor; the offload regions are uploaded and runs on the MIC.

Fig. 1 The interaction between 2 Operating Systems on Intel Xeon Phi accelerator and CPU

120

M. T. Chung et al.

Fig. 2 The NUMA (Non-Uniform Memory Access) architecture with 4 memory nodes/machine

2.2 Resources Over-Subscription Resource over-subscription occurs when the requested resources exceed the available resources. This is easy to see when a system does not have centralized management. For example, if there is no order for jobs execution, two or more applications of a user can use the same area of computing resources. When this happens, the execution time rapidly increases. Some others work [3, 13, 22, 26] try to solve this problem to improve the performance. In this work, we try to avoid over-subscription by scheduling and arranging jobs to compute nodes.

2.3 Thread Affinity Regarding memory technologies, manufacturers have adopted the complicated hierarchies of memory to hide the memory access latency from processors. Multiple memory controllers are considered as Non-Uniform Memory Access (NUMA) [2]. Therefore, the concept of thread affinity is concerned, because data movement and memory access from processors have a very high impact on the performance. Figure 2 shows the memory hierarchy of NUMA in a machine with 4 CPU sockets as well as 4 NUMA nodes. There are two ways for the connection in NUMA, local access and remote access associated with different bandwidths. Therefore, the memory latency of local access and remote access on this architecture is different. The difference in memory latency results in the degradation of performance for some HPC applications, and that is the reason leads to studies about thread affinity. Especially, for manycore architecture like Xeon Phi, it supports 4 threads per core, we need to consider the assignment between cores and threads. Thread affinity is defined as a problem related to thread mapping on multicore or manycore architecture. This is the assignment of threads to execution cores according to a policy [5]. A good policy can make a good performance for some kinds of

SCOUT: Scheduling Core Utilization to Optimize the Performance …

121

application. Thread mapping becomes more important with the advent of multicore and manycore processors such as GPU, Intel Xeon Phi coprocessor. Affinity-based thread mapping haves two main purposes: improving the locality, and balance of communication [10].

3 Related Works Thread mapping and scheduling policy for due date problem are usually solved by separate studies. For affinity-based thread mapping, the mechanism of solutions generally consists of two parts: the analysis of memory access behavior, and then the mapping policy. Following that, there are two levels to implement the solutions: user-level and system-level. User-level thread mapping mechanisms target single parallel applications: Source Code Changes [7, 14], Compiler Analysis [6], Offline or Online Profiling [9], Runtime Options [11]. System-level mechanisms are usually implemented in the operating system level. The second problem concerned in this paper is due date assignment for each job when it is submitted to the server. From the behaviors of users running applications, we survey the expected time to finish each job when users submit it to the server. Concretely, users submit a job to an HPC system, they usually expect a time to get back the result which is called due date time. The value of due date time is most estimated by user definition. In practice, due date assignment is a common problem in sales planning [1]. From previous studies about this problem, researchers propose many methods proven by theory or experimented on single machine [12]. A classical method named Earliest Due Date (EDD) [15] on a single machine can minimize maximum tardiness. Furthermore, there are many studies on minimizing maximum tardiness on a single machine with different objective functions [23]. Furthermore, many previous studies propose solutions for a single machine with due date assignment [18]. The author solves the problem with uncertain processing times and considers general precedence constraints among the jobs. Today, modern HPC systems need to keep up with the demand for supporting the efficiency of resource utilization. We can consider the problem of due date assignment or other time constraints on the user side. There are previous studies achieved good results on the compromise cost-execution time [27]. Some related works use heuristic genetic approach [16] with a slight improvement of the total execution time. The problem of scheduling policy with a due date value is popularly concerned such as [25] on a distributed system. In our work, we propose SCOUT module as a scheduling module that exploits two factors to improve the performance. At the user view, it improves the performance of applications as well as the execution time. Otherwise, we try to arrange the order of job execution to solve the due date assignment problem at the server view. This combination helps to increase the throughput as well as reduce the waiting time for users.

122

M. T. Chung et al.

4 Problem Definition and Motivation Today’s the development of multicore and manycore systems facilitates scientific applications in decreasing the execution time. Especially, the advent of accelerators is playing an important role in increasing the power of computation. In detail, this paper mentions Intel Xeon Phi (Knight Corner) with the benefits of massive parallelism and vectorization. However, we need an efficient way to manage computing resources on these accelerators, because the performance can be dropped with the problem of resource over-subscription. For example, when two or more applications use the same cores or memory on Xeon Phi, this leads to the degradation of performance. Figure 3 shows the relation between over-subscription degree and execution time. 2x over-subscription means that if there is no order for the execution of jobs, job A occupied 244 cores for its threads will conflict with job B which also occupies the same 244 cores for its threads. Hence, the performance is dropped about 1.3 times rather than the baseline. A scientific computing problem (e.g., environmental issue analysis ) can be divided into small parts, and then run on as many as possible threads concurrently. Intel Xeon Phi provides advantages of this demand with the high parallelism. For each core on MIC, we have 4-way hyperthreading, each thread can perform different operations in parallel. Following with the problem of over-subscription, another problem is the thread and core mapping which is called thread affinity. Figure 4 shows that the performance of scientific applications is affected by threads affinity. The worst-case threads are mapped ineffectively on cores, causes a bad influence on the performance of applications, e.g., DFFT and Stream OFFLOAD. In the ideal case when threads are reasonably mapped, the computing speed is boosted up far from the normal case. Thread affinity problem can be solved at the operating system level with the default algorithm, while the thread mapping depends on application type. For example, a

Fig. 3 An experiment with many levels of resource over-subscription on 2 Intel Xeon Phi cards

SCOUT: Scheduling Core Utilization to Optimize the Performance …

(a) An example of core-intensive apps - DFFT on CPU.

123

(b) An example of data-intensive apps Stream OFFLOAD on Intel Xeon Phi.

Fig. 4 The influence of thread mapping-aware over scientific applications on CPU & Xeon Phi

core-intensive (computing-intensive) application is recommended to run threads on the separate core. Otherwise, memory-intensive (data-intensive) applications focus on the ability of memory access, the application should be run on cores with the same CPU node. The third problem in HPC system comes from the runtime behavior of users. They submit or run a job on HPC system, then they expect a time to get back the result. With the classic algorithms used for job submission, sometimes we have to accept the effect of this problem.

5 SCOUT Model and Implementation With a scheduling module, it is usually implemented for a specific objective metric. SCOUT exploits two factors to reach the objective: the performance of applications, and scheduling jobs. The application’s performance affected by thread affinity can be improved to reduce the execution time. Then, the scheduling algorithm is to improve the waiting time of user’s expectation. In detail, SCOUT includes three modules as Fig. 5 shown. Job classification and thread mapping are 2 modules cooperated for improving the performance. With an application running in parallel, the problem of thread mapping effects to the performance. However, this has to depend on the feature of application types, they are computing-intensive or data-intensive application. As we mentioned in Sect. 4, the performance of data-intensive applications is better if threads are mapped to cores in the same CPU socket, because the speed of memory access is higher. Therefore, in these 2 modules of SCOUT, job classification is used to divide which job is computing-intensive or data-intensive; then thread mapping module is based on this information to map thread to core. In this part, we use information extracted from the log files of PBS Pro (a workload management tool) on SuperNode-XP (a cluster at Ho Chi Minh City University of

124

M. T. Chung et al.

Fig. 5 SCOUT model as a seamless plug-in inside PBS Pro (HPC Workload Management Tool)

Table 1 Related information of a job from the workload trace Job name runtime #CPUs #MICs fft nbody-100000 mkl-25000 qe_mic_test ...

381 506 1453 1287

24 16 4 48

2 1 0 1

duedate

label

1116 1411 2983 2391

1 0 1 1

Technology) to generate classifier. As shown in Table 1, the related information of a job such as name, runtime, the requested number of CPU, MIC or estimated due date time can be extracted from PBS’s log file. On our cluster, before users are supported for running applications, their jobs are preserved and evaluated the performance with the options of thread mapping. Therefore, combining with the information from log files, we generate the data-set with the label − 0 or 1 (1 is core-intensive and 0 is dataintensive). After that, machine learning is used to create a logistic regression-based [21] classifier for classifying jobs. Based on this classifier, then Thread Mapping module will decide a policy to map threads to cores. From the based libraries and compilers of Intel Parallel Studio [4], the mapping module will automatically map threads to cores by the local environment variable for each job, e.g., the policy can be the scatter mode. Figure 5 shows that SCOUT catches the event when a job is submitted, then classify this job to core-intensive or data-intensive application. The proposed scheduling policy is mainly based on the heuristic from the workload trace. Related to job submission with expectation time (due date or deadline), the principle of this problem is the execution priority of jobs. For instance, a job with due date value is far from the current time, a higher priority can be assigned for other jobs. We define a ratio of the processing time and the remaining time of job until the due date value.

SCOUT: Scheduling Core Utilization to Optimize the Performance …

125

Algorithm 1 Dynamic Critical Ratio Algorithm on Cluster. 1: procedure dynamicCRC  as a Hook module inside PBS Pro  PBS catches the event 2: jobi ← pbs.event (). job 3: name, walltime, ncpu, mem ← jobi .in f o()  Get job’s information 4: duedate ← estimated Func(walltime) 5: timecurr ent = datetime.today() 6: T otalW or k RemainingT ime = 0 7: 8: while state jobk = RU N N I N G do  checking all of jobs that are RUNNING 9: t = jobk .remainingtime 10: if r esour ce f r ee (ser ver ) ≥ r esour cer equir ed ( jobi ) then 11: br eak 12: else if r esour cer equir ed ( jobk ) ≥ r esour cer equir ed ( jobi )&&t.is Min() == tr ue then 13: T otalW or k RemainingT ime = t 14: br eak 15: else 16: r esour ceavailable + = r esour cer equir ed ( jobk ) 17: t = jobk .remainingtime 18: remainingtime+ = t 19: if r esour ceavailable ≥ r esour cer equir ed ( jobi ) then 20: T otalW or k RemainingT ime = remainingtime 21: 22: T otalW or k RemainingT ime+ = walltime 23: C Rvalue = (duedate − timecurr ent )/T otalW or k RemainingT ime 24: jobi .Priorit y = C Rvalue 25: update Priorit y( jobsqueued )  update the priority for all jobs in the waiting queue 26: return C Rvalue

C Ri =

Duedatei − Curr ent T ime T otalW or k RemainingT imei

(1)

As Eq. 1 shown, C Ri means the Critical Ration value of jobi . Duedatei is the due date value of jobi after executing on the server. Curr ent T ime is value at the time when we calculate C Ri . The last one is T otalW or k RemainingT imei that means the total remaining time jobi need to wait for the execution. To get this value, we need to check all of the jobs in the running queue and predict the finish time of each job. The ratio value will be assigned to each job and converted as a priority value. However, each value will be updated again after a new event of job submission on the server. The C R value reveals that if jobi has the C R value lower than others, this shows that jobi need to be executed as soon as possible. The proposed scheduler is named Dynamic Critical Ratio Cluster (DCRC) and it is showed in Algorithm 1. SCOUT implements this algorithm with the idea of updating C R values on the cluster, and it is a plug-in inside PBS Pro.

126

M. T. Chung et al.

6 Experimental Results 6.1 Testing Environment and Benchmarks In this section, we mention the testing environment on the CPU/Coprocessor based cluster—SuperNode-XP at Ho Chi Minh City University of Technology. The system includes 24 compute nodes. Each node is a NUMA dual-socket system which is accelerated by 2 Intel Xeon Phi coprocessors. Four compute nodes are used to deploy the testbeds. The operating system is Red Hat Enterprise 7.3, and PBS Professional 14.01 is used for scheduling job submission. SCOUT module is integrated into PBS Professional as a seamless hook module. The set of jobs are scientific applications on HPC systems, listed in Table 2, such as Flood Simulation, Material Modeling or Molecular Analysis. The arrival time of job submission is randomly distributed by Lublin model [19]. Because we cannot manually create the set of job submission by default modes, so Lublin model is one of the popular distributions for job submission. Lublin is a detailed model for rigid jobs that are used in many related studies [19]. Figure 6 shows the distribution of job submission for the experiment, the x-axis and y-axis represent the number of submitted jobs throughout the queued time.

Table 2 List of scientific computing applications Apps Information OpenTelemac Nbody-Simulation Vectorization-Data LU-decomposition STREAM-Offload Jacobi Matrix_x_vector QE

VAST

An integrated suite of solvers for use in the field of free-surface flow A basic N-body simulation on Xeon Phi Vectorization-data program on Xeon Phi LU factorization program on Xeon Phi A streaming data program running on CPU&Xeon Phi A linear algebra program on Xeon Phi A linear algebra program on Xeon Phi Quantum ESPRESSO—an integrated suite of Open-Source computer codes for electronic-structure calculations and materials modeling at the nanoscale Virtual Airway Skill Trainer

SCOUT: Scheduling Core Utilization to Optimize the Performance …

127

Fig. 6 The distribution of job submission is generated by Lublin model on SuperNode-XP cluster

6.2 Performance Applications shown in Table 2 are scientific computing jobs that are used by users on our system—SuperNode-XP. We want to evaluate the real performance when running them in practice. These jobs have adjusted the input for a reasonable execution because HPC applications have a long execution time and we cannot wait for months to obtain the result. There are 3 groups: small (S), medium (M), and large (L). Group S has the execution time from 10 s to 5 min, 5–15 min for Group M, and 15–30 minutes for Group L. Furthermore, based on the distribution of Lublin model [19], we distribute the arrival times of all jobs. Afterward, our results are compared to others schedulers that include FCFS, SPT, LPT, EDD where jobs are scheduled first with the arrival order, smaller processing time, larger processing time and smaller due date time respectively. SCOUT shows a better result when compared with these policies in the focus criteria.

6.2.1

Throughput

For SCOUT, we use Critical Ratio value to arrange the order of submitted jobs. Concretely, a job with a longer expectation time (due date) to get back the result will have a lower priority than others. This helps many jobs in the queue that can be prioritized and do not need to wait for other jobs before the due date. Figure 7a shows the comparison of throughput between SCOUT and other schedulers. In detail, we compare not only SCOUT module with others but also separately the DCRC scheduler (means the scheduler only uses DCRC algorithm). Because this highlights the efficiency when we propose the combination of 3 modules in SCOUT, although the single DCRC strategy inside a scheduler is still better than others in case of due date problem. The result shows that if we only use DCRC strategy for the scheduler,

128

M. T. Chung et al.

(a) Throughput

(b) Total Execution Time

Fig. 7 The performance of SCOUT module about Throughput & Total Execution Time in comparison with other schedulers

the throughput can be better than 1.76 jobs/hour. This value is small with the unit of hour, however, many types of HPC jobs have a long time of execution during many hours. Therefore, the value can be considered for efficiency with the total throughput per day. Especially, SCOUT combines 3 modules with the purpose of both reducing execution time of jobs, then improving the performance of the scheduler, as we can see in Fig. 7a, the result is better than over 4 jobs/hour. The better throughput is associated with a better total execution time, as Fig. 7b shown.

6.2.2

Lateness and Tardiness

Figure 8 shows the average value of lateness and tardiness [17] in the comparison. In this paper, we concern about the problem at due date constraints which is the expected time of a job to complete. In contrast, the deadline constraints are the time

(a) The average lateness

(b) The number of tardiness

Fig. 8 Lateness and tardiness in the comparison between SCOUT module and other schedulers

SCOUT: Scheduling Core Utilization to Optimize the Performance …

129

Fig. 9 The performance of HPC application when running with SCOUT and other schedulers

that each job must be completed by its deadline. The lateness value is equal to the completion time minus the due date. The larger value of lateness means users have to wait longer to get back the result after submitting their jobs, so the smaller value is the better. Figure 8a shows the average lateness of SCOUT and other schedulers. Each boxplot in Fig. 8a highlights 3 main values: maximum, mean and minimum value. As we can see, SCOUT has a smaller value of average and maximum lateness. Figure 8b reveals the number of tardiness of SCOUT compared to other schedulers. SCOUT reduces the number of jobs having lateness value larger than 0, which means the number of tardiness is smaller than others.

6.2.3

Performance of HPC Application

Finally, we submit a set of jobs based on HPC applications from users. These jobs are performed on both CPU and Intel Xeon Phi by offload mode. The average value of performance is shown in Fig. 9. This result highlights that a job in SCOUT has a better performance than the default scheduler of PBS. These jobs are processed through the phase of thread mapping to improve the performance, then they are assigned the priority for execution.

7 Conclusions and Future Work In this paper, we propose a scheduling module that is integrated inside the HPC workload manager tool—PBS Professional. The module name is SCOUT with the role of solving problems related to the performance and execution priority of application on our system. SCOUT is implemented as a seamless plug-in with PBS Pro-

130

M. T. Chung et al.

fessional. This module is performed with a real testbed of job submission on our CPU/coprocessor based cluster, namely SuperNode-XP. Applications in the testbed are taken from users at Ho Chi Minh City University of Technology. SCOUT includes three phases with two main goals: decide a good policy for thread mapping on CPU and Intel Xeon Phi coprocessor. Then, it assigns the priority as well as the execution order to each job based on the proposed algorithm. The experimental result shows that SCOUT increases the throughput of the system as well as improve the performance of HPC applications. For future work, we will consider the analysis of log files to understand deeply runtime behavior. Then, our work will apply machine learning technique to make a good estimation model for scheduling jobs. Acknowledgements We thank the anonymous reviewers for their insightful comments. This research was conducted within the “Studying Tools to Support Applications Running on Powerful Clusters & Big Data Analytics (HPDA phase I 2018–2020)” funded by Ho Chi Minh City Department of Science and Technology (under grant number 46/2018/HD-QKHCN).

References 1. Assarzadegan, P., Rasti-Barzoki, M.: Minimizing sum of the due date assignment costs, maximum tardiness and distribution costs in a supply chain scheduling problem. Appl. Soft Comput. 47, 343–356 (2016) 2. Awasthi, M., Nellans, D., Sudan, K., Balasubramonian, R., Davis, A.: Handling the problems and opportunities posed by multiple on-chip memory controllers. In: Parallel Architectures and Compilation Techniques (PACT), 2010 19th International Conference on, pp. 319–330. IEEE (2010) 3. Benton, J., Do, M.B., Kambhampati, S.: Over-subscription planning with numeric goals. In: IJCAI, pp. 1207–1213. Citeseer (2005) 4. Blair-Chappell, S., Stokes, A.: Parallel Programming With Intel Parallel Studio XE. Wiley, Hoboken (2012) 5. Boillat, J.E., Kropf, P.G.: A fast distributed mapping algorithm. In: CONPAR 90VAPP IV, pp. 405–416. Springer, Berlin (1990) 6. Broquedis, F., Aumage, O., Goglin, B., Thibault, S., Wacrenier, P.A., Namyst, R.: Structuring the execution of openmp applications for multicore architectures. In: Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pp. 1–10. IEEE (2010) 7. Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: A generic framework for managing hardware affinities in hpc applications. In: Parallel, Distributed and Network-Based Processing (PDP), 2010 18th Euromicro International Conference on, pp. 180–186. IEEE (2010) 8. Chrysos, G.: Intel® xeon phi coprocessor-the architecture, vol. 176. Intel Whitepaper (2014) 9. Cruz, E.H., Diener, M., Pilla, L.L., Navaux, P.O.: An efficient algorithm for communicationbased task mapping. In: Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on, pp. 207–214. IEEE (2015) 10. Diener, M., Cruz, E.H., Alves, M.A., Navaux, P.O., Koren, I.: Affinity-based thread and data mapping in shared memory systems. ACM Comput. Surv. (CSUR) 49(4), 64 (2017) 11. Goglin, B., Furmento, N.: Enabling high-performance memory migration for multithreaded applications on linux. In: Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pp. 1–9. IEEE (2009)

SCOUT: Scheduling Core Utilization to Optimize the Performance …

131

12. Gordon, V., Proth, J.M., Chu, C.: A survey of the state-of-the-art of common due date assignment and scheduling research. Eur. J. Oper. Res. 139(1), 1–25 (2002) 13. Iancu, C., Hofmeyr, S., Blagojevi´c, F., Zheng, Y.: Oversubscription on multicore processors. In: Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pp. 1–11. IEEE (2010) 14. Ito, S., Goto, K., Ono, K.: Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments. Comput. Fluids 80, 88–93 (2013) 15. Jackson, J.R.: Scheduling a production line to minimize maximum tardiness. Technical Report, California Univ Los Angeles Numerical Analysis Research (1955) 16. Kaur, K., Chhabra, A., Singh, G.: Heuristics based genetic algorithm for scheduling static tasks in homogeneous parallel system. Int. J. Comput. Sci. Secur. (IJCSS) 4(2), 183–198 (2010) 17. Leung, J.Y.: Handbook of scheduling: algorithms, models, and performance analysis. CRC Press, Baco Raton (2004) 18. Li, J., Sun, K., Xu, D., Li, H.: Single machine due date assignment scheduling problem with customer service level in fuzzy environment. Appl. Soft Comput. 10(3), 849–858 (2010) 19. Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: Modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63(11), 1105–1122 (2003) 20. Power, J., Basu, A., Gu, J., Puthoor, S., Beckmann, B.M., Hill, M.D., Reinhardt, S.K., Wood, D.A.: Heterogeneous system coherence for integrated cpu-gpu systems. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 457–467. ACM, New York (2013) 21. Raschka, S.: Python machine learning. Packt Publishing Ltd, Birmingham (2015) 22. Smith, D.E.: Choosing objectives in over-subscription planning. In: ICAPS, vol. 4, p. 393 (2004) 23. Tavakkoli-Moghaddam, R., Moslehi, G., Vasei, M., Azaron, A.: Optimal scheduling for a single machine to minimize the sum of maximum earliness and tardiness considering idle insert. Appl. Math. Comput. 167(2), 1430–1450 (2005) 24. Tesla, N.: A unified graphics and computing architecture. IEEE Computer Society pp. 0272– 1732 (2008) 25. Toosi, A.N., Sinnott, R.O., Buyya, R.: Resource provisioning for data-intensive applications with deadline constraints on hybrid clouds using aneka. Futur. Gener. Comput. Syst. 79, 765– 775 (2018) 26. Van Den Briel, M., Sanchez, R., Do, M.B., Kambhampati, S.: Effective approaches for partial satisfaction (over-subscription) planning. In: AAAI, pp. 562–569 (2004) 27. Vasile, M.A., Pop, F., Tutueanu, R.I., Cristea, V., Kołodziej, J.: Resource-aware hybrid scheduling algorithm in heterogeneous distributed computing. Futur. Gener. Comput. Syst. 51, 61–71 (2015)

Chainer-XP: A Flexible Framework for ANNs Run on the Intel® Xeon PhiTM Coprocessor Thanh-Dang Diep, Minh-Tri Nguyen, Nhu-Y Nguyen-Huynh, Minh Thanh Chung, Manh-Thin Nguyen, Nguyen Quang-Hung, and Nam Thoai

Abstract Chainer is a well-known deep learning framework facilitating the quick and efficient establishment of Artificial Neural Networks. Chainer can be deployed on systems consisting of Central Processing Units and Graphics Processing Units efficiently. In addition, it is possible to run Chainer on systems containing Intel Xeon Phi coprocessors. Nonetheless, Chainer can only be deployed on Intel Xeon Phi Knights Landing, not Knights Corner. There are many existing systems, such as Tiane2 (MilkyWay-2), Thunder, Cascade, SuperMUC, and so on, including Knights Corner only. For that reason, Chainer cannot fully exploit the computing power of such systems, which leads to the demand for supporting Chainer run on them. It becomes more challenging in the situation where deep learning applications are written in Python while the Xeon Phi processor is only capable of interpreting C/C++ or Fortran. Fortunately, there is an offloading module called pyMIC which helps port Python applications into the Intel Xeon Phi Knights Corner coprocessor. In this paper, we present Chainer-XP as a deep learning framework assisting applications to run on the systems containing the Intel Xeon Phi Knights Corner coprocessor. T.-D. Diep (B) · M.-T. Nguyen · N.-Y. Nguyen-Huynh · M. T. Chung · M.-T. Nguyen · N. Quang-Hung · N. Thoai High Performance Computing Laboratory, Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, VNUHCM, Ho Chi Minh City, Vietnam e-mail: [email protected] M.-T. Nguyen e-mail: [email protected] N.-Y. Nguyen-Huynh e-mail: [email protected] M. T. Chung e-mail: [email protected] M.-T. Nguyen e-mail: [email protected] N. Quang-Hung e-mail: [email protected] N. Thoai e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_7

133

134

T.-D. Diep et al.

Chainer-XP is an extension of Chainer by integrating pyMIC into Chainer. The experimental findings show that Chainer-XP can help to move the core computation (matrix multiplication) to the Intel Xeon Phi Knights Corner coprocessor with acceptable performance in comparison with Chainer.

1 Introduction Artificial Neural Networks (ANNs), which are inspired by the biological neural networks, have attracted a lot of community attention by learning to do tasks by considering examples rather than task-specific programming. Such networks are applied to a wide range of research fields like computer vision [24, 26], speech recognition [16, 19], and natural language processing [15]. Hence, there are several frameworks created, which helps scientists and programmers set up ANNs straightforwardly, such as Caffe [1, 29], Theano [13, 27], TensorFlow [11, 12], Chainer [2, 28]. In all aforementioned frameworks, Chainer is the only one based on the approach “Defineby-run” compared with the others’ approach “Define-and-run” [28]. Thus, Chainer enables its users to debug deep learning applications more straightforwardly. Chainer, which is a well-known framework for ANNs, enables users to write complex architectures simply and intuitively. However, Chainer in particular as well as others in general are effectively used on systems comprising of only Central Processing Units (CPUs) along with Graphics Processing Units (GPUs). On the other hand, there are many contemporary systems, such as Tiane2 (MilkyWay-2), Thunder, Cascade, SuperMUC, and so forth [9], consisting of Intel Xeon processors and Intel Xeon Phi coprocessors without GPUs but almost existing frameworks for ANNs cannot be run on such infrastructures excluding Knights Landing [4]. To fully exploit the computing power of such systems, there is a considerable demand for developing deep learning frameworks which can be run on them. Nevertheless, there are some enormous challenges that we will face when developing such a framework. One typical example is that the programming language users write applications in Chainer is Python which Intel Xeon Phi Knights Corner does not support. Fortunately, there is a convenient module called pyMIC [22], which assists programmers to be able to handle offloads to the coprocessors from Python code. Furthermore, pyMIC also provides some useful functions to handle data transfers in a flexible yet performant way. pyMIC is a Python module for offloading compute kernels from a Python program to the Intel Xeon Phi Knights Corner (IXPKC) coprocessor. Currently, it can simply be leveraged in several scientific computing applications, such as the Python-based open-source electronic structure simulation software GPAW [22] and the open-source high-order accurate computational fluid dynamics solver for unstructured grids PyFR [23]. In this literature, we make use of pyMIC to integrate it into Chainer with the aim of creating a novel framework named Chainer-XP which can be run effectively on systems including CPUs connected with IXPKC. We evaluate Chainer-XP by using MNIST [25]—the popular handwritten digit database on the system comprising two

Chainer-XP: A Flexible Framework for ANNs Run on the Intel® …

135

Intel Xeon processors and one IXPKC. In addition, we compare the performance of Chainer-XP with Chainer in order to show the effectiveness of Chainer-XP. To sum up, our work makes the following contributions: • We propose a novel deep learning framework called Chainer-XP which can be run on the existing systems encompassing IXPKC via the offloading mechanism by blending pyMIC with Chainer. Thanks to the flexibility of pyMIC [22] and the “Define-by-run” approach of Chainer [28], Chainer-XP can be extended and debugged more easily in comparison with other existing efforts. • Basically, Chainer-XP only offloads the core computation (matrix multiplication) into IXPKC. However, the experimental results show that if the operation is repeatedly executed on IXPKC, its execution time (including computation time and communication time) will gradually be shorter in comparison with that of the same operation on CPU, which is fairly appropriate for deep learning applications. • We also evaluate the performance of Chainer-XP compared with Chainer to show its acceptable effectiveness. The remainder of the paper is organized as follows. The next section elaborates a general description of deep learning and introduces several existing deep learning frameworks. Section 3 reviews briefly the architecture of pyMIC module and Sect. 4 depicts the implementation of Chainer-XP in detail. Section 5 evaluates the effectiveness of Chainer-XP in comparison with Chainer while Sect. 6 surveys the state of the art and related work. Finally, Sect. 7 gives some conclusions, together with the further enhancements of the study.

2 Deep Learning Overview Deep Learning (DL) is present in our lives in ways we may not even consider, for example, Facebook’s image recognition, Google search and translate, Apple’s Siri or even face detection application on our smart phones. If Machine Learning (ML) is a subfield of Artificial Intelligence (AI), then DL could be seen as a subfield of ML. The evolution of the subject has gone from AI to ML and then DL. The last decade has witnessed the rapid growth of DL then it becomes one of the most powerful ML techniques behind the most exciting capabilities in diverse areas, such as computer vision, image processing, system identification, decision making, pattern recognition, and so on. In particular, DL is powered by ANNs, then the expression “deep learning” is always used when talking about ANNs, which are inspired by the biological neural networks. In fact, ANN is actually a mathematical model or computational model that simply tries to simulate the structural and functional aspects of the human brain. It consists of interconnected groups of artificial neurons also known as layers. Basically, ANNs process information based on the interaction of multiple connected processing elements. Therefore, the more layers and the more elements in each layer mean that the deeper ANNs obtained and the more computation that the programmers

136

T.-D. Diep et al.

have to handle. Moreover, the complexity of ANN’s architecture is increasing day by day in order to solve the more complex problem which may not be solved by any classical algorithm. The demonstration of that was the advent of Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and their variants. With facing these difficulties, DL frameworks have become an indispensable part of the DL evolution, and they are powering the AI revolution. Without them, it would almost be impossible for data scientists to implement their sophisticated algorithms. To put it simply, DL frameworks help users to easily build deep learning algorithms of considerable complexity. There is the fact that DL is the key to execute tasks of a higher level of sophistication, then building and deploying them successfully can be seen as a big challenge for data scientists and data engineers across the globe. At present, we have a myriad of DL frameworks that allow us to develop ANNs which can offer a better level of abstraction along with the simplification of programming challenges. Each framework is built in a different manner for different purposes. There are some popular frameworks that we may often refer to, such as TensorFlow, Caffe, Theano, Chainer, and so on. All of these frameworks are supported by Intel [4], but only Chainer uses the approach “Define-by-run” [28] which is suitable for the debugging purpose. Hence, we choose Chainer to develop in this paper.

3 pyMIC Python has recently been known as one of the most commonly used programming languages in the computing industry. It has also gained a lot of attention by the HighPerformance Computing community due to the advent of two widely used add-on packages NumPy [6, 31] and SciPy [7, 20]. In addition, Intel Xeon Phi coprocessor has emerged as a cheap yet efficient accelerator in recent years. However, applications run on the coprocessor must be written in C/C++ or Fortran, not Python. Therefore, the need for supporting Python applications run on the coprocessor is highly essential. pyMIC [22] acts as an offloading module which can assist move Python code to the processor and run it. In particular, pyMIC provides two interfaces for programmers. One is to fill C code while the other is to fill the Python code. In other words, pyMIC provides the bridge between Python interface and C interface to help simplify the programming of application developers which want to run Python applications on Intel Xeon Phi coprocessor. Figure 1 shows the architecture of pyMIC as a 4-layer module. The highest layer encompasses Python functions which can directly be used by applications written by Python. Underneath this layer, a Python extension one (_pyMICimpl) written in C/C++ operates as a connector between the highest layer and the lowest layer (Intel LEO runtime). Intel LEO runtime, which is based on the Intel Language Extensions for Offloading, provides Intel Composer XE and C/C++ pragmas which can directly be run on IXPKC. In order to call C/C++ functions from pyMIC layer, _pyMICimpl is implemented by means of taking advantage of Cython mechanism [3]. Besides,

Chainer-XP: A Flexible Framework for ANNs Run on the Intel® …

137

Fig. 1 Architecture of the pyMIC module

the pyMIC module includes the data structure offload_array which is compatible with NumPy with standard methods implementing all the array operations. By dint of the data structure, the pyMIC module can easily reuse the traditional data structure NumPy [6, 31].

4 Chainer-XP In this section, we describe Chainer in detail as well as changes in Chainer-XP that make it possible to be operated on IXPKC. In comparison with the existing frameworks for DL, Chainer, which is one of the most flexible frameworks, provides a straightforward way to build complex DL architectures. Thanks to the popularity, simplicity and contribution of many scientific computing libraries like NumPy [6, 31], Cython [3], etc. Python was used to implement the whole Chainer. Chainer provides different APIs based on the approach “Define-by-Run” (also known as dynamic computational graphs) as well as object-oriented high-level APIs to build and train neural networks. It also supports CUDA/cuDNN using a self-developed library called cupy [28], which is partially compatible with NumPy for high-performance training and inference. Chainer also provides several popular optimization methods like stochastic gradient descent (SGD) [14], Adam [21], AdaGrad [17], etc. as well as many basic numerical operations for ANN computation, such as convolutions, losses, and activation functions which are implemented as sub-classes of Function. Basically, Chainer uses an approach called “Define-by-Run” to perform computation in ANNs. In other words, Chainer does not fix the computation graph of the model before training. Instead, the graph structure is used to implicitly memorize the whole computation when the forward propagation is executed. By adopting the “Define-by-Run” scheme, the network written by Chainer is defined on-the-fly via the actual computation. In “Define-and-Run” frameworks, the entire computational graph always remains in memory regardless of whether certain layers are no longer being used in the backpropagation. In contrast, “Define-by-Run” allows each forward step to be stored when data pass by and freed from memory after being used for the backpropagation, which makes Chainer use memory more efficiently. In addition, Chainer implements some general classes like Variable, Link then describe input, output or even each layer of ANN as those sub-classes, which contains an array storing values, weights or gradients. Thanks to that mechanism, users are able to see

138

T.-D. Diep et al.

the intermediate output or what is happening inside their model then they can easily control the computation flows. It helps developers determine what is wrong with the model or even improve results and performance. Thus, it can be said that Chainer, which is a potential framework, is suitable for debugging purpose. In order to write a DL application, programmers must build a computational graph by dynamically “chaining” various kinds of Links and Functions to define a network. With Chainer, the network is executed by running the chained graph, hence they call it Chainer. There are some low-level core components of Chainer: – Variable: In Chainer, Variable (Fig. 2) is the most basic class which has a structure to keep track of computation, every variable holds a data attribute which is an array of type either numpy.ndarray or cupy.ndarray using to store its value. Similarly, its grad attribute is also an array storing the gradient value created when it calls backward(). As mentioned above, Variable uses an attribute named creator to memorize all Functions it has passed by, then uses them in the backward propagation. Moreover, backward() method is also implemented inside Variable. When it is called, all Functions stored in creator will be popped out, then perform their own backward on this Variable’s data sequentially to produce the gradient and store it in grad. Besides, Variable also has several methods that perform some basic tasks for manipulating data as well as grad. In Chainer-XP, we provide two new attributes for Variable named offl_data and offl_grad in type of offload_array to hold pointers referring to two memories in IXPKC, which is used to store array object corresponding to data and grad in this device. A method called to_mic() is implemented in order to support programmers explicitly update both offl_data and offl_grad by using current value of data and grad on host memory. In the opposite direction, Chainer already implements to_cpu method, which is used to synchronize data and grad from GPU’s memory to the host’s (CPU) one. Hence, we do not need to create a new method but modify the existing method so that it can update data and grad on the host memory based on their value calculated on IXPKC or GPU according to the environment parameter after any computation finished. However, it may be able to reduce the communication cost between the host and the device if we perform more continuous computations without exchanging data before retrieving the results. In addition, we use the environment parameter as a variable used to select where to execute the computation. Thus, if the computations are independent of the data, we can perform parts of computation on both IXPKC and GPU simultaneously by using the environment parameter dynamically. – Link: Link (Fig. 3) is one of Chainer’s primitive structures used for model definitions. It can be said that Link is a building block of neural network models that supports various features, such as handling parameters, defining network fragments, serialization, and so on. Basically, Link is used to define the connections between layers inside ANNs. At the moment, Chainer supports some popular transformation Links, like Linear, Convolution... A Link has its own parameters stored in an

Chainer-XP: A Flexible Framework for ANNs Run on the Intel® …

139

attribute named _params. However, other parameters and variables can be registered to a Link via the add_param method, then referred by their name. It makes Link class be able to manage its parameters and persistent values or distinguish parameters and persistent values from other attributes by names. Parameter is a subclass of Variable, which holds the weight matrix of the model. It can be initialized from another Variable or even from array data in type of numpy.ndarray or cupy.ndarray. By modifying the Parameter’s constructor function, we support initializing a new Parameter from an offload_array. Like Variable class, we modify some existing methods as well as provide several synchronization functions to help programmers manipulate all parameters either on the host or the device in the most effective way. – Function: Function is also a primitive structure of Chainer. It describes all basic components that a function needs to have to do backpropagation. Hence, all function implementations defined in module chainer.functions inherit this class. The main feature of Function is keeping track of function applications as a backward graph. When a Function is applied to Variable objects, forward method of this Function will be called, then executed on data field of input variables. After the forward method finished, the function will be memorized in creator field of these variables. Thanks to that mechanism, the variable knows exactly which function it passed by. Thus, in the backpropagation phase, the backward method of all functions recorded in the creator will be called in the correct order. If Variable and Link are responsible for managing and storing data in an ANN, then Function is responsible for computing on these data. At present, Chainer provides almost all of the common functions used in deep learning, such as linear, convolution and other activation functions like relu, sigmoid, softmax, and so on. To make IXPKC totally support DL, all functions mentioned above should be able to perform computation on IXPKC. Theoretically, we find out that dot() is one of the most common functions in Chainer. According to experimental results in Table 1, we also realize that dot() is the most time-consuming function when we measure runtime of each function inside an ANN written by Chainer for handwritten recognition application [25]. As we can see, the ratio between the time spent on executing dot() function and the entire application runtime is increasing as the batch size (the number of samples propagated through the network) grows up. Based on these observations, we decide to support dot() function, one of the core functions of the Function class, to perform computation on IXPKC. The results and assessment of our implementation are described in the next section.

140

T.-D. Diep et al. : Point to

Host Memory

Xeon Phi Memory

Variable

data

offl_data

grad

offl_grad

creator

F1

F2

F3

Fig. 2 Variable structure Parameters

Link

Host Memory

P1

to_cpu()

P2

Xeon Phi Memory

to_mic()

Fig. 3 Link structure Table 1 Dot runtime Batch size 1000 2000 3000 4000 5000

Dot() function (%)

Other functions (%)

82 83 85 88 91

18 17 15 12 9

Chainer-XP: A Flexible Framework for ANNs Run on the Intel® …

141

5 Evaluation In this section, we intend to evaluate different dot() function implementations executed on CPU and IXPKC by recording the execution time when calling it once and multiple times. We then present our evaluation study for two ANNs, the first one uses dot() function run on CPU called ANN-CPU, the other uses dot() function run on IXPKC called ANN-XP. Both mentioned ANNs are implemented as handwritten digit recognition applications, which is one kind of the most classical application in ML, running with the MNIST dataset [25]. Before we discuss those experiments, we first describe the environment and hardware specifications on which we perform our evaluations.

5.1 Environmental Setup All of the experiments that we present in this paper are performed on a compute node including an IXPKC and two CPUs, detailed specifications are described in Table 2. According to the prior work [18], it can be said that the theoretical peak performances of two processors Intel Xeon E5 and coprocessor Intel Xeon Phi 7120P are the same and approximately 1 TFLOPS. Technically, the coprocessor can achieve such high performance thanks to the multithreading mechanism. Then, in order to utilize all computing power of the coprocessor, all threads must be utilized effectively. However, threads can be migrated from one core to another, which depends on operating systems’ scheduling decisions. This leads to performance depletion because migrated threads must fetch data into the cache of a new core, which was mentioned in the Colfax manual [30]. Fortunately, there is an optimization technique on the coprocessor which can help us inhibit thread migration by setting environment variable KMP_AFFINITY to Scatter, Compact, or several other modes [8]. Scatter and Compact are two of the most popular ones. According to the previous study [30], applications can be divided into two categories: compute-bound applications and bandwidth-bound applications. The first one

Table 2 Hardware specifications Codename CPU Model Microarchitecture Clock frequency Memory size Cache Max memory bandwidth Core/threads

Intel Xeon E5-2680V3 Sandy Bridge EP 2.50/3.30 GHz 128 GB 30.0 MB SmartCache 68 GB/s 12/24

Coprocessor Intel Xeon Phi 7120P Intel Many Integrated Core 1.24/1.33 GHz 16 GB 30.5 MB L2 352 GB/s 61/244

142

T.-D. Diep et al.

requires as many threads as possible simultaneously to do the heavy computation, for example, matrix multiplication (the core operation in running ANN). With this kind of application, the order of threads can greatly affect application performance because threads with adjacent numbers are likely to access the same data or chunks of data that are close together. Therefore, Compact mode is the best choice for this kind of application since it allows threads with adjacent numbers to be placed on the same core to share the cache, which leads to the increase in both spatial and temporal locality. Meanwhile, the bandwidth-bound applications have been proven to work well on Scatter mode, which places threads with adjacent numbers on different cores. The reason for this is that in applications bounded by memory bandwidth, it tends to use very few threads per core on the coprocessor to reduce thread contention on memory controllers. Besides, inefficient data access can also lead to a reduction in the performance of the entire process. By default, data will always be allocated on 4 KB pages, it may cause page fault frequently when we access a large amount of data continuously. By contrast, if the threshold value of MIC_USE_2MB_BUFFERS is set, when data offloaded to the coprocessor exceed that value, the memory will be allocated on big 2 MB pages. Then, we can access more data with few pages, which not only improves the performance but also lowers the allocation cost. In summary, to achieve the best performance, our system is configured as follows. KMP_AFFINITY=compact, MIC_USE_2MB_BUFFERS=16K. All the experiments are run in Python environment version 2.7.5. Moreover, Intel compiler version 17.0.4 20170411 and Intel OpenMP version 5.0.20170308 are used to compile our code.

5.2 Experimental Results As mentioned above, since the matrix multiplication is the core operation in ANN computation, in our first experiment, we measure the computation time of dot() function to compare the performance. Matrices varying in size from 1,000 to 10,000 are separately multiplied by two implementations of dot(), one run on the Intel Xeon Phi coprocessor is called dot-XP while the other run on CPU is called dot-CPU. Because of taking additional time for exchanging data, the execution time of dotXP contains computation time and communication time, while dot-CPU only has computation time. Figure 4 shows that dot-XP always has the computation time lower than dot-CPU. The reason is that the most of computations in dot() function are mostly performed on independent data blocks, then when parallelizing dot(). In fact, not all operations in the matrix multiplication can be parallelized, some synchronizations are required during dot() execution. The experimental results also show that dot-XP only got approximately 2–3 times faster. More specifically, dot-XP is 2.25 times faster than dot-CPU in calculating the multiplication of two matrices 1,000 × 1,000, this ratio is not fixed, it slightly increases as the size of the matrices increases. When the matrices

Chainer-XP: A Flexible Framework for ANNs Run on the Intel® … dot-CPU computation time dot-XP computation time

143

dot-XP communication time

20 16.3 14.7

15

11.8 8.9

8.2 7.1

6.6

5.6 2.7

3.1

3.5

3.8

10,000

1.5

2.3

6,000

1.9

9,000

4

3.1

5,000

3,000

2,000

1,000

1.7 1.6 1.1 0.9 0.8 0.4

2.3

4,000

3.3

8,000

4.9

5

0

10.5

9.8

10

7,000

Time(s)

13 11.4

Size of Square Matrix

Fig. 4 Dot-XP and Dot-CPU execution

size reaches 10,000 × 10,000, dot-XP is 3.11 times faster than dot-CPU, which proved that IXPKC has the great potential for computation with large matrices. By contrast, performing computation in IXPKC requires that data must be transferred from the host to the device and vice versa. Based on the obtained results, it can be said that the communication time occupies about 75% execution time, and this ratio keeps unchanged as the matrix size grows. Obviously, if we only perform one computation each data transfer session, the synchronization will slow down our process significantly. As a result, dot-XP has execution time longer than dot-CPU 1.88 times with a matrix size being 1,000 and 1,38 times with matrix size being 10,000. In the second experiment, we increase the number of computations each data transfer session, then compute the average runtime instead of increasing only the matrix size. In Fig. 5, we show the obtained results. As expected, when performing more computation each data exchange session, IXPKC seems more efficient than CPU because sending the execution commands does nearly not take any additional communication. For example, the execution time of 2 matrix multiplication computations on IXPKC is only slower than on CPU about 1.21 times, but with 5 computations, the runtime including communication on IXPKC is even faster than on CPU 1.34 times. Therefore, we can improve the performance by transferring large continuous computations to IXPKC and doing other computations in CPU while waiting for taking the returned results. In the next experiment, to evaluate the performance of the matrix multiplication function in a deep learning application, we measure the execution time of two mod-

144

T.-D. Diep et al. ANN-XP computation time

ANN-XP communication time

5 times on XP 5 times on CPU 2 times on XP 2 times on CPU 1 time on XP 1 time on CPU 0

1

0.5

2

1.5

3

2.5

4

3.5

4.5

Time(s) Fig. 5 Multi-dot execution ANN-CPU computation time ANN-XP computation time

ANN-XP communication time

80

Time(s)

60

40

20

10,000

9,000

8,000

7,000

6,000

5,000

4,000

3,000

2,000

1,000

0

Unit Fig. 6 ANN-CPU and ANN-XP

ified ANNs which recognize handwritten digit. One of them uses dot() of NumPy run on CPU called ANN-CPU and the other uses dot() of modified pyMIC run on IXPKC called ANN-XP. In addition, both ANNs contain three layers and have the number of units in the hidden layer in a range from 1,000 to 10,000. During the evaluation, we also separate computation time and communication time as measuring dot() execution time in each ANN in the previous experiment. As

Chainer-XP: A Flexible Framework for ANNs Run on the Intel® …

145

shown in Fig. 6, the total runtime of ANN-XP is always higher than ANN-CPU but the computation time of ANN-XP has proved that the computing power of IXPKC is exploited effectively. For instance, by using 1,000 units, batch size 60,000 and 1-channel image of size 28 × 28, the largest size of matrices that dot() function must perform is [784 × 60,000] × [60,000 × 1000], and we got 29.84% shorter computation time when running ANN-XP compared to ANN-CPU. This number increases dramatically as the number of units grows. We even achieve 61.92% when ANNs reach 10,000 units. That can be explained as dot() function occupies the most of the time spent on computation, about 90%. That means the largest computations are mostly performed on IXPKC and the rest of the ANNs computation does not take too much computing resources on CPU. Based on our experience in running ANN-XP on Dockers [5] containing the different number of CPU cores, we are aware that ANN-XP’s computation time was significantly affected by CPU computing power. Therefore, we can share the CPU’s capabilities for other scientific computations simultaneously. However, in ANNs, matrix multiplication operations are not executed seamlessly, it causes data synchronization for each dot() execution, which leads to devastating performance degradation. It spends approximately 80% and 73% of total run time for data communication between CPU and Xeon Phi devices, corresponding to the ANN-XP contained 1,000 and 10,000 units each layer. Thus, in order to reduce communication time and boost the performance of DL applications, all computations on data must be executed continuously on IXPKC with minimal synchronization. In other words, not only dot() but also other computations must be performed in IXPKC.

6 Related Work There are several deep learning frameworks able to be run on systems consisting of Intel Xeon Phi coprocessors, such as Neon, Tensor Flow, Theano, Chainer, etc. However, all of them can only be run on Intel Xeon Phi Nights Landing coprocessors, not Nights Corner ones [4]. Therefore, Chainer-XP is totally different from them in terms of the underlying supporting hardware infrastructure because Chainer-XP can be run on the existing systems containing IXPKC. To the best of our knowledge, there are few deep learning frameworks capable of being run IXPKC. Xeon-CafPhi [10] is the only one that facilitates offloading deep learning applications into IXPKC. Like Chainer-XP, Xeon-CafPhi also offloads partly computation of deep learning applications into IXPKC. However, Xeon-CafPhi is built on Caffe, whose approach is “Define-and-run” [29], which is far different from the approach “Define-by-run” of Chainer-XP [28]. With the approach “Define-by-run”, deep learning applications can be debugged in an easier manner. Therefore, Chainer-XP is the first deep learning framework with the “Define-by-run” approach which can be run on IXPKC. Moreover, Chainer-XP is based on the offloading module pyMIC [22], which is rather flexible in developing further Chainer-XP while it takes much time to develop Xeon-CafPhi with similar efforts.

146

T.-D. Diep et al.

7 Conclusions and Future Work In this paper, we propose and develop Chainer-XP, which is an extension of the deep learning framework Chainer run on IXPKC. Chainer-XP is an integration of pyMIC and Chainer with the aim of offloading the core computation—matrix multiplication of deep learning applications into IXPKC. The experimental results prove that Chainer-XP is able to not only exploit the computing power of the underlying hardware infrastructure containing IXPKC, but also shows the effectiveness with acceptable performance. Nonetheless, Chainer-XP still contains several limitations. First, Chainer-XP ports only matrix multiplication code from CPU to IXPKC by means of offloading mechanism. In other words, it does not offload the other computation codes into IXPKC. Therefore, the communication between the host and the device still happens frequently, which leads to the high overhead in communication time. In order to tackle the issue, we intend to offload all computations from CPU to IXPKC to improve the performance. Second, Chainer-XP lets users set up only the fully connected neural networks. As such, Chainer-XP does not support more complex neural networks like CNN, RNN, and so on. Therefore, we intend to extend Chainer-XP to facilitate the establishment of the networks. Ultimately, Chainer-XP can be run on at least one IXPKC. Therefore, we intend to enhance Chainer-XP so that it can be run on the systems with many cooperative coprocessors. Acknowledgements This research was conducted within the “Studying Tools to Support Applications Running on Powerful Clusters & Big Data Analytics (HPDA phase I 2018-2020)” funded by Ho Chi Minh City Department of Science and Technology (under grant number 46/2018/HDQKHCN). The authors would also like to thank the anonymous referees for their valuable comments and helpful suggestions.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

12.

Caffe. http://caffe.berkeleyvision.org. Accessed 29 May 2018 Chainer. https://chainer.org. Accessed 29 May 2018 Cython. http://cython.org. Accessed 29 May 2018 Deep learning frameworks. http://www.numpy.org. Accessed 29 May 2018 Docker. https://www.docker.com. Accessed 29 May 2018 Numpy. http://www.numpy.org. Accessed 29 May 2018 Scipy. https://www.scipy.org. Accessed 29 May 2018 Thread affinity interface. https://software.intel.com/en-us/node/522691. Accessed 21 Dec 2017 Top 500. https://www.top500.org. Accessed 29 May 2018 Xeon-cafphi. http://rohithj.github.io/Xeon-CafPhi. Accessed 29 May 2018 Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems (2016). arXiv preprint arXiv:1603.04467 Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)

Chainer-XP: A Flexible Framework for ANNs Run on the Intel® …

147

13. Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D., Ballas, N., Bastien, F., Bayer, J., Belikov, A., Belopolsky, A., et al.: Theano: a python framework for fast computation of mathematical expressions 472, 473 (2016). arXiv preprint arXiv:1605.02688 14. Bottou, L.: Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436. Springer, Berlin (2012) 15. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011) 16. Ding, W., Wang, R., Mao, F., Taylor, G.: Theano-based large-scale visual recognition with multiple gpus (2014). arXiv preprint arXiv:1412.2302 17. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011) 18. Halyo, V., LeGresley, P., Lujan, P., Karpusenko, V., Vladimirov, A.: First evaluation of the CPU, GPGPU and MIC architectures for real time particle tracking based on hough transform at the LHC. J. Instrum. 9(04), P04005 (2014) 19. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., et al.: Deep speech: Scaling up end-to-end speech recognition (2014). arXiv preprint arXiv:1412.5567 20. Jones, E., Oliphant, T., Peterson, P.: {SciPy}: open source scientific tools for {Python} (2014) 21. Kinga, D., Adam, J.B.: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015) 22. Klemm, M., Enkovaara, J.: pymic: a python offload module for the intel xeon phi coprocessor. In: Proceedings of PyHPC (2014) 23. Klemm, M., Witherden, F., Vincent, P.: Using the pymic offload module in pyfr (2016). arXiv preprint arXiv:1607.00844 24. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 25. LeCun, Y.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998) 26. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015) 27. Team, T.T.D., Al-Rfou, R., Alain, G., Almahairi, A., Angermueller, C., Bahdanau, D., Ballas, N., Bastien, F., Bayer, J., Belikov, A., et al.: Theano: a python framework for fast computation of mathematical expressions (2016). arXiv preprint arXiv:1605.02688 28. Tokui, S., Oono, K., Hido, S., Clayton, J.: Chainer: a next-generation open source framework for deep learning. In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in the 29th Annual Conference on Neural Information Processing Systems (NIPS), vol. 5 (2015) 29. Vision, B., Center, L.: Caffe: a deep learning framework (2015) 30. Vladimirov, A., Asai, R., Karpusenko, V.: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors: Handbook on the Development and Optimization of Parallel Applications for Intel Xeon Processors and Intel Xeon Phi Coprocessors. Colfax International (2015) 31. Walt, S.V.D., Colbert, S.C., Varoquaux, G.: The numpy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13(2), 22–30 (2011)

Inverse Problems in Designing New Structural Materials Daniel Otero Baguer, Iwona Piotrowska-Kurczewski, and Peter Maass

Abstract The development of new structural materials with desirable properties has become one of the most challenging tasks for engineers. High performance alloys are required for the continued development of cars, aircraft and more complex structures. The vast search space for the parameters needed to design these materials makes all established approaches very expensive and time-consuming. A high-throughput screening method has been recently introduced in which many small samples are produced and exposed to different tests. Properties of the materials are predicted by a so-called predictor function that uses the information extracted from these tests. This approach offers not only a quick and cheap exploration of the search space but also the generation of a large data-set containing parameters and predicted material properties. From such a data-set we define a forward operator (Neural Network) mapping from parameters to the material properties and focus mainly on solving the corresponding Inverse Problem: given desired properties a material should have, find the material and production parameters to construct it. We use Tikhonov regularization to reduce the ill-posedness of the problem and a proximal gradient method to find the solution. The main contribution of our work is to exploit an idea from other recent works that incorporate tolerances into the Tikhonov functional.

1 Introduction The development of complex structures such as aircraft, cars and buildings depends crucially on the use of materials with specific properties. Mechanical properties such as the ability to withstand surface indentation (strength), the stress at which a material starts to yield plastically (yield strength) or the maximum tensile stress the material D. O. Baguer (B) · I. Piotrowska-Kurczewski · P. Maass Center for Industrial Mathematics (ZeTeM), Bremen, Germany e-mail: oter[email protected] I. Piotrowska-Kurczewski e-mail: [email protected] P. Maass e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_8

149

150

D. O. Baguer et al.

Fig. 1 Micro-structure of a steel after heating it up to 900◦ (a), 1000◦ (b), 1100◦ (c) and quenching [5]

can withstand before failure (tensile strength) are just some examples. There are established methods for the development of these materials but they are based on tedious, expensive and time-consuming processes. One of the most important steps in the production process of a material is the choice of the chemical composition. In addition, several post-treatments can change the material properties. For example, heating up the alloy to a specific temperature and then cooling it down, transforms the micro-structure (Fig. 1) and therefore changes the actual performance of the material. Thus, for constructing a material with the desired properties it is necessary to tune the chemical composition and the parameters that are used for the post-treatments. In our setting the sought parameters consist of both the parameters that describe the chemical components and those that are used for the production and post-treatments. The aim is to determine the parameter values to obtain a material that satisfies a certain performance profile. Definition 1 Given y˜ ∈ Rm and ε ∈ Rm , the properties y ∈ Rm satisfy the performance profile ( y˜ , ε) if y ∈ B( y˜ , ε) = {z ∈ Rm | ∀ 1 ≤ i ≤ m : y˜i − εi ≤ z i ≤ y˜i + εi }.

(1)

The number of properties we are considering is m ≈ 6. An example of a performance profile is shown in Table 1 where y˜ = (1500, 70, 280, 275, 8, 5055) and ε = (1500, 5, 10, 75, 1, 4945). The interval for the hardness was established based on the maximum possible value for this property which is 10 000 HV (diamond). The importance of specifying such intervals instead of fixed values is illustrated in Sect. 3.1.1. We will assume that there are around 40 parameters that we need to determine. However, this produces a very large number of possible combinations obtained by varying these parameters. For example, considering that each of them could take 10 different values, which is rather simplistic, there would be at least 1040 possible settings. This means that the search space is so large that an exhaustive search is completely impractical. In order to overcome this problem a high-throughput screening technique, based on quickly testing small volume samples, has been recently intro-

Inverse Problems in Designing New Structural Materials

151

Table 1 Example of a performance profile for a material Property Value Density (kg/m3 ) Young’s modulus (Gpa) Yield strength (Mpa) Tensile strength (Mpa) Elongation (%) Hardness (HV)

110

Fig. 2 Micro balls with a diameter of less than 1 mm used to quickly extract descriptors that support the prediction of the material properties

duced [9]. These samples have the form of small balls with a diameter of less than 1 mm (Fig. 2). The screening data includes different types of measurements such as laser-based micro-hardness [7], nano indentation and dilatometry among other tests. The values extracted from those measurements, which are called descriptors, are not directly related to the properties of the material. They do not measure the real strength for example, and they cannot be simply scaled as in other similar applications. However, a predictor function [8] is developed that given the descriptors of a small sample, predicts the most probable properties of the material.

2 Predictor Function and Forward Operator The predictor function ψμM : Rd → Rm is a mapping from micro to macro scale that assigns the material properties that, based on the descriptors extracted from a micro sample, are the most likely. It is developed using a data-driven approach. The data is obtained from experiments where micro as well as macro samples are built using correlated parameters. From the micro samples the descriptors z ∈ Rd are extracted and from the macro samples the properties y ∈ Rm (m ≈ 6) are measured. Such a predictor function supports a quick exploration of the search space, for example by means of a heuristic search [8], since only experiments on microdimensions are needed to evaluate (predict) the performance of the materials. An overview of the whole methodology is given in [8, 9]. In this work we present a more analytical approach. We also develop a data-driven model ϕ : Rn → Rm , but it predicts the properties of the materials directly from the

152

D. O. Baguer et al.

parameters. Our approach is then to model the problem as an inverse problem with forward operator ϕ. In order to train such a model ϕ, we need many data points containing parameters x (i) ∈ Rn and properties y (i) ∈ Rm of different materials, which would be rather expensive. However, we can generate such a data-set by making a large number of micro experiments and predicting the properties of the corresponding materials using the predictor function. We develop ϕ using a Neural Network. This is based on the fact that Neural Networks can approximate arbitrarily well almost any kind of function [6, 14]. More details are given in Sect. 5.

3 Inverse Problem An inverse problem is concerned with the reconstruction of an unknown quantity from indirect measurements, which frequently contain some kind of noise [18]. If we assume for a moment that ε = 0, i.e we want a material with exact properties y δ , then we need to reconstruct x † ∈ Rn from y δ ∈ Rm where yδ = y† + η

(2)

with y † = ϕ(x † ) and η < δ. Usually η represents the noise and is unknown, although sometimes we can make assumptions on its probability distribution and its magnitude δ ∈ R. In this case the data y δ contains the central values of the performance profile and is actually not the result of any measurement but a guess, i.e. the desired properties. The noise η corresponds to the difference between y δ and the closest y † such that there exist x † ∈ Rn with ϕ(x † ) = y † . Our forward operator ϕ is non-linear and we assume that the inverse problem is illposed because any of the conditions of well-posedness [18] might not be satisfied. It could happen that there are no parameters whose most likely corresponding properties satisfy the performance profile, or there could be many of them or they could be highly unstable with respect to changes in the performance profile. Therefore, special techniques called regularization methods should be used [18]. To reduce the ill-posedness we use the Tikhonov regularization approach [10, 18, 23]. We set Jα : Rn × Rm → R with Jα (x, y δ ) = S(ϕ(x), y δ ) + αR(x)

(3)

and define the operator Tα : Rm → Rn as Tα (y δ ) = arg min Jα (x, y δ ), x∈Rn

(4)

Inverse Problems in Designing New Structural Materials

153

which is our approximation of the inverse, i.e. the solution to our problem is given by Tα (y δ ). In (3) the discrepancy term S : Rn × Rm → R measures how close ϕ(x) is to δ y in some sense (Sect. 3.1), and the regularization term R : Rn → R contributes to reducing the instability. This term exploits important prior information of the true solution, for example not every x ∈ Rn contains feasible values for the parameters since there are some conditions, such as ranges, that must be considered (Sect. 3.2). The scalar value α ∈ R+ determines the trade-off between the discrepancy and the regularization.

3.1 Choosing the Discrepancy Term The discrepancy term S should be such that S(ϕ(x), ˜ y δ ) ≈ 0, if ϕ(x) ˜ ∈ B(y δ , ε) n holds for x˜ ∈ R . Additionally, when it does not hold, it should measure how far ϕ(x) ˜ is from the desired values (Fig. 3). With this purpose we use the ε-insensitive function:  |z|ε =

0 : |z| ≤ ε |z| − ε : |z| > ε

(5)

for z ∈ R and ε ≥ 0. For all y ∈ Rm and ε ∈ Rm we define a semi-norm as y2ε =

m 

|y j |2ε j

(6)

j=1

and we use it for the discrepancy term so that from now on we will take Sε (ϕ(x), y) =

Fig. 3 Plot of | · |2ε for ε = 0.2 (solid red) and ε = 0.5 (broken blue)

1 ϕ(x) − y2ε . 2

(7)

154

D. O. Baguer et al.

Fig. 4 Simple example. The broken blue line indicates the performance obtained without considering the tolerances and the continuous red line indicates the performance obtained when considering them. The result from the first approach is outside the profile indicated by the grey area

A similar idea is used in recent works such as [16], where regularization properties for integral forward operators are shown. Other work [11] uses the ε-insensitive function for solving ill-posed problems in more general settings.

3.1.1

A Simple Example

To highlight the importance of using such a discrepancy term and not just the distances to the intervals midpoints, we analyze a rather simple scenario. Assume that the forward operator ϕ : Rn → R3 is defined as ϕ(x1 , x2 , . . . , xn ) = (0.4, x1 , 1 − x1 ) = (y1 , y2 , y3 ).

(8)

and the performance profile is given by y δ = (0.4, 0.5, 0.3), ε = (0, 0, 0.2). If we find x ∈ Rn that minimizes the usual distance ϕ(x) − y δ 2 , we obtain xˆ ˆ = (0.4, 0.6, 0.4) ∈ / B(y δ , ε) not fitting the performance prowith xˆ1 = 0.6 and ϕ(x) ˆ = (0.4, 0.5, 0.5) ∈ file. However, if we minimize (7) we obtain xˆ1 = 0.5 and ϕ(x) B(y δ , ε), which does fit the given performance profile. This example is illustrated in Fig. 4.

3.2 Reducing Ill-Posedness In the regularization term we try to exploit prior information about the true solution. This is a common strategy in Inverse Problems and some examples are [2, 12]. From the parameter definitions we obtain linear boundary conditions and ranges that let us define a convex bounded set C ⊂ Rn . For example, if there are two heating processes, a boundary constraint might be that the temperature of the second one has to be higher than the temperature of the first one. We include this information x ∈ C into the regularization term by means of the indicator function

Inverse Problems in Designing New Structural Materials

 IC (x) =

155

0 :x ∈C . ∞:x∈ /C

(9)

Although this helps us to reject unfeasible solutions, it does not help us reducing instability and does not ensure that the solution is unique. Moreover, it could be that a set of parameters x ∈ C but for some reason the production process fails when using these parameters. Therefore, we choose some point x ∗ ∈ C and add x − x ∗ 2 to the regularization term. This is a standard approach in Tikhonov regularization. [23]. We propose several strategies for selecting x ∗ that can be used and have slightly different interpretations: 1. The point that maximizes the minimum distance to any of the boundaries. It can be easily obtained by solving a linear optimization problem [3]. The motivation behind this choice is that solutions close to the boundaries tend to be problematic in the production process. 2. Given a set of N true solutions x (1) , x (2) , ..., x (N ) , i.e. parameters that have already been tried without production problems, let x ∗ equal the centroid of these points N 1  (i) x , (10) x∗ = N i=1 which is the point that minimizes the sum of squared Euclidean distances between itself and each other point. The second choice is the one we use in our experiments. Finally, we define 1 R(x) = IC (x) + x − x ∗ 22 , 2

(11)

as the regularization term and Jα,ε (x, y δ ) = Sε (ϕ(x), y δ ) + IC (x) +

α x − x ∗ 22 , 2

(12)

as the functional to minimize, i.e. the approximation to the inverse is given by Tα,ε (y δ ) = arg min Jα,ε (x, y δ )

(13)

x∈Rn

3.3 Optimization Evaluating Tα,ε requires the solution of an optimization problem. Since ϕ and  · 2ε are both differentiable, the discrepancy term Sε (ϕ(x), y) is also differentiable.

156

D. O. Baguer et al.

We show now how to obtain the gradient ∇x Sε (ϕ(x), y) by means of the backpropagation algorithm [13] with a simple modification. The standard back-propagation algorithm yields the gradient of the mean squared error of the network for one or more observations y (i) . The algorithm works for the gradients with respect to the parameters as well as with respect to the input. In this case we are only interested in the gradient with respect to the input x. With the standard back-propagation we obtain ∇x

1 ϕ(x) − y δ 22 = [∇x ϕ(x)]T (ϕ(x) − y δ ) 2

(14)

but in our case we need ∇x

1 ϕ(x) − y δ 2ε = [∇x ϕ(x)]T |ϕ(x) − y δ |ε 2

(15)

where |z|ε , for z ∈ Rm , is the component-wise evaluation of the ε-insensitive function. Definition 2 (ε-insensitive back-propagation) If after the forward-pass of the standard back-propagation algorithm, i.e. after evaluating ϕ(x), we modify the target output y δ ∈ Rm in the following way ⎧ δ ⎨ ϕ(x) j : |ϕ(x) j − y j | ≤ ε j yˆ δj = y δj + ε j : ϕ(x) j − y δj > ε j ⎩ δ y j − ε j : ϕ(x) j − y δj < −ε j

∀ 1 ≤ j ≤ m,

(16)

and run the standard back-propagation algorithm on yˆ δ ∈ Rm , we obtain the desired gradient (15). We call this algorithm ε-insensitive back-propagation. We have seen that Sε (ϕ(x), y) is differentiable, including giving a practical way to compute the gradients, and that R(x) is non-differentiable but convex. The direct evaluation of Jα,ε could lead us to significant numerical problems. However, it can be split into these two terms in order to evaluate them and exploit their properties separately. We do that by means of the Proximal Gradient Method [21], also known as Forward-backward splitting [4]. The method consists of the following iteration    x k+1 = proxλk αR x k − λk ∇x Sε ϕ(x k ), y δ where

1 x − v22 proxλg (v) = arg min g(x) + 2λ x∈Rn



for λ > 0 and g : Rn → R a convex function. In this case g(x) = αR(x).

(17)

Inverse Problems in Designing New Structural Materials

157

Fig. 5 Evaluation of the discrepancy term during the proximal gradient iterations from (17) for 5 different randomly generated starting points. The broken red lines correspond to the standard version (PG) and the continuous blue lines to the accelerated one (APG). The forward operator is a Neural Network with 14 inputs and 3 outputs

If ∇x Sε is Lipschitz continuous with constant L, it can be shown that this iteration converges with rate O(1/k) when a fixed step size λk ∈ (0, 2/L] is used. Moreover, accelerated versions exist which converge with rate O(1/k 2 ) under the same conditions [17, 19, 20]. This is achieved by adding an extrapolation step before doing the standard iteration. The version we use is z k+1 = x k + wk (x k − x k−1 )    x k+1 = proxλk αR z k+1 − λk ∇x Sε ϕ(z k+1 ), y δ

(18) (19)

k . A comparison of the convergence of the accelerated and nonwith wk = k+3 accelerated versions for several starting points is shown in Fig. 5. Since ϕ is the composition of simple functions [13] an upper bound Lˆ for L could ˆ However, this upper bound might be too be computed and we could set λk = 2/ L. large resulting in a slow convergence. Instead, we compute the step sizes λk by a line search in each iteration, which is a standard approach when L is unknown [21]. Now we show how to evaluate the proximal operator. In our case



α 1 x − v22 proxλαR (v) = arg min IC (x) + x − x ∗ 22 + 2 2λ x∈Rn

(20)

Computing (20) directly seems to require solving another optimization problem. However, it can be shown that it is equivalent to another proximal operator by changing its argument λ˜ ∗ ˜ v + α λx proxλαR (v) = proxλI (21) ˜ C λ with λ˜ =

λ . 1+λα

This property can be found in [21] and follows from observing that

158

D. O. Baguer et al.

λ˜ α 1 1 x − x ∗ 22 + x − v22 = x − v − α λ˜ x ∗ 22 ˜ 2 2λ λ 2λ (22) 1 λ˜ α ∗ 2 1 ∗ 2 2 ˜ −  v + α λx 2 + x 2 + v2 . 2 2λ 2λ˜ λ The last three terms in (22) do not depend on x, therefore

α 1 proxλαR (v) = arg min IC (x) + x − x ∗ 22 + x − v22 2 2λ x∈Rn 1 λ˜ ∗ 2 ˜ 2 = arg min IC (x) + x − v − α λx λ 2λ˜ x∈Rn λ˜ v + α λ˜ x ∗ . = proxλI ˜ C λ

(23) (24) (25)

The proximal operator of IC is just the projection operator C , which can be easily evaluated. In the example we show in Sect. 5, C is determined by the components’ bounds, which means that the projection operator only projects the components to the corresponding intervals. Finally, we obtain proxλαR (v) = C

λ˜ v + α λ˜ x ∗ . λ

(26)

4 Fully Learned Inverse Our approach consists in first learning a forward operator and then solving the corresponding inverse problem. Another approach would be to directly learn the inverse operator. However, the inverse might not exist or might be highly unstable because of the ill-posedness of the problem. Even in scenarios where the forward operator is linear [1] or if some kind of inverse does exist (Fig. 6), the quality of the obtained results is quite low. A data-driven model f (≈ ϕ −1 ) would be trained by minimizing some kind of empirical error based on training points (y (i) , x (i) ) such as E=

N 

 f (y (i) ) − x (i) 2 .

(27)

i=1

If ϕ is non-injective there could be different x (i) that correspond to similar y (i) and minimizing (27) will make the learned inverse model to output the mean of those x (i) s, which is not desirable in most cases. In Fig. 6, the forward operator is ϕ(x) = x 2 , which always maps the same y = x 2 for x and −x. For each x the learned inverse

Inverse Problems in Designing New Structural Materials

159

Fig. 6 The learned inverse (solid red line), which is plotted with reverse axis, tries to approximate the training points but due to the non-injectivity of ϕ the result is always a compromise between negative and positive roots

Table 2 Parameters of the first entry on the training set. The first 13 entries correspond to the chemical composition and are given in % C Mn Si Cr Ni Mo V N Nb Co W Al Ti T (◦ C) 0.01

0.02

0.03

5.0

12.16 3.0

0.01

0.0

0.01

Table 3 Properties of the first entry in the training set Yield strength (YS) (MPa) Tensile strength (TS) (MPa) 1309.8

1345.6

0.01

0.0

0.31

0.21

816.0

Elongation (E) (%) 15.0

will try to be close to both points (−x, x 2 ) and (x, x 2 ), and will arrive near (0, x 2 ). In order to avoid this, one could require training points that fulfill x ≥ 0. However, this kind of previous knowledge on how to properly sample x is usually not known because ϕ is much more complex and has a higher level of non-injectivity.

5 Numerical Results In order to test our approach we used a modified version of a materials data-set,1 where each material contains 14 parameters and 3 properties (in terms of our descriptorsbased approach that means that the predictor function was already used to predict the properties of the materials). There are 13 possible chemical elements and the last parameter is the temperature to which the alloy is heated up before quenching. We only consider one heating process in this study. Tables 2 and 3 show an entry of this data-set as an example. The data-set contains 413 materials in total. We implemented the forward operator with a Neural Network consisting of 14 input neurons, 3 fully-connected hidden layers with 10 neurons each, 3 output neurons and 403 parameters in total. We trained it 20 times starting from different initial random states and created an ensemble model that outputs the mean of those 20 networks. The training was done using Adam [15] with training batches of size 64 1 https://citrination.com/datasets/153092.

160

D. O. Baguer et al.

Fig. 7 Training and test errors of one of the 20 neural networks during the training process. It achieves 0.0075 relative mean squared error over the test set

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 8 First 6 examples from the test-set. All property values are scaled to lie in the interval [0, 1]. The broken blue line shows the original performance of a material from the test-set and the red continuous line shows the predicted performance using the learned forward operator

(Fig. 7). Before training the data was scaled to lie in the interval [0, 1]. The ensemble model achieves a relative mean squared error of 0.0067 over the test set (data that has not been seen). In Fig. 8 we show some prediction examples. Finally, we tested our approach to solve the inverse problem using the tolerances ε and compared it with the approach that does not consider the tolerances, i.e. ε = 0. Additionally, we compared it with a heuristic approach (Nelder-Mead implementation in python2 ). We designed 500 performance profiles based on points from the test data-set and 500 randomly generated. The results given in Table 4 show that the ε-based approach yields better results in both cases. Some of these profiles and the results are shown in Fig. 9. 2 https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html.

Inverse Problems in Designing New Structural Materials

161

Table 4 Accuracy of the heuristic approach using the Nelder-Mead method for solving the inverse problem (IP+Heuristic), our approach solving the inverse problem using the accelerated PG method without tolerances (IP + APG) and our approach using the accelerated PG method and the tolerances (IP + APG + ε). The percentages indicate the amount of instances that were correctly solved, i.e. a solution within the desired intervals was found IP + Heuristic IP + APG IP + APG + ε Profiles based on the test data Randomly generated profiles

73.00 %

84.80 %

89.20 %

36.00 %

46.00 %

52.40 %

(a)

(b)

Fig. 9 The pictures show the obtained properties (black dots) for 20 of the profiles from the second row of Table 4. The results were obtained using the ε-based approach Fig. 10 Example with two parameters where the selection of x ∗ and the term x − x ∗ 22 does not lead us to a good regularization. The points (x (i) , y (i) ) come from successful material productions

6 Further Work The prior based on the parameter definitions x ∈ C could be too generic. Even more, the regularization term x − x ∗ 22 might not be a good choice as it can be observed in the artificial example from Fig. 10, where parameters close to x ∗ would always fail in production processes. A different approach would be to implicitly learn a prior distribution of the parameters and a projection operator from a large data-set. Similar ideas have been successfully implemented in [12, 22]. We propose this as possible future work.

162

D. O. Baguer et al.

7 Summary and Conclusions An application of an ill-posed inverse problem where tolerances play an interesting role was shown in this work. The regularization was achieved by a combination of exploiting prior information about the true solutions and incorporating tolerances into the Tikhonov functional. It was also shown how the proximal gradient method can take advantage of splitting the objective function in order to make use of properties such as differentiatiability and convexity separately. An efficient and simple way of computing the gradients based on the standard back-propagation algorithm was introduced. Moreover, the evaluation of the proximal operator was reduced to the evaluation of a projection operator with a different argument. In the numerical results we compared the ε-approach with the method that does not consider the tolerances and also with a heuristic search. The accuracy of our approach was better and it shows great potential as an alternative to the established methods. The main idea of the high-throughput project is to be able to quickly search in the parameter space, i.e. to make many experiments. However, we also use these experiments to learn something from them, i.e. to learn how they influence the properties and to find an analytic solution by solving an inverse problem. The results show that this approach could positively impact on the productivity of the described high-throughput system. Acknowledgements Financial support of the subproject P02 “Heuristic, Statistical and Analytical Experimental Design” by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)–Projektnummer 276397488—SFB 1232 is gratefully acknowledged. Support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—Project number 281474342/GRK2224/1 is also gratefully acknowledged. The authors would also like to acknowledge the anonymous reviewers for their helpful comments and suggestions.

References 1. Adler, J., Öktem, O.: Solving ill-posed inverse problems using iterative deep neural networks. Inverse Probl. 33(12), 124007 (2017) 2. Bathke, C., Kluth, T., Brandt, C., Maaß, P.: Improved image reconstruction in magnetic particle imaging using structural a priori information. Int. J. Magn. Part. Imaging 3(1), (2017) 3. Bertsimas, D., Tsitsiklis, J.: Introduction to Linear Optimization. 01 (1998) 4. Combettes, P.L., Pesquet, J.: Proximal Splitting Methods in Signal Processing, pp. 185–212. Springer, New York (2011) 5. Cota, A., Fernando Lucas, O., Barbosa, A., Cassio Antonio Mendes, L., Araujo, F.G.: Microstructure and mechanical properties of a microalloyed steel after thermal treatments. Mat. Res. 6, 06 (2003) 6. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989) 7. Czotscher, T.: Material characterisation with new indentation technique based on laser-induced shockwaves. Lasers Manuf. Mater. Process. 5, 10 (2018)

Inverse Problems in Designing New Structural Materials

163

8. Drechsler, R., Eggersglüß, S., Ellendt, N., Huhn, S., Mädler, L.: Exploring superior structural materials using multi-objective optimization and formal techniques, pp. 13–17 (2016) 9. Ellendt, N., Mädler, L.: High-throughput exploration of evolutionary structural materials. HTM J. Heat Treat. Mater. 73(1), 3–12 (2018) 10. Engl, H.W., Hanke, M., Neubauer, A.: Regularization of inverse problems. Mathematics and its Applications, vol. 375. Kluwer Academic Publishers Group, Dordrecht (1996) 11. Gralla, P., Piotrowska-Kurczewski, I., Maaß, P.: Tikhonov functionals incorporating tolerances. In: Proceedings in Applied Mathematics and Mechanics (2017) 12. Hauptmann, A., Lucka, F., Betcke, M., Huynh, N., Adler, J., Cox, B., Beard, P., Ourselin, S., Arridge, S.: Model-based learning for accelerated, limited-view 3-d photoacoustic tomography. IEEE Trans. Med. Imaging 37(6), 1382–1393 (2018) 13. Haykin, S.: Neural Networks: A Comprehensive Foundation (2nd Edition) Neural Networks: A Comprehensive Foundation (1998) 14. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989) 15. Kingma, D., Ba, J.: Adam: A method for stochastic optimization (2014). arXiv:1412.6980 16. Krebs, J.: Support vector regression for the solution of linear integral equations. Inverse Probl. 27, 065007 (2011) 17. Lorenz, D.A., Pock, T.: An inertial forward-backward algorithm for monotone inclusions. J. Math. Imaging Vis. 51(2), 311–325 (2015) 18. Louis, A.K.: Inverse und schlecht gestellte Probleme. Vieweg+Teubner Verlag, Wiesbaden (1989) 19. Muoi, P., Hao, D., Maaß, P., Pidcock, M.: Descent gradient methods for nonsmooth minimization problems in ill-posed problems. J. Comput. Appl. Math. 298, 105–122 (2016) 20. Nesterov, Y.: A method of solving a convex programming problem with convergence rate o(1/k 2 ), vol. 27, pp. 372–376 (1983) 21. Parikh, N., Boyd, S.P.: Proximal algorithms. Found. Trends R Optim. 1, 123–231 (2014) 22. Rick Chang, J.H., Li, C., Poczos, B., Vijaya Kumar, B.V.K., Sankaranarayanan, A.C.: One network to solve them all—solving linear inverse problems using deep projection models. In: The IEEE International Conference on Computer Vision (ICCV) (2017) 23. Rieder, A.: Keine Probleme mit inversen Problemen: eine Einführung in ihre stabile Lösung. Vieweg, Wiesbaden (2003)

Coupled Electromagnetic Field and Electric Circuit Simulation: A Waveform Relaxation Benchmark Christian Strohm and Caren Tischendorf

Abstract We consider coupled dynamical systems, arising from lumped circuit modeling coupled to distributed modeling of electromagnetic devices. The corresponding subsystems form an ordinary differential equation system, reflecting the spatially discretized electromagnetic field equations, see e.g. [33], and a system of differential-algebraic equations (DAEs), describing the circuit obtained by modified nodal analysis, see e.g. [28]. The different nature of the subsystems motivates us to solve the coupled systems by co-simulation in form of waveform relaxation methods, see e.g. [29]. It allows the use of sophisticated solvers for each subsystem. Furthermore, one can easily exploit the different structural properties of the subsystems. This becomes even more important due to the fact that the systems’ dimension may easily reach millions of unknowns. However, convergence of waveform relaxation methods is not always guaranteed as soon as DAEs are involved, see e.g. [5]. As one prototype of waveform relaxation methods we analyze the convergence behavior of the Gauss–Seidel approach. We present a criterion that guarantees convergence, supported by a numerical benchmark, and discuss the influence of three different coupling formulations onto the convergence behavior.

1 Introduction Waveform relaxation methods have been of interest for solving very large scale integrated (VLSI) circuits already about 35 years ago. At that time, a main general convergence result was proven for a waveform iteration scheme solving ordinary differential equations [29, 30]. A theoretical framework and an analysis of convergence rates is presented in [34]. On bounded time intervals superlinear convergence is shown when a Lipschitz condition is satisfied. A comprehensive overview of waveform relaxation methods for ODEs is given in [11]. The waveform relaxation C. Strohm (B) · C. Tischendorf Department of Mathematics, Humboldt-Universität zu Berlin, 10099 Berlin, Germany e-mail: [email protected] C. Tischendorf e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_9

165

166

C. Strohm and C. Tischendorf

approach has also successfully been used for the solution of distributed models, see e.g. [19]. Usually the convergence is rather slow. Better convergence rates can be obtained by properly adapted transmission conditions for information exchange of the subsystems, see e.g. [20]. The first comprehensive study of the convergence of waveform relaxation methods for DAE subsystems is given in [5]. It analyzes the coupling of index-1 DAEs and shows that convergence can only be guaranteed if a certain contractivity condition is fulfilled and the communication window size is not too large. The specific numerical problems there are caused by algebraic loop couplings. The situation is better for couplings with no algebraic loops as often given in industrial applications, see [4]. Global error bounds for Jacobi type waveform relaxation combined with higher-order approximations of subsystem inputs are shown for algebraically coupled ODE subsystems where the coupling forms a tree structure [4]. In [16] we find an analysis of waveform relaxation methods for coupled circuits arising by circuit partitioning. It transfers the convergence results for index-1 DAEs to circuits and gives an interpretation by suitable subcircuit structures. Additionally, the propagation of discretization errors within the waveform relaxation is analyzed there. Here, we consider only the continuous case, i.e. the convergence behavior for exact solutions of the iterated subsystems. A first successful approach of waveform relaxation approximation for coupled lumped and distributed electronic devices has been reported in [7]. There, a distributed model for studying the thermal distribution within a circuit has been coupled to lumped device models and tested for a circuit described by an index-1 DAE. In [1], convergence criteria are given for waveform relaxation of Gauss–Seidel type for coupled systems of lumped circuit DAEs of index 1 and distributed semiconductor (pn-diode) model equations. Similarly to the investigations here, convergence and stability of waveform relaxation methods of Gauss–Seidel type for a field/circuit coupled problem are proved in [39, 40]. The difference consists mainly in the field modeling. Whereas the circuit coupling with a magneto-quasistatic field model is studied there, we analyze the coupling with the full-Maxwell equations here. It results in a DAE/DAE coupling after spatial discretization there and an ODE/DAE coupling here. There the circuit is assumed to have DAE index not larger than 1. Here, we allow the circuit to have DAE index 2. Notice that also other device models have been studied for coupled circuit/device simulations by waveform relaxation methods, see e.g. drift-diffusion model couplings in [6]. As in the ODE case, better convergence rates can be achieved by adapted transmission conditions, see e.g. [21]. The fundamental basis for this work here are the convergence results of waveform relaxation for ODEs coupled to index-2 circuit DAEs presented in [36], see Theorems 2.3 and 2.4. Here, we additionally allow that circuits contain so-called C V -loops of voltage sources and capacitances (with at least one voltage source), that have been excluded there for simpler derivations. For the coupling of lumped circuit models and electromagnetic field models for electronic devices, we follow the approach presented in [37]. In Sect. 2 we describe the modeling of lumped circuits using the modified nodal analysis. It includes a

Coupled Electromagnetic Field and Electric Circuit Simulation …

167

decoupling of the circuit equations extending Theorem 3.6 presented in [36] to circuits with C V -loops. Next, in Sect. 3 we briefly introduce electromagnetic (EM) devices and their full-Maxwell modeling as well as their spatial discretization by finite integration. Afterwards, we turn to the coupled modeling in Sect. 4, which mainly builds upon the approach presented in [8]. We present and discuss different formulations for the coupling of the subsystems. With these preliminaries established, we show in Sect. 5 convergence of a Gauss–Seidel type waveform relaxation method for one of the coupling formulations. Finally, numerical experiments for a low pass filter benchmark are given.

2 Electric Circuits In the simulation of electrical circuits within the framework of industrial applications, the so-called modified nodal analysis is a frequently employed approach, cf. [12]. Since electric circuits have a graph structure, the section’s outline is as follows: We start with some matrix characterizations of circuit structures using graph theory. Then we give a brief introduction to the mathematical modeling of electric circuits by differential-algebraic equations (DAEs) and close with a decoupling DAE analysis that globally extracts the inherent ODE system from the algebraic constraints by use of the matrices reflecting particular circuit structures.

2.1 Matrix Representation of Circuit Structures An electric circuit, allowing multi-terminal elements, can be topologically interpreted as a hypergraph H = (N , E ), consisting of a node set N and a hyperedge set E . Notice, hypergraphs are generalizations of graphs allowing so-called hyperedges that connect more than two nodes, see [23]. From the nodes of each hyperedge we choose one to be the reference node and use the branches between the reference node to the non-reference nodes of the hyperedge for a description of the electrical behavior of multi-terminal elements. Let the branches be oriented such that they direct to the reference node. Collecting all branches in B we end up with a graph G = (N , B) representing the electrical circuit. For our analysis, we assume a proper electrical circuit which means: • G is connected. • Each branch connects two different nodes. • Each node connects at least two branches. Next, we choose one node, typically the ground node, to be the electric circuit’s reference node. Notice that we allow the graph to have more than one branch connecting two nodes as it is common for circuits. In graph theory such graphs are often called multigraphs.

168

C. Strohm and C. Tischendorf

Let n ∈ N be the number of circuit nodes of N , m ∈ N the number of branches of B and Afull ∈ Rn×m be the full incidence matrix of G defined by

(Afull )i, j

⎧ ⎪ ⎨ 1 if node i is a non-reference terminal of branch j := −1 if node i is the reference terminal of branch j ⎪ ⎩ 0 else

Since G is connected, the rows of Afull are dependent meaning that one row can be eliminated, in this case the electric circuit’s reference node. Thereby we obtain the reduced incidence matrix A ∈ {−1, 0, 1}(n−1)×m , which we just call incidence matrix from here on. Notice that the reduced incidence matrix has always full row rank [15]. Furthermore, loops of the graph correspond to linearly dependent columns of A and cutsets of the graph correspond to linearly dependent rows of A. Trees of the graph form a non-singular submatrix of A. A detailed explanation of these characterizations is given in [15]. For the decoupling of the circuit equations in Sect. 2.3, we introduce special cutsets and loops as well as their characterization by the incidence matrix A. Assume B X ⊂ B and BY ⊂ B to be disjoint branch subsets for some branch types X and Y , respectively, and type Z refers to the complementary branch subset ˙ Y ). With A X and AY we denote the incidence matrices of the subB Z = B\(B X ∪B graphs (N , B X ) and (N , BY ), respectively. We call a loop a X Y -loop if it consists of branches of type X and y only. Correspondingly, X Y -cutsets are cutsets consisting of branches of type X and y only. By a + sign after a type indicator we indicate that the considered set consists of at least one branch of that specific type. For instance, an X Y + -loop is a loop consisting of branches of type X and Y but with at least one Y -type branch and a Y + Z -cutset is a cutset consisting of branches of type Y and Z but at least one Y -type branch. Theorem 1 Let B X and BY be disjoint branch subsets of a proper graph G = ˙ Y ). Then, G has an X Y + -cutset, if and only if there (N , B) and B Z := B\(B X ∪B  exists an x such that A Z x = 0 and A Y x  = 0. Proof (⇒) Let Bc ⊂ B X ∪ BY form an X Y + -cutset of G . The cutset divides N into two nonempty subsets N1 and N2 where all branches of G that connect nodes from N1 with nodes from N2 form the cutset Bc . Without loss of generality, the reference node belongs to N2 and counted last as number n. We define x ∈ Rn−1 as follows.  1 if node i belongs to N1 , xi := 0 if node i belongs to N2 . Additionally, we introduce xn := 0 for the reference node. Consider any bz ∈ B Z . It / Bc . Consequently, connects two nodes i 1 and i 2 . Since Bc ⊂ B X ∪ BY we get bz ∈ the nodes i 1 and i 2 either belong both to N1 or both to N2 which yields xi1 = xi2 . By definition of the incidence matrix we obtain AZ x = 0. Consider now any Y -type branch b y from Bc . It connects one node i 1 ∈ N1 with one node i 2 ∈ N2 yielding

Coupled Electromagnetic Field and Electric Circuit Simulation …

169

xi1 = 1 and xi2 = 0. For the corresponding column a y of AY we obtain a  y x = ai 1 ,y =  ±1. It results in AY x = 0.  (⇐) Let x ∈ Rn−1 such that AX x = 0 and A Y x  = 0. Since AY x  = 0, we know  that there exists a column a y of AY s. t. a y x = 0. We denote the branch corresponding to a y by b y ∈ B. The branch b y connects two nodes i 1 and i 2 . Due to a  y x  = 0, we obtain that xi1 = 0 or xi2 = 0. We can assume that xi1 = 0. Now we form a subset N1 of the nodes of G as follows: node j ∈ N belongs to N1 if and only if x j = xi1 . We define N2 := N \N1 . Consequently, the node i 2 and the mass node belong to N2 . Let Bc be the set of all branches of G that connect nodes from N1 with nodes from N2 . By construction, Bc forms a cutset of G with b y ∈ Bc . Consider any bz ∈ B Z . It connects two nodes j1 and j2 . Let az be the corresponding branch of A Z . Since AZ x = 0, we get az x = 0. If one of the nodes j1 and j2 is the mass node, say j2 , then az x = ±x j1 which yields x j1 = 0. Consequently, j1 and j2 belong to N2 . If both nodes j1 and j2 are different from the mass node then az x = ±(x j1 − x j2 ) which implies x j1 = x j2 . It means, that the nodes j1 and j2 belong either both to N1 or both / Bc and we obtain that Bc is a X Y + -cutset.  to N2 . Therefore, bz ∈ Corollary 1 Let B X and BY be disjoint branch subsets of a proper graph G = ˙ Y ). Furthermore, let the columns of Q Z form a (N , B) and B Z := B\(B X ∪B basis of the kernel of AZ that means im Q Z = ker AZ . Then, G has an X Y + -cutset if and only if Q Z AY = 0. Proof Q Z AY = 0 is equivalent to A Y Q Z  = 0. This, furthermore, is equivalent to  the existence of an x ∈ im Q Z with A Y x  = 0. As A Z x = 0 if and only if x ∈ im Q Z , we deduce the assertion to be fulfilled by Theorem 1.  Lemma 1 Let the columns of Q X and Q Y form bases of the kernels of AX and let the columns of Q X Y be as basis of the kernel of the A Y Q X , respectively. Further,    concatenated matrix A X AY . Then, ker Q X Y = ker Q  Y QX. Proof The assertion follows from  z ∈ ker Q  Y QX



 Q Y QXz = 0



Q X z ∈ im Q X AY

and z ∈ ker Q X Y



  z ∈ im A X AY



∃w∃v : z = AY w + A X v



∃w : z − AY w ∈ im A X



∃w : Q X (z − AY w) = 0. 

170

C. Strohm and C. Tischendorf

2.2 Lumped Modeling of Circuits We consider electric circuits with lumped models that means spatial dependency is neglected. The time dependent quantities of interest are the voltages across and the currents through each branch and, furthermore, the potentials at each node. Kirchhoff’s laws form the basis of circuit equations for lumped circuits [15]: • Kirchhoff’s current law (KCL): For any node and at any time the algebraic sum of all branch currents entering or leaving the node is zero. • Kirchhoff’s voltage law (KVL): For any loop and at any time the algebraic sum of all branch voltages around the loop is zero. Let I := [t0 , T ] ⊂ R be some time interval. With i, v : I → Rm and e : I → being the vector-functions of all branch currents, branch voltages and node R potentials (except the mass node), respectively, Kirchhoff’s laws for current and voltage yield n−1

v = A e.

Ai = 0,

(1)

Constitutive element equations define relations between the branch’s currents and voltages and thus complete the Kirchhoff’s laws. Most of them can be categorized into the two classes of current and voltage controlling elements. For an arbitrary circuit element consider the dissection i = (ielem , icompl ) and v = (velem , vcompl ) according to the element branches and the complementary ones. A circuit element is called current controlling if it has a constitutive equation which explicitly determines its branch currents, ielem = f elem (

d delem (icompl , v, t), icompl , v, t) dt

(2)

or voltage controlling if the voltages are explicitly determined, velem = f elem (

d delem (i, vcompl , t), i, vcompl , t), dt

(3)

for some functions f elem and delem . These constitutive equations do not cover all types of elements, but more than just the basic ones. The basic two-terminal elements are given by

• current source: • resistor:

iI = i(t) iR = g(vR )

Coupled Electromagnetic Field and Electric Circuit Simulation …

• inductor:

iL =

• voltage source:

171

d φ(vL ) dt

vV = v(t)

• capacitor:

vC =

d q(iC ) dt

with scalar source functions v : I → R and i : I → R and characteristic functions g, φ, q : R × I → R since each of them act on one branch only. For later inclusion of spatially discretized distributed models, we introduce another current controlling element, the • mock element: iM = f M (

d d dM (vM ), dM (uM ), vM , uM ) dt dt

(4)

where uM fulfills the ODE d d dM (uM ) + bM (uM ) = cM ( dM (vM ), vM ). dt dt Notice that iM and vM are often not scalar but multi-valued functions of time. Furthermore, the so defined mock element includes also most equivalent circuit model descriptions of transistors since they are usually of the form iM = f M (

d dM (vM ), vM ). dt

Since most circuits can be described by the elements mentioned above, we restrict our circuit analysis to circuits with such elements. Assumption 1 Let the electrical circuit consist of capacitors, resistors, inductors, voltage sources, current sources and mock elements according to the previously defined models, inducing m C , m R , m L , m V , m I and m M ∈ N branches, respectively. Modified nodal analysis. According to the above classification (2) and (3) sort the quantitiesand reduced  incidence matrix such that i = (icurr , ivol ), v = (vcurr , vvol ) and A = Acurr Avol . Further, we collect all the current and voltage controlling constitutive equations into f curr , dcurr and f vol , dvol , respectively. Inserting f curr into KCL and replacing all the branch voltages in terms of potentials, we obtain from (1): d dcurr (i, A e, t), i, A e, t) + Avol ivol = 0, dt d f vol ( dvol (i, A e, t)i, A e, t) − A vol e = 0. dt

Acurr f curr (

(5) (6)

172

C. Strohm and C. Tischendorf

The system (5)–(6) is called the modified nodal analysis (MNA) and represents a system of DAEs. By Assumption 1 we can sort the currents and the columns of the incidence matrix according to the type of the elements by introducing i = (iC , iR , iL , iV , iI , iM ) : I → Rm C +m R +m L +m V +m I +m M ,   A = AC AR AL AV AI AM ∈ {−1, 0, 1}(n−1)×(m C +m R +m L +m V +m I +m M ) . With qC : Rm C × I → Rm C , gR : Rm R × I → Rm R , φL : Rm L × I → Rm L , i s : I → Rm I and vs : I → Rm V we describe the element type-wise characteristic functions, resulting from concatenation. Introducing x = (e, iL , iV ) and w = (wC , wL ) the system (5)–(6) can be written as f MNA (

d dMNA (x, t), x, t) = cMNA (iM ) dt

(7)

where ⎛

⎞ AC wC + AR gR (A R e) + AL iL + AV iV + AI i s (t) ⎠, wL − A  f MNA (w, x, t) = ⎝ Le e vs (t) − A V ⎛ ⎞ 

−AM iM  q (A e) , cMNA (iM ) = ⎝ 0 ⎠ dMNA (x, t) = C C φL (iL ) 0

(8)

and iM given by (4) with vM = A M e.

2.3 Analysis of the Lumped Circuit DAE This section provides a decoupling of the lumped circuit DAE into the inherent ODE and the algebraic constraints. Our decoupling approach is similar to the projector based decoupling used in [18] but uses the dissection index concept with truncated projections introduced in [25]. The dissection index concept combines advantages from the projector based index concept [28] with the strangeness index concept [27]. In particular, it allows a decoupling of circuit equations with state independent projections. The key point of the dissection concept is given by so-called kernel splitting pairs. A pair of matrices {P, Q} is called a kernel  splitting pair of a given matrix M if im Q = ker M and the concatenation Q P is non-singular. Lemma 2 Let {P, Q} be a kernel splitting pair of a matrix M with full row rank. Then, M P is non-singular.     Proof It follows from the definition and full row rank of M Q P = 0 M P . 

Coupled Electromagnetic Field and Electric Circuit Simulation …

173

Lemma 3 Let M and N be two matrices with the same number of rows. Furthermore, assume that N is positive definite. If {P, Q} is a kernel splitting pair of M then {V, Q} with  −1  N and rank Ik = k := rank P V  := 0 Ik Q P is also a kernel splitting pair of M. Proof By definition of a kernel splitting pair, we only have to show that [Q V ] is non-singular. Assume that there exist x and y such that Qx + V y = 0.

(9)

 −1  = V  N −1 , we obtain For S := 0 Ik Q P  −1  Qx S N  S  y = S(S N ) y = SV y = −S Qx = − 0 Ik Q P      −1   x  x   Q P = − 0 Ik Q P = − 0 Ik = 0. 0 0 Since N  is positive definite, we can conclude S  y = 0. This implies y = 0 since S has full row rank. Regarding (9) and the full column rank of Q, we get al.so x = 0.  Next, we formulate reasonable assumptions reflecting physical properties of electric circuits and their elements. Assumption 2 (Passivity) All resistances, inductances and capacitances in the electric circuit show a passive behavior, i.e. qC , gR and φL are strongly monotone. Assumption 2 reflects the property that these elements consume energy and do not produce energy. It is strict in the sense that active elements like controlled sources are excluded. However, the following Theorem 2 remains true when controlled current sources are included for which the terminals are connected by a capacitive path. It becomes clear by the same arguments as the ones used in [18]. We neglect a detailed presentation here for better reading and understanding of the decoupling procedure below. Consequently, the following results apply also to circuits with transistors and diodes as long as their capacitive contribution is not neglected in the modeling. Assumption 3 (Consistency) The electric circuit contains neither V -loops nor I cutsets. Assumption 3 avoids shortcuts and is needed for the existence of (unique) solutions, see [18]. Additionally, we assume sufficient smoothness of the element functions in order to guarantee existence and uniqueness of solutions. Assumption 4 (Smoothness) The characteristic functions qC , gR and φL are globally Lipschitz continuous. The source functions i s and vs are twice continuously differentiable.

174

C. Strohm and C. Tischendorf

As required by the next Theorem, we an additional assumption concerning mock elements. Assumption 5 (No I M-cutsets) The electric circuit contains no I M-cutsets. For the sake of simplicity and to point out which terms are considered as sources, we define sc (iM ) := AM iM ,

si (t) := AI i s (t),

sv (t) := vs (t).

Theorem 2 If Assumptions 1, 2, 3, 4 and 5 are satisfied, then there exist globally Lipschitz continuous functions f 0 , f 1 , f 2 as well as a (constant) matrix M3 , a non-singular Lipschitz continuous matrix  function M1 (y, z3 ) and a (constant) nonsingular transformation matrix T = T0 T1 T2 T3 such that the DAE (7) can be globally decoupled into an equivalent system of the form d y = f 0 (y, z1 , z2 , z3 , si , sc (iM )), dt d z1 = M1 (y, z3 ) z3 + f 1 (y, z2 , z3 , si , sc (iM )), dt z2 = f 2 (y, z3 , sv , si , sc (iM )),

 si + sc (iM ) . z3 = M3 sv

(10) (11) (12) (13)

For a given C 1 function iEM on I , the function x ∈ C 1 (I ) is a solution of (7) with x(t0 ) = x0 if and only if x¯ defined by x = T x¯ = T0 y + T1 z1 + T2 z2 + T3 z3 is a C 1 solution of the decoupled system (10)–(13) on I with y(t0 ) = y0 satisfying x0 = T0 y0 + T1 z1 (t0 ) + T2 z2 (t0 ) + T3 z3 (t0 ). Proof The proof combines the idea of the dissection concept [25] for DAEs with the projector based decoupling [18, 28] for circuit DAEs. Due to Assumption 1, the DAE (7) reads AC

d  qC (A C e) + AR gR (AR e) + AL iL + AV iV + si + sc (iM ) = 0, dt d φL (iL ) − A L e = 0, dt −A V e + sv = 0.

(14) (15) (16)

Let {P, Q} be kernel splitting pairs of the following matrices M: {P, Q} {PC , Q C } {PV , Q V } {PR , Q R } { P¯V , Q¯ V } M

A C

A V QC

A R QC QV

PV Q  C AV

{ P¯L , Q¯ L }

{Pe , Q e }

  Q R Q V Q C AL

 Q¯  V AV PC

Coupled Electromagnetic Field and Electric Circuit Simulation …

175

We use them to split e, iL and iV as follows: e = Q C [Q V (Q R z1l + PR z2r ) + PV z2v ] + PC [Q e ye + Pe z3e ] , iL = Q¯ L y¯ l + P¯L z¯ 3l , iV = Q¯ V z¯ 1v + P¯V z¯ 2v ,

(17) (18) (19)

and collect the new variables as y := (ye , y¯ l ), z1 := (z1l , z¯ 1v ), z2 := (z2v , z2r , z¯ 2v ), z3 := (z3e , z¯ 3l ). Exploiting the splitting pairs’ properties we introduce gˆ R (z2r , z2v , ye , z3e ) := gR (A R (Q C (Q V PR z2r + PV z2v ) + PC (Q e ye + Pe z3e )) = gR (A R e),     ˆ e , z3e ) := PC AC qC (A C(y C PC (Q e ye + Pe z3e ))AC PC = PC AC qC (AC e)AC PC ,

ˆ yl , z¯ 3l ) := φL ( Q¯ L y¯ l + P¯L z¯ 3l ) = φL (iL ). L(¯

The function gˆ R is globally Lipschitz continuous since gR is globally Lipschitz conˆ yl , z¯ 3l ) are positive definite since A ˆ e , z3e ) and L(¯ tinuous. Furthermore, C(y C PC has full columns rank and qC as well as φL are strongly monotone. We proceed with some additional splitting pairs to split the equations. We choose kernel splitting pairs {VC (ye , z3e ), Q e } and {V¯L (¯yl , z¯ 3l ), Q¯ L } of the matri    ¯ ces Q¯  V AV PC and AL := Q R Q V Q C AL , respectively, such that    ˆ e , z3e ) Q e Pe = 0 VC (ye , z3e )C(y    ˆ yl , z¯ 3l ) Q¯ L P¯L = 0 V¯L (¯yl , z¯ 3l ) L(¯

 I ,  I .

The existence of such pairs is guaranteed by Lemma 3 (use N := Cˆ −1 (ye , z3e ), Q := Q e , P := Pe for the first pair and and N := Lˆ −1 (¯yl , z¯ 3l ), Q := Q¯ L , P := P¯L for the second pair). Furthermore, the matrix functions VC (ye , z3e ) and V¯L (¯yl , z¯ 3l ) are globally Lipschitz continuous since qC and φL are strongly monotone that imply Cˆ −1 (ye , z3e ) and Lˆ −1 (¯yl , z¯ 3l ) to be globally Lipschitz continuous [26]. We derive equations of the form (10)–(13) in four steps, starting with (13) and finishing with (10):   ¯ 1. Multiplying (14) by Q  R Q V Q C and (16) by Q V from the left yields:

z¯ 3l = M¯ 3l (si + sc (iM )),   ¯ z3e = (M3e M3e )−1 M3e Q V sv ,

  ¯ −1    with M¯ 3l := −(Q  R Q V Q C AL PL ) Q R Q V Q C , (20)   with M3e := Q¯ V AV PC Pe . (21)

  ¯ Note, the inverse of Q  R Q V Q C AL PL exists according to Lemma 2 (use that M :=    Q R Q V Q C AL has full row rank due to the absence of I M-cutsets, see Assumption rank by the definition of Pe . Introducing M3 := 5).   Further, M3e has full column 0 M¯ 3l   ¯  yields (13). 0 (M3e M3e )−1 M3e QV

176

C. Strohm and C. Tischendorf

   2. Multiplying (16) by P¯V and (14) by PR Q  V Q C and PV Q C from the left yields: −1 ¯   z2v = f 2v (ye , z3e , sv ) := ( P¯V A V Q C PV ) PV (−AV PC (Q e ye + Pe z3e ) + sv ), z2r = f 2r (y, z3 , sv , si , sc (iM )), with f 2r satisfying

z¯ 2v

h R ( f 2r (y, z3 , sv , si , sc (iM )), y, z3 , sv , si , sc (iM )) = 0, = f¯2v (y, z3 , sv , si , sc (iM )) ¯ −1   := −(PV Q  ˆ R ( f 2r (y, z3 , sv , si , sc (iM )), f 2v (ye , z3e , sv ), C AV PV ) PV Q C [AR g ¯ ¯ ye , z3e ) + AL ( Q L y¯ l + PL z¯ 3l ) + AV Q¯ V z¯ 1v + si + sc (iM )],

where  ˆ R (z2r , f 2v (ye , z3e , sv ), ye , z3e ) h R (z2r , y, z3 , sv , si , sc ) :=PR Q  V Q C [AR g ¯ + AL ( Q L y¯ l + P¯L z¯ 3l ) + si + sc ].

¯ The functions f 2v and f¯2v are well-defined since PV Q  C AV PV is non-singular Q P has full column rank by the splitting (use again Lemma 2 and that A V C V pair construction). The global Lipschitz continuity of gˆ R implies f 2v and h R to be globally Lipschitz continuous. Furthermore, h R is strongly monotone w.r.t.  z2r since PR Q  V Q C AR is non-singular (see Lemma 2) and gR is strongly monotone. This ensures the existence of the globally unique function f 2r that is also globally Lipschitz continuous [26]. Consequently, also f¯2v is globally Lipschitz continuous. Introducing ⎛

⎞ f 2v (ye , z3e , sv ) f 2 (y, z3 , sv , si , sc ) := ⎝ f 2r (y, z3 , sv , si , sc )⎠ f¯2v (y, z3 , sv , si , sc ) we obtain (12). 3. Multiplying (15) by V¯L (¯yl , z¯ 3l ) and (14) by VC (ye , z3e )PC from the left yields:

with

d z¯ 3l + f 1l (y, z2 , z3 ), dt d = − M¯ 1v (ye , z3e ) z3e + f¯1v (y, z2 , z3 , sc (iM ), si (t)), dt

z1l = M1l (¯yl , z¯ 3l )

(22)

z¯ 1v

(23)

Coupled Electromagnetic Field and Electric Circuit Simulation …

177

−1 M1l (¯yl , z¯ 3l ) :=(V¯L (¯yl , z¯ 3l ) A¯  L) , M¯ 1v (ye , z3e ) :=(VC (ye , z3e )PC AV Q¯ V )−1 ,

f 1l (y, z2 , z3 ) := − M1l (¯yl , z¯ 3l )V¯L (¯yl , z¯ 3l ) [A L (Q C (Q V PR z2r + PV z2v ) + PC (Q e ye + Pe z3e ))], f¯1v (y, z2 , z3 , si , sc (iM )) := − M¯ 1v (ye , z3e )VC (ye , z3e )PC [AR gˆ R (z2r , z2v , ye , z3e ) + AL ( Q¯ L y¯ l + P¯L z¯ 3l ) + AV P¯V z¯ 2v + (si + sc (iM ))]. Note that because of the absence of I M-cutsets, A¯  L has full row rank such that from Lemma 2 follows the non-singularity of V¯L (¯yl , z¯ 3l ) A¯  L . Furthermore, VC (ye , z3e )PC AV Q¯ V is non-singular due to the absence of V -loops. This becomes clear as follows: The absence of V -loops implies AV to have full column rank. By definition Q¯ V we see that AV Q¯ V has full column rank as well ¯ and that PV Q  C AV Q V = 0. By the latter equation and the circumstance that the    ¯ ¯ construction of Q V yields Q  VQ  C AVQ V = 0, we deduce that Q C AV Q V = 0. PC AV Q¯ V always has the same column rank Exploiting this and the fact that Qct as AV Q¯ V , it follows that PC AV Q¯ V has full column rank. Again, we use Lemma  2 for M := Q¯  V AV PC and P := VC (ye , z3e ) and are done. We observe thatM1l (¯yl , z¯ 3l ) and M¯ 1v (ye , z3e ) are globally Lipschitz continuous since V¯L (¯yl , z¯ 3l ) and VC (ye , z3e ) are so. This implies also the global Lipschitz continuity of f 1l and f¯1v . Introducing  0 M1l (¯yl , z¯ 3l ) , M1 (y, z3 ) := − M¯ 1v (ye , z3e ) 0 

f 1l (y, z2 , z3 ) f 1 (y, z2 , z3 , si , sc (iM )) := ¯ f 1v (y, z2 , z3 , si , sc (iM )) 

we obtain (11).   4. Multiplying (15) by Q¯  L and (14) by Q e PC from the left and using (22)–(23) yields: d y¯ = f¯0l (y, z1 , z2 , z3 ) := dt l ¯ −1 ¯  ˆ ¯ ¯ ˆ ( Q¯  L L(¯yl , z¯ 3l ) Q L ) Q L [I − L(¯yl , z¯ 3l ) PL VL (¯yl , z¯ 3l )] ¯ [A L (Q C (Q V PR z2r + PV z2v ) + PC (Q e ye + Pe z3e )) + AL z1l ], d ye = f 0e (y, z1 , z2 , z3 , si , sc (iM )) := dt −1    ˆ ˆ (Q  e C(ye , z3e )Q e ) Q e PC [I − C(ye , z3e )Pe VC (ye , z3e )] [AL ( Q¯ L y¯ l + P¯L z¯ 3l ) + AV ( Q¯ V z¯ 1v + P¯V z2v ) + AR gˆ R (z2r , z2v , ye , z3e ) + (si + sc (iM ))].

178

C. Strohm and C. Tischendorf

ˆ yl , z¯ 3l ) Q¯ L The functions f¯0l and f 0e are well-defined since the matrices Q¯  L L(¯  ˆ and Q e C(ye , z3e )Q e are positive definite. By the same arguments as before, the functions f¯0l and f 0e are also globally Lipschitz continuous. Introducing

f 0 (y, z1 , z2 , z3 , si , sc ) :=

f 0e (y, z1 , z2 , z3 , si , sc ) f¯0l (y, z1 , z2 , z3 )



yields (10).

  Regarding (17)–(19), the transformation matrix T = T0 T1 T2 T3 is given by  T0 =

PC Q e 0 0 Q¯ L 0 0



 , T1 =

QC QV QR 0 0 0 0 Q¯ V



 , T2 =

Q C PV Q C Q V PR 0 0 0 0 0 0 P¯V



 , T3 =

PC Pe 0 0 P¯L 0 0



and for the initial condition holds x(t0 )=x0 =T0 y0 + T1 z1 (t0 ) + T2 z2 (t0 ) + T3 z3 (t0 ). Finally, we can conclude that the system (14)–(16) is equivalent to the system (10)– (13) as T is non-singular by construction.  For the later analysis concerning waveform relaxation, we investigate circuits with mock elements satisfying one further condition. Assumption 6 Mock elements do not form a cutset together with inductors and current sources. In other words, there is no L I M + -cutset. Note that Assumption 6 is a weak topological restriction to the location of mock elements in the circuit. For example, a capacitive/resistive path between the terminals of a mock element is sufficient. Corollary 2 With Assumption 6 to hold, the decoupled system (10)–(13) simplifies to the form d y = f 0 (y, z1 , z2 , z3 , si , sc (iM )), dt d z1 = M1 (y, z3 ) z3 + f 1 (y, z2 , z3 , si , sc (iM )), dt z2 = f 2 (y, z3 , sv , si , sc (iM )),

 si . z3 = M3 sv

(24) (25) (26) (27)

Proof For the proof we look back to the one of Theorem 2. There, we multiplied   ¯ (14) by Q  R Q V Q C and (16) by Q V from the left yielding:   ¯ ¯ 2v + si + sc (iM )) = 0. Q R Q V Q C (AV PV z

(28)

Due to Assumption 6 we obtain from Corollary 1 that Q  CVR AM = 0, with Q CVR   being a basis of kernel of AC AV AR . Applying Lemma 1 twice, we deduce that    from Q  CVR AM = 0 follows Q R Q V Q C AM = 0. Hence, Eq. (28) is equivalent to

Coupled Electromagnetic Field and Electric Circuit Simulation …

179

  ¯ ¯ 2v + si ). 0 = Q R Q V Q C (AV PV z

Therefore, it is z¯ 3l = M¯ 3l si , z3e = M3e sv ,

  ¯ −1    with same M¯ 3l = −(Q  R Q V Q C AL PL ) Q R Q V Q C ,  −1 ¯  with same M3e = ( Q¯  V AV PC Pe ) Q V ,



yielding equation (27).

3 Electromagnetic Devices In order to serve the trend that systems on chip become smaller in size whereby the operating frequency increases, industry is interested in the development of refined models and proper simulation techniques to enhance these devices while saving expenditures by virtue of laboratory testing etc. Such models shall cover all the physical phenomena which no longer can be ignored, e.g. cross-talking and skin effect. Hence, we make use of the full set of Maxwell’s equations, since they describe all large-scaled electromagnetic phenomena, cf. [41].

3.1 Modeling As we intend to incorporate electromagnetic (EM) devices, such as incorporated circuits or other on-chips, into an electric circuit, we first have to provide a suitable EM model. Maxwell’s equations as published by James Clerk Maxwell’s in 1865 [32] are a set of equations unifying the electric and magnetic field theory into classical electromagnetism. Given in their modern macroscopic differential vector-valued formulation using the SI-unit convention, Maxwell’s equations (MEs) consist of four first order partial-differential equations: Gauss’s law (GL) Gauss’s law for magnetism (GLM) Faraday-Lenz’s law (FL) Maxwell-Amp`ere’s law (MA)

∇ · D = ρ, ∇ · B = 0, ∇ × E = −∂t B, ∇ × H = J + ∂t D,

(29) (30) (31) (32)

MEs describe the behavior of the electric and magnetic flux densities D and B, the electric and magnetic field E and H, depending on the distribution of charge and current density given by ρ and J, respectively.

180

C. Strohm and C. Tischendorf

Constitutive equations. Similar to the constitutive element equations of electrical circuits, we need additional quantity relations in order to make the system (29)–(32) determinate, cf. [24]. The idea is to deduce D, H and J from E and B, by making use of empirical observations when material comes into play. We consider D = εE,

H = νB,

J = σ E,

(33)

with permittivity ε, reluctivity (inverse permeability) ν = μ−1 and conductivity σ . Boundary conditions. In order to keep MEs (29)–(32) determinate we need to introduce boundary conditions. A typical choice is to assume , the domain of interest, being surrounded by a perfectly electric conducting (PEC) medium, see e.g. [2, 3, 10, 17]: Assumption 7 (PEC Boundary) The tangential component of the electric field at the boundary vanishes, i.e. with n being the outer unit normal E × n = 0,

on ∂.

on a given star-shaped domain . A − ϕ formulation Instead of solving MEs in their classical form, one typically makes use of alternative formulations. Here, we focus on the A − ϕ formulation, see e.g. [9, 10, 35, 43], which formulates MEs in terms of so-called potentials. This approach expresses the electric field E and the magnetic flux density B by E = −∇ϕ − ∂t A,

B = ∇ × A,

(34)

for some A and ϕ which are called magnetic vector potential and electric scalar potential, respectively. What is left to solve are the GL (29) and MA (32) which, after incorporating the constitutive equations (33), yield the so-called called A − ϕ formulation: −∇ · [ε (∇ϕ + ∂t A)] = ρ,

(35)

∇ × (ν∇ × A) + ∂t [ε (∇ϕ + ∂t A)] + σ (∇ϕ + ∂t A) = 0.

(36)

Gauging. As with the reduction of equations comes ambiguity, i.e. system (35)–(36) is not uniquely solvable, we follow the grad-div regularization approach from [14] and introduce the grad-type Lorenz gauge condition ε∇∂t ϕ + ζ ∇ [ξ ∇ · (ζ A)] = 0,

(37)

for some material tensors ζ and ξ , see [8]. Notice, in case of different materials, (37) has to be considered for each (homogeneous) material domain separately. Boundary conditions. A possible choice for the A − ϕ formulation’s boundary conditions can be deduced from Assumption 7 as follows, see e.g. [22]:

Coupled Electromagnetic Field and Electric Circuit Simulation …

181

Assumption 8 (Boundary Conditions for A − ϕ formulation) We assume PEC medium at artificial boundaries: A × n = 0,

on ∂,

(38)

∇ϕ × n = 0,

on ∂.

(39)

As proposed in [8] we end up with the following set of partial differential equations in order to solve an EM problem: ε∇∂t ϕ + ζ ∇ [ξ ∇ · (ζ A)] = 0,

(40)

∇ × (ν∇ × A) + ∂t [ε (∇ϕ + Π )] + σ (∇ϕ + ∂t A) = 0, ∂t A − Π = 0,

(41) (42)

with boundary conditions (38)–(39), where Π is a quasi-canonical momentum in order to avoid second order derivatives. Total current density. Let Jtot :  × I → R3 be the total current density flowing through a point in space at a certain time. The current through an arbitrary surface

, with unit normal n, is then obtained by Jtot · n ds. Assumption 9 The total current density is given by the sum of displacement and conduction current density: Jtot := ∂t D + J. We assume Assumption 9 to hold since it is implemented in the software package devEM, a commercial EM device modeling package, see [31]. Hence, the current through expressed in potentials reads 



−∂t [ε (∇ϕ + Π )] − σ (∇ϕ + ∂t A) · n ds =

∇ × (ν∇ × A) · n ds,

according to MA (32). From the latter variant we observe that the cumulated sum through all terminals, i.e. the whole boundary ∂, equals zero: 

 ∂

∇ × (ν∇ × A) · n ds =



∇ · ∇ × (ν∇ × A) dr = 0.

(43)

From (43) it becomes clear that the model assumption, Assumption 9, is compatible to Kirchhoff’s current law.

3.1.1

Spatial Discretization

The EM system (38)–(42) needs to be spatially discretized in order to apply time integration schemes. For this purpose we make use of the finite integration technique (FIT), originally introduced 1977 by Thomas Weiland [45], which allows us to discretize the integral formulation of the EM system. In the following we are not going into the details of FIT, but refer to the aforementioned reference for further reading.

182

C. Strohm and C. Tischendorf

From the FIT we obtain two staggered meshes defined on  which are dual to each other. Let n P , n L , n F and n V ∈ N be the number of geometrical objects, i.e. mesh points, links, faces and volumes, respectively, of the primal mesh. According to the mesh’s duality, these are the amounts of dual mesh volumes, facets, links and points, respectively, of the dual mesh. We assume PEC boundary conditions. Therefore, the discrete vector potential tangential to the boundary vanishes, e.g. [40]. Further, the scalar potential is constant at the boundary, which is why we express the discrete scalar potential at the n bound P bound boundary points it by a function φ : I → Rn P that is constant for each contact. The spatially discretized quantities are associated with the n int P ∈ N mesh points ∈ N links which do not belong to the boundary ∂. These are the discrete and n int L n int P scalar potential φ : I → R on the on the internal mesh points and the discrete int vector potential a : I → Rn L on the internal mesh links as integral quantities, i.e. ai = linki A · dl. On the meshes we introduce discrete operators for the gradient, divergence and curl as matrices, where the dual ones are notated with a tilde above: primal gradient matrix:

G ∈ {−1, 0, 1}n L ×n P ,

dual divergence matrix

int int S˜ ∈ {−1, 0, 1}n P ×n L ,

int

int

primal curl matrix:

C ∈ {−1, 0, 1}n F ×n L ,

dual curl matrix:

int C˜ ∈ {−1, 0, 1}n L ×n F .

int

By construction, for the discrete operators holds G = − S˜  and C = C˜  . The conint int int int stitutive material parameters turn into matrices Mε ∈ Rn L ×n L , Mν ∈ Rn L ×n L and n F ×n F which are connecting the primal and dual meshes’ fields, see e.g. Mσ ∈ R int int [44]. From the gauging we obtain further material matrices Mζ ∈ Rn L ×n L and int int int bound Mξ ∈ Rn L ×n L . Finally, G ∈ {−1, 0, 1}n L ×n P an extension to G by the boundary nodes, required to incorporate the excitation φ . Having chosen the domain’s shape and boundary conditions carefully and using physically reasonable material, we may assume that the properties collected in the following remark. Remark 1 The permittivity and reluctivity matrices Mε and Mν are positive definite, the conductivity matrix Mσ is positive semi-definite and the discrete gradient operator G has full column rank, see e.g. [8, 13]. The discrete pendant of the A − ϕ formulation with Lorenz gauge (40)–(42) incorporating the boundary conditions (38)–(39) reads d S˜ Mε G φ + S˜ Mζ G Mξ S˜ Mζ a = 0, dt

(44)

Coupled Electromagnetic Field and Electric Circuit Simulation …

 d   Mε Gφ + G φ + π + Mσ C˜ Mν Ca + dt

Gφ + G φ +

183

 d a = 0, (45) dt

d a − π = 0, dt

(46) int

again with the discrete quasi-canonical momentum π : I → Rn L . System (44)– (46) is referred to as Maxwell’s grid equations (MGEs) in this article, c.f. [8]. Due to Remark 1 and by the duality property G = − S˜  , the discrete Laplace-operator S˜ Mε G is non-singular implying that MGEs form an ordinary differential equation system d u + bEM (u) = c, dt

(47)

for u = (φ, a, π ) : I → Rn P +n L +n L with int

int

int

−1 BEM u, bEM (u) :=MEM

(48)

where ⎡

MEM

⎤ S˜ Mε G 0 0 := ⎣ Mε G Mσ Mε ⎦ , 0 I 0



−1 MEM

⎤ ( S˜ Mε G)−1 0 0 ⎦ 0 0 I =⎣ −1 −1 −1 ˜ −G( S Mε G) Mε −Mε Mσ

and ⎞ ⎛ ⎤ 0 S˜ Mζ G Mξ S˜ Mζ 0   −1 d M G φ + M G φ ⎠. BEM := ⎣ Mσ G C˜ Mν C 0 ⎦ , c := −MEM ⎝ dt ε σ 0 0 0 −I ⎡

0

4 Coupled Electric Circuits and Electromagnetic Devices In this section we want to provide a coupled model for electric circuits incorporating EM devices, cf. Fig. 1, that follows the approach in [8, 38]. Therefore, we describe the EM device as a subcircuit of circuit elements introduced in Sect. 2.2 including mock elements.

184

C. Strohm and C. Tischendorf

qC e1

φL

e2

e3

e4

gR

vs

gR

Fig. 1 Example of an electric circuit incorporating an EM device

4.1 Modeling Let the EM model’s boundary ∂ consist of m EM + 1 disjoint nonempty parts j ⊂ ∂. Each part is assumed to be connected to exactly one node of the electric circuit. By convention, 0 is attached to the element’s reference terminal. The boundary parts can be mapped to the circuit nodes by the device’s incidence matrix AEM ∈ {−1, 0, 1}(n−1)×m EM with n being the total number of circuit nodes. Coupling branch voltages. First, we select the primal mesh points belonging to the j-th terminal, for 0 ≤ j ≤ m EM , by making use of the matrix  Λ ∈ {0, 1}

n bound ×m EM P

,

λi j =

1 if mesh point i belongs to j . 0 else

Next, we denote the EM device’s branch voltages by vEM : I → Rm EM Since they can be considered as potentials with respect to the reference terminal’s potential, we map these quantities directly to the adjacent scalar potential, which yields φ = ΛvEM .

(49)

Since the MNA approach substitutes branch voltages by node potentials e, we replace vEM in (49) by A EM e and obtain φ = ΛA EM e.

(50)

Coupling branch currents. Recalling the model for the total current density, we figure that the discrete variables for both J and D are originated on dual mesh facets. Therefore we introduce a mapping and selection to the boundary’s primal mesh points by −G 

, c.f. [8]. Since the currents through the elements boundary comply with KCL, we can map them to the reference to non-reference terminal relations, i.e. with Λ , yielding the EM device’s branch currents iEM : I → Rm EM with

Coupled Electromagnetic Field and Electric Circuit Simulation …

iEM =Λ G 



 d   Mε Gφ + G φ + π + Mσ dt

Gφ + G φ +

185

d a dt

 (51)

or analogously, according to MA (45) in MGEs, ˜ iEM = − Λ G 

C Mν Ca.

(52)

Coupling functions. In the following, we derive all the functions that are necessary in order to incorporate the EM device as a current controlling mock element in the sense of Sect. 2.2. We provide a different variant for each of the coupling equations (56), (59) and (60). We start by defining the some common functions dEM−MNA (x) := Mε G ΛA EM e

(53)

and ⎛

⎞ 0 −1 ⎝ ⎠ wEM + Mσ G ΛA cEM (wEM , x) := −MEM EM e . 0

(54)

Notice that dEM−MNA is linear and cEM is bilinear. After incorporating the branch voltage coupling equation (50) into (47) we obtain the ODE subsystem d d u + bEM (u) = cEM ( dEM−MNA (x), x) dt dt

(55)

for the EM device in terms of circuit variables. Next, we incorporate (50) into (51) and obtain iEM = f EM (

d d dEM−MNA (x), u, x, u) dt dt

(56)

with ˙ x, u) := f EM (wEM , u,        ˙ . Λ G Mε G φ˙ + π˙ + wEM + Mσ Gφ + G ΛA EM e + a

(57)

Note that shifting the differentiation operator in front of each component of u does not require any further smoothness. Finally, by splitting up the summands in f EM , we obtain new functions

186

C. Strohm and C. Tischendorf

e3

mock element

e4

or

or fˆ

M

EM

fE M fˆ E

gR

vs

M

φL

e2

fE

qC e1

gR

EM device

Fig. 2 Electric circuit incorporating an EM device entirely as a mock element, here represented by controlled sources along each branch    qEM (A EM e) := Λ G Mε G ΛAEM e,    gEM (A EM e) := Λ G Mσ G ΛAEM e,     ˙ ˙ u) := Λ G  ˙ + Mσ (Gφ + a˙ ) sEM (u,

Mε G φ + π

(58)

such that (56) is equivalent to iEM =

d d  qEM (A u, u). EM e) + gEM (AEM e) + sEM ( dt dt

(59)

  Remark 2 Λ G 

Mε G Λ and Λ G Mσ G Λ are positive definite and positive semidefinite, respectively.

The constitutive element equation (59) allows us to interpret the EM device as a composition of a parallel capacitor, a resistor and a controlled current source alongside each branch, cf. Fig. 3. Note, a similar composition with a capacitive part extraction has been presented in [1] for the coupling of circuit and spatially discretized semiconductor device models. Using the alternative version of the total current model (52), that complies the one in [8], we obtain ˜ iEM = fˆEM (u) := − Λ G 

C Mν Ca.

(60)

Remark 3 (EM device as mock element) According to (56), this EM model fits the mock element in Sect. 2.2 for either iM = f EM (

d d dEM−MNA (x), u, x, u) or iM = fˆEM (u), dt dt

see Fig. 2, as well as for iM = sEM ( dtd u, u), see Fig. 3, whereas the latter variant requires a shift of qEM and gEM into the circuit equations. Further, we use AM := AEM .

Coupled Electromagnetic Field and Electric Circuit Simulation …

e3

sE M

qE

M

q EM

gR

M

gE

M

gE

gR

vs

e4

mock element M

φL

e2

sE

qC e1

187

EM device

Fig. 3 Electric circuit incorporating an EM device whose constitutive element equations are interpreted as a composition of capacitors, resistors (possibly degenerated and hence positive semidefinite) and a mock element, here represented by controlled sources along each branch

4.2 Systems On the basis of the above preliminaries, we now provide a few categories of system formulations for the coupled problem that arise either due to implementation purposes, performance reasons or analysis results. devEM version. As the devEM solver’s interface takes control of the current coupling, it used to be convenient to write (56) as an extra equation and incorporate it into KCL with the dummy variable iEM , cf. [31, 37]. Hence, the coupled system that is solved arises from (7), (55), and (56) and is of the form d d u + bEM (u) = cEM ( dEM−MNA (x), x). dt dt f EM (

d d dEM−MNA (x), u, x, u) − iEM = 0, dt dt d f MNA ( dMNA (x), x, t) = cMNA (iEM ). dt

(61) (62) (63)

The functions bEM , cEM , dEM−MNA , f EM , f MNA , dMNA and cMNA are given by (48), (54), (53), (57) and (8). Incorporated version. Following the idea of the MNA approach to eliminate as many current quantities as possible, we incorporate the current coupling equation (62) into KCL of (7). Here, we make use of the previous observation (59) and group the resistance, capacitance and controlled current source like constitutive equations by introducing 



AC := AC AEM ,





AR := AR AEM ,

Together with the EM system (55) we obtain

 qC , qC := qEM

 gR . := gEM

gR

188

C. Strohm and C. Tischendorf

d d u + bEM (u) = cEM ( dEM−MNA (x), x), dt dt d d f MNA2 ( dMNA2 (x), x, t) = cMNA2 ( u, u) dt dt

(64) (65)

for ⎛

⎞ AC wC + AR gR (A R e) + AL iL + AV iV + AI i s (t) ⎠, wL − A f MNA2 (w, x, t) := ⎝ Le e vs (t) − A V ⎛ ⎞ 

˙ u) −A sEM (u,  EM q (A e) ⎠. 0 ˙ u) := ⎝ , cMNA2 (u, dMNA2 (x) := C C φL (iL ) 0

Remark 4 The system (64)–(65) fulfills the Assumption 6 with a mock element described by iM = sEM ( dtd u, u), see Eq. (58). Therefore, Corollary 2 applies to (64)– (65) without any preconditions on the EM device’s position. Alternative version. Using (60) instead of (59) as the current coupling equation, we obtain d d u + bEM (u) = cEM ( dEM−MNA (x), x), dt dt f MNA (

d dMNA (x), x, t) = cMNA3 (u) dt

(66) (67)

with ⎞ ⎛ ˆ −A EM f EM (u) ⎠. cMNA3 (u) := ⎝ 0 0

Remark 5 The three presented systems of DAEs are analytically equivalent in the following manner. Every (x, u, iEM ) is a solution of (61)–(62) if and only if (u, x) is a solution of (64)–(65) or (66)–(67), respectively, and iEM = f EM (

d d dEM−MNA (x), u, x, u). dt dt

Proof Whereas the equivalence of (61)–(62) and (64)–(65) is obvious, the equivalence of (61)–(62) and (66)–(67) follows from the discrete MA Eq. (45). 

Coupled Electromagnetic Field and Electric Circuit Simulation …

189

5 Waveform Relaxation Method There exist two main types of waveform relaxation methods, see [11] for a detailed description. The first one is of Jacobi type where the subsystems can be solved in parallel. The second one is of Gauss–Seidel type where the subsystems can be solved one after each other. The latter one has usually better convergence properties. We discuss here only a Gauss–Seidel type waveform relaxation as one promising prototype. We expect similar results (with stronger convergence criteria) for Jacobi type waveform relaxation methods.

5.1 Gauss–Seidel Method As described in [11, 29], Gauss–Seidel waveform relaxation for two coupled ODEs u˙ = f (u, x, t),

x˙ = g(u, x, t)

is, for iteration counter k, given by the iteration u˙ [k] = f (u [k] , x [k−1] , t),

x˙ [k] = g(u [k] , x [k] , t)

starting from an initial function x [0] . For our coupled systems here, we need x[0] and d d (x [0] ) as initial functions on the time interval I . Transferring the Gauss– dt EM−MNA Seidel iteration idea to the systems (61)–(62), (64)–(65) and (66)–(67), respectively, we obtain the evaluation procedures • for the devEM version: d [k] d u + bEM (u[k] ) = cEM ( dEM−MNA (x[k−1] ), x[k−1] ) dt dt

(68)

coupled with f EM (

d d dEM (x[k] ), u[k] , x[k] , u[k] ) − i[k] EM = 0, dt dt d f MNA ( dMNA (x[k] ), x[k] , t) = cMNA (i[k] EM ), dt

(69) (70)

• for the incorporated version: d [k] d u + bEM (u[k] ) = cEM ( dEM−MNA (x[k−1] ), x[k−1] ) dt dt coupled with

(71)

190

C. Strohm and C. Tischendorf

f MNA2 (

d d dMNA2 (x[k] ), x[k] , t) = cMNA2 ( u[k] , u[k] ), dt dt

(72)

• for the alternative version: d [k] d u + bEM (u[k] ) = cEM ( dEM−MNA (x[k−1] ), x[k−1] ) dt dt

(73)

coupled with f MNA (

d dMNA (x[k] ), x[k] , t) = cMNA3 (u[k] ). dt

(74)

Here k ∈ N is the iteration parameter and all equations have to be completed by the initial values, i.e. u[k] (t0 ) = u 0 and x[k] (t0 ) = x0 . Remark 6 Contrary to Remark 5, the systems are not entirely equivalent anymore. [k] [k] Whereas (u[k] , x[k] , i[k] EM ) is a solution of (68)–(70) if and only if (u , x ) is a [k] d d [k] [k] [k] [k] solution of (71)–(72) and iEM = f EM ( dt dEM (x ), dt u , x , u ), the solutions (u[k] , x[k] ) may differ from those of (73)–(74). Furthermore, one could change the order of the subsystems. We only considered the case, where the ODE subsystem is solved first. We expect more difficulties when solving the DAE subsystem first since the algebraic constraints do not keep the same during the solution process in this case.

5.2 Convergence Analysis In this section we proof Gauss–Seidel method’s convergence for the alternative version of the coupled electric circuit and EM device system (73)–(74). In analogy to Remark 3, we can interpret the EM device in (73)–(74) as a mock [k] ˆ element by setting i[k] M = f EM (u ) and AM = AEM . Theorem 3 (Convergence) Let the Assumptions 1–6 be satisfied. Then, the Gauss– Seidel method’s sequence (u[k] , x[k] ) of (73)–(74) converges in (C 1 (I ), ·C 1 (I ) ) to the solution (u, x) of (66)–(67) if the time interval I is sufficiently small and ¯ < 1, for all u and x where ρ( M) M¯ := CEM DEM−MNA T2 J2 ,

(75)

with CEM := ∂wEM cEM (wEM , x),

DEM−MNA := ∂x dEM−MNA (x),

J2 := ∂sc f f 2 ∂u sc f (u).

Proof For sc f := AEM fˆEM we get sc f (u[k] ) = AEM fˆEM (u[k] ) = AEM i[k] M and can decouple (74) by making use of Corollary 2 yielding

Coupled Electromagnetic Field and Electric Circuit Simulation …

191

d [k] [k] [k] [k] y = f 0 (y[k] , z[k] 1 , z2 , z3 , si , sc f (u )), dt [k] [k] [k] d [k] [k] z[k] z + f 1 (y[k] , z[k] 1 = M1 (y , z3 ) 2 , z3 , si , sc f (u )), dt 3 [k] [k] [k] z[k] 2 = f 2 (y , z3 , sv , si , sc f (u )),

 si . z[k] 3 = M3 sv

(76) (77) (78) (79)

Due to x = T0 y + T1 z1 + T2 z2 + T3 z3 , we obtain from (73) d d [k] + T2 z[k−1] + T3 z[k−1] ), u = − bEM (u[k] ) + cEM ( dEM−MNA (T0 y[k−1] + T1 z[k−1] 1 2 3 dt dt T0 y[k−1] + T1 z[k−1] + T2 z[k−1] + T3 z[k−1] ) 1 2 3 d + T2 z[k−1] + T3 z[k−1] ) = − bEM (u[k] ) + cEM ( dEM−MNA (T0 y[k−1] + T1 z[k−1] 1 2 3 dt 

d d d d , + T2 z[k−1] + T3 z[k−1] · T0 y[k−1] + T1 z[k−1] dt dt 1 dt 2 dt 3 + T2 z[k−1] + T3 z[k−1] ). T0 y[k−1] + T1 z[k−1] 1 2 3

(80)

[k−1] This expression requires the time derivatives of z[k−1] , z[k−1] . For the sake 1 2  and z3 si of simplicity we introduce the notations m 3 (sv , si ) := M3 sv and [k] [k] [k] f 0[k] := f 0 (y[k] , z[k] 1 , z2 , z3 , si , sc f (u ))  = f 0 (y[k] , M1 (y[k] , m 3 (sv , si ))m 3 (˙sv , s˙i )

 + f 1 (y[k] , f 2 (y[k] , m 3 (sv , si ), sv , si , sc f (u[k] )), m 3 (sv , si ), si , sc f (u[k] )) ,

f 2 (y[k] , m 3 (sv , si ), sv , si , sc f (u[k] )), m 3 (sv , si ), si , sc f (u[k] )), [k] [k] f 1[k] := f 1 (y[k] , z[k] 2 , z3 , sv , si , sc f (u ))

= f 1 (y[k] , f 2 (y[k] , m 3 (sv , si ), sv , si , sc f (u[k] )), m 3 (sv , si ), sv , si , sc f (u[k] )), [k] [k] [k] f 2[k] := f 2 (y[k] , z[k] 3 , sv , si , sc f (u )) = f 2 (y , m 3 (sv , si ), sv , si , sc f (u )), [k] M1[k] := M1 (y[k] , z[k] 3 ) = M1 (y , m 3 (sv , si )).

Then we collect the occurrences of variables in function

f 0[k]

f 1[k]

f 2[k]

M1[k]

z[k] 1

z[k] 2

z[k] 3

occurrences u[k] , y[k] u[k] , y[k] u[k] , y[k] y[k] u[k] , y[k] u[k] , y[k] − With these shorthands, we find the following expressions for the derivatives z˙ 1 , z˙ 2 and z˙ 3 .

192

C. Strohm and C. Tischendorf

z˙ [k] 3 = m 3 (˙sv , s˙i ), [k] [k] z˙ [k] + ∂z3 f 2[k] m 3 (˙sv , s˙i ) + ∂sv f 2[k] s˙v + ∂si f 2[k] s˙i + J2[k] u˙ [k] , 2 = ∂y f 2 y˙   [k] [k] [k] [k] ˙ y = M m (¨ s , s ¨ ) + ∂ M + ∂ M m (˙ s , s ˙ ) m 3 (˙sv , s˙i ) + ∂y f 1[k] y˙ [k] z˙ [k] v y z v 3 3 i i 3 1 1 1 1   + ∂z2 f 1[k] ∂y f 2[k] y˙ [k] + ∂z3 f 2[k] m 3 (˙sv , s˙i ) + ∂sv f 2[k] s˙v + ∂si f 2[k] s˙i + J2[k] (u[k] )u˙ [k]

+ ∂z3 f 1[k] m 3 (˙sv , s˙i ) + ∂si f 1[k] s˙i + J1[k] u˙ [k] .

with J1[k] := ∂sc f f 1[k] ∂u sc f (u[k] ),

J2[k] := ∂sc f f 2[k] ∂u sc f (u[k] ).

We observe the following further occurrences of variables z˙ [k] 1

function

z˙ [k] 2

z˙ [k] 3

occurrences u[k] , y[k] , u˙ [k] , y˙ [k] u[k] , y[k] , u˙ [k] , y˙ [k] − With inserting the above shorthands, known functions (and their derivatives) and exploiting the occurrences tables, we observe that the right hand side of the ODE subsystem (80) can be defined as a function θ1 such that d [k] d d u = θ1 (u[k] , u[k−1] , y[k−1] , u[k−1] , y[k−1] , t). dt dt dt

(81)

Analogously, we can define a θ2 such that for (76) holds d [k] y = θ2 (u[k] , y[k] , t). dt

(82)

The latter two equations, i.e. (81)–(82), form the overall inherent ODE of the alternative coupled system with applied Gauss–Seidel method (73)–(74). Analogously, the inherent ODE for the system without Gauss–Seidel method (66)–(67) reads d d d u = θ1 (u, u, y, u, y, t), dt dt dt d y = θ2 (u, y, t). dt

(83) (84)

To (83)–(84) we now apply Theorem 2.3 of [36] stating that we have to consider the convergence matrix ⎛ M := ⎝

∂θ1 ∂θ1 ∂ dtd u ∂ dtd y ∂θ2 ∂θ2 ∂ dtd u ∂ dtd y

⎞ ⎠=



∂θ1 ∂θ1  ∂ dtd u ∂ dtd y

0

0

.

Coupled Electromagnetic Field and Electric Circuit Simulation …

Tracing back the occurrences of with altered indices, we obtain

d [k−1] u , dt

193

by making use of the occurrence tables

    ∂θ1 = CEM DEM−MNA T1 ∂z2 f 1 ∂sc f f 2 + ∂sc f f 1 + T2 ∂sc f f 2 ∂u sc f d ∂ dt u     ∂θ1 = CEM DEM−MNA T0 + T1 ∂y M1 m 3 (˙sv , s˙i ) + ∂y f 1 + ∂z2 f 1 ∂y f 2 + T2 ∂y f 2 . d ∂ dt y Further, as we met Assumption 6, it is A EM Q C Q V Q R = 0 and, hence, 

DEM−MNA T1 = Mε G ΛA EM

⎡  QC QV QR 0 00 ⎣ 0

⎤ 0 0 ⎦=0 ¯ QV

which follows from Corollary 1 and twice Lemma 1. Therefore, the derivative expressions reduce to ∂θ1 = CEM DEM−MNA T2 J2 ∂ dtd u

and

  ∂θ1 = CEM DEM−MNA T0 + T2 ∂y f 2 . d ∂ dt y

Due to the lower zero blocks, the spectral radius of M equals the spectral radius of the upper left quadrant, i.e. ρ(M) = ρ



∂θ1 ∂ dtd u



  = ρ CEM DEM−MNA T2 J2 .

[k] Now, C 1 convergence of (u[k]  , y ) → (u, y) is obtained if I is sufficient small and for all u and x it is ρ M¯ < 1. Analogously to the second step of the proof of Theorem 2.4 in [36], we exploit the (Lipschitz) continuities of f 1 , f 2 , M1 and M3 and obtain for the limits of (77)–(79)

d [k] [k] [k] [k] [k] lim z[k] = lim M1 (y[k] , z[k] 3 ) dt z3 + f 1 (y , z2 , z3 , si , sc f (u )) k→∞ 1 k→∞ = M1 (y, z3 )

d z3 + f 1 (y, z2 , z3 , si , sc f (u)) = z1 , dt

[k] lim z[k] = lim f 2 (y[k] , z[k] 3 , sv , si , sc f (u )) k→∞ 2 k→∞

= f 2 (y, z3 , sv , si , sc f (u)) = z2 , [k] lim z = lim m 3 (sv , si ) = z3 . k→∞ 3 k→∞

As no derivatives of either u and y were involved, [k] [k] lim x[k] = lim T0 y[k] + T1 z[k] 1 + T2 z2 + T3 z3 = T0 y + T1 z1 + T2 z2 + T3 z3 = x

k→∞

k→∞

converges in C 1 as well.



194

C. Strohm and C. Tischendorf

vs,2 e1

e2

vs,1

e3

φL

Fig. 4 Low-pass filter as an electrical circuit incorporating an EM device

Fig. 5 Solving the low-pass filter in Fig. 4 using (68)–(70). The plot shows the error of the circuit and a few field variables in terms of Gauss–Seidel iteration count

5.3 Benchmark In this section we provide numerical results according to the waveform relaxation method of Gauss–Seidel type introduced in Sect. 5. The waveform relaxation is realized in a self-developed flow network DAE framework [42] that is implemented in Python and interfacing devEM. Our convergence analysis applies to system (73)–

Coupled Electromagnetic Field and Electric Circuit Simulation …

195

Fig. 6 Solving the low-pass filter in Fig. 4 using (71)–(72). The plot shows the error of the circuit and a few field variables in terms of Gauss–Seidel iteration count

(74) but the function cMNA3 is not accessible from devEM. Therefore, we present the simulation results only for systems (68)–(70) and (71)–(72). Benchmark parameters. As a coupled electric circuit and EM device system, we chose the low-pass filter in Fig. 4 operating at very high frequencies. The circuit parameters are vs,1 (t) = 2.5 sin(2π · 109 t) and vs,2 (t) = 2.5 sin(2π · 1010 t) for the voltage sources and qC = 1/(2π · 0.5 · 107 · 0.03003) for the capacitor. The EM device is a 3 × 3 × 9 micron aluminum bar (σ = 0.333 · 108 ) in oxide with contacts at the opposing small facets and the complementary boundary attached to respectively e2 , e3 and the circuit’s ground. Solver settings. During every solving process we use BDF of first order with constant time-steps of 10−11 seconds on a total time interval I = [0, 1.2 · 10−9 ], in seconds. The initial values u 0 and x0 are set to zero such as the initial guesses x[0] and dtd x[0] on each discrete time-point. Time windowing is not used here, i.e. time integration is realized over the whole time interval. Systems settings. The low-pass filter is solved using the original formulation (61)– (62) and the incorporated split formulation (64)–(65). Formulations of the alternative type (66)–(67) are not accessible to us at the moment. For both systems we apply the Gauss–Seidel method, starting with the ODE subsystem first, and compared each

196

C. Strohm and C. Tischendorf

Fig. 7 Solving the low-pass filter in Fig. 4 using (68)–(70). The plot shows the iterative solutions relative error in time of the circuit node e3

iteration’s solution to the resulting monolithic one. As an error measure we take the error for each component which itself is defined by the maximal relative error over the whole time-interval I . Convergence results. For the system formulation (68)–(70) we obtain convergence to machine precision after approximately 8–9 iterations as shown in Fig. 5. As expected, similar results are obtained for the system (71)–(72) and presented in Fig. 6. Additionally, the error over the time interval I of the two exemplary components e3 and iEM1 can be tracked in Figs. 7 and 8, respectively. Whereas the error plot for e3 does not change, except for minimal changes around machine precision, when using (71)–(72), the error plot for iEM1 is not present due to substitution of iEM1 .

6 Conclusion and Outlook Waveform relaxation of Gauss–Seidel type has been studied for three variants of coupling lumped circuit equations to full Maxwell equations describing EM devices. Convergence is proved for couplings of the form (73)–(74). A general convergence criterion is given in Theorem 3. It exploits the results of [36] and extends it in two

Coupled Electromagnetic Field and Electric Circuit Simulation …

197

Fig. 8 Solving the low-pass filter in Fig. 4 using (68)–(70). The plot shows the relative error of the node potential e3 in comparison to the monolithic solution in each Gauss–Seidel iteration

aspects. First, circuits with C V -loops are allowed without restriction. Secondly, the coupling function c E M may depend on the time derivative of circuit variables. The coupled electric circuit and EM device problem is simulated by making use of system formulation (61)–(62), as the solver PyCEM, used in [37], interfaces the industrial EM solver devEM. Tests have shown that incorporating the EM device’s branch current into KCL of the circuit equations, as done in (71)–(72), leads to similar convergence results. One has to be aware that the system (73)–(74) and the system (61)–(62) are no more equivalent to each other when applying iteration schemes. It remains to find general convergence criteria also for coupling formulations of the form (71)–(72). On the other hand, the convergence results for the form (73)–(74) motivate to adopt the (commercial) tool’s interfaces in order to cover more than just system formulations (61)–(62).

198

C. Strohm and C. Tischendorf

Since the waveform relaxation results show sufficiently small errors after a few iterations, we recommend to use it if efficient solvers are available for the subsystems, the weak topological Assumption 5 is satisfied and a monolithic formulation of the coupled system is difficult to provide. Acknowledgements We would like to thank the reviewers for their valuable comments and comprehensive efforts towards improving our manuscript. Further, we acknowledge financial support under BMWi (German Federal Ministry for Economic Affairs and Energy) grant 0324019E.

References 1. Alì, G., Bartel, A., Brunk, M., Schöps, S.: A convergent iteration scheme for semiconductor/circuit coupled problems. In: Michielsen, B., Poirier J., (eds.) Scientific Computing in Electrical Engineering SCEE 2010. Mathematics in Industry, vol. 16, pp. 104–111. Springer, Berlin (2012) 2. Alonso Rodríguez, A., Raffetto, M.: Unique solvability for electromagnetic boundary value problems in the presence of partly lossy inhomogeneous anisotropic media and mixed boundary conditions. Math. Models Methods Appl. Sci. 13(04), 597–611 (2003) 3. Alonso Rodríguez, A., Valli, A.: Eddy Current Approximation of Maxwell Equations. Springer Milan, Berlin (2010). https://doi.org/10.1007/978-88-470-1506-7 4. Arnold, M., Clauss, C., Schierz, T.: Error analysis and error estimates for co-simulation in FMI for model exchange and co-simulation v2. 0. In: Progress in Differential-Algebraic Equations, pp. 107–125. Springer, Berlin (2014) 5. Arnold, M., Günther, M.: Preconditioned dynamic iteration for coupled differential-algebraic systems. BIT Numer. Math. 41(1), 1–25 (2001). https://doi.org/10.1023/A:1021909032551 6. Bartel, A., Brunk, M., Günther, M., Schöps, S.: Dynamic iteration for coupled problems of electric circuits and distributed devices. SIAM J. Sci. Comput. 35(2), B315–B335 (2013). https://doi.org/10.1137/120867111 7. Bartel, A., Günther, M.: Multirate co-simulation of first order thermal models in electric circuit design. In: Schilders, W.H.A., ter Maten, E.J.W., Houben, S.H.M.J. (eds.) Scientific Computing in Electrical Engineering, pp. 104–111. Springer, Berlin (2004) 8. Baumanns, S.: Coupled Electromagnetic Field/Circuit Simulation. Modeling and Numerical Analysis. Logos Verlag, Berlin (2012) 9. Biro, O., Preis, K.: On the use of the magnetic vector potential in the finite-element analysis of three-dimensional eddy currents. IEEE Trans. Magn. 25(4), 3145–3159 (1989) 10. Bossavit, A.: Computational Electromagnetism: Variational Formulations, Complementarity, Edge Elements. Academic, Cambridge (1998) 11. Burrage, K.: Parallel and Sequential Methods for Ordinary Differential Equations. Clarendon Press, New York (1995) 12. Chua, L., Lin, P.: Computer-Aided Analysis of Electronic Circuits: Algorithms and Computational Techniques (Prentice-Hall Series in Electrical Computer Engineering). Prentice-Hall, Englewood Cliffs (1975) 13. Clemens, M.: Large systems of equations in a discrete electromagnetism: formulations and numerical algorithms. IEE Proc.-Sci., Meas. Technol. 152(2), 50–72 (2005). https://doi.org/ 10.1049/ip-smt:20050849 14. Clemens, M., Weiland, T.: Regularization of eddy-current formulations using discrete grad-div operators. IEEE Trans. Magn. 38(2), 569–572 (2002) 15. Desoer, C.A., Kuh, E.S.: Basic Circuit Theory. McGraw-Hill, New York (1969) 16. Ebert, F.: On partitioned simulation of electrical circuits using dynamic iteration methods. Ph.D. thesis, TU Berlin (2008)

Coupled Electromagnetic Field and Electric Circuit Simulation …

199

17. Eller, M., Reitzinger, S., Schöps, S., Zaglmayr, S.: A symmetric low-frequency stable broadband Maxwell formulation for industrial applications. SIAM J. Sci. Comput. 39(4), B703–B731 (2017) 18. Estévez Schwarz, D., Tischendorf, C.: Structural analysis of electric circuits and consequences for MNA. Int. J. Circuit Theory Appl. 28(2), 131–162 (2000). https://doi.org/10. 1002/(SICI)1097-007X(200003/04)28:23.0.CO;2-W 19. Gander, M., Stuart, A.: Space-time continuous analysis of waveform relaxation for the heat equation. SIAM J. Sci. Comput. 19(6), 2014–2031 (1998). https://doi.org/10.1137/ S1064827596305337 20. Gander, M.J., Ruehli, A.E.: Optimized waveform relaxation methods for rc type circuits. IEEE Trans. Circuits Syst. I: Regul. Pap. 51(4), 755–768 (2004). https://doi.org/10.1109/TCSI.2004. 826193 21. Garcia, I.C., Schöps, S., Maciejewski, M., Bortot, L., Prioli, M., Auchmann, B., Verweij, A.: Optimized field/circuit coupling for the simulation of quenches in superconducting magnets. IEEE J. Multiscale Multiphys. Comput. Tech. 2, 97–104 (2017). https://doi.org/10.1109/ JMMCT.2017.2710128 22. Haber, E., Ascher, U.M.: Fast finite volume simulation of 3d electromagnetic problems with highly discontinuous coefficients. SIAM J. Sci. Comput. 22(6), 1943–1961 (2001) 23. I.Voloshin, V.: Introduction to Graph and Hypergraph Theory. NOVA Science Publishers (2009) 24. Jackson, J.D.: Classical Electrodynamics, 3rd edn. Wiley, New Jersey (1999) 25. Jansen, L.: A dissection concept for DAEs: Structural decoupling, unique solvability, convergence theory and half-explicit methods. Ph.D. thesis, Humboldt-Universitt zu Berlin, Mathematisch-Naturwissenschaftliche Fakultät I (2015). https://doi.org/10.18452/17166 26. Jansen, L., Matthes, M., Tischendorf, C.: Global unique solvability for memristive circuit DAEs of index 1. Int. J. Circuit Theory Appl. 43(1), 73–93 (2015) 27. Kunkel, P., Mehrmann, V.: Differential-Algebraic Equations. European Mathematical Publishing House (2006). https://doi.org/10.4171/017 28. Lamour, R., März, R., Tischendorf, C.: Differential-Algebraic Equations: A Projector Based Analysis. Springer, Heidelberg (2013) 29. Lelarasmee, E.: The Waveform Relaxation Method for Time Domain Analysis of Large Scale Integrated Circuits: Theory and Applications. Electronics Research Laboratory, College of Engineering, University of California (1982) 30. Lelarasmee, E., Ruehli, A.E., Sangiovanni-Vincentelli, A.L.: The waveform relaxation method for time-domain analysis of large scale integrated circuits. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 1(3), 131–145 (1982). https://doi.org/10.1109/TCAD.1982.1270004 31. MAGWEL N.V.: Device-Electro-Magnetic Modeler (DevEM™) (2016). http://www.magwel. com 32. Maxwell, J.C.: A dynamical theory of the electromagnetic field. Philos. Trans. R. Soc. 155, 459–512 (1865). https://doi.org/10.1017/cbo9780511698095.028 33. Merkel, M., Niyonzima, I., Schöps, S.: Paraexp using leapfrog as integrator for high-frequency electromagnetic simulations. Radio Sci. (2017) 34. Miekkala, U., Nevanlinna, O.: Convergence of dynamic iteration methods for initial value problems. SIAM J. Sci. Stat. Comput. 8(4), 459–482 (1987). https://doi.org/10.1137/0908046 35. Nolting, W.: Grundkurs Theoretische Physik 3. Springer, Berlin (2011). https://doi.org/10. 1007/978-3-642-13449-4 36. Pade, J., Tischendorf, C.: Waveform relaxation: a convergence criterion for differentialalgebraic equations. Numer. Algorithms (2018). https://doi.org/10.1007/s11075-018-06455 37. Schoenmaker, W., Meuris, P., Strohm, C., Tischendorf, C.: Holistic coupled field and circuit simulation. In: Proceedings of the 2016 Conference on Design, Automation and Test in Europe, pp. 307–312. EDA Consortium (2016) 38. Schöps, S.: Multiscale modeling and multirate time-integration of field/circuit coupled problems. Ph.D. thesis, Universität Wuppertal, Fakultät für Mathematik und Naturwissenschaften Mathematik und Informatik Dissertationen (2011)

200

C. Strohm and C. Tischendorf

39. Schöps, S., Bartel, A., De Gersem, H., Günther, M.: DAE-index and convergence analysis of lumped electric circuits refined by 3-d magnetoquasistatic conductor models. In: Roos, J., Costa, L.R. (eds.) Scientific Computing in Electrical Engineering SCEE 2008, pp. 341–348. Springer, Berlin (2010) 40. Schöps, S., De Gersem, H., Bartel, A.: A cosimulation framework for multirate time integration of field/circuit coupled problems. IEEE Trans. Magn. 46(8), 3233–3236 (2010). https://doi.org/ 10.1109/TMAG.2010.2045156 41. Stratton, J.A.: Electromagnetic Theory. Wiley-Blackwell, New Jersey (2007). https://doi.org/ 10.1002/9781119134640 42. Streubel, T., Strohm, C., Trunschke, P., Tischendorf, C.: Generic construction and efficient evaluation of flow network DAEs and their derivatives in the context of gas networks. In: Operations Research Proceedings 2017, pp. 627–632. Springer, Berlin (2018) 43. Strohm, C., Tischendorf, C.: Interface model integrating full-wave Maxwell simulation models into modified nodal equations for circuit simulation. IFAC-PapersOnLine 48(1), 940–941 (2015) 44. Thoma, P.: Zur numerischen Lösung der Maxwellschen Gleichungen im Zeitbereich. Disseration D17 TH Darmstadt (1997) 45. Weiland, T.: A discretization method for the solution of Maxwell’s equations for six-component fields. AEÜ - Int. J. Electron. Commun. 31, 120–166 (1977)

SCIP-Jack: An Exact High Performance Solver for Steiner Tree Problems in Graphs and Related Problems Daniel Rehfeldt, Yuji Shinano, and Thorsten Koch

Abstract The Steiner tree problem in graphs is one of the classic combinatorial optimization problems. Furthermore, many related problems, such as the rectilinear Steiner tree problem or the maximum-weight connected subgraph problem, have been described in the literature—with a wide range of practical applications. To embrace this wealth of problem classes, the solver SCIP- Jack has been developed as an exact framework for classic Steiner tree and 11 related problems. Moreover, the solver comes with both shared- and distributed memory extensions by means of the UG framework. Besides its versatility, SCIP- Jack is highly competitive for most of the 12 problem classes it can solve, as for instance demonstrated by its top ranking in the recent PACE 2018 Challenge. This article describes the current state of SCIPJack and provides up-to-date computational results, including several instances that can now be solved for the first time to optimality.

1 Introduction The Steiner tree problem in graphs (SPG) is one of the classic N P-hard optimization problems [1]. Given an undirected connected graph G = (V, E), costs c : E → Q+ and a set T ⊆ V of terminals, the problem is to find a tree S ⊆ G of minimum cost that includes T (Fig. 1). The SPG is commonly said to have a variety of practical applications. However, applications that involve solving the pure SPG are not often encountered in practice (although they surely exist, for instance in the design of fiber-optic networks [2]). The lack of real-world applications for the pure SPG is highlighted by the fact that D. Rehfeldt (B) · T. Koch TU Berlin, Str. des 17, Juni 135, Berlin 10623, Germany e-mail: [email protected] T. Koch e-mail: [email protected] Y. Shinano · T. Koch Zuse Institute Berlin, Takustr. 7, Berlin 14195, Germany e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_10

201

202

D. Rehfeldt et al.

(a) An SPG instance

(b) A feasible solution (Steiner tree)

Fig. 1 Illustration of a Steiner tree problem in a graph (left) and a possible solution (right). Terminals are drawn as squares, Steiner nodes as circles

from the thousands of instances collected by the authors in the SteinLib [3] less than a fifth can claim practical origins and even of those most are more suitably formulated as a rectilinear Steiner minimum tree problem. However, there are many practical applications that include problems closely related to the SPG. Therefore, the solver SCIP- Jack has been developed, which is able to not only solve the SPG, but also 12 related problems. The 2014 DIMACS Challenge, dedicated to Steiner tree problems, marked a revival of research on the SPG and related problems: Both at and in the wake of the Challenge several new Steiner problem solvers were introduced and many articles were published. One of these new solvers is SCIP-Jack, which was by far the most versatile solver participating in the DIMACS Challenge, being able to solve the SPG and 10 related problems (in the current version one more problem class can be handled). Moreover, SCIP-Jack was able to win two parallel and two sequential categories of the DIMACS Challenge. Other solvers that successfully participated in the DIMACS Challenge are described in [4, 5]—indeed the exact and heuristic solvers described in [4] were able to win the majority of categories at the Challenge. SCIP-Jack is described in detail in the article [6], but already in an updated version that vastly outperforms its predecessor participating in the DIMACS Challenge. The current version of SCIP-Jack described in this article again considerably improves on the state reported in [6]. In the following, recent developments for both the SPG and the related problem are reported; however, two problem classes, the maximum-weight connected subgraph problem (in Sect. 3) and the hop-constrained directed Steiner tree problem (in Sect. 4), will be described in most detail—to provide some representative insight into the solving approach of SCIP- Jack for different problem classes. This article also reports on several instances that have been solved for the first time to optimality. Moreover, it provides results on new SPG instances from the recent PACE 2018 Challenge [7], where SCIP- Jack successfully competed. SCIP- Jack is freely available for academic use as part of the SCIP Optimization Suite [8].

SCIP-Jack: An Exact High Performance Solver for Steiner Tree …

203

1.1 Notation and Preliminaries As to notation, to denote the vertices and edges of a specific graph G we will write V (G) and E(G), respectively. In contrast, for a subset of vertices W ⊆ V we define   E[W ] := {vi , v j } ∈ E | vi , v j ∈ W . Further, the notation n := |V | and m := |E| will be used. For all problem classes that contain terminals T define s := |T |. For prize-collecting Steiner tree and maximumweight connected subgraph problems (see Sects. 5.1 and 3) we set T := {v ∈ V | p(v) > 0}. For the sake of simplicity we usually write V := {v1 , . . . , vn } as well as T := {t1 , . . . , ts }. Define Q+ := {x ∈ Q | x ≥0}. For any function x : M → Q with M finite, and any M  ⊆ M define x(M  ) := i∈M  x(i). Additionally, for W ⊆ V define δ(W ) := {{u, v} ∈ E | u ∈ W, v ∈ V \ W } and for a subgraph G  ⊆ G and W  ⊆ V (G  ) define δG  (W  ) := {{u, v} ∈ E(G  ) | u ∈ W  , v ∈ V (G  ) \ W  }. A corresponding notation is used for directed graphs (V, A): For W ⊆ V define δ + (W ) := {(u, v) ∈ A | u ∈ W, v ∈ V \ W } and δ − (W ) := δ + (V \ W ). All non-parallel computational experiments described in the following were performed on a cluster of Intel Xeon X5672 CPUs with 3.20 GHz and 48 GB RAM. A development version of SCIP 6.0 [8] and SCIP- Jack 1.3 was used and CPLEX 12.7.11 was employed as the underlying LP solver. Moreover, the overall run time for each instance was limited by two hours. If an instance could not be solved to optimality within the time limit, the gap is reported, which is defined as |pb−db| for final primal and dual bound pb and db, respectively. The average gap max{|pb|,|db|} is obtained as the arithmetic mean, while averages of the number of nodes and the solving time are computed by taking the shifted geometric mean [9] with a shift of 10 and 1, respectively.

1.2 SCIP-Jack: A High Level View SCIP- Jack includes a wide range of generic and problem-specific algorithmic components, most of them falling into one of the following three categories. First, reduction techniques are extremely important (both in presolving and domain propagation). Apart from some instances either specifically constructed or insightfully handpicked to defy reduction techniques, such as the PUC [10] and I640 [11] test sets, preprocessing is usually able to significantly reduce instances. Results presented in the PhD theses of Polzin [12] and Daneshmand [13] report an average reduction in the number of edges of 78%, with many instances being solved completely by presolving. The reduction rates of SCIP- Jack are even stronger on 1 http://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/.

204

D. Rehfeldt et al.

some problem classes—e.g., on average more than 90% of edges can be deleted for the maximum-weight connected subgraph problem (described in Sect. 3). Second, heuristics are essential to find good or even optimal solutions and help find strong upper and lower bounds quickly. Having a strong primal bound available is a prerequisite for the reduced cost based domain propagation routines in SCIPJack. Furthermore, heuristics can be especially important for hard instances, for which the dual bound often stays substantially below the optimum for a long time. Most heuristics implemented in SCIP- Jack can be used for several problem classes, but there are also problem-specific ones, e.g., for the maximum-weight connected subgraph problem [14]. Finally, the core of SCIP- Jack is constituted by graph-transformations and a branch-and-cut procedure used to compute lower bounds and prove optimality. SCIP- Jack transforms all problem classes to the Steiner arborescence problem (sometimes with additional constraints), which is defined as follows. Given a directed graph D = (V, A), costs c : A → Q+ , a set T ⊆ V of terminals, and a root r ∈ T , a directed tree S = (V (S), A(S)) ⊆ D is required that first, for all t ∈ T contains exactly one directed path from r to t and second, minimizes 

c(a).

a∈A(S)

Thereupon, one can use the following formulation: Formulation 1 Flow Balance Directed Cut Formulation min c T y y(δ + (W )) ≥ 1, y(δ − (v)) ≤ y(δ + (v)), y(δ − (v)) ≥ y(a), y(a) ∈ {0, 1},

(1) for all W ⊂ V, r ∈ W, (V \ W ) ∩ T = ∅ (2) for all v ∈ V \ T for all a ∈ δ + (v), v ∈ V \ T

(3) (4)

for all a ∈ A

(5)

Only constraints (2) and (5) are necessary for the validity of the IP formulation, but (3) can improve the LP-relaxation [15] and (4) (while not changing the optimal value of the LP-relaxation [12]) can often speed up the solving process when a branch-andcut approach is used. Both theoretically [15] and practically [12] the LP-relaxation of Formulation 1 has been shown to be superior to other (in particular undirected) MIP formulations. After presolving, SCIP- Jack runs a dual-ascent heuristic [16] to select a set of constraints from (2) to be included into the initial LP (and to find a feasible solution [6]). Subsequently, the LP is solved and a separator routine [17] based on a maximum-flow algorithm is used to find violated constraints. The violated constraints are added to the LP and the procedure is reiterated as long as the dual-bound can be sufficiently improved. Otherwise branching is initiated. During branch-and-cut, domain propagation and several (constructive and local) primal heuristics are applied

SCIP-Jack: An Exact High Performance Solver for Steiner Tree …

205

Table 1 SCIP-Jack can solve the SPG and 11 related problems Abbreviation Problem name SPG SAP RSMT OARSMT NWSTP PCSTP RPCSTP MWCSP RMWCSP DCSTP GSTP HCDSTP

Steiner tree problem in graphs Steiner arborescence problem Rectilinear Steiner minimum tree problem Obstacle-avoiding rectilinear Steiner minimum tree problem Node-weighted Steiner tree problem Prize-collecting Steiner tree problem Rooted prize-collecting Steiner tree problem Maximum-weight connected subgraph problem Rooted maximum-weight connected subgraph problem Degree-constrained Steiner tree problem Group Steiner tree problem Hop-constrained directed Steiner tree problem

to speed up the solution process. All problem classes that can be solved by SCIPJack are listed in Table 1.

2 Steiner Tree Problems in Graphs The Steiner tree problem in graphs is arguably the most famous of all problems discussed in this article. The literature contains a vast amount of research articles on theoretical aspects of the SPG, such as complexity, polyhedral studies, or approximation algorithms. Also, a large number of articles have focused on practical solving. The by far most successful solver for the SPG has been developed in the PhD theses of Polzin [12] and Daneshmand [13] (with a recent computational update in [18]), who notably extended previous results—for instance from the PhD thesis of Duin [11]. Their solver is based on a wide range of reduction techniques, but also includes a number of other methods such as cutting and pricing algorithms. At the time of publishing the solver was several orders of magnitudes faster than any competitor, and could also solve a number of instances for the first time to optimality. It is also faster than SCIP- Jack on most instances (usually by a factor of 5 or more), except for very hard ones (such as the PUC instances described later), for which SCIP- Jack yields better results. However, unlike SCIP- Jack, the solver is not publicly available.

206

D. Rehfeldt et al.

2.1 The PACE Challenge The Parameterized Algorithms and Computational Experiments Challenge (PACE) has been initiated to investigate the practical performance of multivariate, finegrained, parameterized, or fixed-parameter tractable algorithms. It is sponsored by the University of Amsterdam, Eindhoven University of Technology, Leiden University, and the Center for Mathematics and Computer Science (CWI). The latest challenge PACE 2018 [7] was concerned with fixed-parameter algorithms (FPT) for the SPG—for SPG instances with a fixed number of terminals or with a fixed treewidth polynomial algorithms are known. The Challenge included three tracks, each with 100 instances and a time limit of 30 min per instance. Overall, there were more than 100 submissions to PACE 2018. Although SCIP- Jack does not include any FPT algorithms, the authors decided to submit it to all three tracks of PACE 2018. Since no commerical solvers were allowed, SoPlex 4.0 was used as LP solver. In Track A (exact solution of instances with few terminals) SCIP- Jack reached 3rd place,2 in Track B (exact solution of instances with small treewidth) SCIP- Jack reached 1st place, and in Track C (heuristic, instances with different FPT characteristics) SCIP- Jack reached 2nd place.3 While the actual instances used for the challenge have not been made available yet, for each track a training test set of 100 instances was published (with the results on these test sets being almost identical to the ones of the actual challenge). For these instances of the exact tracks A and B we report results of running SCIP- Jack with SoPlex 4.0 (the configuration used in the actual challenge) and CPLEX 12.7.1 as LP solver in Table 2. Note that the computational environment used for this article is different from the one at the PACE Challenge—the run times suggest that the environment used here is around 10% faster. The available memory was limited to 6 GB and a time limit of 30 min was set, as in the PACE Challenge. The average time is reported as the arithmetic mean—since that was the secondary criterion at the PACE Challenge in case of a tie in the number of solved instances. While SCIPJack/SoPlex can solve more than 90% of all instances in both tracks within the time limit, it is substantially outperformed by SCIP- Jack/CPLEX, which solves 199 of the 200 instances in half an hour. Notably, SCIP- Jack/CPLEX solves all 100 instances of Track A to optimality, while the winning solver in this track solves 95.

2.2 Recent Developements and Further Results As compared to [6] a number of algorithmic components have been added or improved. The perhaps most important one is an initial implementation of extended reduction techniques [19]. These techniques try to prove that a subgraph G  (usually 2 Winning

team Track A: Yoichi Iwata, Takuto Shigemura (NII, Japan). team Track C: Emmanuel Romero Ruiz, Emmanuel Antonio Cuevas, Irwin Enrique Villalobos Lpez, Carlos Segura Gonzlez (CIMAT, Mexico).

3 Winning

SCIP-Jack: An Exact High Performance Solver for Steiner Tree …

207

Table 2 Computational results for PACE 2018 instances SCIP- Jack/CPLEX SCIP- Jack/SoPlex Track # instances Solved ∅ time (s) Solved ∅ time (s) A B

100 100

100 99

65 108

94 92

101 120

Best other Solved 95 77

a single edge or vertex) is not part of at least one optimal solution by considering a (sufficient) set of supergraphs of G  and showing that all of them are not contained in at least one optimal solution. The realization of these techniques is quite intricate, but already the initial, and very restricted, implementation in SCIP- Jack allowed to delete about 5% more edges in general—which still falls far short of the results reported in [12]. Another important improvement is the use of reduction techniques in domain propagation. Other authors [12] already observed that complex reductions beyond the simple deletion of edges and vertices cannot be easily translated into a linear program during branch-and-cut. However, we observed that one can map the graph obtained from applying all reduction techniques back to a, strict, subgraph of the original problem. In this way it is possible to translate the reduction techniques into variable fixings and apply them during branch-and-cut without having to restart the computation. Other, more technical, improvements include a more cache-efficient implementation of dual-ascent. Moreover, the maximum-flow algorithm needed for separating violated inequalities from Formulation 1 has been reimplemented—and is on average more than an order of magnitude faster than the previous one. Furthermore, the data structures needed to map a solution in the reduced problem to a solution to the original problem have been reimplemented. A summary of the computational performance of SCIP- Jack on the five SPG test sets is presented in Table 3. Each line in the table shows aggregated results for the test set specified in the first column. The second column, labeled #, lists the number of instances in the test set, the third column states how many of them were solved to optimality within the time limit. The average number of branch-and-bound nodes and the average running time in seconds of these instances are presented in the next two columns, named optimal. The last two columns, labeled timeout, show the average number of branch-and-bound nodes and the average gap for the remaining instances, i.e., all instances that hit the time limit. The first two test sets can be solved very quickly (with maximum run time of 25 s) and all but one instances can be solved at the root node (usually already during presolving). In contrast, for the I640 instances that could not be solved in two hours more than 700 branch-and-bound nodes are explored on average. Similarly, for the last and hardest test set PUC a considerable number of branch-and-bound nodes is required—even more for the instances that can be solved within the time limit.

208

D. Rehfeldt et al.

Table 3 Computational results for SPG instances Optimal Test set # Solved ∅ nodes X E Vienna-isimple I640 PUC

∅ time (s)

Timeout ∅ nodes

∅ gap (%)

3 20 85

3 20 85

1.0 1.6 6.0

0.2 0.4 59.4

– – –

– – –

100 50

82 13

14.3 1004.4

5.1 125.4

748.1 417.1

0.7 2.8

Notably, this is the first time that SCIP- Jack has been able to solve all instances from the vienna-i-simple test set to optimality within two hours. Also, the results on the (well-known and notoriously hard) PUC test set are better than the ones from Polzin and Daneshmand [18], both with respect to the number of solved instances and the average gap.

3 Maximum-Weight Connected Subgraph Problems Given an undirected graph G = (V, E) and vertex weights p : V → Q, the maximumweight connected subgraph problem  (MWCSP) asks for a connected subgraph S = (V (S), E(S)) ⊆ G such that v∈V (S) p(v) is maximized. Computational biology [20, 21] seems to be the predominant application field for the MWCSP, but one can also find the problem in different areas such as wildlife conservation [22] and computer vision [23]. Although already published in [24], the transformation from MWCSP to SAP will be detailed in the following—in order to exemplify the transformation procedures to SAP used for all problem classes in SCIP- Jack. The underlying idea is to treat the MWCSP vertices of positive weight as terminals in the SAP. However, since all terminals need to be part of any feasible SAP solution, several vertices (including a root) and arcs are added to the SAP that allow to model not including a positive vertex in a feasible solution. Transformation 1 [MWCSP to SAP] Input: An MWCSP P = (V, E, p) Output: An SAP P  = (V  , A , T  , c , r  ) 1. Set V  := V , A := {(v, w) ∈ V  × V  | {v, w} ∈ E}. 2. Set c : A → Q+ such that for a = (v, w) ∈ A : − p(w), if p(w) < 0 c (a) = 0, otherwise 3. Add two vertices r  and v0 to V  .

SCIP-Jack: An Exact High Performance Solver for Steiner Tree …

209

r M

M

2.1

7

2.1

7

0 0

0 0

0

0 0.9

1.7 -1.7

-0.9

1.7

(a) MWCSP instance

0.9

(b) SAP instance

Fig. 2 Illustration of an MWCSP instance (left) and the equivalent SAP obtained by Transformation 1 (right). Terminals are drawn as squares and Steiner vertices as circles (with those of positive weight enlarged)

4.  Denote the set of all v ∈ V with p(v) > 0 by T = {t1 , . . . , ts } and define M := t∈T p(t). 5. For each i ∈ {1, . . . , s}: a. b. c. d.

Add an arc (r  , ti ) of weight M to A . Add a new node ti to V  . Add arcs (ti , v0 ) and (ti , ti ) to A , both being of weight 0. Add an arc (v0 , ti ) of weight p(ti ) to A .

6. Define the set of terminals T  := {t1 , . . . , ts } ∪ {r  }. 7. Return (V  , A , T  , c , r  ). Proposition 1 [MWCSP to SAP] Let P = (V, E, p) be an MWCSP and P  = (V  , A , T  , c , r  ) an SAP obtained from P by Transformation 1, with solution sets S and S  respectively. Thereupon, a function HM W : S  → S exists such that for each solution S  = (V  (S  ), A (S  )) ∈ S  that is optimal, S := (V (S), E(S)) := H PC (S  ) is an optimal solution to the original MWCSP P with: V (S) = {v ∈ V : v ∈ V  (S  )},

(6) 







E(S) = {{v, w} ∈ E : (v, w) ∈ A (S ) or (w, v) ∈ A (S )}.

(7)

Furthermore, if S  is optimal, it holds that: p(V ) =

 v∈V : p(v)>0

p(v) − c(A (S  )) + M.

(8)

210

D. Rehfeldt et al.

Transformation 1 is exemplified for a simple MWCSP instance in Fig. 2. Note that this transformation is not used in the branch-and-cut solving procedure, but rather for applying dual-ascent (both during presolving and propagation). An interesting observation utilized in SCIP- Jack is that one may use Transformation 1 to show that a vertex v ∈ T is part of at least one optimal solution. Consider the SAP instance P  = (V  , A , T  , c , r  ) obtained by applying Transformation 1 on a MWCSP P. Moreover, let L D A be the lower bound obtained by dual-ascent and U an upper bound for P  . If the reduced cost of an arc (r  , ti ) is higher than U − L D A , it can be deduced that the vertex ti is part of at least one optimal solution to the P. If at least one (positive) vertex can be shown to be part an optimal solution, the MWCSP P can be solved as a rooted maximum-weight connected subgraph problem. This problem incorporates the additional condition that a non-empty set of vertices R ⊆ V needs to be part of all feasible solutions [25]. The authors showed in [14] that using the rooted MWCSP formulation is both theoretically and practically stronger than using the unrooted MWCSP formulation. Moreover, as described in [26], one can use Transformation 1 and dual-ascent to eliminate edges and vertices of an MWCSP. Besides the dual-ascent based reduction techniques mentioned above, SCIP- Jack includes a number of sophisticated reduction routines described in [14] and [26]. But also several straight-forward (but still powerful) reduction routines are applied; for instance one observes that adjacent vertices vi , v j ∈ V with p(vi ) ≥ 0, p(v j ) ≥ 0 can always be contracted, since if a connected subgraph contains vi or v j , there is another connected subgraph of same or higher weight that contains both vi and v j . Notably, SCIP- Jack can reduce many large-scale MWCSP instances with hundreds of thousands of edges to a single vertex MWCSP (and thereby already solving it to optimality). Finally, SCIP- Jack contains several primal heuristics for the MWCSP [14]. These heuristics strongly interact with reduction techniques: Several reduction techniques, such as dual-ascent based ones, require a strong primal bound; on the other hand, primal heuristics that involve solving MWCSP subproblems, such as recombination heuristics, heavily make use of reductions techniques. Similarly, strong primal bounds provided by the heuristics implemented in SCIP- Jack are indispensable for proving that an MWCSP is in fact a rooted MWCSP. Computational results on seven test sets are provided in Table 4. The SHINY test set has been introduced in [21] and contains instances from computational biology. The ACTMOD instances also stem from computational biology and were introduced at the 11th DIMACS Challenge [27]. The remaining test sets were all introduced at the 11th DIMACS Challenge and originally formulated as prize-collecting Steiner tree problems. However, their instances have uniform edge weights and can therefore be transformed to MWCSP. Results on additional test sets can be found in [14]. On all tests sets described here, as well as those in [14] (which comprise all publicly available MWCSP instances the authors are aware of) SCIP- Jack yields stronger results than other approaches from the literature. In fact, on each instance that takes longer than 0.1 s to be solved SCIP- Jack is faster than any other solver. For instance, while none of the three solvers used in [21] could solve all SHINY instances to optimality, SCIP- Jack solves each instance in less than 0.1 s. The strongest MWCSP

SCIP-Jack: An Exact High Performance Solver for Steiner Tree … Table 4 Computational results for MWCSP instances Optimal Test set # Solved ∅ nodes ∅ time (s) SHINY ACTMOD Handsd Handsi Handbd Handbi PUCNU

39 8 10 10 14 14 18

39 8 10 10 13 13 12

1.0 1.0 1.0 1.0 1.0 1.0 23.7

0.0 0.1 0.4 0.6 8.1 8.4 42.2

211

Timeout ∅ nodes

∅ gap (%)

– – – – 1.0 1.0 78.8

– – – – 0.1 0.1 1.5

solver from the literature besides SCIP- Jack has recently been introduced in [28]. While this solver is much faster than previous approaches, SCIP- Jack consistently outperforms it, often by more than an order of magnitude. Also, SCIP- Jack solves several instances to optimality that are intractable for the solver from [28].

4 Hop-Constrained Steiner Tree Problems The hop-constrained directed Steiner tree problem (HCDSTP) can be considered as an SAP with two additional conditions [29]. Let (V, A, T, c, r ) be an SAP instance and let H ∈ N (called hop limit). A feasible solution to the SAP is feasible for the HCDSTP if additionally: 1. |A(S)| ≤ H 2. δ S+ (t) = 0 ∀t ∈ T \ {r } The HCDSTP asks for feasible solution S of minimum cost c(A(S)). Algorithms for solving the HCDSTP can for instance be found in [6, 29, 30]. The flow-balance cut formulation (Formulation 1) used by SCIP- Jack is easily extended to cover this variation by adding one extra linear inequality bounding the sum of all binary arc variables and removing all outgoing arcs from each terminal. Still, the hop limit brings significant implications for the preprocessing and the heuristics in its wake: Many of the reduction techniques remove or include edges from the graph if a less costly alternative can be found, regardless of whether this alternative includes a greater number of edges and possibly violates the hop constraint. Nevertheless, empirically strong preprocessing techniques can be designed. Some examples will be given in the following. While most of these techniques have been implemented in SCIP- Jack since version 1.0 [31], they have not been described in the literature yet.

212

D. Rehfeldt et al.

For two vertices vi , v j ∈ V let d(vi , v j ) be the length of a shortest directed path from vi to v j . For ease of presentation assume that r = t1 . For each i = 2, . . . , s i := mina∈δ− (ti ) c(a). define cmin Proposition 2 Let vi ∈ V \ T . If there is an optimal solution S = (V (S), A(S)) such that vi ∈ V (S), then d(r, vi ) + min d(vi , t) + t∈T \{r }

s 

q

q

cmin − max cmin

(9)

q=2,...,n

q=2

is a lower bound on the weight of S. Proof Since S is an arborescence rooted in r , it contains a path P1 from r to vi , which is of cost at least d(r, vi ). Additionally, at least one path P2 from vi to a terminal t j other than the root is contained in S, since the latter has been assumed to be optimal. P2 is of cost at least mint∈T \{r } and due to S being cycle free, P1 and P2 are arc-disjoint. Since for all tk ∈ T \ {r, t j } it holds that δ S+ (tk ) = ∅, none of their incoming arcs can be contained in P1 or P2 . However, as all of them are contained in S, for each terminal in T \ {r, t j } one of its incoming arcs has to be part of S as well. Hence: s 

c(A(S)) ≥ c(A(P1 )) + c(A(P2 )) +

q

cmin

q=2,q =t j

≥ d(r, vi ) + d(vi , t j ) +

s 

q

cmin

q=2,q =t j

≥ d(r, vi ) + min d(vi , t) + t∈T \{r }

s 

q

q

cmin − max cmin q=2,...,n

q=2



Consequently, the proposition is proven.

Corollary 1 Let (vi , v j ) ∈ A with vi ∈ / T . If there is an optimal solution S = (V (S), A(S)) such that (vi , v j ) ∈ A(S), then d(r, vi ) + c((vi , v j )) + min d(v j , t) + t∈T \{r }

s  q=2

q

q

cmin − max cmin q=2,...,n

(10)

is a lower bound on the weight of S. If for a vertex vi the best known primal bound is smaller than the bound (9) induced by vi , the vertex can be deleted. Similarly, an arc can be deleted if the best known primal bound bound is smaller than (10). While the idiosyncrasy of the HCDSTP in the shape of the hop constraint impedes numerous reduction techniques, it concomitantly gives rise to tests of a novel type.

SCIP-Jack: An Exact High Performance Solver for Steiner Tree …

213

In the remainder of this section, denote by d1 the directed distance function with respect to unit arc costs. Consequently, d1 (vi , v j ) is equal to the minimum number of arcs necessary to reach vertex v j from vertex vi . Proposition 3 Let vi ∈ V \ T . If d1 (r, vi ) + min d1 (vi , t) + |T | − 2 > H t∈T \r

(11)

is satisfied, vi cannot be contained in any optimal solution. Proof Let S = (V (S), A(S)) be an optimal solution and suppose vi ∈ V (S). Analogously to the proof of Proposition 2 it can be verified that S contains arc-disjoint paths P1 , from r to vi and P2 , from vi to a t j ∈ T \ {r }. Considering the definition of d1 one readily acknowledges that at least d1 (r, vi ) arcs are contained in P1 and at least d1 (vi , t j ) arcs are contained in P2 . Furthermore, both P1 and P2 contain exactly one terminal, since to all terminal other than the root there are no outgoing arcs. Noting that to each terminal in T \ {r, t j } there is at least one arc included in S, one obtains that |A(S)| ≥ d1 (r, vi ) + d1 (vi , t j ) + |T | − 2 ≥ d1 (r, vi ) + min d1 (vi , t) + |T | − 2 t∈T \r

(11)

> H.

Thus S is not a feasible solution.



Note that one can obtain more refined, and potentially more powerful reduction techniques, by using the terminal decomposition concept that has recently been introduced by the authors of this article, see [14]. Since these techniques have not been implemented yet into SCIP- Jack, no further details will be provided here. Importantly, all above reductions techniques require a primal bound. To this end, SCIPJack contains a simple constructive heuristic described in [6]. Implementing more refined primal heuristics, such as the ones described in [29], would likely improve the performance. Unlike previous versions, the current SCIP- Jack also applies dualascent based reductions, which can be used despite the additional constraint (as the corresponding dual variable can simply be set to 0). Computational results on a number of benchmark instances (with more than 600 000 arcs) from the 11th DIMACS Challenge are provided in Table 5. Notably, all instances can be solved to optimality, while the previous version of SCIP- Jack described in [6] left six instances unsolved. Moreover, experiments on the larger test set gr16 (with more than 8 000 000 arcs) were performed, but SCIP- Jack ran out of memory for most instances. However, the instances wo11-gr16-cr100-tr100-se10 and wo11-gr16-cr200-tr100-se3 could be solved for the first time to optimality— with optimal values of 121 234 and 54 163, which also notably improves on the previously best known bounds.

214

D. Rehfeldt et al.

Table 5 Computational results for HCDSTP instances Optimal Test set # Solved ∅ nodes ∅ time (s) gr12 gr14

19 21

19 21

1.0 6.6

1.3 256.0

Timeout ∅ nodes

∅ gap (%)

– –

– –

5 Further Related Problems In this section the remaining problem classes that can be solved by SCIP- Jack will be shortly described and the latest computational results will be reported. For more details on the individual problems and their handling by SCIP- Jack see [6].

5.1 Prize-Collecting Steiner Tree Problems Given an undirected graph G = (V, E), edge-weights c : E → Q+ , and nodeweights p : V → Q+ , the prize-collecting Steiner tree problem (PCSTP) is to find a tree S = (V (S), E(S)) ⊆ G such that C(S) :=

 e∈E(S)

c(e) +



p(v)

(12)

v∈V \V (S)

is maximized. The rooted prize-collecting Steiner tree problem additionally requires a predefined vertex r ∈ V to be part of each feasible solution. The prize-collecting Steiner tree problem has been widely discussed in the literature [32–35], accompanied by several exact solving approaches [35–38]. Besides SCIP- Jack, computationally strong solvers are described in [4, 28]. Practical applications of the PCSTP range from the design of fiber optic networks [35] to computational biology [39]. The solving approach of SCIP- Jack is similar to that for the MWCSP, but involves several additional techniques. More details can be found in [40]. SCIPJack is the fastest solver for the PCSTP, outperforming the next best solver [28] on all common test sets from the literature. Furthermore, SCIP- Jack could solve several PCSTP instances from the 11th DIMACS Challenge for the first time to optimality [40]. See Table 6 for some results.

SCIP-Jack: An Exact High Performance Solver for Steiner Tree … Table 6 Computational results for PCSPG instances Optimal Test set # Solved ∅ nodes CRR E Cologne1 Cologne1

80 0 14 15

80 0 14 15

1.0 – 1.0 1.0

215

∅ time (s)

Timeout ∅ nodes

∅ gap (%)

0.1 – 0.1 0.1

– – – –

– – – –

5.2 Node-Weighted Steiner Tree Problems Given an undirected graph G = (V, E), node costs p : V → Q+ , edge costs c : E → Q+ and a set T ⊆ V of terminals, the objective is to find a tree S = (V (S), E(S)) with T ⊆ V (S) that minimizes  e∈E(S)



c(e) +

p(v).

v∈V (S)

In SCIP- Jack the NWSTP is transformed to an SAP by substituting each edge by two anti-parallel arcs. Then, observing that in a tree there cannot be more than one arc going into the same vertex, the weight of each vertex is added to the weight of each of its incoming arcs. The current version of SCIP- Jack does not improve on the results reported in [6], so no computational experiments are conducted here.

5.3 Group Steiner Tree Problems The group Steiner tree problem (GSTP) is another generalization of the Steiner tree problem that originates from VLSI design [41]. For the GSTP the concept of terminals as a set of vertices to be interconnected is extended to a set of vertex groups: Given an undirected graph G = (V, E), edge costs c : E → Q+ and a series of vertex subsets T1 , . . . , Ts ⊆ V , s ∈ N, a minimum cost tree spanning at least one vertex of each subset is required. By interpreting each terminal t as a subset {t}, every SPG can be considered as a GSTP, with the latter likewise being N P-hard. On the other hand, it is possible to transform each GSTP instance (V, E, T1 , . . . , Ts , c) to an SPG [42]. In the case of SCIP- Jack, to solve a GSTP, the transformation from [42] is applied and the resulting problem is treated as a customary SPG that is solved without any alteration. Improvements could likely be obtained by employing GSTPspecific heuristics or reduction techniques [43]. Computational results for two sets of GSTP instances [6] are listed in Table 7.

216

D. Rehfeldt et al.

Table 7 Computational results for GSTP instances Optimal Test set # Solved ∅ nodes GSTP1 GSTP2

8 10

8 5

7.3 1.0

∅ time (s)

Timeout ∅ nodes

∅ gap (%)

27.6 1063.0

– 43.0

– 1.7

5.4 Rectilinear Steiner Tree Problems Given a number of n ∈ N points in the plane, find a shortest tree consisting only of vertical and horizontal line segments, containing all n points. The RSMTP is N Phard, as proved in [44], and has been the subject of various publications, see [45–47]. In addition to this two-dimensional variant, a generalization of the problem to the d-dimensional case, with d ≥ 3, will be considered. Hanan [48] reduced the RSMTP to the Hanan-grid obtained by constructing vertical and horizontal lines through each given point of the RSMTP. It is proved in [48] that there is at least one optimal solution to an RSMTP that is a subgraph of the grid. Hence, the RSMTP can be reduced to an SPG. Subsequently, this construction and its multi-dimensional generalization [49] is exploited in order to adapt the RSMTP to be solved by SCIP- Jack. Given a d-dimensional, d ∈ N \ {1}, RSMTP represented by a set of n ∈ N points in Qd , the first step involves building a d-dimensional Hanan-grid. By using the resulting Hanan-grid an SPG P = (V, E, T, c) can be constructed, which is handled equivalently to a usual SPG problem by SCIP- Jack. This simple Hanan-grid based approach is normally not competitive with highly specialized solvers such as GeoSteiner [45] in the case d = 2. However, a motivation for the implementation in SCIP- Jack is to address the obvious lack of solvers—specialized or general—that can provide solutions to RSMTP instances in dimensions d ≥ 3. A variant of the RSMTP is the obstacle-avoiding rectilinear Steiner minimum tree problem (OARSMTP). This problem requires that the minimum-length rectilinear tree does not pass through the interior of any specified axis-aligned rectangles, denoted as obstacles. SCIP- Jack is easily extended to solve the OARSMTP with a simple modification to the Hanan grid approach applied to the RSMTP. This modification involves removing all vertices that are located in the interior of an obstacle together with their incident edges as well as all edges crossing an obstacle. The computational experiments for the RSMTP instances have been performed on instances from a cancer research application [50] that exhibit up to eight dimensions. The results are shown in Table 8. All except for the largest instance, with 4 762 800 nodes and 64 777 860 edges, can be solved to optimality. However, while previous versions of SCIP- Jack ran out of memory after presolving, the stronger reduction techniques—which reduce the number of edges and nodes to 1 077 575 and 12 277 938, respectively—allow to significantly close the optimality gap (from 100 to 5.8%) and also improve the primal bound from 185 to 181.

SCIP-Jack: An Exact High Performance Solver for Steiner Tree … Table 8 Computational results for RSMT instances Optimal Test set # Solved ∅ nodes Cancer

14

13

1.0

Table 9 Computational results for DCSTP instances Optimal Test set # Solved ∅ nodes TreeFam

20

13

345.0

217

∅ time (s)

Timeout ∅ nodes

∅ gap (%)

1.6

1.0

5.8

∅ time (s)

Timeout ∅ nodes

∅ gap (%)

43.2

1066.4

24.4

5.5 Degree-Constrained Steiner Tree Problems The degree-constrained Steiner tree problem (DCSTP) is an SPG with additional degree constraints on the vertices, described by a function b : V → N. The objective is to find a minimum cost Steiner tree S = (V (S), E(S)) such that δ S (v) ≤ b(v) is satisfied for all v ∈ V (S). A comprehensive discussion of the DCSTP, including its applications in biology, can be found in [51]. The implementation in SCIP- Jack to solve the DCSTP involves the extension of Formulation 1 by an additional (linear) degree constraint for each vertex. While most SPG reduction techniques of SCIP- Jack cannot be applied for the DCSTP, it is still possible to use the dual-ascent method (ignoring the additional constraints), as it provides a feasible lower bound and valid reduced costs. Results on the (real-world) DCSTP instances from the 11th DIMACS Challenge can be found in Table 9. Compared to the specialized solver [51] several more instances can be solved and the gap among the remaining instances is substantially smaller. While the solver described in [4] can solve one more instance to optimality within two hours, the average is much smaller for SCIP- Jack (by more than 50%). Notably, the instance TF105897-t3 was solved for first time to optimality in the experiments for this article—the optimal objective value is 97832.

6 Ug[SCIP-Jack,*]: Shared and Distributed Parallelization UG is a generic framework to parallelize any existing state-of-the-art branch-andbound (B&B) based solver, subsequently referred to as base solver. From the beginning of the development, it was intended to parallelize state-of-the-art MIP solvers, which use LP based B&B. A particular instantiated parallel solver is referred to as ug[a specific solver name, a specific parallelization library name]. Here, the specific parallelization library is used to realize the message-passing based communications.

218

D. Rehfeldt et al.

Fig. 3 Design structure of UG

The design structure of UG is presented in Fig. 3. The base solvers workloads are coordinated by a single supervisor. ug[SCIP,MPI](=ParaSCIP) is one of the most successful massively parallel MIP solver to solve previously unsolvable instances [52] using up to 80,000 cores. Currently, SCIP does not have parallel B&B tree search feature. Therefore, ug[SCIP,Pthreds] (=FiberSCIP) is considered as multi-threaded version of SCIP. In the SCIP Optimization Suite 6.0 [8], Pthreads and C++11 threads can be used to run it on shared memory computing environments and an MPI library can be used to run it on distributed memory computing environments. Owing to the well-defined plugin structure of SCIP, SCIP can be considered as a general B&B framework and it can be extended to a B&B based solver for a specific problem. In this way, ug[SCIPJack,*] is the parallel version of SCIP- Jack, which can run on shared and distributed memory computing environments depending on which parallelization library is used. While the parallelization of any SCIP based solver only requires a small amount of glue code (typically less than 200 lines of code), to improve the parallel performance of ug[SCIP-Jack,*] a number of changes and extensions had to be implemented in both UG and SCIP- Jack. Initially, SCIP- Jack only branched on edges in the graph, that is, branching on variables in SCIP. However, newer version of SCIP- Jack branch on vertices, that is, branching on constraints in SCIP. Therefore, in UG-0.8.6, a new feature to transfer subproblems with branching on constraints was added. Additionally, together with the constraints, string values, in which some branching information can be encoded, can now also be transferred. Moreover, a new callback routine of UG is

SCIP-Jack: An Exact High Performance Solver for Steiner Tree …

219

added, which allows the user to initialize a subproblem before starting to solve it. In this way, ug[SCIP- Jack,*] can apply additional preprocessing for the graph based on the encoded information of branching constraints. This additional algorithmic efforts considerably reduce the run time for the shared memory version (usually by more than 100% when using 8 threads). However, it introduces a difficulty to make ug[SCIP- Jack, MPI] run on a large scale computing environments, because the improved preprocessing of ug[SCIP- Jack,*] which is translated to bound changes in the underlying IP model leads to a huge amount of information that has to be communicated between the base solvers and the supervisor process. A new communication protocol is being designed at the moment, but has not been finished yet. As a consequence not more than a few hundred MPI processes can be used efficiently. Nevertheless, experiments were performed for the two unsolved PUC instances hc10p and hc11p. For both instances the best known upper bounds have been recently improved in [4] and again in [5] (by running the instance for several days on shared memory machines). ug[SCIP- Jack, MPI] was able to improve both bounds again: From 119 492 to 119 332 for hc11p and from 59 797 to 59 733 for hc10p. Finally, in Table 10 shared memory experiments are provided, to both exemplify the performance of ug[SCIP- Jack, PThreads] and point out the difficulties that come with the parallelization. The experiments were performed on five instances from the PUC test set, by using 1, 8, 16, 32, and 64 threads (plus one thread for the supervisor). The first five rows provide the corresponding run time in seconds. The next row, labeled root time, provides the time that was spent at the root node (for which only one base solver is used). The next row lists the maximum number of base solvers that could be used during the solving process and the last line states the first time when the maximum number of base solvers was used. The least pronounced scaling effect can be observed for the instance cc3-4p; one explanation is that never more than 13 solvers are used. The best scaling can be observed for the last three instances, for which all 64 solvers can be be utilized at least once. One should also consider that there is usually a ramp-down phase towards the end of the computation, when the number open nodes B&B tree is decreasing. Moreover, the powerful reduction techniques of SCIP- Jack, which are on the one hand a pivotal solving ingredient, often keep the number of open nodes small, as many subproblems can be quickly discarded—which often impedes the full utilization of all available threads.

7 Outlook: There and Back Again Although SCIP- Jack has seen a series of major enhancements during the last years, the development is far from being finished. As to Steiner tree problem variants, the prize-collecting Steiner tree problem (see Sect. 5.1) will receive special attention. More generally, the implementation of extended reduction techniques (see Sect. 2.2) will be continued (not only for SPG, but also for several related prob-

220

D. Rehfeldt et al.

Table 10 Parallel computational results on SPG instances Threads cc3-4p cc3-5u cc5-3p 1 8 16 32 64 Root time Max # used solvers Ramp-up time

hc7p

hc7u

105 56 58 55 53 6 13

3,979 2,614 1,801 2,249 1,899 10 52

8,234 2,349 2,500 1,793 1,433 109 64

3,970 1,683 1,060 671 479 8 64

5,106 1,738 1,106 761 498 16 64

11

152

466

41

42

All times in seconds

lems). Furthermore, the dual-ascent algorithm (see Sect. 1.2) will be improved, both implementation-wise and algorithmically. As to the parallelization of SCIP- Jack, besides improving ug[SCIP- Jack,*], the authors plan to add several internal parallelization extensions: for presolving, domain propagation, cutting-plane generation, and primal and dual heuristics. Above all, however, future development of SCIP- Jack will see a return to the roots: the classic SPG. The aim is to improve performance not only by enhancing existing implementations, but by implementing a number of new techniques—a few having been published for instance in [53]. Acknowledgements The authors would like to thank the anonymous reviewers for their suggestions and corrections. This work was supported by the BMWi project Realisierung von Beschleunigungsstrategien der anwendungsorientierten Mathematik und Informatik für optimierende Energiesystemmodelle—BEAM-ME (fund number 03ET4023DE). The work for this article has been conducted within the Research Campus MODAL funded by the German Federal Ministry of Education and Research (fund number 05M14ZAM).

References 1. Karp, R.: Reducibility among combinatorial problems. In: Miller, R., Thatcher, J. (eds.) Complexity of Computer Computations, pp. 85–103. Plenum Press, Berlin (1972) 2. Leitner, M., Ljubic, I., Luipersbeck, M., Prossegger, M., Resch, M.: New Real-world Instances for the Steiner Tree Problem in Graphs. Technical report, ISOR, Uni Wien (2014) 3. Koch, T., Martin, A., Voß, S.: SteinLib: An updated library on Steiner tree problems in graphs. In: Du, D.Z., Cheng, X. (eds.) Steiner Trees in Industries, pp. 285–325. Kluwer, Dordrech (2001) 4. Fischetti, M., Leitner, M., Ljubi´c, I., Luipersbeck, M., Monaci, M., Resch, M., Salvagnin, D., Sinnl, M.: Thinning out Steiner trees: A node-based model for uniform edge costs. Math. Program. Comput. 9(2), 203–229 (2017) 5. Pajor, T., Uchoa, E., Werneck, R.F.: A robust and scalable algorithm for the Steiner problem in graphs. Math. Program. Comput. 10(1), 69–118 (2018)

SCIP-Jack: An Exact High Performance Solver for Steiner Tree …

221

6. Gamrath, G., Koch, T., Maher, S., Rehfeldt, D., Shinano, Y.: SCIP-Jack–a solver for STP and variants with parallelization extensions. Math. Program. Comput. 9(2), 231–296 (2017) 7. : PACE 2018. https://pacechallenge.wordpress.com/pace-2018 (2018) 8. Gleixner, A., Bastubbe, M., Eifler, L., Gally, T., Gamrath, G., Gottwald, R.L., Hendel, G., Hojny, C., Koch, T., Lübbecke, M.E., Maher, S.J., Miltenberger, M., Müller, B., Pfetsch, M.E., Puchert, C., Rehfeldt, D., Schlösser, F., Schubert, C., Serrano, F., Shinano, Y., Viernickel, J.M., Walter, M., Wegscheider, F., Witt, J.T., Witzig, J.: The SCIP Optimization Suite 6.0. Technical Report 18–26, ZIB, Takustr. 7, 14195, Berlin (2018) 9. Achterberg, T.: Constraint Integer Programming. Ph.D. thesis, Technische Universität Berlin (2007) 10. Rosseti, I., de Aragão, M., Ribeiro, C., Uchoa, E., Werneck, R.: New benchmark instances for the Steiner problem in graphs. In: Extended Abstracts of the 4th Metaheuristics International Conference (MIC’2001), Porto, pp. 557–561 (2001) 11. Duin, C.: Steiner Problems in Graphs. Ph.D. thesis, University of Amsterdam (1993) 12. Polzin, T., Daneshmand, S.V.: Improved algorithms for the Steiner problem in networks. Discrete Appl. Math. 112(1–3), 263–300 (2001) 13. Daneshmand, S.V.: Algorithmic approaches to the Steiner problem in networks (2004) 14. Rehfeldt, D., Koch, T.: Combining NP-hard reduction techniques and strong heuristics in an exact algorithm for the maximum-weight connected subgraph problem. SIAM J. Optim. 29(1), 369–398 (2019) 15. Polzin, T., Vahdati Daneshmand, S.: A comparison of Steiner tree relaxations. Discrete Appl. Math. 112(1–3), 241–261 (2001) 16. Wong, R.: A dual ascent approach for Steiner tree problems on a directed graph. Math. Program. 28, 271–287 (1984) 17. Koch, T., Martin, A.: Solving Steiner tree problems in graphs to optimality. Networks 32, 207–232 (1998) 18. Polzin, T., Vahdati-Daneshmand, S.: The Steiner Tree Challenge: An updated Study. http:// dimacs11.zib.de/downloads.html (2014) 19. Polzin, T., Daneshmand, S.V.: Extending Reduction Techniques for the Steiner Tree Problem, pp. 795–807. Springer, Berlin, Heidelberg (2002) 20. Dittrich, M.T., Klau, G.W., Rosenwald, A., Dandekar, T., Müller, T.: Identifying functional modules in protein-protein interaction networks: an integrated exact approach. In: ISMB, pp. 223–231 (2008) 21. Loboda, A.A., Artyomov, M.N., Sergushichev, A.A.: Solving Generalized Maximum-Weight Connected Subgraph Problem for Network Enrichment Analysis, pp. 210–221. Springer International Publishing, Cham (2016) 22. Dilkina, B., Gomes, C.P.: Solving connected subgraph problems in wildlife conservation. In: Proceedings of the 7th International Conference on Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, CPAIOR’10, pp. 102–116. Springer, Berlin, Heidelberg (2010) 23. Chen, C., Grauman, K.: Efficient activity detection with max-subgraph search. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16–21, 2012, pp. 1274–1281 (2012) 24. Rehfeldt, D., Koch, T.: Transformations for the prize-collecting Steiner tree problem and the maximum-weight connected subgraph problem to SAP. J. Comput. Math. 36(3), 459–468 (2018) 25. Álvarez-Miranda, E., Ljubi´c, I., Mutzel, P.: The Rooted Maximum Node-Weight Connected Subgraph Problem, pp. 300–315. Springer, Berlin, Heidelberg (2013) 26. Rehfeldt, D., Koch, T., Maher, S.: Reduction techniques for the prize-collecting Steiner tree problem and the maximum-weight connected subgraph problem. Networks 73, 206–233 (2019) 27. 11th DIMACS Challenge.: http://dimacs11.zib.de/ (2018) 28. Leitner, M., Ljubi, I., Luipersbeck, M., Sinnl, M.: A dual ascent-based branch-and-bound framework for the prize-collecting Steiner tree and related problems. INFORMS J. Comput. 30(2), 402–420 (2018)

222

D. Rehfeldt et al.

29. Burdakov, O., Doherty, P., Kvarnström, J.: Local search for hop-constrained directed Steiner tree problem with application to uav-based multi-target surveillance. In: Examining Robustness and Vulnerability of Networked Systems,pp. 26–50 (2014) 30. Pugliese, L.D.P., Gaudioso, M., Guerriero, F., Miglionico, G.: A lagrangean-based decomposition approach for the link constrained Steiner tree problem. Optim. Meth. Softw. 33(3), 650–670 (2018) 31. Gamrath, G., Fischer, T., Gally, T., Gleixner, A.M., Hendel, G., Koch, T., Maher, S.J., Miltenberger, M., Müller, B., Pfetsch, M.E., Puchert, C., Rehfeldt, D., Schenker, S., Schwarz, R., Serrano, F., Shinano, Y., Vigerske, S., Weninger, D., Winkler, M., Witt, J.T., Witzig, J.: The SCIP Optimization Suite 3.2. Technical Report 15–60, ZIB, Takustr.7, 14195, Berlin (2016) 32. Canuto, S.A., Resende, M.G.C., Ribeiro, C.C.: Local search with perturbations for the prizecollecting Steiner tree problem in graphs. Networks (2001) 33. Bienstock, D., Goemans, M.X., Simchi-Levi, D., Williamson, D.P.: A note on the prize collecting traveling salesman problem. Math. Program. 59, 413–420 (1993) 34. Johnson, D.S., Minkoff, M., Phillips, S.: The prize collecting steiner tree problem: theory and practice. In: Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’00, Philadelphia, PA, USA, Society for Industrial and Applied Mathematics, pp. 760–769 (2000) 35. Ljubi, I.: Exact and Memetic Algorithms for Two Network Design Problems. Ph.D. thesis, Vienna University of Technology (2004) 36. El-Kebir, M., Klau, G.W.: Solving the Maximum-Weight Connected Subgraph Problem to Optimality. Comput. Res. Repos. abs/1409.5308 (2014) 37. Ljubic, I., Weiskircher, R., Pferschy, U., Klau, G.W., Mutzel, P., Fischetti, M.: An algorithmic framework for the exact solution of the prize-collecting steiner tree problem. Math. Program. 105(2–3), 427–449 (2006) 38. Lucena, A., Resende, M.G.C.: Strong lower bounds for the prize collecting Steiner problem in graphs. Discrete Appl. Math. 141(1–3), 277–294 (2004) 39. Ideker, T., Ozier, O., Schwikowski, B., Siegel, A.F.: Discovering regulatory and signalling circuits in molecular interaction networks. In: ISMB, pp. 233–240 (2002) 40. Rehfeldt, D., Koch, T.: Reduction-based exact solution of prize-collecting Steiner tree problems. Technical Report 18–55, ZIB, Takustr. 7, 14195, Berlin (2018) 41. Hwang, F., Richards, D., Winter, P.: The Steiner Tree Problem. Annals of Discrete Mathematics. Elsevier Science, Amsterdam (1992) 42. Vo, S.: A survey on some generalizations of Steiner’s problem. In: 1st Balkan Conference on Operational Research Proceedings, 1, pp. 41–51 (1988) 43. Ferreira, C.E., de Oliveira Filho, F.M.: New reduction techniques for the group Steiner tree problem. SIAM J. Optim. 17(4), 1176–1188 (2006) 44. Garey, M., Johnson, D.: The rectilinear Steiner tree problem is NP-complete. SIAM J. Appl. Math. 32, 826–834 (1977) 45. Warme, D., Winter, P., Zachariasen, M.: Exact algorithms for plane Steiner tree problems: A computational study. In: Du, D.Z., Smith, J., Rubinstein, J., (eds.) Advances in Steiner Trees. Kluwer, New York, pp. 81–116 (2000) 46. Zachariasen, M., Rohe, A.: Rectilinear group Steiner trees and applications in VLSI design. Technical Report 00906, Institute for Discrete Mathematics (2000) 47. Emanet, N.: The Rectilinear Steiner Tree Problem. Lambert Academic Publishing (2010) 48. Hanan, M.: On Steiner’s problem with rectilinear distance. SIAM J. Appl. Math. 14(2), 255–265 (1966) 49. Snyder, T.L.: On the exact location of Steiner points in general dimension. SIAM J. Appl. Math. 21(1), 163–180 (1992) 50. Chowdhury, S.A., Shackney, S., Heselmeyer-Haddad, K., Ried, T., Schffer, A.A., Schwartz, R.: Phylogenetic analysis of multiprobe fluorescence in situ hybridization data from tumor cell populations. Bioinformatics 29(13), 189–198 (2013) 51. Liers, F., Martin, A., Pape, S.: Binary Steiner trees. Discrete Optim. 21(C), 85–117 (2016)

SCIP-Jack: An Exact High Performance Solver for Steiner Tree …

223

52. Shinano, Y., Achterberg, T., Berthold, T., Heinz, S., Koch, T., Winkler, M.: Solving Open MIP Instances with ParaSCIP on Supercomputers Using up to 80,000 Cores. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 770–779 (2016) 53. Rehfeldt, D., Koch, T.: Generalized preprocessing techniques for Steiner tree and maximumweight connected subgraph problems. Technical Report 17–57, ZIB, Takustr. 7, 14195, Berlin (2017)

Physical Parameter Identification with Sensors or Actuators Spanning Multiple DOF’s Dong-Huei Tseng, Minh Q. Phan, and Richard W. Longman

Abstract This paper extends a family of techniques that recovers the physical parameters in a finite-element model of a structure with sensors or actuators spanning multiple degrees of freedom (DOF’s). Existing methods require each measurement to be associated with only one degree of freedom, each actuated degree of freedom can only be influenced by one actuator, and each actuator can only directly affect one degree of freedom. The extension presented here generalizes these methods to the case where each measurement can involve more than one degree of freedom, each actuated degree of freedom can be influenced by more than one actuator, or each actuator can directly affect more than one degree of freedom. A generalization of the concept of actuator and sensor collocation is developed in order to carry out the non-trivial extension.

1 Introduction The physical parameters in a finite-element model of a structure in second-order form are contained in the mass, stiffness, and damping matrices. Identifying these parameters from input-output data can be the system identification objective itself [4], or it can be used in the refinement of an existing finite-element model [21]. Recent applications include health monitoring of smart structures [1, 8, 12, 16–18, 29], and soft tissue imaging for medical diagnostics [38]. The mathematical problem is challenging because the relationship between input-output data and the system physical parameters is highly nonlinear. D.-H. Tseng (B) Superior Information Technology Inc., New Taipei City, Taiwan e-mail: [email protected] M. Q. Phan Dartmouth College, Hanover, NH 03755, USA e-mail: [email protected] R. W. Longman Columbia University, New York, NY 10027, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_11

225

226

D.-H. Tseng et al.

Recent methods, e.g., [3, 7, 9–11, 24, 30–32], work with the equations of motion in first-order state-space form. The state-space formulation takes advantage of system identification techniques that identify a state-space model of a structure from input-output measurements. A large number of such system identification methods are available. These methods can be broadly divided into three groups: interaction matrix based methods [2, 5, 13, 15, 20, 22, 23, 25, 27, 39, 40] including the Observer/Kalman filter identification (OKID) algorithm [14], subspace methods [6, 28, 37] including the N4SID algorithm [36], and superspace methods [19, 26]. Superspace methods bypass the state recovery step required in the subspace methods altogether. In terms of the state-space based physical parameter identification problem, the approach presented in [24, 32] is perhaps the simplest because it converts the original nonlinear inverse problem into two successive linear problems: by first converting an identified state-space model to physical coordinates, then finding the mass, stiffness, and damping matrices from it. The solutions in [24, 32] require a full set of sensors, one per degree of freedom, to find the transformation matrix that converts an identified state-space model to physical coordinates. We recently developed a method that does not require a full set of sensors [33]. However, this method uses Kronecker products which suffer from high-dimensionality when dealing with systems that have a large number of degrees of freedom (DOF). In [34, 35], we presented a numerically efficient solution that avoids this difficulty. The solution only requires an eigen-decomposition of a 2n-by-2n matrix where n is the number of degrees of freedom. Furthermore, the method does not require a full set of sensors. It only requires one independent sensor or actuator per degree of freedom, and at least one collocated pair of sensor and actuator. Once a state-space model is successfully converted to physical coordinates [34], the mass, stiffness, and damping matrices can be recovered from it [35]. Up to this point, all available techniques including ours assume that in a finiteelement model of a structure, each measurement (displacement, velocity, or acceleration) is associated with a single degree of freedom, each actuated degree of freedom is directly influenced by a single actuator, and each actuator can only directly affect one degree of freedom. In applications where there is a very large number of sensors, measurements from clusters of sensors that are in proximity to each other might be combined together. Other scenarios include the use of spatially distributed sensors such as strain gages or piezo-electric patches that combine information from more than one degree of freedom. In terms of actuation, large piezo-electric patches can also be used as actuators, and if the patch size is larger than the spacing between the finite-element nodes, then one actuator might directly influence more than one degree of freedom. On the other hand, if piezo-electric struts are used as actuators and the physical locations where they are connected to are treated as finite-element nodes, then we have a situation where more than one actuators directly affect one finite-element degree of freedom. It turns out that this modification to the mathematical problem definition requires a non-trivial extension to the available solutions. This paper provides such an extension

Physical Parameter Identification with Sensors or Actuators Spanning Multiple DOF’s

227

to our family of methods. Part of the extension is a generalization of the concept of actuator-sensor collocation to the situation where sensors or actuators span multiple finite-element degrees of freedom as discussed in the previous paragraph.

2 System Models and Problem Statement The second-order equation of motion of an n-degree-of-freedom structure with r actuators has the following general form Mq(t) ¨ + Cq(t) ˙ + Kq(t) = Bu(t)

(1)

where n elements of q(t) are the generalized coordinates, M, C, and K are the nby-n mass, damping, and stiffness matrices. The n-by-r matrix B specifies how the actuators affect the n degrees of freedom. Let y(t) denote the output measurements consisting of displacements, velocities, or accelerations. In state-space form, the system assumes the following form, said to be in physical coordinates, x(t) ˙ = Ax(t) + Bu(t) y(t) = C x(t) + Du(t) where

(2)



     In×n q(t) 0n×r 0n×n , B= , A= x(t) = −M−1 K −M−1 C M−1 B q(t) ˙

(3)

The state-space model C and D matrices depend on the type of sensors used for measurement. State-space representation takes advantage of modern state-space system identification techniques. Identified state-space models are in discrete time, but as long as the sampling frequency is sufficiently high, an identified discrete-time state-space model can be converted to continuous time. An identified state-space model (in discrete or continuous time), however, is in an arbitrary and unknown coordinates. Let Ar , Br , Cr , Dr denote an identified state-space model that has been converted to continuous time. The mathematical problem can be stated as follows: Given (a) a continuous-time state-space model Ar , Br , Cr of a structure in some unknown coordinates, (b) the actuator locations and how the actuators affect the n degrees of freedom (i.e., B is known), and (c) the number and types of measurements which are displacements, velocities, or accelerations and their scaling or conversion factors C p , Cv , and Ca , find a transformation matrix T to convert Ar , Br , Cr to physical coordinates such that (4) A = T Ar T −1 , B = T Br , C = Cr T −1     measurements, C = 0m×n Cv for where C = C p 0m×n for displacement   velocity measurements, or C = Ca −M−1 K −M−1 C for acceleration measure-

228

D.-H. Tseng et al.

ments. The direct transmission term Dr is not affected by the transformation, Dr = D. Once the model (Ar , Br , Cr ) is converted to physical coordinates, finding M, C, K from A, B, C is a linear problem by solving MX = K ,

MY = C ,

MZ = B

(5)

The combinations X = M−1 K, Y = M−1 C, and Z = M−1 B are extracted from A, B, C in physical coordinates. In order to recover M, C, and K uniquely, additional symmetry conditions must be placed on M, C, and/or K. References [24, 32, 35] offer methods to solve for M, C, K from X , Y , Z and B.

3 Preliminary Relationship for the Partitions of T Let the 2n-by-2n transformation matrix T be partitioned into upper and lower partitions of equal dimensions,   T1 (6) T = T2 From Eq. (4), we obtain AT = T Ar , 

0 I −X −Y



T1 T2



 =

T1 T2

 Ar

(7)

which produces T2 = T1 Ar −X T1 − Y T2 = T2 Ar

(8)

The first expression in Eq. (8) relates T1 to T2 via Ar . Because Ar is known, to find T , we only need to find either T1 or T2 , since T1 can be computed from T2 and Ar , and vice versa. In this paper the derivation is based on finding T1 . The second expression in Eq. (8) is nonlinear because it involves the products X T1 and Y T2 where X , Y , T1 , T2 are unknown. We could, in theory, formulate a mathematical problem to find X , Y , T1 , T2 simultaneously. However, this would be a nonlinear problem which we try to avoid. Since B = T Br , 

we obtain

0 Z



 =

 T1 Br T2

(9)

Physical Parameter Identification with Sensors or Actuators Spanning Multiple DOF’s

229

T1 Br = 0 T2 Br = Z

(10)

The first expression in (10) is homogeneous and linear in T1 because Br is known. The second expression in (10) involves Z which is unknown. Up to this point only one linear and known homogeneous relationship for T1 is found which is T1 Br = 0. Additional linear (and non-homogeneous) equations for T1 can be determined from the measurement equations depending on what types of measurements are available (displacements, velocities, and/or accelerations). In all previous works, each measurement (displacement, velocity, or acceleration) corresponds to one individual degree of freedom. Mathematically, each row of C p , Cv , or Ca consists of only one non-zero element. In this extension, we allow each measurement to involve information from more than one degree of freedom. Thus each row of C p , Cv , or Ca can contain more than one non-zero element. Likewise, previously, each actuated degree of freedom can only be directly affected by one actuator, i.e., each row of B contains only one non-zero element. We now allow each actuated degree of freedom to be affected by more than one actuator, hence each row of B can contain more than one non-zero element. Furthermore, it is previously assumed that each actuator can only influence one degree of freedom. That is each column of B contains only one non-zero element. Now we allow the possibility that each actuator can affect more than one degree of freedom, i.e, each column of B can contain more than one non-zero element. The net effect of all these cases is that C p , Cv , or Ca , and B can now be fully populated. To find the transformation matrix T that converts a system to physical coordinates, the mathematical problem can be categorized into two cases: one with a full set of sensors, and one without a full set of sensors. Having a full set of sensors is equivalent to the C matrix being full rank which is n for an n-degree-of-freedom system. The case with a full set of sensors is solved in [24, 32], and the case without a full set of sensors is solved in [33] using the Kronecker product method. The Kronecker product solution, however, suffers from high dimensionality for systems with high degrees of freedom. In Ref. [34], we formulated a solution that avoids the use of Kronecker products. This solution requires one independent sensor or actuator per degree of freedom, and at least one collocated pair of sensor and actuator. This paper generalizes Ref. [34] to the situation where sensor or actuators can span multiple finite-element degrees of freedom, i.e., C p , Cv , Ca and B can be fully populated. To solve this problem, we generalize the notion of sensor and actuator collocation in [34] to mean that the intersection of certain two vector spaces must not be null. The requirement of one independent sensor or actuator per degree of freedom is also generalized to be a rank condition on a certain matrix. The details of this development are explained in the following sections.

230

D.-H. Tseng et al.

4 Additional Linear Equations to Find T1 for a System Without a Full Set of Sensors When a full set of sensors is available, there are sufficient linearly independent equations in C T = Cr from Eq. (4) to find T1 , [24, 32]. When a full set of sensors is not available, the number of linearly independent equations required to find T1 falls short, and it appears that the problem will necessarily become nonlinear. However, additional equations for T1 are found in [33]. These additional equations involve the input influence matrix B, and they remain linear in T1 , (BT T1 )Ark Br = BrT (ArT )k (T1T B)

(11)

These additional equations in T1 are still homogeneous. To turn them into nonhomogeneous equations, we need to relate C p , Cv , or Ca to B through the concept of actuator and sensor collocation. When sensors or actuators do not span multiple finite-element degrees of freedom, the coupling can be achieved by at least one collocated pair of sensor and actuator. When sensors or actuators do span multiple finite-element degrees of freedom, the required coupling is explained in the following section for each type of measurements.

5 Finding T1 for Different Types of Measurements In this section we describe how T1 can be found for different types of measurements which can be displacements, velocities, or accelerations. The case for acceleration measurements is the most common in flexible structure applications.

5.1 Case 1 : Displacement Measurements   The output influence matrix for displacement measurements is C = C p 0m×n where C p is known. For an m-output n-DOF system, the dimensions of C p are m-by-n. The matrix T relates C to Cr , 

Cp

   T1 0 = Cr T2

(12)

The above equation produces a linear non-homogeneous equation in T1 , where T1 has dimension n-by-2n, (13) C p T1 = Cr

Physical Parameter Identification with Sensors or Actuators Spanning Multiple DOF’s

231

We will use Eqs. (13) and (11) to solve for T1 , but that this point Eq. (11) is still homogeneous. This equation can become non-homogeneous if there is at least one pair of sensor-actuator collocation if each sensor only measures one degree of freedom as required in [33, 34]. The analogous requirement when sensors or actuators span more than one degree of freedom is that the coupling between these two equations is not null. In this case, the coupling can be realized through the the intersection subspace J p between C p and BT by letting Jm = J p , Cm = C p , Fm = F p and G m = G p in Appendix 2, (14) J p = F pT C p = G Tp BT The number of sensors plus the number of actuators must be more than the number of degrees of freedom for the intersection subspace not to be null. Post-multiplying Eq. (14) with T1 and taking transpose of the resulting equation, one obtains (C p T1 )T F p = T1T BG p = CrT F p

(15)

using Eq. (13). Post-multiplying Eq. (11) with G p produces (BT T1 )Ark (Br G p ) = BrT (ArT )k (CrT F p )

(16)

k

Equation (16) is now in the form QAk B1 = R(AT ) B2 . As in [34], Q = BT T1 can be solved from Eq. (32) in Appendix 1 by setting A = Ar , B1 = Br G p , R = BrT , B2 = (CrT F p ) . Then T1 can be solved from C p T1 = Cr and Q = BT T1 . This twostep strategy avoids the use of Kronecker products. Appendix 1 describes a method that further avoids numerical instability associated with raising Ar to high powers for a high DOF-system. Writing C p T1 = Cr and Q = BT T1 together, 

   Cr Cp = T 1 BT Q

(17) 

To ensure a unique solution for T1 , we further require that the matrix

Cp BT

 must be

full rank which is n. Then T1 can be uniquely solved from  T1 =

Cp BT

† 

where † denotes the pseudo-inverse operation.

Cr Q

 (18)

232

D.-H. Tseng et al.

5.2 Case 2 : Velocity Measurements The case for velocity measurements can be derived analogously. The output influence  matrix in the case of velocity measurements is C = 0m×n Cv where Cv is known. From    T1  0 Cv = Cr (19) T2 we obtain Cv T2 = Cr . Combining with T2 = T1 Ar , the following linear equation in T1 is obtained (20) Cv T1 = Cr Ar−1 Similar to Sect. 5.1, the coupling can be realized through the the intersection subspace Jv between Cv and BT by letting Jm = Jv , Cm = Cv , Fm = Fv and G m = G v in Appendix 2, (21) Jv = FvT Cv = G vT BT Again, the number of sensors plus the number of actuators must be more than the number of degrees of freedom for the intersection subspace not to be null. Postmultiplying Eq. (21) with T1 , and taking transpose of the resulting equation, one obtains (22) (Cv T1 )T Fv = T1T BG v and by using Eq. (20),

(Cr Ar−1 )T Fv = T1T BG v

(23)

Post-multiplying Eq. (11) with G v we find the non-homogeneous equation for T1 , (BT T1 )Ark Br G v = BrT (ArT )k (Cr Ar−1 )T Fv

(24)

Now, Eqs. (24) is in the form of Eq. (32) with Q = BT T1 , A = Ar , B1 = Br G v, Cv R = BrT , B2 = (Cr Ar−1 )T Fv from which Q can be solved (see Appendix 1). If BT is full rank which equals to n, then T1 can be solved from 

Cv T1 = BT

† 

Cr Ar−1 Q

where † denotes the pseudo-inverse operation.

 (25)

Physical Parameter Identification with Sensors or Actuators Spanning Multiple DOF’s

233

5.3 Case 3 : Acceleration Measurements In the case of acceleration measurements, we have   C = Ca −M−1 K −M−1 C ,

D = Ca M−1 B

(26)

Unlike the previous cases where C is known, C now contains the unknown products M−1 K and M−1 C. This potential complication can be overcome as follows. Since X = M−1 K and Y = M−1 C, C in Eq. (26) can be expressed as C = Ca [−X − Y ]. Because C T = Cr , it follows that Ca [−X − Y ]T = Cr . But from the bottom partition of Eq. (8), [−X − Y ]T = T2 Ar . Therefore, Ca T2 Ar = Cr . Substituting in T2 = T1 Ar we obtain Ca T1 Ar2 = Cr , or equivalently, Ca T1 = Cr Ar−2

(27)

We can now proceed analogously as in the previous two cases. The coupling can be realized through the the intersection subspace Ja between Ca and BT by letting Jm = Ja , Cm = Ca , Fm = Fa and G m = G a in Appendix 2, Ja = FaT Ca = G aT BT

(28)

As before, we require that the number of sensors plus the number of actuators must be more than the number of degrees of freedom for the intersection subspace not to be null. Post-multiplying Eq. (28) with T1 , and using Eq. (27) followed by taking transpose of the resulting equation, one obtains (Cr Ar−2 )T Fa = T1T BG a

(29)

Post-multiplying Eq. (11) with G a , we find the non-homogeneous equation for T1 , (BT T1 )Ark Br G a = BrT (ArT )k (Cr Ar−2 )T Fa

(30)

Now, Eq. (30) is in the form of Eq. (32) with Q = BT T1 , A = Ar , B1 = Br G a , R = BrT, B2 = (Cr Ar−2 )T Fa from which Q can be solved (see Appendix 1). Thus, Ca is full rank which equals to n, then T1 can be solved from if BT  T1 =

Ca BT

† 

Cr Ar−2 Q

 (31)

234

D.-H. Tseng et al.

6 Step-by-Step Algorithm In this section, all the relevant equations are summarized in the form of an algorithm starting from (Ar , Br , Cr ). Such a model is obtained by converting an identified discrete-time state-space model to continuous time. Any state-space system identification algorithm can be used to identify such a model from discrete-time input-output data such as OKID [13, 14, 22, 23, 25–27]. The discrete-to-continuous time conversion can be performed by the standard Matlab function d2c for example. Step 1: Determine the output influence matrix C with C p , Cv , and Ca depending on the number and types of measurements associated with the model (Ar , Br , Cr ). Also determine the input influence matrix B that specifies the actuator locations and how they affect the degrees of freedom in the physical model. The number of sensors plus the number of actuators must be more than the number of degrees of freedom. Step 2: Find the intersection subspace between C p , Cv , or Ca and BT by the technique in Appendix 2. Step 3: Set A = Ar , R = BrT and B1 , B2 according to the types of measurements specified below: Case 1 (displacement measurements): Set B1 = Br G p and B2 = (CrT F p ) where F p and G p are derived from Eq. (14). Case 2 (velocity measurements): Set B1 = Br G v and B2 = (Cr Ar−1 )T Fv where Fv and G v are derived from Eq. (21). Case 3 (acceleration measurements): Set B1 = Br G a and B2 = (Cr Ar−2 )T Fa where Fa and G a are derived from Eq. (28). Step 4: For all types of measurements, Q = BT T1 can be solved by the technique in Appendix 1 from A = Ar , R = BrT and B1 , B2 . Step 5: After Q is found in Step 4, T1 can be found if the matrix that needs to inverted to solve for it is full rank. The unique solution of T1 is given in Eq. (18) for Case 1 (displacement measurements), Eq. (25) for Case 2 (velocity measurements), or Eq. (31) for Case 3 (acceleration measurements), respectively. Step 6: After T1 is found, then T2 is computed from T2 = T1 Ar . The transformation matrix T , which is made up of T1 and T2 , to transform (Ar , Br , Cr ) to physical coordinates is now found. Once the model is transformed to physical coordinates, extract the partitions X, Y, Z from (A, B, C). Then use the method explained in [35] to recover M, C, K.

7 Numerical Illustration Consider a 5-degree-of-freedom, 3-input, 3-output system with the following mass, stiffness, damping matrices, and input matrix B,

Physical Parameter Identification with Sensors or Actuators Spanning Multiple DOF’s

⎡ M=

1 ⎢0 ⎢0 ⎣ 0 0

0 1 0 0 0

0 0 1 0 0

0 0 0 1 0

⎡ C=



0 0⎥ 0⎥ ⎦ 0 1

⎡ K=



2 −1 0 0 0 ⎢ −1 2 −1 0 0 ⎥ ⎢ 0 −1 2 −1 0 ⎥ ⎣ ⎦ 0 0 −1 2 −1 0 0 0 −1 2



100 −50 0 0 0 ⎢ −50 100 −50 0 0 ⎥ ⎢ 0 −50 100 −50 0 ⎥ ⎣ ⎦ 0 0 −50 100 −50 0 0 0 −50 100

⎡ B=

235



1.7491 −0.5273 1.7411 ⎢ 0.1326 0.9323 0.4868 ⎥ ⎢ 0.3252 1.1647 1.0488 ⎥ ⎦ ⎣ −0.7938 −2.0457 1.4886 0.3149 −0.6444 1.2705

For the interested readers who would like to reproduce these numerical examples, the input matrix B is generated by Matlab random number generator randn(5, 3) with a ‘state’ number set to 2 followed by r oundn(B, −4) to keep only four digits after the decimal points. To illustrate the developed algorithm, we convert the truth state-space model A, B, C to Ar , Br , Cr by an arbitrary transformation matrix T (which is unknown to the algorithm), and then apply to the algorithm to recover T , ⎡

T =

0.8644 ⎢ 0.0942 ⎢ −0.8519 ⎢ ⎢ 0.8735 ⎢ −0.4380 ⎢ ⎢ −0.4297 ⎢ −1.1027 ⎢ ⎢ 0.3962 ⎣ −0.9649 0.1684

−1.9654 −0.7443 −0.5523 −0.8197 1.1091 −0.6149 −0.2546 −0.2698 −1.6720 −1.8760

0.5750 −0.8661 −2.1165 −0.9645 0.2127 0.4779 0.1007 0.2974 0.5701 −1.6245

0.6434 0.6819 0.0147 −1.3015 −1.2846 0.8122 0.8385 1.4203 −0.9898 −1.1832

−0.4663 −0.3659 1.1183 −0.4656 −1.5608 −0.2831 −1.3229 −0.1962 0.4190 0.7423

−0.1430 −2.1619 −0.6442 1.4396 −0.8469 0.0573 0.6434 −0.6704 −0.0031 0.3529



1.1795 −0.4852 −0.2781 1.3168 −0.6859 1.1970 −0.3280 1.4422 ⎥ 1.6768 1.3948 −0.0125 1.4669 ⎥ ⎥ −0.2553 0.1654 0.9032 −1.1071 ⎥ −0.6475 −0.5100 −1.1125 −0.4609 ⎥ ⎥ −0.1822 1.3777 −0.8392 −0.0203 ⎥ ⎥ 0.8518 1.2985 0.0355 −0.0460 ⎥ −0.3066 −0.1301 −1.2465 −0.5445 ⎥ ⎦ −0.4405 0.7402 0.8845 0.9170 −0.6115 1.3320 2.5383 −0.0194

The above transformation matrix T is generated by Matlab random number generator randn(10, 10) with a ‘state’ number set to 1 followed by r oundn(T, −4). With Ar , Br , Cr as a starting point, the developed algorithm is applied to recover T for all three types of measurements. The same Ar , and Br are used in each case: ⎡ ⎢ ⎢ ⎢ Ar = ⎢ ⎢ ⎢ ⎣

31.5772 0.4719 −250.9964 −27.8172 8.0152 −106.2675 30.5158 −12.9847 −256.4920 12.0534 −24.4802 163.6390 6.3372 38.2879 −55.9722 −199.2495 179.6510 182.9144 148.3987 −136.9942 −228.8128 3.8285 28.7391 −303.2824 94.4077 −155.8685 −79.7930 −212.7583 162.2665 114.9585

−12.9107 174.2995 115.4442 −159.6858 221.3902 238.6230 −134.5963 −12.3162 89.8971 47.2969 30.0849 70.8482 59.0970 −14.5968 10.5704 208.3751 128.3163 −132.2642 227.5725 250.7146 −119.4672 −17.3122 −119.5060 −10.7080 98.0775 −157.2208 −122.1427 45.0349 11.0123 7.6464 −52.0802 −139.1587 80.1590 45.3885 −37.4839 −18.9981 −174.7511 −300.6778 168.7951 −167.1952 −361.5975 198.8359 0.1791 210.6520 294.7894 −130.4766 189.3386 350.4712 −191.6219 −9.9223 230.3930 125.5443 −156.5079 262.7219 265.4900 −143.3747 20.6102 135.5363 225.9230 84.1813 39.2946 170.9865 −46.5709 −3.0748 −104.4942 −277.1191 205.4575 −115.3557 −322.0591 214.0116

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦

236

D.-H. Tseng et al.



Br =

2.6126 ⎢ 0.7034 ⎢ 2.6784 ⎢ ⎢ −1.4802 ⎢ 0.7478 ⎢ ⎢ −2.4772 ⎢ 2.7856 ⎢ ⎢ 3.2725 ⎣ 0.5568 −2.2875

−0.3076 −0.0445 −0.7420 0.8697 −0.5344 −0.2115 0.2333 −0.6764 0.2565 −0.5816



0.2061 −0.7305 ⎥ 1.3573 ⎥ ⎥ −0.6955 ⎥ 0.6277 ⎥ ⎥ −1.5861 ⎥ ⎥ 1.1843 ⎥ 1.3261 ⎥ ⎦ 0.1026 −1.9790

Recall that once the upper partition of T which is T1 , is found, the lower partition of T which is T2 can also be found from T2 = T1 Ar . Thus it is sufficient in these examples to demonstrate that T1 can be recovered exactly. Once T is known the physical parameters of the system given in terms of the mass, stiffness, and damping matrices can be found after transforming the model Ar , Br , Cr to physical coordinates.

7.1 Case 1 : Displacement Measurements The measurement matrix C p is generated by Matlab random number generator randn(3, 5) with a ‘state’ number set to 3 followed by r oundn(C p , −4),

Cp =

0.9280 −0.7230 0.2673 1.3991 −0.6441 0.1733 −0.5744 1.3345 −0.5725 0.8676 −0.6916 −0.3077 −1.3311 −0.3467 1.6390

The corresponding Cr is  Cr =

2.0106 −3.2946 −0.8924 −0.8855 0.4846 3.8178 2.0986 −0.3830 1.9560 −0.6807 −1.9213 0.7814 −1.4906 −0.6300 0.5341 −1.2016 2.4205 0.5526 −1.3588 1.5913 −0.5136 4.4255 3.3691 −2.3286 −3.4502 −0.2656 −3.8094 −2.7826 −1.8266 −3.6786



From C p and B, we can find their intersection from Eq. (14). The results are Fp = and

0.5211 −0.1318 0.3861

F pT C p = G Tp BT =

 



,

Gp =

0.0693 −0.4161 −0.0843

0.1937 −0.4198 −0.5505 0.6706 0.1828 0.1937 −0.4198 −0.5505 0.6706 0.1828

 

The matrix Q = BT T1 obtained by the eigen-decomposition method is found to be Q=

0.4161 −2.7160 1.0352 1.8492 −0.6224 −2.1557 2.5162 −0.5282 −1.6013 3.7052 −2.8649 0.6613 −1.7397 3.8039 3.1655 −5.0897 1.6311 2.9866 −1.3045 4.9205 1.4012 −4.1746 −2.8058 −2.1019 −2.4932 −0.9100 2.2757 0.7990 −0.7259 2.2996

Physical Parameter Identification with Sensors or Actuators Spanning Multiple DOF’s

237

T1 can then be found by Eq. (18) as ⎡ T1 =

0.8644 ⎢ 0.0942 ⎢ −0.8519 ⎣ 0.8735 −0.4380

−1.9654 0.5750 0.6434 −0.4663 −0.7443 −0.8661 0.6819 −0.3659 −0.5523 −2.1165 0.0147 1.1183 −0.8197 −0.9645 −1.3015 −0.4656 1.1091 0.2127 −1.2846 −1.5608

−0.1430 1.1795 −0.4852 −2.1619 −0.6859 1.1970 −0.6442 1.6768 1.3948 1.4396 −0.2553 0.1654 −0.8469 −0.6475 −0.5100



−0.2781 1.3168 −0.3280 1.4422 ⎥ −0.0125 1.4669 ⎥ ⎦ 0.9032 −1.1071 −1.1125 −0.4609

which matches the upper partition of T exactly. Thus we have validated the algorithm in the case of displacement measurements for this example.

7.2 Case 2 : Velocity Measurements The measurement matrix Cv is generated by Matlab random number generator randn(3, 5) with the ‘state’ number set to 4 followed by r oundn(Cv , −4),

1.8106 0.0754 0.9811 0.8875 −0.5972 0.2150 0.0440 0.3866 −1.4401 −1.0254 0.6941 0.4003 0.9561 −0.1747 −1.5929

Cv = The corresponding Cr is Cr =

−1.4294 −1.7608 2.6408 2.7554 −0.8763 −0.7190 −0.5922 2.3262 −3.4706 0.2510 1.2291 4.0838 1.0669 3.3993 −1.5595 −0.5759 1.1412 −2.1288 −4.5373 −1.5176 −0.4605 2.4937 3.1444 4.3150 −2.1693 −0.9052 0.9724 −0.8994 −5.9578 −0.6824

From Cv and B, we can find their intersection from Eq. (21). The results are Fv = Therefore,

FvT Cv = G vT BT =

−0.3243 0.3455 0.1605

 



,

Gv =

0.0792 0.1810 −0.2554

−0.4015 0.0550 −0.0312 −0.8134 −0.4162 −0.4015 0.0550 −0.0312 −0.8134 −0.4162

 

The matrix Q = BT T1 obtained by the eigen-decomposition method is found to be Q=

0.4161 −2.7160 1.0352 1.8492 −0.6224 −2.1557 2.5162 −0.5282 −1.6013 3.7052 −2.8649 0.6613 −1.7397 3.8039 3.1655 −5.0897 1.6311 2.9866 −1.3045 4.9205 1.4012 −4.1746 −2.8058 −2.1019 −2.4932 −0.9100 2.2757 0.7990 −0.7259 2.2996

T1 can then be found from Eq. (25) as

238

D.-H. Tseng et al.

⎡ T1 =

0.8644 ⎢ 0.0942 ⎢ −0.8519 ⎣ 0.8735 −0.4380

−1.9654 0.5750 0.6434 −0.4663 −0.7443 −0.8661 0.6819 −0.3659 −0.5523 −2.1165 0.0147 1.1183 −0.8197 −0.9645 −1.3015 −0.4656 1.1091 0.2127 −1.2846 −1.5608

−0.1430 1.1795 −0.4852 −2.1619 −0.6859 1.1970 −0.6442 1.6768 1.3948 1.4396 −0.2553 0.1654 −0.8469 −0.6475 −0.5100



−0.2781 1.3168 −0.3280 1.4422 ⎥ −0.0125 1.4669 ⎥ ⎦ 0.9032 −1.1071 −1.1125 −0.4609

which matches the upper partition of T exactly.

7.3 Case 3 : Acceleration Measurements The measurement matrix Ca is generated by Matlab random number generator randn(3, 5) with a ‘state’ number set to 5 followed by r oundn(Ca , −4),

Ca =

0.9876 0.1726 1.1085 −0.5620 −0.6155 0.2590 −1.5290 −1.0425 −1.0893 −1.1514 −0.3801 0.0001 1.0402 0.2976 −0.7726

The corresponding matrix Cr is Cr =

93.6930 152.1913

75.6846 −154.4944 −225.7149

−83.9071 197.7345 −90.1565 56.1029

−1.3583 −425.6783 −34.7176

−67.5759 −119.4814 −264.4034 −280.7909

119.3336

41.7421 −285.6733 −71.1668 −36.8036

−54.6139 −265.1723 −118.6467 −182.7454 −148.6276 −131.5850

62.5117 214.0518

−56.8529

From Ca and B, we can find their intersection from Eq. (28). The results are Fa = and

FaT Ca = G aT BT =

0.4688 0.1722 −0.5633

 



,

Ga =

0.6404 −0.1444 −0.2726

0.7217 −0.1824 −0.2458 −0.6187 −0.0516 0.7217 −0.1824 −0.2458 −0.6187 −0.0516

 

The matrix Q = BT T1 is obtained by the eigen-decomposition method as Q=

0.4161 −2.7160 1.0352 1.8492 −0.6224 −2.1557 2.5162 −0.5282 −1.6013 3.7052 −2.8649 0.6613 −1.7397 3.8039 3.1655 −5.0897 1.6311 2.9866 −1.3045 4.9205 1.4012 −4.1746 −2.8058 −2.1019 −2.4932 −0.9100 2.2757 0.7990 −0.7259 2.2996

T1 can then be found by Eq. (31) as ⎡ T1 =

0.8644 ⎢ 0.0942 ⎢ −0.8519 ⎣ 0.8735 −0.4380

−1.9654 0.5750 0.6434 −0.4663 −0.7443 −0.8661 0.6819 −0.3659 −0.5523 −2.1165 0.0147 1.1183 −0.8197 −0.9645 −1.3015 −0.4656 1.1091 0.2127 −1.2846 −1.5608

−0.1430 1.1795 −0.4852 −2.1619 −0.6859 1.1970 −0.6442 1.6768 1.3948 1.4396 −0.2553 0.1654 −0.8469 −0.6475 −0.5100

which matches the upper partition of T exactly.



−0.2781 1.3168 −0.3280 1.4422 ⎥ −0.0125 1.4669 ⎥ ⎦ 0.9032 −1.1071 −1.1125 −0.4609

Physical Parameter Identification with Sensors or Actuators Spanning Multiple DOF’s

239

8 Conclusions In summary, the highly nonlinear inverse problem of identifying the physical parameters of a finite-element model of a structure from input-output measurements is of both theoretical and practical interests. Applications include detecting damages in bridges or buildings, or tumors in soft tissues. Recent methods take advantage of modern system identification techniques that can identify state-space models of dynamical systems accurately from noisy measurements. We pioneered an approach that solved the originally nonlinear problem by two successive linear problems. The first linear problem transforms an identified state-space model in some unknown and arbitrary coordinates to physical coordinates, and the second linear problem finds the physical parameters of the structure from the model obtained in first linear problem. Our original solution required a full set of sensors, one per degree of freedom, but recent breakthrough for the two-step approach extended to the case where a full set of sensors is not required. Furthermore, our most recent eigen-decomposition based method offers orders-of-magnitude reduction in computational requirement that enables these methods to handle systems with very high degrees of freedom. In this paper, we extend our family of solutions one step further to the case where each measurement can involve more than one finite-element degree of freedom, each actuated degree of freedom can involve more than one actuator, and each actuator can directly affect more than one degree of freedom. Mathematically, this extension turns out to be non-trivial. To address this problem, we generalize the concept of actuator and sensor collocation to the above situation, and develop a solution to the problem where the measurements can be displacements, velocities, or accelerations. Numerical examples are provided to illustrate this extension to our family of solutions.

Appendix 1: A Numerically Stable Solution for Q Additional details of the results summarized in this Appendix can be found in [34]. k From QAk B1 = R(AT ) B2 , the linear problem that solves for Q has the form, QC1 = RC2

(32)

where C1 and C2 have the structure of “wide” controllability-like matrices,   C1 = B1 AB1 A2 B1 · · · A2n−1 B1   C2 = B2 (AT )B2 (AT )2 B2 · · · (AT )2n−1 B2

(33)

The matrices C1 and C2 are specified in terms of A, B1 , and B2 . The matrix C1 or C2 each has a total of 2n rows. In this form solving for Q only requires the inversion of an 2n-by-2n matrix,

240

D.-H. Tseng et al.

Q = RC2 C1 T (C1 C1 T )−1

(34)

for uniquely as shown in Eq. (34). This As long as C1 is full rank, Q can   be solved solution requires the inverse of C1 C1T which is an 2n-by-2n matrix. This solution is an improvement over the original Kronecker product based method of [33] which requires the inversion of a matrix whose dimensions are approximately of the order n 2 -by-n 2 . One major issue remains, however. The solution in Eq. (34) is not numerically stable because C1 and C2 involve powers of A and AT up to 2n − 1. It is common for the eigenvalues of A or AT to have magnitudes larger than 1. Therefore, when raised to high powers, these matrices can explosively grow to very large magnitudes. In this Appendix we summarize a specialized solution for Q that eliminates this issue. Let the eigen-decomposition of A be A = U ΛU −1

(35)

AT = (U −1 )T ΛU T

(36)

where Λ is a diagonal matrix of the eigenvalues of A, ⎡ ⎢ ⎢ Λ=⎢ ⎣

λ1

⎤ λ2

..

⎥ ⎥ ⎥ ⎦

.

(37)

λ2n The matrices Ak = U Λk U −1 , (AT )k = (U −1 )T Λk U T , k = 0, 1, . . . , 2n − 1, will grow very large magnitudes for high-DOF systems if any eigenvalue of A has magnitude larger than 1. We would like to eliminate this source of numerical instability in the solution for Q. Let zi denote the columns of B1 , i = 1, 2, . . . , l. We define the column vectors made up of α ji , j = 1, 2, . . . , 2n, as ⎡⎡ ⎢   ⎢ ⎢⎢ U −1 B1 = U −1 z1 z2 · · · zl = ⎢ ⎢ ⎣⎣

α11 α21 .. .

⎤⎡ ⎥⎢ ⎥⎢ ⎥⎢ ⎦⎣

α2n1

α12 α22 .. .





⎥ ⎥ ⎥ ··· ⎦

⎢ ⎢ ⎢ ⎣

α2n2

α1l α2l .. .

⎤⎤ ⎥⎥ ⎥⎥ ⎥⎥ ⎦⎦

(38)

α2nl

Likewise, let bi denote the columns of B2 , i = 1, 2, . . . , l, and define the column vectors made up of β ji , j = 1, 2, . . . , 2n, as ⎡⎡ ⎢   ⎢ ⎢⎢ U T B2 = U T b1 b2 · · · bl = ⎢ ⎢ ⎣⎣

β11 β21 .. .

β2n1

⎤⎡ ⎥⎢ ⎥⎢ ⎥⎢ ⎦⎣

β12 β22 .. .

β2n2





⎥ ⎥ ⎥ ··· ⎦

⎢ ⎢ ⎢ ⎣

β1l β2l .. .

β2nl

⎤⎤ ⎥⎥ ⎥⎥ ⎥⎥ ⎦⎦

(39)

Physical Parameter Identification with Sensors or Actuators Spanning Multiple DOF’s

241

Thus, for each column zi and bi , ⎡ ⎢ ⎢ U −1 zi = ⎢ ⎣

α1i α2i .. .





⎢ ⎢ U T bi = ⎢ ⎣

⎥ ⎥ ⎥, ⎦

α2ni



β1i β2i .. .

⎥ ⎥ ⎥ ⎦

(40)

β2ni

Therefore,

⎡ ⎢ ⎢ Ak zi = U Λk U −1 zi = U Λk ⎢ ⎣

α1i α2i .. .

⎤ ⎥ ⎥ ⎥ ⎦

(41)

α2ni ⎡ ⎢  T k T T   ⎢ A bi = U −1 Λk U T bi = U −1 Λk ⎢ ⎣

β1i β2i .. .

⎤ ⎥ ⎥ ⎥ ⎦

(42)

β2ni Let us examine Eq. (32) associated with the ith columns of B1 and B2 ,           Q zi Azi A2 zi · · · A2n−1 zi = R bi AT bi AT 2 bi · · · AT 2n−1 bi (43) The left hand side of Eq. (43) becomes ⎡⎡ ⎢⎢ ⎢⎢ QU ⎢ ⎢ ⎣⎣

α1i α2i .. .





⎥ ⎢ ⎥ ⎢ ⎥ Λ⎢ ⎦ ⎣

α2ni

α1i α2i .. .





⎥ ⎢ ⎥ ⎢ ⎥ · · · Λ2n−1 ⎢ ⎦ ⎣

α2ni

α1i α2i .. .

⎤⎤ ⎥⎥ ⎥⎥ ⎥⎥ ⎦⎦

(44)

α2ni

The right hand side of Eq. (43) becomes ⎡⎡ ⎢⎢ ⎢⎢ R(U −1 )T ⎢ ⎢ ⎣⎣

β1i β2i .. .

β2ni





⎥ ⎢ ⎥ ⎢ ⎥ Λ⎢ ⎦ ⎣

β1i β2i .. .

β2ni





⎥ ⎢ ⎥ ⎢ ⎥ · · · Λ2n−1 ⎢ ⎦ ⎣

β1i β2i .. .

⎤⎤ ⎥⎥ ⎥⎥ ⎥⎥ ⎦⎦

(45)

β2ni

Furthermore, since Λk is a diagonal matrix, we can reverse the order of the multiplications,

242

D.-H. Tseng et al.

⎡ ⎢ ⎢ Λk ⎢ ⎣

α1i α2i .. .





⎥ ⎢ ⎥ ⎢ ⎥=⎢ ⎦ ⎣

α1i

⎤⎡ α2i

..

.

α2ni ⎡ ⎢ ⎢ Λk ⎢ ⎣

β1i β2i .. .





⎥ ⎢ ⎥ ⎢ ⎥=⎢ ⎦ ⎣

β1i

⎥⎢ ⎥⎢ ⎥⎢ ⎦⎣ α2ni

..

⎥⎢ ⎥⎢ ⎥⎢ ⎦⎣

. β2ni

β2ni

⎤ ⎥ ⎥ ⎥ ⎦

(46)

λk2n ⎤⎡

β2i

λk1 λk2 .. .

λk1 λk2 .. .

⎤ ⎥ ⎥ ⎥ ⎦

(47)

λk2n

Therefore, Eq. (43) can be expressed as ⎡ ⎢ ⎢ QU ⎢ ⎣ ⎡ ⎢ ⎢ = R(U −1 )T ⎢ ⎣

α1i

β1i

α2i

β2i

..

.

..

.

⎡ 2n−1 ⎤ ⎤ ⎤ ⎡⎡ ⎤ ⎡ ⎤ λ1 λ1 1 ⎢ λ2n−1 ⎥ ⎥ ⎥ ⎢ ⎢ 1 ⎥ ⎢ λ2 ⎥ ⎢ 2 ⎥⎥ ⎥ ⎢⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎢ .. ⎥ ⎢ .. ⎥ · · · ⎢ .. ⎥ ⎥ ⎣ . ⎦⎦ ⎦ ⎣⎣ . ⎦ ⎣ . ⎦ 1 α2ni λ2n λ2n−1 2n ⎡ 2n−1 ⎤ ⎤ ⎤ ⎡⎡ ⎤ ⎡ ⎤ λ1 1 λ1 ⎢ λ2n−1 ⎥ ⎥ ⎥ ⎢ ⎢ 1 ⎥ ⎢ λ2 ⎥ ⎢ 2 ⎥⎥ ⎥ ⎢⎢ ⎥ ⎢ ⎥ · · · ⎢ .. ⎥ ⎥ ⎥ ⎢ ⎢ .. ⎥ ⎢ .. ⎥ ⎣ . ⎦⎦ ⎦ ⎣⎣ . ⎦ ⎣ . ⎦ β2ni 1 λ2n λ2n−1 2n

(48)

where the eigenvalues of A which are the sources of numerical instability are isolated in the Vandermonde matrix that appears on both sides of Eq. (48). If A has no repeated eigenvalues, the Vandermonde matrix is full rank, and can be removed from both sides of Eq. (48). Thus the source that causes A and AT to grow as they are raised to high powers is removed. For compactness of notation, define the diagonal matrices [α]i and [β]i of the α ji and β ji coefficients as ⎡ ⎢ ⎢ [α]i = ⎢ ⎣

α1i

α2i

..

. α2ni





⎥ ⎥ ⎥ ⎦

⎢ ⎢ [β]i = ⎢ ⎣

β1i

⎤ β2i

..

.

⎥ ⎥ ⎥ ⎦

(49)

β2ni

We have a relationship for Q associated with the ith columns of B1 and B2 , i = 1, 2, . . . , l, T  QU [α]i = R U −1 [β]i (50) Writing Eq. (50) for all columns of B1 and B2 and packaging them together produces   T    [β]1 [β]2 · · · [β]l QU [α]1 [α]2 · · · [α]l = R U −1

(51)

Physical Parameter Identification with Sensors or Actuators Spanning Multiple DOF’s

243

  Post-multiplying both sides of Eq. (51) by the transpose of [α]1 [α]2 · · · [α]l produces   l   l     T (52) QU [α]i2 = R U −1 [β]i [α]i i=1

i=1

from which Q can be solved T  Q = R U −1

 l 

[β]i [α]i

i=1

 l 

−1 [α]i2

U −1

(53)

i=1

This solution requires the eigen-decomposition of the matrix A which is a 2n-by-2n matrix for an n-DOF system, and it eliminates the source of numerical instability associated with the solution given in Eq. (34).

Appendix 2: Intersection Subspace Let both Cm and BT be wide matrices. To find the row space of the intersection subspace between Cm and BT , let Jm denote the common row vector(s) that lies in both the row space of Cm and BT , i.e., Jm = FmT Cm = G mT BT

(54)

where each column of Fm and G m are coefficients that expresses each row vector of Jm in terms of the rows of Cm and BT respectively. The transpose of Eq. (54) can be expressed in a compact form as R H = 0 where R=



CmT





Fm B , H= −G m

 (55)

Thus each column vector of H lies in the null space of R. If the intersection subspace between Cm and BT is not null, then H is not zero. In this case,  singular  when T 0 S . Let V = , we have S = value decomposition is performed on, R = U SV 1   V1 V2 then    V1T  (56) R = U S1 0 V2T Because RV2 = 0, one choice for H can be H = V2 . The matrix Fm can therefore be extracted from the upper m rows of H , and G m from the lower r rows of −H . For V2 to exist, R must be a wide matrix (more columns than rows), which implies the number of sensors plus the number of actuators must be more than the number

244

D.-H. Tseng et al.

of degrees of freedom. Furthermore, R must be full rank so that the transformation matrix T that converts an identified state-space model to physical coordinates can be uniquely solved for.

References 1. Agbabian, M.S., Masri, S.F., Miller, R.K., Caughey, T.K.: System identification approach to detection of structural changes. ASCE J. Eng. Mech. 117(2), 370–390 (1991) 2. Aitken, J.M., Clarke, T.: Observer/Kalman filter identification with wavelets. IEEE Trans. Signal Process. 60(7), 3476–3485 (2012) 3. Alvin, K.F., Park, K.C.: Second-order structural identification procedure via state-space based system identification. AIAA J. 32(2), 397–406 (1994) 4. Beck, J.L., Jennings, P.C.: Structural identification using linear models and earthquake records. Earthq. Eng. Struct. Dynam. 8, 145–160 (1980) 5. Chang, M., Pakzad, S.N.: observer kalman filter identification for output-only systems using interactive structural modal identification toolsuite (SMIT). J. Bridge Eng. 19(5), 040140021-04014002-11 (2014) 6. Chiuso, A.: The role of vector autoregressive modeling in predictor based subspac identification. Automatica 43(6), 1034–1048 (2007) 7. De Angelis, M., Lus, H., Betti, R., Longman, R.W.: Extracting physical parameters of mechanical models from identified state space representations. ASME J. Appl. Mech. 69, 617–625 (2002) 8. Figueiredo, E., Park, G., Figueiras, J., Farrar, C., Worden, K.: Structural health monitoring algorithm comparisons using standard data sets. Los Alamos National Laboratory Internal Report LA-14393, Los Alamos, NM, (2009) 9. Friswell, M.I., Garvey, S.D., Penny, J.E.T.: Extracting second-order systems from state-space representations. AIAA J. 37(1), 132–135 (1999) 10. Guillet, J., Mourllion, B., Birouche, A., Basset, M.: Extracting second-order structures from single-input state-space models: application to model order reduction. Int. J. Appl. Math Comput. Sci. 21(3), 509–519 (2011) 11. Houlston, P.R.: Extracting second-order system matrices from state space system. Proc. Inst. Mech. Eng., Part C: J. Mech. Eng. Sci. 220(8), 1147–1149 (2006) 12. Koh, B.-H.: Damage identification in smart structures through sensitivity enhancing control. Ph.D. Thesis. Thayer School of Engineering, Dartmouth College, Hanover, NH, March (2003) 13. Juang, J.-N., Horta, L.G., Phan, M.Q.: System/Observer/Controller Identification Toolbox. NASA TM-107566 (1992) 14. Juang, J.-N., Phan, M.Q., Horta, L.G., Longman, R.W.: Identification of Observer. Kalman Filt. Markov Parameters: Theory Exp.: J. Guid., Control, Dyn. 16(2), 320–329 (1993) 15. Juang, J.-N.: Applied System Identification. Prentice Hall, Englewood Cliffs, NJ (1994) 16. Koh, B.H., Dharap, P., Nagarajaiah, S., Phan, M.Q.: Real-time structural damage monitoring by input error function. AIAA J. 43(8), 1808–1814 (2005) 17. Koh, B.H., Li., Z., Dharap, P., Nagarajaiah, S., Phan, M.Q.: Actuator failure detection through interaction matrix formulation. J. Guida., Control, Dyn. 28(5), 895–901 (2005) 18. Koh, B.H., Nagarajaiah, S., Phan, M.Q.: Reconstructing structural changes in a dynamic system from experimentally identified state-space models. J. Mech. Sci. Technol. 22, 103–112 (2008) 19. Lin P., Phan, M.Q., Ketcham, S.A.: State-space model and kalman filter gain identification by a superspace method. In: Bock, H.G., Phu, H.X., Rannacher, R., Schloeder, J., (Eds.) Modeling, Simulation, and Optimization of Complex Processes, pp. 121–132. Springer (2014) 20. Majji, M., Juang, J.-N., Junkins, J.L.: Observer/Kalman-filter time-varying system identification. J. Guid. Control Dyn. 33(3), 887–900 (2010)

Physical Parameter Identification with Sensors or Actuators Spanning Multiple DOF’s

245

21. Mottershead, J.E., Friswell, M.I.: Model updating in structural dynamics: a survey. J. Sound Vib. 165(2), 347–375 (1993) 22. Phan, M.Q., Horta, L.G., Juang, J.-N., Longman, R.W.: Linear system identification via an asymptotically stable observer. J. Optim. Theory Appl. 79(1), 59–86 (1993) 23. Phan, M.Q., Horta, L.G., Juang, J.-N., Longman, R.W.: Improvement of observer/kalman filter identification (OKID) by residual whitening. J. Vibr. Acoust. 117, 232–238 (1995) 24. Phan, M.Q. Longman, R.W.: Extracting mass, stiffness, and damping matrices from identified state-space models. In: AIAA-2004-5415, Proceedings of the AIAA Guidance, Navigation, and Control Conference and Exhibit. Providence, RI, (2004) 25. Phan, M.Q.: interaction matrices in system identification and control. In: Proceedings of the 15th Yale Workshop on Adaptive and Learning Systems, New Haven, CT, (2011) 26. Phan, M.Q., Vicario, F., Longman, R.W., Betti, R.: Observer/Kalman filter identification by a Kalman filter of a Kalman filter (OKID2 ). In: AAS 17-201, AAS/AIAA Spaceflight Mechanics Meeting, San Antonio, TX, (2017) 27. Phan, M.Q., Vicario, F., Longman, R.W., Betti, R.: State-space model and kalman filter gain identification by a Kalman filter of a Kalman filter. ASME J. Dyn. Sys., Meas. Control 140(3) (2018) 28. Qin, S.J.: An overview of subspace identification. Comput. Chem. Eng. 30(10–12), 1502–1513 (2006) 29. Smyth, A.W., Pei, J.-S.: Integration of measured response signals for nonlinear structural health monitoring. In: Proceedings of the Third US-Japan Workshop on Nonlinear System Identification and Health Monitoring. Los Angeles, CA, (2000) 30. Tseng, D.-H., Longman, R.W., Juang, J.-N.: Identification of the structure of the damping matrix in second order mechanical systems. Adv. Astronaut. Sci. 87, 167–190 (1994) 31. Tseng, D.-H., Longman, R.W., Juang, J.-N.: Identification of gyroscopic and nongyroscopic second-order mechanical systems including repeated problems. Adv. Astronaut. Sci. 87, 145– 165 (1994) 32. Tseng, D.H., Phan, M.Q., Longman, R.W.: Mass, Stiffness, and damping matrices from an identified state-space model by sylvester equations. Adv. Astronaut. Sci. 156, 1831–1851 (2016) 33. Tseng, D.H., Phan, M.Q., Longman, R.W.: Converting a state-space model to physical coordinates without a full set of sensors: a kronecker-based method. In: AAS 17-205, AAS/AIAA Space Flight Mechanics Meeting, San Antonio, TX, (2017) 34. Tseng, D.H., Phan, M.Q., Longman, R.W.: Converting to physical coordinates with or without a full set of sensors by eigen-decomposition of identified state-space models. In: AAS 17-611, AAS/AIAA Astrodynamics Specialist Conference, 3239–3258 (2017) 35. Tseng, D.H., Phan, M.Q., Longman, R.W.: Mass Stiffness and damping matrices from statespace models in physical coordinates by eigen-decomposition of a special matrix. In: AAS 17-612, AAS/AIAA Astrodynamics Specialist Conference, 3259–3278 (2017) 36. Van Overschee, P., De Moor, B.: N4SID: subspace algorithms for the identification of combined deterministic-stochastic systems. Automatica 30(1), 75–93 (1994) 37. Van Overschee, P., De Moor, B.: Subspace Identification for Linear Systems: Theory, Implementation. Applications. Kluwer, Dordrecht, The Netherlands (1996) 38. Van Houten, E.E.W., Miga, M.I., Weaver, J.B., Kennedy, F.E., K.D.: Three dimensional subzone-based reconstruction algorithm for MR elastography. Magn. Res. Med. 45, 827–837 (2001) 39. Vicario, F., Phan, M.Q., Raimondo, B., Longman, R.W.: Output-only observer/Kalman filter identification (O3 KID). J. Struct. Control Health Monit. (2014). https://doi.org/10.1002/stc. 1719 40. Vicario, F., Phan, M.Q., Betti, R., Longman, R.W.: OKID via output residuals: a converter from stochastic to deterministic system identification. J. Guid. Control Dyn. https://doi.org/10. 2514/1.G001786(2016)

Monotonization of a Family of Implicit Schemes for the Burgers Equation Alexander Kurganov and Petr N. Vabishchevich

Abstract We study numerical methods for convection-dominated fluid dynamics problems. In particular, we consider initial-boundary value problems for the Burgers equation with small diffusion coefficients. Our goal is to investigate several strategies, which can be used to monotonize numerical methods and to ensure nonoscillatory and positivity-preserving properties of the computed solutions. We focus on fully implicit finite-element methods constructed using the backward Euler time discretization combined with high-order spatial approximations. We experimentally study the following three monotonization approaches: mesh refinement, increasing the time-step size and utilizing higher-order finite-element approximations. Feasibility of these three strategies is demonstrated on a number of numerical examples for both one- and two-dimensional Burgers equations.

1 Introduction We study numerical methods for convection-dominated nonlinear convectiondiffusion equations; see, e.g., [2, 23, 27, 32]. In practice, it is often important to design numerical methods that produce monotone, non-oscillatory and positivity-preserving computed solutions. This has been done for both linear and nonlinear convective terms; see, e.g., [12, 14, 17] and references therein. A classical approach in enforcing monotonicity of computed solutions of hyperbolic and convection-dominated problems is by introducing artificial viscosity. The concept of artificial viscosity was first introduced back in 1950 in [31]. For modern A. Kurganov (B) Department of Mathematics, Southern University of Science and Technology, Shenzhen 518055, China e-mail: [email protected] P. N. Vabishchevich Nuclear Safety Institute of the Russian Academy of Sciences, 115191 Moscow, Russia e-mail: [email protected] North-Eastern Federal University, 677000 Yakutsk, Russia © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_12

247

248

A. Kurganov and P. N. Vabishchevich

artificial viscosity methods we refer the reader to, e.g., [8, 9, 15]. Non-oscillatory and positivity-preserving properties can be also ensured by using upwind approximations and nonlinear limiters; see, e.g., [7, 10, 13, 19] and references therein. In this paper, we investigate several strategies, which can be used to suppress spurious oscillations in the computed solutions of the initial-boundary value problems (IBVPs) for the Burgers equation without adding artificial viscosity or utilizing any special approximations of the convection term. We implement standard finite-element spatial approximations [6, 18], which are widely used in computational fluid dynamics [5, 33]. One of the key points in designing a non-oscillatory numerical method is choosing an appropriate time discretization. When numerically solving IBVPs for linear convection-diffusion equations, two-level schemes (θ -method, schemes with weights) are traditionally widely used; see, e.g., [3, 20]. For these methods, the monotonicity of the computed solution is studied using the discrete maximum principle; see, e.g., [25, 26]. We would like to point out that even in the simplest purely diffusive case, the only unconditionally monotone scheme is the fully implicit backward Euler one. Other two-level schemes are only conditionally monotone with the monotonicity restriction on the time-step size τ being very strict, namely, τ = O(h 2 ), where h is the spatial mesh size. We therefore use only the fully implicit backward Euler method in the current study. We implement the implicit scheme with the help of the classical iterative Newton’s method [24], which is used to obtain the solution at the new time level. Alternative approaches for the realization of the implicit schemes are discussed in [11]. The main goal of this paper is to investigate the following three monotonization strategies. The first one is reducing the spatial mesh size. This approach would lead to a monotone positivity-preserving numerical solution provided h is taken to be proportional to the diffusion coefficient. Such restriction, however, may become impractical for multidimensional problems with small diffusion. The second strategy is increasing the time-step size. By doing this we take advantage of the fact that the use of the backward Euler temporal discretization introduces a numerical diffusion proportional to τ . The numerical effect of this type of monotonization has been recently demonstrated in [30], where a simple implicit finite-element method for isentropic gas dynamics equations has been proposed. The third strategy is increasing the order ( p) of the spatial finite-element approximation. As it has been demonstrated in [4], this helps to suppress the oscillations in computed solutions of stationary convection-diffusion equations. In this paper, we experimentally study the three aforementioned monotonization strategies in the context of nonlinear convection-diffusion equations in the convection-dominated regime. To this end, we vary the parameters h, τ and p and verify that reducing the spatial mesh or time-step size or increasing the order of spatial approximation help to obtain non-oscillatory, positivity-preserving numerical solutions. The paper is organized as follows. In Sect. 2, we formulate the studied IBVP for the one-dimensional (1-D) Burgers equation and the family of implicit finiteelement methods for its numerical solution. In Sect. 3, we conduct the aforementioned

Monotonization of a Family of Implicit Schemes …

249

experimental monotonization study for the 1-D IBVP. In Sect. 4, we extend our study to the IBVP for the two-dimensional (2-D) Burgers equation. Some concluding remarks on efficiency and accuracy of the resulting method are given in Sect. 5.

2 One-Dimensional Problem We consider the 1-D Burgers equation, ut +

 u2  2

x

= νu x x , x ∈ Ω = (−1, 1), t ∈ (0, T ],

(1)

where ν > 0 is a constant diffusion coefficient, subject to the following initial and boundary conditions: u(x, 0) = u 0 (x), x ∈ Ω, u x (−1, t) = u x (1, t) = 0, t ∈ (0, T ].

(2) (3)

We design the numerical method for the IBVP (1)–(3) based on the finite-element spatial approximations; see, e.g., [29]. For the sake of simplicity, we will use a uniform grid with the nodes x j = j h, j = 0, . . . , M, Mh = |Ω| = 2. We define the finite-element space V h ⊂ H 1 (Ω) and approximate the solution on each finite element using polynomials of degree p. The approximate solution ψ ∈ V h is then obtained from the following two integral equations:   2  ψ ψt v dx + v dx + ν ψx vx dx = 0, t ∈ (0, T ], 2 x Ω Ω Ω   ψ(x, 0)v(x) dx = u 0 (x)v(x) dx, 

Ω

(4) (5)

Ω

which should be satisfied for all v ∈ V h . Let τ be a uniform time-step size and denote by ψ n (x) := ψ(x, t n ), where t n := nτ, n = 0, . . . , N , N τ = T . In order to numerically solve the problem (4), (5), we use fully implicit scheme, which results in the following integral equation:  Ω

ψ n+1 − ψ n v dx + τ

  n+1 2   (ψ ) v dx + ν ψxn+1 vx dx = 0, n = 0, . . . , N − 1, x 2

Ω

Ω

(6)

subject to the corresponding initial condition: 

 ψ 0 v dx = Ω

u 0 v dx. Ω

(7)

250

A. Kurganov and P. N. Vabishchevich

In order to solve the nonlinear variational problem (6), (7), the iterative Newton method is applied [24]. The computational implementation is based on the FEniCS platform for solving partial differential equations [1, 21]. A standard convergence/stopping criterion has been used for the iterative process. Calculations finished when both the absolute and relative (with respect to the initial value) residuals become smaller than the prescribed tolerances. In the numerical experiments reported below, we take the tolerances 10−9 for the absolute residual and 10−8 for the relative one.

3 Numerical Monotonization We consider a particular example of the 1-D Burgers equation (1) with different values of ν subject to the following initial condition:  u 0 (x) =

1 + cos(2π(x + 0.5)), −1 ≤ x < 0, 0, 0 ≤ x ≤ 1.

(8)

We note that the exact solution of the IBVP (1), (8), (3) is bounded and 0 ≤ u(x, t) ≤ u max , u max := max u 0 (x) = 2. −1≤x≤1

We will compare the computed solutions below with the corresponding reference solutions, which are obtained on a very fine mesh with M = 50000 using the piecewise-linear finite elements ( p = 1) and the time-step size τ selected based on the CFL condition with the CFL number equal to 1, namely: τ = τ0 :=

h u max

=

1 . M

(9)

The reference solutions at the final time T = 1 are plotted in Fig. 1 (left) for ν = 10−2 , 10−3 , 10−4 and 10−5 . Zoom at the shock area is shown in Fig. 1 (right). We now turn to the study of the three aforementioned monotonization strategies.

3.1 Spatial Mesh Refinement We compute the solutions of the IBVP (1), (8), (3) with ν = 10−2 , 10−3 , 10−4 and 10−5 and present the obtained results in Fig. 2 for three different meshes with M = 100, 400 and 1600 and the time-step size selected based on the CFL condition (9). As one can clearly see, when ν = 10−2 , all of the three computed solutions are monotone and non-negative. This is, however, not true for ν = 10−3 , for which mesh refinement is required since the solution computed on the coarse grid with M = 100

Monotonization of a Family of Implicit Schemes …

251

Fig. 1 Reference solutions for different values of ν (left) and zoom at the shock area (right)

Fig. 2 Computed (with M = 100, 400 and 1600) and reference solutions for ν = 10−2 (top left), ν = 10−3 (top right), ν = 10−4 (bottom left) and ν = 10−5 (bottom right)

is oscillatory. For the smaller values of ν = 10−4 and 10−5 , all of the three computed solutions are oscillatory and one needs to further refine the mesh to suppress these oscillations. One can, however, observe that the area where the oscillations are spread shrinks as M increases. We also note that in the case of ν = 10−4 , the amplitude of oscillations reduces when the mesh is refined.

252

A. Kurganov and P. N. Vabishchevich

3.2 Enlarging Time-Step Size Once again, we compute the solutions of the IBVP (1), (8), (3) with ν = 10−2 , 10−3 , 10−4 and 10−5 . We now fix M = 400 and set τ = γ τ0 with three different values of γ = 1, 2 and 4. The obtained results are presented in Fig. 3, where one can clearly see the smoothing effect of time-step enlargement. In particular, we emphasize that even for smaller diffusion coefficients ν = 10−4 and 10−5 , the solutions computed with the largest time-step size, that is, with γ = 4 are oscillation-free and positive.

3.3 Increasing the Order of Spatial Approximation Finally, we compute the solutions of the IBVP (1), (8), (3) with ν = 10−2 , 10−3 , 10−4 and 10−5 using the Lagrangian finite elements of orders p = 1, 2 and 4 on the fixed mesh with M = 200 and τ determined by the CFL condition (9). The obtained results are plotted in Fig. 4. As one can observe, in the case of large diffusion coefficient ν = 10−2 , the computed solutions are almost identical; when ν = 10−3 , the solution obtained with p = 1 is oscillatory, while the solutions computed with p = 2 and 4, which are very close to each other, are non-oscillatory and positive; when ν = 10−4 and 10−5 , the oscillations are still suppressed by the use of higherorder finite elements and the solution computed with p = 4 is clearly sharper than the one obtained with p = 2.

4 Two-Dimensional Problem In this section, we consider the 2-D Burgers equation, ut +

 u2  2

x

+

 u2  2

y

= ν(u x x + u yy ), (x, y) ∈ Ω = (−1, 1) × (−1, 1), t ∈ (0, T ],

(10) where ν > 0 is, as before, a constant diffusion coefficient. The initial and boundary conditions are u(x, y, 0) = u 0 (x, y), (x, y) ∈ Ω,

(11)

u x (−1, y, t) = u x (1, y, t) = u y (x, −1, t) = u y (x, 1, t) = 0, t ∈ (0, T ].

(12)

We consider a particular example of the 2-D Burgers equation (1) with ν = 10−4 subject to the following initial condition: u 0 (x, y) = 2e−10((x+0.5)

2

+(y+0.5)2 )

.

(13)

Monotonization of a Family of Implicit Schemes …

253

Fig. 3 Computed (with M = 400 and τ = γ τ0 with γ = 1, 2 and 4) and reference solutions for ν = 10−2 (top left), ν = 10−3 (top right), ν = 10−4 (bottom left) and ν = 10−5 (bottom right)

Fig. 4 Computed (with M = 200, τ = τ0 and p = 1, 2 and 4) and reference solutions for ν = 10−2 (top left), ν = 10−3 (top right), ν = 10−4 (bottom left) and ν = 10−5 (bottom right)

254

A. Kurganov and P. N. Vabishchevich

Fig. 5 2-D solutions computed using p = 1 (top row), 2 (middle row) and 4 (bottom row). The solutions are plotted in the entire Ω (left column) and zoomed at the shock front area (right column)

We note that as in the 1-D example studied in Sect. 3, the exact solution of the IBVP (10)–(13) is bounded and 0 ≤ u(x, y, t) ≤ u max := max(x,y)∈Ω u 0 (x, y) = 2. We numerically solve the IBVP (10)–(13) using a family of implicit finite-element methods with p = 1, 2 and 4 on a uniform M × M Cartesian mesh with M = 200 and the time-step size taken according to the CFL condition with the CFL number equal to 1, namely: τ = (2/M)/u max = 1/M. The solutions computed at the final time T = 1 are shown in Fig. 5. As one can see, when p = 1 is used, the obtained solution is quite oscillatory. The oscillations are, however, substantially decreased by using p = 2 and almost eliminated when p = 4.

Monotonization of a Family of Implicit Schemes …

255

5 Concluding Remarks In this paper, we have experimentally studied three monotonization strategies in the context of nonlinear convection-diffusion equations in the convection-dominated regime. We have varied the spatial mesh size (h), time-step (τ ) and spatial formal order of the method ( p) and verified that reducing h, increasing τ or increasing p helps to obtain non-oscillatory, positivity-preserving numerical solutions. Obviously, when h is reduced or p is increased, the CPU time will increase, which may affect the efficiency of the method. On the other hand, whet τ is increased, the accuracy of the method is affected. We therefore suggest the author to mix the studied strategies in order to obtain an accurate, non-oscillatory and efficient resulting method. Details of the optimal strategy depends on problem at hand and also on the computational resources available. Acknowledgements The work of A. Kurganov was supported in part by NSFC grant 11771201. The work of P. N. Vabishchevich was supported in part by Russian Federation Government megagrant 14.Y26.31.0013. The authors would like to thank anonymous reviewers for their valuable suggestions and corrections.

References 1. Alnæs, M., Blechta, J., Hake, J., Johansson, A., Kehlet, B., Logg, A., Richardson, C., Ring, J., Rognes, M.E., Wells, G.N.: The FEniCS project version 1.5. Arch. Numer. Softw. 3(100), 9–23 (2015) 2. Anderson, J.D.: Computational Fluid Dynamics: The Basics with Applications. Mechanical Engineering Series. McGraw-Hill, New York (1995) 3. Ascher, U.M.: Numerical Methods for Evolutionary Differential Equations. Computational Science & Engineering, vol. 5. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2008) 4. Cai, Q., Kollmannsberger, S., Sala-Lardies, E., Huerta, A., Rank, E.: On the natural stabilization of convection dominated problems using high order Bubnov-Galerkin finite elements. Comput. Math. Appl. 66(12), 2545–2558 (2014) 5. Donea, J., Huerta, A.: Finite Element Methods for Flow Problems. Wiley, Hoboken (2003) 6. Ern, A., Guermond, J.L.: Theory and Practice of Finite Elements. Applied Mathematical Sciences, vol. 159. Springer, New York (2004) 7. Godlewski, E., Raviart, P.A.: Numerical Approximation of Hyperbolic Systems of Conservation Laws. Applied Mathematical Sciences, vol. 118. Springer, New York (1996) 8. Guermond, J.L., Pasquetti, R., Popov, B.: Entropy viscosity method for nonlinear conservation laws. J. Comput. Phys. 230(11), 4248–4267 (2011) 9. Hartmann, R., Houston, P.: Adaptive discontinuous Galerkin finite element methods for nonlinear hyperbolic conservation laws. SIAM J. Sci. Comput. 24(3), 979–1004 (2002) 10. Hesthaven, J.S.: Numerical Methods for Conservation Laws: From Analysis to Algorithms. Computational Science & Engineering, vol. 18. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2018) 11. van der Houwen, P.J., Sommeijer, B.P., Kok, J.: The iterative solution of fully implicit discretizations of three-dimensional transport models. Appl. Numer. Math. 25(2–3), 243–256 (1997). Special issue on time integration (Amsterdam, 1996)

256

A. Kurganov and P. N. Vabishchevich

12. Hundsdorfer, W., Verwer, J.: Numerical Solution of Time-dependent Advection-DiffusionReaction Equations. Springer Series in Computational Mathematics, vol. 33. Springer, Berlin (2003) 13. Kröner, D.: Numerical Schemes for Conservation Laws. Wiley-Teubner Series Advances in Numerical Mathematics. Wiley, Chichester (1997) 14. Kulikovskii, A.G., Pogorelov, N.V., Semenov, A.Y.: Mathematical Aspects of Numerical Solution of Hyperbolic Systems. Chapman & Hall/CRC Monographs and Surveys in Pure and Applied Mathematics, vol. 118. Chapman & Hall/CRC, Boca Raton (2001) 15. Kurganov, A., Liu, Y.: New adaptive artificial viscosity method for hyperbolic systems of conservation laws. J. Comput. Phys. 231, 8114–8132 (2012) 16. Kurokawa, M., Uchiyama, Y., Nagai, S.: Tribological properties and gear performance of polyoxymethylene composites. J. Tribol. 122, 809–814 (2000) 17. Kuzmin, D.: A Guide to Numerical Methods for Transport Equations. University ErlangenNuremberg (2010) 18. Larson, M.G., Bengzon, F.: The Finite Element Method: Theory, Implementation, and Applications. Texts in Computational Science and Engineering, vol. 10. Springer, Heidelberg (2013) 19. LeVeque, R.J.: Finite Volume Methods for Hyperbolic Problems. Cambridge Texts in Applied Mathematics. Cambridge University Press, Cambridge (2002) 20. LeVeque, R.J.: Finite Difference Methods for Ordinary and Partial Differential Equations: Steady-State and Time-Dependent Problems. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2007) 21. Logg, A., Mardal, K.A., Wells, G.: Automated Solution of Differential Equations by the Finite Element Method: The FEniCS Book, vol. 84. Springer Science & Business Media, Berlin (2012) 22. Mao, K., Greenwood, D., Ramakrishnan, R., Goodship, V., Shrouti, C., Chetwynd, D., Langlois, P.: The wear resistance improvement of fibre reinforced polymer composite gears, Wear 426– 427, 1033–1039 (2019) 23. Morton, K.W.: Numerical Solution of Convection-Diffusion Problems. Applied Mathematics and Mathematical Computation, vol. 12. Chapman & Hall, London (1996) 24. Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables. Classics in Applied Mathematics, vol. 30. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2000). Reprint of the 1970 original 25. Samarskii, A.A.: The Theory of Difference Schemes. Monographs and Textbooks in Pure and Applied Mathematics, vol. 240. Marcel Dekker Inc, New York (2001) 26. Samarskii, A.A., Gulin, A.V.: Usto˘ıchivost’ Raznostnykh Skhem (Russian) [Stability of Difference Schemes], 2nd edn. Editorial URSS, Moscow (2004) 27. Samarskii, A.A., Vabishchevich, P.N.: Numerical Methods for Solving Convection-Diffusion. URSS, Moscow (1999) 28. Senthilvelan, S., Gnanamoorthy, R.: Effect of gear tooth fillet radius on the performance of injection molded nylon 6/6 gears. Mater. Des. 27, 632–639 (2006) 29. Thomée, V.: Galerkin Finite Element Methods for Parabolic Problems. Springer Series in Computational Mathematics, vol. 25, 2nd edn. Springer, Berlin (2006) 30. Vabishchevich, P.N.: Decoupling schemes for predicting compressible fluid flows. Comput. Fluids 171, 94–102 (2018) 31. Von Neumann, J., Richtmyer, R.D.: A method for the numerical calculation of hydrodynamic shocks. J. Appl. Phys. 21, 232–237 (1950) 32. Wesseling, P.: Principles of Computational Fluid Dynamics. Springer Series in Computational Mathematics, vol. 29. Springer, Berlin (2001) 33. Zienkiewicz, O.C., Taylor, R.L., Nithiarasu, P.: The Finite Element Method for Fluid Dynamics, 7th edn. Elsevier/Butterworth Heinemann, Amsterdam (2014)

The Insensitivity of the Iterative Learning Control Inverse Problem to Initial Run When Stabilized by a New Stable Inverse Xiaoqiang Ji and Richard W. Longman

Abstract In routine feedback control, the input command is the desired output. The actual output is a convolution integral of the forcing function—which essentially is never equal to the command. If it were equal, control system designers would be solving an inverse problem. For systems that execute the same task repeatedly, Iterative Learning Control (ILC) addresses the inverse problem by adjusting the command the next time the task is executed, based on the error observed during the current run, aiming to converge to that command that produces the desired output. For a majority of real world systems, this asks to solve an ill-conditioned inverse problem whose exact solution is an unstable control action. A simple robot example performing a one-second maneuver asks to invert a Toeplitz matrix of Markov parameters that is guaranteed full rank, but has a condition number of 1052 . The authors and co-workers have developed a stable inverse theory to address this difficulty for discrete-time systems. By not asking for zero error for the first or first few time steps (the number determined by the number of discrete-time zeros outside the unit circle), the inverse problem has stable solutions for the control action. Incorporating this understanding into ILC, the stable control action obtained in the limit as the iterations tend to infinity, is a function of the tracking error produced by the command in the initial run. By picking an initial input that goes to zero approaching the final time step, the influence becomes particularly small. And by simply commanding zero in the first run, the resulting converged control minimizes the Euclidean norm of the underdetermined control history. Three main classes of ILC laws are examined, and it is shown that all ILC laws converge to the identical control history. The converged result is not

X. Ji · R. W. Longman (B) Department of Mechanical Engineering, Columbia University, MC 4703, 500 West 120th Street, New York, NY 10027, USA e-mail: [email protected] X. Ji e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_13

257

258

X. Ji and R. W. Longman

a function of the ILC law. All of these conclusions apply to ILC that aims to track a given finite time trajectory, and also apply to ILC that in addition aims to cancel the effect of a disturbance that repeats each run.

1 Introduction Feedback control systems aim to execute whatever command is given to them. Thus, given the differential equation for the output as a function of the input command, the control law is aiming to solve an inverse problem. There is always considerable error which can be characterized by the control system bandwidth. Plotting the response to sinusoidal inputs, when the amplitude of the output sinusoid has decayed to 70% of the value at low frequencies, the output has started to decay, and has started ignoring such frequency components in the command. Iterative Learning Control (ILC) aims to solve this inverse problem in situations where the control system can be run repeatedly, reducing the error following the desired output each repetition. The ILC law sets up a contraction mapping, and iterates with the world (not with a computer model of the world), adjusting the command based on the error observed in the previous run. ILC stores data from the previous run, so that it is a digital control method solving a discrete-time inverse problem. The real world for digital control systems is governed by ordinary differential equations, but the digital controller creates the forcing function applied to this equation, updating it each sample time. Each update is continuously applied to the differential equation until a new update arrives from the controller—called a zero-order hold. If one looks at the solution to the differential equation at the sample times, one can make a linear difference equation that has identical solution to the differential equation. Reference [1] proves that the process of converting to a difference equation model introduces the forcing function at additional sample times, enough to make the most recent output time step in the equation be one step ahead of the most recent forcing function input time step. When the discretization introduces three or more additional terms, and the sample rate is reasonable, the characteristic polynomial of the forcing function side of the equation will contain a root or roots that are larger than one in magnitude. This makes the discrete-time inverse problem unstable for a majority of digital control systems in the world. The implication is, if one wants to have perfect tracking of a desired discretetime trajectory at all time steps, the control action needed is unstable, and grows exponentially with time steps. The inverse problem error must be zero at the sample times, but between sample times the solution of the differential equation (after some initial time steps) is growing in magnitude exponentially, and alternating in sign each time step. Of course, this exponential error growth when perfectly following the discrete-time desired trajectory does not address the initial intended problem of finding the input to accurately follow the desired continuous time output.

The Insensitivity of the Iterative Learning Control …

259

Mathematically, the discrete-time inverse problem is asking to invert an illconditioned matrix. We note that the lower triangular Toeplitz matrix of Markov parameters is guaranteed full rank analytically, indicating that the inverse exists. Numerical methods that aim to address this kind of problem include Tikhonov regularization. This is unappealing, since it will not produce zero error at any of the sample times. We comment that the stable inverse solution presented here could be of use in numerical methods as an alternative to Tikhonov regularization. Another numerical approach is to set the small singular value(s) producing the ill-conditioning to zero, then use a pseudo-inverse. This is unappealing for the same reason. In this paper we wish to do better. There is a theory established for stable inverses [2], but it makes use of pre and post actuation, extending the time interval of the problem. The authors and co-workers developed a set of stable inverses, termed Longman JiLLL stable inverses, where JiLLL refers to people who have contributed to the results: Xiaoqiang Ji, Te Li, Yao Li, and Peter LeVoci in Refs. [3–8]. The notation for each stable inverse is Longman JiLLL FI, NI, FS, NS—FI for solution of the initial delete problem factored, NI for not factored, and FS for solution of the skip step problem factored, NS for not factored. We consider JiLLL NI here. In this paper we illustrate with a simple 3r d order system that models the input-output relationship of each axis of a Robotics Research Corporation robot [9]. Then, given a p-time step desired output, JiLLL NI gives a p-time step input that produces zero error at all time steps except the first step. The first step is determined by a minimum Euclidean norm input action. The unstable behavior produced by the ill-conditioning is eliminated. When we apply the approach to the ILC problem, the first step instead is the result of the command given in the initial run. We comment that for a one-second desired trajectory for the robot link, sampled at 100 Hz sample rate, the condition number of the matrix to be inverted is of the order of magnitude of 1052 . We also comment that Matlab is not able to compute this condition number. We need other techniques to estimate this number. We reiterate, the matrix is analytically guaranteed to be full rank, the inverse is guaranteed to exist. We wish to take advantage of the JiLLL NI result when using ILC, but it is not reasonable to perform the pseudoinverse step with ILC. Making use of this step would simply find an inverse of the model, and ILC seeks to iterate with the world and solve the inverse problem iteratively in the real world, not a model of the world. ILC can be very effective. In experiments performed on the robot, the tracking error was decreased by a factor of 1000 in 12 iterations with the world, and this factor is far below the error level of our model of the world and it approaches the reproducibility level of the hardware. The ILC laws make use of a model, but must be sufficiently robust to model error that they still produce convergence. To start an ILC iteration, one first applies some input to the system and observes the output. Since we are concerned with control systems the most logical input is the desired output. To match ILC to the JiLLL NI inverse, the ILC updates will not consider the output error at the first time step or first few time steps, the number being the number of non-minimum phase zeros in the transfer function. Then the

260

X. Ji and R. W. Longman

output at these time steps becomes a function of the command given during this first run. This paper studies this in detail. It is concluded that three main classes of ILC laws converge to the same control action, given the same initial run. The influence of the initial run, introduces a small component on the unstable solution, but it is so small that it does not exhibit any instability in the p time steps of the given trajectory. It is also shown that if one decides to command zero during the first run, then the converged solution is the same as the JiLLL NI solution, giving the minimum norm control action for the first time step.

2 The True Inverse Solution Producing Zero Tracking Error at Every Time Step Consider single-input, single-output systems as n th order difference equations. y[k + n] + a1 y[k + (n − 1)] + · · · + an y[k] = b1 u[k + (n − 1)] + · · · + bn u[k] (1) This model results from converting the original continuous-time model fed by a zero order hold to discrete-time. With the time forward shift operator z{ f (k)} = f (k + 1), equivalently one gets 

   z n + a1 z n−1 + . . . + an {y(k)} = b1 z n−1 + b2 z n−2 + . . . + bn {u(k)}

(2)

In the context of linear discrete time-invariant systems, use of the z {·} operator and the z-transform variable can often be done interchangeably. We term the roots of the right hand side polynomial in the bracket as the zeros, and correspondingly the roots of the left hand side as poles. To find the input needed to produce a desired output, one needs to solve an inverse problem described in [8]. Substituting the desired trajectory into the left hand side creates a forcing function, and solving for the input needed, requires finding a particular solution associated with this forcing function, plus the solution of the homogeneous equation produced by the right hand side. To illustrate concepts in this paper we consider a 3rd order continuous time system with pole excess of three,  G(s) =

a s+a



ω02 s 2 + 2ζ ω0 s + ω02

 (3)

where a = 8.8, ζ = 0.5, and ω0 = 37. This input comes through a zero order hold, sampling at 100 Hz for this study. The discrete-time equivalent difference equation relating input to output has the form y(k + 3) + a1 y(k + 2) + a2 y(k + 1) + a3 y(k) = b1 u(k + 2) + b2 u(k + 1) + b3 u(k) (4)

The solution to the resulting discrete time inverse problem takes the form

The Insensitivity of the Iterative Learning Control …

u(k) = u p (k) + C1 (z 1 )k + C2 (z 2 )k

261

(5)

where u p (k) is a particular solution, C1 and C2 are arbitrary constants determined by initial conditions, and z 1 , z 2 are the zeros introduced. The system Eq. (1) has a state space realization x(k + 1) = Ax(k) + Bu(k) y(k) = C x(k) + d(k)

(6)

with ⎡

0 1 ⎢ .. ⎢ . 0 A=⎢ ⎢ .. ⎣ 0 . −an −an−1

⎤ 0 ··· . . .. ⎥ . . ⎥ ⎥, ⎥ ··· 1 ⎦ · · · −a1

⎡ ⎤ 0 ⎢ .. ⎥   ⎢ ⎥ B = ⎢ . ⎥ , C = bn bn−1 · · · b1 ⎣0⎦ 1

(7)

where A ∈ R n×n , B ∈ R n×1 and C ∈ R 1×n . In many applications of ILC, there is a disturbance that repeats each run. The disturbance can appear anywhere in the feedback control loop, but it can always be represented by an equivalent influence on the output. The d(k) introduced here is to represents such a repeating disturbance. The input, actual output, desired output and disturbance histories over a finite number p of time steps can be written by defining these history vectors u = [u(0) u(1) · · · u( p − 1)]T , y = [y(1) y(2) · · · y( p)]T , y ∗ = [y ∗ (1) y(2) · · · y ∗ ( p)]T , d = [d(1) d(2) · · · d( p)]T . The solution to Eq. (6) can then be written as ¯ 0+d y = Pu + Ax

(8)

where x0 is the initial condition, A¯ is matrix ⎡ CB 0 ⎢ C AB C B ⎢ P=⎢ .. .. ⎣ . .

an observability matrix, and P is a Toeplitz ⎡ ⎤ ⎤ CA ··· 0 ⎢ C A2 ⎥ ··· 0 ⎥ ⎢ ⎥ ⎥ ¯ ; A = (9) ⎢ .. ⎥ ⎥ . .. . ⎦ ⎣ . ⎦ . . C A p−1 B C A p−2 B · · · C B C Ap

Then the solution to the inverse problem is

¯ 0−d u ∗ = P −1 y ∗ − Ax

(10)

The inverse P −1 is guaranteed to exist since all the eigenvalues of matrix P are C B > 0. Note that Eqs. (4) and (8) pose the same inverse problem in different forms.

262

X. Ji and R. W. Longman

3 Iterative Learning Control Laws ILC adjusts the command to a feedback control system repeatedly performing a desired task under a repeating disturbance. The command is adjusted after each run, based on the error observed in the previous run, and the aim is to achieve zero error e j (k) = y ∗ (k) − y j (k) tracking the repeated desired trajectory as the repetition number j tends to infinity. There have been many ILC approaches developed, and Refs. [10–13] give good perspective on the ILC field that developed. Each repetition starts from the same initial condition. A general linear learning control law is given by (11) u j+1 = u j + Le j where L is a matrix of learning gains. By taking the difference of Eq. (8) from two successive runs, one can write the error propagation equation e j+1 = (I − P L)e j

(12)

where I is the identity matrix. We consider three main classes of ILC laws. We refer to the first law investigated in [14] as the P Transpose Law (or the Contraction Mappling Law), (13) L = φ PT and it is a contraction mapping in the sense of the Euclidean norm of the tracking error from iteration to iteration. The second law investigated in [15] is the partial isometry law formed from the singular value decomposition of the P = U SV T matrix according to (14) L = φV U T Here U and V are unitary matrices whose columns (and rows) represent unit vectors in p dimensional space and these vectors are orthogonal. And people also choose to pick the learning gain matrix in such a way as to minimize a quadratic cost each iteration that controls the learning transients [16, 17]. We comment that by picking the quadratic cost weights appropriately, all three laws can be presented by the quadratic cost law as in [18]. The quadratic cost function that uses a single scalar weight r , results in the following learning law L J j+1 = e Tj+1 e j+1 + r δ j+1 u T δ j+1 u ;

 −1 T L = PT P + r I P

(15)

where the difference operator δ j ξ = ξ j − ξ j−1 holds for any quantity ξ j . Using any of these laws in Eq. (12), one concludes that the error history converges to zero tracking error as j tends to infinity, for all possible initial runs, if and only if all eigenvalues of (I − P L) are less than one in magnitude. If all singular values of (I − P L) are less than one, the error converges monotonically with iterations in the

The Insensitivity of the Iterative Learning Control …

263

sense of the Euclidean norm. For sufficiently small gain φ or large r , each of these laws is guaranteed to converge monotonically to zero tracking error at all time steps, which means it converges to the unstable solution in Eq. (10).

4 Addressing Instability and Ill-Conditioning of the Inverse Problem 4.1 Instability of the Inverse Problem Zeros of the right hand side polynomial in Eq. (2) that are the images of zeros in continuous time we call intrinsic zeros, and if the continuous time zero is in the right half s-plane, it maps outside the unit circle in the z-plane. Generically, the discretization process will introduce the number of extra zeros needed to produce the power n − 1 in the right hand polynomial. These zeros introduced by the discretization process we term sampling zeros. Reference [1] shows that, when two or more zeros have thus been introduced, at least one zero is outside the unit circle for reasonable sample rates. Equivalently, if the original differential equation has three or more poles than zeros, there will be at least a zero outside the unit circle in discrete-time. This perhaps applies to a majority of control systems in the world. When there are zeros outside the unit circle, an unstable control action results from the corresponding solutions consisting of an arbitrary constant from initial conditions, times the zero location to the power of the time step. This corresponds to sampled values of an exponentially growing function. Reference [1] proves that a sampling zero is introduced whose location asymptotically approaches z 2 = −3.7321 in Eq. (5) as the sample time interval tends to zero, and the other zero z 1 is inside the unit circle at the reciprocal location. After some time steps the C2 (−3.7321)k term will totally dominate u(k) history, therefore the unique inverse solution for the necessary control action will almost always be growing exponentially with time, and will alternate in sign every time step. This is violently unstable.

4.2 Addressing the Instability in the Inverse Problem Reference [8] would like to remove the exponentially growing term from the solution, employing whatever initial conditions make the coefficient C2 = 0, or at least makes it negligibly small. Reference [8] proposed the approach by allowing enough beginning values of u(k) to be free to correspond to the number of initial conditions for its difference equation. Then find a solution that minimizes the Euclidean norm of the control input history. Then the resulting solution is

264

X. Ji and R. W. Longman



  ¯ u 0 = Pd† y ∗d − Ax(0) − d d d

(16)

where the subscript d denotes deleting the first row(s) of entries in a matrix or a vector. Because there is only one zero outside the unit circle for a third order system, we only need to delete one row to get enough freedom to adjust C2 . One could alternatively delete two rows to give full freedom to adjust all initial conditions, but this is not necessary. The Pd† denotes the Moore–Penrose pseudoinverse of Pd , which minimizes the Euclidean norm of the resulting input history and eliminates the instability in any finite time trajectory. The reformulated inverse problem creates a stable inverse producing zero error at every time step except the first (or first few when there is more than one zero outside the unit circle).

4.3 Viewing the Ill-Conditioning and Instability in Terms of the Singular Value Decomposition of Matrix P The SVD of matrix P = U SV T has various special properties treated in [6, 7]. In the limit as p gets large, the singular values in S are the magnitude frequency response at the discrete frequencies one can compute from p time steps. The U and V contain output and input singular vectors associated with these frequencies that contain phase response information. One observes that these vectors are close to being sinusoids, and one can identify the frequency by taking the DFT of the vector. However, when there are zeros outside the unit circle, Refs. [6, 7] show that extra singular values are introduced unrelated to frequency response. The values of these new anomalous singular values are functions of matrix size, i.e. for large enough p there is a linear decay of magnitude on a log scale, associated with the zero location, as a function of matrix size p. Figure 1 gives the smallest and next to the smallest singular values as functions of matrix size p, and the figure is truncated at p = 50. The slope associated with the zero outside, 1/3.3104, is also presented showing the smallest singular value follows this scope. It fails to follow once Matlab is no longer able to compute the singular value accurately. Linear extrapolation shows the smallest singular value should be 10−54 for p = 100, i.e. for a one-second trajectory. The associated condition number of the P matrix is 1.8 ∗ 1052 in spite of the fact that the matrix is analytically guaranteed to be full rank. It is this property of the smallest singular value as a function of p that produces ill-conditioning. The associated pair of input and output singular vectors shown in Figs. 2 and 3 alternate in sign from time step to time step, with magnitudes having opposite slopes with time. The input singular vector magnitude each time step grows linearly on a log scale, and the output singular vector decays linearly. It is this property that produces the instability of the inverse problem. The plots give the log of the absolute values of the singular vector components each time step. Again the decaying slope is associated with the smallest singular value, decaying by 1/3.3104 every time step, and the growing slope on the input side is growing by 3.3104 every time step. Again, Matlab is not able to compute

The Insensitivity of the Iterative Learning Control …

265

10 0

Fig. 1 The smallest and next to smallest singular values of P, and relation of slope to zero location, as a function of matrix size in time steps

-15

10

smallest s.v. second smallest s.v. 1/(3.3104)p

-30

10

0

10

20

30

40

50

40

50

Time Step Index

Fig. 2 Magnitude of components of the associated input singular vector

10 30

Input singular vecor

Magnitude

10

(3.3104)k

20

10 10

10 0

10 -10

10 -20

0

10

20 30 Time Step Index

the components of these singular vectors accurately for time steps when the values are too small, and straight line slope degenerates into noise. Reference [3] proves that to stabilize a non-minimum phase system with zeros outside the unit circle, real, complex or both, one can produce a stable inverse solution from a factored form [4]. It is a simple matter to extend the result to prove that Eq. (16) works. The proof is based on developing a universal bound on the smallest singular value for this matrix, independent of p. Thus, the smallest singular value of P that decreases exponentially as the size of the matrix increases, has been eliminated.

266

X. Ji and R. W. Longman

Fig. 3 Magnitude of components of the associated output singular vector

10 0 Output singular vecor

Magnitude

1/(3.3104)k

10 -10

10 -20

10 -30

0

10

20 30 Time Step Index

40

50

5 Making ILC Converge to a Stable Inverse 5.1 Behavior of ILC When Solving the Ill-Conditioned Problem The mathematics of ILC says it converges to the P inverse solution. Properties of this solution are that it has zero error at every time step, control action alternates in sign every time step and grows in magnitude exponentially, and produces errors in the solution of the differential equation that grow exponentially in magnitude between successive time steps. Such a solution defeats the purpose of asking for zero error. Sometimes these bad properties are not observed when ILC is applied to the real world. Often, the error decreases reasonably fast, but it seems to have finished converging at a disappointing error level far from the zero error promised by the mathematical analysis. Reference [7] explains this phenomenon, saying that the iterations have not yet converged, and that with enough iterations one will see the instability appearing, and the error at sample times decaying further. This means that one may have improved the error level, but the iteration process is poised to become unstable. In practice, this phase of the convergence process may not be observed because the iterations are terminated before actually reaching convergence, or because the finite word length in the analog to digital and digital to analog converters does not allow accumulation of the learning signal.

5.2 Modifying ILC Laws to Aim for Stable Inverse Reference [7] applied the idea presented in [3, 8] to ILC. Given a third order discrete time system, one deletes the first initial row to form Pd , picking L = φ PdT for the

The Insensitivity of the Iterative Learning Control …

267

modified P Transpose Law, L = φVd Ud T for the partial isometry law where Vd  −1 T and Ud are the singular vector matrices of Pd , and L = Pd T Pd + r I Pd for the quadratic cost law [15–18]. Reference [18] shows that all these laws can be unified in one general formulation. These ILC laws are updating the control action for all p time steps, but ask for zero error at only the addressed time steps remaining after deleting the initial row (or rows whose number is equal to the number of non-minimum phase zeros). Reference [8] asks to pick the extra freedom by picking a minimum norm solution, but the ILC approach simply applies whatever command one wishes for the first iteration, and then starts using any of the above control laws. The questions addressed here are: How well will this work? What final error level is produced? Does it make a difference which law we use? How significantly is the final error level affected by our choice of the initial run, etc? Extensive numerical experience shows that no matter how one chooses the control action in the initial iteration, one could always achieve zero tracking error on the time steps remaining after deleting the chosen initial steps. After using the modified ILC laws, the final level of the control action and also the unaddressed error at the first time step, are insensitive to the choice of the control action in the initial run. This may appear counter-intuitive, and we seek to explain this phenomenon.

6 Analytical and Numerical Results We established without asking for zero error for the first (or first few) time steps allows one to create solutions to the inverse problem that are stable. There are of course an infinite number of solutions to the underspecified set of equations, one of which is the original inverse solution which we do not want. The Moore–Penrose pseudoinverse was used in the stable inverse chosen above, and seen to eliminate the unstable inverse issue. In Sect. 6.1 we obtain an expression for the set of all possible inverse solutions, all expressed in terms of a parameter γ that indicates the difference between any given solution and the Moore–Penrose solution. ILC cannot make use of the Moore–Penrose pseudoinverse as a learning law during its iterations. ILC aims to get to zero tracking error in the real world, not in our imperfect model of the world, by iterative adjustments of the system input using world response data. ILC starts with an initial run, applying whatever one chooses, for example, as the input to the control system, one logically would ask for the desired output. In Sect. 6.2 we show that the converged solution for the ILC laws is dependent of the choice of the input for the initial run. And we show that all three ILC laws converge to the same solution when using the same initial run, i.e. the same value of γ . Section 6.3 shows the dependence of the final converged control action u(k) on the choice of the input for the initial run is very small. Hence, ILC easily converges to a very well behaved solution after deleting the requisite number of initial rows. It is also shown that if the input in the initial run is zero, or if it is made zero near the end of

268

X. Ji and R. W. Longman

the initial run, then ILC essentially converges to the Moore–Penrose pseudoinverse solution of the underspecified equations. One might expect to need to make a very precise choice of the initial conditions to ensure the value of C2 is sufficiently small to avoid unstable behavior. From this thinking the results here are counter-intuitive. It becomes evident that the choice of initial run needed to produce anything close to the unstable behavior of the system inverse is something that one would never think of using. Section 6.4 examines that the influence of the initial run on the converged error on the unaddressed time step is very small.

6.1 The γ Parameter Set of All Possible Solutions to the Underspecified Equations First, consider some properties of a generalized inverse of a rectangular matrix Pd modeling an underdetermined system after the deletion of the first row(s) in P, i.e. for the 3r d order system, there is one more control action than the number of errors being addressed. Partition the SVD of matrix Pd using the system y ∗d = Pd u for simplicity and considering one zero outside the unit circle y ∗d

= Ud



 T  Vd1 Sd 0 u T Vd2

(17)

T T Denote VdT u = u, ˆ then uˆ 1 = Vd1 u = Sd−1 UdT y ∗d , uˆ 2 = Vd2 u = γ , and γ could be any scalar as it lies in the null space. Then converting to the original u space, one gets    Sd −1 Ud T y ∗  d u = Vd1 Vd2 γ (18) = Pd† y ∗d + Vd2 γ

The first term represents the Moore–Penrose pseudo-inverse result minimizing the Euclidean norm of the control action. The second term gives all possible solutions by choice of all possible values of γ . The pseudo-inverse solution given by JiLLL NI, is a particularly attractive solution since it produces the smallest possible Euclidean norm of the control to accomplish the zero error at the time steps addressed. Of course there also exists a γ producing zero error at all time steps, i.e. producing the true inverse solution of Eq. (10), which contains the exponentially growing unstable control action.

The Insensitivity of the Iterative Learning Control …

269

6.2 The γ Values for Each ILC Law as a Function of the Initial ILC Run Since we use modified ILC laws to converge to one of these solutions that gives zero error at addressed steps, it is of interest to know what γ value is produced upon convergence as a function of u 0 for each choice of ILC law. P Transpose Law: Plug L = φVd Sd UdT into Eq. (9), then the control action updates according to u j+1 = u j + φ PdT e j,d = (I p − φ PdT Pd )u j + φ PdT (y d ∗ − d d ), where the subscript d denotes deleting the first row or entry in a matrix or a vector. Apply the SVD of matrix Pd and partition VdT e j+1 into the part that learns and the part not being updated in the iteration process, and one calculates the decoupled solution for the control action in the new space 

Vd1 T Vd2 T



 u j+1 =

I p−1 − φ Sd2 0 0 1



Vd1 T Vd2 T



   Sd UdT y ∗d − d d uj + φ 0 

(19)

T Denote Vd1 u j+1 = uˆ d, j+1 , M = I p−1 − φ Sd2 and Wd = φ Sd UdT (yd∗ − dd ), then the learned control action in the new space is governed by

u˜ d, j+1 = M u˜ d, j + Wd

(20)

  −1  I − M j Wd u˜ d, j = M j u˜ d,0 + I p−1 − M

(21)

whose solution is

Since one is free to pick the learning gain φ in matrix M, we consider φ = 1/σmax is a reasonable choice where σmax denotes the maximum singular value of the P matrix. As the iteration number j goes to infinity, the learned part of the control action expressed in the converted space converges to   u˜ d, j = Sd−1 UdT y ∗d − d d

(22)

Meanwhile, solving for the unlearned part and convert it back to the original space produces   T u0 (23) u ∞ = Pd† y ∗d − d d + Vd2 Vd2 T Comparing Eqs. (22) to (16), we conclude that γ = Vd2 u 0 for this learning law.

Partial Isometry Law: The control updates according to u j+1 = u j + φVd UdT e j,d . Perform the same operation as above, i.e. partitioning the converted control u˜ j = VdT u j T u and u T u . Then one into the learned and the unlearned parts as u˜ d,∞ = Vd1 ˜ f,∞ = Vd2 j j 2 obtains Eq. (17) with minor changes from Sd , Sd to Sd , 0 respectively. After performing the same calculation for the learned and the unlearned parts and converting back to the

270

X. Ji and R. W. Longman

original space, one concludes that the value of γ is identical for this ILC law as for the P Transpose Law, and the converted control history is therefore also identical. Quadratic Cost Law: The control updates as u j+1 = u j + φ(PdT Pd + r I )−1 PdT e j,d .  −1 −1 Given the identity (X + Y )−1 = X −1 − X −1 Y I + X −1 Y X and the use of SVD of   T  −1 Pd = Ud Sd 0 Vd , the inverse term in the middle can be expressed as PdT Pd + r I   −1 −1 = Vd diag (s12 + r ) . . . (s 2p−1 + r ) r −1 VdT , where si denotes the i th singular value in matrix Sd . Then plug into the control updating equation and again convert to the u˜ j space, and partition to the learned u˜ d,∞ and unlearned u˜ f,∞ by T and V T . One gets a modified version of Eq. (17) with premultiplying u j with Vd1 d2   −1 −1 φ Sd2 changed to diag s12 (s12 + r ) . . . s 2p−1 (s 2p−1 + r ) 0 , and φ Sd changed to   −1 −1 diag s1 (s12 + r ) . . . s p−1 (s 2p−1 + r ) 0 . Again, one gets the same control action as Eq. (21) and the same value of γ . It is interesting to note that when the iteration number goes to infinity, the control actions produced by all three ILC laws are identical, given by JiLLL NI pseudoinverse solution, plus the same term as a function of the initial iteration’s choice of command.

6.3 The Influence of the Initial Run on the Converged Final Control History For the 3r d order discrete-time system, Vd2 is a unit vector with magnitudes of the components growing linearly on a log scale as shown in Fig. 2. Consider a one-second length trajectory, and 100 Hz sample rate, then the magnitude of the first component has an approximate magnitude of 10−50 based on linear extrapolation, since Matlab is not able to compute this number. Then it grows exponentially up to the magnitude of 10−1 for the last component. Therefore the latter part of u 0 contributes more to the value T u . For a reasonable choice of the control action (the initial command) u , of γ = Vd2 0 0 pre-multiplying it by a unit vector with such a property, one sees no reason to think that the resulting value of γ could be large. Also note, that one multiplies γ by Vd2 to produce the influence on the converged control action. Hence, the influence of u 0 on the control is given by multiplying this T . Figure 4 gives a carpet plot of the initial control history by the outer product Vd2 Vd2 matrix entries versus row and column number, on a linear scale. Obviously, only the last five entries in the rows and columns have much influence on the converged control action. Figure 5 further illustrates this where the magnitudes of the matrix entries are given on a log scale. The planar surface in the back corner of the plot ending at the 100 by 100 entry, represents correct matrix entries. As the row entries are decreased the carpet plot leaves the planar surface when the computed entries get below approximately 10−16 to 10−17 , and the same happens for the column entries. This corresponds to the fact that Matlab is unable to compute these entries accurately. When both the rows and the columns are too far from the back corner of the plot, Matlab has trouble in two ways, the computed matrix entries are all stuck in the range of 10−33 .

The Insensitivity of the Iterative Learning Control … Fig. 4 Log of magnitude of T matrix entries Vd2 Vd2

271

1

0

0

-0.4 0

50

50

100

Row Index

Fig. 5 Influence of initial input components of u 0 on converged control action: T matrix entries of Vd2 Vd2

100

Column Index

0

-20

-40 100

100 50

Column Index

50 0

0

Row Index

Consider some of the implications: (1) If one wants γ to precision of 4 significant digits, the last 5 components in Vd2 is enough. For purposes of illustration, consider that u 0 has all components equal to unity. Figure 6 shows the accumulation of γ , adding the terms of the inner product T u together, progressing from step one to the final step. When entry p is γ = Vd2 0 reached one has the actual value of γ . Note that only the last few time steps matter. Because the entries in Vd2 alternate in sign, the figure also gives the corresponding result if one makes all entries in u 0 have unit magnitude and alternate sign. But this does not produce a significantly different result. (2) Therefore, one could use any desired input during the initial run, e.g. aiming to get close to zero error at all addressed time steps from the start, but then make the command decay to near zero for the last 5 entries. Then the final converged control action would be very close to the Moore–Penrose pseudoinverse.

272

X. Ji and R. W. Longman

Fig. 6 Illustration of how the value of γ accumulates as time steps progress

1.5 Const. ones input Alter series input

1

0.5

0

-0.5

-1

0

25

50

75

100

Time Step Index

(3) Alternatively, if possible, one could simply use the initial command as identically zero, in which case any of the ILC laws will converge to the Moore–Penrose pseudoinverse, corresponding to the minimum Euclidean norm control action to produce zero error at the addressed time steps.

6.4 The Influence of the Initial Run on the Converged Error of the Unaddressed First Time Step The methods used here have avoided the difficulty of inverting an ill-conditioned matrix and avoided the use of an unstable control action. But this was accomplished at the expense of not being able to get zero error in the first time step. It is of interest to ask what happens to the error for this time step(s). The final level of error is given by e∞ = y ∗ − Pu ∞ − d, which produces zero at all time steps but the first which gives e∞ (1) = y ∗ (1) − P f u ∞ − d(1) 

Tu = y ∗ (1) − P f Pd† y ∗d − d d − d(1) − P f Vd2 Vd2 0

(24)

  where P f = C B 0 · · · 0 is the first row in the P matrix. The first three terms on the right hand side are pre-determined by system dynamics, the desired command trajectory, and the repeated disturbance during the iteration process. The only free choice is u 0 . Again it is premultiplied by the matrix whose entries are studied in Fig. 3. But this time it is further premultiplied by the P f matrix of all zeros except for C B. Recall that the discrete-time B is roughly equal to the continuous time B times the sample time interval, in this case 0.01 s. So C B should not be a large number. It is very difficult to move this initial value of the error, but it also seems true that the error is not likely to be a large number.

The Insensitivity of the Iterative Learning Control …

273

Previously we discussed the unstable inverse from the point of view of the initial conditions determining the coefficient of C2 of the unstable solution (−3.104)k in Eq. (5). From this point of view one might think that one would have to be very careful with the initial condition in order to make the unstable term be near zero throughout the p time steps. Let us investigate if there is any need to be careful. (1) First, observe how hard it is to influence the error at the first time step, and still maintain zero error at later steps. The additional initial control action Δu 0 necessary to make a change Δe∞ (1) satisfies   T Δe∞ (1) = −P f Vd2 Vd2 Δu 0 = −(C B) · Vd2 (1) · Δγ

(25)

Numerically, Eq. (25) says that in order to make one unit change Δe∞ (1) in the first time step error, Δγ must have a magnitude of approximately 1057 with a negative sign in front. (2) To produce a given desired change in Δe∞ (1) there needs to be a large change in the control action given by   u ∞  = Pd† y ∗d − d d + (γ + Δγ ) Vd2

(26)

(3) It is conceivable that the Moore–Penrose pseudoinverse contains a nonzero value of C2 when minimizing  the Euclidean norm of the control action, but the value must be of the order of 1 (z 2 )k so that no large control is accumulated from this term in p time steps. One might prefer to have C2 identically zeros, if one could find a way to do this. (4) The instability of the control input producing zero error at the addressed time steps is then included in the Vd2 γ term in Eq. (18). Since Vd2 is a unit vector, in order for the unstable history in the Vd2 vector to have substantial influence on the control history, one needs a substantial value for γ , i.e. one needs a substantial component of u 0 on Vd2 . (5) For a given magnitude u 0 the maximizing choice is to pick the initial input to be equal to Vd2 , i.e. pick an unstable control input. One is not likely to do this, but in addition T V = 1. Therefore, one also needs to make the this only produces a value of Vd2 d2 control u 0 to be a large number multiplying Vd2 , in order to generate an initial input that has substantial instability observed within the given p time steps of the problem.

7 Conclusions The JiLLL NI solution to an inverse discrete-time control problem produces zero error at all time steps except the first few (one step for the example 3r d order problem (with pole excess 3), 2 time steps if the problem were 5th order, 3 if it were 7th order). The inverse problem is inverting the associated difference equation, which has a zero(s) outside the unit circle that becomes an unstable pole(s) outside the unit circle. The resulting difference equation that must be satisfied has a solution of the associated homogeneous equation that

274

X. Ji and R. W. Longman

is an arbitrary constant(s), determined by initial conditions, times the unstable solution(s). In order to have a stable inverse, it would seem necessary to set the initial conditions very precisely so that there would be a zero coefficient multiplying the unstable solution. We show that this is not necessary, and in fact it is very hard to set initial conditions that exhibit the instability. JiLLL NI gives a p time step control history producing the desired output at all time steps but the first (or first few). Then the control action at the first time step is determined by the Moore–Penrose pseudoinverse. ILC wants to similarly converge to zero error, but cannot use this pseudoinverse solution because it seeks zero error in the unknown world model, instead of our model of the world. The set of all possible pseudoinverses is established. It is determined that all 3 ILC laws converge to the same pseudoinverse solution, when given the same initial input used in the first ILC iteration. It is determined what choice of the initial run is required to produce the actual unstable inverse for all time steps, and also it is determined what kind of initial run is needed to result in any significant unstable control action. It is clear that one would never pick such an initial input history. If one wants to reduce the very small influence of a reasonable initial run on the ILC converged zero error control history, one can make the initial control action decay to zero for the last few time steps. If one wants to go further, and have ILC converge to the Moore–Penrose minimum Euclidean norm solution, then one can give a zero command to the control system for the entire first run. In this case, the ILC converges to the minimum Euclidean norm control solution of the Longman Stable Inverse JiLLL NI. Since the ILC does not consider the error in the first time step, the error at this time step is studied. One might worry that something wild must be done at this time step in order to produce the stable inverse for the remaining steps. It appears that this not the case. Furthermore, requesting any significant change in the error at this first time step while keeping the remaining errors zero, requires introducing unstable control time histories. The final conclusion is that all three ILC laws will converge to well behaved and very useful solutions to the inverse control problem using any reasonable initial runs, and that the ill-conditioning and the instability of the inverse model are eliminated. Acknowledgements We wish to acknowledge the suggestions of two anonymous reviewers that have helped us to improve the original versions of this paper.

References 1. Astrom, K., Hagander, P., Strenby, J.: Zeros of sampled systems. In: Proceedings of the 19th IEEE Conference on Decision and Control, pp. 1077–1081 (1980) 2. Devasia, S., Chen, D., Paden, B.: Nonlinear inversion-based output tracking. IEEE Trans. Autom. Control 41(7), 930–942 (1996) 3. Ji, X., Li, T., Longman, R.W.: Proof of two stable inverses of discrete time systems. Adv. Astronaut. Sci. 162, Part I, 123–136 (2018) 4. Ji, X., Longman, R.W.: New results for stable inverses of discrete time systems. In: Narendra, K.S. (ed.) Proceedings of the 19th Yale Workshop on Adaptive and Learning Systems, Yale University, New Haven, CN (2019)

The Insensitivity of the Iterative Learning Control …

275

5. Li, T.: Eliminating the internal instability in iterative learning control for non-minimum phase systems. Doctoral Dissertation, Department of Mechanical Engineering, Columbia University, New York, NY 10027 (2017) 6. Longman, R.W., Li, T.: On a new approach to producing a stable inverse of discrete time systems. In: Narendra, K.S. (ed.) Proceedings of the 18th Yale Workshop on Adaptive and Learning Systems, Yale University, New Haven, CN (2017) 7. Li, Y., Longman, R.W.: Characterizing and addressing the instability of the control action in iterative learning control. Adv. Astronaut. Sci. 136, 1967–1985 (2010) 8. LeVoci, P.A., Longman, R.W.: Intersample error in discrete time learning and repetitive control. In: Proceedings of the AIAA/AAS Astrodynamics Conference, Providence, RI (2017) 9. Longman, R.W.: Iterative learning control and repetitive control for engineering practice. Int. J. Control, Special Issue on Iterative Learn. Control 72(10), 930–954 (2000) 10. Bien, Z., Xu, J.: Iterative Learning Control: Analysis, Design, Integration and Applications, pp. 107–146. Kluwer Academic Publishers, Boston (1998) 11. Moore, K., Xu, J.: Guest editors: special issue on iterative learning control. Int. J. Control 73(10) (2000) 12. Bristow, D.A., Tharayil, M., Alleyne, A.G.: A survey of iterative learning control. IEEE Control Syst. Mag. 26(3), 96–114 (2006) 13. Ahn, H.S., Chen, Y., Moore, K.L.: Iterative learning control: brief survey and categorization. IEEE Trans. Syst. Man Cybern. Part C 37(6), 1099–1121 (2007) 14. Jang, H.S., Longman, R.W.: A new learning control law with monotonic decay of the tracking error norm. In: Proceedings of the Thirty-Second Annual Allerton Conference on Communication, Control and Computing, Monticello, IL, pp. 314–323 (1994) 15. Jang, H.S., Longman, R.W.: Design of digital learning controllers using a partial isometry. Adv. Astronaut. Sci. 93, 137–152 (1996) 16. Phan, M.Q., Frueh, J.A.: System identification and learning control. In: Bien, Z., Xu, J. (eds.) Iterative Learning Control: Analysis, Design, Integration, and Applications, pp. 285–306. Kluwer Academic Publishing, Norwell (1998) 17. Amann, N., Owens, D.H., Rogers, E.: Robustness of norm-optimal iterative learning control. In: Proceedings of International Conference on Control, Exeter, UK, vol. 2, pp. 1119–1124 (1995) 18. Bao, J., Longman, R.W.: Unification and robustification of iterative learning control laws. Adv. Astronaut. Sci. 136, 727–745 (2010)

Strategy Optimization in Sports via Markov Decision Problems Susanne Hoffmeister and Jörg Rambau

Abstract In this paper, we investigate a sport strategic question related to the Olympics final in beach volleyball: Should the German team play most aggressively to win many direct points or should it rather play safer to avoid unforced errors? This paper introduces the foundations of our new two-scale approach for the optimization of sport strategic decisions. We present possible answers to the benchmark question above based on a direct approach and the presented two-scale method. A comparison shows the benefits of the new paradigm.

1 Introduction Picture yourself in a beach volleyball match. The score is 14:13 for you in the third set. One more point and you win the Olympic gold medal. Your team’s repertoire contains a very risky and a slightly safer style. The risky style gives you more chances to directly win the point, at the expense of a larger probability of a direct failure. For the safer style, it is more likely that the rally will continue. How should you play? This will be the benchmark question for the concepts in this paper. Does the best style depend on the score? Does it depend on who serves first? We approach this type of principle strategic decision problems in sports games with mathematical models based on Markov Decision Problems (MDPs) [18]. Whereas Markov chains have been used extensively, mainly to identify which skills are pivotal in sports games [13], MDPs, and even more so, Markov games, are much less prominent. An MDP consists of several (temporal) stages. In all stages there are states, actions, rewards, and transition probabilities. In each stage, the decision maker can specify an action depending on the state. Then the system will move to another state in the S. Hoffmeister (B) Invision AG, Leipzig, Germany e-mail: [email protected] J. Rambau University of Bayreuth, Bayreuth, Germany e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_14

277

278

S. Hoffmeister and J. Rambau

next stage according to the transition probabilities, which depend on state and action. Depending on the current state, the current action, and the resulting state, a reward is granted. The goal is to find a policy, such that the expected (discounted) total reward over all stages (finitely or infinitely many) is maximized. Sometimes the decision maker wants to maximize the probability that the system reaches some desired state (in our case a winning state of the game). This can be modelled by making the desirable state an absorbing state (i.e., a state in which the system remains forever) and granting a reward of one for entering this state (rewards are zero everywhere else). Thus, this way we can model how to maximize the probability to win a sports game. The problem with the applicability of MDP models in sports is the following. The number of different situations in sports games is, strictly speaking, inaccessibly large, infinite, or even uncountable. Since it is not explicitly defined what the “true” state space of a sports game is, one needs to seek for an approximation. We deliberately chose to look for a representation of the really substantially different situations by a finite number of states and actions. The finite-state-action approach comes with the question of how detailed such a finite MDP should be. For a model with continuous state space components, the analogous question arises for the number of state-space dimensions. If one uses too many states and actions, one may end up with a too complicated MDP: good policies may be too detailed to be easily adopted by the players during the gameplay, and (near) optimal policies may even be very difficult to determine. Moreover, the more complicated a model is, the more modeling parameters usually need to be estimated, which may become a problem, if the solution is sensitive to parameter deviations. If one uses only few states and actions, then the estimation of transition probabilities becomes difficult. Just think of the simplest possible MDP with only the three states “start”, “win”, “loss”. Then the whole problem is reduced to the estimation of the transition probabilities. Thus, in order for a coarse MDP to be realistic, a lot of information has to be contained in the transition probabilities. Therefore, these probabilities inevitably depend on a combination of various aspects (e.g., the combination of players on the field). If for some such combination we have no or not enough historical data, then a straight-forward estimate is not available. One idea in statistics is to model using a formula with (few) parameters how the data is generated from other (observable) data. Then one can estimate the (few) parameters and use the evaluated formula in unknown terrain. We use the underlying principle of this idea. However, we will not employ a parametrized formula but we use two Markov Decision Problems (MDPs) to answer the benchmark question. Our strategic MDP models the influence of principle strategic options. Optimal policies for it answer our question. Because these policies are principle strategic decisions, players can follow the corresponding recommendations. However, the strategic MDPs transition probabilities are difficult to estimate. Thus, a gameplay MDP models the gameplay in detail. Its transition probabilities are easier to estimate. By simulating the gameplay [email protected], we estimate the transition probabilities of the strategic [email protected] We carry out this program for our example benchmark question. It

Strategy Optimization in Sports via Markov Decision Problems

279

will turn out that in our coarsest MDP model the optimal decision among risky play and safe play in a service or a field attack situation depends neither on the score nor on the right to serve. We will see that the optimal decision can be found by evaluating a certain expression in the transition probabilities. If this expression is positive, then playing risky throughout is optimal, otherwise playing safe throughout is better. To arrive at our conclusions, we have developed an all new machinery based on Markov decision problems (MDP). Our method differs from existing MDP approaches in the manner that we combine two related MDPs to answer a single strategic question. A detailed MDP has transition probabilities that can be estimated, a coarse MDP provides conclusions that can be used in practice. By relating the two, we can provide useful conclusions on the basis of data that can be estimated. We implemented our approach for the example beach volleyball. For the first time, we develop a strategic MDP that models whether safe or risky play yields a higher winning probability in a particular match (“s-MDP”); we analytically characterize an optimal policy for it in terms of the (very few) s-MDP transition probabilities (“optimize”); we develop a gameplay MDP that models a rally in detail depending on the players’ individual skills (“g-MDP”); we analyse historical competition videos from the London 2012 Olympics with an all new video analysis tool in order to estimate the single-player-dependent transition probabilities for the g-MDP (“calibrate”); we simulate the g-MDP for each policy of the s-MDP, which provides estimations of the transition probabilities for the s-MDP (“simulate”); we derive whether safe or risky is better against a given opponent by evaluating the optimality criterion for the transition probabilities (“conclude”); finally, we visualize the sensitivity of the strategic MDP’s recommendation depending on skills and opponent strength in a strategy-skill score card (“present”). For general sports games, the meta-procedure “s-MDP”—“optimize”—“ g-MDP”—“calibrate”—“simulate”—“conclude” constitutes the first two-scale approach to answer principle strategic questions. The resulting skill-strategy score cards can support the choice of a strategy in an upcoming match against a particular opponent. This meta-procedure defines a research program for each principal strategic question in an individual sports game. This paper is organized as follows. In Sect. 2, we briefly review the related work in both MDP and sports strategy research. Section 3 describes our new approach of combining two MDPs of different scale. The two MDPs are defined in Sects. 4 and 6 for our example beach volleyball. Section 5 between shows some computational results based on the s-MDP alone based on direct counts from the 2012 Olympics. In a first round of computational results, we show that a direct estimation of s-transition probabilities yields volatile results with systematic shortcomings that motivate our two-scale procedure. Section 7 specifies the parameters that define a g-MDP strategy and the implementation of the two special strategies risky and safe. Our data collection from the 2012 Olympics is based on a new video analysis tool that appears in Sect. 8. We present computational results in Sect. 9, followed by a comparison in Sect. 10 and a sensitivity analysis based on skill-strategy score cards in Sect. 11. Our suggestion how to extend the method to game theory can be found in Sect. 12. We conclude the paper in Sect. 13.

280

S. Hoffmeister and J. Rambau

2 Sport Strategy Optimization and MDPs If the focus of the analysis of a sports game is on interactions and strategic aspects of that game, then a dynamic model may be more appropriate than a statistical approach. Markov chains (MCs), Markov decision problems (MDPs) and Markov games (MGs) are possible frameworks for a dynamic model. They all incorporate the Markov property: the next state may just depend on the current state and not on the complete realization history of states. An MC is a discrete-time stochastic process that can be characterized by a set of states and a probability distribution P which specifies transition probabilities between states. With an MC, it is possible to examine how a system will evolve under P given a certain initial distribution [17]. There are several contributions that use MCs to investigate different aspects in various types of sports. We only present a sample of recent applications of MCs that consider beach volleyball or volleyball, as this is the type of sport we will later use as an example. Miskin et al. [13] investigate skill importance in women’s volleyball. The authors model play sequences as discrete absorbing MCs by using a Bayesian approach to estimate the transition probabilities from the data gathered. The data was collected during the 2006 competitive season of a single women’s Division I volleyball team. The 36 states consolidated in this analysis are moves that consist of a skill and a rating combination, e.g., a set is rated according to its distance from the net. The importance score of a skill is a metric that incorporates its impact to the desired outcome and its uncertainty. It is computed by the posterior distribution associated with the skill. Ferrante and Fonseca [5] use an MC approach for volleyball to compute an explicit formula of the serving team’s winning probability in a set. Beside this, the mean duration of a set is computed in terms of the expected number of rallies. The authors make the assumption that the probability of winning a single rally is independent of the other rallies and constant during the game. In this way they are able to apply an [email protected] The states in their model correspond to different scores that may occur in a set together with an indicator which team serves next. The winning probability is computed in terms of two parameters which represent the winning probability of a rally depending on the serving team. MC properties and combinatorial arguments are used to derive the explicit formula for the winning probability. The authors applied their formula to data from the Italian Volleyball League. The calculated winning probabilities and set durations were close to real data estimates. A similar MC approach as in [13] has been used by Heiner et al. [7] for women’s soccer. An MDP is more complex than an [email protected] It is a Markov decision process supplemented by an optimality criterion. The decision process incorporates a decision maker that chooses at each time step (a decision epoch) an action from a specified set of actions. This way, the transition probabilities do not only depend on the current state (like in an MC), they also involve the chosen action of the decision maker. Depending on the current state, the chosen action, and the realized next state, a reward is generated. A policy is a decision rule that prescribes an action choice for

Strategy Optimization in Sports via Markov Decision Problems

281

each state. Once a policy has been fixed, the system generates an [email protected] Policies can be compared with respect to the optimality criterion. There are tools to determine an optimal policy for the decision maker in many settings if the problem scale is not too large. We follow the notation of Puterman’s textbook [18] throughout this paper. In sports related applications, MDPs are often used in connection with general, tactical considerations that are not team or match specific. Some examples are: Clarke and Norman [4] as well as Nadimpali and Hasenbein [15] investigate a Markov Decision Problem (MDP) for tennis games to determine when a player should challenge a line call. The latter one is the more detailed model. We describe it briefly in the following: A decision point occurs when an opportunity to challenge the umpire arises. The states include the outcome of the point, the score, the number of challenges remaining, the probability that the call is incorrect, and the result of a successful challenge. There are two possible actions in each state: challenge and do not challenge. Further parameters of the model are the relative strength of the players and the fallibility of the officials. These parameters are used to generate the transition probabilities for the model. They use the standard linear programming approach for multi-chain, average cost MDPs to obtain optimal policies under a variety of parameter settings. Hirotsu and Wright model football as a four state Markov Process and use dynamic programming to determine the optimal timing of a substitution [8], the best strategy for changing the configuration of a team [9], or to determine under which circumstances a team may benefit from a professional foul [26]. Chan and Singal [2] use an MDP to compute an optimization-based handicap system for tennis. The weaker player gets ‘free points’ at the start of the match, such that the match-win probability of both players is equalized. Clarke and Norman [3] formulate an MDP for cricket to determine whether the batsman should take an offered run when maximizing the probability that the better batsman is on strike at the start of the next over. The model is solved analytically by dynamic programming. Norman [16] builds a more aggregated MDP for tennis games to tackle the question when to serve fast or when to serve slow at each stage of a game. The model is solved analytically using a monotonicity property of the optimal cost function and dynamic programming. Most of the MDPs on team or match dependent sport-strategic decisions are retrospective: Terroba et al. [21] develop an MDP-based framework for tennis matches. The information needed to build the model is semi-automatically gathered from broadcast sports videos. Machine learning algorithms are executed to identify optimal policies. They also present a novel modification to the Monte Carlo tree search algorithm and apply their model to popular tennis matches of the past. They present how the player who has lost in reality could have won the match with identical skills, just by using a different policy. To the best of our knowledge, the only MDPs that take player skills into account and could be applied to future matches exist for baseball. Wright and Hirotsu [25] formulate a Markov model for baseball to calculate an optimal pinch hitting strategy under the ‘Designated Hitter Rule’. Their method can be applied to a specific match by using the probability of each player to achieve a single, double, triple, home run, walk, or out.

282

S. Hoffmeister and J. Rambau

An MG is a stochastic game in an MDP-like environment. Instead of one decision maker, there exists a whole set of players. At each decision epoch, each player chooses an action from his action set. So, the transition probabilities and rewards incorporate the decisions of all players. An MG gets even more complex through the different optimality criteria of the players. Therefore, policies that are simultaneous best responses, i.e., Nash equilibria, are the focus of interest. In some cases, most notably when an MG with finite strategy sets has pure Nash equilibria, they can be found algorithmically. This requires the repeated computation of best responses to fixed strategies. Our method in this paper can be also seen as a building block for this problem. In Sect. 12 we show first results into this direction. More on MGs can be found in Webb’s textbook [24]. To the best of our knowledge, there exist only a few applications of a Markov Game (MG) to optimize the policies a-priori for a particular sports game. Kira et al. [11] formulate an MG for baseball and computed Markov perfect equilibria for both teams. The transition probabilities of the MG are assumed to depend only on the probability parameters for the hitting skills of the players. They use a dynamicprogramming algorithm for solving the Bellman equations that characterize the value function of the game for both teams. However, MG models have been applied in the context of sports strategies in a more general set-up: Walker et al. [23] use Binary Markov Games to model a sports game like tennis and derive that under certain monotonicity properties optimal policies to win the match are a repeated application of an optimal policy to win a rally (our results on the optimality of myopic policies in Sect. 4 are related but in the MDP set up). Turocy [22] uses MG models fed with massive historical data in order to clarify whether there really has been a “last-up” advantage in baseball on average in the past. Routley and Schulte [19] employ MG models to rank ice hockey players according to their skills. In an upgraded MG model also location information is included [20]. Anbarc et al. [1] have tried to decide the fairness of tie break mechanisms on the basis of MG models. Why did we choose to utilize MDPs instead of MGs in this paper, although our policy might be influencing the policy of our opponent? The answer is two-fold: first, in order to investigate MG models, the problem of characterizing best-responses to given policies is important. By investigating MDPs, we cover this step. Second, for strategical decision support, MGs would guide us to the best policy against a strategically perfect opponent. In most cases, this is not what we want; we consider it rather more successful to adapt to the special strengths and weaknesses of a particular opponent. In future models, we plan to incorporate dimensions like variability into the MDP setting. This will at least cure some of the short-comings of MDP models in this regard. If the sets of strategy choices are finite and small (like for the benchmark problem in this paper), our approach can be applied to solve finite constant-sum games modeling the behaviours of both opponents (see Sect. 12).

Strategy Optimization in Sports via Markov Decision Problems

283

3 The New Two-Scale MDP Approach We seek for elementary strategic guidelines for a sports game. The dilemma in MDPmodelling of sports strategy optimization is the following: A compact MDP directly developed for the strategic question (“Should I play safe or risky?”) may allow us to analytically or numerically solve for an optimal policy based on the input data. However, this input data, in particular the transition probabilities (e.g., for an attack directly winning a point in a beach volleyball rally) often depend on the combination of all players involved. Consequently, they are hard to estimate, since historical data for all combinations of players is needed (e.g., for the direct point probability, the setting and hitting skills of the attacking team’s players and reception skills of the receiving team’s players are all relevant at the same time). For a more detailed model where transitions only depend on individual actions (e.g., the hard smash aimed at a certain spot in the field happened as intended) transition probabilities can be estimated easier, since only individual success probabilities for a single player are needed. Such detailed MDP models with billions of states could be very complicated to solve for optimal policies, but that is not the main problem. A detailed MDP will inevitably produce recommendations on the level of exact individual actions in all kinds of special situations (e.g., whenever your opponent has taken certain positions and the ball is in another position and flies in a certain direction and your team mate’s position is in another certain position …, then your next hit should be a set to a certain position.) Since strategy recommendations have to be implemented by humans eventually, such outcomes would be impractical. However, the idea to use coarser, more principle MDP models can lead to a very difficult input-data problem: The details that we may want to leave out in the model are not irrelevant. They appear in aggregated form in the transition probabilities, which often depend on the opponent’s behaviour. The consequence is that these probabilities can hardly be estimated whenever up-to-date observations for our team and the actual opponent are not available. In contrast, in a detailed model the transition probabilities may refer to very simple state transitions. This could be the probability that a certain hit is performed successfully, almost successfully, or failed completely. Such probabilities can be observed in special training sessions or in videos of historic events, independently of the skills of an opponent. It seems that each modelling granularity has something to offer. Therefore, we use both, i.e., we employ two MDPs instead of one for the optimization of a sports strategy and relate them to each other. One is coarse and one is detailed. We obtain our new two-scale method. The coarse MDP is called the strategic MDP, s-MDP for short. It represents the principle influence of the strategic decision in question on the winning probability. It uses as its basis a plausible segmentation of the gameplay into strategic pieces. This could be a phase of ball-possession or the like. This s-MDP will have moderate size and a simple structure so that finding an optimal policy is within reach analytically or numerically. Moreover, players can implement the resulting recommendation in practice. However, its transition probabilities need not be easily observable. The

284

S. Hoffmeister and J. Rambau

detailed MDP is called the gameplay MDP, g-MDP for short. It is an ordinary MDP but with a very fine granularity. Since the g-MDP is only used in a simulation, it does not matter if it has billions of states. It represents the dynamics of the detailed gameplay in greater detail in such a way that its gameplay-decisions (g-decisions) and gameplay-state transitions (g-transitions) can be related to strategy decisions (s-decisions) and strategy-state transitions (s-transitions) in a meaningful way. Neither the size nor the structure of the g-MDP are restricted. What we care about is that all transition probabilities in the g-MDP can be observed up to an acceptable accuracy, possibly by some additional effort like special training sessions or video analysis Whether or not we are scoring a direct point depends on us and the opponent. So, observations must be classified according to pairs of opponent teams, and there are many. This leads to a very sparse data basis to estimate probabilities. One way out is to develop a model how the probabilities come about. Our idea means: instead of standard parametric statistics, we use a g-MDP to generate the transition probabilities. A suitable g-MDP has single moves and hits as actions. The g-states store the players’ and the ball’s positions plus some technical information like how many times in a row the ball has been touched by the same team, who touched the ball before and whether the ball was hit hard before. In order to gain an advantage over using the s-MDP alone, we allow state transitions whose g-transition probabilities only depend on the skills of the player hitting the ball. For example, the action “smash targeted at a certain position in the opponent’s field” leads to follow-up states only depending on to what extent the smash was carried out successfully. Whether or not the opponent can return the smash in a controlled fashion solely depends on the returning player’s skills. The resulting g-MDP will be complicated. It will possibly be hard to find optimal g-policies. And: a g-policy will be complicated to implement during gameplay. But: By Monte-Carlo simulation (g-simulation) of the g-MDP we can in certain cases estimate the resulting s-transition probabilities. More specifically, we have to relate the g-MDP to the s-MDP as follows: For each s-policy we have to specify what g-actions fit to this policy in the [email protected] Call a feasible sequence of g-decision rules over the epochs of a phase of ball possession an attack plan. Any set of such attack plans is called an attack type. A probability distribution over an attack type is called an attack style. Now, we assign to each spolicy an attack style.1 We call this assignment the s-g-implementation. For example, if at a certain score our team is in possession of the ball and wants to play the spolicy risky, then we can assign to it a probability distribution over the set of all attack plans ending with the most risky (i.e., close to the border of the field or a hard hit like a smash) attack-hit available in the respective situation.2 The set-up of such an s-g-implementation requires the classification of all g-decision functions in each g1 In

other words, an attack style is a mixed partial g-policy consisting only of decision rules that belong to some attack type. 2 Note that it would not be sufficient to assign a probability distribution over a set of actions to an s-policy, since any hit in a sequence of actions could fail with some probability, resulting in a state without feasible actions. In contrast to this, an attack plan, which uniquely determines a

Strategy Optimization in Sports via Markov Decision Problems

285

state by s-decisions. Thus, a vast amount of case-by-case analysis is necessary using expert knowledge of the particular game. Usually, more than one g-decision function is possible to represent a single s-decision; in that case, we choose one uniformly at random. A different viewpoint is that the s-g-implementation is the formal definition of what a coach actually means by playing safe or risky. Using this connection, one can count in simulation how often one realization of the risky attack style results in a direct point, a direct failure, or a continuation of the rally, which correspond to outcomes of s-transitions in our example. Essentially, the g-MDP is utilized in the same way as a family of Markov-chains, parametrized by classes of g-decisions, each induced by the possible s-decision it implements. An MDP model is the better viewpoint here, since a formal connection between the s-MDP and the g-MDP needs the concept of g-decisions. Since the transition probabilities of the s-MDP usually depend on the combination of all players involved, it is difficult to estimate them by historical observations. At worst, a certain combination of players may have never played in a match before. Here, our two-scale approach comes into play. The related g-MDP models each player’s individual actions (in basketball, e.g., this might mean dribble, pass, shoot, from where, to where, …). The individual player probabilities, called skills, describe the outcomes of these player’s actions and constitute the g-transitions. The advantage of the g-MDP is that the g-transitions only depend on a single player’s skills and such skills are easier to estimate. They do not depend on combination of players and can therefore be estimated from arbitrary matches of that single player or even from training experiments. Usually, several g-transitions in the g-MDP (i.e., again for basketball: pass around—no-look-pass into the attack area—shot—score) constitute one s-transition (i.e., we score) in the [email protected] The s-transition probabilities of the s-MDP can then be estimated by counting the g-transitions in g-MDP-simulations on the basis of estimated skills. Given the simulated s-transition probabilities, one can solve the s-MDP and find out an optimal s-policy, which represents a principle strategic recommendation. Note that in order to show-case the concept in this paper in a more concise way, we have chosen to restrict our s-MDP for beach-volleyball to only few possible policies. In principle, which out of two (or few) policies is the best could be estimated by simulating the corresponding Markov chains in the g-MDP for the durations of complete games (rather than single phases of ball-possession) at the cost of longer computation times. Even in our simple case such a brute-force numerical approach would miss out some important information: The analysis of the s-MDP provides us with structural results (Theorems 1 and 2 and the winning probabilities of the tie-game in Sect. 4) and with useful sensitivity information (see Sect. 11). This information follows from the fact how exactly the simulated probabilities influence the qualities of the policies. This information would be substantially more difficult to obtain sequence of decision rules, returns for all possible resulting failure states an action to cope with it. For example, in a failed attempt to set the ball properly, the most risky smash available might be much less aggressive than the safest smash available after an excellent set—this possibility could not be covered by classifying actions.

286

S. Hoffmeister and J. Rambau

by simulation alone. Furthermore, in our approach the policies in the s-MDP can be chosen to be much more involved as long as the s-MDP can be solved fast analytically or numerically. The evaluation of the g-MDP can even be performed on-demand inside a possible interactive solution method for the [email protected] Beyond this, our setting allows for the following extension: we can make the s-g-implementation the subject of optimization for each s-strategy. Implementing this extension, however, would be beyond the scope of this paper. In the gameplay situation, the players have to decide about the explicit g-actions they perform next, depending on the optimal s-policy and the states they encounter. They will do this exactly by mimicking the s-g-implementation. Whenever in reality all attack plans in an attack type are carried out similarly often as in the g-simulation, the actual s-transition probabilities will be similar to the simulated s-transition probabilities. The exact form of the s-g-implementation is defined through expert knowledge. It can be implicitly based on intuitive understanding of the players. This has the advantage that no brain power is needed for it during gameplay. Alternatively, the team may want to establish an explicit encoding what an s-policy (e.g., play risky) is supposed to mean in terms of an attack style (play “any attack combination with the hardest-possible smash closest-possible to the boundary of the field” or the like). For beach volleyball, we have implemented this idea for one strategic question (see Sects. 4–9). Although in this paper, the only worked example is from beach volleyball (other examples like tennis can be worked out in a similar way), the two-scale MDP paradigm can be used for other sports games as well. For the particular sports game, one first has to develop an s-MDP, which models the strategic question. Second, one needs a sufficiently related g-MDP with observable transition probabilities. The g-MDP serves as a device to estimate the transition probabilities in the s-MDP. For this purpose, s-transitions are counted in simulations of the g-MDP. Consider, e.g., basketball. One interesting strategic question is whether to provoke a very fast tempo with high risk against a then less consolidated defence or to play calmly with low risk against a completely settled defence configuration. Or soccer: Should one preferably play high long passes behind the defending lines or should one play short low passes. A possible s-MDP would then consist of states corresponding to the principal situations (score, ball possession, phase of attack, shot opportunity) in which our team can choose to play “fast” or “slow” (basketball) or “high” or “low” (soccer) in order to influence the transition probabilities. Given these probabilities we could solve the s-MDP and make a recommendation: “fast” or “slow” and “high” or “low”. In the following sections, we will present an s-MDP/ g-MDP pair for beach volleyball that finds rules when to play risky or safe, depending on the skills of the individual players and the opponent’s skills. Even for beach volleyball, we note that our special choices of an s-MDP and a g-MDP are by no means unique. Our choices were guided by the wish to base the answer to the team-strategic question on the skills of the individual players for common hitting techniques. All rules concerning beach volleyball can be found in official documents by the Fédération Internationale

Strategy Optimization in Sports via Markov Decision Problems

287

de Volleyball [6]. We will briefly sketch the most important scoring rules when they become relevant in Sects. 4 and 6. The two-scale MDP paradigm can be transferred to other sports games, but the concrete implementation in this paper, i.e., the s-MDP, the g-MDP, and the data estimation are tied to the beach volleyball example with the benchmark question. A concrete implementation of our new paradigm for interesting strategic questions in basketball and other sports games is research in progress.

4 A Strategic MDP for Beach Volleyball From now on we show how our two-scale model can be applied to a particular sports game, namely beach volleyball, and a particular strategic question, namely safe versus risky play. At this point, safe and risky play are just names for two strategies in the s-MDP. Eventually, it is only the s-g-implementation that will give this a concrete meaning in terms of classes of detailed g-decision rules. In this section, we specify an s-MDP for our benchmark question. Recall that we want to find out in which situations (score, possession, serve or field attack) risky play will lead to a higher set-winning probability than safe play. Our strategy is to construct the s-MDP as simple as possible. The benchmark question requires to model the actions of one of the two teams. Moreover, we distinguish between service play and field attack play—it might be optimal to serve safe and to attack risky or vice versa. Let Team P be the team whose strategy we want to optimize, and let Team Q be P’s opponent.  The control  set in all states s where team P possesses the ball is given by As := risky, safe . The control set in the states where team Q possesses the ball contains a unique dummy control. Aiming at the benchmark question, a state has to contain the current score, which team starts the next attack plan and an indicator whether the state is a serving state or not. Thus, the simplest possible state space with respect to the benchmark question is S reg := {(x, y, k, ) | x, y ∈ N, k ∈ {P, Q},  ∈ {0, 1}}. Here, x and y denote the scores of Team P and Q, respectively. Moreover, k specifies which team possesses the ball, and  encodes whether or not this is a serving state ( = 1) or a field attack state ( = 0). Let us restrict to matches consisting of a single set to 21 points in the following. The state set winning states S W in contains all states, where team P has won the set, e.g., states where P has 21 points and Q no more than 19. Similarly, the state set losing states S lose contains all states, where team P has lost the set, e.g., where Q has 21 points and P no more than 19. At states with a score of 20 : 20, the so called “tie game” starts, where a team wins if it has a lead of 2 points. As our s-MDP should have a finite number of states, we use a different state representation for the tie-game. Instead of remembering the number of points for team P and team Q separately, we only denote the point difference of the two teams in a state. So, the states of the tie game are

288

S. Hoffmeister and J. Rambau

S tie = {(z, k, ) | z ∈ {−2, −1, 0, 1, 2}, k ∈ {P, Q},  ∈ {0, 1}}, which are only finitely many states. This kind of state representation is not possible in the regular set, since the absolute number of 21 points must be reached to win a set. In the tie game, only a relative criterion must be fulfilled. Note that with this simpler representation it is not possible to make the s-transition probabilities dependent on the duration of the tie game. Incorporating dependence on the duration requires a more complicated solution procedure, generally using a countably infinite number of states. In this paper, we stick to stationary probabilities, which is not an uncommon assumption in professional sports [12]. Using the relative notation for the tie-game states, there we have finitely many states that describe a beach volleyball set. Let S W in and S lose contain all winning and losing states respectively for team P which are modelled as absorbing states that are, once entered, never left. A decision epoch starts when Team P gains control over the ball and starts its attack. The decision epoch ends when Team P makes a fault or a point, or when the attack is successful but Team Q has gained control over the ball and starts its own attack plan. The actions of Team Q are modelled as part of the transition probabilities in the [email protected] These decision epochs in general allow for infinitely many stages. Let p+ (P)a [p− (P)a ] be the probability that Team P playing action a directly wins [loses] the rally. The corresponding probabilities for Team Q are denoted by p+ (Q) and p− (Q), respectively. As abbreviations, we denote the probabilities that none of this happens by p0 (P)a := 1 − p+ (P)a − p− (P)a and p0 (Q) := 1 − p+ (Q) − p− (Q), respectively. Since a serving attack has transition probabilities clearly different from a field attack, we distinguish between them. This is denoted by a superscript field or serve on the transition probabilities. In the following, we are considering the strategic options risky and safe either for the service or for the field attack. Thus, the evolution of the system is governed by twelve probabilities att att att att p+ (P)a , p− (P)a , p+ (Q) , p− (Q) , where a ∈ {risky, safe}, att ∈ {serve, field}. These probabilities induce all transition probabilities by incrementing points and changing the right to serve in the obvious way. Entering a winning state yields a reward of one; all other transitions have reward zero. Table 1 summarizes our [email protected] Note how the whole construction of our s-MDP was guided only by the benchmark question, not by finding the most compact representation or a being able to solve the model. In Fig. 1, we illustrate the resulting transition diagram in the case that P services first in the set for a simplified beach volleyball set requiring only two (instead of 21) points for a win. At the states (1, 1, P, 1) and (1, 1, Q, 1), the tie game starts. Our s-MDP was constructed with a symmetric view on teams P and Q: The only difference is that team P can choose a strategy whenever in possession of the ball whereas team Q’s strategy is fixed. In practice, we aim at optimizing the strategy for team P using a strategy for team Q that has been estimated from earlier games in the same tournament or the like. In Sects. 11 and 12 we will discuss sensitivity issues and extend this best-response approach to a finite constant-sum game setting, which allows to prepare for more than one opponent strategy.

Strategy Optimization in Sports via Markov Decision Problems

289

Table 1 Strategic MDP (s-MDP)

It will turn out that the problem to find an optimal policy can be partitioned into two special cases of the s-MDP because the optimal policies for them are myopic (see Appendix 1) and, thus, do not interfere with each other. One case is the serving s-MDP, where the attack types differ only in the serving technique, i.e., As = {risky, safe} for all serving states s = (x, y, P, 1) of P. The other case is the field attack s-MDP, where the attack types differ in the attack plans used during a field attack, i.e., As = {risky, safe} for all non-serving states s = (x, y, P, 0) of P.

290

S. Hoffmeister and J. Rambau serve

p + (P )a

(1, 0, P, 1)

p− (Q

serve

p 0 (P )a

e erv

s ) + (P a

p

p 0 (Q)

(0, 0, P, 1) serve

p 0 (P )a

field

p 0 (P )a

p 0 (Q)

p

p− (P )a serv e p+ (Q fi ) e

field

(1, 0, P, 0)

ld

p −( fi P ) eld

p− (P ) ser v a

(0, 0, Q, 0)

p +(Q

field

a

serv − Q) p ( ld fie )a + (P p ld e fi

) field (0, 1, Q, 1) p 0 (Q)

field

tie game (1, 1, Q, 1)

) Q −( p p+ (Q ) ser v

serve

e

(0, 1, P, 0) p 0 (Q)

(1, 1, P, 1)

e

e

field p − (P )a

(0, 0, P, 0)

1

field ) + (P a

(1, 0, Q, 0)

ld

fie

) field (Q p 0 ld (P )a fie )a P +( p −

p

(2, 0, P, 1)

field

)

p −(P

field

p 0 (P )a

)afield

p + (Q)

(0, 1, Q, 0)

field

(0, 2, Q, 1)

1

Fig. 1 General s-MDP

One may observe that there are many states in the two subproblem s-MDPs in which no action of P is required. In an MDP these states are unnecessary. By concatenating all paths between states in which P has to make a decision, we can transform the s-MDPs into versions where decision states of P, absorbing states and only some additional states—for better readability—occur. Moreover, since there exists a stationary optimal policy and the choice of an action in A for P depends in both subproblems only on the score, P plays the same action in all decisions states with identical scores. Therefore, all transitions that are neither changing the score nor involve actions of P can be merged. This requires the evaluation of some geometric series in a straight-forward fashion. For simpler notation of the result, we defined the following probability terms for score changes in the serving s-MDP (based on an arbitrary fixed field attack strategy for team P) and in the field attack s-MDP (the events are denoted in parentheses):

α serve = Q serve = βQ

α serve = P β Pserve =

p− (Q)

field

field

βa

= =

γaserve =

field + field p (P)

1 − p0 (Q)field p0 (P)field p+ (Q)

field

+ p0 (Q)

field − field p (P)

1 − p0 (Q)field p0 (P)field p+ (P)

field

+ p0 (P)

field − field p (Q)

1 − p0 (P)field p0 (Q)field p− (P)

field

+ p0 (P)

field + field p (Q)

1 − p0 (P)field p0 (Q)field p+ (P)a

field

field

αa

+ p0 (Q)

field − field p (Q)

+ p0 (P)a

1 − p0 (P)afield p0 (Q)field field field field p− (P)a + p0 (P)a p+ (Q) field 0 field 0 1 − p (P)a p (Q) serve serve p+ (P)a + p0 (P)a α serve Q .

,

reception Q, point f or P

,

reception Q, point f or Q

,

reception P, point f or P

,

reception P, point f or Q

,

ballpossession P, point f or P

ballpossession P, point f or Q service P, point f or P

Strategy Optimization in Sports via Markov Decision Problems serve

serve

p + (P )a

(1, 0, P, 1)

e

serv

αQ

e serv 0 (P ) a

p− (P ) ser

ve

a

+p

e serv + (P ) a

+ p 0 (P )a

+p0

p (0, 0, P, 1)

291

(P ) ser a

ve

rve

ser αP

αserve Q

β serve Q

p−

(P ) serve a

+ p0

− (Q

p (0, 1, Q, 1)

) 0 (Q +p

rve

se

)

serve p + (Q)

1

(1, 1, P, 1)

ve

se

(P ) serve a βQserve

(2, 0, P, 1)

tie game (1, 1, Q, 1)

+ p 0 (Q)

serve

serve βP

(0, 2, Q, 1)

1

(2, 0, P, 1)

1

Fig. 2 Serving s-MDP

− (Q

p

ve ser

0

+p

ve

ser

) + (P p (0, 0, P, 1)

)

p− (P ) ser

+p0

p + (P )

p− (P serve 0 field p (Q)

p 0 (P )

ld fie αa

ve

p 0 (P )

(P )

(1, 0, P, 1)

field

)

se

+

p0 (P

)

se

rv

e

p+ (Q

β field a

(P ) ser

βafield

αa

e

(1, 0, P, 0)

p+ (Q ) fie

− (Q

p

ld

(0, 0, P, 0)

+ p 0 (P )

field

rv

ve

serve 0 field p (Q)

serve − field p (Q)

serve

(0, 1, Q, 1)

)

(1, 1, P, 1) fie

ld

ve

tie game

ser

)

(1, 1, Q, 1)

ld

fie

αa p 0 (Q)

serve

(0, 1, P, 0)

p+ (Q ) ser

ve

βafield

(0, 2, Q, 1)

1

Fig. 3 Field Attack s-MDP

field

field

Note that αa + βa = 1. Figures 2 and 3 illustrate the outcome of this transformation for the two-point field attack s-MDP. The same transformation is also possible for the tie-game since the structure of the transition probabilities in the tie-game is identical to the regular game. The s-MDP can be solved analytically for an optimal policy. The total expected reward is monotonically increasing in the number of points of P and monotonically decreasing in the number of points of Q (Appendix 1 provides a formal proof for this plausible proposition). Due to this property, a myopic policy that maximizes the probability to win the next point is optimal. This result was found independently of a result by Walker et al. [23], who proved that given a monotonicity property a myopic policy is optimal for binary Markov games. Since the transition probabilities are identical in every stage of the game, the optimal myopic policy stays the same throughout the game. Our theoretical main result is the following: Theorem 1 (Optimal Policy—Field Attack s-MDP) There exists a stationary optifield field mal policy that chooses in each state the action a ∗ with αa∗ ≥ αa for all a ∈ A.

292

S. Hoffmeister and J. Rambau

Theorem 2 (Optimal Policy—Serving s-MDP) There exists a stationary optimal ≥ γaserve for all a ∈ A. policy that chooses in each state the action a ∗ with γaserve ∗ Note that it is important for this result that a rally cannot result in a draw (like in the famous MPD-text-book example on a multiple-game chess match, where it is optimal to play safe when in the lead). Also games in which time marks the end (like in soccer) need a different analysis. Since we have only two actions in A, risky play (a ∗ = risky) is optimal throughout field field serve serve − γsafe whenever αrisky − αsafe is positive in the field attack s-MDP or when γrisky is positive in the serving s-MDP, respectively. The optimal policy is unique if these depends on the played field attack expressions are strictly positive. Observe that α serve Q strategy. For determining the best combination of a field attack and a serving strategy one first has to compute best field attack strategy according to Theorem 1, calculating for this field attack strategy and then apply Theorem 2 to determine the optimal α serve Q serving strategy based on the optimal field attack strategy. As an optimal decision rule in the s-MDP does only depend on the situation type, and furthermore only selects between two actions, we define a decision in the s-MDP as a mixture between risky and safe that only depends on the situation type: Definition 1 (s-MDP decision rule) A s-MDP decision rule is a mixture between the two actions risky and safe where mixsit is the fraction by which the action risky is chosen in situation sit. Even if Theorems 1 and 2 are sufficient to characterize an optimal strategy in the whole s-MDP, we want to give an analytic formula for the winning probability of the tie-game. As before, stationary data guarantees the existence of an optimal stationary policy and we can aggregate the transitions and states such that we get a simplified representation that consists only of serving states. The summarized transition probabilities are computed from the original s-MDP transition probabilities by using the geometric series. The cumulated probability of gaining or losing a point if serving variant a and field attack variant b is played is: serve

· p− (Q)

serve

· p+ (Q) + p0 (P)a

+ α a,b P := p (P)a

+ p0 (P)a

β Pa,b := p− (P)a

+ p0 (P)a

serve serve

field

serve

+ p0 (P)a serve

field

· p0 (Q) field

· p0 (Q)

field

· αb,P field

· βb,P .

So, we compute the terms for every combination of a serving variant a with a field attack variant b. In a serving state of team Q, team P has only to choose a field attack variant b. The cumulated probability of gaining, or losing, a point if field attack variant b is played is: α bQ := p− (Q)

serve

+ p0 (Q)

serve

field

· αb,P

β Qb := p+ (Q)

Figure 4 visualizes the aggregated tie-game.

serve

+ p0 (Q)

serve

field

· βb,P .

Strategy Optimization in Sports via Markov Decision Problems

293

αa,b P

Fig. 4 Tie Game s-MDP a P ,b

1, P

β

αb

α

2, P

1

−2, Q

1

a P ,b

Q

0, P b a,

βP

αb

0, Q b

βQ

Q

−1, Q

b βQ

a,b a,b For the winning probabilities v0,P and v0,Q for team P in the tie states (0, P) and (0, Q), we can set up a system of equations a,b a,b a,b a,b = α a,b v0,P P · v1,P + β P · v−1,Q a,b a,b a,b v1,P = α a,b P · 1 + β P · v0,Q a,b a,b a,b v0,Q = α bQ · v1,P + β Qb · v−1,Q a,b a,b v−1,Q = α bQ · v0,P + β Qb · 0.

Solving this system of equations yields the following formula for the winning probability of P depending on the service strategy a and the field attack strategy b: va,b P =

2 (α a,b P ) b a,b b (1 − α bQ β Pa,b )2 − α a,b P αQ βP βQ

, va,b Q =

a,b b b b a,b α a,b P α Q (α P β Q − α Q β P + 1) b a,b b (1 − α bQ β Pa,b )2 − α a,b P αQ βP βQ

.

We could now answer the benchmark question if we knew the twelve governing probabilities for Teams P and Q. However, whether or not P scores directly depends also on the skills of Q, and vice versa. Thus, a direct estimation of these probabilities would require to make historical observations of Team P for each opponent team separately. In the following section, we ignore the dependence of these probabilities on the opponent and try to estimate them from all matches in the same tournament.

5 Computational Results I After having found an analytic solution of the s-MDP, we can try to answer the benchmark question raised in the introduction of this paper. First, the analytical solution of the s-MDP showed us that the optimal strategy is myopic. So, independent of the current score, it is always optimal to choose the strategy that maximizes the

294

S. Hoffmeister and J. Rambau

probability to win the next point. We now turn our attention to the estimation of the s-transition probabilities from historic observations directly. All our computational results in this paper are based on a video analysis of the Olympics 2012 in London. We took the point of view of strategic consultants for the German team Brink–Reckermann (Team P) for the final against the Brazilian team Alison–Emanuel (Team Q). In order to estimate the direct point and fault probabilities directly from the matches prior to the final, each service and field attack of each team was classified according to whether • it was played risky or safe or can not be classified • whether it led to an immediate point, an immediate fault, or a continued rally. The absolute counts were used for a maximum-likelihood estimation of all the stransition probabilities. Their dependence on the opponent team was ignored. For the purpose of comparison we also estimated these probabilities a-posteriori from the final match alone. Note that—although we take this a-posteriori measurement as a yard-stick—the observations in the final match contain only very few realizations of a distribution over possible outcomes. Any comparison of a-priori estimates with these a-posteriori observations is subject to two types of errors: the a-priori estimates have a certain prediction error that comes from both systematic errors (like ignoring the dependence on the opponent) and sampling errors (like small observation counts), and the a-posteriori estimate has a particularly large sampling error (only the points of a single match are used). Therefore, a discrepancy between the predictions of our model and the actual outcome in the one final should not be solely attributed to modeling/estimation errors. It might as well be the case that the actual outcome of the final did not coincide with the expected outcome of the final. Tables 2 and 3 show the resulting maximum-likelihood estimations of the stransition probabilities based on the match data. Table 2 evaluated the rallies of the prefinal matches while Table 3 considered only rallies of the final match. Since the s-MDP strategies risky and safe are characterized by more than one property (hitting technique and target field) there exist observed attacks that neither belong to the risky nor to the safe strategy. Therefore, the number of observations stated in the #-column is, e.g., for risky serves in the final, quite small. The prefinal-mix respective final-mix is the actual mixture of risky and safe services and field attacks played in the prefinal matches respective in the final match. According to Definition 1, we have 32 134 1 = 39

58 91 12 . = 23

prefinal-mixserve =

prefinal-mixfield =

final-mixserve

final-mixfield

Using the optimality criteria of Theorems 1 and 2, we can compute that a risky service and a risky field attack (short: risky-risky) is the optimal policy under the considered policies in the prefinal setting against the Brazilian team playing their

Strategy Optimization in Sports via Markov Decision Problems

295

Table 2 Direct estimation of s-MDP probabilities—prefinal setting Strategy a

#

p+ (P)a (%)

p− (P)a (%)

#

p+ (P)a (%)

p− (P)a (%)

γaserve αa (%) (%)

Win. Prob. (%)

32

16

22

58

66

17

37

34

77

32

16

22

33

48

0

36

33

60

102

4

9

58

66

17

34

34

69

102

4

9

33

48

0

32

33

50

134

7

12

91

59

11

34

34

65

#

p+ (Q) (%)

#

p+ (Q) (%)

165

2

105

58

serve

serve

field

field

field

risky-risky risky-safe safe-risky safe-safe prefinal-mix

prefinal-mix

serve

p− (Q) (%)

serve

10

field

p− (Q) (%)

field

15

Table 3 Direct estimation of s-MDP probabilities—postfinal setting p+ (P)a (%)

p− (P)a (%)

#

p+ (P)a (%)

p− (P)a (%)

1

0

0

12

42

25

29

29

14

1

0

0

11

64

9

34

34

73

Strategy a

#

risky-risky risky-safe

serve

serve

field

field

γaserve (%)

field

αa (%)

Win. Prob. (%)

safe-risky

38

3

0

12

42

25

30

29

18

safe-safe

38

3

0

11

64

9

36

34

77

33

31

44

final-mix

final-mix

39

3

#

p+ (Q) (%)

28

4

0 serve

p− (Q) (%) 14

serve

23

52

#

p+ (Q) (%)

39

59

17 field

p− (Q) (%)

field

15

prefinal strategy, compare column 8 and 9 in Table 2. By dynamic programming, we computed the winning probabilities in the last column and found that the prefinal recommended risky-risky strategy would have led to the highest winning probability of 77% by a seemingly large margin. We wish to evaluate how our prefinal recommendation would have proven in the final match. However, an evaluation of the directly estimated s-transition probabilities from the final match can not be given large weight due to the small number of observations. If the compared strategies were more specialized (e.g.. requiring certain positions of players on the court), the number of observations would have been even smaller. Also, if a team likes to evaluate a strategy they have not played before in the tournament, a direct estimation of the s-transition probabilities is not possible for that new strategy. In order to see how well the estimations for the s-transition probabilities and the corresponding s-MDP-based winning probabilities match what actually happened in the final, one can compare the values from Table 3. The resulting differences in winning probabilities are substantial, and it is not clear whether these are due to the prefinal estimations and their systematic short-comings, to the postfinal estimations

296

S. Hoffmeister and J. Rambau

due to their small number of observations, or to an actual deviation of the teams’ skills exposed in the final compared to the matches before. Since in a direct estimation of the s-transition probabilities from the matches are independent of the opponent, the estimations in the prefinal setting are always identical, no matter which team Germany faces in the final match. Since the s-transition probabilities describe events that depend on both teams (e.g., the probability of an ace depends on the serving skills of the serving team as well as the reception skills of the opponent team) the s-transition probabilities estimates should vary for different opponents. The more the opponent in the final match varies from the prefinal opponents, the more probable it is that the optimal strategy against the opponent in the final is different. It is not possible to directly estimate the s-transition probabilities dependent on the opponent team for two reasons: First, like in our example of the beach volleyball tournament of the Olympic games 2012, the team may not have faced the opponent of the final match prior to the final. Second, estimating from at maximum one match, the resulting number of observations would be as small as in Table 3. A clustering of opponent teams may be tried to cure the first problem. In this application example, however, Germany did not even play any team of similar strength as Brazil prior to the final. For those cases in which the results based solely on the s-MDP are not satisfying (as is the case for our benchmark question), we suggest our two-scale approach. The use of an adequate g-MDP can help to overcome the issues resulting from the direct estimation of the s-transition probabilities. Moreover and more importantly, with the new method it will become possible to define and analyse many more strategic options on the basis of the same individual player-skill estimates. This will be demonstrated in Sect. 12.

6 A Gameplay MDP for Beach Volleyball This section sketches the infinite-horizon, stationary g-MDP for a beach volleyball rally. The formal details can be found in Appendix 1. Its purpose is to provide an approximation architecture for estimating opponent-dependent s-transition probabilities from simulation runs based on the individual players’ skills, which govern the g-transition probabilities. In this section, it becomes relevant how a point is won in beach volleyball. We summarize briefly the scoring rules as far as they are relevant to the model. A point is won whenever after a hit the ball touches the ground inside the opponent’s field. As soon as the opponent touches the ball, this is counted as the hit of the opponent. A point is lost if the ball is hit outside the court, grounds on the own court, or a hit is not executed properly. Moreover, a point is lost when a team touches the ball four times in a row or a player hits the ball twice in succession. A block is a passive contact close to the net in order to rebound an attack-hit of the opponent. The first hit after a block may be executed by both players. However, a blocking contact counts for the rules concerning the number of contacts of a team. Furthermore, a point is lost when

Strategy Optimization in Sports via Markov Decision Problems

297

a player touches the net. Each rally starts with a service from behind the ground line that must not be blocked. In order to maintain the symmetry of the gameplay, team Q’s action sets will be analogously modelled to the action sets of team P. However, team Q plays a fixed probability distribution over a defined set of attack plans. The transition probabilities encompass both the randomized choices of Q’s actions and the realizations of random variables in the system dynamics. We split up the team actions of a team into two individual player actions. Any player’s action will be defined as a combination of a hit and a move. The decomposition of a complex team action into actions of a single player makes the definition of the action sets more manageable. Furthermore, this decomposition allows us to build the g-MDP solely on individual player probabilities, which is the big advantage of the g-MDP. The g-MDP is an infinite horizon, discrete time MDP with stationary data. Stationary data can be reasonably assumed for a professional beach volleyball rally. In [12], Koch found that the temporal position within a rally did neither effect the type nor the quality of the attack-hit. Our decision epochs are the points in time at which one of the players is about to hit the ball. If a player has decided about his next action, which may be, e.g., an attack-hit, he has to stick to it. This is because an interruption of the current action to revise the decision would lead to a delay. In a professional match this delay will usually outweigh any potential improvement of the new decision. Each time a player contacts the ball, the state of the system is observed and a decision about the next hit or movements must be made by team P. However, there is one exception from that rule: if the ball is touched in a blocking action, then no state will be observed until the ball is rebounded and hits the ground or is touched by the next player. Additionally, a new state is observed when the ball hits the ground or one team has made a fault. In that case, the rally is completed, and one of the teams has scored a point. We number the points in time in a rally where the state of the system is observed by natural numbers. The crucial property of our g-MDP is that the g-transition probabilities only depend on the skills of the individual players. To this end, any possible action (which can be a move or a hit) can result in a success, a failure, or a deviation. In our model, positions are classified by a grid on the field. Moves are represented by changes of a player’s position in the grid with an upper bound on the range. In our model, moves are always successful. If an attack hit or a service is successful, then the ball lands in the intended grid field on the opponent’s court side with the intended hardness. If a defensive hit is successful, then the ball is received and passed on to the intended grid field so that the team mate can continue the counter-attack. A fault means that the hit is in the net or was carried out with an execution fault. A deviation means that the ball passes the net, but not as intended, e.g., it lands in a neighboring grid field (which may also be out). The skill level for a special hit is defined as a probability distribution over these three possible outcomes. Depending on the outcomes of an attack hit, the opponent faces various situations, depending on which his hits lead to success, failure, or deviation. Similarly, depending on the outcomes of a reception, setting and smashing have to be performed in different

298

S. Hoffmeister and J. Rambau

situations influencing the their probabilities for success, failure, and deviation. An analogous mechanism works for blocking, though slightly more complicated (see Appendix 1 for details). This way, the skills of the two teams and of the two players in a team are decoupled completely in the [email protected] Therefore, the number of necessary probability estimates is linear in the number of players as opposed to quadratic in the number of teams if the team formations are fixed or even of degree four in the number of players in general. Depending on the g-state, there are various actions possible. Note that this setup also accounts for how the quality of receptions and sets influence the possible follow-up actions: A deviated reception makes subsequent setting impossible, and an additional reception with move has to be performed; a deviated setting results in the impossibility of a smash making a more difficult shot necessary, etc. By a plausibility analysis we excluded from all rule-compliant actions the non-professional choices. The remaining choices are made based on a g-strategy, which is linked to the s-strategies. This will be the topic of the next section.

7 Gameplay MDP Strategy Each team in the g-MDP plays a team specific g-MDP-strategy. A g-MDP-strategy is a variation of some basic strategy used as a default together with an modification of the blocking, serving and attack-hit strategies according to parameters that characterize a team strategy, e.g., risky or safe. Besides these configurable decisions, all strategies in the g-MDP use the same default decision rules of the basic strategy. In the context of MDPs, a g-MDP-strategy is a stationary policy that consists of a decision rule that prescribes, depending on the state, which action should be chosen. However, since we are in a sports environment, we will speak of strategies instead of stationary policies. The basic strategy is implemented to guarantee a reasonable match flow. It excludes unrealistic and moreover obviously non-optimal combinations of player actions or sequences of team actions, and chooses uniformly at random one of the plausible options in each situation. Some parts of the basic strategy are parametrized such that extreme strategies, like risky and safe, can be derived from it. We chose in this example to configure the blocking, serving, and attack-hit parts of a strategy. In general, other or more complex parametrizations of the basic strategy are possible. With an implemented basic strategy, it is not necessary to implement an individual decision rule for each possible state in the g-MDP – all straight-forward actions are inherited from the basic strategy. The decision rules of the basic strategy split up into one decision rule for each state category. The ten different state categories are serving, reception, setting, attacking and defending states from the perspective of both teams. Each category is determined by values of the state variable counter and side(pos(ball)), where side(pos(ball)) states on which court side the ball is. We refrain from a complete definition in print

Strategy Optimization in Sports via Markov Decision Problems

299

of the straight-forward decision rules of the basic strategy that are used also in all other strategies under consideration—it is just a very long list of very plausible rules. Instead, we restrict ourselves to those decision rules in which the strategies of interest differ. We represent all strategies as randomized policies over identical sets of plausible deterministic policies representing extremal ways to play. The investigated strategies only differ in the selection probabilities. This way we obtain a parametrized set of randomized strategies. More specifically, all strategy-specific decision rules are encoded by a vector π whose components determine the probability for choosing true in a Boolean decision. It represents the probabilities with which the strategy chooses one out of two extremal ways to play in various dimensions. In the basic strategy, e.g., all components of π are set to 0.5 which means that in each dimension both decision possibilities are equally probable. For example, the blocking strategy is specified by πb , which states with which probability player 1 of a team is the designated blocking player in the next rally. It follows that with probability (1 − πb ) player 2 is the blocking player. The parameter πs determines the serving strategy of a team. With probability πs , a serve on player 1 of the opponent team is made, i.e., the target field of the serve belongs to the opposing court half that is covered by player 1. Further, a technique and target field decision of the serve and attack-hit are field included in πh . The two parts πhserve and πh of πh comprise the strategies corresponding to service and field attack, respectively. Each component further splits up into a technique and target field decision that can be different for both players sit sit (ρ), πh,target (ρ))T with sit ∈ {serve, field}. The subscript term ρ, i.e., πhsit = (πh,tech indicates if the decision is related to the technique (tech) or target field (target) decision. Now, we can summarize all parameters that are necessary for defining a g-MDP strategy of team P: Definition 2 (g-MDP strategy) A strategy of the g-MDP is a parametrization of the basic strategy and characterized by the parameters: ⎛ ⎞ πh π = ⎝πb ⎠ , πs

πhserve πh = field , πh 

sit (ρ) πh,tech sit ∈ {serve, field}, , = sit πh,target (ρ) ρ ∈ {P1 , P2 }. 

πhsit

For a higher memorability, we defined the values of the components of πh always as the probability for the more risky opportunity. In our example, we have two serving techniques available in the g-MDP, namely the float serve S F and the jump serve S J . The float serve is considered as a safe hit and the jump serve as a risky hit. (All classifications of this type have been determined by personal communication serve (ρ) is defined as the with high-level amateur beach volleyball players.) So πh,tech probability that ρ chooses a S J . For the attack-hit, we have three techniques available the smash FS M , a planned shot FP and an emergency shot FE . The emergency shot is normally only played if none of the other attack-hits is possible, and in such a case it is chosen with certainty by each strategy. The smash is considered as a risky field hit and the planned shot as a safe hit. So πh,tech (ρ) is defined as the probability

300

S. Hoffmeister and J. Rambau risky

Table 4 Overview risky hitting strategy πh risky

Strategy πh

safe

safe

versus safe hitting strategy πh risky

πh (ρ)

πh

Strategy πh

Serve

πh (ρ)

safe

πh

Attack-hit field

0

field

0

Serving technique SJ

1

serve (ρ) πh,tech

0

Attack technique FS M

1

πh,tech (ρ)

Border field

1

serve (ρ) πh,target

0

Border field

1

πh,target (ρ)

that ρ chooses a FS M . Furthermore, we define all fields that are near the touch of the court as border fields. For example, on court side of team Q the border fields are ∂F := {Q11 − Q31, Q14 − Q34}. These are more risky target fields than nonfield serve (ρ) and πh,target (ρ) are the probabilities with which a border border fields. So πh,target field is chosen as a target field. Should there be several possible risky or safe options the risky or safe, respectively, strategy chooses one of them uniformly at random. Using this randomization is a means to prevent complete predictability, although this advantage cannot be measured in the reward function of the current MDP-setup. It can be seen as injecting some general key learnings from game theory into the system. After having introduced the general concept of a g-MDP strategy, we want to specify two hitting strategies that implement the s-MDP strategies risky [safe] as gMDP strategies. For answering our benchmark question, we will compare them later in the computational results section, see Sect. 9. We call the two special strategies risky safe the risky hitting strategy πh and the safe hitting strategy πh . They are the most risky takes always a risky technique and extreme hitting strategies. The strategy πh safe chooses always a border field as target field. The πh strategy chooses always a safe hit with a non border field as target field. Table 4 summarizes the techniques and target risky fields chosen by the two extreme strategies. The assignment of risky → πh and safe safe → πh is the s-g-implementation of our benchmark question. In the following, risky safe we will refer to πh [πh ] when we write about the risky [safe] strategy in the g-MDP-setting.

8 Gameplay MDP Validation For calibrating the g-MDP for the German team Brink–Reckermann against the Brazilian team Alison–Emanuel in the final match of the Olympic 2012 games, we need estimations for the skills of all players as input parameters. To estimate the skills, we evaluated all matches they played in the tournament except the final match. Details of the data collection process and the complete presentation of all data tables

Strategy Optimization in Sports via Markov Decision Problems

301

Table 5 Input data from all matches except final: Julius Brink—Serves and Attack-Hits Target fields

Q11–Q14

Performance

#

succ

fault

#

Q21–Q24 succ

fault

Q31–Q34 #

succ

fault

P01–P04

34

0.88 (0.88)

0.00 (0.00)

43

0.88 (0.88)

0.12 (0.12)







34

0.94 (0.94)

0.00 (0.00)

16

0.75 (0.75)

0.19 (0.19)







Out

0

0.86 (–)

0.02 (–)

0

0.86 (–)

0.02 (–)







P11–P14

0

0.86 (–)

0.02 (–)

0

0.86 (–)

0.02 (–)







P21–P24

55

0.85 (0.85)

0.04 (0.04)

17

0.94 (0.94)

0.00 (0.00)







P31–P34

7

0.77 (0.71)

0.01 (0.00)

2

0.89 (1.00)

0.02 (0.00)







Out

0

0.76 (–)

0.06 (–)

0

0.76 (–)

0.06 (=)







P11–P14

0

0.76 (–)

0.06 (–)

1

0.79 (1.00)

0.05 (0.00)







P21–P24

7

0.73 (0.71)

0.11 (0.14)

7

0.82 (0.86)

0.02 (0.00)







P31–P34

1

0.70 (0.00)

0.05 (0.00)

1

0.79 (1.00)

0.05 (0.00)







Out

0

0.95 (–)

0.05 (–)

0

0.95 (–)

0.05 (–)

0

0.95 (–)

0.05 (–)

P11–P14

0

0.95 (–)

0.05 (–)

0

0.95 (–)

0.05 (–)

0

0.95 (–)

0.05 (–)

P21–P24

8

0.99 (1.00)

0.01 (0.00)

30

0.97 (0.97)

0.03 (0.03)

0

0.95 (–)

0.05 (–)

P31–P34

2

0.96 (1.00)

0.04 (0.00)

3

0.88 (0.67)

0.12 (0.33)

0

0.95 (–)

0.05 (–)

Serve SF SJ Attack-hit FS M

FE

FP

can be found in [10]. For illustration purposes, the skill estimates for Julius Brink based on the pre-final matches are included in Tables 5 and 6. In Table 5, the maximum-likelihood estimates of the individual player probabilities of Julius Brink for all types of serves and attack-hits are presented. We aggregated different player positions and target fields together to get a larger number of observations. The number of observations for a certain combination of player position and target field is stated in the #-column. The probabilities shown in brackets are the maximum-likelihood estimates for the specified hit whereas the other probabilities are the maximum a-posteriori probability estimations which include a prior assumption [14]. For categories with more than eleven observations both probabilities are

302

S. Hoffmeister and J. Rambau

Table 6 Input data from all matches except final: Julius Brink—defence, reception, set, block Attack strength Defence

Reception

Set

Block

normal

hard

Performance

#

succ

fault

#

succ

fault

d

20

0.85 (0.85)

0.05 (0.05)

14

0.71 (0.71)

0.21 (0.21)

dm

29

0.93 (0.93)

0.00 (0.00)

13

0.46 (0.46)

0.38 (0.38)

r

34

1.00 (1.00)

0.00 (0.00)

9

0.90 (0.89)

0.10 (0.11)

rm

42

0.95 (0.95)

0.02 (0.02)

3

0.97 (1.00)

0.02 (0.00)

s

117

0.99 (0.99)

0.00 (0.00)







Performance

#

direct point

over net but no point

fault

misses ball

b

5

0.20

0.20

0.20

0.40

equal. More details on the a-posteriori skill estimation can be found in [10]. The column succ states for each combination the probability that the hit lands in the target field and the column fault contains the probability of a technical error. The remaining probability is the probability that the hit was successful but the ball deviated into a neighbour-field of the target field. Table 6 specifies the estimated probabilities of Julius Brink for defence, receptions, settings, and blocks. The estimated probabilities fit our intentions that we had when we defined the hits, e.g., receptions have a higher success rates than defence actions and hard balls are harder to defend or receive than normal balls. For the blocking skills, the first three columns after the number of observations describe the possible results of a block that catches the ball, while the last column is the probability that the block misses the ball. Since Jonas Reckermann is the designated blocking player in the German team, Julius Brink has done nearly no blocks in all these matches [10]. We have done the same estimations of the individual probabilities of the other players. The respective tables are presented in [10]. Before going on with strategic recommendations in the next section, we want to check how well the g-MDP model fits a real beach volleyball match. We use the final match of the Olympic games as a benchmark for our model and the estimated input skills. It is the only match of the tournament where we have estimations of the skills of both teams. Table 7 shows the strategy estimations for both teams in terms of the strategy definition of Sect. 7. It contains the strategy estimated from observations of the final match and the estimation from the prefinal matches. The estimated strategy of the final is used for validating the g-MDP model definition. In the prefinal strategy, we used 50% as the estimate for the team’s serving strategy. Since the teams faced

1.56

27.27

33.33

πs

69.23

26.09

17.24

73.91

3.85

24.14

19.23

34.55

96.08

42.86

92.86

16.00

68.00

BRA Alison (%)

Reckermann (%)

GER

Brink (%)

πb

field πh,tech field πh,target

serve πh,target

serve πh,tech

Final π final

53.70

81.48

26.67

36.67

Emanuel (%)

πs

πb

field πh,tech field πh,target

serve πh,target

serve πh,tech

GER

50.00

2.44

38.71

65.32

18.90

39.37

Brink (%)

Prefinal π prefinal

45.00

71.67

36.36

50.41

Reckermann (%)

Table 7 Estimated final and prefinal strategies of Brink–Reckermann and Alison–Emanuel of the Olympic 2012 games BRA

50.00

95.85

36.00

82.00

17.93

37.24

Alison (%)

30.94

86.19

17.29

35.34

Emanuel (%)

Strategy Optimization in Sports via Markov Decision Problems 303

304

S. Hoffmeister and J. Rambau

Table 8 Validation of simulated s-MDP transition probabilities based on different skill estimates serve serve serve serve Estimation p+ (P)π final (%) p− (P)π final (%) p+ (Q)π final (%) p− (Q)π final (%) method Realized probabilities final Simulating the g-MDP with Skills of all matches except final Skills of all matches Skills of final only

2

4

4

14

2

15

3

15

2

12

4

14

1

2

6

10

Estimation method Realized probabilities final Simulating the g-MDP with Skills of all matches except final Skills of all matches Skills of final only

p+ (P)π final (%)

p− (P)π final (%)

p+ (Q)π final (%)

p− (Q)π final (%)

49

17

55

16

32

15

26

19

36

15

36

19

46

12

50

15

field

field

field

field

different opponents in the prefinal matches, we could not derive any meaningful value from the observations. Recall, that the classification of a hit as safe or risky depends on the state of the system, e.g., whether or not the setting resulted in a deviation or not. We collected the realized s-MDP transition probabilities from the Olympic final by counting the number of serves and field attacks as well as the direct points and faults. These values are presented in the first line of Table 8. For validating our approach, we simulated 1000 batches of 100 beach volleyball rallies each where both teams played the estimated strategy of the final match. We implemented a special-purpose simulation code in Java. The g-MDP simulation has been slightly tweaked from the ideal descriptions of the g-MDP in order to match as closely as possible our interpretation of the video data for the data collection. By counting the number of serves and field attacks as well as their outcome (direct point, fault or a subsequent attack), we calculated the s-MDP transition probabilities from the g-MDP simulation by a maximum-likelihood estimation. That is, we counted the number of times the action was performed and the number of times it

Strategy Optimization in Sports via Markov Decision Problems

305

resulted in a success, a failure, or a deviation. The quotients were taken as estimates for the probabilities. Since the case of a deviation of target field for an attack hit is difficult to judge upon (because we cannot tell the intention from the video), we only classified outs as a deviation. Whenever there were fewer than eleven (a number determined in many experiments) observations for an action, we used actions from an extended category to add additional observations. The results for different skill estimations are shown in the last three lines of Table 8. The deviations of the predictions in the prefinal row from the direct observations in the final in the very first row are partly small and encouraging and partly quite large. However, as discussed before, one should not consider the first row as containing the “true” probabilities because the values from the final are an estimation for the probabilities, too, and one with a large sampling error, given the small number of observations. Still, it can be seen that the pattern of which probability is large and which probability is small looks quite similar. We have to keep in mind, though, that any interpretation of an outcome of our analysis must be accompanied by a thorough sensitivity analysis. We will show a possibility to implement a concept for this in Sect. 11.

9 Computational Results II After having found an appropriate g-MDP model, we can estimate s-MDP transition probabilities from simulating the g-MDP and answer the benchmark question raised in the introduction of this paper. Using the g-implementation of risky and safe as described in Table 4 and the a-priori skill estimations of all player, we can estimate the s-MDP transition probabilities, see Table 9. Assuming Brazil plays a similar strategy as their prefinal strategy π prefinal , their s-MDP transition probabilities can be estimated from the g-MDP simulation as well, see last line in Table 9.

Table 9 Estimation of s-MDP probabilities from g-MDP simulation—prefinal setting field serve serve field field Strategy a p+ (P)a p− (P)a p+ (P)a p− (P)a γaser ve (%) αa (%) Winning Prob (%) (%) (%) (%) (%) risky-risky risky-safe safe-risky safe-safe π prefinal

5 5 1 1 2

20 20 13 13 16

p+ (Q) (%) 2

serve

π prefinal

p− (Q) (%) 12

40 16 40 16 32 serve

p+ (Q) (%) 24

16 13 16 13 14 field

p− (Q) (%) 18

46 39 48 40 45 field

55 46 55 46 53

80 31 84 34 73

306

S. Hoffmeister and J. Rambau

Table 10 Estimation of s-MDP probabilities from g-MDP simulation—postfinal setting field serve serve field field Strategy a p+ (P)a p− (P)a p+ (P)a p− (P)a γaser ve (%) αa (%) Winning Prob (%) (%) (%) (%) (%) risky-risky risky-safe safe-risky safe-safe π final

2 2 1 1 1

8 8 2 2 2

π final

p+ (Q) (%) 6

serve

p− (Q) (%) 10

52 40 52 40 47 serve

p+ (Q) (%) 51

14 12 14 12 12 field

p− (Q) (%) 14

36 33 37 34 36

37 34 37 34 36

49 25 52 28 41

field

Using the optimality criteria of Theorem 1 and 2, we can compute that a risky service and a risky field attack (short: risky-risky) is the optimal policy under the considered policies in the prefinal setting against the Brazilian team playing their prefinal strategy. This result coincides with the prefinal recommendation based on the direct estimated transition probabilities presented in Sect. 5. The reader may have noticed that there are differences in the performances and the played strategy when comparing the prefinal matches with the final match: The skill estimates based on the prefinal matches differ from the estimates based on the final match, compare e.g.. Tables for the skills of Reckermann in [10]. Furthermore, the strategy of Brazil in the final match deviates slightly from their prefinal strategy, see Table 7. Because of these differences, we wish to evaluate how our prefinal recommendation would have proven in the final match. Table 10 shows the estimated s-MDP transition probabilities estimated from the g-MDP simulation provided with the postfinal setting. Applying again the optimality criteria of Theorems 1 and 2, we derive that in the postfinal setting a safe service strategy and a risky field-attack strategy would be the best response to the Brazilian final strategy. However, the a-priori recommendation proves to be quite good. By dynamic programming, we computed the winning probabilities and found that the a-priori recommended riskyrisky strategy would have led in the postfinal setting to a winning probability of 49%, which is better that the actual played strategy of Germany in the final (41% winning probability) and only slightly worse than the optimal safe-risky strategy (52% winning probability). It should be said that we experienced deviations between our predictions and the outcomes of the final in this case as well, as in the pure s-MDP case in Sect. 5. The deviations are attributed to changes in the individual skills. The skill estimates from the prefinal matches differ from skill estimates based on the final match only. However, the number of observation of one match is not satisfactory to estimate the individual skills: The whole point of basing strategic recommendation on skill estimates rather than direct point probabilities is that the data of all matches prior to the final can be used independently of the opponent.

Strategy Optimization in Sports via Markov Decision Problems

307

In order to keep this paper focused on the as-simple-as-possible benchmark question, we sticked to a comparison of two possible strategies only. Given the level of detail in the g-MDP, we could easily compare more strategy combinations involving the blocking player or the positioning of the players in the field on the basis of the same skill estimations. Note, moreover, that the skill estimations used in this paper are based on very few data. In practice, we suggest to evaluate the skills of our team also in training sessions and other matches prior to the tournament.

10 Comparison of Methods We have carried out the whole concept for the German beach volleyball team Brink Reckermann at the Olympic games 2012. We tried to give a recommendation for the German prior to the final match against Brazil. Thereby we answered the strategic question “does risky or safe play lead to a higher winning probability?” twice: first, by using direct estimates of the s-MDP transition probabilities and a second time by using estimates from from the g-MDP simulation. The prefinal recommendations of both methods coincided. Since both estimation methods for the s-transition probabilities have independent weaknesses, in this case the prefinal recommendation can be trusted even more. It seems that the two-scale approach did not gain anything new. However, that can only be said for the recommendation alone. In Sects. 11 and 12 we will see how the two-scale approach yields sensitivity and game theoretic insights about strategies that have never been played before, which is even more important in the presence of only loosely validated data estimation. This would be impossible with the direct estimation of s-transition probabilities from historic data only. In Tables 11 and 12, we present the estimation results of the s-MDP transition probabilities for the final strategy to compare both estimation methods. The Table splits up into two Tables: Table 11 presents the estimates for the serving situation and Table 12 for the field attack situation. Both methods, the direct estimation of the s-MDP transition probabilities, see Sect. 5, and the estimation from the g-MDP simulation, see Sect. 9, are compared to the realized transition probabilities in the final match. In the g-MDP simulation, we used as an estimate for the final strategy π final the values of Table 7. The results of the g-MDP simulation, i.e., line 2 and 4 in each Table, are a compilation of the results of Table 8 where we validated the g-MDP simulation. The results of the direct estimation based on the final match is also a compilation of the results presented in Table 3. However, the direct estimation for the final strategy based on the prefinal matches contains new values. We used the prefinal estimates for risky and safe of Table 2 and mixed them by the final-mix. For example, for the serving probabilities of team P, we get, p+ (P)final-mix = final-mixserve · p+ (P)risky + (1 − final-mixserve ) · p+ (P)safe serve

serve

serve

p− (P)final-mix = final-mixserve · p− (P)risky + (1 − final-mixserve ) · p− (P)safe serve

serve

serve

308

S. Hoffmeister and J. Rambau

Table 11 Comparison between two scale and direct approach: The realized s-MDP transition probabilities of the final are the benchmark—strategicMDP transition probabilities for serving situation Estimation method

p+ (P) (%)

p− (P) (%)

p+ (Q) (%)

2

4

4

14



2

15

3

15

3.27%

4

9

2

12

2.86%

serve

serve

serve

p− (Q) (%)

serve

Data

Dec. rule

g-MDP simulation

Prefinal skills

π final

s-MDP direct

Prefinal matches

final-mix

g-MDP simulation

Final skills

π final

1

2

6

10

2.23%

s-MDP direct

Final match

final-mix

3

0

4

14

1.10%

Observations final

Avg. L 1 -error

Table 12 Comparison between two scale and direct approach: The realized s-MDP transition probabilities of the final are the benchmark—s-MDP transition probabilities for field attack situation field field field field Estimation Data Dec. rule p+ (P) p− (P) p+ (Q) p− (Q) Avg. method L 1 -error (%) (%) (%) (%) Observations final g-MDP simula- Prefinal tion skills s-MDP direct Prefinal matches g-MDP simula- Final tion skills s-MDP direct Final match

49

17

55

16



π final

32

15

26

19

12.87%

final-mix

57

9

60

17

5.71%

π final

46

12

50

15

3.40%

final-mix

52

17

59

15

2.27%

by the direct estimation method based on prefinal matches. As the proportion team Q used a risky or a safe strategy in the final and the estimates for the transition probabilities of those strategies has not been presented yet, we put them in Table 15 in the Appendix. Although, the average absolute difference of the realized transition probabilities of the final match and the direct estimates of the field attack transition probabilities is in all cases smaller than the deviation of the estimates from the g-MDP simulation, the approximation errors are of a similar order of magnitude, given that the “benchmark” count in the final represents only the counts of one sample match, which is a random experiment, too. Only the direct point probability for Brazil after a field attack has been underestimated by a large margin (26% versus 55%) by the g-MDP: This is subject to further investigation. Note that the direct estimation of the s-transition probabilities relevant for the final could only be done for those strategies that have been played before. If one

Strategy Optimization in Sports via Markov Decision Problems

309

wants to optimize over strategy sets containing strategies that have not been played before (in the tournament against similar opponents), then the g-MDP simulation is the only method that yields estimates at all.

11 Sensitivity and Skill Strategy Score Cards In the following, we discuss what we call Skill-Strategy Score Cards. Skill-Strategy Score Cards are a visualization of the sensitivity of strategy recommendations on probability estimates. Given the substantial uncertainty in the probability estimates, this is paramount to the correct assessment of overly detailed computational results in practice. They indicate for various individual skill levels and for various opponent types the differences in the winning probabilities of two strategies. The skill probabilities psucc,ρ (pos(ρ), h) and pfault,ρ (pos(ρ), h) for one hit h are varied in each small plot from zero to one. The pdev,ρ (pos(ρ), h) is implicitly determined through the two varied probabilities. The colour of each square-shaped data point in the plot reflects the difference between the winning probabilities of the safe hitting strategy safe risky πh and the risky hitting strategy πh , that were both introduced in Sect. 7. A green colour means that it is better to play safe, the red colour suggests risky play, and the yellow colour indicates that there is no difference in the winning probabilities of both strategies. As a reminder, the smash is played in a field attack in risky whereas the planned shot is played in safe. Lines in each small plot indicate the real skill level of the varied hit. For example, in Fig. 5 the smashing skills are varied, and the three lines indicate the average skill level of the smash as presented in the input data. One can use a Strategy-Skill Score Card as follows: Imagine you want to choose between safe and risky field attack play against a particular opponent. You can estimate all necessary s-transition probabilities, either as described in this paper via the direct estimation or via the simulation of the g-MDP (or by some other method). You are interested in the sensitivity of the recommendation w.r.t. the opponent’s strength and your players’ skills on complementary hits representing the safe and risky field attack styles. For example, the skill level for a planned shot influences the winning probability of the safe attack style, and the skill level of smash (the complementary hit) influences the winning probability for the risky attack style; you want to know how the strategic recommendation changes as you vary the skill levels of these hits. Then you do the following: 1. Produce a Strategy-Skill Score Card for smash with variable skills for smash and the estimated skills for shot. This yields a chart like the one in Fig. 5. Do the same with variable shot skills and the estimated smash skills. This yields a chart like the one in Fig. 6. field field 2. Focus on the little chart at the respective (p+ (Q) , p− (Q) )-coordinate, where field field p+ (Q) denotes the estimated opponent’s direct-point probability and p− (Q) the estimated opponent’s error-probability.

1

0.8

0.6

Color key

real skill Team P

1

0.4 0.2

0.5

0 −0.2 −0.4

0.5 1 fault probability of smash

winProb(safe)−winProb(risky)

success probability of smash

S. Hoffmeister and J. Rambau

direct-point probability of opponent

310

0.4

0.2

0.0 0.0

0.2

0.4

0.6

0.8

1

direct-fault probability of opponent Fig. 5 Skill-Strategy Score Card: difference between the winning probabilities of safe and risky play for different opponents and varying smash skills psucc,ρ (pos(ρ), FS M ) and pfault,ρ (pos(ρ), FS M )

3. Focus on the square at the ( psucc,ρ (pos(ρ), FS M ) , pfault,ρ (pos(ρ), FS M ))-coordinate in the little chart, where psucc,ρ (pos(ρ), FS M ) denotes your players’ estimated success probability for smash and pfault,ρ (pos(ρ), FS M ) the corresponding error-probability. 4. The result for the estimated probabilities is: the greener the square, the more does safe outperform risky. 5. Since all probabilities are only estimates, the graphics show some sensitivities: The neighbouring little squares show how the superiority of safe over risky, or vice versa, changes as your players’ smash skills (or planned shot skills, respectively) vary: the squares above are for larger success probabilities, the squares to the right are for larger error probabilities, and the squares to the top-right are for larger deviation probabilities (no error but with deviation into a neighbouring field). 6. The neighbouring little charts show how the superiority of safe over risky, or vice versa, changes as your opponent’s strength varies: the little charts above are for a larger direct point probability of your opponent in the field, the little charts to the right for a larger error probability of your opponent in the field. This way, our team can base a decision on a larger area of plausible probabilities.

1

0.8

311

real skill Team P

Color key

1

0.4 0.2

0.5

0 −0.2 −0.4 0.5 1 fault probability of shot

0.6

winProb(safe)−winProb(risky)

success probability of shot

direct-point probability of opponent

Strategy Optimization in Sports via Markov Decision Problems

0.4

0.2

0.0 0.0

0.2

0.4

0.6

0.8

1

direct-fault probability of opponent Fig. 6 Skill-Strategy Score Card: difference between the winning probabilities of safe and risky play for different opponents and varying shot skills psucc,ρ (pos(ρ), FP ) and pfault,ρ (pos(ρ), FP )

7. Moreover, it is possible to assess the critical probability values where the superiority of safe over risky flips fast (narrow yellow areas between green and red). Let us draw some conclusions from the example cards in Figs. 5 and 6. If the opponent is strong (many direct points), then we have to look at the top little chart, where the difference between the two strategies is very small (yellow all over): Against such a strong opponent, the choice of a strategy does not matter. We see in Fig. 5 that for the field strategies a weak opponent (opponent with many direct errors) leads to the yellow-green little chart to the right that has yellow only if the skills for a successful-and-on-target smash are large enough. That is, the risky field strategy against such an opponent is quite robustly never better than safe, and both strategies are equally good if our smash skills are good enough. The little chart at the origin, however, (opponent with few direct points and few errors) shows a sharp dependence of the superiority of safe over risky on the smash skills of our players. This is plausible because against such an opponent we get many more chances for a smash during a rally, and its quality will influence the winning probability of a risky attack style substantially.

312

S. Hoffmeister and J. Rambau

12 Extension: Two Person Constant Sum Game Using the two-scale approach and the skill data from the previous section, we were able to generate the two person constant sum game presented in Figs. 7 and 8. We evaluated the s-MDP with transition probabilities resulting from a simulation of the g-MDP based on the estimated individual skills. Tables 7 uses skill estimates from the prefinal matches and Table 8 skill estimates from the final match. Each Tables presents the winning probability of Germany for 32 × 32 strategy combinations. These Tables are an extension of the benchmark question, which compares only two strategies against a static opponent, to game theory, where also the opponent team can vary between strategies. Note that a generation of such a table from direct estimated s-transition probabilities would require a massive amount of data to get enough estimates that belong to a strategy combination. Considering how often certain teams meet within a season, this project would seem to have no chance of success.

Based on input parameters (skills) estimated from prefinal matches Strategy Brazil

Strategy Germany

π final

Fig. 7 Winning probabilities for Germany for different strategy combinations of both teams in the prefinal setting

Strategy Optimization in Sports via Markov Decision Problems

313

Based on input parameters (skills) estimated from final match Strategy Brazil

Strategy Germany

π final

Fig. 8 Winning probabilities for Germany for different strategy combinations of both teams in the postfinal setting

The strategies compared in the tables are all extreme strategies that are generated by the g-MDP strategy parameters field

field

serve serve (ρ1), πh,∗ (ρ1), πh,∗ (ρ2), πh,∗ (ρ2). πb , πh,∗

Observe that the used technique and the target field for one situation are combined into one value. The serving strategy πs is in the prefinal setting fixed to 0.5 and in the postfinal setting to the observed value of πs in π final . By an extreme strategy, we mean that each parameter is alternating between 0 and 1. So, we consider 5 parameters, each of which can be 0 or 1, resulting in 32 different strategies. For presenting the strategies in the table in a clear manner, we use a pattern to indicate the used strategy. A parameter that takes the value 0 is represented by a white coloured field, a 1 by a grey coloured field. Furthermore, we used the ordering of the parameters as presented above. So, for example, the first line in both tables corresponds to Germany playing

314

S. Hoffmeister and J. Rambau

the strategy 0, 0, 0, 0, 0, which means that player 2 is always blocking and both player play a safe serve and a save field attack. Pure green means Germany wins; yellow depicts a 50–50-chance; and red means Germany loses. The intermediate colours indicate the intermediate values. We comment only on obvious patterns. Prior to the match we would have recommended the following for Germany: • Since the lower half of the table is greener, player 1 (Brink) should be the blocking player for Germany. This need not necessarily come from blocking skills but also from the order in which an attack is performed: the blocking player is most often also the setting player. This result surprises us, since actually player 2 (Reckermann) is the usual blocking player for Germany. And most probably for some reason. We still have to investigate whether this is an artefact or a reasonable option. • Every other row is greener, thus, player 2 should play risky field attacks; this is also the case for player 1, but very less so (only a slight visible change in every other group of four rows). • The most evenly green rows are the ones with pattern 1, ∗, 1, ∗, 1. All strategies matching this pattern achieve a high winning probability against all possible Brazilian strategies. After the final, we see that things have changed quite substantially: • The most prominent impression is that there is far more red: The winning probabilities for the skills estimated from the final only were much reduced for Germany. One possible reason for this is that the performance for the important actions of the German players was not as good as before or the performance of the Brazilian team improved over the prefinal estimates. • In spite of this big change, the strategic a-posterior recommendation is still to have player 1 be the blocking player who plays risky in the field. It is much less important in hindsight, however, what player 2 does in the field.

13 Conclusion We presented a new concept to answer principle, match-dependent strategic questions in sports games. The question itself is modelled by a strategic MDP (s-MDP) containing only information relevant to the question. If the direct estimation of the sMDP from the available data is not satisfying, an adequate gameplay MDP (g-MDP) can help to derive valuable s-MDP transition probabilities. The important property of the g-MDP is, that its transition probabilities depend only on the separate skills of the players. With the probabilities derived from the g-MDP, the analytic solution of the s-MDP can be evaluated to answer the strategic question. We have extensively analysed the Olympic final 2012 by this new method. Some results are encouraging, and some surprising outcomes have yet to be investigated

Strategy Optimization in Sports via Markov Decision Problems

315

further. Since all estimations of probabilities are quite fragile, a sensitivity analysis of the results is a must. We have presented skill-strategy score cards as a means to graphically present sensitivity information showing the hot spots in parameter space where decisions switch fast. We think that other strategic questions with a similar structure like the benchmark question in this paper can be treated by following our concept of multi-scale modelling with MDPs. Future research will deal with situations in which skills are dependent on the current score or with the role of variability for success. Moreover, one could try to optimize the detailed meanings of only roughly described strategies like risky and safe by find the best possible s-g-implementation. Acknowledgements We thank our student assistants Ronan Richter and Fabian Buck for their support in the extensive video analysis. Moreover, we thank the anonymous referees for valuable suggestions that greatly improved the presentation of this paper.

Appendix 1 In this section, we give the details of the proofs for Theorems 2 and 1. One important observation is the following plausible lemma that holds for both special cases of s-MDPs. It says that it is no disadvantage for us when we have more points or the opponent has fewer points. Lemma 1 The optimal expected reward-to-go v∗ (x, y, k, ) satisfies v∗ (x, y, k, ) ≤ v∗ (x + 1, y, k, ) and v∗ (x, y, k, ) ≥ v∗ (x, y + 1, k, ). Proof We prove this by comparing all possible realizations of the game separately. First of all, the outcome of future rallies does not depend on the score. Each winning scenario starting from state (x, y, k, ) corresponds to a winning scenario with identical transitions starting from state (x + 1, y, k, ) with one stage less that has at least the same probability. Thus, the total winning probability starting from (x + 1, y, k, ) is no smaller than the one starting in (x, y, k, ). Moreover, each losing scenario starting from state (x, y, k, ) corresponds to a losing scenario with identical transitions starting from state (x, y + 1, k, ) with one stage less that has at least the same probability. Thus, the total losing probability starting from (x, y + 1, k, ) is no smaller than the one starting in (x, y, k, ). The claim expresses exactly this in terms of the optimal reward-to-go in the respective states. In the previous lemma we compared the winning probabilities in states with identical service components. We now explain why the winning probability increases when we win the next point. Lemma 2 The optimal expected reward-to-go satisfies v∗ (x + 1, y, P, 1) ≥ v∗ (x, y + 1, Q, 1).

316

S. Hoffmeister and J. Rambau

Proof Team P, in order to win starting at state (x, y + 1, Q, 1), first has to reach a score of x + 1 at some point in time. Thus, the main observation, denoted by (∗), is that all winning scenarios starting from state (x, y + 1, Q, 1) pass through exactly one of the states (x + 1, y + z, P, 1), z = 1, . . . , 21 − y. Let W be the event that P wins, let E be the event that state (x, y + 1, Q, 1) is passed, and for z = 1, . . . , 21 − y let E z be the event that state (x + 1, y + z, P, 1) is passed. Then we compute: v∗ (x, y + 1, Q, 1) = Prob(W |E)

21−y

=

Prob(E z |E)Prob(W |E z ) Markov-Property and ∗

z=1



21−y

=

Prob(E z |E)v∗ (x + 1, y + z, P, 1)

z=1



21−y



Prob(E z |E)v∗ (x + 1, y, P, 1) Lemma 1 and induction

z=1

≤ v∗ (x + 1, y, P, 1)

by ∗ .

Thus, an optimal policy is myopic: Corollary 1 The policy that always maximizes the probability to win the next point is optimal for the s-MDP for beach volleyball. 

Appendix 2 This appended section defines the details for the infinite-horizon, stationary gMDP for a beach volleyball rally that was sketched in Sect. 6. Let P and Q be the teams participating in the game. P1 and P2 are the players of P; Q 1 and Q 2 are the players of Q. Team P is the team for which we want to choose an optimal playing strategy, whereas team Q is the uncontrolled opposing team. That means, as in the s-MDP, team P is the decision making team, and the behaviour of team Q is part of the system disturbance in the transition probabilities. We have decision epochs T = {1, 2, 3, . . .}, and t ∈ T is the total number of ball contacts minus the blocking contacts in the rally so far. A state in the g-MDP is a tuple that contains the players’ positions, the ball’s position, a counter of the number of contacts, the information which player last contacted the ball, a Boolean variable that indicates the hardness of the last hit, and the designated blocking player of the defending team for the next attack. A general formulation for a state is (pos(P1 ), pos(P2 ), pos(Q 1 ), pos(Q 2 ), pos(ball), counter, lastContact, hard, blocker).

P 03

P 13

P 23

Q00

Q21

Q11

Q01

Q22

Q12

1m

P 24

Q10

Q02

3m

P 14

P 35 Q30

P 04

Q20

P 34 Q31

P 25

L

R

F F

L P 22

P 01

P 11

P 21

Q23

Q24

Q14

4m Q25

3.5m Q15

P 00

P 10

P 20

P 30 Q35

0.5m

Q13

B

Q03

3m

P 12

Q04

1m

P 02

P 32 Q33

R

P 15

P 31 Q34

B

P 05

317

P 33 Q32

Strategy Optimization in Sports via Markov Decision Problems

Q05

Fig. 9 Court grid

The function pos(·) returns the position of a player or the ball. A position on the court is defined on basis of the grid presented in Fig. 9. The components counter and lastContact are needed to implement the three-hits and the double contact rule respectively. The state variable counter can take values from the set {−1, 0, 1, 2, 3}. The case “−1’ marks a service state. This way it is possible to forbid a blocking action on services. The counter stays −1 if the ball crosses the net after a serve. This helps to distinguish between a reception or defence action. Consequently, if the counter is 0, the ball crossed the net via an attack-hit performed in a field attack. The information which player last contacted the ball is needed to implement the double-contact fault into the model. The state variable lastContact takes values in {P1 , P2 , Q 1 , Q 2 , ∅}. If the ball has just crossed the net or the state is a serving state, a ∅-symbol shows that both players are allowed to execute the next hit. The Boolean state variable hard indicates the power of the last hit. If hard = 1, then the ball has a high speed when reaching the field, else the ball has normal speed. Finally, the state variable blocker takes values in {P1 , P2 , Q 1 , Q 2 } and indicates the designated blocking player of the currently defending team. It is necessary to save it in the state since the decision who blocks is made once at the beginning of the opponents attack plan and followed more than one time step. Besides these generic states, the g-MDP contains the absorbing states point and fault, where point and fault is denoted from the perspective of team P. The resulting g-MDP has around one billion different states. As an example (P02, P33, Q12, Q13, P02, −1, ∅, 0, −) is a typical serving state for team P. Of course, some of the states occur more often in practice than others. Depending on the current state, there are different actions available to each player. The individual player actions of a player ρ consist of a hit h and a move μ. We distinguish between a one-field and a two-field movement. Also, the direction ( f := forward, f r := forward-right, …) of the movement matters. A blocking action belongs to the group of movements since ball possession is not required to perform a block. A blocking action can only be performed if the player is in a field at the net. All pos-

318

S. Hoffmeister and J. Rambau

Table 13 Move specification for ρ belonging to team P Symbol Specification Description ∅ m M b

– f, f r, r, r b, b, bl, l, l f f, r, b, l –

Stay Move one field Move two fields Block

Requirements none None None pos(ρ) ∈ {P31, . . . , P34}, counter = −1

sible moves for team P are listed in Table 13. The moves of the players that belong to team Q are defined analogously. Depending on the position of a player and on the position of the ball relative to the player, each player has a set of available hits. Sometimes, this set can consist solely of the hit no hit. A hit h tech field is defined by a hitting technique tech and a target field field. Depending on the hit’s degree of complexity, there are different requirements such that the hit is allowed in the model. The function neighbour(field) returns a set of all neighbouring fields of field according to the grid presented in Fig. 9 and the field itself. All hitting techniques with their possible target fields and requirements are listed in Table 14. The hitting techniques for a player of team Q are defined analogously. There are rules in the model that restrict the possible combinations of a hit with a move to a player action as well as rules that restrict the possible combinations of two player actions to a team action. Reasons for these restrictions are practical considerations. There are three rules on combining a hit with a movement to a player action. The first one is: If a player makes a real hit, i.e., a hit that is not no hit, due to timing reasons only a one-field movement is allowed. The second one is: If a player makes a hit that is performed with a jump, like, e.g., a jump serve, only a one-field movement in forward direction (i.e., towards the net) can follow. The third one is: If the hit requires a movement before executing the hit, no additional movement afterwards is allowed. This is, e.g., the case for a reception that takes place in a neighbouring field of the hitting player. We incorporate one restriction to the combination of player actions: If two player actions are combined to a team action, only one player may make a real hit. Team actions that themselves or whose player actions do not follow these rules are not available in the model—for both teams. Further conceivable restrictions could be easily implemented in the model whenever they only depend on the current state. Transition probabilities determine the evolution of the system if a certain action in a certain state is chosen. Assume, we know for each player ρ and each hitting technique h tech target the probability t+1 psucc,ρ pos(ρ), h tech (ball) = target | post (ρ), h tech target := P pos target ,

Strategy Optimization in Sports via Markov Decision Problems

319

Table 14 Hit specification for player ρ of team P and ball ball; requires always ρ = lastContact except the action no hit ( if pos(ball) = pos(ρ) then no movement afterwards allowed) tech target Description Requirements Position counter ∅ Serve SF



No hit

*

None

Q11 − Q24

Float serve

= −1

SJ

Q11 − Q24

Jump serve (hard)

= −1

pos(ρ) = pos(ball) ∈ P01 − P04 pos(ρ) = pos(ball) ∈ P01 − P04

Reception r rm

P11 − P34 P11 − P34

Receive Receive with move

= −1 = −1

pos(ball) = pos(ρ) pos(ρ) ∈ neighbour(pos(ball))

neighbour(pos(ρ)) \ (Q, ·)

set

>0

pos(ρ) = pos(ball)

Smash (hard)

>1

pos(ρ) = pos(ball) or pos(ρ) + m f = pos(ball) pos(ρ) ∈ neighbour(pos(ball)) pos(ρ) = pos(ball)

Setting s

Attack-Hit FS M Q11 − Q24

FE

Q11 − Q24

Emergency shot

>1

FP Defence d dm

Q11 − Q34

Planned shot

>0

P11 − P34 P11 − P34

Defence Defence with move

= −1 = −1

pos(ball) = pos(ρ) pos(ρ) ∈ neighbour(pos(ball))

i.e., the probability that the specified target field target from ρ’s position at time t is met. In the notation used above, the terms pos(ρ) and h tech target show the dependence on the position of the hitting player and the hit he uses. The probability is timeindependent. The t on the right-hand side of the last equation is only used to indicate that post (ρ) is the position of ρ at time t while post+1 (ball) is the position of the ball in the subsequent state. Similarly, assume, we know the probability of an execution fault t+1 = fault | post (ρ), h tech pfault,ρ pos(ρ), h tech target := P s target for player ρ using hit h tech target from position pos(ρ). An execution fault includes hits where the ball is not correctly hit such that the referee terminates the rally. For serves and attack-hits an execution fault also includes that the ball is hit into the net. Furthermore, assume that we know the blocking skills of each player. The parameter pblock,ρ denotes the probability that player ρ touches the ball when performing

320

S. Hoffmeister and J. Rambau

the block b against an adequate attack-hit from the opponent’s side of the court. The probability pblock,ρ is independent of the skills of the attacking player. There are three possible outcomes of that block. The block can be so strong that it is impossible for the opponent team to get the returned ball, and the blocking team wins the rally. This probability is denoted by pblock,ρ,point . Also, the block can result in a fault with probability pblock,ρ,fault . That happens if the ball is blocked into the net and can not be regained or the blocking player touches the net, which is an execution fault. None of the above happens with probability pblock,ρ,ok := pblock,ρ − pblock,ρ,point − pblock,ρ,fault . This is called an “ok”-block, and the ball lands in one random field on the opponents or own court side. We define pno block,ρ := 1 − pblock,ρ as the probability that the blocking player fails to get his hands at the ball. In this case, the landing field of the ball is not affected by the block. In total, the blocking probabilities are pno block,ρ + pblock,ρ,point + pblock,ρ,fault + pblock,ρ,ok = 1.

  pblock,ρ

From all these input probabilities, we generate all transition probabilities in the [email protected] We explain how the next state evolves from the current state and the played team actions: The next player’s position depends only on the current position and the movement the player makes. An allowed movement will always be successful. The crucial component is the next position of the ball. Here, the individual skills of the hitting player enter the model. Assume first, no player of the opposing team is will blocking. Then with probability psucc,ρ pos(ρ), h tech target the ball’s next position the hitting be the desired target field, and with probability pfault,ρ pos(ρ), h tech target player makes an execution fault. The remaining probability tech tech 1 − psucc,ρ pos(ρ), h tech target − pfault,ρ pos(ρ), h target =: pdev,ρ pos(ρ), h target will be the probability that the ball lands in a neighbouring field of the target field. We assume each neighbouring field is equally probable. If the hit is an attacking hit to the opponent’s court side, then the ball may be blocked. The blocking action must be made from an adequate position3 such that the block can have an impact. If all preconditions are fulfilled, we first evaluate whether the hit is successful. A hit is successful if no execution fault occurs, the ball crosses the net, and approaches the target field or one of its neighbours with the respective probabilities. Given a successful attack, we evaluate in the next step the result of the block. If the blocking player does not touch the ball, then the next position of the ball will not be affected by the block. Otherwise, the outcome of the block is evaluated according to the blocking skill of that player and may be a point, fault or a different position of the ball. This need not automatically mean a point for the attacking team,

3 The attacking player must be in front of the blocking player. An attack-hit from the last row of the

court (P11 − P14 or Q11 − Q14) can not be blocked.

Strategy Optimization in Sports via Markov Decision Problems

321

since the defending team may perform a successful defence action in the next time step. Finally, in case of an execution fault or if the ball is not hit by any player, then the next state will be point or fault, respectively, from the perspective of team P.

Appendix 3

Table 15 Direct estimation of s-MDP probabilities for team Q Based on prefinal matches Strategy

#

risky-risky 19 risky-safe 19 safe-risky 146 safe-safe 146 Based on final match Strategy

#

risky-risky risky-safe safe-risky safe-safe

6 6 22 22

p+ (Q) (%)

serve

5 5 1 1 p+ (Q) (%) 17 17 0 0

p− (Q) (%)

serve

32 32 7 7 serve

p− (Q) (%) 33 33 9 9

serve

#

p+ (Q) (%)

78 27 78 27

65 37 65 37

#

p+ (Q) (%) 69 14 69 14

32 7 32 7

field

p− (Q) (%)

field

21 0 21 0 field

p− (Q) (%) 19 0 19 0

field

References 1. Anbarc, N., Sun, C., Ünver, M.: Designing fair tiebreak mechanisms: the case of penalty shootouts. Technical report, Boston College Department of Economics (2015) 2. Chan, T.C.Y., Singal, R.: A Markov decision process-based handicap system for tennis. J. Quant. Anal. Sports 12(4), 179–189 (2016). https://doi.org/10.1515/jqas-2016-0057 3. Clarke, S.R., Norman, J.M.: Dynamic programming in cricket: protecting the weaker batsman. Asia Pacific J. Oper. Res. 15(1) (1998) 4. Clarke, S.R., Norman, J.M.: Optimal challenges in tennis. J. Oper. Res. Soc. 63(12), 1765–1772 (2012) 5. Ferrante, M., Fonseca, G.: On the winning probabilities and mean durations of volleyball. J. Quant. Anal. Sports 10(2), 91–98 (2014) 6. FIVB: Official Beach Volleyball Rules 2009–2012 . Technical report, Fédération Internationale de Volleyball (2008) 7. Heiner, M., Fellingham, G.W., Thomas, C.: Skill importance in women’s soccer. J. Quant. Anal. Sports 287–302 (2014). https://doi.org/10.1515/jqas-2013-0119

322

S. Hoffmeister and J. Rambau

8. Hirotsu, N., Wright, M.: Using a Markov process model of an association football match to determine the optimal timing of substitution and tactical decisions. J. Oper. Res. Soc. 53(1), 88–96 (2002). https://doi.org/10.1057/palgrave/jors/2601254 9. Hirotsu, N., Wright, M.: Determining the best strategy for changing the configuration of a football team. J. Oper. Res. Soc. 54(8), 878–887 (2003). https://doi.org/10.1057/palgrave.jors. 2601591 10. Hoffmeister, S., Rambau, J.: Skill estimates - olympic beach volleyball tournament 2012 (2019). https://epub.uni-bayreuth.de/4150/ 11. Kira, A., Inakawa, K., Fujita, T., Ohori, K.: A dynamic programming algorithm for optimizing baseball strategies. J. Oper. Res. Soc. Jpn. 62 (2019) 12. Koch, C., Tilp, M.: Analysis of beach volleyball action sequences of female top athletes. J. Human Sport Exer. 4(3), 272–283 (2009) 13. Miskin, M.A., Fellingham, G.W., Florence, L.W.: Skill importance in women’s volleyball. J. Quant. Anal. Sports 6(2) (2010) 14. Mitchell, T.M.: Estimating Probabilities. In: Machine Learning, Chap. 2, pp. 1–11. McGrawHill Science/Engineering/Math (2017) 15. Nadimpalli, V.K., Hasenbein, J.J.: When to challenge a call in tennis: a Markov decision process approach. J. Quant. Anal. Sports 9(3), 229–238 (2013) 16. Norman, J.M.: Dynamic programming in tennis - when to use a fast serve. J. Oper. Res. Soc. 36(1), 75–77 (1985) 17. Norris, J.: Markov Chains. Cambridge University Press, Cambridge (1997) 18. Puterman, M.L.: Markov Decision Processes - Discrete Stochastic Dynamic Programming. Wiley, New York (2005) 19. Routley, K., Schulte, O.: A Markov Game Model for Valuing Player Actions in Ice Hockey. Uncertainty in Artificial Intelligence (UAI), pp. 782–791 (2015) 20. Schulte, O., Khademi, M., Gholami, S., Zhao, Z., Javan, M., Desaulniers, P.: A Markov game model for valuing actions, locations, and team performance in ice hockey. Data Mining Knowl. Discov. (2017). https://doi.org/10.1007/s10618-017-0496-z 21. Terroba, A., Kosters, W., Varona, J., Manresa-Yee, C.S.: Finding optimal strategies in tennis from video sequences. Int. J. Pattern Recogn. Artif. Intell. 27(06), 31 (2013). https://doi.org/ 10.1142/S0218001413550100 22. Turocy, T.L.: In search of the “last-ups” advantage in baseball: a game-theoretic approach. J. Quant. Anal. Sports 4(2) (2008). https://doi.org/10.2202/1559-0410.1104 23. Walker, M., Wooders, J., Amir, R.: Equilibrium play in matches: binary Markov games. Games Econ. Behav. 71(2), 487–502 (2011). https://doi.org/10.1016/j.geb.2010.04.011 24. Webb, J.N.: Game Theory - Decisions Interaction and Evolution. Springer, London (2007) 25. Wright, M., Hirotsu, N.: A Markov chain approach to optimal pinch hitting strategies in a designated hitter rule baseball game. J. Oper. Res. Soc. Jpn. 46(3), 353–371 (2003) 26. Wright, M., Hirotsu, N.: The professional foul in football: Tactics and deterrents. J. Oper. Res. Soc. 54(3), 213–221 (2003). https://doi.org/10.1057/palgrave.jors.2601506

An Application of RASPEN to Discontinuous Galerkin Discretisation for Richards’ Equation in Porous Media Flow Peter Bastian and Chaiyod Kamthorncharoen

Abstract Nonlinear algebraic systems of equations resulting from Discontinuous Galerkin (DG) discretisation of partial differential equations are typically solved by Newton’s method. In this study, we propose a nonlinear preconditioner for Newton’s method for solving the system of equations which is the modification of RASPEN (Restricted Additive Schwarz Preconditioned Exact Newton). We employ inexact inner solves and different Partition of Unity (PU) operators. Basically, the idea of RASPEN is to use fixed-point iteration to produce a new (non-)linear system which has the same solution as the original system and solve it using Newton’s method. The restricted additive Schwarz method is used as a non-linear preconditioner and enables parallel computation by division into subdomain problems. We apply this method to p-Laplace and Richards’ equation in porous media flow.

1 Introduction Newton’s method is a powerful tool for solving nonlinear algebraic systems arising after discretisation of nonlinear partial differential equations. It is only locally convergent and requires a globalization strategy such as line search or trust region [1, 18]. In each Newton iteration, the linearized system has to be solved which, if the system is large, might be computationally expensive. One can consider to apply the Domain Decomposition method in order to solve the linearized system in parallel which results in Newton-Krylov-Schwarz methods [6, 7], where the domain decomposition method is applied as a preconditioner in linear problems. P. Bastian · C. Kamthorncharoen (B) Interdisciplinary Center for Scientific Computing (IWR), Heidelberg University, Heidelberg, Germany e-mail: [email protected] P. Bastian e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_15

323

324

P. Bastian and C. Kamthorncharoen

Recently, the domain decomposition methods have been applied in the context of “nonlinear preconditioning”. This word firstly come out by Cai and Keyes in Additive Schwarz Preconditioned Inexact Newton (ASPIN) [5]. Instead of solving the original system, a new nonlinear system of equations is created from Additive Schwarz which is then solved by Newton’s method. The new system is called “preconditioned nonlinear system”. This works in the overlapping domain decomposition framework. For the nonoverlapping case, Bordeu, Boucard, and Gosselet proposed a nonlinear Neumann-Neumann (BDD) method [4] and also Klawonn, Lanser, and Rheinbach introduce a nonlinear dual-primal Finite Element Tearing and Interconnecting (FETI-DP) and Balancing Domain Decomposition by Constraints (BDDC) [13] methods. Restricted Additive Schwarz Preconditioned Exact Newton (RASPEN) proposed by Dolean et. al. [10] uses the same technique as in ASPIN with different restriction (extension) operators. RASPEN employs the restriction (extension) operators which satisfy the Partition of Unity (PU) property. Here, we try another kind of restriction (extension) operators which still preserves PU. Another difference is that ASPIN uses an approximate Jacobian while RASPEN uses the exact one since some components can be reused. ASPIN and RASPEN have successfully been applied on various problems [5, 8– 10, 12] but mainly for Finite Difference (FD) and Finite Element (FE) discretisation. This work shows that this nonlinear preconditioner also performs well for Discontinuous Galerkin (DG) discretisations which provide more complicated nonlinear systems. We apply RASPEN to test problems in order to investigate the robustness of this method. The generic implementation of RASPEN is implemented in DUNE framework [2, 3] which can easily apply to any discretisation schemes on grid-based method (FD, FE, DG). The p-Laplace equation is the first model problem we look into. Another model problem is Richards’ equation, a nonlinear partial differential equation introduced in 1931 by Richards [17] describing a flow in unsaturated soil. Nowadays, it is still a challenging problem.

2 RASPEN In this section, we explain in more detail the nonlinear preconditioner we use in this work. We recall here RASPEN methodology from [10]. Given the nonlinear algebraic problem F(z) = 0 (1) where F : RI → RI is a nonlinear function and I = {1, . . . , n} is an index set. It is understood that z ∈ RI is a vector with components (u)i ∈ R for all i ∈ I . We assume that the solution z in (1) is unique.

An Application of RASPEN to Discontinuous Galerkin Discretisation …

325

2.1 Subdomain Solves Let Ik ⊆ I , 1 ≤ k ≤ p, n k = |Ik |, be a decomposition of the index set I into possibly overlapping subdomains, i.e. p 

Ik = I .

k=1

The restriction (or picking-out) operators Rk : RI → RI k for 1 ≤ k ≤ p are defined by ∀α ∈ Ik . (Rk x)α = (x)α Below we make use of a second restriction Rˆ k : RI → RI k enabling a partition of unity (PU) , i.e. it satisfies p 

RkT Rˆ k = I

on RI .

k=1

One may define the PU-Restriction by Rˆ k = Dk Rk where the Dk are n k × n k diagonal matrices with 0 ≤ (Dk )α,α ≤ 1 and  (Dk )α,α = 1. We set (Dk )α,α = χ1α where χα = |{k|α ∈ Ik }|, that is, a number of subdomains to which (x)α belongs. Note also that p 

RkT Rˆ k = I = I T =

k=1

p 

Rˆ kT Rk .

k=1

With this notation in place we define the solution operator G k : RI → RI k for 1 ≤ k ≤ p : Given x ∈ RI , for any k ∈ {1, . . . , p} it holds that   Rk F (I − RkT Rk )x + RkT G k (x) = 0.

(2)

Note that x is replaced by G k (x) at all indices in Ik .

2.2 Preconditioned Nonlinear System Using the subdomain solves given in (2) the additive nonlinear domain decomposition method is then defined as

326

P. Bastian and C. Kamthorncharoen

x l+1 =

p 

Rˆ kT G k (x l ).

(3)

k=1

Note that here the PU-based extension is used to glue together the subdomain solutions. It is assumed that this iteration is convergent (one may employ also a damped version) to the unique fixed point x ∗ satisfying x∗ =

p 

Rˆ kT G k (x ∗ ).

(4)

k=1

The fixed point coincides with the solution of (1) which is shown in the following Lemma 1 The fixed point x ∗ of (4) coincides with the solution z of (1). Proof Since z is the unique solution of F(z) = 0 we have for any k ∈ {1, . . . , p}:   0 = F(z) = F (I − RkT Rk )z + RkT Rk z , i.e. G k (z) = Rk z. From this follows p 

Rˆ kT G k (z)

k=1

=

p 

Rˆ kT

Rk z =

k=1

 p 

 Rˆ kT

Rk z = z,

k=1

i.e. z is a fixed point of (3). Since the fixed point is assumed to be unique we have  z = x ∗. This means the new nonlinear system G (x) := x −

p 

Rˆ kT G k (x) = 0

(5)

k=1

has the solution z and is called preconditioned nonlinear system.

2.3 Newton Iteration The preconditioned nonlinear system (5) is now solved using Newton’s method with line search:

−1 G (x l ). (6) x l+1 = x l − λl ∇G (x l ) This iteration requires the solution of the linear system (tangent system) ∇G (x l )vl = G (x l )

An Application of RASPEN to Discontinuous Galerkin Discretisation …

327

which, for simplicity, is considered to be solved by Richardson’s iteration   vl,κ+1 = vl,κ + ω G (x l ) − ∇G (x l )vl,κ .

(7)

Remark • Richardson’s iteration is used here for simplicity to illustrate the “matrix-free” computation of matrix-vector product of ∇G (x l )vl,κ . • Below we will use GMRES “without” preconditioner since we are interested only in robustness of the nonlinear iteration. • One can consider “two-level” methods using a coarse grid correction, like in [10, 15] in order to improve convergence rates for the global linear iteration (7). • Line search is not employed in the numerical experiments. Since G involves the subdomain solution operators, the Jacobian ∇G (x l ) is too costly to be set up and we opt for a matrix-free implementation of the iteration (7). In the following we consider in detail how the product ∇G (x l )vl,κ is formed. From the definition of G we obtain   p p   l l T l (8) ∇G (x ) = ∇ x − Rˆ k G k (x ) = I − Rˆ kT ∇G k (x l ), k=1

k=1

i.e. the Jacobians of the subdomain solution operator are required. These are obtained by applying the chain rule to (2). For any α ∈ Ik , β ∈ I observe 0= =

 

∂ Rk F (I − RkT Rk )x + RkT G k (x) α 

∂ xβ

[Rk ∇ F(xk )]α,γ I − RkT Rk + RkT ∇G k (x)

γ ∈I

γ ,β

where we have set for abbreviation xk = xk (x) = (I − RkT Rk )x + RkT G k (x). From this we obtain the n k × n matrix identity

Rk ∇ F(xk ) I − RkT Rk + RkT ∇G k (x) = 0 ⇔ Rk ∇ F(xk ) − Rk ∇ F(xk )RkT Rk + Rk ∇ F(xk )RkT ∇G k (x) = 0 ⇔ Rk ∇ F(xk )RkT ∇G k (x) = Rk ∇ F(xk )RkT Rk − Rk ∇ F(xk )

−1 Rk ∇ F(xk ). ⇔ ∇G k (x) = Rk − Rk ∇ F(xk )RkT

(9)

So the application of ∇G k (x) to a vector can be computed by solving a local linear system with the matrix Rk ∇ F(xk )RkT which is the same local system as in the local subdomain solves.

328

P. Bastian and C. Kamthorncharoen

Inserting this result into (8) yields ∇G (x l ) = I −

p 



−1 Rk ∇ F(xkl ) Rˆ kT Rk − Rk ∇ F(xkl )RkT

k=1 p

=I−



Rˆ kT Rk +

k=1

=

p 

p 



−1 Rk ∇ F(xkl ) Rˆ kT Rk ∇ F(xkl )RkT

(10)

k=1



−1 Rk ∇ F(xkl ), Rˆ kT Rk ∇ F(xkl )RkT

k=1

where we have set xkl = (I − RkT Rk )x l + RkT G k (x l ).

2.4 Algorithm We summarize the procedure outlined above using Richardson’s iteration for the global linear solves in the following algorithm for solving the preconditioned nonlinear system (5): 1: for l = 0, . . . , convergence do  2: 3: 4: 5: 6: 7: 8: 9:

outer Newton iteration solve Rk F xkl = 0 for G k (x l ) (in parallel)

local nonlinear solves p T l l l ˆ compute b = G (x ) = x − k=1 Rk G k (x ) for κ = 0,

global linear solves  do

. . . , convergence solve Rk ∇ F(xkl )RkT ykl,κ = Rk ∇ F(xkl )vl,κ (in parallel) 

p compute vl,κ+1 = vl,κ + ω b − k=1 Rˆ kT ykl,κ end for update x l+1 = x l − λl vl end for

Note that Newton’s method is also used for solving the local nonlinear problems in line 2 in each subdomain. We apply a direct solver for the linearized system in the subdomains. For solving the global linear  system, in line 5 a linear subdomain solve is required where the local Jacobian Rk ∇ F(xkl )RkT can be reused from local nonlinear solve in line 2.

3 Numerical Experiments In this section, we show the application of RASPEN on two test problems. Note that all numerical experiments in this study were carried out in the Distributed and Unified Numerics Environment (DUNE) [2, 3], a free software framework for solving partial differential equations with grid-based methods.

An Application of RASPEN to Discontinuous Galerkin Discretisation …

329

3.1 P-Laplace Equation First, we apply our approach to the p-Laplace equation which appeared in [13]. The p-Laplace operator is defined for p ≥ 2 by

p u := ∇ · (|∇u| p−2 ∇u).

(11)

where p could have different values in different parts of the domain. Note that if p = 2, the equation becomes the linear Laplacian. Let Ωi,η be defined inside each subdomain Ωi by Ωi,η := {x ∈ Ωi ; dist(x, ∂Ωi ) < η} and let Ωi,I be the inner part of subdomain Ωi surrounded by Ωi,η denoted by N N Ωi,I and Ωη := i=1 Ωi,η . Then, with Ωi,I := Ωi Ωi,η . We also define Ω I := i=1 homogeneous Dirichlet boundary condition, we want to find the solution of the partial differential equation. − p u = 1 in Ω, (12) u = 0 on ∂Ω. where we always set p = 2 in Ωη and p > 2 in Ω I as reported below (Fig. 1). The computational domain in this experiment is the unit square Ω := [0, 1] × [0, 1] which is divided into 2 × 2 or 4 × 4 subdomains. The Equation (12) is discretised by continuous Finite Element (FE) and also Discontinuous Galerkin (DG) schemes for rectangular Q 1 finite elements. The algebraic nonlinear equation obtained after discretisation is solved by RASPEN. We choose the initial condition as follows u (0) (x, y) = x(1 − x)y(1 − y) which satisfies the Dirichlet boundary condition and set η = 18 ×Length of subdomain. We run both schemes on 4 and 16 processors in parallel with overlap δ = h and increase the number of elements from 64 × 64 to 512 × 512. The results displayed in Tables 1, and 2 illustrate the number of Newton iterations between standard Newton and RASPEN method. Fig. 1 Left: subdomain Ωi , with the hull Ωi,η surrounds the inner part Ωi.I . Right: example of a computational domain for 4 subdomains

4 subd. 64×64 cells 4 subd. 128×128 cells 4 subd. 256×256 cells 4 subd.512×512 cells 16 subd. 64×64 cells 16 subd. 128×128 cells 16 subd. 256×256 cells 16 subd. 512×512 cells

4 subd. 64×64 cells 4 subd. 128×128 cells 4 subd. 256×256 cells 4 subd. 512×512 cells 16 subd. 64×64 cells 16 subd. 128×128 cells 16 subd. 256×256 cells 16 subd. 512×512 cells

4 4 5 6 4 4 4 6 p = 4.0 Newton 17 20 – – 18 – – –

p = 2.4 Newton

RASPEN 10 11 13 13 11 15 16 17

5 5 5 5 5 5 5 5

RASPEN 6 7 7 8 6 6 7 8 p = 4.4 Newton 25 – – – – – – –

p = 2.8 Newton

RASPEN 12 12 18 23 16 17 12 19

6 6 7 7 6 6 7 7

RASPEN 9 10 11 10 8 10 10 9 p = 4.8 Newton 32 – – – – – – –

p = 3.2 Newton

RASPEN 13 14 – – 20 12 19 40

8 8 9 9 7 7 8 8

RASPEN 13 15 16 16 12 14 15 16 p = 5.2 Newton – – – – – – – –

p = 3.6 Newton

RASPEN 14 40 – – – 12 – –

9 11 12 12 8 9 13 14

RASPEN

Table 1 Numbers of Newton iterations of standard Newton (Newton) and RASPEN on p-Laplace equation for p = 2.4, 2.8, …, 5.2 in Ω I and p = 2 in Ωη after discretised by Finite Element (FE) scheme. ‘–’ represents for divergence of the method

330 P. Bastian and C. Kamthorncharoen

4 subd. 64×64 cells 4 subd. 128×128 cells 4 subd. 256×256 cells 4 subd. 512×512 cells 16 subd. 64×64 cells 16 subd. 128×128 cells 16 subd. 256×256 cells 16 subd. 512×512 cells

4 subd.64×64 cells 4 subd. 128×128 cells 4 subd. 256×256 cells 4 subd. 512×512 cells 16 subd. 64×64 cells 16 subd. 128×128 cells 16 subd. 256×256 cells 16 subd. 512×512 cells

3 3 4 5 3 3 4 4 p = 2.5 Newton 18 25 – 9 14 13 – –

p = 2.1 Newton

RASPEN 4 4 5 – 4 4 4 –

3 3 4 4 3 3 4 4

RASPEN 5 7 8 11 5 7 8 11 p = 2.6 Newton 14 – – – 17 – – –

p = 2.2 Newton

RASPEN 4 4 5 – 4 4 5 5

4 4 4 4 3 4 4 5

RASPEN 11 9 11 8 12 9 11 p = 2.7 Newton 16 – – – 5 – – –

p = 2.3 Newton

RASPEN 5 5 – – – – – –

4 4 4 5 4 4 4 5

RASPEN 13 17 18 13 p = 2.8 Newton – – – – – – – –

p = 2.4 Newton

RASPEN – 5 – – – 5 – –

4 4 4 5 4 4 4 5

RASPEN

Table 2 Numbers of Newton iterations of standard Newton (Newton) and RASPEN on p-Laplace equation for p = 2.1, 2.2, …, 2.8 in Ω I and p = 2 in Ωη after discretised by Discontinuous Galerkin (DG) scheme. ‘–’ represents for divergence of the method

An Application of RASPEN to Discontinuous Galerkin Discretisation … 331

332

P. Bastian and C. Kamthorncharoen

As we can see from the Tables 1, and 2, RASPEN is able to solve the nonlinear algebraic systems obtained from both Finite Element and Discontinuous Galerkin (DG) discretisation schemes. Moreover, it seems to be more robust, in the sense that, it is able to solve more cases with higher values of p. RASPEN can solve the p-Laplace problem up to p = 5.2 for FE and p = 2.8 for DG, on the other hand, Newton’s method can solve up to p = 4.8 for FE and p = 2.7 for DG. Another important thing is, when both approaches converge, RASPEN takes less iterations than standard Newton and the difference is increasing when p is increasing.

3.2 Richards’ Equation Next, we go to another model problem. In this case, we look into an instationary problem which might be more challenging. Richards’ equation describes the fluid flow in an unsaturated porous medium. It has nonlinearities on both temporal and spatial derivatives. In this study, we present Richards’ equation in head-based formulation written as ∂t θ (φ) − ∇ · (K κ(φ)(∇φ − g)) = 0

(13)

where θ is water content, φ is hydraulic head, K is conductivity for saturated condition, κ is relative conductivity and g is gravitational force, respectively. Then, we define the saturation function Θ as a function of water content by Θ=

θ (φ) − θr . θs − θr

where θr is residual water content after complete drainage, and θs is water content under fully saturated conditions. In this study, the parameterisation model of saturation Θ is chosen as the van Genuchten model [11] expressed by

1−n Θ(φ) = 1 + [α|φ|]n n where α, and n are the parameters from van Genuchten model. So, it results that

1−n θ (φ) = θr + 1 + [α|φ|]n n · [θs − θr ] Moreover, we exploit the combination of Mualem [16] and van Genuchten [11] parametrisation model to determine the relative conductivity κ

τ · 1−n 

1−n 2 κ(φ) = 1 + [α|φ|]n n · 1 − [α|φ|]n−1 1 + [α|φ|]n n

An Application of RASPEN to Discontinuous Galerkin Discretisation …

333

Table 3 Numbers of Newton iterations of standard Newton (Newton) and RASPEN on Richards’ equation for overlap cell size h = 1, 2, …, 5 from 10 different conductivity fields overlap size δ = h Newton RASPEN overlap size δ Newton RASPEN overlap size δ Newton RASPEN overlap size δ Newton RASPEN overlap size δ Newton RASPEN

366 339 = 2h 367 340 = 3h 365 341 = 4h 364 328 = 5h 366 327

329 310

342 338

343 333

324 302

336 324

314 276

336 317

337 305

328 301

330 312

352 337

349 335

324 311

336 318

323 301

339 322

342 311

324 308

333 301

352 336

346 329

321 309

340 314

322 301

339 320

341 313

326 310

333 300

347 325

350 322

322 302

340 308

322 300

341 313

342 307

328 303

335 302

354 324

351 315

321 303

342 308

323 301

340 311

342 306

328 302

where τ is the parameter from the Mualem model. Additionally, the conductivity field K is randomly generated by dune-randomfield [14], a DUNE submodule providing Gaussian random fields based on circulant embedding. We consider the square Ω := [0, 2] × [0, 2] as a computational domain and divide it into 16 subdomains. The equation is discretised using Discontinuous Galerkin (DG) with rectangular Q 1 finite elements for 256×256 elements and implicit euler time stepping. We set the initial state for hydraulic head as φ (0) (x, y) = −y with a Dirichlet boundary condition set to be −2.0 at the top, and 0.0 at the bottom and homogeneous Neumann boundary condition at left and right boundary. The time step is chosen to be 1s. The parameters θr , θs , α, n, and τ are 0.05, 0.35, 3, 2, and 0.5, respectively. With these settings, φ is chosen hydrostatic at initial condition. Therefore, we slightly increase the Dirichlet boundary value at the bottom from 0.0 to 0.2 in the first 20s. Then, the flow is going upward through the medium. In general, we cannot solve for every random conductivity fields. Therefore, we show here results from 10 solvable fields with overlap δ = 1h, 2h, . . . , 5h run on 16 processors in parallel. Table 3 shows the total number of nonlinear iterations for Newton and RASPEN over 100 timesteps. It is clearly seen that RASPEN takes less number of outer Newton iterations than standard Newton takes. Increasing overlap cell δ does not show any significant reduction in term of number of outer Newton iterations.

334

P. Bastian and C. Kamthorncharoen

4 Conclusion In this work, we have presented the nonlinear preconditioner known as Restricted Additive Schwarz Preconditioned Exact Newtonusing a different restriction (extension) operator. This approach has been employed to solve the nonlinear systems obtained from Discontinuous Galerkin (DG) discretisation. RASPEN has been generically implemented in DUNE and has been tested for several problems. The results from p-Laplace problem shows that RASPEN is more robust on both FE and DG schemes, and, as a result, it can solve the problem for higher values of p. Moreover, it reduces the number of outer Newton iterations compared to standard Newton. We have further shown that it is also capable to solve the instationary nonlinear problem arising from equation. The results for vertical flow in a porous medium illustrate that RASPEN also reduces the number of outer Newton iterations. Moreover, we have varied the overlap size δ which it did not seem sensitive to convergence behavior. Despite, we have shown the robustness of RASPEN and have seen that it took less iterations. The computational time aspect will be investigated in the future by adding a coarse grid as proposed in [10, 15]. Acknowledgements We would like to thank the anonymous reviewers for the helpful comments and suggestions which greatly improve the quality of this manuscript.

References 1. Armijo, L.: Minimization of functions having lipschitz continuous first partial derivatives. Pac. J. Math. 16, 1–3 (1966) 2. Bastian, P., Blatt, M., Dedner, A., Engwer, C., Klöfkorn, R., Kornhuber, R., Ohlberger, M., Sander, O.: A generic grid interface for parallel and adaptive scientific computing. part ii: implementation and tests in dune. Computing 82(2), 121–138 (2008) 3. Bastian, P., Blatt, M., Dedner, A., Engwer, C., Klöfkorn, R., Ohlberger, M., Sander, O.: A generic grid interface for parallel and adaptive scientific computing. part i: abstract framework. Computing 82(2), 103–119 (2008) 4. Bordeu, F., Boucard, P.A., Gosselet, P.: Balancing domain decomposition with nonlinear relocalization: parallel implementation for laminates. In: Topping, B.H.V., Iványi, P. (eds.) First International Conference on Parallel, Distributed and Grid Computing for Engineering. CivilComp Press, Stirlingshire, UK (2009) 5. Cai, X.C., Keyes, D.: Nonlinearly preconditioned inexact Newton algorithms. SIAM J. Sci. Comput. 24(1), 183–200 (2002) 6. Cai, X.C., Gropp, W.D., Keyes, D.E., Tidriri, M.D.: Newton-Krylov-Schwarz Methods in CFD, pp. 17–30. Vieweg+Teubner Verlag, Wiesbaden (1994) 7. Cai, X.C., Gropp, W., Keyes, D., Melvin, R., Young, D.: Parallel Newton-Krylov-Schwarz algorithms for the transonic full potential equation. SIAM J. Sci. Comput. 19(1), 246–265 (1998) 8. Cai, X.C., Keyes, D.E., Young, D.P.: A nonlinear additive Schwarz preconditioned inexact Newton method for shocked duct flow. In: Proceedings of the 13th International Conference on Domain Decomposition Methods, pp. 343–350 (2001)

An Application of RASPEN to Discontinuous Galerkin Discretisation …

335

9. Cai, X.C., Keyes, D.E., Marcinkowski, L.: Non-linear additive Schwarz preconditioners and application in computational fluid dynamics. Int. J. Numer. Meth. Fluids 40(12), 1463–1470 (2002) 10. Dolean, V., Gander, M., Kheriji, W., Kwok, F., Masson, R.: Nonlinear preconditioning: how to use a nonlinear Schwarz method to precondition Newton’s method. SIAM J. Sci. Comput. 38(6), A3357–A3380 (2016) 11. van Genuchten, M.T.: A closed-form equation for predicting the hydraulic conductivity of unsaturated soils. Soil Sci. Soc. Am. 44(5), 892–898 (1980) 12. Hwang, F.N., Cai, X.C.: A parallel nonlinear additive Schwarz preconditioned inexact Newton algorithm for incompressible Navier-Stokes equations. J. Comput. Phys. 204(2), 666–691 (2005) 13. Klawonn, A., Lanser, M., Rheinbach, O.: Nonlinear FETI-DP and BDDC methods. SIAM J. Sci. Comput. 36(2), A737–A765 (2014) 14. Klein, O.: Dune-randomfield (2019). https://gitlab.dune-project.org/oklein/dune-randomfield 15. Marcinkowski, L., Cai, X.C.: Parallel performance of some two-level ASPIN algorithms. In: Barth, T.J., Griebel, M., Keyes, D.E., Nieminen, R.M., Roose, D., Schlick, T., Kornhuber, R., Hoppe, R., Périaux, J., Pironneau, O., Widlund, O., Xu, J. (eds.) Domain Decomposition Methods in Science and Engineering, pp. 639–646. Springer, Berlin (2005) 16. Mualem, Y.: A new model for predicting the hydraulic conductivity of unsaturated porous media. Water Resour. Res. 12(3), 513–522 (1976) 17. Richards, L.A.: Capillary conduction of liquids through porous mediums. Physics 1(5), 318– 333 (1931) 18. Sorensen, D.: Newton’s method with a model trust region modification. SIAM J. Numer. Anal. 19(2), 409–426 (1982)

On the Development of Batch Stable Inverse Indirect Adaptive Control of Systems with Unstable Discrete-Time Inverse Bowen Wang and Richard W. Longman

Abstract Indirect discrete time adaptive control creates a control law that promises to converge to zero tracking error for any commanded output. A major limitation is that the theory only guarantees convergence if the system inverse is asymptotically stable. Because it is necessarily discrete time, the continuous time governing equation must be converted to the equivalent difference equation, and this conversion normally introduces zeros. For a majority of desired applications, there are zeros introduced outside the unit circle producing instability of the inverse. The basic indirect adaptive control relies on the one-step ahead control law followed by the projection algorithm (or similar algorithm) to update the model based on the data currently available. This paper presents generalized versions of these building blocks that address the unstable inverse issue. The one-step ahead control is replaced by a batch stable inverse control law updating p steps with each batch. A new batch inverse theory for discrete time systems is used. The value of p is chosen to produce stability of the inverse control action produced in the sequence of batch updates. The needed computations to form the projection algorithm in this new formulation are presented. This paper presents an approach to obtaining a stable discrete time adaptive control theory for discrete time systems with unstable inverse; it remains to develop proofs of converges.

B. Wang Mechanical Engineering Department, Columbia University, 500 West 120th St., New York, NY 10027, USA e-mail: [email protected] R. W. Longman (B) Mechanical Engineering Department, Columbia University, MC4703, 500 West 120th St., New York, NY 10027, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_16

337

338

B. Wang and R. W. Longman

1 Introduction In classical control systems, the command to the system is the desired output. The actual output is a solution of the governing differential equation, containing the command in the forcing function. Except under very unusual circumstances, the particular solution output is not equal to the command. In addition, one must expect error in the model used in the design. Indirect adaptive control seeks to address both of these sources of error [1, 2]. It uses the current model and asks for zero error in the next step. It also makes use of the most recent input/output data each step to update the model. This approach promises to produce zero tracking error after convergence, following whatever command you give the control system. Unfortunately, convergence requires that the inverse of the system transfer function be asymptotically stable. This condition is not satisfied in a majority of systems one would like to control [3]. The adaptive updates need a discrete time model. When a differential equation is fed by a zero order hold, one can replace the differential equation by a difference equation without any approximation, i.e. it produces the same output as the differential equation at each time step. The process of converting to a difference equation, introduces zeros into the transfer function, and for a reasonable sample rate, and at least 3 more poles than zeros in the original differential equation, there will be zeros introduced that are outside the unit circle, making the digital system inverse unstable. Digital adaptive control has two basic building blocks, the one-step ahead control algorithm, and the real time projection algorithm for model updates (or other forms of model updates such as least squares). The purpose of this paper is to present modified versions that allow us to eliminate the requirement of a stable inverse, so that digital adaptive control can be applied to systems whose inverse contains zeros outside the unit circle—the majority of systems that one might want to control. Existing methods that address this problem include use of an adaptive version of Linear Model Predictive Control (LMPC) which avoids the unstable inverse issue by applying a penalty on control effort, with the result that one does not achieve zero tracking error. A different approach uses Generalized Sample-data Hold Functions (GSHF) [4–7]. One choice for the GSHF’s is to introduce several zero order holds between each of the original time steps [6]. We make use of one-step ahead control generalized to a batch stable inverse control in [8] (where it is termed p-step ahead control) that takes advantage of new stable inverse theory for such systems [9–11], and makes appropriate adjustments to the projection algorithm. The term batch stable inverse indicates that one is given a finite time desired output history, and asks for the corresponding input history to produced zero error for the specified output time steps. The previously existing batch stable inverse theory [12] for such unstable discrete time systems requires pre- and post-actuation from minus and plus infinity. Truncation of these makes a compromise on the zero tracking error objective in applications, and makes it difficult to apply in a batch stable inverse update mode which is needed to replace the one step ahead control.

On the Development of Batch Stable Inverse Indirect Adaptive Control of Systems …

339

For simplicity, we consider deterministic single-input single-output systems, without disturbances, and consider that the adaptive control is updating the zero order hold input to a continuous time system, e.g. a feedback control system that is stable and with minimum phase. But the discrete time equivalent model has a zero or zeros outside the unit circle. This implies a time delay through the system of one time step. And we consider model updates made using the projection algorithm.

2 Zeros of Discretized Systems For illustration purposes, we emphasize considering a third order differential equation with no zeros in continuous time, fed by a zero order hold that produces a zero outside the unit circle in discrete time. This is sufficient to fully exhibit the phenomena that we seek to address. The continuous time system and its Laplace transfer function are d 2 y(t) dy(t) d 3 y(t) + α + α2 + α3 y(t) = β1 u(t) 1 dt 3 dt 2 dt Y (s) =

s3

+ α1

s2

(1)

β1 U (s) = G(s)U (s) + α2 s + α3

When the input comes through a zero-order hold, this can be replaced by a difference equation with identical solution at the sample times, and associated z-transfer function y(k + 1) + a1 y(k) + a2 y(k − 1) + a3 y(k − 2) = b1 u(k) + b2 u(k − 1) + b3 u(k − 2) (2)

Y (z) =

b1 z 2 + b2 z + b3 U (z) = G(z)U (z) z 3 + a1 z 2 + a2 z + a3

(3)

by using the conversion G(z) = (1 − z −1 )Z [G(s)/s]

(4)

where Z indicates the z-transform of the sampled unit step response specified in the square bracket. One way to view the inverse problem is to write Eq. (2) for each time step. Then substitute the desired output y ∗ (k) into the left hand side. This produces a known forcing function, f (k), to the resulting difference equation, and the input history needed to produce this desired output history is the solution of the nonhomogeneous difference equation b1 u(k) + b2 u(k − 1) + b3 u(k − 2) = f (k) [13]. The associated characteristic polynomial b1 z 2 + b2 z + b3 = 0 has two roots z 1 , z 2 producing the general solution of the equation of the form

340

B. Wang and R. W. Longman

u(k) = c1 (z 1 )k + c2 (z 2 )k + u p (k)

(5)

where u p (k) is a particular solution and c1 and c2 are determined by initial conditions. The polynomial is likely to include roots outside the unit circle. In this case the input needed to produce the desired output is unstable, and grows unbounded with time step k, making the solution for zero tracking error useless. One might seek to adjust the initial conditions such that the coefficient c1 or c2 of the root outside the unit circle is zero in order to obtain a stable inverse control action. This is not practical, since one will not get the coefficient to be identically zero. The zeros introduced during the discretization process are a function of the pole excess in the differential equation, i.e. the number of poles minus the number of zeros [3]. The number of extra zeros introduced is the number needed to produce one more pole than zero in the discrete time transfer function, so that there is a one-time step delay from a change in the zero-order hold input, until one first observes a resulting change in the sampled output. Thus, for a third order differential equation above with no zeros there will be two zeros introduced. The asymptotic locations as sample time tends to zero, of zeros introduced outside the unit circle are [3]: (1) Third order pole excess, −3.732. (2) Fourth order, −9.899. (3) Fifth order, −23.204 and −2.323. (4) Sixth order, −51.218 and −4.542. Clearly, the solution in Eq. (5) can be very unstable, e.g. involving these numbers to the power of the time step. It is this instability of the inverse problem which we seek to eliminate, so that indirect adaptive control can have stable control inputs that produce zero tracking error at addressed time steps.

3 Overview of the Objective—Using a New Batch Stable Inverse to Address the Discrete Indirect Adaptive Control Inverse Problem The digital indirect adaptive control of Refs. [1, 2] is a real time algorithm that recursively updates a model and uses it to compute the control input to give zero error in the next time step (for one step delay systems). Reference [12] developed a method to obtain a stable inverse to systems having zeros outside the unit circle, but done as a batch computation of the inverse for a given number of time steps of desired output. In order to get a true stable inverse producing the desired output each time step, the control action at that time needs to be computed using two summations (in the discrete time case), one starting from minus infinity and going to that time, the other starting from plus infinity and computing backwards in time to that time. References [9–11, 13] culminated in new batch stable inverses that do not involve the pre and post actuation from minus and plus infinity. There are two particularly useful versions. (1) The first asks for zero error at all time steps in the time interval of interest, except it leaves the first step or steps unspecified, without addressing the error in this or these steps. The number of such steps is equal to the number of zeros

On the Development of Batch Stable Inverse Indirect Adaptive Control of Systems …

341

outside the unit circle in the z-transfer function of the system to be inverted. Hence, for Eq. (1) one sacrifices zero error at the first time step. In a system with 5th order pole excess, one sacrifices the first two time steps. The control action and the resulting error at this or these unaddressed time steps is computed by a pseudo inverse of the underspecified equations resulting from not addressing these steps. References refer to this approach as “initial delete”. (2) The second version applied to Eq. (1) with pole excess of 3 and one zero outside the unit circle in discrete time, says to double the sample rate, but ask for zero error at the original sample times only. Using the new underspecified equations, compute the control history from a pseudo inverse producing the control that minimizes the Euclidian norm of the control history. We refer to this batch stable inverse as the skip step approach. The number of time steps with zero order hold updates between each of the addressed (original) time steps is equal to the number of zeros outside the unit circle. For Eq. (1) there are two zero order holds between the time steps for which zero error is requested. We comment that this could be referred to as a generalized hold. The purpose of this paper is to perform a preliminary examination of how one might use the new batch stable inverse results to address as closely as possible the discrete time indirect adaptive control objective of repeated model updates followed by model inverses in real time. The concept involves repeated batch updates of the control, and by using the skip step stable inverse, all time steps are treated similarly. For Eq. (1) all time steps where zero error is obtained, there are two zero order holds preceding it to be used to obtain zero error at that step. Using the initial delete method would make the first time step of each batch update be different from all the rest. Consider in a little more detail what is to be done as time progresses. The user is required to say what command is to be executed for some number p  time steps into the future. Then the batch stable inverse skip step solution computes the control to use for each zero order hold in the batch, i.e. control actions at 2 p  zero order holds (3rd order pole excess). With a correct model, this gives zero error at all p  addressed time steps in the batch. Then given the desired output for the next p  time steps the process is repeated, resulting in p  more steps with zero tracking error. Zero error is obtained at every original time step for all batches. In the past we have referred to this as p-step ahead control, modifying the name one step ahead control [8]. However, this can easily be misinterpreted to suggest that one achieves zero error only after p time steps of zero order hold. This is not what is being done, there is zero error at every original time step. To prevent this possible confusion, and to be more explicit, this approach is now referred to as batch stable inverse control. Note that this can be considered a new form of Model Predictive Control (MPC), one that offers the ability to have zero tracking error. The usual linear MPC penalizes the control action to avoid the unstable inverse issue, it is not trying to obtain zero tracking error. The batch stable inverse control used here allows MPC to in theory produce zero tracking error immediately and for all future time steps. To make it practical, when the control action needed to get to zero error in the next step is too extreme, one can introduce a control penalty to apply making it disappear as the inverse control solution becomes achievable [8].

342

B. Wang and R. W. Longman

Consider how the approach outlined here relates to the use of Generalized Sampled-data Hold Functions in Refs. [4–7]. The generalized hold can be chosen as in [6] to be a set of time steps with zero order holds inserted between addressed time steps. One can pick the zero locations using the multiple zero order holds, and by placing inside the unit circle the transfer function has a stable inverse. This produces a real time transfer function that can be used in discrete time indirect adaptive control. An issue is that the behavior at the time steps introduced for the generalized hold may have bad properties. In the approach created here, the generalized hold having two holds between each addressed step for Eq. (1), is picked to stabilize the batch update, not the inverse of the discrete time transfer function as above. This does not require as many holds be introduced. Each batch has a stable inverse, but when the batch update is applied repeatedly, the initial condition for the stable inverse of the next batch is determined by measurements at the end of the previous batch. Therefore, although there is zero error at every time step it is possible that the control action increases going from one batch to the next. It is the repeated batch process that must be stable in order to make a stable inverse for all time. This stability is obtained by picking a sufficient number of time steps in a batch. It is established that there always exists a number big enough to produce stability. Numerical examples show that as few as 3 addressed time steps can be sufficient. Thus, by asking the user to specify the desired output three steps into the future every three steps, zero tracking error is achieved for all addressed steps, and a stable control action guaranteed for all future time. The multiple zero order hold version of a generalized hold that is used here stabilizes the batch inverse, and it is the choice of the number p of time steps per batch that determines stability of the batch update process. The generalized hold is being used differently than in [6] where it must stabilize the full real time process. In both cases there is an issue of the error involved at the intermediate time steps associated with the new zero order holds. There are fewer intermediate time steps needed in the approach here, and the control action is being chosen to minimize the Euclidian norm of the control instead of being used for more complex and probably more demanding zero placement. Numerical simulations suggest that while the error at addressed time steps are at numerical zero value, e.g. 10 to the minus 14 or 15, the error at between time steps can be at several orders of magnitude less than the command size. Eventually, a comparison needs to be made, but these considerations suggest that the simple generalized holds in the current method should have significantly less issue with error at unaddressed time steps.

On the Development of Batch Stable Inverse Indirect Adaptive Control of Systems …

343

4 One Step Ahead Control, Batch Inverse, and Stable Batch Inverse 4.1 One Step Ahead Control The one-step ahead control is obtained from Eq. (2). After receiving measurement y(k), pick the input u(k) to make the output y(k + 1) equal the desired output y ∗ (k + 1) using u(k) = [y ∗ (k + 1) + a1 y(k) + a2 y(k − 1) + a3 y(k − 2) − b2 u(k − 1) − b3 u(k − 2)]/b1

(6) At step k all terms on the right hand side are known.

4.2 Batch Inverse Control To develop the batch inverse convert Eq. (2) to state variable form. Create the altered difference equation y¯ (k + 1) + a1 y¯ (k) + a2 y¯ (k − 1) + a3 y¯ (k − 2) = u(k)

(7)

and then, superposition says the original output is given in terms of the defined state variables as y(k) = b1 y¯ (k) + b2 y¯ (k − 1) + b3 y¯ (k − 2) = C x(k)

(8)

x(k) = [ y¯ (k − 2) y¯ (k − 1) y¯ (k)]T C = [b3 b2 b1 ] and the state variable difference equation is ⎡

x(k + 1) = Ax(k) + Bu(k)

⎤ 0 1 0 A=⎣ 0 0 1 ⎦ −a3 −a2 −a1

⎡ ⎤ 0 B = ⎣0⎦ 1

(9)

y(k) = C x(k) Of course, by taking z-transforms, one can retrieve the z-transform of the original scalar difference equations with the locations of the associated zeros using Y (z) = [C(z I − A)−1 B]U (z).

344

B. Wang and R. W. Longman

Define vectors for the history of the inputs, history of the outputs, and history of the desired output u = [u(0) u(1) . . . u( p − 1)]T y = [y(1) y(2) . . .

y( p)]T

y ∗ = [y ∗ (1) y ∗ (2) . . .

(10)

y ∗ ( p)]T

Propagating the solution to the state space equation for p time steps produces ⎡ ¯ y = Pu + Ax(0)

CB C AB .. .

0 CB .. .

... ... .. .

0 0 .. .



⎢ ⎥ ⎢ ⎥ P=⎢ ⎥ ⎣ ⎦ C A p−1 B C A p−2 B . . . C B



⎤ CA ⎢ C A2 ⎥ ⎢ ⎥ A¯ = ⎢ . ⎥ (11) ⎣ .. ⎦ C Ap

The matrix P is a finite time version of the z-transfer function. The solution u ∗ for the control history producing the desired output at all p time steps is then ¯ Pu = y ∗ − Ax(0)

(12)

¯ u ∗ = P −1 (y ∗ − Ax(0))

(13)

4.3 Stable Batch Inverse Control From Eqs. (2) or (3) it is obvious the difference equation has an unstable inverse if the roots of b1 z 2 + b2 z + a3 = 0 contain a root outside the unit circle. Then the solution of the homogeneous equation given in Eq. (5) contains an unstable term. This characteristic must also be present in matrix P, in which case P −1 must exhibit the unstable behavior. Reference [10] explains how to recognize an unstable inverse from matrix P by examining its singular value decomposition P = U SV T . For p large enough, the singular values converge to Eq. (3)’s magnitude frequency response at the frequencies that can be seen in p time steps. However, a zero outside the unit circle introduces another singular value with its own right and left singular vectors, having behavior unrelated to frequency response. The signature in the P matrix corresponding to an unstable inverse system, is: 1. A linear decay on a log scale of a singular value as a function of matrix size p. 2. A corresponding pair of input and output singular vectors which have opposite slopes, with the input singular vector component magnitudes growing linearly

On the Development of Batch Stable Inverse Indirect Adaptive Control of Systems …

345

with time step on a log scale, and the output singular vector component magnitudes decaying with the same slope. The smallest singular value gets smaller as a function of matrix size by magnitude of the factor 1/z 1 where z 1 is the zero outside the unit circle. Item 2 can be understood by noting that to correct an error at the beginning of the p-step process given by the output singular vector associated with the singular value in Item 1, the control action needed, is given in terms of the input singular vector, and this input grows exponentially in magnitude with time steps. The slope of the magnitudes of the time step entries in the growing singular vector, corresponds to the magnitude of (z 1 )k where k is the time step. The slope for the decaying singular vector, corresponds to (1/z 1 )k . If there is more than one zero outside, it may take some effort, but one can observe the singular values and singular vectors with these properties for each zero outside. To apply the skip step batch stable inverse, increase the sample rate based on the number of zeros outside the unit circle. In the case of a pole excess of 3 for Eq. (2), double the sample rate. Write Eq. (12) for all time steps at the new sample rate. Define Pa as matrix P with all rows removed that are associated with the new sample times. The remaining rows apply to the sample times that were present at the original sample rate, termed here, the addressed time steps. This equation is now underspecified, there are twice as many inputs as specified time steps for which zero tracking error is requested. There is an infinite number of solutions. Pick that solution that minimizes the Euclidean norm of the control input vector u over all time steps, addressed and unaddressed. Denote this solution as u ∗ . Reference [9] proves that Pa does not contain a singular value that decreases with the number of time steps in the desired trajectory, i.e. the number of addressed time steps. The skip step stable solution is then (14) u ∗ = Pa† (y a∗ − A¯ a x(0)) This batch stable inverse solution is referred to in some papers as Longman JiLLL NS. The N indicates that the matrix P is not factored as a product of a matrices, one containing the zero(s) outside and the other containing only zeros and poles inside. The S indicates use of the skip step approach. And the JiLLL is for Xiaoqiang Ji, Te Li, Yao Li, and Peter LeVoci who contributed to the development [9–11, 13].

5 Recursive Batch Stable Inverse Control Following [8], this section develops the recursive batch stable inverse control. With one step ahead control the user can pick any desired output for the next time step, as one does in routine feedback control. Here the user picks the desired output for the next few steps, p  steps at the original sample rate, meaning that application to Eq. (2) produces zero error at these time steps. One would like to have this number as small as possible. In a numerical example given later, the minimum number for

346

B. Wang and R. W. Longman

stability of the recursive batch update process is 3 addressed time steps. Perhaps this is a small price to pay for a stable inverse. This section develops how to determine the size of p needed to establish stability.

5.1 Creating a Model Predictive Control Model (MPC) from an ARX Model Convert the ARX nth order difference equation model to the standard model for MPC. For simplicity we illustrate the process for the third order system of Eq. (2). Define future and past vectors according to ⎡

⎤ u(k) u F (k) = ⎣u(k + 1)⎦ u(k + 2)

⎤ y(k + 1) y F (k) = ⎣ y(k + 2)⎦ y(k + 3)





⎤ u(k − 3) u P (k) = ⎣u(k − 2)⎦ u(k − 1)



⎤ y(k − 2) y P (k) = ⎣ y(k − 1)⎦ y(k)

(15)

Writing a sequence of equations from Eq. (2) for different time steps gives ⎡

⎡ ⎡ ⎡ ⎤ ⎤ ⎤ ⎤ 1 0 0 0 b3 b2 a3 a2 a1 b1 0 0 ⎣a1 1 0⎦ y (k) = ⎣0 0 b3 ⎦ u P (k) − ⎣ 0 a3 a2 ⎦ y (k) + ⎣b2 b1 0 ⎦ u F (k) F P 0 0 0 0 0 a3 a2 a1 1 b3 b2 b1 (16) Multiplying both sides by the inverse of the coefficient matrix on the left results in the following form y F (k) = P1 u P (k) − P2 y P (k) + Pu F (k)

(17)

Note that the matrix P is in fact the same P matrix as in the previous section. To make use of this for a p-step ahead process, we pick the future vectors to contain p future inputs and outputs, and also pick the past vectors to contain the same number of entries. This is convenient mathematically. There will be many zero entries in the past coefficient matrices because one only needs the number of past entries that correspond to the initial conditions of the original ARX difference equation. At each update, inputs for the next p time steps are chosen, so vector updates are made at sample times k = p( j) where j counts the number of updates. Then the above MPC model can be rewritten for recursive use as y( p( j + 1)) = P1 u( pj) − P2 y( pj) + Pu( p( j + 1))

(18)

On the Development of Batch Stable Inverse Indirect Adaptive Control of Systems …

347

5.2 Separating the Addressed Time Steps from the Unaddressed Let us agree to collect all time steps of the output that are to be addressed into vector y a and all time steps associated with the introduced sample times into the vector y u of unaddressed time steps. The term “addressed” means that we are seeking zero tracking error for these time steps. Then we can rewrite Eq. (18) in terms of partitioned matrices as      y a ( p( j + 1)) P1a P P2aa P2au y a ( pj) = + a u( p( j + 1)) u( pj) − y u ( p( j + 1)) P1u P2ua P2uu y u ( pj) Pu The separate equations for addressed and unaddressed are then y a ( p( j + 1)) = P1a u( pj) − P2aa y a ( pj) − P2au y u ( pj) + Pa u( p( j + 1)) y u ( p( j + 1)) = P1u u( pj) − P2ua y a ( pj) − P2uu y u ( pj) + Pu u( p( j + 1))

(19)

5.3 The Recursive Update for Update j + 1 To get the p time step stable inverse update, eliminate the unaddressed time steps in Eq. (18), and set y a ( p( j + 1)) equal to the desired addressed output in the next period y a∗ ( p( j + 1)). Then solve the underspecified equations using the Moore-Penrose pseudoinverse u( p( j + 1)) = Pa† [y a∗ ( p( j + 1)) − P1a u( pj) + P2aa y a ( pj) + P2au y u ( pj)] (20) This is the control action needed during update period j + 1 to produce the desired output at the addressed steps y a∗ ( p( j + 1)).

5.4 The Dynamics from Batch to Batch Equations (19) and (20) form a set of three coupled vector equations relating the quantities in period j to those in period j + 1. Substituting (19) into (18) and packaging the result gives

348

B. Wang and R. W. Longman



⎤⎡ ⎤ ⎡ ⎤ 0 0 0 y a ( p( j + 1)) y a ( pj) † † † ⎣ y ( p( j + 1))⎦ = ⎣ Pu Pa P2aa − P2ua Pu Pa P2au − P2uu P1u − Pu Pa P1a ⎦ ⎣ y ( pj)⎦ u u u( p( j + 1)) u( pj) Pa† P2aa Pa† P2au −Pa† P1a ⎤ ⎡ I †⎦ ∗ ⎣ + Pu Pa y a ( p( j + 1)) Pa†

(21) For the case of one zero outside the unit circle, the first partition vector on the left for the addressed steps is of dimension p/2 as is the second partition vector for unaddressed time steps, while the third partition vector for the control has dimension p, for a total dimension of 2 p total time steps at the fast sample rate. In the recursive batch stable inverse control, the time steps in the current period, are needed as initial conditions for the next period, and they keep being changed every update, coupling the equations from period to period. Although the error at the addressed time steps are always zero, the output at the unaddressed time steps can exhibit transients from batch to batch. In order for the recursive batch update to be stable, it is necessary that the square system matrix in Eq. (21) have all eigenvalues less than one in magnitude. How much less than one determines how fast the behavior at the unaddressed time steps converge with the batch updates. One can use this stability condition to determine if, based on one’s current model, the transients of the recursive updates will go to zero as updates j progress. One can also use this to determine what the value of p needs to be to produce stability. Examine the first row partition of the equation, which has 3 zero entries, to see that the converged solution has y a ( p( j + 1)) = y a∗ ( p( j + 1)) producing zero error at every addressed time step in the batch update, for all updates, after the true system model has been identified. (As with other forms of adaptive control this situation may be reached under some conditions even when the model identified is not equal to the true model.) We comment that we know that there exists a value of p that creates a stable dynamic process: If one samples so slowly that the system reaches steady state response to the zero order hold input by the end of the time step, then all initial condition terms become negligible. Matrices P1 , P2 that contain the initial conditions for the next batch can be set to zero, and the second and third row partitions become zero as well as the first row partition, producing all zero eigenvalues. One can study the effect of model error. The control computation in Eq. (20) is made using the current model. Denote the coefficients in this case by introducing a hat on top of symbols u( p( j + 1)) = Pˆa† [y a∗ ( p( j + 1)) − Pˆ1a u( pj) + Pˆ2aa y a ( pj) + Pˆ2au y u ( pj)] (22) Symbols without this hat describe the dynamics of the real world or truth model. Then Eq. (21) becomes

On the Development of Batch Stable Inverse Indirect Adaptive Control of Systems …

349

⎤⎡ ⎤ ⎡ ˆ† ˆ ⎤ y a ( pj) y a ( p( j + 1)) Pa Pa P2aa − P2aa Pa Pˆa† Pˆ2au − P2au P1a − Pa Pˆa† Pˆ1a ⎥ † † † ⎣ y ( p( j + 1))⎦ = ⎢ ⎣ Pu Pˆa Pˆ2aa − P2ua Pu Pˆa Pˆ2au − P2uu P1u − Pu Pˆa Pˆ1a ⎦ ⎣ y u ( pj)⎦ u † ˆ † ˆ † ˆ ˆ ˆ ˆ u( p( j + 1)) u( pj) Pa P2aa Pa P2au − Pa P1a ⎤ ⎡ † ˆ Pa Pa ⎥ ⎢ + ⎣ Pu Pˆa† ⎦ y a∗ ( p( j + 1)) Pˆa† ⎡

(23) Asking for zero error immediately using the batch stable inverse can be too aggressing. Reference [8] develops methods to intelligently smooth the convergence process when the current trajectory is far from the desired trajectory so that achieving zero error in the next time step is too much to ask. One may want to adopt some similar approach for the adaptive recursive batch stable inverse updates.

6 Model Updates We choose to make model updates using the projection algorithm [2]. There are various other choices that could be used, but this is perhaps the simplest. There are two choices in making the model updates from data obtained in each batch. The first uses the ARX model of Eq. (2) and updates the coefficients of this model every fast time step as data arrives. The identified coefficients ai , bi can then be used to compute the coefficient matrices in the MPC model in Eq. (17). The second choice updates the MPC model coefficients P, P1 , P2 in Eq. (18) directly using data from the most recent batch. This approach is straightforward, but is inefficient because it is identifying more coefficients many of which may be known to be zero, and others are known to be related. But the approach may be helpful in theoretical analysis.

6.1 ARX Model Updates Equation (2) when put into standard form is y(k + 1) = φT (k)θ φT (k) = [−y(k), −y(k − 1), −y(k − 2), u(k), u(k − 1), u(k − 2)] θ T = [a1 , a2 , a3 , b1 , b2 , b3 ]

(24)

350

B. Wang and R. W. Longman

where k counts the fast time steps, and data is recorded at the fast sample rate. Given current estimates at step k of the difference equation coefficients, the predicted output is given as ˆ yˆ (k + 1) = φT (k)θ(k)

(25)

θˆT (k) = [aˆ 1 (k), aˆ 2 (k), aˆ 3 (k), bˆ1 (k), bˆ2 (k), bˆ3 (k)] The projection algorithm then updates the model coefficients according to ˆ + 1) = θ(k) ˆ +a θ(k

φ(k) [y(k + 1) − yˆ (k + 1)] c + φT (k)φ(k)

(26)

The a is a gain between 0 and 2, and c is inserted if desired to prevent the possibility of dividing by zero. The standard proof for the projection algorithm establishes that this model update procedure will have the identified coefficients converge, although not necessarily to the true value of the coefficients when some part of the system dynamics is not excited and observed in the data. The proof of convergence in this sense applies to both kinds of model updates discussed here.

6.2 MPC Model Updates Introduce subscript i for the ith row of Eq. (18), and rearrange to obtain the standard form to use in the projection algorithm y i ( p( j + 1)) = P1i u( pj) − P2i y( pj) + Pi u( p( j + 1)) = u( pj)T P1iT − y( pj)T P2iT + u( p( j + 1))T PiT ⎡ T⎤ Pi

y i ( p( j + 1)) = u( p( j + 1))T u( pj)T −y( pj)T ⎣ P1iT ⎦ P2iT

(27)

(28)

y i ( p( j + 1)) = φT ( pj)θi The output predicted using the current MPC model in batch j is yˆ i ( p( j + 1)) = φT ( pj)θˆi ( pj) The projection algorithm update of the MPC model coefficients is then

(29)

On the Development of Batch Stable Inverse Indirect Adaptive Control of Systems …

θˆi ( p( j + 1)) = θˆi ( pj) + a

φ( pj) c+

φT ( pj)φ( pj)

351

[y i ( p( j + 1)) − yˆ i ( p( j + 1))] (30)

7 Proposed Discrete Time Indirect Batch Inverse Adaptive Control The two steps of standard discrete time indirect adaptive control are: (1) The control computed in real time for every time step uses the one step ahead control based on the current system model. (2) The coefficients in the model each time step are updated in real time based on the most recent data obtained. In the proposed adaptive batch inverse approach, these two steps are replaced to eliminate the stable inverse requirement: (i) Item (1) is modified to use the skip step batch stable inverse, by introducing a number of zero order holds between each time step that aims to have zero tracking error. The number is equal to the number of zeros outside the unit circle in the discrete time transfer function. This produces a stable batch inverse for the number of time steps in a batch. (ii) The control action is computed using a recursive batch update of the control action, to produce zero error for the number of time steps in the batch. One picks the number of time steps p in each batch, so that the recursive batch updates are stable. The error at addressed time steps is automatically zero for the correct model, but transients at unaddressed time steps are ensured stable by choice of p. This number cannot be one, but can be a small number such as 3 addressed time steps. (iii) As in (2) above, with the input-output data from each batch one uses the projection algorithm to update the model to use in the next batch inverse computation. Using the ARX model these updates can be made every fast time step, and then used to compute the control input for the next batch.

8 Comments and Discussion 8.1 Bound on the Order of the System Routine discrete time indirect adaptive control needs to know the order of the system, or specify an order that is larger than the true system order. The same condition applies to the adaptive control proposed here, when the projection algorithm is used on the ARX model. When it is used on the MPC model one only needs to have the dimension of the past input and past output large enough for the initial conditions needed for the next batch. Of course, one can overestimate this if needed. It is otherwise independent of the true system order.

352

B. Wang and R. W. Longman

8.2 Bound on the Number of Zeros Outside the Unit Circle The skip step method of producing a stable batch inverse asks to include one extra zero order hold between addressed time steps for every zero outside the unit circle. The proof of the skip step stable inverse in [9] says that one still gets zero error at the addressed time steps if this number if overestimated.

8.3 On the Choice of Batch Size p to Produce a Stable Recursive Batch Update Process One must make the choice of batch length large enough that the recursive batch update process is convergent, i.e. all eigenvalues of the system matrix in Eq. (21) must be less than unity in magnitude. It was previously commented that a value of p is guaranteed to exist. Reference [8] studies this issue. It considers a third order continuous time system with a first order factor in the transfer function having a bandwidth of 1.4 Hz, and a second order factor with undamped natural radian frequency 37, and damping ratio of 0.5. The DC gain is unity and sampling is at 100 Hz. The one zero outside the unit circle in the resulting z-transfer function is at −3.3104. Table 1 from [8] gives the spectral radius versus p, where p is the fast sample rate count. A p of 6 is sufficient for convergence, and this corresponds to specifying the desired output for the next 3 time steps for each batch update, this is 3 time steps at 100 Hz. It is not one step ahead control, but 3 time steps is not a very long time to predict what command one would like to apply. In this application there are more issues to consider concerning the choice of p. Given our current model of the world, one can determine the p needed by evaluating the eigenvalues of the system matrix in Eq. (21), assuming it represents the world. It could happen that the real world needs a larger p. Some algorithm is needed to recognize that the p being used is too small, and then adjust the adaptive control process to start using a bigger p. There are also issues of convergence if model error is present and might require the coefficient matrix in Eq. (23) have all eigenvalues less than unity in magnitude, and this matrix is unknown.

Table 1 Spectral Radius (SR) for different values of p p 4 6 8 10 12 SR

1.3487

0.6939

0.3251

0.1491

0.0680

14

16

18

0.0310

0.0141

0.0064

On the Development of Batch Stable Inverse Indirect Adaptive Control of Systems …

353

8.4 Behavior at Unaddressed Time Steps There are three issues to consider concerning the error at the unaddressed time steps. No transients can be observed from batch update to batch update at the addressed time steps, because the error is a numerical zero. But errors at the unaddressed time steps are subject to transients. It is the pseudo inverse in the control law Eq. (22) that picks the control action for the introduced steps. It would be of interest to examine how this behaves during convergence of the model updates. Numerical experience suggests that this transient behavior disappears quickly. The second issue is the nonuniformity of the errors at unaddressed time steps. The skip step approach is used here to make all time steps have the same number of zero order holds inserted, in an effort to make the error at unaddressed time steps look uniform across a batch. Nevertheless, the error at the unaddressed time step(s) preceding the first addressed time step, is different than between other times steps. And it is very hard to influence this error. The third issue is the size of the error at the introduced time steps. Again, from [8], Figs. 1 and 2 show the error at the addressed time steps and at the unaddressed time steps respectively. The desired trajectory is a one Hertz sine wave of unit amplitude, sampling at 100 Hz, so the plots show two periods, and p = 6. The logarithm of the unaddressed errors in the plot uses the absolute value of the error to avoid logs of negative numbers. The scale on Fig. 1 indicates that all errors at addressed time steps are at the numerical zero level, as they must be for the stable batch inverse control. Figure 2 plots the error at the unaddressed times steps. There is some transient behavior at the beginning and the maximum error compared to the continuous sine function is around 10−2 . The root mean square of all 200 time steps is 0.0026. This is not a numerical zero, but it is a few orders of magnitude smaller than the unit

Fig. 1 Tracking error at addressed time steps

10-14

1.5

Addressed Error

1

0.5

0

-0.5

-1

-1.5 0

50

100

Time Step

150

200

354

B. Wang and R. W. Longman

Fig. 2 Tracking error at unaddressed time steps Unaddressed Error [Log Scale]

10-1

10-2

10-3

10-4

10-5

0

50

100

150

200

Time Step

amplitude of the desired trajectory. This suggests that the error at unaddressed time steps need not be serious in applications. Acknowledgements The authors would like to acknowledge the helpful suggestions of the reviewers, and in particular thank Graham C. Goodwin for suggesting comparison with the theory of generalized holds.

References 1. Goodwin, G.C., Ramadge, P.J., Caines, P.E.: Discrete-time multivariable adaptive control. IEEE Trans. Autom. Control 25(3), 449–456 (1980) 2. Goodwin, G.C., Sin, K.W.: Adaptive Filtering Prediction and Control. Prentice Hall, NJ (1984) 3. Åström, K.J., Hagander, P., Strenby, J.: Zeros of sampled systems. In: Proceedings of the 19th IEEE Conference on Decision and Control, pp. 1077–1081 (1980) 4. Kabamba, P.T.: Control of linear systems using generalized sampled-data hold functions. IEEE Trans. Autom. Control 32(9), 772–783 (1987) 5. Freudenburg, J., Middleton, R., Braslavsky, J.: Robustness of zero shifting via generalized sample-data hold functions. IEEE Trans. Autom. Control 42(12), 1681–1692 (1997) 6. Ibeas, A., de la Sen, M., Balaguer, P., Vilanova, R., Peedet, C.: Digital inverse model control using generalized holds with extension to the adaptive case. Int. J. Control Autom. Syst. 8(4), 707–719 (2010) 7. Feuer, A., Goodwin, G.: Generalized sample hold functions: frequency domain analysis of robustness, sensitivity, and intersample difficulties. IEEE Trans. Autom. Control 39(5), 1042– 1047 (1994) 8. Wang, B., Longman, R.W.: Generalized one step ahead control made practical by new stable inverses. In: Proceedings of the AIAA/AAS Space Flight Mechanics Meeting, Kissimmee, FL, (2018) 9. Ji, X., Li, T., Longman, R.W.: Proof of two new stable inverses of discrete time systems. Adv. Astronaut. Sci. 162, 123–136 (2018) 10. Longman, R.W., Li, T.: On a new approach to producing a stable inverse of discrete time systems. In: Proceedings of the 18th Yale Workshop on Adaptive and Learning Systems, Center for Systems Science, New Haven, Connecticut, June 21–23, pp. 68–73 (2017)

On the Development of Batch Stable Inverse Indirect Adaptive Control of Systems …

355

11. Li, Y., Longman, R.W.: Using underspecification to eliminate the usual instability of digital system inverse models. Adv. Astronaut. Sci. 135(1), 127–146 (2010) 12. Devasia, S., Chen, D., Paden, B.: Nonlinear inversion-based output tracking. IEEE Trans. Autom. Control 41(7), 930–942 (1996) 13. LeVoci, P.A., Longman, R.W.: Intersample error in discrete time learning and repetitive control. In: Proceedings of the 2004 AIAA/AAS Astrodynamics Specialist Conference, Providence, RI (2004)

An Improved Conjugate Gradients Method for Quasi-linear Bayesian Inverse Problems, Tested on an Example from Hydrogeology Ole Klein

Abstract We present a framework for high-performance quasi-linear Bayesian inverse modelling and its application in hydrogeology; extensions to other domains of application are straightforward due to generic programming and modular design choices. The central component of the framework is a collection of specialized preconditioned methods for nonlinear least squares: the classical three-term recurrence relation of Conjugate Gradients and related methods is replaced by a specific choice of six-term recurrence relation, which is used to reformulate the resulting optimization problem and eliminate several costly matrix-vector products. We demonstrate that this reformulation leads to improved performance, robustness, and accuracy for a synthetic example application from hydrogeology. The proposed prior-preconditioned caching CG scheme is the only one among the considered CG methods that scales perfectly in the number of estimated parameters. In the highly relevant case of sparse measurements, the proposed method is up to two orders of magnitude faster than the classical CG scheme, and at least six times faster than a prior-preconditioned, non-caching version. It is therefore particularly suited for the large-scale inversion of sparse observations.

1 Introduction In hydrogeology, inverse modelling is used as a tool to infer hydraulic parameters of sites from noisy observations of state variables and dependent quantities. The underlying parameterization of hydrological models has a direct impact on the accuracy of simulation results. However, these parameters are typically not directly observable, and therefore inverse methods are a useful tool: they estimate parameters based on measurements of observable quantities. A wide range of inverse methods has been developed, from hand-tuned approaches to simple deterministic models to full statistical models, and from scalar estimates O. Klein (B) Interdisciplinary Center for Scientific Computing (IWR), Heidelberg University, Im Neuenheimer Feld 205, 69120 Heidelberg, Germany e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_17

357

358

O. Klein

and simple zonations to spatially highly resolved parameterizations (e.g., [1, 7, 14, 20], see [16, 17] for further examples and discussion). Most, if not all, such methods require a large number of simulations to provide an estimate, either through sampling of the parameter space (trial-and-error) or through guided optimization of some performance measure (cost function/objective function). Due to the need for properly calibrated models and the inherent high cost of inverse modelling, efficient inversion techniques are an important research topic. Many particularly efficient approaches rely on spectral methods like the fast Fourier transform (FFT), and therefore on structured grids, see, e.g., Fritz et al. [11]. Several other approaches also focus on compact and efficient representations of the underlying parameter spaces, replacing or augmenting these spectral methods. Li and Cirpka [19] utilize Karhuhnen–Loève (KL) expansion to make their approach applicable to unstructured grids. Nowak and Litvinenko [24] combine low-rank covariance approximations with FFT-techniques in an efficient Kriging approach. Lee and Kitanidis [18] introduce a Principle Component Geostatistical Approach (PCGA) based on principal component analysis (PCA) as a compression technique for parameterization. Others focus on parallelization as a means to speed up inversion techniques, e.g., Schwede et al. [26] propose a two-level parallelization of the inverse modelling process and discuss balancing strategies. Comparatively little has been written about efficient specialized optimization techniques in the context of geostatistical inverse modelling. One work that has a similar focus as ours is that of Nowak and Cirpka [23], who present a modified Levenberg– Marquardt algorithm for geostatistical inversion. As in our approach, they also eliminate multiplications with the inverse of the prior covariance matrix, but in contrast to our method they need to assemble the full linearized model (sensitivity matrix) in each iteration of the optimization scheme. The same holds true for the classical work of Kitanidis [14], which presents a modified Gauss–Newton scheme that solves the cokriging equations in a transformed space, thereby also eliminating costly matrix-vector products. The following sections introduce a set of improvements to standard optimization techniques as they are employed in geostatistical inverse modelling. For the sake of simplicity we restrict ourselves to the nonlinear Conjugate Gradients (CG) method, but it is straightforward to apply the central steps to a wider class of methods, e.g., the limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) method, or the Jacobian-free Gauss–Newton scheme, see Sect. 3.1. We demonstrate that these improvements not only reduce the required number of iterations for a given test problem, but additionally make each individual iteration cheaper to compute, i.e., the discussed form of preconditioning has negative cost. The numerical examples are based on structured grids to allow for the use of circulant-embedding techniques for covariance matrices. The central results of this paper, however, do not require this restriction, and should be applicable whenever cheap algorithms for covariance matrix multiplications are available. The rest of this paper is structured as follows:

An Improved Conjugate Gradients Method …

359

• Section 2 provides a short introduction to Bayesian inverse modelling in the context of hydrogeology, and discusses the main ideas behind the efficient optimization techniques used in the considered framework. • Section 3 discusses implementation details and design decisions, as well as things to consider in parallel applications. • In Sect. 4 a synthetic test problem is considered and performance metrics and accuracy of the inversion results are discussed. • Finally, Sect. 5 provides a short summary and concludes the paper.

2 Inverse Modelling in Hydrogeology Given a vector of parameters p ∈ Rn p , n p ∈ N, that is used to parameterize a given physical model, one may obtain synthetic observations of state variables and dependent quantities through numerical simulation. These observations may then be grouped into a vector of measurements z = z(p) ∈ Rn z , n z ∈ N, representing the response of the physical process given its parameterization and boundary/initial conditions. Typically, the mapping G : Rn p → Rn z , p → z(p)

(1)

is defined through a set of partial differential equations (PDEs). Such a forward simulation is usually conceptually straightforward, if computationally expensive. The inverse problem, i.e., determining a set of parameters p that produces a certain outcome z, is often significantly more difficult. The mapping p → z(p) need not be injective, i.e., several parameterizations may lead to the same observations [16], and in the presence of noise there may even be no parameterization at all that can reproduce z. Furthermore, small changes in z may require arbitrarily large changes in p to compensate. In other words, the construction of an inverse mapping z → p(z) is an ill-posed problem in the classical sense of Hadamard [12], and therefore needs to be regularized. In the context of hydrogeology, the parameter vector p is (the discretization of) some hydraulic property, e.g., the hydraulic log-conductivity Y of an aquifer. The measurement vector z, then, contains some dependent quantity, e.g., the hydraulic head h at certain positions in space, which may be computed using the groundwater flow equation: − ∇ · (K ∇h) = q (2) Here K (x) = exp(Y (x)) is the hydraulic conductivity at a certain position x ∈ Rd , d ∈ {2, 3}, (with Y (x) being the component of Y that is associated with x), h is the system state (hydraulic head), with z = h being its restriction to certain points, and q is a source term. A forward model G, Eq. (1), is then obtained through the specification of boundary conditions for (2), which are fixed by the setup under consideration.

360

O. Klein

2.1 Bayesian Inversion The Bayesian approach to inverse modelling treats both the observation z and the parameterization p as random variables, thereby incorporating measurement uncertainty in a natural fashion. The measurement vector z for a given parameterization p is assumed to be contaminated by noise ε with known distribution, typically Gaussian, z | p = (z∗ + ε) | p, ε ∼ N (0, Qzz ) =⇒ z | p ∼ N (z∗ , Qzz ) ,

(3)

where z∗ = G(p) is the mean of the observation vector (i.e., “true” state) and Qzz is its auto-covariance matrix. Likewise, treating the parameter vector p as a random variable automatically includes prior knowledge about the system (parameterizations that are “likely” and “unlikely” independent of the measurements, e.g., due to physical limitations or underlying structure): p is assumed to follow a known prior distribution, often also Gaussian: p ∼ N (p∗ , Q pp ) ,

(4)

with p∗ its prior mean and Q pp its covariance matrix. Bayes’ Theorem states that the joint probability density of a pair (p, z) of parameters and measurements is π(p, z) = π(p) × π(z | p) = π(z) × π(p | z) ,

(5)

where π(x | y) is the conditional probability density of x given y. This can be rewritten as π(p | z) = π(p) × π(z | p) × π(z)−1 ∝ π(p) × π(z | p) ,

(6)

i.e., the posterior probability π(p | z) ( p as an explanation of observing z) can be computed up to an unknown normalization constant π(z) (marginal probability of observing z), as the product of prior probability π(p) and measurement likelihood π(z | p). As a consequence, the ill-posed problem of determining a parameter vector p given some observations z has been replaced by the (well-posed) problem of determining a probability density π(p | z) given densities π(p) and π(z | p). The missing constant π(z) is of no importance in practice, since it can be computed through renormalization if needed. In the case where both p and z | p are assumed to be normally distributed, we have     1 1 2 ∗ 2 × exp − z − G(p) Q−1 , (7) π(p) × π(z | p) = exp − p − p Q−1 pp zz 2 2

An Improved Conjugate Gradients Method …

361

where · A is the norm induced by the positive definite matrix A, i.e.,

x A := (x T Ax)

1/2

.

(8)

2.2 Maximum a Posteriori Estimation In theory, the posterior probability π(p | z) can serve as an answer to the inverse problem. It assigns a probability density to each possible set of parameters p, thereby distinguishing regions of likely parameter configurations from those that are less likely. However, the resulting density function is too high-dimensional and complex to be of direct use. This is the reason why methods like Markov–Chain Monte Carlo (MCMC) typically derive low-dimensional quantities of interest (QoI) from samples of the posterior distribution. MCMC methods tend to be very accurate, but also exceedingly expensive. Modern extensions like multilevel MCMC can mitigate these costs to some extent, but the methods remain costly. A less computationally expensive, but also less accurate, approach is maximum a posteriori (MAP) estimation, a form of deterministic Bayesian inversion. MAP estimation tries to extract the most probable set of parameters, and can be seen as both a modification of maximum likelihood (ML) and a form of regularization of the original inverse problem. In the case of equation (7), MAP estimation consists in finding the minimizer pMAP of the objective function L(p) :=

1 1

p − p∗ 2Q−1 + z − G(p) 2Q−1 , pp zz 2 2

(9)

which is the negative of the logarithm of π(p | z). This minimizer pMAP is then interpreted as an estimate of the mean of π(p | z), which would be exact if G were a linear mapping. See Fig. 1 for an example of MAP estimation. Likewise, the approximate inverse Hessian of L at pMAP ,  −1 −1 T −1 , QMAP pp := Q pp + H Qzz H

(10)

where H is the linearization of the model G at pMAP , can serve as an estimate of the posterior covariance matrix, thereby providing linearized uncertainty quantification. In essence, the posterior distribution is approximated by a Gaussian distribution with mean pMAP and covariance matrix QMAP pp . This approach requires significantly fewer computational resources than methods that retain the full posterior distribution, but there are several restrictions and caveats: • First and foremost, there is no guarantee that a unique minimizer pMAP even exists, since the posterior distribution may be multimodal for a general forward model G.

362

O. Klein

Fig. 1 Example of maximum a posteriori estimation. Left: synthetic reference parameter field containing 2048 × 2048 values of log-conductivity Y. Noisy measurements of hydraulic head h defined through Y are extracted at 16 × 16 equidistant locations, as described in Sect. 4. Right: Inversion result pMAP based on the synthetic measurements and the preconditioned caching nonlinear CG method described in Sect. 2.3. This result is an estimate of the mean of all parameterizations that would explain the observed head values

• Assuming a unique minimizer of the objective function L, and sufficient linearity of L around this minimizer, there is no guarantee that the resulting estimates pMAP and QMAP pp provide a satisfactory description of the posterior distribution. • Lastly, the approach is restricted to prior distributions and noise distributions that allow minimization of (9), which are therefore typically assumed to be Gaussian in practice. These points make it necessary to verify the appropriateness of applying MAP estimation to a given problem. We refer to [16] for further discussion in this regard.

2.3 Specialized Optimization Methods for MAP Estimates Finding the minimizer pMAP of (9), assuming it is unique, amounts to solving an unconstrained optimization problem pMAP := arg min L(p) ,

(11)

which is a nonlinear least squares problem due to the structure of L. Assuming sufficient regularity of the map G, derivative-based optimization methods use the gradient of the objective function L, Eq. (9), given by ∗ T −1 ∇p L = Q−1 pp (p − p ) + H Qzz (G(p) − z) ,

(12)

to solve (11). Here H is again the linearization of the model G, now around the point where the gradient is evaluated. This gradient is usually evaluated using adjoint

An Improved Conjugate Gradients Method …

363

model equations, due to the low resulting computational cost and high accuracy in discretized settings [25].

2.3.1

Preconditioned Nonlinear Conjugate Gradients

Based on (12), one can in principle use a gradient-based optimization method to obtain pMAP . However, the resulting optimization problem is often ill-conditioned for even moderately dimensioned parameter vectors p, and becomes strongly so when going to high-dimensional spaces [16]. It is often advisable, especially when n p n z , to switch to a preconditioned formulation. Using the prior covariance matrix Q pp as preconditioner, we arrive at Q pp ∇p L = (p − p∗ ) + Q pp HT Q−1 zz (G(p) − z) .

(13)

as a replacement for the gradient as local search direction. The simplest efficient numerical method making use of Eq. (12) resp. (13) is the nonlinear Conjugate Gradients (CG) method, see Algorithm 1. Each iteration of this method obtains a step direction di , which is a linear combination of the current gradient and the previous direction based on a scalar factor βi , and then performs a line search in that direction to produce a step length αi and new parameter vector pi . The unpreconditioned CG method is obtained by setting M := I (identity matrix), while the prior-preconditioned version uses M := Q pp . It is well-known that preconditioning is effectively a change of variables p → Q−1/2 pp p

(14)

applied to the objective function (9), but written in the original variables p. The prior covariance matrix of the transformed space is then I, i.e., the parameters are uncorrelated white noise. This form of preconditioner has been applied successfully in a range of approaches and applications in the context of inverse modelling, see, e.g., [5] for an application in uncertainty quantification, and [8] for an example in the context of Markov–Chain Monte Carlo (MCMC). In the context of Bayesian inversion, this form of preconditioning can also be seen as an approximation of the classical Gauss–Newton scheme, which has the step direction −1  T −1 ∇p L . δp = − Q−1 pp + H Qzz H

(15)

A comparison with Eq. (13) shows that prior preconditioning amounts to dropping the term HT Q−1 zz H from the Gauss–Newton step direction. This modification is a low-rank perturbation of the approximative inverse Hessian in the case of large-scale inversion of sparse observations. There are several different choices for the factor β, and most of the common versions are supported by the implementation of the framework developed in this paper,

364

O. Klein

Algorithm 1: Preconditioned nonlinear conjugate gradients Input: initial value p0 , r0 = 0, t0 = 0, d0 = 0, preconditioner matrix M, stopping criterion Output: estimate of MAP point pMAP i := 0 [set index]; repeat i → i + 1; [shift index] ri := −∇p L|pi−1 ; [compute residual] ti := Mri [compute preconditioned residual]; βi := orthogonalize (ri−1 , ti−1 , ri , ti , di−1 ) [compute conjugation factor]; di := ti + βi di−1 [set direction]; αi := linesearch (pi−1 , di ) [compute step width]; pi := pi−1 + αi di [define ith iteration]; until converged; pMAP := pi [accept final iteration]; return pMAP ;

namely β H S , β F R , β P R P , β DY , and β N in the notation of [13]. For the numerical experiments of Sect. 4 the variant of Fletcher and Reeves is used, given by βiFR :=

riT ti , T ri−1 ti−1

(16)

where ri is the ith residual (negative of gradient), and ti is the ith preconditioned residual (ti = Mri ). The step length α is computed using simple quadratic interpolation.

2.3.2

Recurrence Relations and Caching

One of the attractive properties of the classical CG method, obtained when setting M := I and therefore ti = ri in Algorithm 1, is the fact that it can incorporate curvature information in its step directions with minimal memory overhead. In contrast to the overly simplistic Steepest Descent (Gradient Descent) method, which doesn’t consider curvature at all, and quasi-Newton methods, which have to store several vectors to approximate the Hessian, the CG method is based on a well-known threeterm recurrence relation, Fig. 2. The residual ri is evaluated using pi−1 together with Eq. (12), the calculation of βi requires only ri , the direction di can be obtained from di−1 and ri , and pi is calculated using pi−1 and di . Each of the three vectors pi , ri , and di is only needed until its successor has been computed, and therefore only three vectors have to be kept in memory at any given time. Nominally, the factor βi requires information from two different iterations, but this can be solved, at least in the case of β FR , by storing its numerator and denominator separately. A similar recurrence relation also holds for the preconditioned CG method, see Fig. 3, but here it is a four-term relation due to storage of both the residual ri and the preconditioned residual ti .

An Improved Conjugate Gradients Method …

di−1

pi−1

ri

365

di

pi

ri+1

Fig. 2 Three-term recurrence relation of the standard nonlinear CG method. The three terms ri , di and pi can be computed from their predecessors (index i − 1) and fully define their successors (index i + 1). Each vector is only needed until its successor is computed, so only three vectors have to be kept in memory at any given point in time

di−1

pi−1

ti

di

pi

ri

ti+1

ri+1

Fig. 3 Four-term recurrence relation of the preconditioned nonlinear CG method. The role of the residual ri is taken over by the preconditioned residual ti . Dashed arrow: implicit dependency of di on ri due to factor β

The CG method has to store three vectors (pi , ri , and di ), while the preconditioned method has to keep an additional vector t i in memory, so four vectors in total. We may additionally store the shadow iterate vi := Q−1 pp pi , which represents the progression of pi if one would use the original residual ri instead of ti to determine the step direction. Here, the name shadow iterate is chosen to reflect the fact that vi closely follows pi , but in a transformed space. This matrix–vector product vi is required several times, in evaluations of both the objective function L (linesearch) and its gradient ∇p L. Taking this idea one step further and also storing the shadow direction wi := Q−1 pp di has drastic consequences for the preconditioned CG method: due to the specific preconditioner choice M = Q pp we have −1 −1 ri = Q−1 pp t i , vi = Q pp pi , wi = Q pp di ,

(17)

i.e., the tuple (ri , vi , wi ) is the tuple (ti , pi , di ) multiplied with Q−1 pp . Each of these tuples can in theory be reconstructed from the other, and storing both doubles the memory requirements of the method. But at the same time, one may use these six vectors to perform an iteration of the method without any multiplication with Q−1 pp . Crucially, the first term of the objective function is linear least squares, and therefore

366

O. Klein

a caching version that stores vi and wi may avoid several matrix–vector products, since for any step length α ∈ R = (pi−1 + αdi − p∗ )T (vi−1 + αwi − v∗ )

pi−1 + αdi − p∗ 2Q−1 pp

(18)

∗ ∗ holds with the shadow mean v∗ := Q−1 pp p . This term is usually zero (when p = 0) or very cheap to compute (e.g., when the only parts of p with non-zero mean are uncorrelated trend components), and can be precomputed and stored when it isn’t.

2.3.3

Resulting Method

Due to the above considerations we can therefore proceed as follows: 1. Compute the gradient of L, and thereby the residual ri , using the previous iterate pi−1 (for the nonlinear term), its shadow vi−1 and v∗ (both for the linear term). 2. Multiply ri with Q pp to obtain the preconditioned residual ti . 3. Calculate βi as usual, then use di = ti + βi di−1 =⇒ wi = ri + βi wi−1

(19)

to obtain both the direction di and its shadow wi . 4. Perform a linesearch and compute αi as usual, using pi−1 , vi−1 , di , and wi , and the linearity of the first term of (9), as per Eq. (18), to evaluate L without multiplications with Q−1 pp . 5. Calculate the new iterate pi and its shadow vi using pi = pi−1 + αi di =⇒ vi = vi−1 + αi wi .

(20)

6. Repeat. None of the six vectors is needed once its successor has been computed. As a result we arrive at a six-term recurrence relation, Fig. 4, that consists of two separate chains moving in lockstep. At the cost of doubling the memory requirements, multiplications with Q−1 pp are no longer needed. In many applications, this increase in memory requirements can be neglected, e.g., because it is dwarfed by the memory needed for circulant-embedding techniques, or because the full forward system state has to be kept in memory for the adjoint equations in transient settings. This leaves us with four different versions of the nonlinear CG method, summarized in Table 1: the classical nonlinear CG method (unpreconditioned and noncaching), the method described above, and two intermediate versions that employ one of the modifications but not the other. Note that multiplications with Q−1 pp , as needed by three of the versions, are implemented as an inner (linear) CG method, and therefore one multiplication with Q−1 pp is equivalent to several multiplications with Q pp in terms of cost. The concrete number of matrix multiplications may vary

An Improved Conjugate Gradients Method …

367

di−1

pi−1

ti

di

pi

ti+1

wi−1

vi−1

ri

wi

vi

ri+1

Fig. 4 Six-term recurrence relation of the prior-preconditioned caching CG method. Storing the additional vectors vi and wi leads to two mostly parallel chains, which interact during the computation of the residuals ri and ti , of β, and of α. Dashed arrows: implicit dependencies due to factor β. Not shown: implicit dependencies of pi on vi−1 and vi on pi−1 due to step width α Table 1 Requirements of considered CG variants Nonlinear CG variant Memory requirements Unpreconditioned, non-caching Unpreconditioned, caching Preconditioned, non-caching Preconditioned, caching

Matrix multiplications

3 × np

9-13 × Q−1 pp

4 × np 4 × np 6 × np

3-5 × Q−1 pp 1 × Q pp , 9-13 × Q−1 pp 1 × Q pp

within the stated bounds due to variability in the number of linesearch steps that are taken (typically one or two). Note that one iteration of the preconditioned caching method is cheaper than one iteration of the classical CG method, and therefore the preconditioner has negative cost. Furthermore, preconditioned step directions tend to be smoother, which may additionally reduce the costs of numerically evaluating the model G and its linearization H.

3 Implementation Details The methods described in Sect. 2 are implemented as submodules of the Distributed and Unified Numerics Environment (DUNE) [2–4], a modular toolbox for highperformance computing with special focus on the solution of partial differential equations. DUNE is open-source C++ software, with strong emphasis on generic programming (C++ templates) and modern C++ programming techniques (introduced in C++11 and above) to ensure high efficiency and portability of the implemented methods.

368

O. Klein

3.1 Program Structure The program consists of four DUNE modules, based on DUNE 2.6. One module each handles • the generation of stationary Gaussian random fields, • the automated solution of system equations and their adjoints, • and the resulting nonlinear optimization problem. A fourth module is used to 1. tie these three modules together, 2. augment the framework by concrete user-defined system equations (the groundwater flow equation in the given context), 3. and supply prior distribution, measurements with noise, boundary conditions and other user-supplied parameters, thereby producing an executable for a given problem definition. Splitting the code base into different submodules makes code reuse easier. One may reuse the field generation code for similar but unrelated projects, as was done in, e.g., [6, 21], both using some version of this field generator module. The same holds for the nonlinear optimization code, which may be reused in other projects in the DUNE ecosystem. Finally, the framework for system equations decouples the inverse modelling code from the concrete forward and adjoint model, thereby keeping the former as general as possible and facilitating its application to different models. The submodule for field generation, named dune-randomfield, has been released as open source [15]. The package makes heavy use of the FFTW library [10], which is used to perform the discrete Fourier transforms needed for the circulant embedding technique [9]. Due to the fact that FFTW can only parallelize along one dimension, the module provides functionality for parallel redistribution of random fields, see Sect. 3.2. Input and output of generated fields is supported in VTK and HDF5 file formats. The submodule for the solution of model PDEs utilizes tags for the automated construction of forward and adjoint simulation runs: the user implements template specializations of a set of classes, namely • a LocalOperator class, describing the discontinuous Galerkin (dG) or finite volume discretization of the PDE on a single triangulation element, • classes for boundary conditions and initial conditions (if applicable), • a class containing the underlying parameterization of the PDE (model parameters), which enables coupling with other PDEs and the parameters that are to be estimated. Based on a user-defined tag, used as template parameter in all of these classes, the module automatically assembles a list of solver objects for the model equations, and additionally a (reversed) list for the respective adjoint equations. Several of these classes provide default implementations, which can be accessed by simply using

An Improved Conjugate Gradients Method …

369

the default implementation as a base class of the template specialization (effectively making the tagged specialization an alias for the default implementation). The same approach can be used to provide user-defined classes that can be shared across several models, thereby enabling code reuse and reducing implementation time. The nonlinear optimization submodule provides a number of well-known methods for nonlinear least squares and general unconstrained nonlinear optimization, among them the Gauss–Newton method (matrix-free using internal CG iterations), the nonlinear Conjugate Gradients method, and the limited-memory Broyden–Fletcher– Goldfarb–Shanno (L-BFGS) method in two-loop recursion formulation [22], in each case including the modifications mentioned in Sect. 2.3, i.e., elimination of Q−1 pp based on preconditioning in combination with caching mechanisms. All of these methods rely only on function evaluations, gradient evaluations, and scalar products. This means the module is automatically fully parallel as long as these operations work in parallel, which is the case for DUNE built-ins, compare Sect. 3.2.

3.2 Parallelization One of the major advantages of DUNE is its transparent handling of parallelism. Most user-defined functions and classes operate on a single finite element or interface, etc., and DUNE automatically provides the required information if it resides on a different compute node. This means DUNE modules can be written in a rather parallelizationagnostic manner, since the DUNE solver classes take care of domain decomposition and communication matters, based on the Message Passing Interface (MPI). The same holds for built-in linear algebra, which, e.g., automatically computes parallel scalar products and vector norms. Nevertheless, the explicit parallel execution of code has to be considered for one specific task. The generation of random fields is based on the MPI-enabled FFTW library [10], which can only parallelize along one dimension. The solution of PDEs usually requires a more flexible approach to domain decomposition, and this means the generated random fields have to be redistributed, since they are used to parameterize the PDEs. Let p the number of nodes, and p = px × p y (in 2D), resp. p = px × p y × pz (in 3D) the structured parallel distribution of nodes on a Cartesian mesh. To simplify matters, assume that the number of cells per dimension can always be distributed evenly among the desired number of processors. Then, with x being the dimension FFTW uses for parallelization and assuming a fixed ratio between px and p y (and pz if applicable), the number of nodes that need to receive data from a given node during redistribution is p y = O( p 1/2 ) in 2D resp. p y × pz = O( p 2/3 ) in 3D. The FFTW code has to send data to p = O( p) nodes during its transpose steps (due to the 1D distribution). The total amount of data sent is the same in both cases, and therefore the time required for redistribution should become negligible when compared with the rising communication overhead of the FFTW transform for large p. In addition to this redistribution to a different grid layout, the values at processor boundaries have to be available on both processors for finite-element and finite-

370

O. Klein

volume codes (so-called overlap). The total number of such neighboring processors is 2 × 2 = 4 in 2D and 2 × 3 = 6 in 3D, i.e., a small fixed number, and the amount of data that has to be sent in each case is significantly smaller than for the redistribution or transpose communications. Therefore, the impact of this additional communication step can be neglected.

4 Numerical Results In this section the performance of the methods described in Sect. 2.3 is tested on synthetic test data. The test setting is stationary groundwater flow according to Eq. (2) on a 2D domain of size 100 m × 100 m. No-flow conditions are imposed on the left and right boundaries, and constant Dirichlet boundary values on the top and bottom boundaries, with a potential difference of 2 m. Synthetic measurements of the system state h are extracted on an equidistant Cartesian grid, with (10 m, 10 m) its lower left and (90 m, 90 m) its upper right corner. This leaves a 10 m margin to the domain boundary, thereby reducing potential boundary effects. The resulting synthetic observations are combined with the appropriate level of noise to simulate measurement error. We use a Gaussian prior distribution for the log-conductivity Y, with exponential covariance function and a correlation length of 10 m. A realization of the prior distribution is used as synthetic ground truth, and the resulting synthetic state observations are combined with an appropriate level of uncorrelated noise. The individual test runs differ in the grid resolution of the parameterization and numerical solution, the number of synthetic observations used for inversion, the variance of the prior distribution, and the (synthetic) measurement error. Unless otherwise noted, the grid resolution is n p = 128 × 128, the measurement grid has dimensions n z = 16 × 16, the prior variance is 1, and the measurement error follows a normal distribution with a standard deviation of 0.01 m. All tests are run on a server with Intel® Xeon® Silver 4114 CPUs running at 2.2 GHz. By their very nature, both the measurements and the inversion results are randomly generated, and therefore the required number of iterations, the time to solution, and the final value of L have to be expected to be randomly distributed as well. For this reason each run is repeated five times with different parameters and measurements, and sample mean and standard error are reported. The individual runs are stopped when the relative change in the objective function value is below 1 × 10−6 , the Q pp norm of the gradient is below 1 × 10−4 , or 1000 iterations have been performed.

An Improved Conjugate Gradients Method …

371

4.1 Convergence Speed of Optimization We begin our investigation with measurements of the number of iterations and time to solution for the four different CG variants in different situations, deferring discussion of the quality of the resulting estimates to Sect. 4.2.

4.1.1

Grid Sizes/Resolution

Table 2 shows the required number of iterations and the time to solution for different grid sizes, for a fixed number of observations of n z = 16 × 16, and the upper left plot of Fig. 5 shows the time to solution divided by the number of parameters, i.e., the required time per degree of freedom. The preconditioned caching CG version is clearly the fastest in all cases. The unpreconditioned caching version comes in second, being approximately half as fast. The non-caching versions are significantly slower, taking approximately four times longer than the fastest version. In half the cases the preconditioned non-caching version is slower than the unpreconditioned one. The number of parameters grows by a factor of 4 from line to line. With the exception of the preconditioned non-caching variant, the time to solution grows by almost the same factor, indicating (optimal) linear complexity in this case. At around n p = 256 × 256, all variants except the preconditioned caching one become drastically less efficient, see Fig. 5. The latter is therefore not only the fastest of the

Table 2 Number of iterations, time to solution, and time per iteration for different grid sizes. Values in parentheses are based on a single run due to excessive time requirements, and therefore have an unknown error Unpreconditioned conjugate gradients np

Non-caching Iterations

Caching tsolve (s) 22.2 ± 5.2

titer (s)

Iterations

0.071

312 ± 83

tsolve (s)

312 ± 83

322

243 ± 33

642

181 ± 7

1282

164 ± 14

1200 ± 100

7.3

164 ± 14

636 ± 55

3.9

2562

397 ± 99

19800 ± 4300

50

407 ± 98

10000 ± 2100

25

5122

(368)

(161 000)

(440)

(368)

(73 400)

(200)

78 ± 11 321.2 ± 8.7

13.2 ± 2.9

titer (s)

162

0.043

0.32

243 ± 33

43.6 ± 6.2

0.18

1.8

181 ± 7

175.4 ± 5.8

0.97

Preconditioned conjugate gradients 162

271 ± 24

17.6 ± 1.7

0.065

274 ± 22

4.86 ± 0.40

0.018

322

323 ± 20

90.0 ± 6.3

0.28

318 ± 19

18.9 ± 1.3

0.059

642

292 ± 30

414 ± 50

1.4

287 ± 27

75.0 ± 7.8

0.26

1282

306 ± 32

1770 ± 180

5.8

307 ± 31

294 ± 28

0.96

2562

325 ± 19

12190 ± 650

38

327 ± 20

1233 ± 81

3.8

5122

(319)

(123 000)

(390)

314 ± 22

4950 ± 350

16

372

O. Klein Grid Sizes / Resolution

Number of Measurements

tsolve [s]

tsolve /n p [s]

104

10−1

103 102 101

103

104

10−3

105

10−2

10−1

np

nz /n p

Measurement Error

Prior Variance

100

103

103

tsolve [s]

tsolve [s]

104

102

102 101

101 10−4

10−3

10−2

10−1

100

Measurement error

10−4

10−2

100

102

Prior variance

Fig. 5 Upper left: Time per parameter value as a function of n p for the unpreconditioned non-caching ( ) and caching ( ), and the preconditioned non-caching ( ) and caching ( ) versions of nonlinear [email protected] Number of measurements is fixed at 16 × 16. Disconnected marks are based on single runs and have unknown error. Upper right: Time to solution as a function of measurement density n z /n p (same key). Number of parameters is fixed at 128 × 128. Lower left: Time to solution as a function of prescribed measurement error (same key). Lower right: Time to solution as a function of prior variance (same key). The preconditioned caching variant is clearly the fastest in almost all cases

considered methods, but also the only one that retains optimal complexity over the full range of considered values for n p .

4.1.2

Number of Measurements

Table 3 shows the number of iterations and required time for different numbers of observations, with a fixed parameterization of size n p = 128 × 128, and the upper right plot of Fig. 5 shows the time to solution as a function of the measurement density n z /n p . Again, the preconditioned caching CG version is the fastest, except for the last case with an excessively large number of measurements. For smaller numbers of measurements, as they are typical for real-world field experiments, the

An Improved Conjugate Gradients Method …

373

Table 3 Number of iterations, time to solution, and time per iteration for different numbers of measurements Unpreconditioned conjugate gradients nz

Non-caching

Caching

Iterations

tsolve (s)

titer (s)

Iterations

22

200 ± 14

1364 ± 92

6.8

200 ± 14

tsolve (s)

716 ± 48

titer (s)

42

530 ± 190

3300 ± 1100

6.2

530 ± 190

1760 ± 580

3.3

82

360 ± 160

2200 ± 810

6.1

360 ± 160

1200 ± 430

3.3

162

163 ± 13

1170 ± 87

7.2

163 ± 13

636 ± 48

3.9

322

208 ± 19

1500 ± 130

7.2

208 ± 19

823 ± 73

4.0

642

282 ± 19

2200 ± 130

7.8

282 ± 19

1265 ± 79

4.5

1282

340 ± 22

3600 ± 260

11

340 ± 22

2300 ± 180

6.8 0.86

3.6

Preconditioned conjugate gradients 22

18.4 ± 2.0

98 ± 10

5.3

18.4± 1.8

15.9 ± 1.4

42

48.2 ± 3.6

254 ± 15

5.2

48.4± 3.0

42.1 ± 2.0

0.87

82

184 ± 15

1080 ± 85

5.9

184 ± 15

176 ± 14

0.96

162

273 ± 23

1550 ± 140

5.5

274 ± 21

264 ± 22

0.96

322

494 ± 29

3100 ± 200

6.3

494 ± 29

570 ± 40

1.2

642

630 ± 88

4320 ± 630

6.9

600 ± 110

1030 ± 120

1.7

1282

866 ± 90

7670 ± 780

8.9

858 ± 76

3100 ± 260

3.6

preconditioned versions are over an order of magnitude faster than the unpreconditioned ones. At around n z = 16 × 16 the unpreconditioned versions start to become competitive, which provides an explanation for the observations concerning the two non-caching versions in Table 2. The main reason for this development is the rising number of iterations for the preconditioned versions. The preconditioned and unpreconditioned version mostly terminate because of different reasons (gradient condition versus value reduction condition), so it may be worthwhile to consider other stopping criteria instead.

4.1.3

Measurement Error

Table 4 shows the number of iterations and time to solution for different measurement noise levels, and the lower left plot of Fig. 5 shows the time to solution as a function of the measurement error. The reported measurement errors are the standard deviation of the measurement noise. The level of noise strongly influences the difficulty of the inverse problem, as expected. Choosing a standard deviation close to one, i.e., in effect assuming a measurement error that is on the scale of the natural variability of the observed quantity, leads to a very simple problem. But at the same time very little is learned about the parameterization, of course. Lower values for the measurement error lead to significantly more iterations and, at least in the case of the unpreconditioned versions, to higher variability in the time measurements. The low numbers of iterations for very low measurement errors occur due to premature

374

O. Klein

Table 4 Number of iterations, time to solution, and time per iteration for different measurement noise levels Unpreconditioned conjugate gradients Meas. err.

Non-Caching Iterations

Caching tsolve (s)

titer (s)

Iterations

tsolve (s)

titer (s)

1e-4

161 ± 56

940 ± 300

5.8

161 ± 56

490 ± 160

3.16e-4

340 ± 170

2300 ± 1100

6.8

340 ± 170

1200 ± 560

3.5

1e-3

700 ± 140

4800 ± 970

6.9

700 ± 140

2640 ± 520

3.8

1870 ± 210

3.9

721 ± 23

3.9

3.16e-3

480 ± 53

1e-2

184.4 ± 6.4

3.16e-2

129.2 ± 4.0

1e-1 3.16e-1 1

90 ± 14

3.0

3470 ± 390

7.2

480 ± 53

1335 ± 45

7.2

184.4 ± 6.4

930 ± 38

7.2

129.2 ± 4.0

488 ± 19

3.8

630 ± 100

7.0

90 ± 14

337 ± 54

3.7

32.0 ± 8.7

213 ± 59

6.7

32.0 ± 8.7

114 ± 30

3.6

24 ± 13

161 ± 85

6.7

24 ± 13

85 ± 44

3.5

Preconditioned conjugate gradients 1e-4

560 ± 83

3200 ± 460

5.7

541 ± 79

515 ± 74

0.95

3.16e-4

437 ± 76

2600 ± 480

5.9

420 ± 74

415 ± 75

0.99

1e-3

660 ± 100

3760 ± 550

5.7

620 ± 110

600 ± 100

0.97

3.16e-3

316 ± 36

1830 ± 230

5.8

308 ± 37

297 ± 38

0.96

1e-2

317 ± 29

1950 ± 210

6.2

319 ± 29

327 ± 31

1.0

144.2 ± 4.9

863 ± 45

6.0

145.2 ± 5.0

144.0 ± 6.9

0.99

1e-1

46.2 ± 0.8

274.4 ± 4.5

5.9

46.2 ± 0.8

46.26 ± 0.78

1.0

3.16e-1

16.4 ± 0.7

89.0 ± 3.1

5.4

16.4 ± 0.7

15.81 ± 0.51

0.96

6.0 ± 0.3

35.5 ± 1.8

5.9

6.68 ± 0.34

1.1

3.16e-2

1

6.0 ± 0.3

termination of the optimization, see the discussion in the next session. All of the methods fail for the two smallest values for the measurement error, but this is not readily apparent when only looking at iteration numbers and timings.

4.1.4

Prior Variance

Finally, Table 5 shows the number of iterations and time to solution for different values of the prior variance for the log-conductivity Y, and the lower right plot of Fig. 5 shows the time to solution as a function of prior variance. The objective function L, Eq. (9), is linear in both the parameter variance and the measurement variance. In principle, scaling up one of them is therefore equivalent to scaling down the other, i.e., shifting emphasis from or to the prior part resp. the measurement part of the objective function. However, there is a subtle difference: increasing the prior variance leads to higher heterogeneity in the initial parameterization that is used to obtain synthetic measurements. In other words, the forward problem distinguishes between scaling of the parameter uncertainty and the measurement uncertainty, even if the inverse problem doesn’t. This is a direct consequence of the increasing effect of

An Improved Conjugate Gradients Method …

375

Table 5 Number of iterations, time to solution, and time per iteration for different values of prior variance Unpreconditioned conjugate gradients Variance

Non-caching Iterations

Caching tsolve (s)

titer (s)

Iterations

tsolve (s)

titer (s)

1e2

640 ± 190

3800 ± 1100

5.9

640 ± 190

2040 ± 580

3.2

1e1

416 ± 74

2820 ± 470

6.8

416 ± 74

1520 ± 250

3.7 3.8

185 ± 16

1300 ± 110

7.0

185 ± 16

697 ± 60

1e-1

163.0 ± 5.1

1153 ± 35

7.1

163.0 ± 5.1

610 ± 18

3.7

1e-2

198 ± 14

1373 ± 85

6.9

198 ± 14

719 ± 43

3.6

1e-3

252 ± 11

1594 ± 71

6.3

252 ± 11

830 ± 35

3.3

1e-4

176.4 ± 5.0

889 ± 22

5.0

176.4 ± 5.0

453 ± 13

2.6 0.88

1

Preconditioned conjugate gradients 1e2

420 ± 160

2120 ± 780

5.0

420 ± 160

370 ± 140

1e1

710 ± 120

3910 ± 680

5.5

720 ± 120

640 ± 100

0.89

1

375 ± 29

2180 ± 180

5.8

360 ± 28

345 ± 29

0.96

1e-1

106.2 ± 2.8

618 ± 23

5.8

103.8 ± 3.6

102.1 ± 4.1

0.98

1e-2

32.2 ± 1.1

184.9 ± 8.1

5.7

32.4 ± 1.2

32.1 ± 1.6

0.99

1e-3

11.6 ± 0.6

66.0 ± 2.8

5.7

11.6 ± 0.6

11.76 ± 0.45

1.0

1e-4

4.6 ± 0.2

28.6 ± 1.5

6.2

4.6 ± 0.2

5.60 ± 0.27

1.2

nonlinearity when sampling over a larger range of parameter values. For this reason it is sensible to look at prior variance variations in their own right. The results are similar to those in Table 4, but there is a difference: decreasing the measurement error leads to hard, but well-defined, inverse problems, while increasing the prior variance beyond a certain point leads to failure of the numerical solver and therefore failure of the synthetic forward problem. The preconditioned versions are again more stable for more difficult problems, and the preconditioned caching version is again clearly the fastest.

4.1.5

Summary

Having considered several different scenarios to compare the different methods, we may come to the conclusion that the prior-preconditioned caching variant of the nonlinear CG method is the preferred version in almost all of the considered situations, being significantly faster than the other versions at the cost of a moderate increase in memory requirements. The non-caching preconditioned CG and the caching unpreconditioned version are typically faster than the naive unpreconditioned non-caching one, but still significantly slower than the fastest method. It is especially noteworthy that preconditioning without the caching mechanism may even lead to worse performance than in the unpreconditioned case.

376

O. Klein

4.2 Accuracy of Inversion Results We complement the discussion of efficiency of the previous section by looking at the accuracy of the results obtained through the various methods. The two unpreconditioned versions generally produce the same results, and the two preconditioned versions are very similar in their output, therefore only the two caching versions are considered in the following discussion. There are methods that can treat the parameter estimate in a Bayesian setting and rigorously check whether the estimate and its uncertainty fit the underlying assumptions [16]. However, in the current setting we are more interested in the quality of the inversion result as a deterministic parameter estimate, and therefore consider the minimum of the objective function, the root mean square error (RMSE) of the parameter estimate, and the maximum absolute error (MAE) of the parameter estimate as indicators. For uncorrelated measurement residuals, the minimum of the objective function is expected to follow a χ 2 -distribution with n z degrees of freedom. Therefore it has a 1/2 mean value of n z and a standard deviation of n z , which implies a standard error of 1/2 1/5 × n z , since our sample size is five. In geostatistical inverse problems like the one considered here, the final measurement residuals are generally correlated, even when the original measurements aren’t, and therefore the minimum of L will at most follow such a distribution with a reduced number n z,eff < n z of degrees of freedom. However, this number n z,eff is generally unknown, and therefore it is customary to replace it with the original n z in discussions of the expected value of L(pMAP ). The root mean square error (RMSE) and the mean absolute error (MAE) are defined as  np 1/2  −1/2 MAP true 2 RMSE(pMAP ) = n p × ( pi − pi ) (21) i=1

and MAE(pMAP ) = n −1 p ×

np  MAP p − pitrue , i

(22)

i=1

where piMAP are the components of the MAP estimate pMAP , and pitrue are the components of the synthetic input field that should be estimated.

4.2.1

Grid Sizes/Resolution

The left column of Fig. 6 shows the final objective function value, the RMSE of the estimate, and the MAE of the estimate for different grid sizes. The unpreconditioned CG method seems to be particularly accurate on coarse meshes, and the results become slightly worse on finer grids. The indicators don’t show any clear dependence on the grid size. The minimum value of the objective function is in the expected range

An Improved Conjugate Gradients Method …

377

for all runs, with a slight underestimation especially on coarser grids for all methods, and a stronger overestimation on finer grids for the unpreconditioned ones.

4.2.2

Number of Measurements

The right column of Fig. 6 shows the final objective function value, the RMSE of the estimate, and the MAE of the estimate for different numbers of measurements. In contrast to the left column, the indicators are very sensitive to the number of measurements n z , due to the fact that a larger n z implies more knowledge about the system state is contained in the observations. The RMSE and MAE of the preconditioned CG method tend to be lower than those of the unpreconditioned version, indicating better inversion results in addition to faster convergence, compare Sect. 4.1. The minimum value of L is always in the expected range, and typically lower for the preconditioned version.

4.2.3

Measurement Error

The left column of Fig. 7 shows the final objective function value, the RMSE of the estimate, and the MAE of the estimate for different measurement noise levels. The minimum value of the objective function is in the expected range for all but the smallest three levels, with the smallest two failing completely. For the remaining levels the value of the preconditioned version is typically the lower of the two. The same holds for the RMSE and MAE values. For both methods one can see a trend to lower values for smaller error levels. Low measurement noise means that more information is retained in the observations, which translates into a better parameter estimate. However, these values remain low when the preconditioned method fails, indicating that the estimate is nevertheless a good representation of the initial parameterization, while the values for the unpreconditioned variant are significantly higher.

4.2.4

Prior Variance

The right column of Fig. 7 shows the final objective function value, the RMSE of the estimate, and the MAE of the estimate for different values of prior variance. Again, the RMSE and MAE values for the preconditioned versions are typically lower than for the unpreconditioned versions. The values for very low and for very high variance are higher than those for intermediate values, which may suggest strong emphasis on the prior for the former and a difficult inverse problem for the latter. The final function values are generally in the expected range for the preconditioned variants. In contrast, the values of the unpreconditioned versions rise sharply when the prior variance is increased beyond a certain point, showing that these variants have more difficulty handling cases with limited prior knowledge.

378

O. Klein

Grid Sizes / Resolution

Number of Measurements 104

LMAP

160

140

102

120 100

RMSE

103

105

1

1

0.8

0.8

0.6

0.6 103

MAE

104

104

105

0.8

0.8

0.6

0.6

0.4

103

104

np

105

0.4

10−3

10−2

10−1

100

10−3

10−2

10−1

100

10−3

10−2

10−1

100

nz /n p

Fig. 6 Left column: The minimum encountered objective function value, RMSE, and MAE as a function of n p , for the unpreconditioned caching ( ) and the preconditioned caching ( ) versions of nonlinear [email protected] Right column: The same information as a function of measurement density n z /n p . The preconditioned versions generally achieve values of the objective function in the expected range, while the unpreconditioned versions begin to struggle for high resolutions resp. low measurement densities. The values of the RMSE and MAE indicators seem to be mostly independent of the chosen grid resolution, while there is a strong dependency on the measurement density, with lower numbers of measurements providing less information for the reconstruction of the synthetic ground truth. According to all three indicators, the estimates produced by the preconditioned versions are as good or better than those of the unpreconditioned versions

An Improved Conjugate Gradients Method …

379

Measurement Error

Prior Variance

104

LMAP

104

103

103

102

102

RMSE

10−4

10−2

10−1

100

1

1

0.8

0.8

0.6

0.6 10−4

MAE

10−3

10−3

10−2

10−1

100

0.8

0.8

0.6

0.6

0.4

10−4

10−3

10−2

10−1

Measurement error

100

0.4

10−4

10−2

100

102

10−4

10−2

100

102

10−4

10−2

100

102

Prior variance

Fig. 7 Left column: The minimum encountered objective function value, RMSE, and MAE as a function of prescribed measurement error, for the unpreconditioned caching ( ) and the preconditioned caching ( ) versions of nonlinear [email protected] Right column: The same information as a function of prior variance. Low measurement error lead to very hard problems, and both the unpreconditioned and preconditioned versions fail for measurement errors below 0.001. This may in part be caused by discretization errors making it impossible to reach the prescribed accuracy. The unpreconditioned versions also fail for large prior variances, but the preconditioned versions reach the expected range of objective function values even for very large values of prior variance. The values of the RMSE and MAE indicators again rise due to lack of information for the cases of high measurement error and low prior variance, but they also rise for cases with high prior variance, reflecting the increasing influence of nonlinearities

380

4.2.5

O. Klein

Summary

In all considered cases the preconditioned versions tended to have lower RMSE and MAE values, and the final objective function values were lower, more often in accordance with the theoretical values, and more consistent across different runs. Together with the timings discussed in the previous section we may come to the conclusion that preconditioning and caching improve convergence speed in almost all cases, sometimes drastically so, and without negative effects on the quality of the produced inversion results.

4.3 Results for 3D Test Case This short section serves as an investigation into the applicability of the proposed method for full three-dimensional problems. The 3D test case is identical to the original 2D one except for the underlying domain and the placement of the synthetic observations. Instead of a square of size 100 m × 100 m, we consider a cube of size 100 m × 100 m × 100 m, with no-flow conditions on all but two sides, and as before a potential difference of 2 m between the top and bottom boundaries. The measurement grid is extended to a third dimension in a natural fashion, again with a 10 m margin against the domain boundary. Table 6 shows the required number of iterations, time to solution, and minimum obtained value of the objective function for the four different CG versions, and Fig. 8 shows the time per parameter value and the minimum objective function value as functions of n p , as in Figs. 5 and 6. The results are qualitatively the same as in the 2D test case. The time per parameter value is not quite constant for the preconditioned caching version, since the numerical solver for the model equation has not yet reached asymptotics on these relatively coarse grids. The cost of all other versions rises sharply in the number of parameters n p . The objective function values fail to reach the expected range for very low resolutions due to discretization errors, are in the expected range for intermediate and fine grids, and begin to rise for finer resolutions. Therefore, further investigations into the robustness and general behavior on highly resolved three-dimensional grids are warranted.

4.4 Parallel Efficiency Returning to the original 2D test case, in this last section the parallelizability of the four CG variants is investigated. The forward model and the optimization schemes are parallelized using domain decomposition based on MPI, as described in Sect. 3.2. Here we only consider decompositions of the domain into squares that result in powers of two as the number of processors per dimension, namely 4 = 2 × 2 and 16 = 4 × 4 processors.

An Improved Conjugate Gradients Method …

381

Table 6 Number of iterations, time to solution, and final value of objective function for different grid sizes, for a 3D version of the test problem with 4 × 4 × 4 equidistant measurements. Values in parentheses are based on a single run due to excessive time requirements, and therefore have an unknown error Unpreconditioned conjugate gradients np

Non-caching

Caching

Iterations

Iterations

tsolve (s)

233 323

10.4 ± 2.2

34.1 ± 7.2

10.4 ± 2.2

25.0 ± 5.2

L MAP 41.3 ± 2.7

11.8 ± 0.6

92.1 ± 3.6

11.8 ± 0.6

67.2 ± 2.8

30.1 ± 1.9

453 643

8.8 ± 2.1

235 ± 56

8.8 ± 2.1

160 ± 38

30.2 ± 2.0

15.6 ± 3.0

3010 ± 630

15.6 ± 3.0

1610 ± 320

33.2 ± 4.1

23.6 ± 3.3

9900 ± 1400

23.6 ± 3.3

5800 ± 850

36.1 ± 4.5

(40)

(125 000)

(33)

(47500)

(46.9/43.6)

913 1283

tsolve (s)

Preconditioned conjugate gradients 6.0 ± 0.3

17.57 ± 0.93

6.0 ± 0.3

9.59 ± 0.48

39.7 ± 3.0

5.2 ± 0.2

37.1 ± 1.6

5.2 ± 0.2

19.53 ± 0.75

29.7 ± 1.8

453 643

5

913 1283

124.4 ± 2.3

5.6 ± 0.2

5

993 ± 62

5.6 ± 0.2

6.2 ± 0.2

2454 ± 84

6.2 ± 0.2

(7)

(18700)

7

56.0 ± 1.2

28.4 ± 2.1

192.1 ± 8.9

30.7 ± 3.4

643 ± 25

33.0 ± 4.2

1843 ± 18

35.5 ± 2.2

40

10−2

LMAP

tsolve /n p [s]

233 323

10−3

30 10 4

10 5

np

10 6

10 4

10 5

10 6

np

Fig. 8 Left: Time per parameter value as a function of n p for the 3D test case, for the unpreconditioned non-caching ( ) and caching ( ), and the preconditioned non-caching ( ) and caching ( versions of nonlinear [email protected] Number of measurements is fixed at 4 × 4 × 4. Disconnected marks are based on single runs and have unknown error. Right: Minimum observed value of the objective function (same key). The values coincide for the two preconditioned resp. unpreconditioned versions in all cases but one

Table 7 shows the same setup as Table 2, i.e., timings of the solvers as a function of grid size, but this time running in parallel on four processor cores. The results are generally of the same quality as in the sequential case, therefore their discussion as in Sect. 4.2 will be omitted for brevity. Table 7 and Fig. 9 show the same general behavior as in the sequential case, with one difference: some of the test cases are too coarse, and the incurred communication overhead results in negative speedups (i.e., “speed-downs”) when compared to the sequential setting. On finer meshes, the

382

O. Klein

Table 7 Number of iterations, time to solution, and time per iteration for different grid sizes, using domain decomposition on four cores Unpreconditioned conjugate gradients Non-caching

np

Caching

Iterations

tsolve (s)

titer (s)

Iterations

tsolve (s)

titer (s)

322

254 ± 25

187 ± 18

0.74

254 ± 25

88.4 ± 8.6

642

200 ± 12

310 ± 17

1.6

200 ± 12

167.9 ± 9.4

1282

165 ± 11

645 ± 46

3.9

165 ± 6

338 ± 13

2.0

2562

295 ± 4

5120 ± 110

17

248 ± 29

2130 ± 250

8.6

0.35 0.84

Preconditioned conjugate gradients 322

324 ± 15

210 ± 16

0.65

321 ± 18

17.2 ± 1.2

642

380 ± 42

498 ± 60

1.3

382 ± 41

92 ± 11

0.24

1282

290 ± 49

900 ± 150

3.1

293 ± 48

138 ± 21

0.47

2562

236 ± 54

3380 ± 780

14

289 ± 21

431 ± 25

1.5

368 ± 39

1870 ± 190

5.1

5122

Parallel on 16 Processors

tsolve /n p [second]

tsolve /n p [second]

Parallel on 4 Processors

10−1

10−2 10 3

10 4

np

10 5

0.054

10−1

10−2

10−3

10 3

10 4

10 5

10 6

np

Fig. 9 Time per parameter value as a function of n p in a parallel setting, for the unpreconditioned non-caching ( ) and caching ) and caching ( ), and the preconditioned non-caching ( ( ) versions of nonlinear [email protected] Number of measurements is again 16 × 16. left: parallel run on 4 processors, right: parallel run on 16 processors

preconditioned caching variant achieves a speedup of 2.75(11), or, in other words, a parallel efficiency of 68.8%. Table 8 shows the same setup running in parallel on 16 cores. The results are qualitatively the same as for four cores. Due to the increase in compute cores and resulting smaller domain size per core a slightly higher grid resolution is required to arrive at acceptable speedups. On finer meshes, the preconditioned caching variant is sped up by a factor of 10, i.e., we achieve 62.5% parallel efficiency.

An Improved Conjugate Gradients Method …

383

Table 8 Number of iterations, time to solution, and time per iteration for different grid sizes, using domain decomposition on 16 cores. Entries marked with (∗): based on three runs instead of five due to excessive time requirements Unpreconditioned conjugate gradients np

Non-caching Iterations

Caching tsolve (s)

titer (s)

Iterations

tsolve (s)

titer (s)

322

228 ± 17

225 ± 16

0.99

228 ± 17

102.9 ± 7.5

0.45

642

201 ± 6

357 ± 11

1.8

201 ± 6

180.7 ± 5.8

0.90

1282

160 ± 8

533 ± 30

3.3

163 ± 9

291 ± 15

1.8

2562

245 ± 43

2010 ± 370

8.2

238 ± 47

970 ± 190

4.1

5122

502 ± 71 (∗)

15400 ± 2200

31

594 ± 218 (∗)

7800 ± 2400

13

(∗)

(∗)

Preconditioned conjugate gradients 322

351 ± 27

307 ± 14

0.87

355 ± 25

16.82 ± 0.79

0.047

642

277 ± 34

430 ± 58

1.6

280 ± 34

59.2 ± 8.4

0.21

1282

289 ± 14

812 ± 36

2.8

292 ± 15

141.5 ± 6.5

0.48

2562

285 ± 28

2020 ± 220

7.1

288 ± 23

231 ± 22

0.80

5122

267 ± 44 (∗)

7000 ± 1100

26

249 ± 25

491 ± 49

2.0

10242

257 ± 17

1700 ± 120

6.6

20482

313 ± 26

8800 ± 620

28

(∗)

5 Summary and Conclusions The paper presents an implementation of a framework for quasi-linear Bayesian inverse modelling. This framework consists of several submodules of the Distributed and Unified Numerics Environment (DUNE). The modular structure and generic implementation make extensions of the framework to other areas of application particularly easy. A specialized version of the nonlinear Conjugate Gradients (CG) method is introduced, which combines prior preconditioning with caching mechanisms to drastically increase the convergence speed of the numerical optimization scheme. The central idea is a six-term recurrence relation, replacing the conventional three-term recurrence relation of CG methods, that eliminates several costly operations that are part of the standard formulation. This nonlinear CG method is a generalization of a method by the same author, previously discussed in a more restricted form tailored to a concrete application. Extensions to, e.g., the limited-memory Broyden– Fletcher–Goldfarb–Shanno (L-BFGS) method or Jacobian-free Gauss–Newton are straightforward. Several synthetic test cases are considered to investigate the efficiency and applicability of the new method. The results are favorable, showing that the combination of prior preconditioning and reformulation of the optimization problem leads to faster convergence in almost all cases. Not only is the required number of iterations decreased, but the computational cost of each individual iteration is reduced, i.e., the

384

O. Klein

combination of prior preconditioning and caching results in a preconditioner with negative cost. Additionally, the new method is more robust, and provides more accurate results over a wide range of test cases. The gain in performance is especially large for fine computational meshes and low measurement densities. Here, reductions up to two orders of magnitude are observed. Parallel test runs demonstrate that the method is sufficiently parallelizable and retains its properties in a parallel setting. Acknowledgements The first steps towards what would turn into the discussed generic framework were undertaken as part of the Project TOMOME supported by the Federal Ministry of Education and Research of Germany (BMBF), Grant No. 03G0742B. The main development effort for the current implementation took place as part of the research alliance “Data-Integrated Simulation Science” between the universities of Stuttgart and Heidelberg, funded by the MWK Baden-Württemberg, AZ:@ 7533.-30-20/5/1. The financial support of the funding agencies is gratefully acknowledged. The author would also like to thank Wolfgang Nowak and an anonymous reviewer, whose insightful comments helped improve the final version of this manuscript.

References 1. Alcolea, A., Carrera, J., Medina, A.: Pilot points method incorporating prior information for solving the groundwater flow inverse problem. Adv. Water Resour. 29(11), 1678–1689 (2006) 2. Bastian, P., Blatt, M., Dedner, A., Engwer C., Klöfkorn, R., Kornhuber, R., Ohlbergerm M., Sander, O.: A generic grid interface for parallel and adaptive scientific computing. part ii: Implementation and tests in dune. Computing 82(2–3), 121–138 (2008) 3. Bastian, P., Blatt, M., Dedner, A., Engwer C., Klöfkorn, R., Ohlbergerm M., Sander, O.: A generic grid interface for parallel and adaptive scientific computing. part i: abstract framework. Computing 82(2–3), 103–119 (2008) 4. Bastian, P., Blatt, M., Engwer C., Dedner, A., Klöfkorn, R., Kuttanikkad, S., Ohlberger, M., Sander, O.: The distributed and unified numerics environment (dune). In: Proceedings of the 19th Symposium on Simulation Technique in Hannover (2006) 5. Bui-Thanh, T., Burstedde, C., Ghattas, O., Martin, J., Stadler, G., Wilcox, L.C.: Extremescale uq for bayesian inverse problems governed by pdes. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 3. IEEE Computer Society Press (2012) 6. Chávez, G., Turkiyyah, G., Zampini, S., Keyes, D.: Parallel accelerated cyclic reduction preconditioner for three-dimensional elliptic pdes with variable coefficients. J. Comput. Appl. Math. 344, 760–781 (2018) 7. Cockett, R., Heagy, L.J., Haber, E.: Efficient 3d inversions using the richards equation. Comput. Geosci. (2018) 8. Cotter, S.L., Roberts, G.O., Stuart, A.M., White, D., et al.: Mcmc methods for functions: modifying old algorithms to make them faster. Stat. Sci. 28(3), 424–446 (2013) 9. Dietrich, C.R., Newsam, G.N.: Fast and exact simulation of stationary gaussian processes through circulant embedding of the covariance matrix. SIAM J. Sci. Comput. 18(4), 1088– 1107 (1997) 10. Frigo, M., Johnson, S.G.: The design and implementation of fftw3. Proc. IEEE 93(2), 216–231 (2005) 11. Fritz, J., Neuweiler, I., Nowak, W.: Application of fft-based algorithms for large-scale universal kriging problems. Math. Geosci. 41(5), 509–533 (2009) 12. Hadamard, J.: Sur les problèmes aux dérivées partielles et leur signification physique. Princeton Univ. Bull. 13(49–52), 28 (1902)

An Improved Conjugate Gradients Method …

385

13. Hager, W.W., Zhang, H.: A survey of nonlinear conjugate gradient methods. Pacific J. Optim. 2(1), 35–58 (2006) 14. Kitanidis, P.K.: Quasi-linear geostatistical theory for inversing. Water Resour. Res. 31(10), 2411–2419 (1995) 15. Klein, O.: dune-randomfield. https://gitlab.dune-project.org/oklein/dune-randomfield (2016) 16. Klein, O.: Preconditioned and randomized methods for efficient bayesian inversion of large data sets and their application to flow and transport in porous media. PhD thesis, Interdisciplinary Center for Scientific Computing (IWR), Heidelberg University, Heidelberg (2016) 17. Klein, O., Cirpka, O.A., Bastian, P., Ippisch, O.: Efficient geostatistical inversion of transient groundwater flow using preconditioned nonlinear conjugate gradients. Adv. Water Resour. 102, 161–177 (2017) 18. Lee, J., Kitanidis, P.K.: Large-scale hydraulic tomography and joint inversion of head and tracer data using the principal component geostatistical approach (pcga). Water Resour. Res. 50(7), 5410–5427 (2014) 19. Li, W., Cirpka, O.A.: Efficient geostatistical inverse methods for structured and unstructured grids. Water Resour. Res. 42(6) (2006) 20. Li, W., Nowak, W., Cirpka, O.A.: Geostatistical inverse modeling of transient pumping tests using temporal moments of drawdown. Water Resour. Res. 41(8) (2005) 21. Mohring, J., Milk, R., Ngo, A., Klein, O., Iliev, O., Ohlberger, M., Bastian, P.: Uncertainty quantification for porous media flow using multilevel monte carlo. In: International Conference on Large-Scale Scientific Computing, pp. 145–152. Springer (2015) 22. Nocedal, J.: Updating quasi-newton matrices with limited storage. Math. Comput. 35(151), 773–782 (1980) 23. Nowak, W., Cirpka, O.A.: A modified levenberg-marquardt algorithm for quasi-linear geostatistical inversing. Adv. Water Resour. 27(7), 737–750 (2004) 24. Nowak, W., Litvinenko, A.: Kriging and spatial design accelerated by orders of magnitude: combining low-rank covariance approximations with fft-techniques. Math. Geosci. 45(4), 411– 435 (2013) 25. Plessix, R.-E.: A review of the adjoint-state method for computing the gradient of a functional with geophysical applications. Geophys. J. Int. 167(2), 495–503 (2006) 26. Schwede, R.L., Ngo, A., Bastian, P., Ippisch, O., Li, W., Cirpka, O.A.: Efficient parallelization of geostatistical inversion using the quasi-linear approach. Comput. Geosci. 44, 78–85 (2012)

Idiosyncrasies of the Frequency Response of Discrete-Time Equivalents of Continuous-Time System Pitcha Prasitmeeboon and Richard W. Longman

Abstract This paper gives a detailed study of the properties or idiosyncrasies of the phase frequency response of discretized systems obtained from a continuous time system fed by a zero order hold. One might expect that the frequency response of a continuous time system and the response of the equivalent discrete time system would be very similar, they both produce the same output at the sample times. But instead, the discretization process can introduce many perhaps surprising phenomena. This work is motivated by Repetitive Control (RC) which seeks to find a simple finite impulse response (FIR) filter that mimics the inverse of the frequency response of the discretized system. Some very effective and simple FIR results have been reported in the literature, but it is shown here that there are many possible idiosyncrasies that such designs can have, and they could preclude creating FIR designs that can cancel phase all the way to Nyquist frequency. Phase response properties are presented for odd and for even pole excess, for Nyquist frequencies at frequencies where the continuous time phase has nearly converged, for frequencies below this down to a singularity, at the singularities, and below the singularity or singularities.

1 Introduction Frequency response methods are basic and effective methods used to design classical feedback control systems. Digital control replaces the controller of such a system with a digital computer or microprocessor that must sample the error signal at sample P. Prasitmeeboon Faculty of Engineering, King Mongkut’s Institute of Technology Ladkrabang (KMITL), Ladkrabang, Bangkok 10520, Thailand e-mail: [email protected] R. W. Longman (B) Department of Mechanical Engineering, Columbia University, MC4703, 500 West 120th St., New York, NY 10027, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 H. G. Bock et al. (eds.), Modeling, Simulation and Optimization of Complex Processes HPSC 2018, https://doi.org/10.1007/978-3-030-55240-4_18

387

388

P. Prasitmeeboon and R. W. Longman

times, compute an update of the corrective control action, then apply it to the world through a zero order hold. This action is held until an update is supplied at the next time step. This piece-wise constant signal is applied to the plant governed by an ordinary differential equation (ODE) whose output is the controlled variable. To analyze such systems, one can replace the ODE, which is fed by a zero order hold with the output also sampled, by an equivalent difference equation that has precisely the same solution at the sample times as the ODE itself. Once the ODE is replaced by a difference equation, one can design a digital feedback control system using frequency response methods for difference equations (Ref. [1]). This paper looks in detail at what the phase frequency response is like for systems that have been discretized in this way. The discretization process introduces some surprising idiosyncrasies into the frequency response. It is the purpose of this paper to identify all of the characteristics of the discrete time system phase response. This investigation is motivated by the field of repetitive control (RC) [2–6]. RC aims to converge to zero tracking error in control systems that are executing a periodic command or in the presence of a disturbance of known period, or both. The objective is to have zero error for the fundamental frequency of the given period, and for all harmonics up to Nyquist frequency, the highest frequency one can see unambiguously using the given sample rate. This implies that the frequency response behavior is important from zero to Nyquist. Unlike usual feedback designs, where the frequency response behavior well above some bandwidth is not particularly important, in RC the frequency response behavior is important all the way to Nyquist. This motivates our investigation of what kinds of behavior can this frequency response exhibit. We list a set of behaviors that we describe as idiosyncrasies. Frequency response plots of a control system from command to response are plots telling the error in executing the command for each frequency in the command. It is the purpose of the RC system to eliminate these errors. The simplest form of RC looks at the error one period back and adjusts the command to the feedback system asking to change the output by this amount. As this adjustment of command goes through the feedback control system it is modified by the phase lag through the system and the amplitude change through the system. If the phase change through the system is −180◦ , then the output has a changed sign and will add to the error instead of reduce the error. It is therefore, important to introduce a compensator that adjusts the phase of the command, introducing a phase change (usually lead) that will cancel the phase change (usually lag) through the system. For small RC gain, the accuracy tolerance of this cancellation can approach +/ − 90◦ . The amplitude change is not critical because the periodic process will accumulate whatever amplitude change is needed, but we can similarly correct for the amplitude change by making the compensator equal the inverse of the steady state frequency response. References [5, 6] present an effective way to design such a compensator in the form of a non-causal finite impulse response (FIR) filter F(z) = a1 z m−1 + a2 z m−2 + ... + an−1 z −(n−m−1) + an z −(n−m)

(1)

Idiosyncrasies of the Frequency Response of Discrete-Time Equivalents …

389

that is applied to the error signal of the previous period. The meaning in practice is that the compensator looks back one period and multiplies the errors observed by the above gains. Positive powers of z are for errors at time steps forward from one period back, and negative powers of z are for errors further back than one period. The gains are chosen to try to make the frequency response of the product G(z)F(z) equal one for all frequencies from zero to Nyquist, i.e. G(eiω j T )F(eiω j T )=1, for all radian frequency ω. T is the sample time interval. G(z) is the z-transfer function of discretized system from command to output. The gains are picked to minimize the cost function that sums over a suitably chosen set of frequencies ω j from zero to Nyquist N  [1 − G(eiω j T )F(eiω j T )][1 − G(eiω j T )F(eiω j T )]∗ (2) J= j=1

The asterisk indicates conjugation. Reference [5] applies this to a third order differential equation whose Laplace transfer function is  G a (s) =

a s+a



ωn2 s 2 + 2ζωn s + ωn2

 (3)

with a = 8.8, ζ = 0.5, ωn = 37. It is converted to a z-transfer function G(z) using sampling at 200 Hz. This models the command to response of each link of a commercial robot. The result is shown in Fig. 1 which shows the design cancels the phase to within less than 0.02 degree error. Hence, this design can be very effective. This paper investigates the characteristics of the phase response of discretized differential equations fed by zero order holds, and discovers idiosyncrasies that could make it difficult to design effective compensators F(z) for some systems.

Fig. 1 Plot of G(eiω j T )F(eiω j T ) (left) and plots of its magnitude and its phase (right) from zero to Nyquist (n = 12, m = 7)

390

P. Prasitmeeboon and R. W. Longman

2 Frequency Response of Differential Equations and Difference Equations We review frequency response for differential equations, how to convert them to equivalent difference equations, and the frequency response of difference equations. In the sequel, we make observations of the relationships between the two frequency responses for equivalent systems. The observations are labeled Idiosyncrasies because some of the observed relationships might be unexpected.

2.1 Frequency Response of Differential Equations For illustration purposes, consider a simple differential equation and its associated Laplace transfer function dy(t) d 2 y(t) + 2y(t) = 2u(t) +3 2 dt  dt  2 Y (s) = U (s) = G a (s)U (s) s 2 + 3s + 2

(4)

Frequency response means the steady state response of the differential equation when u(t) = cos(ωt). Call this response y1 (t). To find this solution, let y2 (t) be the response when u(t) = sin(ωt), write each equation, multiply the second by the imaginary number i, and add them together. The right hand side of the result can be written as 2(cos(ωt) + i sin(ωt)) = 2eiωt . Then a particular solution can be obtained in the form y(t) = A(ω)eiωt . Inserting it into the equation gives A(ω) = G a (iω) which can be written as magnitude times a phase using G a (iω) = Ma (ω)eφa ω. The desiredy1 (t) is then the real part of this result, which is y1 (t) = Ma (ω) cos ωt + φa (ω) .The Ma (ω) is the amplitude frequency response, or change in amplitude of an input cos(ωt) when it goes through system (4), and the φa (ω) is the change in phase. Observe some properties of the phase change.  The transfer function in Eq. (4) can   1 2 and the phase angle of G a (s) is the be written as the product G a (s) = s+1 s+2 sum of the phase angles for each term in this product. at the second term,   Looking 2 makes with the positive the phase angle is the angle the complex number iω+2 real axis. This angle starts at zero when ω is zero, and as ω tends to infinity the2   2 in the denominator becomes insignificant and the factor looks like iω = −i ω2 , whose phase angle is −90◦ . If there were a factor s + 3 in the numerator, it would produce an angle of +90◦ as ω tends to infinity. Define the pole excess of a Laplace transfer function G a (s) as the number of poles (roots of the denominator) minus the number of zeros (roots of any numerator polynomial). Then the phase response of a differential equation as ω → ∞ is the pole excess times −90◦ .

Idiosyncrasies of the Frequency Response of Discrete-Time Equivalents …

391

2.2 Converting a Differential Equation to a Difference Equation If a differential equation as in Eq. (4) is fed through a zero order hold, one can create a difference equation whose solution is identical to that of the differential equation at the sample times. This conversion from G a (s) to a z-transfer function representing a difference equation is accomplished using G(z) = (1 − z −1 )Z {G a (s)/s}

(5)

The expression in the curly bracket represents the unit step response starting from zero initial conditions. The Z says to sample this unit step response at the same time steps t = kT , and take the z-transform of the resulting sequence. One intuitive way to derive this result is to start by decomposing the zero order hold input function of time u(t) into a sum of scaled unit step functions. At the start of a time step, put in a step function of the height of the input for that time step. At the end of the step, subtract that step out. Then add the results for all time steps up to the present step. This produces a convolution sum over all input step functions to the present time minus a convolution sum over the removals of the steps one step later. The z-transform of a convolution sum is the product of the transforms of the input with the system transform. The 1 term in Eq. (5) puts in the steps, the −z −1 takes them out.

2.3 Frequency Response of Difference Equations Now develop the frequency response of difference equations parallel to Sect. 2.1 above. Consider a simple second order difference equation and its z-transfer function y((k + 2)T ) + a1 y((k + 1)T ) + a2 y(kT ) = b1 u((k + 1)T ) + b2 u(kT )  Y (z) =

 b1 z + b2 U (z) = G(z)U (z) z 2 + a1 z + a2

(6) (7)

The extra term on the right is included because the discretization process of the previous section normally introduces zeros into the z-transfer function as discussed below. Write Eq. (6) for both inputs cos(ωkT ) and sin(ωkT ) defining y1 (kT ) and y2 (kT ) as the steady state responses, multiply the second equation by i and sum the equations. Then the solution y(kT ) = y1 (kT ) + i y2 (kT ) has a particular solution of the form y(kT ) = A(ω)eiωkT where  A(ω) =

b1 eiωT + b2 (eiωT )2 + a1 eiωT + a2

 = G(eiωT )

(8)

392

P. Prasitmeeboon and R. W. Longman

  Writing G eiωT = M(ω) exp(φ(ω)) and finding the real part y1 (kT ) produces the desire frequency response. What is of particular interest in this paper is the phase  response φ(ω) given by the angle the complex number G eiωT makes with the positive real axis. From this result, we list a few Idiosyncrasies that distinguish the discrete time frequency response from the continuous time frequency response. Idiosyncrasy 1 The continuous time frequency response considers frequencies from zero to infinity. The discrete time frequency response considers frequencies from zero to Nyquist frequency, ω = π/T , call it ω N . All frequencies above this value produce frequency responses observed at the sample times that are identical to a corresponding frequency below Nyquist. Idiosyncrasy 2 To find the discrete time frequency response one substitutes z = exp(iωT ), which goes around the upper half of the unit circle in the z plane, starting at +1 and continuing to Nyquist which is z = −1. Substituting −1 makes the transfer function a real number, hence it is either positive or negative. This implies that when the frequency reaches Nyquist, the phase angle must be a multiple of −180◦ . Note that this can never match the odd pole excess of a continuous time system—from above this is an odd multiple of −90◦ . Idiosyncrasy 3 Consider a G(z) of the form G(z) =

K (z − z 1 )(z − z 2 ) (z − p1 )(z − p2 )(z − p3 )

(9)

To find phase at a frequency ω, substitute the point z = exp(iωT ) on the unit circle into this expression and find the angle made with the positive real axis for each factor above. The sum of these angles for the numerator minus the sum for the denominator, gives the phase response at this frequency. From Fig. 2 one can observe that any zero z 1 outside the unit circle on the negative real axis will contribute zero angle to the

Fig. 2 Illustration of phase contributions of factors in G(z)

Idiosyncrasies of the Frequency Response of Discrete-Time Equivalents …

393

phase at Nyquist, a zero inside the unit circle will contribute +180◦ at Nyquist, and a pole inside the unit circle contributes −180◦ . Hence, the phase angle at Nyquist frequency is the discrete transfer function pole excess inside the unit circle times −180◦ .

3 Influence of a Zero Order Hold on Phase Consider the block diagrams in Fig. 3. They allow one to investigate the effects of a zero order hold on the phase of a continuous time system. Let the command input yd (t) = t in both block diagrams. In the upper diagram, the input to the differential equation sees is u C (t) shown in Fig. 4. When this function goes through a zero order hold before being applied to the differential equation, the input is u D (t) as shown in Fig. 4. Consider that at the start of a time step, u D (t) is the same as u C (t), but approaching the end of the time step, the u D (t) being applied to the differential equation has not been changed. This suggests that on the average, the zero order hold input is delayed by 1/2 a time step. Consider that at Nyquist frequency there are two samples per period, so that half a time step delay corresponds to −90◦ phase lag. At half Nyquist, there are four samples per period, making the phase lag introduced be −45◦ . This thinking suggests that a zero order hold will introduce a linear phase lag that is zero at zero frequency and −90◦ at Nyquist. This is a phase delay in the input to the differential equation. One might expect that this will result in the same phase delay being introduced into the

Fig. 3 Two block diagrams, the top is without a zeros order hold, the bottom introduces a zero order hold Fig. 4 The input to the differential equation with and without the zero order hold

394

P. Prasitmeeboon and R. W. Longman

output of the differential equation. Now compute the output in each case and find the actual delay. The solution of dy(t) + 2y(t) = 2yd = 2t dt

(10)

is the general solution of the homogeneous equation with a constant determined by initial conditions, plus a particular solution y p (t) which can be found by substituting y p (t) = At + B and adjusting the A and B to make the equation satisfied. The result for all time t, and for the sample times t = kT , are 1 y(t) = C1 e−2t + (t − ), 2

1 y(kT ) = C1 (e−2T )k + (kT − ) 2

(11)

Note that going through the differential equation, created a one-half second time delay. Now consider the block diagram with the zero order hold included. Obtain the z-transfer function using Eq. (5), use partial fraction expansion of the term in the curly brackets and convert each term to its z-transform equivalent. Then convert the transform into the associated difference equation   y (k + 1)T − e−2T y(kT ) = (1 − e−2T )yd (kT )

(12)

Again look for a particular solution of the form y p (kT ) = AkT + B, and find the A and B, and add to the general solution of the homogeneous equation, giving  y(kT ) = C2 (e−2T )k + kT −

T 1 − e−2T



(13)

Observe that the particular solution in the discrete time case can be written as  −

T 1 − e−2T

 =



≈ − 21 ≈

− 21



−T

 

1− 1−2T +





(2T )2 −...

1 2

1

1−T + 1 T 2

+

2 3



T 2 −...

2 2 T 3

− ...

(14)

Therefore, the delay in the output produced by the zero order hold is not exactly onehalf time step, but this value is a very good approximation for reasonable sample time intervals T . Idiosyncrasy 4 When the input to a differential equation comes through a zero order hold, the delay is approximately one half time step, but the delay is a function of the system dynamics.

Idiosyncrasies of the Frequency Response of Discrete-Time Equivalents …

395

Fig. 5 Phase response of the continuous time system (top), of the continuous time system with a half time step delay at each frequency (lower solid line), and of the digital system (circles)

Figure 5 illustrates some of the Idiosyncrasies identified so far, considering a G(s) = 1/(s + a)3 , a = 8.8, and a sample rate of 500 Hz, Nyquist frequency of 250 Hz. The top dashed curve gives the frequency response of the original differential equation G(s) which at Nyquist frequency is approaching its asymptotic value of −270◦ as frequency tends to infinity. The solid curve below this one, is the same phase response but including one-half time step delay. The circles are the frequency response of the same system fed by a zero-order hold, which we know must end at a multiple of −180◦ at Nyquist frequency, which it does at the right end of the plot. In this case, the continuous time system with the added delay of the half time step, automatically supplies the needed extra phase at Nyquist. The continuous time response with one-half step delay predicts well the frequency response of the discrete time system at all frequencies.

4 The Zeros Introduced in Discretization 4.1 The Time Delay Going Through the Discrete Time System The input to the differential equation coming from a zero order hold is a series of steps as in Fig. 4. At the start of a time step, the input suddenly goes to a new value, but the solution to the differential equation cannot change instantaneously. Instead, it will start changing then, and a change in the output will first be visible at the next sample time. (The exception is systems with the same number of zeros as poles in

396

P. Prasitmeeboon and R. W. Longman

continuous time, which are usually non-physical.) The implication is that G(z) in general must have one more pole than zero. Idiosyncrasy 5 If an ordinary differential equation has a pole excess of m = p − z > 0, ( p and z are the number of poles and zeros) is fed by a zero order hold, the discretization process will introduce m − 1 zeros.

4.2 The Asymptotic Zero Locations Reference [7] derives the location of these introduced zeros as the sample time interval T tends to zero. Table 1 gives the locations for the zeros introduced that are on or outside the unit circle. Note that for every zero located outside the unit circle there is a companion zero introduced inside the unit circle, and its location is the reciprocal of the location outside (and not listed in the table), making a total of p − z − 1 zeros introduced. If this number is even, then half of the zeros are outside the unit circle, and half are inside. If this number is odd, then one zero will be located asymptotically on the unit circle at −1 (when T has not reached the asymptotic limit, expect this zero to be inside). Half of the remaining zeros are outside, half are inside. Comments on Repetitive Control (RC): The aim is to solve the inverse problem, find the input that will produce the desired output at all time steps. This suggests that the ideal RC law is the inverse of G(z), and then the zeros listed in Table 1 become poles of a difference equation to solve for the needed input. Associated with these poles are solutions of the associated homogeneous equation with these roots for its characteristic equation. Each such root creates a solution of an arbitrary constant times the above root to the kth power, where k is the time step. In the case of a pole excess of 3 this is a solution C1 (−3.7321)k which grows fast. In the case of a 4th

Table 1 Asymptotic zero locations for zeros introduced outside and on the unit circle Pole excess Zero location 2 3 4 5 6 7 8 9 10 11

−1.0000 −3.7321 −9.8990, −1.0000 −23.2039, −2.3225 −51.2184, −4.5419, −1.0000 −109.3052, −8.1596, −1.8682 −228.5110, −13.9566, −3.1377, −1.0000 −471.4075, −23.1360, −4.9566, −1.6447 −963.8545, −37.5415, −7.5306, −2.5155, −1.0000 −1958.6431, −59.9893, −11.1409, −3.6740, −1.5123

Idiosyncrasies of the Frequency Response of Discrete-Time Equivalents …

397

order pole excess the solution is approximately C2 (−10)k . For a pole excess of 10, there will be a solution that is approximately C3 (−1000)k . Thus, this inverse control action could be asking for a control with a component that grows by a factor of 1000 every time step. This is a very large number at the end of one second using 100 Hz sample rate, and much worse if it was 1000 Hz sample rate. It is this property that motived [5] to create RC compensators that compute an approximate inverse of the steady state frequency response, instead of using the inverse of the transfer function. Then convergence to zero error is obtained as time steps tend to infinity.

4.3 Zero Locations as the Sample Time Interval Becomes Long Now consider what happens to the zeros as the sample time interval gets long     y (k + n)T + a1 y (k + n − 1)T + ... + an y(kT )   = b1 u (k + n − 1)T + ... + bn u(kt)

(15)

Once the time interval T of a zero order hold becomes longer than the settling time of the system, then only the most up to date terms on each side of the equation matter. All “initial condition” effects have decayed to a negligible value. This says that for such a large T , the z-transfer function of the above equation becomes Y (z) =

 b z n−1  1 U (z) zn

(16)

which says that as T becomes long, all zeros will converge to the origin. Idiosyncrasy 6 As T is increased from its asymptotic zero value, all zeros outside the unit circle will move toward the origin along the negative real axis. Each zero outside will eventually cross into the unit circle. Examining Fig. 2, one sees that the angle φz1 will instantaneously change from zero for z approaching −1, to +180◦ just after the zero has entered the unit circle. This creates a jump discontinuity in the phase plot.

5 Phase Behavior at Fast Sample Rates This section examines the phase plots when the sample rate makes the Nyquist frequency occur in a frequency region where the phase of the continuous time system is approaching its asymptotic value reached as frequency goes to infinity. Bode plot information indicates that at a frequency that is approximately 10 times the highest

398

P. Prasitmeeboon and R. W. Longman

break frequency of any factor in G(s), the phase is already approaching to within 6◦ of its final value at infinite frequency. Consider first, odd pole excess, then even pole excess. Idiosyncrasy 7 (Odd Pole Excess) – If G(s) has an odd pole excess, then its frequency response will asymptotically approach an odd multiple of −90◦ . – We have established that the discretized version will try to follow this frequency response curve, after including a one-half time step delay. At Nyquist, this introduces an addition −90◦ phase lag. – When Nyquist frequency is in the region of G(s) after it is close to its asymptotic phase, then the discrete time system can follow the G(s) with half step included nearly all the way to Nyquist frequency. – Figure 5 illustrates this. Idiosyncrasy 8 (Even Pole Excess) – If G(s) has an even pole excess, then its frequency response will asymptotically approach an even multiple of −90◦ , ending at a multiple of −180◦ . Consider choice of a Nyquist frequency for which this plot is already close to the final value. – When the one half time step delay is added to this, and extra −90◦ is included, resulting in an odd multiple of −90◦ . – Therefore, the actual discrete time frequency response curve must deviate from this G(s) plus half time step delay curve near the end, aiming to fix this 90◦ discrepancy. – This deviation is produced by the introduced zero near −1 in systems with even pole excess. – Consider Fig. 2, but move the zero near −1 to be inside the unit circle, and consider it to be very near −1. When the z on the unit circle corresponds to a frequency perpendicularly above this zero, the contribution of the zero factor to the phase, φz1 , will be +90◦ . Then as the z on the unit circle progresses from this frequency toward Nyquist frequency at −1, the angle will quickly change to +180◦ , introducing a quick +90◦ change in phase approaching Nyquist. This makes a fast motion of the frequency response away from G(s) plus half time step delay curve in order to end on the first even multiple of 180◦ encountered. An interesting question to consider is how is it possible that the half time step delay that seems to be inherent in the discrete time signal, can be cancelled as the frequency approaches Nyquist. The next sections illustrate these Idiosyncrasies (Figs. 6 and 7). In addition, these sections examine what happens when the Nyquist frequency is reduced: (1) First when the sampling frequency is reduced from the high frequency values where the phase of G(s) is near saturated, down to the first singularity. (2) Then understand what happens to phase while the sample frequency decreases through singularities. (3) And when the sampling frequency is reduced below such a singularity. It is felt that examining the behavior of systems with 3rd, 4th, and 5th order pole excess

Idiosyncrasies of the Frequency Response of Discrete-Time Equivalents …

399

exhibits all the idiosyncrasies that would appear in higher order pole excess. The systems considered have Laplace transfer function G(s) =

 1 i s+a

(17)

where a = 8.8. The value of i is 3, 4, and 5 for third order, fourth order, and fifth order pole excess.

6 Phase Behavior as Nyquist Frequency is Reduced from High Frequency to First Singularity As the Nyquist frequency is reduced from a high frequency where the zeros are near those in Table 1, the zeros outside the unit circle move toward the origin. Figures 6 and 7 show the phase behavior for Nyquist frequencies over a full frequency range. Consider the range from high frequencies, as discussed in the previous section, down to the first singularity. Figure 6 applies to an odd pole excess as given by Eq. (17) with i = 3, and Fig. 7 gives behavior for even pole excess of i = 4. There are many plots in these figures, which need to be explained. Consider Fig. 6. First, the ODE curve is the phase response of the original differential equation in Eq. (17). When the one half step delay is added to this plot, with sample rate 1000 Hz, the result is the curve labeled 1000 Hz. Indistinguishable from this curve is the frequency response of the discretized system for this sample rate. By decreasing

Fig. 6 Frequency response for various sample rates for third order pole excess in Eq. (17)

400

P. Prasitmeeboon and R. W. Longman

Fig. 7 Frequency response for various sample rates for fourth order pole excess in Eq. (17)

the phase by −90◦ at high frequencies, the continuous plus half step delay is aiming for a multiple of −180◦ , so both curves can be on top of each other. When the sample rate is reduced, the continuous time frequency response plus half step delay is at less negative phase angle when it reaches Nyquist, and these phase angles are no longer near a multiple of −180. The 50 Hz sample rate plot has two curves, the bottom one being the discrete time response ending at −360◦ . It has to deviate from the upper curve, the continuous with half step delay curve, so that it can end at this next multiple of −180◦ below. As the sample time decreases further this curve downward gets more and more extreme, until it is a vertical discontinuity approaching the singular frequency from above. We label this behavior from high sample rates down to the first singularity as Pattern 1. Idiosyncrasy 9 Expect Pattern 1 to apply to all odd pole excesses for frequencies from high frequencies where the sample rate is at a frequency when the continuous time system phase response is near its final value, down to the first singularity. Figure 7 exhibits a different pattern for this frequency range which will be labeled Pattern 2. ODE is as before, but because it is a 4th order system the high frequency phase angle approaches −360◦ . When the half time step delay is added, the curve ends with an extra −90◦ lag. The discrete system cannot follow this curve to the end, as the 3rd order pole excess did, because it is an odd multiple of −90◦ . The bottom right corner of the figure has actually two curves. The one ending at the corner is

Idiosyncrasies of the Frequency Response of Discrete-Time Equivalents …

401

the continuous time with half step delay, but the other curve very quickly goes up to −360◦ —the same as the continuous time system, indicating very close to Nyquist there is no extra phase lag introduced by the discretization. As the sample rate is decreased and the value of the continuous plus half step delay value is becoming less negative at the new Nyquist frequencies, the swing upward gets less pronounced, and eventually there is no need to swing upward, but rather it must swing downward. Again, this becomes extreme with a vertical downward motion when reaching the first singularity.

7 Phase Behavior Crossing a Singularity Examine in more detail the behavior of the phase as the sample rate decreases making a zero move from outside the unit circle to inside, as presented in Fig. 8. Curves are shown for sample rate 17.1 Hz and for 17.0 Hz. At 17.1 the frequency response suddenly moves almost vertically downward by 90◦ . When the sample rate decreases to 17.0 Hz the frequency response suddenly moves almost vertically upward by 90◦ . So there is a sudden change in the final phase angle going from 17.1 to 17.0 that is a discontinuity of +180◦ . This behavior is explained in Fig. 9. Consider the factored form of a z-transfer function as in Eq. (9). Consider that the zero z − z 1 is outside the unit circle as in the left figure of Fig. 9. The frequency response is obtained by adding all the phase angles of such zeros together and subtracting the phase angles of all poles in Eq. (9). The z goes around the upper half of the unit circle from zero to Nyquist at −1. The phase influence φz1 of z − z 1 when z is at zero frequency, i.e. +1, is zero. As z goes around the unit circle this phase will reach a maximum when z is such that the line from z 1 to z is tangent to the unit circle as shown in the left plot. As z continues

Fig. 8 Frequency response as the sample rate is decreased across a frequency when a zero moves from outside the unit circle to inside the unit circle

402

P. Prasitmeeboon and R. W. Longman

Fig. 9 Explaining the phase behavior as a zero moves from outside to inside the unit circle

around the circle to −1 the angle quickly decreases to zero again. As z 1 is made closer to −1 from the outside, the angle for a tangent line is approaching +90◦ , and the distance from this frequency to Nyquist at −1 is getting shorter. In the limit this creates an arbitrarily fast phase decrease by 90◦ . Now consider that z 1 is some distance inside the unit circle as shown in the right of Fig. 9. As z moves from +1 around the unit circle to the location shown that is vertically above z 1 , the phase contribution of z − z 1 builds up to +90◦ . Then for the short interval of frequencies above the z shown in the plot, the phase changes quickly to +180◦ . As z 1 is moved arbitrarily close to −1 from inside, this fast 180◦ change becomes instantaneous in the limit. Idiosyncrasy 10 As the sample rate crosses a frequency at which a zero enters the unit circle, the phase angle first develops an instantaneous −90◦ change, then develops an instantaneous +180◦ change. The net result, just before entering the unit circle there is a sudden −90◦ change, and just after entering this change is converted to a sudden +90◦ phase change. Figure 10 considers a 5th order pole excess using Eq. (17), and plots all of the zeros introduced by discretization as a function of sample rate. There are two zeros inside the unit circle, one is indistinguishable from the top of the graph, and the second one is a small distance below. There is a line drawn at −1 corresponding to