Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution: Discounted and Average Criteria (SpringerBriefs in Probability and Mathematical Statistics) [1 ed.] 3030357198, 9783030357191

This SpringerBrief deals with a class of discrete-time zero-sum Markov games with Borel state and action spaces, and pos

208 85 2MB

English Pages 134 [129] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution: Discounted and Average Criteria (SpringerBriefs in Probability and Mathematical Statistics) [1 ed.]
 3030357198, 9783030357191

Table of contents :
Preface
Contents
Summary of Notation and Terminology
Symbols and Abbreviations
Spaces of Functions
1 Zero-Sum Markov Games
1.1 Game Models
1.1.1 Difference-Equation Games: Estimation and Control
1.2 Strategies
1.3 Markov Game State Process
1.4 Optimality Criteria
2 Discounted Optimality Criterion
2.1 Minimax-Maximin Optimality Conditions
2.2 Discounted Optimal Strategies
2.2.1 Asymptotic Optimality
2.3 Estimation and Control
2.3.1 Density Estimation
2.3.2 Asymptotically Optimal Strategies
2.3.3 Proof of Theorem 2.11
3 Average Payoff Criterion
3.1 Continuity and Ergodicity Conditions
3.2 The Vanishing Discount Factor Approach (VDFA)
3.2.1 Difference Equation Average Game Models
3.3 Estimation and Control Under the VDFA
3.3.1 Average Optimal Pair of Strategies
3.3.2 Proof of Theorem 3.5
4 Empirical Approximation-Estimation Algorithms in MarkovGames
4.1 Assumptions and Preliminary Results
4.2 The Discounted Empirical Game
4.2.1 Empirical Estimation Process
4.2.2 Discounted Optimal Strategies
4.3 Empirical Approximation Under Average Criterion
4.4 Average Optimal Strategies
4.5 Empirical Recursive Methods
5 Difference-Equation Games: Examples
5.1 Continuity Conditions
5.2 Autoregressive Game Models
5.3 Linear-Quadratic Games
5.4 A Game Model for Reservoir Operations
5.5 A Storage Game Model
5.6 Equicontinuity and Equi-Lipschitz Conditions
5.7 Numerical Implementations
5.7.1 Linear-Quadratic Games
5.7.2 Finite Games: Empirical Approximation
A Elements from Analysis
A.1 Semicontinuous Functions
A.2 Spaces of Functions
A.3 Multifunctions and Selectors
B Probability Measures and Weak Convergence
B.1 The Empirical Distribution
C Stochastic Kernels
C.1 Difference-Equation Processes
D Review on Density Estimation
D.1 Error Criteria
D.2 Density Estimators
D.2.1 The Kernel Estimate
D.2.2 Projection Estimates
D.2.3 The Parametric Case
References
Index

Citation preview

SPRINGER BRIEFS IN PROBABILIT Y AND MATHEMATIC AL STATISTICS

J. Adolfo Minjárez-Sosa

Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution Discounted and Average Criteria

123

SpringerBriefs in Probability and Mathematical Statistics Editor-in-Chief Mark Podolskij, University of Aarhus, Aarhus C, Denmark Series Editors Nina Gantert, Technische Universit¨at M¨unchen, M¨unich, Nordrhein-Westfalen, Germany Richard Nickl, University of Cambridge, Cambridge, UK Sandrine P´ech´e, Univirsit´e Paris Diderot, Paris, France Gesine Reinert, University of Oxford, Oxford, UK Mathieu Rosenbaum, Universit´e Pierre et Marie Curie, Paris, France Wei Biao Wu, University of Chicago, Chicago, IL, USA

SpringerBriefs present concise summaries of cutting-edge research and practical applications across a wide spectrum of fields. Featuring compact volumes of 50 to 125 pages, the series covers a range of content from professional to academic. Briefs are characterized by fast, global electronic dissemination, standard publishing contracts, standardized manuscript preparation and formatting guidelines, and expedited production schedules. Typical topics might include: - A timely report of state-of-the art techniques - A bridge between new research results, as published in journal articles, and a contextual literature review - A snapshot of a hot or emerging topic - Lecture of seminar notes making a specialist topic accessible for non-specialist readers - SpringerBriefs in Probability and Mathematical Statistics showcase topics of current relevance in the field of probability and mathematical statistics Manuscripts presenting new results in a classical field, new field, or an emerging topic, or bridges between new results and already published works, are encouraged. This series is intended for mathematicians and other scientists with interest in probability and mathematical statistics. All volumes published in this series undergo a thorough refereeing process. The SBPMS series is published under the auspices of the Bernoulli Society for Mathematical Statistics and Probability.

More information about this series at http://www.springer.com/series/14353

J. Adolfo Minj´arez-Sosa

Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution Discounted and Average Criteria

J. Adolfo Minj´arez-Sosa Department of Mathematics University of Sonora Hermosillo, Sonora, Mexico

ISSN 2365-4333 ISSN 2365-4341 (electronic) SpringerBriefs in Probability and Mathematical Statistics ISBN 978-3-030-35719-1 ISBN 978-3-030-35720-7 (eBook) https://doi.org/10.1007/978-3-030-35720-7 Mathematics Subject Classification: 91A15, 90C40, 62G05 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To my two women: Francisca and Camila

Preface

Discrete-time zero-sum Markov games constitute a class of stochastic games introduced by Shapley in [65] whose evolution over time can be described as follows. At each stage, players 1 and 2 observe the current state x of the game and independently choose actions a and b, respectively. Then, player 1 receives a payoff r(x, a, b) from player 2 and the game moves to a new state y in accordance with a transition probability or a transition function F as in (1), below. The payoffs are accumulated throughout the evolution of the game in a finite or infinite horizon under a specific optimality criterion. Even though there are now many studies in this field under multiple variants, it is mostly assumed that all components of the game are completely known by the players. However, the environment itself in which it evolves could make this assumption unrealistic or too strong. Hence, the availability of approximation and estimation algorithms that provide players with some insights on the evolution of the game is important, so that they can select their actions more accurately. An important feature of this book is that it will deal with a class of Markov games with Borel state and action spaces, and possibly unbounded payoffs, under discounted and average criteria, whose state process {xt } evolves according to a stochastic difference equation of the form xt+1 = F(xt , at , bt , ξt ), t = 0, 1, . . .

(1)

Here, the pair (at , bt ) represents the actions chosen by players 1 and 2, respectively, at time t, and {ξt } is the disturbance process which is an observable sequence of independent and identically distributed random variables with unknown distribution θ for both players. In this scenario, our concern is in a game played over an infinite horizon evolving as follows. At stage t, once the players have observed the state xt , and before choosing the actions at and bt , players 1 and 2 implement a statistical estimation process to obtain estimates θt1 and θt2 of θ , respectively. Then, vii

viii

Preface

independently, the players adapt their decisions to such estimators to select actions a = at (θt1 ) and b = bt (θt2 ). Next the game jumps to a new state according to the transition probability determined by Eq. (1) and the unknown distribution θ , and the process is repeated over and over again. This book is the first part of a project whose objective is to make a systematic analysis on recent developments in this kind of games. Specifically, in this first part we will provide the theoretical foundations on the procedures combining statistical estimation and control techniques for the construction of strategies of the players. We generically call this combination “estimation and control” procedures. The second part of the project will deal with another class of games models, as well as with approximation and computational aspects. The statistical estimation process will be studied from two approaches. In the first one, we assume that the distribution θ has a density ρ on ℜk . In this case, there is a vast literature (see, e.g., [9–11, 27] and references therein) that provides different density estimation methods that might be easily adapted to the conditions imposed by the problem being analyzed. Among these we can mention kernel density estimation, Lq estimation for q ≥ 1, and projection estimation, through which it is possible to obtain several important properties such as the rate of convergence. The second approach is provided by the empirical distribution θt defined by the random disturbance process {ξt }. This method is very general in the sense that both the random variables ξt and the distribution θ can be arbitrary. The price that must be paid due to this generality is that its applicability is restricted because it is necessary to impose stronger conditions than those of the previous case on the game model. Anyhow, the use of the empirical distribution has the additional advantage that it provides an approximation method of the value of the game and optimal strategies for players, in cases where the distribution θ is difficult to handle, by replacing θ with a simpler distribution given by θt . In general terms, our approach to obtain estimation and control procedures for both discounted and average criteria consists of combining a statistical estimation method suitable for θ with game theory techniques. Our starting point is to, first, prove the existence of a value of the game as well as measurable minimizers/maximizers in the Shapley equation. To this end, some conditions are imposed on the game model which fall within the weightednorm approach proposed by Wessels in [76] and then fully studied in [23, 24, 31] for Markov decision processes (MDPs) and recently for zero-sum stochastic games in [32, 40, 41, 44, 48]. Thereby, the estimation method is adapted to these conditions to obtain appropriate convergence properties. Clearly, the good behavior of the strategies obtained through the estimation and control procedures depends on the accuracy of the estimation method, and even more on the optimality criterion with which their performance is measured. For instance, it is well known that the discounted criterion strongly depends on the decisions selected in the early stages of the game, just where the estimation process yields deficient information about the unknown distribution θ . So, neither player

Preface

ix

1 nor player 2 can generally ensure the existence of discounted optimal strategies. Hence the optimality under a discounted criterion is studied in an asymptotic sense. The notion of asymptotic optimality used in this book for Markov games was motivated from Sch¨al [67], who introduced this concept to study adaptive MDPs. In contrast, in view of the necessary asymptotic analysis in the study of the average criterion, the strategies obtained by means of estimation and control procedures turn out to be average optimal, providing suitable ergodicity conditions. According to the historical development of the theories of stochastic control and Markov games, the problem of estimation and control for MDPs, also known as an adaptive Markov control problem, has received considerable attention in recent years (see, e.g., [2, 7, 22, 25, 26, 28, 29, 33–35, 52–55, 67] and references therein). In fact, even though approximation algorithms for stochastic games and games with partial information have been studied from several points of view (see, e.g., [8, 17, 20, 43, 46, 59, 60, 63], and references therein), in the field of statistical estimation and control procedures for Markov games the literature remains scarce; we can cite, for instance, [50, 56–58, 69, 70]. In particular, [56] deals with semi-Markov zero-sum games with unknown sojourn time distribution. The works [69, 70] study repeated games assuming that the transition law depends on an unknown parameter which is estimated by the maximum likelihood method, whereas [50, 56–58] deal with the theory developed in the context of this book. The book is organized as follows. In Chap. 1 the class of Markov game models we deal with is introduced, together with the main elements necessary to define the game problem. Chapters 2 and 3 are devoted to analyze the discounted and the average criteria, respectively, where estimation and control procedures are presented under the assumption that the distribution θ has a density on ℜk . Empirical estimation-approximation methods are given in Chap. 4. In this case, by using the empirical distribution to estimate θ both discounted and average criteria are analyzed. Finally, several examples of the class of Markov games studied throughout the book are given in Chap. 5. In this part we focus, mainly, on illustrating our assumptions on the game model, as well as on the numerical implementation of the estimation and control algorithm in specific examples. Acknowledgments. The work of the author has been partially supported by Consejo Nacional de Ciencia y Tecnolog´ıa (CONACYT-M´exico) grant CB/2015-254306. Special thanks to On´esimo Hern´andez-Lerma who has read a draft version; his valuable comments and suggestions helped to improve the book. I also want to thank my colleagues Fernando Luque-V´asquez, Oscar Vega-Amaya, and Carmen HigueraChan, with whom I form the “controllers team” of the University of Sonora; undoubtedly, many of the ideas discussed in our long talks are included in this book. Finally, I deeply thank Donna Chernyk, Associate Editor at Springer, for her help. Hermosillo, Mexico August 2019

J. Adolfo Minj´arez-Sosa

Contents

1

Zero-Sum Markov Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Game Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Difference-Equation Games: Estimation and Control . . . . . . 1.2 Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Markov Game State Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Optimality Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 4 5 6

2

Discounted Optimality Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Minimax-Maximin Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . 2.2 Discounted Optimal Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Asymptotic Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Estimation and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Asymptotically Optimal Strategies . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Proof of Theorem 2.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9 10 12 15 16 19 22 24

3

Average Payoff Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Continuity and Ergodicity Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The Vanishing Discount Factor Approach (VDFA) . . . . . . . . . . . . . . . 3.2.1 Difference Equation Average Game Models . . . . . . . . . . . . . . 3.3 Estimation and Control Under the VDFA . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Average Optimal Pair of Strategies . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Proof of Theorem 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31 32 33 34 36 38 39

4

Empirical Approximation-Estimation Algorithms in Markov Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Assumptions and Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The Discounted Empirical Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Empirical Estimation Process . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Discounted Optimal Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Empirical Approximation Under Average Criterion . . . . . . . . . . . . . .

47 48 51 53 54 58 xi

xii

Contents

4.4 4.5 5

Average Optimal Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Empirical Recursive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Difference-Equation Games: Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Continuity Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Autoregressive Game Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Linear-Quadratic Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 A Game Model for Reservoir Operations . . . . . . . . . . . . . . . . . . . . . . . 5.5 A Storage Game Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Equicontinuity and Equi-Lipschitz Conditions . . . . . . . . . . . . . . . . . . 5.7 Numerical Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Linear-Quadratic Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.2 Finite Games: Empirical Approximation . . . . . . . . . . . . . . . . .

69 70 71 72 74 78 81 85 85 90

A Elements from Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Semicontinuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Spaces of Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Multifunctions and Selectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95 95 96 97

B

Probability Measures and Weak Convergence . . . . . . . . . . . . . . . . . . . . . . 101 B.1 The Empirical Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

C Stochastic Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 C.1 Difference-Equation Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 D Review on Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 D.1 Error Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 D.2 Density Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 D.2.1 The Kernel Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 D.2.2 Projection Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 D.2.3 The Parametric Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Summary of Notation and Terminology

Symbols and Abbreviations N Set of positive integers N0 Set of nonnegative integers ℜ Set of real numbers ℜ+ Set of nonnegative real numbers 1D (·) Indicator function of the set D := Equality by definition a.e. Almost everywhere a.s. Almost surely i.i.d. Independent and identically distributed r.v. Random variable p.m. Probability measure l.s.c. Lower semicontinuous u.s.c. Upper semicontinuous

Spaces of Functions • The space Lq = Lq (ℜk ), for 1 ≤ q < ∞, consists of all real-valued measurable functions on ℜk with finite Lq -norm: ρ Lq :=

 ℜk

1/q q

|ρ | d μ

with respect to the Lebesgue measure μ . • A Borel space is a Borel subset of a complete separable metric space.

xiii

xiv

Summary of Notation and Terminology

For a Borel space X, we use the following notation: B(X)

Borel σ -algebra in X, and “measurable,” for either sets or functions, means “Borel measurable.”

B(X)

Space of real-valued bounded measurable functions on X with the supremum norm: vB := supx∈X |v(x)| . Subspace of bounded continuous functions. Space of lower semicontinuous functions and bounded from below. For a function W : X → [1, ∞), space of measurable functions with finite weighted norm (W -norm): |v(x)| vW := supx∈X . W (x)

C(X) ⊂ B(X) L(X) BW (X)

CW (X) ⊂ BW (X) Subspace of W -bounded continuous functions. LW (X) ⊂ BW (X) Subspace of W -bounded lower semicontinuous functions. P(X) Space of probability measures on X endowed with the weak topology (see Appendix B). P(X|Y ) Family of stochastic kernels on X given Y, where X and Y are Borel spaces.

Chapter 1

Zero-Sum Markov Games

In this chapter we present the class of zero-sum Markov games we are interested in. We first introduce the game model which is a collection of objects describing the evolution in time of the games. In addition, in order to define the corresponding game problem, the concept of strategies for players is given along with the optimality criteria that will be analyzed in the next chapters. We pay special attention to a class of Markov games whose evolution over time is modeled by means of a stochastic difference equation. It is precisely in this kind of games that we will study estimation and control schemes, assuming that the random disturbance process involved in their dynamics has an unknown distribution.

1.1 Game Models A zero-sum Markov game model is defined by the collection G M := (X, A, B, KA , KB , Q, r)

(1.1)

formed by: (a) (b) (c)

A Borel space X called the state space. Borel spaces A and B representing the action sets for players 1 and 2, respectively. The constraint sets KA and KB which are assumed to be Borel subsets of X ×A and X × B, respectively. Moreover, for each x ∈ X, the x-sections A(x) := {a ∈ A : (x, a) ∈ KA } and B(x) := {b ∈ B : (x, a) ∈ KB }

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 J. A. Minj´arez-Sosa, Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution, SpringerBriefs in Probability and Mathematical Statistics, https://doi.org/10.1007/978-3-030-35720-7 1

1

2

1 Zero-Sum Markov Games

represent the admissible actions or control sets for players 1 and 2, respectively, and the set K = {(x, a, b) : x ∈ X, a ∈ A(x), b ∈ B(x)} (d)

of admissible state-actions triplets is a Borel subset of X × A × B. A stochastic kernel Q(·|·) on X given K, called the transition law. That is, if x ∈ X is the state of the game at time t, and the players 1 and 2 select actions a ∈ A(x) and b ∈ B(x), respectively, then Q(·|x, a, b) is the distribution of the next state of the game: Q (D|x, a, b) = Pr [xt+1 ∈ D|xt = x, at = a, bt = b] , D ∈ B(X).

(e)

(1.2)

A measurable function r : K → ℜ that represents the one-stage payoff function.

The game is played as follows. At each time t ∈ N0 , the players observe the state of the game xt = x ∈ X. Next, players 1 and 2 select, independently, actions at = a ∈ A(x) and bt = b ∈ B(x) respectively. Then, player 1 receives a payoff r(x, a, b) from player 2, and the game jumps to a new state xt+1 = y ∈ X according to the transition law Q(·|x, a, b). Once the game is in the new state, the process is repeated. Therefore the goal of player 1 (player 2) is to maximize (minimize) his/her rewards (cost).

1.1.1 Difference-Equation Games: Estimation and Control There are many situations where the evolution of the game is modeled by a stochastic difference equation of the form xt+1 = F(xt , at , bt , ξt ), t ∈ N0 ,

(1.3)

where F : K × S → X is a given measurable function and {ξt } is a sequence of observable i.i.d. random variables defined on a probability space (Ω , F , P), taking values in a Borel space S, with common distribution θ ∈ P(S), and independent of the initial state x0 . In this case the transition law Q in (1.2) is determined by the function F and the distribution θ as (see Appendix C.1) Q(D|x, a, b) = θ ({s ∈ S : F(x, a, b, s) ∈ D}) =



1D [F(x, a, b, s)]θ (ds), D ∈ B(X),

(1.4)

S

for (x, a, b) ∈ K. If θ has a density ρ on S = ℜk with respect to Lebesgue measure, then Q takes the form

1.1 Game Models

Q(D|x, a, b) =

3



1D [F(x, a, b, s)]ρ (s)ds, D ∈ B(X), (x, a, b) ∈ K.

ℜk

Typical examples of dynamics as (1.3) appear, for instance, in autoregressive and linear games models. In the first one the game’s state process evolves according to an equation of the form xt+1 = G(at , bt )xt + ξt for t ∈ N0 , with initial state x0 , state space X = [0, ∞), actions sets A(x) ⊂ A = ℜ, and B(x) ⊂ B = ℜ for x ∈ X, and G : A × B → (0, λ ] is a given function with λ < 1. In the linear models the dynamic of the game is xt+1 = xt + at + bt + ξt for t ∈ N0 , with x0 given, where X = A = B = ℜ. Both models will be analyzed in Chap. 5 together with additional examples to illustrate the theory developed throughout the book. The estimation and control procedures, proposed in next chapters, are established for this kind of games by assuming that the random disturbance process {ξt } is observable with a common and unknown distribution θ ∈ P(S). In this context, unlike the standard evolution of a Markov game, at each stage t ∈ N0 , on the knowledge of the state xt = x and possibly the history of the game, before choosing the actions at and bt , players 1 and 2 get estimates θt1 and θt2 of the unknown distribution θ , respectively, and adapt independently their strategies to select actions a = at (θt1 ) ∈ A(x) and b = bt (θt2 ) ∈ B(x). Next, the game evolves as the standard case, i.e., player 1 receives a payoff r(x, a, b) from player 2, and the system visits a new state xt+1 ∈ X according to the transition law in (1.4). Assuming observability of the process {ξt } allows to implement statistical estimation methods of θ . Among them, the empirical distribution constitutes the most general method in the sense that both the disturbance space S and the distribution θ ∈ P(S) can be arbitrary (see Appendix B.1). In the particular case where θ has a density ρ on S = ℜk , the spectrum of statistical methods to estimate ρ is extended, so that there are more options to choose the most appropriate according to the conditions of the problem being analyzed (see Appendix D). Both approaches will be analyzed in the following chapters; and it is worth emphasizing that they differ in the type of conditions needed for their implementation, and therefore in the arguments used in the corresponding proofs.

4

1 Zero-Sum Markov Games

1.2 Strategies The players select the actions by means of rules called strategies defined below. Let H0 := X and Ht := K × Ht−1 for t ∈ N. Then, a generic element of Ht is ht := (x0 , a0 , b0 , . . . , xt−1 , at−1 , bt−1 , xt ) which represents the history of the game up to time t. On the other hand, for each x ∈ X, we define A(x) := P(A(x)) and B(x) := P(B(x)), as well as the sets of stochastic kernels   Φ 1 : = ϕ 1 ∈ P(A|X) : ϕ 1 (·|x) ∈ A(x) ∀x ∈ X   Φ 2 : = ϕ 2 ∈ P(B|X) : ϕ 2 (·|x) ∈ B(x) ∀x ∈ X . Definition 1.1. (a) A strategy for player 1 is a sequence π 1 = {πt1 } of stochastic kernels πt1 ∈ P(A|Ht ) such that

πt1 (A(xt )|ht ) = 1 ∀ht ∈ Ht ,t ∈ N0 . We denote by Π 1 the family of all strategies for player 1. (b) A strategy π 1 = {πt1 } ∈ Π 1 is called a Markov strategy if πt1 is in Φ 1 for all t ∈ N0 , and it is called stationary if

πt1 (·|ht ) = ϕ 1 (·|xt ) ∀ht ∈ Ht ,t ∈ N0 ,   1 in Φ 1 , so that π 1 is of the form π 1 = ϕ 1 , ϕ 1 , . . . := for some stochastic kernel ϕ  1 ϕ . We denote by Πs1 the class of stationary strategies for player 1. Let F1 be the set of all measurable functions f 1 : X → A such that f 1 (x) ∈ A(x) for all x ∈ X, and let F2 be the set of all measurable functions f 2 : X → B such that f 2 (x) ∈ B(x) for all x ∈ X. Definition 1.2. A strategy π 1 = {πt1 } for player 1 is said to be a   (a) pure (or deterministic) strategy if there exists a sequence gt1 of measurable functions gt1 : Ht → A such that, for all ht ∈ Ht and t ∈ N0 , gt1 (ht ) ∈ A(xt ) and   πt1 (A |ht ) = 1A gt1 (ht ) for all A ∈ B(A), which means that πt (·|ht ) is concentrated at gt1 (ht );

1.3 Markov Game State Process

5

  (b) pure Markov strategy if there is a sequence ft1 of functions ft1 ∈ F1 such that πt (·|ht ) is concentrated at ft1 (xt ) ∈ A(xt ) for all ht ∈ Ht and t ∈ N0 ; (c) pure stationary strategy if there is a function f 1 ∈ F1 such that πt (·|ht ) is concentrated at f 1 (xt ) ∈ A(xt ) for all ht ∈ Ht and t ∈ N0 . We denote by ΠD1 the family of all pure strategies for player 1. The sets Π 2 , Πs2 , and ΠD2 of all strategies, all stationary strategies, and pure strategies for player 2 are defined similarly. Observe that

ΠDi ⊂ Π i , for i = 1, 2. Wherever appropriate, we shall use the following notation related to the probability measures in the sets A(x) and B(x). For probability measures ϕ 1 (·|x) ∈ A(x) and ϕ 2 (·|x) ∈ B(x), and x ∈ X, we write ϕ i (x) = ϕ i (·|x), i = 1, 2. In addition, for a measurable function u : K → ℜ, u(x, ϕ 1 , ϕ 2 ) :=





B(x) A(x)

u(x, a, b)ϕ 1 (da|x)ϕ 2 (db|x) = u(x, ϕ 1 (x), ϕ 2 (x)). (1.5)

For instance, for x ∈ X we have r(x, ϕ 1 , ϕ 2 ) :=





B(x) A(x)

r(x, a, b)ϕ 1 (da|x)ϕ 2 (db|x).

For the case of games evolving according to a difference equation as (1.3), we consider histories of the form ht := (x0 , a0 , b0 , ξ0 , . . . , xt−1 , at−1 , bt−1 , ξt−1 , xt ) ∈ K×S × Ht−1 ,

(1.6)

for t ∈ N. In addition, for x ∈ X and s ∈ S, we write v(F(x, ϕ 1 , ϕ 2 , s)) :=





B(x) A(x)

v((F(x, a, b, s))ϕ 1 (da|x)ϕ 2 (db|x),

for a measurable function v : X → ℜ.

1.3 Markov Game State Process Let (Ω , F ) be the measurable space consisting of the sample space Ω = K∞ and its product σ -algebra F . Following standard arguments (see, e.g., [13]), from the Theorem of C. Ionescu-Tulcea (Proposition C.2 in Appendix C) we have that for each pair of strategies (π 1 , π 2 ) ∈ Π 1 × Π 2 and initial state x0 = x ∈ X, there exists a unique probability measure Pxπ

1 ,π 2

and a stochastic process {(xt , at , bt )} , where xt

6

1 Zero-Sum Markov Games

and (at , bt ) represent the state and the actions of the players, respectively, at stage t ∈ N0 , satisfying, for D ∈ B(X), A ∈ B(A), and B ∈ B(B), Pxπ

1 ,π 2

[x0 ∈ D] = δx (D);  

at ∈ A |ht = πt1 A |ht ; 

1 2 Pxπ ,π bt ∈ B |ht = πt2 B |ht ; 



1 2 Pxπ ,π at ∈ A , bt ∈ B |ht = πt1 A |ht πt2 B |ht ;

(1.7)

1 2 Pxπ ,π

π 1 ,π 2

Px

(1.8) (1.9) (1.10)

[xt+1 ∈ D|ht , at , bt ] = Q (D|xt , at , bt ) ;

(1.11)

where δx (·) is the Dirac measure concentrated at x. We denote by Exπ tation operator with respect to

1 ,π 2

the expec-

1 2 Pxπ ,π .

In the scenario of difference equation games (1.3), the measurable space consists of the sample space Ω = (K × S)∞ with the corresponding product σ -algebra F . Then, considering histories of the form (1.6), in addition to the properties (1.7)– (1.11), we have that for each (π 1 , π 2 ) ∈ Π 1 × Π 2 and x ∈ X, the probability measure

Pxπ

1 ,π 2

and the stochastic process {(xt , at , bt , ξt )} , satisfy Pxπ

1 ,π 2

  ξt ∈ S |ht , at , bt = θ (S ), S ∈ B(S).

The stochastic process {xt } defined on (Ω , F , Pxπ process.

1 ,π 2

) is called the game’s state

1.4 Optimality Criteria For each pair of strategies (π 1 , π 2 ) ∈ Π 1 × Π 2 and initial state x0 = x ∈ X, we define the total expected α -discounted payoff as Vα (x, π 1 , π 2 ) := Exπ

1 ,π 2



∑ α t r(xt , at , bt )

,

(1.12)

t=0

where α ∈ (0, 1) represents the discount factor. We also define the long-run expected average payoff as 1 1 2 n−1 J(x, π 1 , π 2 ) := lim inf Exπ ,π ∑ r(xt , at , bt ). n→∞ n t=0

(1.13)

The lower and the upper values of the discounted game are given as: Lα (x) := sup

inf Vα (x, π 1 , π 2 ), x ∈ X,

2 2 π 1 ∈Π 1 π ∈Π

(1.14)

1.4 Optimality Criteria

7

and Uα (x) := inf

sup Vα (x, π 1 , π 2 ), x ∈ X,

π 2 ∈Π 2 π 1 ∈Π 1

(1.15)

respectively. Observe that, in general, Uα (·) ≥ Lα (·), but if it holds that Uα (·) = Lα (·), the common function is called the α -value of the game and is denoted by Vα (·). Now, if the discounted game has a value Vα (·), a strategy π∗1 ∈ Π 1 is said to be α -optimal for player 1 if Vα (x) = inf Vα (x, π∗1 , π 2 ), x ∈ X. π 2 ∈Π 2

Similarly, a strategy π∗2 ∈ Π 2 is said to be α -optimal for the player 2 if Vα (x) = sup Vα (x, π 1 , π∗2 ), x ∈ X. π 1 ∈Π 1

In this case, (π∗1 , π∗2 ) is an α -optimal pair of strategies or saddle point. Note that (π∗1 , π∗2 ) is an optimal pair if and only if Vα (x, π 1 , π∗2 ) ≤ Vα (x, π∗1 , π∗2 ) ≤ Vα (x, π∗1 , π 2 ) for all x ∈ X, π1 ∈

Π 1, π 2



(1.16)

Π 2.

The lower value L(·) and the upper value U(·) for the average payoff criterion are defined similarly, and the average value of the game is denoted by J(·). Then, if the average game has a value J(·), a strategy π∗1 ∈ Π 1 is said to be average optimal for player 1 if (1.17) J(x) = inf J(x, π∗1 , π 2 ), x ∈ X; π 2 ∈Π 2

and a strategy π∗2 ∈ Π 2 is said to be average optimal for player 2 if J(x) = sup J(x, π 1 , π∗2 ), x ∈ X.

(1.18)

π 1 ∈Π 1

The pair (π∗1 , π∗2 ) is called an average optimal pair of strategies if (1.17) and (1.18) hold; equivalently, J(x, π 1 , π∗2 ) ≤ J(x, π∗1 , π∗2 ) ≤ J(x, π∗1 , π 2 )

(1.19)

for all x ∈ X, π1 ∈ Π 1 , π 2 ∈ Π 2 .

ε -Optimal Strategies. If the α -discounted game has a value Vα (·), a strategy π∗1 ∈ Π 1 is said to be ε -optimal for player 1, for ε ≥ 0, if Vα (·) − ε ≤ inf Vα (·, π∗1 , π 2 ). π 2 ∈Π 2

8

1 Zero-Sum Markov Games

In addition, a strategy π∗2 ∈ Π 2 is said to be ε -optimal for the player 2 if Vα (·) + ε ≥ sup Vα (·, π 1 , π∗2 ). π 1 ∈Π 1

Therefore, a strategy π∗i ∈ Π i is optimal for player i, i = 1, 2, if it is 0-optimal. In a similar way we define ε -optimal strategies for the average payoff criterion.

Our objective in the following chapters is to study the existence of optimal pairs of strategies for difference equation games (1.3) when the disturbance distribution θ is unknown for both players. Specifically, we will establish estimation and control procedures that lead to the construction of optimal strategies.

Chapter 2

Discounted Optimality Criterion

We consider the game model G M := (X, A, B, KA , KB , Q, r) introduced in (1.1). The problems we are concerned with in this chapter are those related to the discounted case, which are summarized as follows. 1. Establish conditions to prove the existence of a value of the game and a pair of optimal strategies. That is, prove the existence of a function Vα and a pair of strategies (π∗1 , π∗2 ) ∈ Π 1 × Π 2 such that, for all x ∈ X, (π 1 , π 2 ) ∈ Π 1 × Π 2 , Uα (x) = Lα (x) = Vα (x), and Vα (x, π 1 , π∗2 ) ≤ Vα (x, π∗1 , π∗2 ) = Vα (x) ≤ Vα (x, π∗1 , π 2 ),

where Vα (x, π , π 1

2

1 2 ) := Exπ ,π





∑ α r(xt , at , bt ) t

(2.1)

t=0

is the total expected α -discounted payoff, and Uα and Lα are the upper and the lower values of the game (see (1.12), (1.14), (1.15), and (1.16)). To obtain these results we need some concepts and techniques on multifunctions and measurable selectors that are summarized in Appendix A. 2. Once the previous problem is solved, we will focus on introducing estimation and control procedures in difference equation game models, described in Sect. 1.1.1, assuming that {ξt } is a sequence of i.i.d. random variables with unknown density ρ .

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 J. A. Minj´arez-Sosa, Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution, SpringerBriefs in Probability and Mathematical Statistics, https://doi.org/10.1007/978-3-030-35720-7 2

9

10

2 Discounted Optimality Criterion

2.1 Minimax-Maximin Optimality Conditions Throughout the chapter we will be dealing with the following Shapley’s operator defined on a family of measurable functions v on X:

 

1 2 1 2 Tα v(x) := inf sup r(x, ϕ , ϕ ) + α v(y)Q dy|x, ϕ , ϕ , (2.2) ϕ 2 ∈B(x) ϕ 1 ∈A(x)

X

for x ∈ X. The main issue in (2.2) is to ensure the interchange of inf and sup as well as the existence of measurable selectors of the multifunctions A(x) and B(x) (see Definition A.4 in Appendix A). To this end, it is needed to impose suitable continuity and compactness conditions on the game model such as the following. Assumption 2.1 (a) The multifunctions x → A(x) and x → B(x) are compact-valued and continuous. (b) The payoff function r(x, a, b) is continuous in (x, a, b) ∈ K. (c) The transition law Q is weakly continuous, that is, the mapping (x, a, b) →

 X

v(y)Q (dy|x, a, b)

is continuous on K, for every function v ∈ C(X). Assumption 2.2 (a) For each x ∈ X, the sets A(x) and B(x) are compact. (b) For each (x, a, b) ∈ K, r(x, ·, b) is upper semicontinuous (u.s.c.) on A(x), and r(x, a, ·) is lower semicontinuous (l.s.c.) on B(x). (c) For each (x, a, b) ∈ K and v ∈ B(X), the functions a→

 X

v(y)Q(dy|x, a, b) and b →



v(y)Q(dy|x, a, b) X

are continuous on A(x) and B(x), respectively. For a measurable function v : X → ℜ, (x, a, b) ∈ K, ϕ 1 (x) ∈ A(x), and ϕ 2 (x) ∈ B(x), we define (see (1.5)) H(x, a, b) := r(x, a, b) + α and ¯ ϕ 1 , ϕ 2 ) := H(x,





B(x) A(x)

 X

v(y)Q (dy|x, a, b)

H(x, a, b)ϕ 1 (da|x)ϕ 2 (db|x).

2.1 Minimax-Maximin Optimality Conditions

11

From “extended Fatou Lemma” (see [31, Lemma 8.3.7]) and Proposition B.2 in Appendix B, it is easy to prove that if H is l.s.c. (u.s.c.) on K, then H¯ is also l.s.c. (u.s.c.) on X × A(x) × B(x). Thus, from Proposition B.3, provided that either Assumption 2.1 or 2.2 holds, Berge’s Theorem, Fan’s minimax theorem, and suitable selection theorems, given in Appendix A (see also [61]), yield the existence of (ϕ∗1 , ϕ∗2 ) ∈ Φ 1 × Φ 2 , such that, for all x ∈ X, ϕ∗1 (x) ∈ A(x) and ϕ∗2 (x) ∈ B(x) satisfy

 

1 2 1 2 inf r(x, ϕ , ϕ ) + α v(y)Q dy|x, ϕ , ϕ Tα v(x) = sup 2 ϕ 1 ∈A(x) ϕ ∈B(x)

X



= r(x, ϕ∗1 , ϕ∗2 ) + α v(y)Q(dy|x, ϕ∗1 , ϕ∗2 ) X 

 1 2 1 2 = max r(x, ϕ , ϕ∗ ) + α v(y)Q(dy|x, ϕ , ϕ∗ ) ϕ 1 ∈A(x)

X

ϕ 2 ∈B(x)

X



 = min r(x, ϕ∗1 , ϕ 2 ) + α v(y)Q(dy|x, ϕ∗1 , ϕ 2 )) .

(2.3)

It is worth observing that if the continuity, convexity, and concavity conditions, requested in Berge’s Theorem and Fan’s minimax Theorem, are satisfied by the components of the game model, then either Assumption 2.1 or 2.2, together a selection theorem (see Appendix A), yield the existence of ( f∗1 , f∗2 ) ∈ F1 × F2 , such that for all x ∈ X, f∗1 (x) ∈ A(x) and f∗2 (x) ∈ B(x) satisfy 

 Tα v(x) = sup inf r(x, a, b) + α v(y)Q (dy|x, a, b) a∈A(x) b∈B(x)

X



= r(x, f∗1 , f∗2 ) + α v(y)Q(dy|x, f∗1 , f∗2 ) X 

 2 2 = max r(x, a, f∗ ) + α v(y)Q(dy|x, a, f∗ ) a∈A(x) X 

 1 1 = min r(x, f∗ , b) + α v(y)Q(dy|x, f∗ , b)) . b∈B(x)

(2.4)

X

This situation occurs, for instance, in the linear-quadratic games as we will see in Chap. 5. Another set of minimax (maximin) optimality conditions implying similar relations to (2.3) is the following weaker assumption. Assumption 2.3 (a) The mapping x → A(x) is l.s.c. and A(x) is complete for every x ∈ X; (b) The mapping x → B(x) is u.s.c. and B(x) is compact for every x ∈ X; (c) The function r(·, ·, ·) ≥ 0 is l.s.c. on K;

12

2 Discounted Optimality Criterion

(d) The mapping (x, a, b) →



v(y)Q(dy|x, a, b)

(2.5)

X

is continuous on K for every v ∈ C(X). Note that from Proposition C.1(b), Assumption 2.3(d) (or Assumption 2.1(c)) implies that the mapping in (2.5) is lower semicontinuous whenever v(·) is in L(X). Remark 2.1. Under Assumption 2.3, from K¨uenle [47] we have that the operator Tα maps L(X) into itself. In addition, for all x ∈ X and v ∈ L(X) 

 1 2 1 2 inf r(x, ϕ , ϕ ) + α v(y)Q(dy|x, ϕ , ϕ ) . (2.6) Tα v(x) := sup 2 ϕ 1 ∈A(x) ϕ ∈B(x)

X

Moreover, for each ε > 0, there exist ϕε1 ∈ Φ1 and ϕ∗2 ∈ Φ2 such that, for all x ∈ X, 

 1 2 1 2 Tα v(x) = sup r(x, ϕ , ϕ∗ ) + α v(y)Q(dy|x, ϕ , ϕ∗ ) , (2.7) X

ϕ 1 ∈A(x)

and Tα v(x) − ε ≤



 1 2 1 2 inf r(x, ϕε , ϕ ) + α v(y)Q(dy|x, ϕε , ϕ ) ∀x ∈ X.

ϕ 2 ∈B(x)

(2.8)

X

2.2 Discounted Optimal Strategies In this section we address the problem on the existence of the value function and optimal strategies for the discounted game. The theory is developed under the setting of the weaker Assumption 2.3 together with the following growth condition that allows us to handle the unbounded payoff case. Assumption 2.4 There exist a continuous function W ≥ 1 on X and positive constants β < 1, M < ∞, and d < ∞ such that the following inequalities hold for all (x, a, b) ∈ K: (a) 0 ≤ r(x, a, b) ≤ MW (x); (b)



X W (y)Q(dy|x, a, b)

≤ β W (x) + d.

(c) The function (x, a, b) → is continuous on K.

 X

W (y)Q(dy|x, a, b)

2.2 Discounted Optimal Strategies

13

Remark 2.2. If the payoff function r is bounded, say by the constant M, then Assumption 2.4 holds by taking W ≡ 1 and d = 1. From Van Nunen and Wessels [72], we have that Assumption 2.4 together with Assumption 2.3 implies that the Shapley’s operator Tα has a unique fixed point in a certain measurable functions space. To state this fact precisely we first introduce some definitions and notation. For each measurable function u : X → ℜ, define the W -weighted norm, W -norm for short, as |u(x)| , ||u||W := sup x∈X W (x) and denote by BW the normed linear space of all measurable functions with finite W-norm. Let LW be the subspace of functions in L(X) that belongs to BW . It is easy to verify that LW is a Banach space. Remark 2.3 (Contraction Property and 2.4 hold. For each discount ber γα ∈ (α , 1) and define the e := d (γα /α − 1)−1 . Consider the with finite W¯ -norm, that is

of Tα [72]). Suppose that Assumptions 2.3 factor α ∈ (0, 1), we fix an arbitrary numfunction W¯ (x) := W (x) + e, x ∈ X, where space BW¯ of measurable functions v : X → ℜ

|v(x)| vW¯ := sup ¯ < ∞. W (x) x∈X Observe that BW = BW¯ and the norms ·W and ·W¯ are equivalent since vW¯ ≤ vW ≤ lα vW¯ for v ∈ BW , where lα := 1 + e = 1 +

αd . γα − α

(2.9)

(2.10)

Then, from [72, Lemma 1], the function W¯ satisfies the inequality

α

 S

W¯ [F(x, a, b, s)]μ (ds) ≤ γα W¯ (x) ∀(x, a, b) ∈ K.

(2.11)

Following straightforward calculations, it is easy to see that, for each α ∈ (0, 1), inequality (2.11) implies that operator Tα is a contraction from LW into itself with respect to the W¯ -norm with modulus γα . Hence: (a) For all v, u ∈ BW , Tα v − Tα uW¯ ≤ γα v − uW¯ . (2.12) (b) Thus, by the Banach Fixed Point Theorem, Tα has a unique fixed point Vα ∈ LW , i.e., (2.13) Tα Vα = Vα

14

2 Discounted Optimality Criterion

and, as n → ∞,

||Tαn u −Vα ||W¯ → 0 ∀u ∈ LW ,

where Tαn u = Tα (Tαn−1 u) for n ≥ 1. (c) Furthermore, since || · ||W and || · ||W¯ are equivalent norms on LW , ||Tαn u −Vα ||W → 0 as n → ∞, ∀u ∈ LW . The next theorem shows the existence of optimal strategies for the discounted stochastic game. Its proof follows from (2.6) to (2.8), Remark 2.3, and standard dynamic programming arguments. Theorem 2.5. Suppose that Assumptions 2.3 and 2.4 hold. Then: (a) The function Vα in Remark 2.3 is the value of the discounted game; (b) There exists ϕ∗2 ∈ Φ 2 such that 

 1 2 1 2 Vα (x) = sup r(x, ϕ , ϕ∗ ) + α Vα (y)Q(dy|x, ϕ , ϕ∗ ) ∀x ∈ X; ϕ 1 ∈A(x)

X

moreover, for each ε > 0 there exists ϕε1 ∈ Φ 1 such that 

 1 2 1 2 Vα (x) − ε ≤ inf r(x, ϕε , ϕ ) + α Vα (y)Q(dy|x, ϕε , ϕ ) ∀x ∈ X. ϕ 2 ∈B(x)

X

    (c) Thus, πε1 = ϕε1 ∈ Πs1 is an ε -optimal strategy for player 1 and π∗2 = ϕ∗2 ∈ Πs2 is an optimal strategy for player 2, that is, Vα (·) − ε ≤ inf Vα (·, πε1 , π 2 ) π 2 ∈Π2

Vα (·) = sup Vα (·, π 1 , π∗2 ). π 1 ∈Π 1

Theorem 2.5 states the existence of a l.s.c. value function Vα . It is clear that if continuity of Vα is required, we need to impose more restrictive conditions. For instance, as is proved in [42, Theorem 4.2], this fact is achieved under Assumptions 2.1 and 2.4. Specifically, we have the following result. Theorem 2.6. Suppose that Assumptions 2.1 and 2.4 hold. Then, for each α ∈ (0, 1): (a) The discounted payoff game has a value Vα ∈ CW and Vα W ≤

M . 1−α

(2.14)

2.2 Discounted Optimal Strategies

15

(b) The value Vα satisfies Tα Vα = Vα , and there exists (ϕ∗1 , ϕ∗2 ) ∈ Φ 1 × Φ 2 , such that ϕ∗1 (x) ∈ A(x) and ϕ∗2 ∈ B(x) satisfy, for all x ∈ X, 

Vα (x) = r(x, ϕ∗1 , ϕ∗2 ) + α Vα (y)Q(dy|x, ϕ∗1 , ϕ∗2 ) X 

 1 2 1 2 = max r(x, ϕ , ϕ∗ ) + α Vα (y)Q(dy|x, ϕ , ϕ∗ ) ϕ 1 ∈A(x)

= min

ϕ 2 ∈B(x)

(2.15)

X



 r(x, ϕ∗1 , ϕ 2 ) + α Vα (y)Q(dy|x, ϕ∗1 , ϕ 2 ) .

(2.16)

X

    In addition, π∗1 = ϕ∗1 ∈ Πs1 and π∗2 = ϕ∗2 ∈ Πs2 form an optimal pair of strategies. From Remark 2.3, the condition in Assumption 2.4(b) can be replaced by the following assumption (see (2.11)).   1 such that Assumption 2.7 There exists a constant γ˜ ∈ 1, α  S

W (y)Q(dy|x, a, b, s) ≤ γ˜W (x), ∀(x, a, b) ∈ K.

Moreover, under Assumptions 2.4(a), (c) and 2.7, it is easy to prove that the operator Tα is a contraction with respect to the W -norm modulus α γ˜ ∈ (0, 1). Hence, provided that Assumption 2.3 (or Assumption 2.1) also holds, Theorem 2.5 (or Theorem 2.6) remains valid.

2.2.1 Asymptotic Optimality There are some situations in which it is not possible to obtain optimal strategies for players under the usual criterion. If this is the situation, it is necessary to search schemes that give strategies that are “nearly” optimal, which involves to analyze the concept of optimality in a weaker sense. This is the case that will be treated in the next section for difference equation games models, where a statistical estimation method for the unknown density of the random disturbance process is combined with control procedures in order to construct strategies. Specifically, we adapt to stochastic games the notion of asymptotic optimality introduced by Sch¨al in [67] (see also [28]) to study adaptive Markov control processes under the discounted criterion. To introduce this weaker optimality criterion, we define the discrepancy function D : K → ℜ as D(x, a, b) := r(x, a, b) + α

 X

Vα (y)Q(dy|x, a, b) −Vα (x)

16

2 Discounted Optimality Criterion

for all (x, a, b) ∈ K. Observe that the relation (2.13) is equivalent to sup

inf D(x, ϕ 1 , ϕ 2 ) =

2 ϕ 1 ∈A(x) ϕ ∈B(x)

inf

sup D(x, ϕ 1 , ϕ 2 ) = 0.

ϕ 2 ∈B(x) ϕ 1 ∈A(x)

Moreover, from Theorem 2.5(b), the pair (ϕε1 , ϕ∗2 ) satisfies, for all x ∈ X, D(x, ϕε1 , ϕ 2 ) ≥ −ε

∀ϕ 2 ∈ B(x),

D(x, ϕ 1 , ϕ∗2 ) ≤ 0

∀ϕ 1 ∈ A(x).

(2.17)

These facts motivate the following definition. Definition 2.1. A strategy π∗1 ∈ Π 1 is said to be asymptotically discounted optimal (AD-optimal) for player 1 if π 1 ,π 2

lim inf Ex ∗ t→∞

D(xt , at , bt ) ≥ 0 ∀x ∈ X, π 2 ∈ Π 2 .

Similarly, π∗2 ∈ Π 2 is said to be AD-optimal for player 2 if π 1 ,π∗2

lim sup Ex t→∞

D(xt , at , bt ) ≤ 0 ∀x ∈ X, π 1 ∈ Π 1 .

A pair of AD-optimal strategies is called an AD-optimal pair. In this case, if (π∗1 , π∗2 ) is an AD-optimal pair, then, for each x ∈ X, π 1 ,π∗2

lim Ex ∗

t→∞

D(xt , at , bt ) = 0 ∀x ∈ X.

2.3 Estimation and Control We consider a discrete-time two person zero-sum Markov game whose state process {xt } ⊂ X evolves according to the equation (see Sect. 1.1.1) xt+1 = F(xt , at , bt , ξt ), t = 0, 1, . . . ,

(2.18)

where F : K × ℜk → X is a given measurable function and (at , bt ) ∈ A(xt ) × B(xt ). The perturbation process {ξt } is formed by observable i.i.d. ℜk -valued random variables, for some fixed nonnegative integer k, independent of the initial state x0 , with common probability density ρ (·) which is unknown to both players. Moreover, we suppose that the realizations ξ0 , ξ1 , . . . of the disturbance process and the states x0 , x1 , . . . are completely observable.

2.3 Estimation and Control

17

Unlike the standard game model, observe that the solution to the game given in Theorem 2.5 (or Theorem 2.6) is not accessible to the players. Since ρ is unknown, they need to combine statistical density estimation procedures and control processes to gain some insights on the evolution of the game. Hence, in this case our concern is in a game played over an infinite horizon evolving as follows: at each time t ∈ N0 , the players observe the state of the game xt = x ∈ X; next, on the record of a sample ξ¯t := (ξ0 , ξ1 , . . . , ξt−1 ) and possibly taking into account the history of the game, players 1 and 2 get estimates ρt1 = ρt1 (ξ¯t ) and ρt2 = ρt2 (ξ¯t ) of the unknown density ρ , respectively, and adapt independently their strategies to choose actions a = at (ρt1 ) ∈ A(x) and b = bt (ρt2 ) ∈ B(x), respectively. As a consequence, player 2 pays the amount r(x, a, b) to player 1 and the system visits a new state xt+1 = x ∈ X according to the evolution law (see (1.4)) 

Q(D|x, a, b) :=

ℜk

1D [F(x, a, b, s)]ρ (s)ds D ∈ B(X).

As in the standard case, the goal of player 1 (player 2, resp.) is to maximize (minimize, resp.) the total expected α -discounted payoff (1.12) (see (2.1)). Because the discounted criterion depends strongly on the decision selected at the early stages (precisely when the information about the density ρ is deficient) this estimation and control procedure yields in the best case suboptimal strategies in the long term. Therefore the optimality in this case will be studied under the notion of asymptotic optimality given in Definition 2.1. Observe that the Shapley operator (2.2) takes the form T(α ,ρ ) v(x) := Tα v(x) =

inf

 1 2 sup r(x, ϕ , ϕ ) + α

ϕ 2 ∈B(x) ϕ 1 ∈A(x)

ℜk

 v(F(x, ϕ , ϕ , s))ρ (s)ds . 1

2

(2.19) Now, to show the existence of minimizers/maximizers we need to impose the following conditions. Assumption 2.8 (a) For each s ∈ ℜk , F(·, ·, ·, s) is continuous. (b) Moreover, there exist constants λ0 ∈ (0, 1), d0 ≥ 0, p > 1, and M > 0 such that for all (x, a, b) ∈ K it holds that 0 ≤ r(x, a, b) ≤ MW (x),  ℜk

W p [F(x, a, b, s)]ρ (ds) ≤ λ0W p (x) + d0 .

(2.20) (2.21)

18

2 Discounted Optimality Criterion

Remark 2.4. (a) From Proposition C.3, Assumption 2.8(a) implies that the mapping (x, a, b) →

 ℜk

v(F(x, a, b, s))μ (ds)

(2.22)

is continuous for each function v ∈ C(X) and each probability measure μ (·) on ℜk , which yields Assumptions 2.1(c) and 2.3(d) hold true. In fact, it is easy to see that Assumption 2.4(c) holds as well. Thus, from Proposition C.1 (b), if v(·) belongs to L(X), then the mapping (2.22) is in L(K). On the other hand, by Jensen’s inequality, 1/p 1/p relation (2.21) implies Assumption 2.4(b) with β := λ0 and d := d0 , that is,  ℜk

W (F(x, a, b, s))ρ (s)ds ≤ β W (x) + d,

(2.23)

Then, we have that Assumption 2.8 implies Assumption 2.4. Hence, under Assumptions 2.3(a)–(c) and 2.8, the conclusions of Theorem 2.5 are still valid. Similarly, Theorem 2.6 holds under Assumptions 2.1(a), (b) and 2.8. (b) Furthermore, provided that Assumption 2.8(a) and inequality (2.23) hold, together with either Assumption 2.3(a)–(c) or Assumption 2.1(a)–(b), the facts established in Remark 2.4(a) are valid for any density σ on ℜk such that  ℜk

W (F(x, a, b, s))σ (s)ds ≤ β W (x) + d.

(2.24)

Therefore, from Theorem 2.5 (resp. Theorem 2.6) there exists a value Vασ ∈ LW (resp. Vασ ∈ CW ) such that T(α ,σ )Vασ (x) = Vασ (x), x ∈ X. (c) Additionally, note that Assumption 2.8(b) yields sup Exπ

1 ,π 2

t∈N0

[W p (xt )] < ∞ and sup Exπ

1 ,π 2

[W (xt )] < ∞

(2.25)

t∈N0

for each pair (π 1 , π 2 ) ∈ Π 1 × Π 2 and x ∈ X. Indeed, observe that by iterating inequality (2.24) we get Exπ Hence,

1 ,π 2

W (xt ) ≤ β t W (x) +

sup Exπ

1 ,π 2

1 − βt d ∀x ∈ X,π 1 ∈ Π 1 , π 2 ∈ Π 2 ,t ∈ N. 1−β

W (xt ) < ∞ ∀x ∈ X, π 1 ∈ Π 1 , π 2 ∈ Π 2 .

(2.26)

t∈N0

Similarly, inequality (2.21) yields sup Exπ

t∈N0

1 ,π 2

W p (xt ) < ∞ ∀x ∈ X, π 1 ∈ Π 1 , π 2 ∈ Π 2 .

(2.27)

2.3 Estimation and Control

19

2.3.1 Density Estimation Denote by L1 the space of functions defined on ℜk which are integrable with respect to the Lebesgue measure, and by || · ||L1 the corresponding norm. We show the existence of estimators of ρ ∈ L1 with suitable properties for the construction of asymptotically optimal strategies for both players. To state this precisely we impose the following assumptions on the density. Assumption 2.9 There exists a measurable function ρ on ℜk , such that ρ (·) ≤ ρ(·) a.e. with respect to the Lebesgue measure. Assumption 2.10 The function

ψ (s) :=

1 W [F(x, a, b, s)], s ∈ ℜk , (x,a,b)∈K W (x) sup

(2.28)

is finite and satisfies the integrability condition

Ψ¯ :=

 ℜk

ψ 2 (s)ρ(s)ds < ∞.

(2.29)

Note that the function ψ in (2.28) is upper semianalytic, thus universally measurable. Hence, the integral in (2.29) is well defined (see [5, Ch. 7, Section 7.6]). Let ξ0 , ξ1 , . . . , ξt , . . . be independent realizations of random vectors with density ρ (·), and let ρti (s) := ρti (s; ξ0 , ξ1 , . . . , ξt−1 ), for s ∈ ℜk and t ∈ N, be density estimators for players i = 1, 2, such that E||ρti − ρ ||L1 = E

 ℜk

  i ρt (s) − ρ (s) ds → 0, as t → ∞.

(2.30)

A wide class of estimators that satisfies this condition are, for instance, the kernel estimates   s − ξi 1 t i ρt (s) = k ∑ K , dt tdt i=1 where the kernel K is a nonnegative measurable function with 0, and tdtk → ∞ as t goes to infinity (see Appendix D.2.1).



ℜk

K(s)ds = 1, dt →

Now, we define the following class of densities. Definition 2.2. Let D be the set of densities σ (·) on ℜk satisfying the following conditions: D.1. σ (·) ≤ ρ(·) a.e. with respect to the Lebesgue measure; D.2.



ℜk W [F(x, a, b, s)]σ (s)ds ≤

β W (x) + d ∀(x, a, b) ∈ K,

20

2 Discounted Optimality Criterion

where the function ρ and the constants β and d are as in Assumption 2.9 and Remark 2.4(a).

Remark 2.5. In the scenario of Assumption 2.7, Condition D.2 is replaced by D.2 .

 ℜk

W [F(x, a, b, s)]σ (s)ds ≤ γ˜W (x) ∀(x, a, b) ∈ K,

  1 where γ˜ ∈ 1, . α

Proposition 2.1. Under Assumption 2.10, the set D is a closed and convex subset of L1 . Proof. First observe that the convexity of D follows directly from the Definition 2.2. We then proceed to prove that D is closed. To this end, let us fix a sequence {σn } ⊂ D such that L

σn →1 σ ∈ L1 as n → ∞.

(2.31)

Let l be the Lebesgue measure on ℜk and suppose that there is a set G ⊂ ℜk with l(G) > 0 and such that σ (s) > ρ(s), s ∈ G. Then, for some ε > 0 and G ⊂ G with l(G ) > 0 σ (s) > ρ(s) + ε , ∀s ∈ G . (2.32) On the other hand, since σn ∈ D, for n ∈ N0 , there exists Gn ⊂ [0, ∞) with l(Gn ) = 0, such that σn (s) ≤ ρ(s) for s ∈ ℜk \Gn , n ∈ N0 . (2.33) Combining (2.32) and (2.33) we get |σn (s) − σ (s)| ≥ ε , ∀s ∈ G ∩ (ℜk \Gn ), n ∈ N0 . Using the fact that l(G ∩ (ℜk \Gn ) > 0 we obtain that σn does not converge to σ in measure, which is a contradiction to the convergence in L1 . Therefore σ (s) ≤ ρ(s) a.e. Now we will prove that σ satisfies the inequality in Condition D.2. Because  ℜk

W [F(x, a, b, s)]σn (s)ds ≤ β W (x) + d ∀(x, a, b) ∈ K, n ∈ N0 ,

it is enough to show that for all (x, a, b) ∈ K,  ℜk

W [F(x, a, b, s)]σn (s)ds →

 ℜk

W [F(x, a, b, s)]σ (s)ds,

(2.34)

2.3 Estimation and Control

21

as n → ∞. From Assumption 2.10, we have that W [F(x, a, b, s)] ≤ W (x)ψ (s) for all (x, a, b) ∈ K, s ∈ ℜk . Then      W [F(x, a, b, s)] [σn (s) − σ (s)] ds In : =  ℜk

≤ W (x) ≤ W (x)



ℜ

k

ψ (s) |σn (s) − σ (s)| ds 1

ℜk

1

ψ (s) |σn (s) − σ (s)| 2 |σn (s) − σ (s)| 2 ds.

By applying H¨older’s inequality and taking into account that σn (·) ≤ ρ(·) and σ (·) ≤ ρ(·) a.e., we get 0 ≤ In ≤ W (x) ≤ W (x)

 ℜk



1/2  ψ (s) |σn (s) − σ (s)| 2

ℜk

ℜk

1/2  ψ 2 (s) (2ρ(s))

≤ 21/2Ψ¯ 1/2W (x)

 ℜk

ℜk

1/2 |σn (s) − σ (s)| ds 1/2

|σn (s) − σ (s)| ds 1/2

|σn (s) − σ (s)| ds

,

(2.35)

where the last inequality comes from (2.29). Hence, letting n → ∞ in (2.35), from (2.31) we get In → 0, which yields (2.34). Finally, we prove that σ is a density. To this end, we first observe that σ (·) ≥ 0 a.e. Now, similar as (2.35),       ≤ 1 − |σn (s) − σ (s)| ds σ (s)ds   ℜk



 ℜk

2ρ(s)

ℜk

1/2 

1/2 ℜk

|σn (s) − σ (s)| ds

→ 0 as n → ∞.



Since ψ (·) ≥ 1, this fact and (2.29) imply ℜk σ (s)ds = 1, and therefore σ is a density on ℜk . This proves that D is closed.  From Proposition 2.1 and Theorem D.5 in Appendix D we can use a projection density estimation method to obtain our estimator. That is, for each t ∈ N, there exists ρti (·) ∈ D such that ||ρti − ρti ||L1 = inf ||σ − ρti ||L1 for i = 1, 2. σ ∈D

(2.36)

Notice that E||ρti − ρ ||L1 → 0 since (see (2.30)) ||ρti − ρ ||L1 ≤ ||ρti − ρti ||L1 + ||ρti − ρ ||L1 ≤ 2||ρti − ρ ||L1 ∀t ∈ N.

(2.37)

22

2 Discounted Optimality Criterion

The densities ρti (·) defined for i = 1, 2, as

ρti := argmin ||σ − ρti ||L1

(2.38)

σ ∈D

will be used as estimators to obtain asymptotically optimal strategies for each player. First, however, we will express their convergence property using a more suitable norm defined as follows. For a measurable function σ : ℜk → ℜ define 1 x∈X a∈A(x),b∈B(x) W (x)

||σ || := sup



sup

ℜk

W [F(x, a, b, s)]|σ (s)|ds.

(2.39)

Note that Condition D.2 guarantees that ||σ || < ∞ for any density σ in D. Lemma 2.1. If Assumptions 2.9 and 2.10 hold, then E||ρti − ρ || → 0 as t → ∞ for i = 1, 2. Proof. Proceeding similarly as in (2.35), from Holder’s inequality, (2.28) and (2.29), ||ρti − ρ || ≤



 ℜk

  ψ (s) ρti (s) − ρ (s) ds



  ψ (s) ρti (s) − ρ (s) ds

1/2 

2

ℜk

≤C



 ℜk

  i ρt (s) − ρ (s) ds

1/2

ℜk

  i ρt (s) − ρ (s) ds

= C ||ρti − ρ ||L1 , 1/2

1/2

(2.40)

where C := 21/2Ψ¯ 1/2 . The result follows from (2.40) since E||ρti − ρ ||L1 → 0 as 1/2 t → ∞, which, from Remark D.1 (Appendix D), implies E||ρti − ρ ||L1 → 0 as t → ∞. 

2.3.2 Asymptotically Optimal Strategies To obtain asymptotically optimal strategies we will use a nonstationary value iteration approach when the players use the density estimators {ρti } as the true density. Hence, for i = 1, 2, we introduce the operators

T(α ,ρti ) v(x) :=

inf

sup

ϕ 2 ∈B(x) ϕ 1 ∈A(x)

r(x, ϕ , ϕ ) + α 1



2

ℜk



v F(x, ϕ 1 , ϕ 2 , s) ρti (s)ds



2.3 Estimation and Control

23

for x ∈ X and t ∈ N. Note that these operators are contractions from (LW , || · ||W¯ ) into itself with modulus γα —see Condition D.2 and Remark 2.3(a)—provided that Assumptions 2.3(a)–(c) and 2.8 hold. Thus, in this case, from Remark 2.1, the “minimax” value iteration functions for player i (=1, 2) i t ∈ N, U0i ≡ 0, Uti := T(α ,ρti )Ut−1

belong to the space LW . Moreover, there exists a sequence {ϕ¯t2 } ⊂ Φ 2 for player 2 such that 



2 2 1 ¯2 2 1 ¯2 Ut−1 F(x, ϕ , ϕt , s) ρt (s)ds (2.41) Ut (x) = sup r(x, ϕ , ϕt ) + α ℜk

ϕ 1 ∈A(x)

for all x ∈ X,t ∈ N. Similarly, for any sequence of positive numbers {εt } converging to zero there exists a sequence {ϕ¯t1 } ⊂ Φ 1 such that

Ut1 (x) − εt ≤

inf

ϕ 2 ∈B(x)

 r(x, ϕ¯t1 , ϕ 2 ) + α

ℜk



1 F(x, ϕ¯t1 , ϕ 2 , s) ρt1 (s)ds Ut−1

 (2.42)

for all x ∈ X,t ∈ N. Furthermore, a straightforward calculation shows that, for some positive constants C1 and C2 ,  i  Ut (x) ≤ CiW (x) ∀t ∈ N0 , x ∈ X, i = 1, 2.

(2.43)

Remark 2.6. In the particular case when ρt1 = ρt2 = ρt , the sequences {ϕ¯t1 } ⊂ Φ 1 and {ϕ¯t2 } ⊂ Φ 2 are defined through the value iteration functions Ut := T(α ,ρt )Ut−1 t ∈ N, U0 ≡ 0. That is, for x ∈ X and t ∈ N, Ut (x) =

 sup r(x, ϕ 1 , ϕ 2 ) + α

inf

ϕ 2 ∈B(x) ϕ 1 ∈A(x)

ℜk

= sup

ϕ 1 ∈A(x)



r(x, ϕ

1

, ϕ¯t2 ) + α

 ℜk

 1 2 ¯ inf r(x, ϕt , ϕ ) + α

ϕ 2 ∈B(x)

ℜk



Ut−1 F(x, ϕ 1 , ϕ 2 , s) ρt (s)ds



Ut−1 F(x, ϕ 1 , ϕ¯t2 , s) ρt (s)ds







1 2 ¯ Ut−1 F(x, ϕt , ϕ , s) ρt (s)ds + εt . (2.44)

In the scenario of Assumptions 2.1 (or 2.2) and 2.4, we can replace (2.44) by the following equation

24

2 Discounted Optimality Criterion

Ut (x) =

inf

sup

ϕ 2 ∈B(x) ϕ 1 ∈A(x)

 r(x, ϕ 1 , ϕ 2 ) + α

ℜk

= sup

ϕ 1 ∈A(x)

=

r(x, ϕ

1

, ϕ¯t2 ) + α

 ℜk

 inf r(x, ϕ¯t1 , ϕ 2 ) + α

ϕ 2 ∈B(x)

ℜk



Ut−1 F(x, ϕ 1 , ϕ 2 , s) ρt (s)ds



Ut−1 F(x, ϕ 1 , ϕ¯t2 , s) ρt (s)ds







1 2 Ut−1 F(x, ϕ¯t , ϕ , s) ρt (s)ds .

(2.45)

Finally we state the main result as follows. Theorem 2.11. Suppose that Assumptions 2.3(a)–(c), 2.8, 2.9, and 2.10 hold. Then the strategies π¯∗1 = {ϕ¯t1 } ∈ Π 1 and π¯∗2 = {ϕ¯t2 } ∈ Π 2 are AD-optimal for player 1 and player 2, respectively. Thus, in particular, π¯ 1 ,π¯∗2

lim Ex ∗

t→∞

D(xt , at , bt ) = 0 ∀x ∈ X.

2.3.3 Proof of Theorem 2.11 The proof of Theorem 2.11 is a consequence of Lemmas 2.2–2.5 below. Throughout this subsection we suppose that all the assumptions in Theorem 2.11 hold. Lemma 2.2. For each i=1, 2:    ρ    ρ ρ T(α ,ρti )Vα − T(α ,ρ )Vα  ≤ α Vα W ρti − ρ  ∀t ∈ N.

(2.46)

Thus, for each (π 1 , π 2 ) ∈ Π 1 × Π 2 and x ∈ X,   1 2 ρ ρ lim Exπ ,π T(α ,ρti )Vα − T(α ,ρ )Vα  = 0.

(2.47)

W

t→∞

W

  Proof. Because ρti − ρ  does not depend on (π 1 , π 2 ) ∈ Π 1 × Π 2 and x ∈ X, note that (2.47) follows from (2.46) and Lemma 2.1. Thus we proceed to prove (2.46). To do this, for all x ∈ X, t ∈ N, and i = 1, 2, we have     ρ ρ T(α ,ρti )Vα (x) − T(α ,ρ )Vα (x)

≤ sup

   sup α

ϕ 2 ∈B(x) ϕ 1 ∈A(x)

ℜk

ρ Vα F(x, ϕ 1 , ϕ 2 , s) ρti (s)ds −α

 ℜk

 

ρ Vα F(x, ϕ 1 , ϕ 2 , s) ρ (s)ds

2.3 Estimation and Control

sup α

≤ sup

ϕ 2 ∈B(x) ϕ 1 ∈A(x)

 ρ ≤ α Vα W sup

25

 ℜk

  ρ

  Vα F(x, ϕ 1 , ϕ 2 , s)  ρti (s) − ρ (s) ds 

sup

ϕ 2 ∈B(x) ϕ 1 ∈A(x)

ℜk



 W F(x, ϕ 1 , ϕ 2 , s) ρti (s) − ρ (s) ds

 ρ   ≤ α Vα W ρti − ρ  W (x). 

Hence, (2.46) holds.

Lemma 2.3. The following holds: 1 2 ρ lim E π ,π Uti −Vα W t→∞ x

=0

for each i = 1, 2, (π 1 , π 2 ) ∈ Π 1 × Π 2 , and x ∈ X. Proof. First note that for all t ∈ N and i = 1, 2,    i  ρ i Ut −Vαρ  ¯ =  T U − T V i  ¯ α ( α , ρ ) (α ,ρt ) t−1 W

W

      ρ ρ ρ i ≤ T(α ,ρti )Ut−1 − T(α ,ρti )Vα  + T(α ,ρti )Vα − T(α ,ρ )Vα  . ¯ ¯ W

W

Then, since T(α ,ρti ) is a contraction operator with modulus γα with respect to the W¯ -norm, it follows from Lemma 2.2 that  i   i  ρ   ρ Ut −Vαρ  ¯ ≤ γα Ut−1 (2.48) −Vα W¯ + α Vα W ρti − ρ  . W  i  Moreover, since E ρt − ρ  → 0, there exists a positive constant M such that Exπ

1 ,π 2

 i    Ut −Vαρ  ¯ ≤ γα Exπ 1 ,π 2 U i −Vαρ  ¯ + M . t−1 W W

Thus, by iterations of this inequality we obtain Exπ

1 ,π 2

t−1  i    Ut −Vαρ  ¯ ≤ (γα )t Vαρ  ¯ + M ∑ (γα )k W W k=0

 ρ M + Vα W¯ ≤ , 1 − γα which in turn implies that l := lim sup Exπ t→∞

1 ,π 2

 i  Ut −Vαρ  ¯ < ∞. W

26

2 Discounted Optimality Criterion

Now taking expectation on (2.48) and then limsup as t goes to infinity, we have that 0 ≤ l ≤ γα l, which yields that l = 0. This proves the desired result.  In the following we shall use the “approximate discrepancy functions” for player i (=1,2) defined as Dti (x, ϕ 1 , ϕ 2 ) := r(x, ϕ 1 , ϕ 2 ) + α

 ℜk

i Ut−1 (F(x, ϕ 1 , ϕ 2 , s))ρti (s)ds −Uti (x)

for x ∈ X, ϕ 1 ∈ A(x), ϕ 2 ∈ B(x). Lemma 2.4. For all x ∈ X and t ∈ N, it holds that   sup sup D(x, ϕ 1 , ϕ 2 ) − Dti (x, ϕ 1 , ϕ 2 ) ≤ W (x)ηti , ϕ 1 ∈A(x) ϕ 2 ∈B(x)

where   i  ρ   ρ ρ ηti := Uti −Vα W + (β + d) Ut−1 −Vα W + α Vα W ρti − ρ 

(2.49)

for all t ∈ N. 1 2 Proof. Let x ∈ X,  ϕ ∈ A(x), ϕ ∈ B(x), and t ∈ N be fixed but arbitrary, and write Rti (x, ϕ 1 , ϕ 2 ) := D(x, ϕ 1 , ϕ 2 ) − Dti (x, ϕ 1 , ϕ 2 ). Then, observe that

  ρ Rti (x, ϕ 1 , ϕ 2 ) ≤ Uti (x) −Vα (x)      ρ i 1 2 i 1 2  +α  Ut−1 (F(x, ϕ , ϕ , s)ρt (s) − Vα ((F(x, ϕ , ϕ , s))ρ (s) ℜk ℜk    ρ ≤ Uti (x) −Vα (x) + α

ℜk



 ℜk

  ρ Vα ((F(x, ϕ 1 , ϕ 2 , s)) ρ (s) − ρti (s) ds

  i U (F(x, ϕ 1 , ϕ 2 , s)) −Vαρ (F(x, ϕ 1 , ϕ 2 , s)) ρti (s)ds t−1

   ρ   ρ ≤ Uti (x) −Vα (x) + α Vα W ρti − ρ  W (x)   i ρ + Ut−1 −Vα W

ℜk

W (F(x, ϕ 1 , ϕ 2 , s))ρti (s)ds.

2.3 Estimation and Control

27

Now, since ρti (·) is in D, from Condition D.2 we have  ℜk

W [F(x, a, b, s)]ρti (s)ds ≤ β W (x) + d ≤ [β + d]W (x).

This implies    ρ   ρ Rti (x, ϕ 1 , ϕ 2 ) ≤ Uti (x) −Vα (x) + α Vα W ρti − ρ W W (x)  i ρ + Ut−1 −Vα W (β + d)W (x). Hence, sup

  sup D(x, ϕ 1 , ϕ 2 ) − Dti (x, ϕ 1 , ϕ 2 ) ≤ W (x)ηti

ϕ 1 ∈A(x) ϕ 2 ∈B(x)

for all x ∈ X,t ∈ N.



Lemma 2.5. Let the strategies π∗1 = {ϕ¯t1 } and π∗2 = {ϕ¯t2 } be as in (2.42) and (2.41). Then −εt −W (xt )ηt1 ≤ D(x, ϕ¯t1 , ϕ 2 ) ∀ϕ 2 ∈ B(x),t ∈ N, D(x, ϕ 1 , ϕ¯t2 ) ≤ W (xt )ηt2 ∀ϕ 1 ∈ A(x),t ∈ N, with ηti as in (2.49). Proof. The inequalities follow directly from Lemma 2.4 noting that inf Dt1 (x, ϕ¯t1 , ϕ 2 ) ≥ −εt ,

ϕ 2 ∈B(x)

and sup Dt2 (x, ϕ 1 , ϕ¯t2 ) = 0

ϕ 1 ∈A(x)

hold for all x ∈ X,t ∈ N.



Remark 2.7. Observe that from (2.49), and Lemmas 2.1 and 2.3, 1 2 lim E π ,π ηti t→∞ x

Thus, for i = 1, 2,

= 0 for i=1,2, (π 1 , π 2 ) ∈ Π 1 × Π 2 , and x ∈ X.

ηti → 0 in probability Pxπ

1 ,π 2

.

28

2 Discounted Optimality Criterion

Moreover, since ||σ || < ∞ for σ in D, from (2.43) we have ki := sup ηti < ∞ for i = 1, 2. t

Finally we are ready to prove our main theorem in this chapter. Proof (Proof of Theorem 2.11). From Lemma 2.5 it is enough to prove, for i=1, 2, that 1 2 Exπ ,π W (xt )ηti → 0 ∀x ∈ X, π 1 ∈ Π 1 , π 2 ∈ Π 2 . (2.50) To prove this fact, we begin by proving that the process {W (xt )ηti } converges to

zero in probability with respect to Pxπ ,π for all x ∈ X, π 1 ∈ Π 1 , π 2 ∈ Π 2 . To do this, let l1 and l2 be arbitrary positive constants; then observe that for all x ∈ X and t ∈ N,

  1 2 1 2 1 2 l1 + Pxπ ,π [W (xt ) > l2 ] . Pxπ ,π W (xt )ηti > l1 ≤ Pxπ ,π ηti > l2 1

2

Thus, Chebyshev’s inequality and (2.25) yield

  1 2 1 2 l1 1 1 2 Pxπ ,π W (xt )ηti > l1 ≤ Pxπ ,π ηti > + Exπ ,π W (xt ) l2 l2 1 2 ≤ Pxπ ,π

ηti

 M l1 > + l2 l2

for some constant M < ∞. Hence, from Remark 2.7, lim sup Pxπ

1 ,π 2

t→∞

  M W (xt )ηti > l1 ≤ . l2

Since l2 is arbitrary, we conclude that lim Pxπ

t→∞

1 ,π 2

  W (xt )ηti > l1 = 0,

which proves the claim. On the other hand, from (2.25) and Remark 2.7, we see that the inequality sup Exπ t∈N

1 ,π 2

 p 1 2 W (xt )ηti ≤ kip sup Exπ ,π W p (xt ) < ∞ t∈N

holds for all x ∈ X, π 1 ∈ Π 1 , π 2 ∈ Π 2 . Thus, from [3, Lemma 7.6.9, p. 301], the 1 2 latter inequality implies that the process {W (xt )ηti } is Pxπ ,π -uniformly integrable.

2.3 Estimation and Control

29

Finally, using the uniform integrability of the process {W (xt )ηti } and that it converges to zero, we conclude that Exπ as t → ∞.

1 ,π 2

W (xt )ηti → 0



Remark 2.8 (Bounded Payoffs). The idea behind the use of projection estimates is that we need an estimator that satisfies Condition D.2. As is noted in Remark 2.2, if the payoff function r is bounded we can take W ≡ 1, so Assumption 2.10 holds letting ρ = ρ¯ , and we have ·L1 = · (see (2.39)). Thus, any L1 -consistent density estimates ρti , i = 1, 2, can be used for the construction of strategies under estimation and control procedures. Moreover, from (A.1) in Appendix A, the norms ·B and ·W coincide, which yields that the convergences stated in Lemmas 2.2 and 2.3 would be under the norm ·B .

Chapter 3

Average Payoff Criterion

We are now interested in analyzing the average payoff game, that is, the game in (1.3) with the long-run expected average payoff in (1.13): 1 1 2 n−1 J(x, π 1 , π 2 ) := lim inf Exπ ,π ∑ r(xt , at , bt ) n→∞ n t=0

(3.1)

for (π 1 , π 2 ) ∈ Π 1 × Π 2 and x ∈ X. Our first concern is to establish conditions ensuring the existence of a value of the game, i.e., a function J(·) such that U(x) = L(x) = J(x), where U and L are the upper and the lower values of the average game (see (1.17)–(1.19)). Having the value of the game, we then give conditions ensuring the existence of a saddle point (π∗1 , π∗2 ) ∈ Π 1 × Π 2 , that is, J(x, π 1 , π∗2 ) ≤ J(x, π∗1 , π∗2 ) = J(x) ≤ J(x, π∗1 , π 2 ) ∀x ∈ X, (π 1 , π 2 ) ∈ Π 1 × Π 2 . Finally, as in the discounted game, we introduce estimation and control procedures in difference equation game models (see Sect. 1.1.1), assuming that {ξt } is a sequence of i.i.d. r.v. with unknown density ρ . Unlike the discounted criterion (1.12), the average performance index depends on the tail of the game’s state process instead of the early stages of the game. Hence, an asymptotic analysis is necessary to study the average payoff game, so suitable ergodicity conditions should be imposed on the game model. Due precisely to this asymptotic analysis, estimation and control procedures do provide average optimal strategies. As in the field of Markov Decision Processes (MDPs), the average criterion in games can be studied by means of the so-called vanishing discount factor approach (VDFA), that is, as a limit of the discounted criterion (see [42], for instance). In this book, we apply a variant of the VDFA to obtain average optimal pairs of strategies under estimation and control procedures. Specifically, for a suitable discount factor © The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 J. A. Minj´arez-Sosa, Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution, SpringerBriefs in Probability and Mathematical Statistics, https://doi.org/10.1007/978-3-030-35720-7 3

31

32

3 Average Payoff Criterion

sequence αt ↑ 1 and by replacing the unknown density ρ by the estimates ρt1 and ρt2 , obtained in each approximation stage, we can obtain minimizers/maximizers in the corresponding αt -discounted Shapley equations. Then, with the ergodicity conditions, we analyze the limit as t → ∞.

3.1 Continuity and Ergodicity Conditions The analysis of the average game will be made under two sets of assumptions. The first one, Assumption 3.1, is a combination of some conditions in Assumptions 2.1 and 2.4, which we rewrite for easy reference. It contains continuity and compactness conditions yielding the existence of measurable selectors. In addition, Assumption 3.2 is an ergodicity condition needed to analyze the asymptotic behavior for the average criterion. Assumption 3.1 (a) The multifunctions x → A(x) and x → B(x) are compact-valued and continuous. (b) The payoff function r(x, a, b) is continuous in (x, a, b) ∈ K. (c) The mapping (x, a, b) →

 X

v(y)Q (dy|x, a, b)

is continuous on K, for every function v ∈ C(X). (d) There exist a continuous function W ≥ 1 on X and a constant M > 0 such that 0 ≤ r(x, a, b) ≤ MW (x) for all (x, a, b) ∈ K. (e) The mapping (x, a, b) →



X W (y)Q (dy|x, a, b)

is continuous on K.

Assumption 3.2 There exist a measurable function λ : K → [0, 1], a probability measure m∗ on X, and a constant β ∈ (0, 1) such that: (a)



X W (y)Q (dy|x, a, b)

≤ β W (x) + λ (x, a, b)d for all (x, a, b) ∈ K, where 

d := X

W (x)m∗ (dx) < ∞.

(b) Q(D|x, a, b) ≥ λ (x, a, b)m∗ (D) for all D ∈ B(X), (x, a, b) ∈ K;  (c) Λ¯ (x)m∗ (dx) > 0, where Λ¯ (x) := inf inf λ (x, a, b) is assumed to be a meaX

a∈A(x) b∈B(x)

surable function. Assumption 3.2 implies that the game’s state processes {xt } defined by the class of stationary strategies are W -geometrically ergodic Markov chains with uniform convergence rate. That is, there exist constants R > 0 and γ ∈ (0, 1) such that

3.2 The Vanishing Discount Factor Approach (VDFA)

33

    ϕ 1 ,ϕ 2  Ex u(xt ) − u(y)d μϕ 1 ,ϕ 2 (y) ≤ RW (x)γ t  X

(3.2)

for all t ∈ N, x ∈ X, u ∈ BW , and strategies (ϕ 1 , ϕ 2 ) ∈ Φ 1 × Φ 2 , where μϕ 1 ,ϕ 2 stands for the unique invariant probability measure corresponding to the Markov chain {xt } induced by the pair (ϕ 1 , ϕ 2 ). It is important to note that the constants R and γ depend neither on the strategies ϕ 1 and ϕ 2 nor on the kernel Q (see, e.g., [23, 24]). These conditions, or some variants of them, have been used in several works ([23, 24, 31, 39, 42, 44, 51, 73]). The reader is referred to these works for detailed discussions about their meanings and consequences. Remark 3.1. Observe that Assumption 3.2(a) implies the inequality in Assumption 2.4(b). Then, under Assumptions 3.1 and 3.2(a), the conclusions of Theorem 2.6 hold. That is, for each α ∈ (0, 1): (a) The discounted game has a value Vα ∈ CW and Vα W ≤ (b) The value Vα satisfies and there exists

(ϕ∗1 , ϕ∗2 )



M . 1−α

(3.3)

Tα Vα = Vα ,

Φ 1 × Φ 2,

Vα (x) = r(x, ϕ∗1 , ϕ∗2 ) + α

 X

such that

ϕ∗1 (x)

(3.4) ∈ A(x) and

ϕ∗2

∈ B(x) satisfy

Vα (y)Q(dy|x, ϕ∗1 , ϕ∗2 )



 1 2 1 2 = max r(x, ϕ , ϕ∗ ) + α Vα (y)Q(dy|x, ϕ , ϕ∗ ) ϕ 1 ∈A(x)

X

= min

ϕ 2 ∈B(x)

r(x, ϕ∗1 , ϕ 2 ) + α

 X

 Vα (y)Q(dy|x, ϕ∗1 , ϕ 2 ) , ∀x ∈ X. (3.5)

    In addition, π∗1 = ϕ∗1 ∈ Πs1 and π∗2 = ϕ∗2 ∈ Πs2 form an α -optimal pair of strategies.

3.2 The Vanishing Discount Factor Approach (VDFA)    

Suppose that Assumptions 3.1 and 3.2 hold. Let Vα and ϕ∗1 , ϕ∗2 ∈ Πs1 × Πs2 be the value of the discounted game and a pair of discounted optimal strategies (see Remark 3.1(b)). Let z ∈ X be an arbitrary fixed state and define hα (·) := Vα (·) −Vα (z) and jα := (1 − α )Vα (z)

34

3 Average Payoff Criterion

for any α ∈ (0, 1). Then Eq. (3.4) is equivalent to jα + hα (x) = Tα hα (x), x ∈ X. Moreover, from (3.5),

(3.6)



jα + hα (x) = r(x, ϕ∗1 , ϕ∗2 ) + α hα (y)Q(dy|x, ϕ∗1 , ϕ∗2 ) X 

 1 2 1 2 = max r(x, ϕ , ϕ∗ ) + α hα (y)(x)Q(dy|x, ϕ , ϕ∗ ) ϕ 1 ∈A(x)

= min

ϕ 2 ∈B(x)

r(x, ϕ∗1 , ϕ 2 ) + α



X

X

 hα (y)Q(dy|x, ϕ∗1 , ϕ 2 ) , ∀x ∈ X.

The following theorem, borrowed from [42, Theorem 4.3], states the existence of a value for the average game defined as the limit of jα as α ↑ 1. Theorem 3.3. Suppose Assumptions 3.1 and 3.2 hold. Then the average payoff game has a constant value J(·) = j∗ , that is, j∗ = inf

sup J(x, π 1 , π 2 ) = sup

π 2 ∈Π 2 π 1 ∈Π 1

inf J(x, π 1 , π 2 ) ∀x ∈ X.

2 2 π 1 ∈Π 1 π ∈Π

Moreover, there exists an average optimal pair of strategies (π∗1 , π∗2 ) ∈ Πs1 × Πs2 and j∗ = lim (1 − α )Vα (z), α →1−

(3.7)

where z ∈ X is an arbitrary fixed state. The proof essentially is based on the analysis of the limit, as t → ∞, of the αt Shapley equation (3.8) jαt + hαt (x) = Tαt hαt (x), x ∈ X, for a fixed and arbitrary sequence {αt } of discount factors converging to 1. In fact, from (3.7) we have (3.9) j∗ = lim jαt . t→∞

3.2.1 Difference Equation Average Game Models We again consider a game evolving according to a difference equation as in (1.3) (see Sect. 1.1.1), that is, xt+1 = F(xt , at , bt , ξt ), t = 0, 1, . . . , where F : K × ℜk → X is a given measurable function, (at , bt ) ∈ A(xt ) × B(xt ), and {ξt } is a sequence of i.i.d. ℜk -valued r.v. with density ρ (·). We assume that for each s ∈ ℜk the function F(·, ·, ·, s) is continuous and there exists a measurable function ρ : ℜk → ℜ+ such that ρ (·) ≤ ρ(·) with respect to the

3.2 The Vanishing Discount Factor Approach (VDFA)

35

Lebesgue measure (see Assumptions 2.8 and 2.9). We again consider the class D of densities σ (·) satisfying the following conditions (see Definition 2.2): D.1 D.2

σ (·) ≤ ρ(·) a.e. with respect to the Lebesgue measure. For all (x, a, b) ∈ K,  ℜk

W [F(x, a, b, s)]σ (s)ds ≤ β W (x) + d.

(3.10)

Note that Assumption 3.2(a) implies that the density ρ (·) belongs to D, after observing that the transition kernel Q takes the form (see (1.4)) Q(D|x, a, b) =



1D [F(x, a, b, s)]ρ (s)ds

ℜk

for D ∈ B(X) and (x, a, b) ∈ K. Remark 3.2 (cf. Remark 2.4(b)). Under the continuity condition of the function F and Assumptions 3.1(a), (b), (d), we have that for any density σ satisfying (3.10) and α ∈ (0, 1), there exists a function Vασ ∈ CW which is the value of the corresponding discounted game. We denote, for a fixed z ∈ X, α ∈ (0, 1), and σ ∈ D hασ (·) := Vασ (·) −Vασ (z) and jασ := (1 − α )Vασ (z). Observe that the α -discounted Shapley equation (3.6) takes the form 

 jασ + hασ (x) = max min r(x, ϕ 1 , ϕ 2 ) + α hασ (F(x, ϕ 1 , ϕ 2 , s))ρ (s)ds . ϕ 1 ∈A(x) ϕ 2 ∈B(x)

ℜk

(3.11) Moreover, under Assumption 3.1, for any density σ ∈ D and α ∈ (0, 1) the following holds (see Remarks 3.1 and 3.2): there exists (ϕ∗1 , ϕ∗2 ) ∈ Φ 1 × Φ 2 , such that for all x ∈ X, ϕ∗1 (x) ∈ A(x) and ϕ∗2 (x) ∈ B(x) is satisfied 

 σ 1 2 σ 1 2 Vα (x) = max min r(x, ϕ , ϕ ) + α Vα (F(x, ϕ , ϕ , s))σ (s)ds ϕ 1 ∈A(x) ϕ 2 ∈B(x)

= r(x, ϕ∗1 , ϕ∗2 ) + α = max

ϕ 1 ∈A(x)

= min

ϕ 2 ∈B(x)

X

 X

Vασ (F(x, ϕ∗1 , ϕ∗2 , s))σ (s)ds



 r(x, ϕ 1 , ϕ∗2 ) + α Vασ (F(x, ϕ 1 , ϕ∗2 , s))σ (s)ds X

r(x, ϕ∗1 , ϕ 2 ) + α

or, equivalently (see (3.6)),

 X

 Vασ (F(x, ϕ∗1 , ϕ 2 , s))σ (s)ds ;

(3.12)

36

3 Average Payoff Criterion

jασ + hασ (x) = r(x, ϕ∗1 , ϕ∗2 ) + α

 X

hασ (F(x, ϕ∗1 , ϕ∗2 , s))σ (s)ds



 1 2 σ 1 2 = max r(x, ϕ , ϕ∗ ) + α hα (F(x, ϕ , ϕ∗ , s))σ (s)ds ϕ 1 ∈A(x)

X

ϕ 2 ∈B(x)

X



 = min r(x, ϕ∗1 , ϕ 2 ) + α hασ (F(x, ϕ∗1 , ϕ 2 , s))σ (s)ds . (3.13)

3.3 Estimation and Control Under the VDFA We now introduce estimation and control procedures for the game models described in the previous section. To this end, the disturbance process {ξt } is assumed to be completely observable. Our approach consists in a combination of the VDFA and suitable density estimation methods. More precisely, to obtain an average optimal pair of strategies, we analyze the limit of the αˆ t -discounted Shapley equations (3.11) for a convenient and fixed discount factor sequence αˆ t ↑ 1, and replace the unknown density ρ by the estimates ρt1 and ρt2 obtained in each approximation stage by the players. We shall show the existence of an average optimal pair of strategies under the following strengthened version of Assumption 3.2 (see Assumptions 2.8(b) and 2.10). Assumption 3.4 (a) There exist constants λ0 ∈ (0, 1), d0 ≥ 0, and p > 1 such that  ℜk

W p [F(x, a, b, s)]ρ (s)ds ≤ λ0W p (x) + d0 (x, a, b) ∈ K.

(3.14)

(b) The function

ψ (s) :=

1 W [F(x, a, b, s)], s ∈ ℜk (x,a,b)∈K W (x) sup

is finite and satisfies the condition  ℜk

ψ 2 (s) (ρ(s)) ds < ∞.

The key point to define the average optimal pairs of strategies is to get density estimators ρti , i ∈ {1, 2}, for players 1 and 2 respectively, belonging to the class D, with a property of convergence stronger than that given in Lemma 2.1. As in discounted case, these estimators are obtained by means of a projection estimate method with a slight modification (see Appendix D.2.2). Essentially, the requirement that ρti ∈ D is to be able to use the estimated versions of the Shapley equations (3.12) and (3.13).

3.3 Estimation and Control Under the VDFA

37

Let ξ0 , ξ1 , . . . , ξt , . . . be independent realizations of random variables in ℜk with density ρ (·), and ρti (·) := ρti (·; ξ0 , ξ1 , . . . , ξt−1 ), for t ∈ N, be density estimators for player i ∈ {1, 2} satisfying the property E||ρti − ρ ||L1 = E

 ℜk

  i ρt (s) − ρ (s) ds = O(t −δ ) as t → ∞,

(3.15)

for some δ > 0. For instance, under suitable conditions on the density ρ , the kernel estimate satisfies (3.15) with δ = 2/5 (see Theorem D.4 in Appendix D). Another example is given in Appendix D.2.3 for a particular density belonging to a parametric family of densities. Now, for p > 1 given in Assumption 3.4, let q > 1 be such that 1/p + 1/q = 1. Then, from Remark D.2 (Appendix D) we have

E||ρti − ρ ||qL1 = E

 ℜk

  i ρt (s) − ρ (s) ds

q

= O(t −δ ) as t → ∞.

(3.16)

Under Assumption 3.4(b), the class of densities D is a closed and convex subset of Lq (see Proposition 2.1). Moreover (by (2.36)) we have that the projection of each density ρti (·),t ∈ N, on D is well defined, that is, there exists a unique density ρti (·) ∈ D for each i ∈ {1, 2} such that ||ρti − ρti ||L1 = inf ||σ − ρti ||L1 . σ ∈D

Hence, the density estimators ρti , i ∈ {1, 2}, are defined by

ρti := argmin ||σ − ρti ||L1 . σ ∈D

(3.17)

The next result establishes the convergence of the densities {ρti (·)} to ρ (·) for i ∈ {1, 2}; the proof is obtained by applying similar arguments to those of Lemma 2.1. Lemma 3.1. Suppose Assumption 3.4 holds. Then E||ρti − ρ ||q = O(t −δ ) as t → ∞, where δ > 0 is as in (3.16) and ||σ || :=

1 W (x) (x,a,b)∈K sup

 ℜk

W [F(x, a, b, s)]|σ (s)|ds,

for any measurable function σ : ℜk → ℜ. Note that Condition D.2 guarantees ||σ || < ∞ for any density σ in D.

(3.18)

38

3 Average Payoff Criterion

3.3.1 Average Optimal Pair of Strategies For a nondecreasing sequence of discount factors {αt }, we denote by κ (n), n ∈ N, the times this sequence changes its values among the first n terms; that is,

κ (n) = |{αt : t = 1, . . . , n}| − 1, where |C| denotes the cardinality of the set C. We choose a sequence {αˆ t } with the following properties: F.1 F.2

(1 − αˆ t )−1 = O(t ν ) as t → ∞, with ν ∈ (0, δ /3q). κ (n) = 0. lim n→∞ n

Observe that Condition F.2 implies that, for each ε > 0, κ (n) could only be greater than ε n for finitely many values of n. Thus, {αˆ t } remains constant for long time periods. An example of a sequence {αˆ t } satisfying Conditions F.1 and F.2 is the following. Example 3.1. Let δ and q be given real numbers satisfying (3.16), and {αt } be the sequence defined as 1 αt := 1 − ν , t for some ν ∈ (0, δ /3q). We define the sequence {αˆ t } by

αˆ t = αk

if

k(k + 1) (k − 1)k ≤t < , k = 2, 3, . . . 2 2

Then, since k ≤ t, straightforward calculations shows that (1 − αˆ t )−1 = (1 − αk )−1 = kν = O(t ν ). Moreover, for n ∈ N,

κ (n) = (k − 2) if Therefore, κ (n)
0 such that, for i = 1, 2,   E ρti − ρ L = O(t −δ ) as t → ∞, 1

we define the sequence {αˆ t } of discount factors satisfying the Conditions F.1 and F.2 (see, e.g., Example 3.1). In fact, it is possible to prove, in specific examples, that Theorem 3.5 remains to be valid for sequences of discount factors that converge enough slowly to one. Hence, Conditions F.1 and F.2 can be weakened. Furthermore, the ergodicity property given in (3.2), which is a consequence of Assumption 3.2, becomes in the well-known geometric ergodicity in the total variation norm (see, e.g., [19, 28]).

Chapter 4

Empirical Approximation-Estimation Algorithms in Markov Games

This chapter proposes an empirical approximation-estimation algorithm in difference equation game models (see Sect. 1.1.1) whose evolution is given by xt+1 = F(xt , at , bt , ξt ), t ∈ N0 ,

(4.1)

where {ξt } is a sequence of observable i.i.d. random variables defined on a probability space (Ω , F , P) , taking values in an arbitrary Borel space S, with common unknown distribution θ ∈ P(S). Note that unlike the previous chapters, we are now considering an arbitrary distribution θ , which not necessarily has a density. Our objective in this scenario is to use an empirical procedure to estimate θ that in turn defines an algorithm to approximate the value of the game as well as an optimal pair of strategies. This is done for both, discounted and average criteria, by applying the following approach. As was seen previously, the study of Markov games with discounted payoffs is analyzed by means of the Shapley equation T(α ,θ )Vαθ = Vαθ , where Vαθ is the value of the game and T(α ,θ ) is a minimax (maximin) operator. Then, in the setknown, ting of θ completely   under  suitable condition, a stationary optimal pair of strategies (π∗1 , π∗2 ) = ϕ∗1 , ϕ∗2 ∈ Πs1 × Πs2 can be computed. Now, assuming ξ¯n = (ξ0 , ξ1 , . . . , ξn−1 ) , the corresponding empirical unknown θ , given a sample

¯ measure θn (·) = θn ·; ξn defines a random operator T(α ,θn ) and an empirical value Vαθn satisfying T(α ,θn )Vαθn = Vαθn . So, for each n ∈ N, it is possible to get an opti    

mal pair ϕn1 , ϕn2 ∈ Πs1 × Πs2 for the θn -empirical game. Then, under suitθn θ able conditions, the  prove

 convergence Vα → Vα and the existence of a limit 1 2 we 1 2 ϕn , which defines a stationary optimal pair of strategies point ϕ∞ , ϕ  ∞ of ϕn ,

(π∞1 , π∞2 ) = ϕ∞1 , ϕ∞2 for the original game. It is worth observing that by the

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 J. A. Minj´arez-Sosa, Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution, SpringerBriefs in Probability and Mathematical Statistics, https://doi.org/10.1007/978-3-030-35720-7 4

47

48

4 Empirical Approximation-Estimation Algorithms in Markov Games

randomness of the operator T(α ,θn ) as well as of the functions Vαθn , the pair ϕ∞1 , ϕ∞2 is a random variable. Hence, as part of our approach, we have to prove that its expectation determines an optimal (nonrandom) pair of strategies. Once the discounted case is analyzed, the average payoff criterion is studied by means of a combination of the VDFA and the empirical estimation procedure. Our approach, in addition to providing a more general method for estimating value functions and for the construction of optimal strategies, can be seen as an approximation method in cases where θ can be known but difficult to handle, i.e., θ is replaced by a simpler distribution, namely, the empirical distribution θn . Throughout the chapter, a.s. means “almost surely” with respect to the underlying probability measure P.

4.1 Assumptions and Preliminary Results The theory on the empirical procedures for discounted and average payoff games will be developed in the setting imposed by the Assumption 3.1 (see Assumptions 2.1, 2.4, and 2.8(b), and Remark 2.4(a)), together with the ergodicity condition in Assumption 3.2. For ease reference, we rewrite them in terms of the distribution θ and the difference equation game model. Assumption 4.1 (a) The multifunctions x −→ A(x) and x −→ B(x) are compactvalued and continuous. (b) The payoff function r is continuous on K, and there exist a continuous function W : X → [1, ∞) and a constant M > 0 such that 0 ≤ r(x, a, b) ≤ MW (x) for all (x, a, b) ∈ K. Moreover, the function (x, a, b) −→

 S

W [F(x, a, b, s)] θ (ds)

is continuous on K. (c) For each s ∈ S, the function F(x, a, b, s) is continuous in (x, a, b) ∈ K. Assumption 4.2 There exist a measurable function λ : K → [0, 1], a probability measure m∗ on X, and a constant β ∈ (0, 1) such that: (a)



S W [F(x, a, b, s)]θ (ds) ≤

β W (x) + λ (x, a, b)d for all (x, a, b) ∈ K, where 

d := X

W (x)m∗ (dx) < ∞;

4.1 Assumptions and Preliminary Results

49

(b) Q(D|x, a, b) ≥ λ (x, a, b)m∗ (D) ∀D ∈ B(X), (x, a, b) ∈ K; (c)



∗ ¯ ¯ X Λ (x)m (dx) > 0, where Λ (x) :=

inf

inf λ (x, a, b) is assumed to be a mea-

a∈A(x) b∈B(x)

surable function.

Assumption 4.3 For the constants β and d in Assumption 4.2, the function W satisfies W [F(x, a, b, s)] ≤ β W (x) + d ∀(x, a, b, s) ∈ K×S.

Remark 4.1. (a) In the particular case of a bounded payoff function r, Assumption 4.3 holds by taking W ≡ 1 and d = 1 (see Remark 2.2). (b) Observe that Assumption 4.1(c) implies that the mapping (x, a, b) −→

 S

v [F(x, a, b, s)] μ (ds)

is continuous on K for every bounded and continuous function v on X and μ ∈ P(S) (see Remark 2.4(a) or Proposition C.3 in Appendix C). (c) We consider the following class of probability measures M (S) := μ ∈ P(S) :

 S

! W [F(x, a, b, s)]μ (ds) ≤ β W (x) + d, (x, a, b) ∈ K .

Observe that Assumption 4.2(a) implies that θ ∈ M (S), that is  S

W [F(x, a, b, s)]θ (ds) ≤ β W (x) + d

for all (x, a, b) ∈ K. On the other hand, under Assumption 4.3, any probability measure μ ∈ P(S) belongs to M (S), that is, M (S) = P(S). For each μ ∈ P(S) and α ∈ (0, 1), we define, for v ∈ BW and x ∈ X, the operator (see (2.2) and (2.19)) 

 sup r(x, ϕ 1 , ϕ 2 ) + α v(F(x, ϕ 1 , ϕ 2 , s))μ (ds) . (4.2) T(α ,μ ) v(x) := inf ϕ 2 ∈B(x) ϕ 1 ∈A(x)

S

If Assumption 4.1 holds, standard results in previous chapters (see Theorem 2.6 and Remark 2.4) ensure that, for each μ ∈ M (S), the operator T(α ,μ ) maps CW into itself, and furthermore the interchange of inf and sup in (4.2) holds: 

 1 2 1 2 T(α ,μ ) v(x) = sup inf r(x, ϕ , ϕ ) + α v(F(x, ϕ , ϕ , s))μ (ds) . (4.3) 2 ϕ 1 ∈A(x) ϕ ∈B(x)

S

50

4 Empirical Approximation-Estimation Algorithms in Markov Games

Moreover, for each μ ∈ P(S) and α ∈ (0, 1), the operator T(α ,μ ) has the contraction property given in Remark 2.3. Indeed, let W¯ (x) := W (x) + e for x ∈ X, where, for each discount factor α ∈ (0, 1) and arbitrary γα ∈ (α , 1), e := d (γα /α − 1)−1

(4.4)

Then (see (2.11)),

α

 S

W¯ [F(x, a, b, s)]μ (ds) ≤ γα W¯ (x) ∀(x, a, b) ∈ K,

(4.5)

and for all v, u ∈ BW ,   T(α ,μ ) v − T(α ,μ ) u ≤ γα v − u ¯ . W W¯

(4.6)

Note that the norms ·W and ·W¯ are equivalent since, as in (2.9), vW¯ ≤ vW ≤ lα vW¯ for v ∈ BW , where lα := 1 + e = 1 +

αd . γα − α

(4.7)

(4.8)

Combining these facts, from Theorem 2.6 we have the following result. Theorem 4.4. Suppose that Assumption 4.1 holds and θ ∈ M (S). Then, for each α ∈ (0, 1): (a) The discounted payoff game has a value Vαθ ∈ CW and    θ Vα  ≤ W

M . 1−α

(b) The value Vαθ satisfies T(α ,θ )Vαθ = Vαθ , and there exists (ϕ∗1 , ϕ∗2 ) ∈ Φ 1 × Φ 2 such that, for all x ∈ X, ϕ∗1 (x) ∈ A(x) and ϕ∗2 (x) ∈ B(x) satisfy 

Vαθ (x) = r(x, ϕ∗1 , ϕ∗2 ) + α Vαθ [F(x, ϕ∗1 , ϕ∗2 , s)]θ (ds) S 

 1 = max r(x, ϕ , ϕ∗2 ) + α Vαθ [F(x, ϕ 1 , ϕ∗2 , s)]θ (ds) ϕ 1 ∈A(x)

S

ϕ 2 ∈B(x)

S



 1 2 θ 1 2 = min r(x, ϕ∗ , ϕ ) + α Vα [F(x, ϕ∗ , ϕ , s)]θ (ds) .

(4.9) (4.10)

    In addition, π∗1 = ϕ∗1 ∈ Πs1 and π∗2 = ϕ∗2 ∈ Πs2 form an optimal pair of strategies. In addition, Theorem 3.3 yields the following result related to the average payoff game.

4.2 The Discounted Empirical Game

51

Theorem 4.5. Under Assumptions 4.1 and 4.2, the average payoff game has a value J(·) = j∗ , that is, j∗ = inf

sup J(x, π 1 , π 2 ) = sup

π 2 ∈Π 2 π 1 ∈Π 1

inf J(x, π 1 , π 2 ) ∀x ∈ X.

2 2 π 1 ∈Π 1 π ∈Π

Further, both players have optimal strategies.

Remark 4.2. The average criterion is analyzed according to the VDFA introduced in Sect. 3.2; for ease reference we formulate it again in terms of the distribution θ . Let z ∈ X be a fixed state, and define, for α ∈ (0, 1) and x ∈ X, jαθ := (1 − α )Vαθ (z),

hαθ (x) := Vαθ (x) −Vαθ (z).

(4.11)

Observe that, from Theorem 4.4(b), for all x ∈ X we have jαθ + hαθ (x) = T(α ,θ ) hαθ (x) = r(x, ϕ∗1 , ϕ∗2 ) + α = max

ϕ 1 ∈A(x)

 S

hαθ [F(x, ϕ∗1 , ϕ∗2 , s)]θ (ds)



 r(x, ϕ 1 , ϕ∗2 ) + α hαθ [F(x, ϕ 1 , ϕ∗2 , s)]θ (ds) S



 1 2 θ 1 2 = min r(x, ϕ∗ , ϕ ) + α hα [F(x, ϕ∗ , ϕ , s)]θ (ds) . ϕ 2 ∈B(x)

(4.12)

S

Then, from [42, Theorem 4.3], under Assumptions 4.1 and 4.2 (see (3.9)), we have lim jθ t→∞ αt

= j∗ ,

(4.13)

for any sequence {αt } of discount factors such that αt  1. Moreover (see (3.25)),     sup hαθ  < ∞. (4.14) α ∈(0,1)

W

4.2 The Discounted Empirical Game Let θt ∈ P(S), for t ∈ N0 , be the empirical distribution of the disturbance process {ξt } (see Appendix B.1). That is, for a given probability measure ν ∈ P(S),

θ0 := ν ,

θt (D) = θt (D)(ω ) :=

1 t−1 ∑ 1D (ξi (ω )) ∀t ∈ N, D ∈ B(S), ω ∈ Ω . t i=0

52

4 Empirical Approximation-Estimation Algorithms in Markov Games

Note that for each D ∈ B(S), θt (D)(·) is a random variable, and for each ω ∈ Ω , θt (·)(ω ) is the uniform distribution on the set {ξ0 (ω ), . . . , ξt−1 (ω )} ⊂ S. We consider the approximating zero-sum Markov game model of the form: G M tα := (X, A, B, KA , KB , Qt , r),

(4.15)

where, for all D ∈ B(X) and (x, a, b) ∈ K, Qt (D|x, a, b) =



1D [F(x, a, b, s)]θt (ds)

S

=

1 t ∑ 1D [F(x, a, b, ξi )]. t i=1

The empirical approximation scheme consists in solving the approximate game G M t , for each t ∈ N. That is, the discounted game is analyzed when both players use the empirical distribution θt instead of the original distribution θ . This procedure leads to an optimal pair of strategies (πt1 , πt2 ) ∈ Πs1 × Πs2 for the game G M tα , for each t ∈ N, provided, of course, that the corresponding value of the game Vαθt exists. Under this setting, our hope is that the optimal pair (πt1 , πt2 ) will have  a good  performance in the game G M , whenever the sequence of empirical values Vαθt

gives a good approximation to the value Vαθ . We now introduce these ideas in precise terms as follows. For each t ∈ N0 , let



1 2 Vαθt (x, π 1 , π 2 ) := Etx,π ,π



∑ α r(xi , ai , bi ) i

,

(4.16)

i=0

be the α -discounted expected payoff function in which all random variables ξ0t , ξ1t , . . . have the same distribution θt . Observe that, under Assumption 4.3, θt is in M (S) for every t ∈ N, that is,  S

W [F(x, a, b, s)]θt (ds)(ω ) ≤ β W (x) + d ∀(x, a, b) ∈ K.

Then Theorem 4.4 yields the following result. Theorem 4.6. Suppose that Assumptions 4.1 and 4.3 hold. Then for each t ∈ N and ω ∈ Ω, θ (ω ) (a) the game G M tα has a value Vαθt = Vα t ∈ CW such that    θt  Vα  ≤ W

M and T(α ,θt )Vαθt = Vαθt ; 1−α

4.2 The Discounted Empirical Game

53

(b) there exists (ϕt1 , ϕt2 ) = (ϕt1 (ω ), ϕt2 (ω )) ∈ Φ 1 × Φ 2 such that, for all x ∈ X, ϕt1 (x, ω ) := ϕt1 (·|x, ω ) ∈ A(x) and ϕt2 (x, ω ) := ϕt2 (·|x, ω ) ∈ B(x) satisfy Vαθt (x) = r(x, ϕt1 , ϕt2 ) + α

 S

Vαθt [F(x, ϕt1 , ϕt2 , s)]θt (ds)



 1 2 θt 1 2 = max r(x, ϕ , ϕt ) + α Vα [F(x, ϕ , ϕt , s)]θt (ds) ϕ 1 ∈A(x)

= min

ϕ 2 ∈B(x)

(4.17)

S



 r(x, ϕt1 , ϕ 2 ) + α Vαθt [F(x, ϕt1 , ϕ 2 , s)]θt (ds) .

(4.18)

S

Remark 4.3. Observe that, for each t ∈ N0 , the value function Vαθt is a random i for i = 1, 2, define a random optimal pair of strategies function, and  ω ), 

 ϕ1t (x, 1 2 (πt , πt ) := ϕt , ϕt2 ∈ Πs1 × Πs2 for the game G M tα .

4.2.1 Empirical Estimation Process The key points to obtain the approximation of the empirical values Vαθt to the value Vαθ are the convergence properties of the empirical distribution. At first glance, from Proposition B.5, in Appendix B, we have that θt converges weakly to θ a.s. Thus, for each (x, a, b) ∈ K and continuous and bounded function u on X, (see Definition B.1)  S

u(F(x, a, b, s))θt (ds) →

 S

u(F(x, a, b, s))θ (ds) a.s., as t → ∞.

However, in the scenario of possibly unbounded payoff and arbitrary disturbance space S, this class of convergence is not sufficient for our objectives. In fact, we need uniform convergence on the set K. In order to state our estimation process, we impose the following assumption. Assumption 4.7 The family of functions VW :=

Vαθ (F(x, a, b, ·)) : (x, a, b) ∈ K W (x)

! (4.19)

is equicontinuous on S.

Proposition 4.1. Under Assumption 4.7,

Δt → 0 a.s., as t → ∞,

(4.20)

54

4 Empirical Approximation-Estimation Algorithms in Markov Games

where

Δt :=

 θ   θ  Vα (F(x, a, b, s))  Vα (F(x, a, b, s))  sup  θt (ds) − θ (ds) . W (x) W (x) S S

(x,a,b)∈K

Proof. Observe that from Theorem 4.4(a), the family of functions VW is uniformly bounded. Then, under Assumption 4.7, the relation (4.20) follows from Proposition B.6 in Appendix B. 

4.2.2 Discounted Optimal Strategies Let us fix (x, ω ) ∈ X × Ω , and consider the multifunction given by (x, ω ) −→ A(x). Since A(x) is a compact subset of A, A(x) is a compact subset of P(A)  (with   the weak topology). Consider the optimal pair of strategies (πt1 , πt2 ) = ϕt1 , ϕt2 ∈ Πs1 × Πs2 for the game model G M tα (see Remark 4.3). Then, from Proposition A.4 in Appendix A, there exists ϕ∞1 ∈ Φ 1 such that ϕ∞1 (x, ω ) = ϕ∞1 (·|x, ω ) ∈ A(x) is an accumulation point of {ϕt1 (x, ω )}. Similarly, there exists ϕ∞2 ∈ Φ 2 such that ϕ∞2 (x, ω ) = ϕ∞2 (·|x, ω ) ∈ B(x) is an accumulation point of {ϕt2 (x, ω )}.     Define the strategies π∞1 = ϕ∞1 ∈ Πs1 and π∞2 = ϕ∞2 ∈ Πs2 . Hence, we can state our results related to the discounted empirical approximation as follows. Theorem 4.8. Under Assumptions 4.1, 4.3, and 4.7, P − a.s.     (a) Vαθt −Vαθ  → 0 as t → ∞; W

(b) the random pair of strategies (π∞1 , π∞2 ) ∈ Πs1 × Πs2 is optimal for the game G M . Furthermore, 1 2 1 2 (c) there exist an optimal (nonrandom)  i  pair of strategies (πˆ∞ , πˆ∞ ) ∈ Πs × Πs for i the game G M defined as πˆ∞ = ϕˆ ∞ where

ϕˆ ∞i (·|x) =

 Ω

ϕti (·|x, ω )P(d ω ), i = 1, 2.

Proof. (a) Since θt ∈ M (S), for t ∈ N0 , from (4.6) we have that the operator T(α ,θt ) is a contraction. Hence, from Theorems 4.4 and 4.6, for each t ∈ N0 ,        θ      Vα −Vαθt  ¯ ≤ T(α ,θ )Vαθ − T(α ,θt )Vαθ  ¯ + T(α ,θt )Vαθ − T(α ,θt )Vαθt  ¯ W

W

W

        ≤ T(α ,θ )Vαθ − T(α ,θt )Vαθ  + γα Vαθ −Vαθt  ¯ ¯ W

Thus

   θt  Vα −Vαθ  ¯ ≤ W

W

 1    T(α ,θ )Vαθ − T(α ,θt )Vαθ  ¯ . 1 − γα W

a.s.

(4.21)

4.2 The Discounted Empirical Game

55

On the other hand, using the fact that W¯ (·) > W (·), for each x ∈ X and t ∈ N0 ,     T(α ,θ )Vαθ − T(α ,θt )Vαθ  ¯ W

≤ sup

sup

x∈X ϕ 1 ∈A(x),ϕ 2 ∈B(x)

 θ   θ  Vα [F(x, ϕ 1 , ϕ 2 , s)]  Vα [F(x, ϕ 1 , ϕ 2 , s)]  θ (ds) − θt (ds)  S W (x) W (x) S

= Δt .

(4.22)

Combining (4.21) and (4.22) we get     θt Vα −Vαθ  ≤ W¯

and from (4.7)

   θt  Vα −Vαθ  ≤ W

1 Δt , 1 − γα lα Δt . 1 − γα

(4.23)

Thus, (4.20) yields part (a). (b) Since for each (x, ω ) ∈ X × Ω , ϕ∞1 (x, ω ) = ϕ∞1 (·|x, ω ) ∈ A(x) is an accumulation point of {ϕt1 (x, ω )}, there exists a subsequence {ϕt1k (x, ω )} of {ϕt1 (x, ω )} such that ϕ∞1 (x, ω ) = limk→∞ ϕt1k (x, ω ). Under similar arguments, there exists a subsequence {ϕt2k (x, ω )} of {ϕt2 (·|x, ω )} such that ϕ∞2 (x, ω ) = limk→∞ ϕt2k (x, ω ). Observe that we can use the same subsequence {tk } for both cases. In the remainder of the proof, to ease notation, we let tk = k. We shall now proceed to prove the optimality of the pair (π∞1 , π∞2 ) ∈ Πs1 × Πs2 . Firstly, observe that, for each x ∈ X, as t → ∞,       Vαθt (F(x, a, b, s)) θt (ds) − Vαθ (F(x, a, b, s)) θ (ds) → 0 a.s. sup  

(a,b)∈A(x)×B(x)

S

S

(4.24) Indeed,

      Vαθt (F(x, a, b, s)) θt (ds) − Vαθ (F(x, a, b, s)) θ (ds)  S  S ≤

  S

  θt  Vα (F(x, a, b, s)) −Vαθ (F(x, a, b, s)) θt (ds)

     θ θ  +  Vα (F(x, a, b, s))θt (ds) − Vα (F(x, a, b, s))θ (ds) S S     ≤ Vαθt −Vαθ  (β W (x) + d) + Δt W (x). W

56

4 Empirical Approximation-Estimation Algorithms in Markov Games

Thus, (4.24) follows from part (a) and (4.20). Now, from (4.17), 

 θk 1 2 1 2 max r(x, ϕ , ϕk ) + α Vα [F(x, ϕ , ϕk , s)]θk (ds) .

θ Vα k (x) =

ϕ 1 ∈A(x)

(4.25)

S

In addition, for any fixed ϕ¯ 1 ∈ A(x) 

 θ lim inf max r(x, ϕ 1 , ϕk2 ) + α Vα k [F(x, ϕ 1 , ϕk2 , s)]θk (ds) ϕ 1 ∈A(x)

k

S



 θk 1 2 1 2 ¯ ¯ ≥ lim inf r(x, ϕ , ϕk ) + α Vα [F(x, ϕ , ϕk , s)]θk (ds) . k

(4.26)

S

On the other hand,  S





θ Vα k [F(x, ϕ¯ 1 , ϕk2 , s)]θk (ds) =

 S

S

Vαθ [F(x, ϕ¯ 1 , ϕk2 , s)]θ (ds) +

θ Vα k [F(x, ϕ¯ 1 , ϕk2 , s)]θk (ds)

 S

Vαθ [F(x, ϕ¯ 1 , ϕk2 , s)]θ (ds).

Then, from (4.24), Fatou’s Lemma, and using the continuity of the functions Vαθ and F, 

lim inf k

S

θ

Vα k [F(x, ϕ¯ 1 , ϕk2 , s)]θk (ds) = lim inf k



 S

 S

Vαθ [F(x, ϕ¯ 1 , ϕk2 , s)]θ (ds)

Vαθ [F(x, ϕ¯ 1 , ϕ∞2 , s)]θ (ds) a.s. (4.27)

Therefore, taking liminf as k → ∞ in (4.25), the relations (4.26) and (4.27) together with part (a) yield Vαθ (x) ≥ r(x, ϕ¯ 1 , ϕ∞2 ) + α

 S

Vαθ [F(x, ϕ¯ 1 , ϕ∞2 , s)]θ (ds).

Since ϕ¯ 1 ∈ A(x) is arbitrary, we have 

 θ 1 2 θ 1 2 Vα (x) ≥ max r(x, ϕ , ϕ∞ ) + α Vα [F(x, ϕ , ϕ∞ , s)]θ (ds) . ϕ 1 ∈A(x)

S

This implies Vαθ (x) =



 1 2 θ 1 2 max r(x, ϕ , ϕ∞ ) + α Vα [F(x, ϕ , ϕ∞ , s)]θ (ds) ,

ϕ 1 ∈A(x)

S

(4.28)

4.2 The Discounted Empirical Game

57

because (see (4.9)) Vαθ (x)

= min

r(x, ϕ , ϕ ) + α 1

max

ϕ 2 ∈B(x) ϕ 1 ∈A(x)

≤ max

ϕ 1 ∈A(x)

r(x, ϕ 1 , ϕ 2 ) + α



2

 S

S

Vαθ [F(x, ϕ 1 , ϕ 2 , s)]θ (ds)



 Vαθ [F(x, ϕ 1 , ϕ 2 , s)]θ (ds) , ∀ϕ 2 ∈ B(x).

Similarly, from (4.18), θ

Vα k (x) = min

ϕ 2 ∈B(x)



 θ r(x, ϕk1 , ϕ 2 ) + α Vα k [F(x, ϕk1 , ϕ 2 , s)]θk (ds) , S

and for an arbitrary and fixed ϕ¯ 2 ∈ B(x) 

 θk 1 2 1 2 lim sup min r(x, ϕk , ϕ ) + α Vα [F(x, ϕk , ϕ , s)]θk (ds) k

ϕ 2 ∈B(x)

S



 θ ≤ lim sup r(x, ϕk1 , ϕ¯ 2 ) + α Vα k [F(x, ϕk1 , ϕ¯ 2 , s)]θk (ds) . S

k

Thus, applying Fatou’s Lemma with limsup, we obtain Vαθ (x) ≤ r(x, ϕ∞1 , ϕ¯ 2 ) + α

 S

Vαθ [F(x, ϕ∞1 , ϕ¯ 2 , s)]θ (ds),

which, in turn, implies Vαθ (x) = min

ϕ 2 ∈B(x)

r(x, ϕ∞1 , ϕ 2 ) + α

 S

 Vαθ [F(x, ϕ∞1 , ϕ 2 , s)]θ (ds) .

(4.29)

Finally, combining (4.28) and (4.29), and applying standard procedures in game theory, we prove that (π∞1 , π∞2 ) ∈ Πs1 × Πs2 is a random optimal pair of strategies for the game G M . (c) We define H(x, a, b) := r(x, a, b) + α

 S

Vαθ [F(x, a, b, s)]θ (ds), (x, a, b) ∈ K.

Observe that, from (4.29) and (1.5), Vαθ (x) = min H(x, ϕ∞1 (ω ), ϕ 2 ) ϕ 2 ∈B(x)

= min



ϕ 2 ∈B(x) A(x)

H(x, a, ϕ 2 )ϕ∞1 (da|x, ω ) a.s., x ∈ X.

(4.30)

58

4 Empirical Approximation-Estimation Algorithms in Markov Games

Hence, Vαθ (x) =





min

Ω ϕ 2 ∈B(x) A(x)

≤ min



ϕ 2 ∈B(x) A(x)

= min



ϕ 2 ∈B(x) A(x)

H(x, a, ϕ 2 )ϕ∞1 (da|x, ω )P(d ω )

H(x, a, ϕ 2 )

 Ω

ϕ∞1 (da|x, ω )P(d ω )

H(x, a, ϕ 2 )ϕˆ ∞1 (da|x)

= min H(x, ϕˆ ∞1 , ϕ 2 ), x ∈ X. ϕ 2 ∈B(x)

Therefore, from (4.30) and Theorem 4.4, for all x ∈ X, 

 θ 1 2 θ 1 2 Vα (x) = min r(x, ϕˆ ∞ , ϕ ) + α Vα [F(x, ϕˆ ∞ , ϕ , s)]θ (ds) . ϕ 2 ∈B(x)

(4.31)

S

Similarly, we can prove that, for each x ∈ X, 

 θ 1 ˆ2 θ 1 ˆ2 Vα (x) = max r(x, ϕ , ϕ∞ ) + α Vα [F(x, ϕ , ϕ∞ , s)]θ (ds) , ϕ 1 ∈A(x)

S

which, combined with (4.31), yields the optimality of the pair (πˆ∞1 , πˆ∞2 ) ∈ Πs1 × Πs2 for the game G M . 

4.3 Empirical Approximation Under Average Criterion The empirical approximation scheme for the average criterion is obtained by combining the VDFA and a suitable convergence property of the empirical process, similar to the procedure followed in Sect. 3.3. Therefore, we will take advantage of the results introduced in previous sections for the discounted criterion. However, due to the additional difficulties in the asymptotic analysis of the average payoff, the following stronger condition is needed. Assumption 4.9 (a) The disturbance space S is the k-dimensional Euclidean space ℜk . (b) Let m > max{2, k} be an arbitrary real number and m¯ := km/[(m − k)(m − 2)]. Then E |ξ0 |m¯ < ∞. (c) The family of functions (see (4.11) and (4.19)) ! hαθ (F(x, a, b, .)) ¯ VW := : (x, a, b) ∈K, α ∈ (0, 1) , W (x)

4.3 Empirical Approximation Under Average Criterion

59

or equivalently VˆW :=

! Vαθ (F(x, a, b, .)) : (x, a, b) ∈K, α ∈ (0, 1) , W (x)

is equi-Lipschitzian on ℜk . That is, there exists a constant Lh > 0 such that, for every s, s ∈ ℜk and (x, a, b) ∈ K,  θ   hα (F(x, a, b, s)) hαθ (F(x, a, b, s ))      ≤ Lh s − s  , −   W (x) W (x) where |·| is the corresponding Euclidean distance in ℜk .

Remark 4.4 (Equicontinuity and Equi-Lipschitz Conditions). Clearly, in the case S = ℜk , the equi-Lipschitz Assumption 4.9(c) implies the equicontinuity Assumption 4.7. Proposition 4.2. Under Assumptions 4.1, 4.2, and 4.9, there exists a constant M¯ such that   ¯ −1/m , (4.32) E Δ¯t ≤ Mt where

       hα (F(x, a, b, s)) h (F(x, a, b, s)) α  ¯ Δt := sup θt (ds) − θ (ds) .  W (x) W (x) (x,a,b)∈K,α ∈(0,1)   ℜk ℜk (4.33)

Proof. From (4.14), the family of functions V¯W is uniformly bounded. Thus, by applying Proposition B.7 in Appendix B we prove (4.32).  In order to introduce the empirical VDFA, we follow similar ideas as in Sect. 3.3.1. Let ν ∈ (0, 1/2m) be an arbitrary real number where m is the constant introduced in Assumption 4.9(b). We fix an arbitrary nondecreasing sequence of discount factors {α¯ t } such that ¯ F.1 ¯ F.2

(1 − α¯ t )−1 = O(t ν ) as t → ∞; κ (n) lim = 0, n→∞ n

where κ (n) is the number of changes of value of {α¯ t } among the first n terms (see Conditions F.1 and F.2 and the succeeding example in Sect. 3.3.1). For a fixed t ∈ N0 , let Vα¯θtt (·, ·, ·) be the α¯ t -discounted payoff function under the empirical distribution θt (see (4.16)), and we denote by Vα¯θtt (·) the corresponding value of the game G M tα¯ t (see (4.15), Theorems 4.4 and 4.6). The functions

60

4 Empirical Approximation-Estimation Algorithms in Markov Games

hθα¯tt (·) and jαθ¯tt are defined accordingly (see (4.11)). Hence, from Theorem 4.4(b) (see (4.12)), there exists a random pair (ϕ¯t1 , ϕ¯t2 ) ∈ Φ 1 × Φ 2 such that, for every x ∈ X, jαθ¯tt + hθα¯tt (x) = T(α¯ t ,θt ) hθα¯tt (x) = r(x, ϕ¯t1 , ϕ¯t2 ) + α¯ t

= max

ϕ 1 ∈A(x)

 S

hθα¯tt [F(x, ϕ¯t1 , ϕ¯t2 , s)]θt (ds)



 r(x, ϕ 1 , ϕ¯t2 ) + α¯ t hθα¯tt [F(x, ϕ 1 , ϕ¯t2 , s)]θt (ds) S

= min

ϕ 2 ∈B(x)

r(x, ϕ¯t1 , ϕ 2 ) + α¯ t

 S

 hθα¯tt [F(x, ϕ¯t1 , ϕ 2 , s)]θt (ds) .

(4.34)

For each t ∈ N0 (see (4.4)–(4.8)) we define

γt ≡ γα¯ t :=  et := d

1 + α¯ t ∈ (α¯ t , 1), 2

−1   2α¯ t γt −1 =d , α¯ t 1 − α¯ t

and lt ≡ lα¯ t := 1 + et = 1 +

2d α¯ t . 1 − α¯ t

It is easy to see that lt ≤ 2(1 + d)(1 − α¯ t )−2 , 1 − γt ¯ yields which, from Condition F.1, lt = O(t 2ν ) as t → ∞. 1 − γt

(4.35)

Moreover, applying similar arguments as in the proof of Theorem 4.8 (see (4.23)) and from definition of the function hαθ (see (4.11)), we can obtain     θt Vα¯ t −Vα¯θt  ≤ W

lt ¯ Δt . 1 − γt

Hence, for all (π 1 , π 2 ) ∈ Π 1 × Π 2 and x ∈ X, from (4.32) and (4.35),   1 2  Exπ ,π Vα¯θtt −Vα¯θt  = O(t 2ν )O(t −1/m ), as t → ∞. W

(4.36)

4.4 Average Optimal Strategies

61

Then, because 2ν < 1/m, we get lim Exπ

1 ,π 2

t→∞

   θt  Vα¯ t −Vα¯θt  = 0. W

Again, from definition of the functions hαθ (x) and jαθ (see (4.11)), we have   1 2  lim Exπ ,π hθα¯tt − hθα¯ t  = 0 t→∞

and

lim Exπ

W

1 ,π 2

t→∞

   θt   jα¯ t − jαθ¯ t  = 0.

(4.37)

(4.38)

(4.39)

On the other hand, following similar ideas as in the proofs of Lemma 3.2(b) and relation (2.50), and with the necessary changes, we obtain   1 2  lim Exπ ,π hθα¯tt − hθα¯ t  W (xt ) = 0 (4.40) t→∞

and

W

1 2 lim E π ,π Δ¯t W (xt ) = t→∞ x

0.

(4.41)

4.4 Average Optimal Strategies 1 2 1 2 Let (π∗1 , π∗2 ) ∈ Π 1 × Π 2 be the  of  strategies  ipair determined by (ϕ¯t , ϕ¯t ) ∈ Φ ×iΦ i i (see  i (4.34)). That is, π∗ = ϕ¯t = ϕ¯t (·|x, ω ) for i = 1, 2. In addition, let πˆ∗ = ϕˆt , i = 1, 2, be the strategies defined by

ϕˆti (·|x) =

 Ω

ϕ¯ti (·|x, ω )P(d ω ).

Then, our main result is stated as follows. Theorem 4.10. Under Assumptions 4.1, 4.2, and 4.9, the pair (π∗1 , π∗2 ) ∈ Π 1 × Π 2 is a random pair of average optimal strategies for the game G M , that is, j∗ = inf J(x, π∗1 , π 2 ) = sup J(x, π 1 , π∗2 ) ∀x ∈ X. π 2 ∈Π 2

(4.42)

π 1 ∈Π 1

  Furthermore, the strategies πˆ∗i = ϕˆti , for i = 1, 2, form an average optimal pair of nonrandom strategies.     Proof. We first prove the optimality of π∗2 = ϕ¯t2 (·|xt , ω ) = ϕ¯t2 , for which we will show j∗ = sup J(x, π 1 , π∗2 ) ∀x ∈ X. π 1 ∈Π 1

62

4 Empirical Approximation-Estimation Algorithms in Markov Games

  Let π 1 = πt1 ∈ Π 1 be an arbitrary strategy for player 1. Then Lt := r(xt , πt1 , ϕ¯t2 ) + α¯ t



hθα¯ t [F(xt , πt1 , ϕ¯t2 , s)]θ (ds) − jαθ¯ t − hθα¯ t (xt )

ℜk

π 1 ,π∗2

= r(xt , πt1 , ϕ¯t2 ) + α¯ t Ex which implies π 1 ,π 2 n−1 Ex ∗

n−1 



jαθ¯ t

r(xt , at , bt ) −



  hθα¯ t (xt+1 ) | ht − jαθ¯ t − hθα¯ t (xt ),



n−1 

π 1 ,π 2 n−1 Ex ∗

=



t=0

hθα¯ t (xt ) − α¯ t hθα¯ t (xt+1 )

t=0



π 1 ,π 2 +n−1 Ex ∗

n−1

∑ Lt



.

t=0

Hence, from (1.13) and (4.13) " J(x, π

1

, π∗2 ) − j∗

= lim inf n→∞



π 1 ,π 2 n−1 Ex ∗



hθα¯ t (xt ) − α¯ t hθα¯ t (xt+1 )



t=0



#

n−1

π 1 ,π 2 +n−1 Ex ∗

n−1 

∑ Lt

.

(4.43)

t=0

Therefore, the remainder of the proof consists in showing that " #   1 ,π 2 n−1 1 ,π 2 n−1 π π lim inf n−1 Ex ∗ ∑ hθα¯ t (xt ) − α¯ t hθα¯ t (xt+1 ) + n−1 Ex ∗ ∑ Lt ≤ 0. n→∞

t=0

t=0

(4.44) ¯ implies that {α¯ t } remains constant for long time First observe that Condition F.2 periods. Then, by applying similar arguments as in the proof of (3.40) and (3.41) in Sect. 3.3.1, we get   1 2 n−1 −1 π ,π∗ θ θ (4.45) lim n Ex ∑ hα¯ t (xt ) − α¯ t hα¯ t (xt+1 ) = 0. n→∞

t=0

Now, we will proceed to prove that lim

n→∞

π 1 ,π 2 n−1 Ex ∗

n−1

∑ Lt

≤ 0.

t=0

To this end, it is enough to prove that π 1 ,π∗2

lim sup Ex t→∞

[Lt ] ≤ 0.

(4.46)

4.4 Average Optimal Strategies

63

Observe that, from (4.33), for each t ∈ N0 ,      θ θ  hα¯ t (F(x, a, b, s))θt (ds) − hα¯ t (F(x, a, b, s))θ (ds) ≤ Δ¯t W (x).  ℜk

ℜk

Hence, adding and subtracting the terms

α¯ t

 ℜk

hθα¯ t [F(xt , πt1 , ϕ¯t2 , s)]θt (ds) and α¯ t

 ℜk

hθα¯tt [F(xt , πt1 , ϕ¯t2 , s)]θt (ds)

we get Lt ≤ Δ¯t W (xt ) + Lt0 + Lt1 ,

(4.47)

where

     θt θ 1 2 1 2  Lt :=  hα¯ t [F(xt , πt , ϕ¯t , s)]θt (ds) − hα¯ t [F(xt , πt , ϕ¯t , s)]θt (ds) , k k ℜ ℜ 0

Lt1 := r(xt , πt1 , ϕ¯t2 ) + α¯ t

 ℜk

hθα¯tt [F(xt , πt1 , ϕ¯t2 , s)]θt (ds) − jαθ¯ t − hθα¯ t (xt ).

    Note that Lt0 ≤ hθα¯tt − hθα¯ t  and, therefore, from (4.38), W

π 1 ,π∗2

lim Ex

t→∞

Lt0 = 0.

(4.48)

For Lt1 , adding and subtracting jαθ¯tt and hθα¯tt (xt ), from the definition of ϕ¯t2 (see (4.34)), we obtain 

 Lt1 ≤ max r(xt , ϕ 1 , ϕ¯t2 ) + α¯ t hθα¯tt [F(xt , ϕ 1 , ϕ¯t2 , s)]θt (ds) − jαθ¯tt − hθα¯tt (xt ) ϕ 1 ∈A(x)

S

+ jαθ¯tt − jαθ¯ t + hθα¯tt (xt ) − hθα¯ t (xt )         ≤  jαθ¯tt − jαθ¯ t  + hθα¯tt − hθα¯ t  W (xt ). W

Thus, (4.39) and (4.40) imply π 1 ,π∗2

lim sup Ex t→∞

Lt1 ≤ 0.

(4.49)

Combining (4.41), (4.47), (4.48), and (4.49), we get (4.46), which, together with (4.45), yields (4.44). Thus, from (4.43) J(x, π 1 , π∗2 ) ≤ j∗

∀x ∈ X.

64

4 Empirical Approximation-Estimation Algorithms in Markov Games

Finally, since π 1 ∈ Π 1 is arbitrary, from Theorem 4.5, j∗ = sup J(x, π 1 , π∗2 ) ∀x ∈ X. π 1 ∈Π 1

The optimality of π∗1 is proved similarly. Finally, the average optimality of the pair (πˆ∗1 , πˆ∗2 ) ∈ Π 1 × Π 2 is proved following similar arguments as in part (c) of Theorem 4.8. 

4.5 Empirical Recursive Methods According to Theorems 4.6 and 4.8, the empirical procedure to approximate the value Vαθ and obtain an optimal pair of strategies is as follows. At each stage t ∈ N, the players determine the distribution θt and select actions that would be optimal if θt were the true distribution. Hence, letting t enough large we obtain optimality in the original game, as is stated in Theorem 4.8. In order to perform in this way, the players must solve the equation T(α ,θt )Vαθt = Vαθt for Vαθt , and then determine the selectors ϕt1 and ϕt2 satisfying (4.17) and (4.18). However, solving such an equation could be difficult, which represents an obstacle to the implementation of the procedure. In order to avoid this inconvenience, we now present a value iteration algorithm which combined with the empirical process defines a recursive method to approximate the value of the game. We define the sequence of functions {Vt } ⊂ CW as V0 = 0, and for t ∈ N and x ∈ X, Vt (x) = T(α ,θt )Vt−1 (x) = max

min

ϕ 1 ∈A(x) ϕ 2 ∈B(x)

= max

min

ϕ 1 ∈A(x) ϕ 2 ∈B(x)

r(x, ϕ 1 , ϕ 2 ) + α

α r(x, ϕ , ϕ ) + t 1

2



 S

Vt−1 [F(x, ϕ 1 , ϕ 2 , s)]θt (ds)

t−1

∑ Vt−1 [F(x, ϕ

1

, ϕ , ξi )] . 2

(4.50)

i=0

Moreover, from Theorem 4.4 (see Theorem 4.6(b)), there exists (ϕ˜t1 , ϕ˜t2 ) ∈ Φ 1 × Φ 2 such that, for all x ∈ X, Vt (x) = r(x, ϕ˜t1 , ϕ˜t2 ) +

α t−1 ∑ Vt−1 [F(x, ϕ˜t1 , ϕ˜t2 , ξi )] t i=0

t−1 α = max r(x, ϕ 1 , ϕ˜t2 ) + ∑ Vt−1 [F(x, ϕ 1 , ϕ˜t2 , ξi )] t i=0 ϕ 1 ∈A(x) α t−1 1 2 1 2 = min r(x, ϕ˜t , ϕ ) + ∑ Vt−1 [F(x, ϕ˜t , ϕ , ξi )] . t i=0 ϕ 2 ∈B(x)

4.5 Empirical Recursive Methods

65

Similarly as Sect. 4.2.2, there exist ϕ˜ ∞1 ∈ Φ 1 and ϕ˜ ∞2 ∈ Φ 2 such that ϕ˜ ∞1 (x, ω ) ∈ A(x) and ϕ˜ ∞2 (x, ω ) ∈ B(x) are accumulation points of {ϕ˜t1 (x, ω )} and {ϕ˜t2 (x, ω )}, respectively.     We define the strategies π˜∗i = ϕ˜ti ∈ Π i and π˜∞i = ϕ˜ ∞i ∈ Πsi , for i = 1, 2.



The hope is that the pairs π˜∞1 , π˜∞2 and π˜∗1 , π˜∗2 have a good performance in the original game model G M , provided that Vt , for t enough large, gives a good approximation of the value function Vαθ . These properties are established precisely in the following result. Theorem 4.11. Under Assumptions 4.1, 4.3, and 4.7   (a) Vt −Vαθ W → 0 P − a.s., as t → ∞;

(b) the pair of strategies π˜∗1 , π˜∗2 ∈ Π 1 × Π 2 is asymptotically optimal (see Definition 2.1);

(c) the pair of strategies π˜∞1 , π˜∞2 ∈ Πs1 × Πs2 is optimal for the game G M . Proof. Since for each α ∈ (0, 1) and μ ∈ P(S), the operator T(α ,μ ) is contraction (see (4.6)), from Theorem 4.4 and (4.50) we have             θ Vα −Vt+1  ¯ ≤ T(α ,θ )Vαθ − T(α ,θt )Vαθ  ¯ + T(α ,θt )Vαθ − T(α ,θt )Vt  ¯ W

W

W

        ≤ T(α ,θ )Vαθ − T(α ,θt )Vαθ  + γα Vαθ −Vt  W¯



    ≤ Δt + γα Vαθ −Vt  , W¯

(4.51)

  where the last inequality follows from (4.22). Letting l := lim supt→∞ Vαθ −Vt W¯ < ∞ and taking lim sup as t goes to infinity in (4.51), from (4.20) we obtain 0 ≤ l ≤ γα l. Thus, as γα ∈ (α , 1), we have that l = 0, which, together with (4.7), yields part (a). 1 2

˜ ˜ 1The2 asymptotic optimality of the pair π∗ , π∗ as well as the optimality of π˜∞ , π˜∞ are proved by following the ideas in the proofs of Theorem 2.11, adapted to the empirical distribution, and Theorem 4.8, respectively.  The Average Case. Once we have the method to approximate the discounted value function Vαθ , we can obtain a procedure to approximate the value of the average game and optimal strategies by applying the VDFA introduced in Sect. 3.2. Observe that for each α ∈ (0, 1), x ∈ X, and θ ∈ P(S), the equations Vαθ (x) = T(α ,θ )Vαθ (x)

66

4 Empirical Approximation-Estimation Algorithms in Markov Games

and

jαθ + hαθ (x) = T(α ,θ ) hαθ (x)

are equivalent, where for a fixed state z ∈ X (see (4.11)) jαθ := (1 − α )Vαθ (z), hαθ (x) := Vαθ (x) −Vαθ (z).   Then, the strategies π∗i = ϕ¯ti in Theorem 4.10 are defined by the selectors ϕ¯ti , for i = 1, 2, such that Vα¯θtt (x) = T(α¯ t ,θt )Vα¯θtt (x) = r(x, ϕ¯t1 , ϕ¯t2 ) + α¯ t

 S

Vα¯θtt [F(x, ϕ¯t1 , ϕ¯t2 , s)]θt (ds)



 θt 1 ¯2 1 2 = max r(x, ϕ , ϕt ) + α¯ t Vα¯ t [F(x, ϕ , ϕ¯t , s)]θt (ds) ϕ 1 ∈A(x)

S

= min

ϕ 2 ∈B(x)

r(x, ϕ¯t1 , ϕ 2 ) + α¯ t

 S

Vα¯θtt [F(x, ϕ¯t1 , ϕ 2 , s)]θt (ds)

 ,

¯ and F.2, ¯ and where {α¯ t } is a sequence of discount factors satisfying Conditions F.1 θt α¯ t Vα¯ t is the value of the game G M t . According to these facts and Theorem 4.10, in order to approximate the value of the game and optimal strategies for the average criterion, a first step is to approximate the function Vα¯θtt . To this end, we can apply, for  each  t ∈ N0 , the empirical value iteration procedure stated in Theorem 4.11. Let (t) Vn ⊂ CW be the sequence of functions defined as (t)

V0 = 0;

(4.52)

(t)

(t)

Vn (x) = T(α¯ t ,θn )Vn−1 (x), n ∈ N, x ∈ X. It is clear that from Theorem 4.11(a), for each t ∈ N0 ,    (t)  Vn −Vα¯θt  → 0, as n → ∞.

(4.53)

W

In addition, observe that        (t)   (t)  θ θ Vn −Vα¯ tt  ≤ Vn −Vα¯θt  + Vα¯θt −Vα¯ tt  . W

W

W

Hence, combining this fact with (4.53) and (4.37), we get, for all (π 1 , π 2 ) ∈ Π 1 × Π 2,   1 2  (t)  (4.54) lim lim Exπ ,π Vn −Vα¯θtt  = 0. t→∞ n→∞

W

4.5 Empirical Recursive Methods

67

Similarly, observe that Exπ

1 ,π 2

      1 2 θ  θt      jα¯ t − j∗  ≤ Exπ ,π  jα¯tt − jαθ¯ t  +  jαθ¯ t − j∗  .

Thus, from (4.39) and (4.13), we have lim Exπ

t→∞

that is

lim Exπ

1 ,π 2

t→∞

1 ,π 2

   θt   jα¯ t − j∗  = 0,

    θ (1 − α¯ t )Vα¯ tt (z) − j∗  = 0.

(4.55)

Therefore, from (4.54) lim lim Exπ

t→∞ n→∞

1 ,π 2

    (t) (1 − α¯ t )Vn (z) − j∗  = 0.

(4.56)

The use of empirical distribution defines a very general approximation method of the value function and construction of optimal strategies, since both the random variables ξt and the distribution θ can be arbitrary. This generality entails imposing equicontinuity and equi-Lipschitz conditions for the discounted and the average criteria, respectively. In fact, even in the case of bounded payoff function r, such conditions are necessary because we need uniform convergence on (x, a, b) ∈ K (see Propositions 4.1 and 4.2). It is worth remarking that to obtain estimation and control procedures under the empirical distribution it is not necessary to implement a projection scheme as the case of density estimation studied in previous chapters (see (2.38) and (3.17)).

Chapter 5

Difference-Equation Games: Examples

In this chapter we introduce several examples to show the relevance of the estimation and control procedures analyzed throughout the book. In the first part we will focus on illustrating the assumptions on the game models, which can be classified into five classes, specifically: • continuity conditions contained in Assumptions 2.1, 2.2, and 2.3; • W -growth conditions in Assumptions 2.4 and 2.7, and their variants (see Assumptions 2.8, 3.1, 3.4, 4.1, and 4.3); • conditions on the density stated in Assumptions 2.9 and 2.10; • ergodicity conditions given in Assumption 3.2 (see also Assumption 4.2) to analyze the average criterion; and finally • equicontinuity and equi-Lipschitz conditions in Assumptions 4.7 and 4.9 used to formulate the empirical estimation-approximation schemes. All these assumptions will be illustrated in zero-sum games evolving according to a stochastic difference equation of the form xt+1 = F(xt , at , bt , ξt ) for t ∈ N0 ,

(5.1)

where F : K × S → X is a given measurable function, assuming compactness of the admissible actions sets A(x) and B(x), for x ∈ X. In addition, {ξt } is a sequence of observable i.i.d. random variables defined on a probability space (Ω , F , P), taking values in a Borel space S, with common distribution θ ∈ P(S), and independent of the initial state x0 .

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 J. A. Minj´arez-Sosa, Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution, SpringerBriefs in Probability and Mathematical Statistics, https://doi.org/10.1007/978-3-030-35720-7 5

69

70

5 Difference-Equation Games: Examples

We conclude presenting numerical implementations of the estimation and control algorithms in two examples. The first one is a class of linear-quadratic games where a suitable density estimation process is applied, and the optimal strategies are found in the set of pure strategies (see Definition 1.2). Next, the empirical approximation is showed in a single example with finite state space.

5.1 Continuity Conditions In this section we present some insights on the fulfillment of the continuity condition of the integral (x, a, b) →

 X



v(y)Q(dy|x, a, b) =

S

v[F(x, a, b, s)]θ (ds)

(5.2)

for (x, a, b) ∈ K, for the cases v ∈ B(X) and v ∈ C(X), involved in Assumptions 2.1, 2.2, and 2.3. Essentially, the continuity in (5.2) depends on the properties of the function F in (5.1) and of the distribution θ ∈ P(S) (see Propositions C.3 and C.4 in Appendix C). Example 5.1. Suppose that for every s ∈ S, the function F(x, a, b, s) is continuous in (x, a, b) ∈ K, and {ξt } is a sequence of i.i.d. r.v. independent on x0 ∈ X, with distribution θ ∈ P(S). Then, for every v ∈ C(X), the function v [F(·, ·, ·, s)] is continuous, which implies, by the Dominated Convergence Theorem, that (x, a, b) → is continuous.

 S

v[F(x, a, b, s)]θ (ds), (x, a, b) ∈ K,

(5.3)



Example 5.2 (Additive Noise Systems). Consider stochastic games evolving on X = ℜ according to an additive noise difference equation of the form xt+1 = F(xt , at , bt , ξt ) = G(xt , at , bt ) + ξt for t ∈ N0 , with A = B = ℜ, where G is a continuous function and {ξt } is a sequence of i.i.d. r.v. with a continuous density ρ on S = ℜ with respect to Lebesgue measure. In this setting, for every v ∈ B(X), applying Scheff´e’s Theorem (see Theorem D.1), the function (x, a, b) → = is continuous.



 X

 ℜ

v(y)Q(dy|x, a, b) =

 ℜ

v[G(x, a, b) + s]ρ (s)ds

v(y)ρ [G(x, a, b) − s]ds, (x, a, b) ∈ K,

(5.4)

5.2 Autoregressive Game Models

71

In the following sections we present examples of zero-sum games whose dynamics fall within one of the cases (5.3) or (5.4) or both.

5.2 Autoregressive Game Models Let {xt } be the state process of a game whose dynamic is defined by the equation xt+1 = G(at , bt )xt + ξt for t ∈ N0 , with initial state x0 , state space X = [0, ∞), compact actions sets A(x) = A ⊂ ℜ, and B(x) = B ⊂ ℜ for x ∈ X, and G : A × B → (0, λ ] is a given continuous function with λ < 1. Moreover, {ξt } is a sequence of i.i.d. random variables taking values in S = [0, ∞) with a continuous density ρ and finite expectation E [ξ0 ] = ξ¯ . In the spirit of a zero-sum game model, we can think in the following situation. Let x∗ ∈ X be a fixed state. Suppose that the objective of player 1 is to select actions tending to move the game process {xt } away from x∗ , while player 2 wants to move the process {xt } as close as possible to x∗ . In these circumstances, we consider a payoff function of the form $ r(x, a, b) = |x − x∗ |, (x, a, b) ∈ K. √ In addition, defining W (x) := x + x∗ + 1, x ∈ X, we get  X

W 2 (y)Q(dy|x, a, b) = =

 ∞ 0

 ∞ 0

W 2 [G(a, b)x + s] ρ (s)ds [G(a, b)x + s + x∗ + 1] ρ (s)ds

≤ λ x + x∗ + 1 + ξ¯ ≤ λ (x + x∗ + 1) + x∗ + ξ¯ + 1 = λ0W 2 (x) + d0 , 1/2 1/2 where λ0 := λ and d0 := x∗ + ξ¯ + 1. Moreover, letting β := λ0 and d := d0 we get (see (2.23) in Remark 2.4 (a))

 ∞ 0

W [G(a, b)x + s] ρ (s)ds ≤ β W (x) + d.

On the other hand, applying similar arguments to (5.4) we can prove that the function

72

5 Difference-Equation Games: Examples

(x, a, b) →

 ∞ 0

W [G(a, b)x + s] ρ (s)ds, (x, a, b) ∈ K,

is continuous. Hence, Assumptions 2.4 and 2.8 with p = 2 are satisfied (see Assumptions 3.1 (d), (e), 3.4, and 4.1). To verify Assumptions 2.9 and 2.10, observe that 1 W [F(x, a, b, s)] W (x) x∈X a∈A(x) b∈B(x)

ψ (s) := sup sup sup

 = sup sup sup

x∈X a∈A(x) b∈B(x)

G(a, b)x + s + x∗ + 1 x + x∗ + 1

1/2 < (1 + s)1/2 , s ∈ [0, ∞).

Thus, taking ρ(·) ≡ ρ (·) we obtain

Ψ¯ :=

 ∞ 0

ψ 2 (s)ρ(s)ds ≤

 ∞ 0

(1 + s)ρ (s)ds = 1 + ξ¯ < ∞.

In the particular case in which {ξt } is assumed to be a sequence of i.i.d. random variables taking values in S = [0, s∗ ], for some s∗ > 0, Assumption 4.3 is satisfied. Indeed, for all (x, a, b) ∈ K × S, W 2 [G(a, b)x + s] = G(a, b)x + s + x∗ + 1 ≤ λ x + s∗ + x ∗ + 1 ≤ λ (x + x∗ + 1) + s∗ + x∗ + 1. Hence

W [G(a, b)x + s] ≤ λ 1/2W (x) + (s∗ + x∗ + 1)1/2 .

Therefore, defining β := λ 1/2 and d := (s∗ + x∗ + 1)1/2 , Assumption 4.3 holds. In conclusion, from Theorem 2.11, there exists a pair (π∗1 , π∗2 ) ∈ Π 1 × Π 2 of asymptotically discounted optimal strategies.

5.3 Linear-Quadratic Games The linear-quadratic games, known as LQ-games, are dynamic games evolving according to a linear equation with quadratic payoff (see [14]). We consider the following particular case. Let {xt } be the state process satisfying the equation xt+1 = xt + at + bt + ξt for t ∈ N0 ,

5.3 Linear-Quadratic Games

73

with x0 given, where X = A = B = ℜ. The random disturbance process {ξt } is a sequence of i.i.d. random variables normally distributed with mean zero and unknown variance σ 2 ∈ (0, 3), that is   1 s2 ρ (s) := √ exp − 2 ∀s ∈ ℜ, E(ξt ) = 0, and σ 2 = E(ξt2 ), ∀t ∈ N0 . 2σ σ 2π (5.5) % The sets of admissible actions for players 1 and 2 are A(x) = B(x) = [− |x| /2, |x| 2], and the one-stage payoff r is the quadratic function r(x, a, b) = x2 + a2 − b2 . % Observe that for a, b ∈ [− |x| /2, |x| 2] we have 1 r(x, a, b) ≤ x2 + a2 + b2 ≤ x2 + x2 2 ≤

3 2 (x + 1) = M(x2 + 1), 2

3 with M = . 2 Defining W (x) := x2 + 1, for all (x, a, b) ∈ K, we have  ℜ

  1 s2 [(x + a + b + s)2 + 1] √ exp − 2 ds 2σ ℜ σ 2π    1 s2 = (x + a + b + s)2 √ exp − 2 ds + 1 2σ ℜ σ 2π    1 (y − (x + a + b))2 = y2 √ exp − dy + 1 2σ 2 ℜ σ 2π

W (y)Q(dy|x, a, b) =



= (x + a + b)2 + σ 2 + 1 ≤ 4x2 + 4 ≤ 4W (x).

(5.6)

Hence, Assumption 2.7 holds with γ˜ = 4 and any discount factor α < 14 . On the other hand, observe that the function ψ in Assumption 2.10 satisfies that (x + a + b + s)2 + 1 x2 + 1 x∈X a∈A(x) b∈B(x)

ψ (s) = sup sup sup = sup x∈X

(2x + s)2 + 1 x2 + 1

≤ s2 + sup x∈X

4xs x2 + 1

< s2 + 2s + 5,

+5

74

5 Difference-Equation Games: Examples

after observing that sup x∈X

4x x2 + 1

= 2.

Furthermore, assuming that ρ is of the form ρ(s) ≡ k1 exp(−k2 |s|), s ∈ ℜ, we can choose the positive constants k1 and k2 such that

ρ (s) ≤ ρ(s) ∀s ∈ ℜ, and

Ψ¯ :=



ψ (s)ρ(s)ds ≤



2





(5.7)

(s2 + 2s + 5)k1 exp(−k2 |s|)ds < ∞.

That is, Assumptions 2.9 and 2.10 hold. Considering that Assumption 2.4(b) can be replaced by the assumption 2.7 (see (2.11)), Theorem 2.11 yields the existence of an asymptotically discounted optimal pair of strategies.

1/2 , for all Similarly as (5.6), if we consider the function W (x) := x2 + 1 (x, a, b) ∈ K, we can prove that Assumption 2.8 is satisfied with p = 2.

5.4 A Game Model for Reservoir Operations In this section we analyze a model of a single reservoir with infinite capacity described as follows. There are two purposes in the operation of the reservoir system: the social related to the water provision to meet the demand for domestic use, and the economic where water is used in hydropower generation, land irrigation, etc. (see, e.g., [77, 78]). Under certain circumstances (e.g., political, economic, and/or social), satisfying these requirements leads to conflict situations in such a way that these can be assumed as opposing objectives, in the sense that water used for one purpose is considered as water loss for the other. Let ξ1,t be a random variable representing the inflows into the reservoir which happen at nonnegative random times Tt ,t ∈ N. At each time Tt , the decision maker observes the reservoir level xt = x ∈ X := [0, ∞) and selects actions at = a ∈ A := [0, a∗ ] and bt = b ∈ B := [b∗ , b∗ ], representing the consumption rates to satisfy both the water demand for social purposes and for economic purposes, respectively, where a∗ , b∗ , and b∗ are fixed positive constants. Observe that the reservoir level at time Tt+1 depends on the inflow ξ1,t+1 and the volume of water consumed Wt during period [Tt , Tt+1 ). For instance, if the volume of water xt = x is not depleted before time Tt+1 , then Wt = (a + b)ξ2,t+1 ,

(5.8)

5.4 A Game Model for Reservoir Operations

75

where ξ2,t+1 := Tt+1 − Tt ,t ∈ N0 , and T0 := 0. Hence, the reservoir level at time Tt+1 is (5.9) xt+1 = x −Wt + ξ1,t+1 . On the other hand, if the reservoir level xt = x is depleted before time Tt+1 , then xt+1 = ξ1,t+1 .

(5.10)

Combining (5.8)–(5.10), the reservoir level process {xt } evolves according to the stochastic difference equation xt+1 = max(xt − (at + bt )ξ2,t+1 , 0) + ξ1,t+1 , t ∈ N0 , where x0 = x ∈ X is the initial water volume. Hence, the dynamic is given by a function F : K × ℜ2+ → X defined as F(x, a, b, s) := max(x − (a + b)s2 , 0) + s1 , for all (x, a, b) ∈ K, s := (s1 , s2 ) ∈ ℜ2+ := ℜ+ × ℜ+ , which is continuous. In order to verify our assumptions, we assume that the random processes {ξ1,t } and {ξ2,t } satisfy the following conditions: C1

The processes {ξ1,t } and {ξ2,t } are independent sequences of i.i.d. random variables. We denote ρi (·) the density of ξi,t for i ∈ {1, 2}.

C2

The joint density ρ ∗ (·, ·) = ρ1 (·)ρ2 (·) is continuous and bounded by the function ⎧ ⎨ L˜ exp(l(s1 + s2 )) i f (s1 , s2 ) ∈ S0 , ρ(s1 , s2 ) := ⎩ 0 otherwise, where L˜ and l are positive constants, and S0 is a compact subset of ℜ2+ .

C3

The processes {ξ1,t } and {ξ2,t } have finite expectation. Moreover E ξ1,t < b∗ E ξ2,t .

Let ΘY be the moment generating function of the random variable Y := ξ1,t − b∗ E ξ2,t , that is

ΘY (t) := E exp(t(ξ1,t − b∗ E ξ2,t )) =

 ℜ2+

exp(t(s1 − b∗ s2 ))ρ ∗ (s1 , s2 )ds1 ds2 .

From Condition C3, observe that the derivative ΘY (0) = E [Y ] < 0. Now, since ΘY (0) = 1, there exists t0 > 0 such that

β := ΘY (t0 /2) < 1 and λ0 := ΘY (t0 ) < 1.

(5.11)

76

5 Difference-Equation Games: Examples

We proceed to verify Assumptions 3.1 and 3.2 (see Assumptions 4.1 and 4.2). To this end, we define W (x) := L exp(t0 x/2), x ∈ X, for some L > 0;

λ (x, a, b) := P[ξ2,t > x/(a + b)], (x, a, b) ∈ K; m∗ (D) := P[ξ1,t ∈ D], D ∈ B(X). Let r : K → ℜ be a continuous one-stage payoff function such that 0 ≤ r(x, a, b) ≤ W (x), ∀(x, a, b) ∈ K. In addition, from the continuity of the functions W and F, applying similar arguments as (5.3) we can prove that the function (x, a, b) →

 ℜ2+

W [F(x, a, b, s)] ρ ∗ (s1 , s2 )ds1 ds2 , (x, a, b) ∈ K,

(5.12)

is continuous. Hence, Assumption 3.1 is satisfied. For the Assumption 3.2, let S¯ := {(s1 , s2 ) ∈ ℜ2+ : x − (a + b)s2 ≤ 0} and S¯c := Then Conditions C1–C3 yield

ℜ2+ − A1 . 

W [F(x, a, b, s)] ρ ∗ (s1 , s2 )ds1 ds2 =

S¯c



W (x + s1 − (a + b)s2 )ρ ∗ (s1 , s2 )ds1 ds2

S¯c





L exp

t

2

S¯c

≤ W (x)



exp S¯c

t

0

2

0

 (s1 − (a + b)s2 ) ρ ∗ (s1 , s2 )ds1 ds2

 (s1 − b∗ s2 ) ρ ∗ (s1 , s2 )ds1 ds2 ≤ β W (x),

(5.13)

and  S¯



W [F(x, a, b, s)] ρ (s1 , s2 )ds1 ds2 =



W (s1 )ρ ∗ (s1 , s2 )ds1 ds2



= λ (x, a, b)E [W (ξ1,t )] = λ (x, a, b)

 X

W (y)m∗ (dy).

(5.14)

5.4 A Game Model for Reservoir Operations

77

Combining (5.13) and (5.14) we get  X

W (y)Q(dy|x, a, b) =



W [F(x, a, b, s)] ρ ∗ (s1 , s2 )ds1 ds2

ℜ2+

≤ β W (x) + λ (x, a, b)d, where d =



∗ X W (y)m (dy)

(5.15)

< ∞.

On the other hand, note that Q(D|x, a, b) =



1D (F(x, a, b, s1 , s2 ))ρ ∗ (s1 , s2 )ds1 ds2

ℜ2+

=



1D (s1 )ρ ∗ (s1 , s2 )ds1 ds2



+



1D (x + s1 − (a + b)s2 )ρ ∗ (s1 , s2 )ds1 ds2

S¯c





1D (s1 )ρ ∗ (s1 , s2 )ds1 ds2



= λ (x, a, b)m∗ (D), D ∈ B(X). Moreover, Pr[b∗ ξ2,t > ξ1,t ] = =

 ∞ 0

 ∞ 0

P[b∗ ξ2,t > y]ρ1 (y)dy

P[b∗ ξ2,t > x]m∗ (dx),

which is positive due to Condition C3. Since Q(D|x, a, b) ≥ P[ξ2,t > x/(a + b)]m∗ (D) ≥ P[ξ2,t > x/b∗ ]m∗ (D), we see that the ergodicity Assumption 3.2 holds. Observe that the inequality (3.14) in Assumption 3.4 follows by using  similar  arguments as in (5.15) with p = 2, λ0 := ΘY (t0 ) (see (5.11)) and d0 := E W 2 (ξ1,t ) , which is finite because of the continuity of W and Condition C2.

78

5 Difference-Equation Games: Examples

Finally, observe that

Ψ (s) ≤ max{exp(t0 s1 /2), exp(t0 /2(s1 − b∗ s2 ))}, s = (s1 , s2 ). Hence, from Condition C2, 

Ψ 2 (s1 , s2 )ρ(s1 , s2 )ds1 ds2 =

ℜ2+



Ψ 2 (s1 , s2 ) (ρ(s1 , s2 )) ds1 ds2 < ∞,

S0

which implies Assumption 3.4. Therefore we can conclude, from Theorem 3.5, that there exists an average optimal pair of strategies.

5.5 A Storage Game Model We now introduce a zero-sum game model to analyze a class of storage systems with controlled input/output, whose evolution in time is as follows. At time Tt when an amount of a certain product R > 0 accumulates for admission in the system, player 1 selects an action a ∈ [a∗ , 1] =: A, with a∗ ∈ (0, 1), representing the portion of R to be admitted. That is, aR is the product amount which player 1 has decided to admit to the system. There is a continuous consumption of the admitted product, controlled by player 2 when selecting b ∈ [b∗ , b∗ ] =: B (0 < b∗ < b∗ ) which represents the consumption rate per unit time. Thus, if xt ∈ X := [0, ∞) is the stock level, at and bt are the decisions of players 1 and 2, respectively, at the time of the t-th decision epoch Tt , then the process {xt } evolves according to the equation xt+1 = (xt + at R − bt ξt+1 )+ , where ξt+1 := Tt+1 − Tt , t ∈ N0 . We suppose that {ξt } is a sequence of i.i.d. random variables with unknown density ρ such that E [ξ ] >

R . b∗

(5.16)

We proceed as in Example 5.4 to verify Assumptions 3.1 and 3.2. Indeed, let Θ be the moment generating function of the random variable R − b∗ ξ , that is,

Θ (t) = E[exp(t(R − b∗ ξ ))]. Observe that (5.16) implies that Θ (0) < 1. In addition, because Θ (0) = 1, there exists t0 > 0 such that

β := Θ (t0 ) = E[exp(t0 (R − b∗ ξ ))] < 1.

(5.17)

5.5 A Storage Game Model

79

We suppose that the one-stage payoff function r is an arbitrary continuous function on K such that 0 ≤ r(x, a, b) ≤ Met0 x . Hence, defining W (x) := exp(t0 x), it is easy to prove that (see (5.12)) (x, a, b) →

 ∞ 0

  W (x + aR − bs)+ ρ (s)ds, (x, a, b) ∈ K,

is continuous, which yields Assumption 3.1. Let

λ (x, a, b) := P[x + aR − bξ ≤ 0] for (x, a, b) ∈ K, and m∗ (·) := δ0 (·), where δ0 is the Dirac measure concentrated at x = 0. Then, for D ∈ B(X), Q(D|x, a, b) =

 ℜ

1D (F(x, a, b, s))ρ (s)ds =

 ℜ

1D ((x + aR − bs)+ )ρ (s)ds

  = P (x + aR − bξ )+ ∈ D ≥ P [x + aR − bξ ≤ 0] δ0 (D) = λ (x, a, b)m∗ (D).

(5.18)

In addition, for each (x, a, b) ∈ K,  X

W (y)Q(dy | x, a, b) =

 ∞ 0

exp t0 (x + aR − bs)+ ρ (s)ds

= P[x + aR − bξ ≤ 0] + exp (t0 x) ≤ P[x + aR − bξ ≤ 0] + exp (t0 x)

 ∞ 0 ∞ 0

exp (t0 (aR − bs)) ρ (s)ds exp (t0 (R − b∗ s)) ρ (s)ds

= β W (x) + λ (x, a, b).

(5.19)

On the other hand, observe that

Λ¯ (x) := inf inf ψ (x, a, b) = inf inf P[x + aR − bξ ≤ 0] a∈A b∈B

a∈A b∈B

≥ P[x + R − b∗ ξ ≤ 0]. Thus,  X

Λ¯ (x)m∗ (dx) ≥

 X

P[x + R − b∗ ξ ≤ 0]m∗ (dx)

= P[R − b∗ ξ ≤ 0] > 0,

(5.20)

80

5 Difference-Equation Games: Examples

where the last inequality holds because (see (5.16)) E [R − b∗ ξ ] < 0. Therefore (5.18)–(5.20) imply Assumption 3.2 (see Assumption 4.2). In addition, observe that by the continuity of Θ , we can choose p > 1 such that (see (5.17)) λ0 := Θ (pt0 ) = E[exp(pt0 (R − b∗ ξ ))] < 1. Thus, a straightforward calculation shows that (see (5.19))  X

W p (y)Q(dy | x, a, b) ≤ P[x + aR − bξ ≤ 0] + exp (t0 px)

 ∞ 0

exp (t0 p(R − b∗ s)) ρ (s)ds,

which in turn implies  X

W p (y)Q(dy | x, a, b) ≤ λ0W p (x) + d0 ,

(5.21)

with d0 = 1. Hence, Assumption 3.4 (a) is satisfied. Now, for s ∈ [0, ∞), let K1 := {(x, a, b) ∈ K : x + aR − bs ≤ 0} K2 := {(x, a, b) ∈ K : x + aR − bs > 0} . Then 1 W [F(x, a, b, s)] W (x) x∈X a∈A b∈B

ψ (s) := sup sup sup

exp (t0 (x + aR − bs)+ ) exp(t0 x) x∈X a∈A b∈B " # exp (t0 (x + aR − bs)+ ) exp (t0 (x + aR − bs)+ ) = max sup , sup exp(t0 x) exp(t0 x) K1 K2 = sup sup sup

"

#

= max sup exp (−t0 x) , sup exp (t0 (aR − bs)) K1

K2

= max {1, exp (t0 (aR − b∗ s))} < ∞, s ∈ [0, ∞).

(5.22)

5.6 Equicontinuity and Equi-Lipschitz Conditions

81

In this sense, in order to verify Assumption 3.4(b), we assume that the density ρ is a function satisfying ρ (s) ≤ ρ(s) := L˜ exp(−qs), where L˜ and q are suitable constants such that (2.29) holds. Hence, Assumption 3.4 is satisfied (see Assumption 2.10). In conclusion, from Theorem 3.5, there exist average optimal strategies for players.

5.6 Equicontinuity and Equi-Lipschitz Conditions The convergence properties of the empirical approximation-estimation algorithms introduced in Chap. 4 are strongly based on the equicontinuity and equi-Lipschitz conditions stated in Assumptions 4.7 and 4.9 for the discounted and average criteria, respectively. Such conditions are satisfied in several contexts. For instance, if the disturbance space S is countable, i.e., if the disturbance process {ξt } is formed by discrete random variables, the equicontinuity is trivially satisfied with respect to the discrete topology. Now, considering that equi-Lipschitz implies equicontinuity, another set of less obvious conditions can be obtained applying the ideas from [16] or [23, 24] for MDPs, and by imposing Lipschitz-like conditions on the payoff function r and on the transition kernel Q. Although we can obtain equicontinuity without going through the equi-Lipschitz property, in this section we introduce sufficient conditions for Assumption 4.9, which in turn implies Assumption 4.7 in the case S = ℜk . We will do this under Assumptions 4.1 and 4.2, and the following conditions. Assumption 5.1 (a) For all x ∈ X, A(x) = A and B(x) = B are compact sets. (b) The one-stage payoff function r and the function F are Lipschitz functions in the following sense: For all x, x ∈ X and a metric dX on X,   sup r(x, a, b) − r(x , a, b) ≤ Lr dX (x, x ) (a,b)∈A×B

and sup (a,b)∈A×B



dX F(x, a, b, s), F(x , a, b, s) ≤ LF1 dX (x, x ) ∀s ∈ ℜk ,

(5.23)

for some constants Lr > 0 and LF1 > 0. (c) The family of functions {F(x, a, b, ·); (x, a, b) ∈ K} is equi-Lipschitz, i.e., for all s, s ∈ ℜk and (x, a, b) ∈ K  

(5.24) dX F(x, a, b, s), F(x, a, b, s ) ≤ LF2 s − s  , where |·| is the Euclidean distance in ℜk .

82

5 Difference-Equation Games: Examples

Let {vt } ⊂ BW (X) be the value iteration functions defined as v0 = 0; and for t ≥ 1 and x ∈ X, vt (x) = min max r(x, a, b) + α

 ℜk

b∈B a∈A

= max min r(x, a, b) + α



a∈A b∈B

It is easy to prove that

ℜk

! vt−1 [F(x, a, b, s)] θ (ds) ! vt−1 [F(x, a, b, s)] θ (ds) .

    vt −Vαθ  → 0, as t → ∞, W

(5.25)

where Vαθ is the value of the game. Indeed, from (4.5), for each x ∈ X,          vt−1 [F(x, a, b, s)] −Vαθ [F(x, a, b, s)] θ (ds) vt (x) −Vαθ (x) ≤ sup α (a,b)∈A×B

ℜk

    ≤ γα vt−1 −Vαθ  W¯ (x), W¯

which yields

        (5.26) vt −Vαθ  ¯ ≤ γα vt−1 −Vαθ  ¯ . W W   Let l := lim supt→∞ vt −Vαθ W¯ . Since vt ,Vαθ ∈ BW¯ (X), we have l < ∞. Thus, taking lim sup in (5.26) and using the fact that γα < 1, we obtain l = 0, and therefore, from (4.7), (5.25) holds. Lemma 5.1. Under Assumption 5.1(a)–(b), the following hold. (a) For each t ∈ N0 , the function vt is Lipschitz with constant t−1

Lvαt := Lr ∑ (α LF1 )l .

(5.27)

l=0

(b) In addition, if α LF1 < 1, then the value of the game Vαθ is Lipschitz with constant LVα :=

Lr . 1 − α LF1

Proof. (a) We proceed by induction. Observe that the part (a) holds for v0 = 0. Now, we assume that vt is Lipschitz with constant Lvαt defined in (5.27). Then, for each x, x ∈ X, from Assumption 5.1 we obtain

5.6 Equicontinuity and Equi-Lipschitz Conditions

  vt+1 (x) − vt+1 (x ) ≤

sup (a,b)∈A×B



≤ Lr dX (x, x ) + α



(a,b)∈A×B

  r(x, a, b) − r(x , a , b )

   vt [F(x, a, b, s)] − vt F(x , a, b, s)  θ (ds)

ℜk

sup

83

Lvαt

 ℜk



dX F(x, a, b, s), F(x , a, b, s) θ (ds)

≤ Lr dX (x, x ) + α Lvαt LF1 dX (x, x )



t−1

= Lr + α LF1 Lr ∑ (α LF1 ) dX (x, x ) l

l=0





t−1

t

≤ Lr 1 + ∑ (α LF1 )l+1 dX (x, x ) = Lr ∑ (α LF1 )l dX (x, x ) l=0

l=0

= Lvαt+1 dX (x, x ).

(5.28)

This fact proves part (a). (b) Observe that if α LF1 < 1, then lim Lα t→∞ vt

=

Lr = LVα . 1 − α LF1

(5.29)

Now, for x, x ∈ X, by adding and subtracting the terms vt (x) and vt (x ), we have          θ     Vα (x) −Vαθ (x ) ≤ Vαθ (x) − vt (x) + vt (x) − vt (x ) + vt (x ) −Vαθ (x )         + Vαθ (x) − vt (x) + Lvαt dX (x, x ) + vt (x ) −Vαθ (x ) .

(5.30)

Letting t → ∞ in (5.30), from (5.25) and (5.29) we get    θ  Vα (x) −Vαθ (x ) ≤ LVα dX (x, x ), that is, Vαθ is Lipschitz with constant LVα .  The results in Lemma 5.1 together with Assumption 5.1(c) will be used to verify the equicontinuity and equi-Lipschitz conditions in Assumptions 4.7 and 4.9(c).

84

5 Difference-Equation Games: Examples

Observe that, because W (·) ≥ 1, for all (x, a, b) ∈ K and s, s ∈ ℜk ,  θ    Vα (F(x, a, b, s)) Vαθ (F(x, a, b, s ))   θ   ≤ Vα (F(x, a, b, s)) −Vαθ (F(x, a, b, s )) −   W (x) W (x) ≤ LVα dX (F(x, a, b, s), F(x, a, b, s ))   ≤ LVα LF2 s − s  .

(5.31)

Hence, the family of function VW (see (4.19)) is equi-Lipschitz, and therefore it is equicontinuous. On the other hand, it is worth noting that if LF1 < 1, then, from (5.27), (5.28), and (5.29), the function Vαθ is Lipschitz with constant LVα :=

Lr , 1 − LF1

which is independent on α ∈ (0, 1). Thus, from (5.31), we can prove that the family of functions VˆW in Assumption 4.9(c) is equi-Lipschitz. Example 5.3. We consider the storage game model introduced in Sect. 5.5. In this case the dynamic of the zero-sum game is defined by the function F(x, a, b, s) = (x + aR − bs)+ , where x ∈ X := [0, ∞), a ∈ A := [a∗ , 1], b ∈ B := [b∗ , b∗ ], s ∈ S := [0, ∞), and R is a positive constant. We will prove that F satisfies relations (5.23) and (5.24) in Assumption 5.1. For arbitrary and fixed (a, b, s) ∈ A × B × S, we define the sets K1 := {x ∈ X : x + aR − bs ≤ 0} and K2 := {x ∈ X : x + aR − bs > 0} , and let dX be the Euclidean distance in ℜ. If x, x ∈ K1 , then     F(x, a, b) − F(x , a, b) = 0 ≤ x − x  ,

(5.32)

and if x, x ∈ K2 , we have       F(x, a, b) − F(x , a, b) = x + aR − bs − x − aR + bs = x − x  .

(5.33)

5.7 Numerical Implementations

85

Now, if x ∈ K1 and x ∈ K2 , then F(x, a, b, s) = 0 and −(x + aR − bs) ≥ 0. Thus     F(x, a, b) − F(x , a, b) = x + aR − bs   ≤ x + aR − bs − (x + aR − bs)   = x − x  . Therefore, a combination of (5.32)–(5.34) yields (5.23).

(5.34) 

Similar arguments show that (5.24) holds with LF2 := b∗ . In conclusion, in the cases when Assumptions 4.1 and 4.2 (or Assumptions 3.1 and 3.2) hold together with Assumption 5.1, we can apply the empirical approximation schemes introduced in Chap. 4 for the discounted and the average criteria.

5.7 Numerical Implementations 5.7.1 Linear-Quadratic Games Consider the LQ-game introduced in Sect. 5.3. We begin our analysis by presenting the solution of the game through explicit formulae of the value function and the optimal strategies. Such expressions will be used to compare the exact solution with the numerical approximations obtained in the implementation of the estimation and control algorithm. Solution to LQ-Game. We apply the value iteration approach (2.45) in Remark 2.6 considering ρt = ρ ∀t ∈ N0 . As is well known, and we will verify it below (see, e.g., [14]), from the quadratic form of the payoff and because the dynamic is additive noise with mean zero, the optimal pair of strategies can be found in the set of pure strategies (see Definition 1.2). In this case, the value iteration functions {Ut } take the form U0 = 0, and for x ∈ ℜ and t ∈ N, 

 Ut (x) = min max x2 + a2 − b2 + α Ut−1 (x + a + b + s) ρ (s)ds . (5.35) ℜ

b∈B(x) a∈A(x)

Observe that

  U1 (x) = min max x2 + a2 − b2 , x ∈ ℜ. b∈B(x) a∈A(x)

Hence, it is easy to see that f11 (x) = 0, f12 (x) = 0, and U1 (x) = x2 . To obtain U2 , from (5.35) and (5.5) we have, for x ∈ ℜ,

86

5 Difference-Equation Games: Examples

 U2 (x) = min max x2 + α x2 + (α + 1)a2 + (α − 1)b2 b∈B(x) a∈A(x)

+2α ax + 2α bx + 2α ab] + σ 2 α .

(5.36)

To find the saddle point, we compute the partial derivatives of the function within brackets with respect to a, and then with respect to b. Set both derivatives equal to zero and solve for a and b to obtain a = −α x, b = α x. Thus, f21 (x) = −α x and f22 (x) = α x, x ∈ ℜ. Substituting these values in (5.36) we get U2 (x) = (1 + α )x2 + σ 2 α .

(5.37)

Similarly, letting t = 3 in (5.35) and using (5.37), we obtain f31 (x) = −(1 + α )x,

f32 (x) = (1 + α )x,

and U3 (x) = (1 + α (1 + α ))x2 + σ 2 (α 2 + α (1 + α )). In general, proceeding by induction, we obtain the expressions ft1 (x) = −βt−1 x, and Ut (x) =

βt 2 x + σ 2 Lt , α

where β0 = 0,

βt = α (1 + βt−1 ) = and Lt =

ft2 (x) = βt−1 x,

α − α t+1 , t ∈ N, 1−α

(5.38)

(5.39)

(5.40)

t−1

∑ α t−i−1 βi .

i=1

A straightforward calculation shows that Lt =

α − αt 2

(1 − α )



(t − 1)α t . 1−α

(5.41)

In addition, for α < 1/4 (see (5.6)) we have that βt ≤ 1/2 ∀t ∈ N0 . Then, in this case, ft1 (x) ∈ A(x), ft2 (x) ∈ B(x),

5.7 Numerical Implementations

87

% where A(x) = B(x) = [− |x| /2, |x| 2]. Furthermore, by observing that lim βt =

t→∞

α α and lim Lt = , t→∞ 1−α (1 − α )2

we have lim Ut (x) =

t→∞

α 1 2 x + σ 2 , x ∈ ℜ. 1−α (1 − α )2

Hence, Lemma 2.3 yields (recall Uti = Ut for i = 1, 2) that the value of the game is ρ

Vα (x) =

α 1 2 x + σ 2 , x ∈ ℜ. 1−α (1 − α )2

(5.42)

In fact, we have the convergence in the W -norm with W (x) = x2 + 1, that is  ρ lim Ut −Vα W = 0. t→∞

Finally, we define the selectors f∗1 (x) : = lim ft1 (x) = lim −βt−1 x = − t→∞

t→∞

f∗2 (x) : = lim ft2 (x) = lim βt−1 x = t→∞

t→∞

α x ∈ A(x); 1−α

α x ∈ B(x). 1−α

It is easy to prove that the pair ( f∗1 , f∗2 ) satisfies the relations (2.15) and  (2.16) with ρ 1 = f 1 and π 2 = = V . Therefore, from Theorem 2.6 (see Theorem 2.5(b)), π V ∗ ∗ ∗ α 2  α f∗ form a stationary optimal pair of strategies for the LQ-game. Estimation and Control in LQ-Games. As in Sect. 5.3, we assume that {ξt } is a sequence of i.i.d. random variables normally distributed with mean zero and unknown variance σ 2 ∈ (0, 3), with density ρ given in (5.5). Now, let ξ0 , ξ1 , . . . , ξt−1 be an i.i.d. sample from the density ρ and σˆ t2 be an estimator of the variance defined as σˆ 12 = 0,

2 1 t σˆ t2 = ξi−1 − ξ¯ , t > 1, (5.43) ∑ t − 1 i=1   where ξ¯ is the sample mean. Clearly σˆ t2 is an unbiased estimator, that is E σˆ t2 = σ 2 , and an estimator ρˆt of the density ρ is   1 s2 ρˆt (s) := √ exp − 2 , s ∈ ℜ, (5.44) 2σˆ t σˆ t 2π $ where σˆ t = σˆ t2 . Following similar arguments as (5.6) and (5.7), we have that ρˆt ∈ D (see Definition 2.2 and Remark 2.5). Therefore, from (2.38) we can take ρt (s) = ρˆt (s). Thus, it is easy to see that E ρt − ρ L1 → 0 as t → ∞,

88

5 Difference-Equation Games: Examples

and moreover (see Lemma 2.1) E ρt − ρ  → 0 as t → ∞. Under this scenario, the value iteration functions (2.45), given in Remark 2.6 (see (5.35)), take the form U0 = 0, and for t ∈ N, 

 2 2 2 Ut (x) = min max x + a − b + α Ut−1 (x + a + b + s) ρt (s)ds . ℜ

b∈B(x) a∈A(x)

Summarizing, the following algorithm finds a pair of AD-optimal strategies and an approximation to the value of the LQ-game:

Algorithm 5.2 (Estimation and Control) 1. Set t = 0 and U0 = 0. Choose arbitrary ( f¯01 , f¯02 ) ∈ F1 × F2 , α < 1/4. 2. Set t = 1. Find ( f¯11 , f¯12 ) ∈ F1 × F2 such that, for each x ∈ ℜ,   U1 (x) = min max x2 + a2 − b2 b∈B(x) a∈A(x)

  2  = max x2 + a2 − f¯12 (x) a∈A(x)

   2 = min x2 + f¯11 (x) − b2 . b∈B(x)

3. For t > 1 and observations ξ1 , ξ2 , . . . , ξt from the normal density ρ , compute σˆ t2 and ρˆt (s) (see (5.43) and (5.44)). 4. For each x ∈ ℜ and t > 1, let 

 2 2 2 Ut (x) = min max x + a − b + α Ut−1 (x + a + b + s) ρt (s)ds . ℜ

b∈B(x) a∈A(x)

Find ( f¯t1 , f¯t2 ) ∈ F1 × F2 such that 

  2

Ut (x) = max x2 + a2 − f¯t2 (x) + α Ut−1 x + a + f¯t2 (x) + s ρt (s)ds a∈A(x)





  1 2

2 2 1 ¯ ¯ = min x + ft (x) − b + α Ut−1 x + ft (x) + b + s ρt (s)ds . b∈B(x)



    From Theorem 2.11 the strategies π¯∗1 = f¯t1 and π¯∗2 = f¯t2 are asymptotically discounted optimal, and, from Lemma 2.3, Ut approximates the value of the LQρ game Vα .

5.7 Numerical Implementations

89

Table 5.1 Sequence of iterates for estimation and control algorithm in LQ-games t 0 1 2 3 4 5 10 15 20 50 60 100 250

Ut (−3) 0.000000 11.16397 12.30697 12.48668 12.40721 12.31120 12.48441 12.72211 12.87389 13.02697 12.95579 12.77091 12.67145 ↓ 12.67313 ρ Vα (−3)

Ut (0) 0.000000 0.003969657 0.628568251 0.683862933 0.574538999 0.471361817 0.642306046 0.880008767 0.838517263 0.837747231 0.834927040 0.8321739 0.8311490 ↓ 0.8310249 ρ Vα (0)

Ut (4) 0.000000 19.84397 21.39017 21.66665 21.61041 21.51997 21.69493 21.93264 22.08441 21.86044 21.98144 21.88293 21.88478 ↓ 21.88366 ρ Vα (4)

To solve steps 2 and 4 in Algorithm 5.2 we can proceed similarly as (5.38)– (5.41). In this case it is easy to see that f¯t1 (x) = −βt−1 x ∈ A(x), whenever α < 1/4, and Ut (x) = where σˆ 2

Lt t =

f¯t2 (x) = βt−1 x ∈ B(x),

βt 2 σˆ 2 x + σ 2 Lt t , α

(5.45)

t−1

2 . ∑ α t−i−1 βi σˆ i+1

(5.46)

i=1

Finally, from Lemma 2.3, π¯ 1 ,π¯∗2  

lim Ex ∗

t→∞

ρ Ut −Vα W = 0.

Numerical Results. We fix α = 0.24. We are interested in approximating the value ρ function Vα given in (5.42) for the initial states x = 0, x = −3, and x = 4. Algorithm 5.2 was performed using the expression (5.45) with samples (ξ0 , ξ1 , . . . , ξt−1 ) simulated from the normal density (5.5), assuming that the true variance is σ 2 = 2. The results are given in Table 5.1. ρ

The approximation to the value function Vα on the entire state space is shown graphically in Fig. 5.1.

90

5 Difference-Equation Games: Examples

ρ

Fig. 5.1 Approximation to the value function Vα in LQ-games

5.7.2 Finite Games: Empirical Approximation We now apply the empirical approximation-estimation schemes, introduced in Chap. 4, in a class of finite games. Specifically, based on the results stated in Sect. 4.5, we implement an algorithm defined by the empirical distribution to approximate the value of the game for both discounted and average criteria. Consider a game evolving according to the equation xt+1 = min {xt + at + bt , 3} + ξt , t ∈ N0 , where X = {3, 4, 5, 6} , A = A(x) = B = B(x) = {1, 2} for all x ∈ X, and {ξt } is a sequence of discrete i.i.d. random variables with unknown distribution θ , taking values in S = {0, 1, 2, 3} . The payoff function is given as ⎧ ⎧ 2, a = 1, b = 1, 2, a = 1, b = 1, ⎪ ⎪ ⎪ ⎪ ⎨ ⎨ −3, a = 1, b = 2, −1, a = 1, b = 2, r(4, a, b) = r(3, a, b) = −1, a = 2, b = 1, −1, a = 2, b = 1, ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ 3, a = 2, b = 2; 1, a = 2, b = 2; (5.47) ⎧ ⎧ 4, a = 1, b = 1, 3, a = 1, b = 1, ⎪ ⎪ ⎪ ⎪ ⎨ ⎨ −2, a = 1, b = 2, −2, a = 1, b = 2, r(5, a, b) = r(4, a, b) = −2, a = 2, b = 1, −2, a = 2, b = 1, ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ 2, a = 2, b = 2; 1, a = 2, b = 2.

5.7 Numerical Implementations

91

Since the payoff is a bounded function, the W -growth conditions, given in Assumptions 4.1 and 4.3, hold (see Remark 4.1). Furthermore, because the state, control, and disturbance spaces are countable, the continuity and equicontinuity conditions trivially are satisfied. Thus, the results stated in Sect. 4.5 are applicable. However, it is worth noting that it is not possible to ensure, in advance, that the optimal pair belongs to the set of pure strategies. Hence, the analysis will be done on the set of all strategies. In this sense, we consider the sets A = P(A) and B = P(B) defined as   A := ϕ 1 = (γ , 1 − γ ) : γ ∈ [0, 1] and

  B := ϕ 2 = (β , 1 − β ) : β ∈ [0, 1] ,

where γ and 1 − γ represent the probabilities of selecting actions 1 and 2 by player 1, respectively. Similarly, β and 1 − β are the probabilities of selecting actions 1 and 2 by player 2. Let r¯ : X × [0, 1] × [0, 1] → ℜ be the function defined as r¯(x, γ , β ) = r(x, 1, 1)γβ + r(x, 1, 2)γ (1 − β ) +r(x, 2, 1)(1 − γ )β + r(x, 2, 2)(1 − γ )(1 − β ).

(5.48)

According to (1.5), we have, for each ϕ 1 = (γ , 1 − γ ) ∈ A and ϕ 2 = (β , 1 − β ) ∈ B, r(x, ϕ 1 , ϕ 2 ) = r¯(x, γ , β ), γ , β ∈ [0, 1].

(5.49)

Discounted Criterion. We apply the value iteration procedure defined in (4.50). Considering (5.49) the functions {Vt } take the form V0 = 0, and for t ∈ N and x ∈ X, 

 Vt (x) = max min r¯(x, γ , β ) + α V¯t−1 [F(x, γ , β , s)]θt (ds) γ ∈[0,1] β ∈[0,1]

= max min

γ ∈[0,1] β ∈[0,1]

S

α t−1 ¯ r¯(x, γ , β ) + ∑ Vt−1 [F(x, γ , β , s)] , t i=0



(5.50)

where V¯t [F(x, γ , β , s)] = V¯t [F(x, 1, 1, s)]γβ + V¯t [F(x, 1, 2, s)]γ (1 − β ) +V¯t [F(x, 2, 1, s)](1 − γ )β + V¯t [F(x, 2, 2, s)](1 − γ )(1 − β ). For t = 1 and x = 3, from (5.47) we have V1 (3) = max min [¯r(3, γ , β )] γ ∈[0,1] β ∈[0,1]

= max min {9γβ − 6γ − 4β + 3} . γ ∈[0,1] β ∈[0,1]

92

5 Difference-Equation Games: Examples

Computing derivatives of the function within brackets with respect to γ , and then with respect to β , and equaling them to zero, we obtain a system of equations whose solution yields     4 5 6 3 , , ϕ˜ 11 (3) = , ϕ˜ 12 (3) = . 9 9 9 9 1 1 Thus, V1 (3) = . We proceed similarly for x = 4, 5, 6, to obtain V1 (4) = , V1 (5) = 3 5 2 1 , V1 (6) = − , and 5 8     2 3 2 3 , , ϕ˜ 11 (4) = , ϕ˜ 12 (4) = , 5 5 5 5     2 3 2 3 1 2 ˜ ˜ , , ϕ1 (5) = , ϕ1 (5) = , 5 5 5 5     3 5 3 5 1 2 , , ϕ˜ 1 (6) = , ϕ˜ 1 (6) = . 8 8 8 8 For t = 2 and x = 3, from (5.50) we have α 1 ¯ V2 (3) = max min r¯(x, γ , β ) + ∑ V1 [F(x, γ , β , s)] 2 i=0 γ ∈[0,1] β ∈[0,1]   α = max min 9γβ − 6γ − 4β + 3 + [V1 (3 + ξ0 ) +V1 (3 + ξ1 )] . 2 γ ∈[0,1] β ∈[0,1] In addition ϕ˜ 21 (3) = prove that

4

5 9, 9



and ϕ˜ 22 (3) =

V2 (x) = V1 (x) +

6

3 9, 9

α 2



. In fact, for x = 3, 4, 5, 6 we can

1

∑ V1 (3 + ξi ).

i=0

In general, a straightforward calculation shows that, for x ∈ X and t ∈ N Vt (x) = Vt−1 (x) +

α t−1 ∑ Vt−1 (3 + ξi ). t i=0

(5.51)

Numerical Results. We implement the empirical approximation scheme by taking samples (ξ0 , ξ1 , . . . , ξt−1 ) from the Binomial distribution θ with parameters (3, 1/2), that is P [ξt = k] = θ (k) =

   3 3 1 , k = 0, 1, 2, 3. k 2

It is worth noting that if we take θt = θ as the true distribution in (5.50), from (5.51) it is easy to see that

5.7 Numerical Implementations

93

Table 5.2 Sequence of iterates for empirical value iteration t 0 1 2 3 4 5 10 15 20 50 60 100 250

Vt (3) 0.000000 0.3333333 0.5166667 0.6138889 0.6402778 0.6076389 0.5534278 0.5638499 0.5569752 0.5597872 0.5649057 0.5844444 0.5885507 ↓ 0.5843750 Vαθ (3)

Vt (4) 0.000000 0.2000000 0.3833333 0.4805556 0.5069444 0.4743056 0.4200945 0.4305166 0.4236419 0.4264539 0.4315723 0.4511110 0.4552173 ↓ 0.4510417 Vαθ (4)

Vt (5) Vt (6) 0.000000 0.000000 0.4000000 -0.12500000 0.5833333 0.05833333 0.6805556 0.15555556 0.7069444 0.18194444 0.6743056 0.14930556 0.6200945 0.09509445 0.6305166 0.10551658 0.6236419 0.09864192 0.6264539 0.10145386 0.6315723 0.10657232 0.6511110 0.12611104 0.6552173 0.12621734 ↓ ↓ 0.6510417 0.1260416667 Vαθ (5) Vαθ (6) 6

Vt (x) = Vt−1 (x) + α ∑ Vt−1 (i)θ (i − 3) i=3

= V1 (x) + 0.2510417



α − αt 1−α

 .

Hence, letting t → ∞ we obtain the value of the game Vαθ (x) = V1 (x) +

0.2510417 α , x = 3, 4, 5, 6. 1−α

(5.52)

This expression will be used to compare the numerical approximation with the exact solution. The numerical results shown in Table 5.2 correspond to the value α = 1/2. Specifically, we show the approximation to the exact values Vαθ (3) = 0.584375, Vαθ (4) = 0.4510417, Vαθ (5) = 0.6510417, and Vαθ (6) = 0.1260417, obtained from (5.52). In Fig. 5.2 we show the approximation to the value function Vαθ for different samples (ξ0 , ξ1 , . . . , ξt−1 ) from the Binomial distribution. Average Criterion. We apply the empirical VDFA introduced in Sect. 4.5 to approximate the value j∗ of the average game. Observe that from (5.52) and (4.13), for any sequence of discount factors αt → 1 we have, j∗ = lim jαθ t = lim (1 − αt )Vαθt (z) = 0.2510417, ∀z = 3, 4, 5, 6. t→∞

t→∞

As is stated in Sect. 4.5, in order to obtain an approximation to the value j∗ = 0.2510417, we fix a suitable sequence {α¯ t } converging, slowly enough, to one.

94

5 Difference-Equation Games: Examples

Fig. 5.2 Simulations of the empirical value iteration functions for different samples from the Binomial distribution Table 5.3 Empirical approximation to the average value function t 0 1 2 3 4 5 10 15 20 50 60 100 250 500 1000 1500

α¯ t 0.2928932 0.4226497 0.5000000 0.5527864 0.5917517 0.6220355 0.7113249 0.7574644 0.7867993 0.8613250 0.8729999 0.9009852 0.9370059 0.9553678 0.9684088 0.9741973 ↓ 1.0000000

(t)

jn (3) 0.000000 1.0000000 0.2933915 0.2891668 0.2860465 0.2836202 0.2764585 0.2727517 0.2703918 0.2644850 0.2635119 0.2611855 0.2581822 0.2566325 0.2555126 0.2550064 ↓ 0.2510417 j∗

(t)

jn (4) 0.000000 1.0000000 0.2267248 0.2295384 0.2316134 0.2332249 0.2365061 0.2400184 0.2416999 0.2458146 0.2464404 0.2479183 0.2497663 0.2506756 0.2512983 0.2515661 ↓ 0.2510417 j∗

(t)

jn (5) 0.000000 1.0000000 0.3267248 0.3189811 0.3132631 0.3088178 0.2957035 0.2889207 0.2846052 0.2738202 0.2720477 0.2678191 0.2623902 0.2596110 0.2576197 0.2567266 ↓ 0.2510417 j∗

(t)

jn (6) 0.000000 1.00000000 0.06422481 0.08419393 0.09893273 0.13929619 0.14843031 0.16417203 0.17444542 0.20030550 0.20521931 0.21557962 0.22925250 0.23615570 0.24102606 0.24318020 0.2510417 j∗

Then, for each α¯ t , we apply the approximation scheme given by (4.52), where the (t) functions Vn can be obtained from (5.51) with α¯ t instead of α . Hence, from (4.56), (t) (t) jn (z) := (1 − α¯ t )Vn (z) is an estimator of j∗ . Table 5.3 shows the approximation to the constant j∗ = 0.2510417 taking the 1 . From a practical point of view, the algorithm was persequence α¯ t = 1 − √ t +1 formed letting t = n. It is worth noting that, really, for any z ∈ {3, 4, 5, 6}, we obtain the approximation to the value of the average game j∗ .

Appendix A

Elements from Analysis

A.1 Semicontinuous Functions Unless stated otherwise, throughout the following we suppose that X is a topological ¯ i.e., ℜ ¯ = ℜ ∪ {−∞, ∞} . space, and we denote the extended real numbers set as ℜ, ¯ is said to be Definition A.1. A function v : X → ℜ (a) lower semicontinuous (l.s.c.) if the set {x ∈ X : v(x) ≤ r} is closed in X for every r ∈ ℜ; (b) upper semicontinuous (u.s.c.) if the set {x ∈ X : v(x) ≥ r} is closed in X for every r ∈ ℜ. Clearly, a function v is l.s.c. if and only if −v is u.s.c., and v is continuous if and only if v is l.s.c. and u.s.c. The next results summarize the main properties of semicontinuous functions. For their proofs, see, e.g., [3, 5]. ¯ Then: Proposition A.1. Let X be a metric space and v : X → ℜ. (a) v is l.s.c. if and only if for each sequence {xn } in X such that xn → x ∈ X, as n → ∞, we have lim infn→∞ v(xn ) ≥ v(x). (b) v is l.s.c. and bounded from below if and only if there exists a sequence {vn } of bounded and continuous functions such that vn  v.

Remark A.1. Similarly, v is u.s.c. if and only if for each sequence {xn } in X such that xn → x ∈ X, as n → ∞, we have lim supn→∞ v(xn ) ≤ v(x). Moreover, v is u.s.c. and bounded from above if and only if there exists a sequence {vn } of bounded and continuous functions such that vn  v. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 J. A. Minj´arez-Sosa, Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution, SpringerBriefs in Probability and Mathematical Statistics, https://doi.org/10.1007/978-3-030-35720-7

95

96

A Elements from Analysis

Proposition A.2. Let X be a compact metric space. ¯ is l.s.c., then v attains its infimum, i.e., there exists x∗ ∈ X such (a) If v : X → ℜ that v(x∗ ) = infx∈X v(x). ¯ is u.s.c., then v attains its supremum, i.e., there exists x∗ ∈ X (b) If v : X → ℜ ∗ such that v(x ) = supx∈X v(x). A topological space X is a Hausdorff space if for each x, y ∈ X, with x = y, there / For exist neighborhoods Ux and Vy of x and y, respectively, such that Ux ∩ Vy = 0. instance, all metric spaces are Hausdorff. Theorem A.1 (Fan Minimax Theorem). Let X and Y be two compact Hausdorff space and h : X ×Y → ℜ be a real valued function. Suppose that (a) h(x, y) is u.s.c. in x ∈ X for each y ∈ Y, and l.s.c. in y ∈ Y for each x ∈ X; (b) h is concave in X and convex in Y. Then the following equality holds: max min h(x, y) = min max h(x, y). x∈X y∈Y

y∈Y x∈X

Proof. See [15].

A.2 Spaces of Functions For a function ρ : ℜk → ℜ and 1 ≤ q < ∞, we define the Lq -norm ρ Lq :=

 ℜk

1/q |ρ |q d μ

,

where μ is the Lebesgue measure on ℜk . The space Lq = Lq (ℜk ) consists of all real-valued measurable functions on ℜk with finite Lq -norm:   Lq = ρ : ρ Lq < ∞ . A Borel space is a Borel subset of a complete separable metric space. A Borel space is always endowed with the Borel σ -algebra B(X), that is, the smallest σ -algebra of subsets of X that contains all the open sets in X. In this sense, “measurable,” for either sets or functions, means “Borel measurable.” For a Borel space X, we define the following spaces: • B(X), Banach space of real-valued bounded measurable functions on X with the supremum norm vB := sup |v(x)| . x∈X

A.3 Multifunctions and Selectors

97

• C(X) ⊂ B(X), subspace of bounded continuous functions. • L(X), space of l.s.c. functions that are bounded from below. Let W : X → [1, ∞) be a measurable function. We define: • BW (X), normed linear space of measurable functions with finite weighted norm (W -norm) v |v(x)|   vW :=   = sup . (A.1) W B x∈X W (x) We will refer to W as a weight function. Observe that vW ≤ vB < ∞ for all v ∈ B(X); that is, a bounded function is a W -bounded function. Moreover, BW (X) is a Banach space that contains B(X) (see [31]). We define the spaces of functions: CW (X) = C(X) ∩ BW (X) ⊂ BW (X) and LW (X) = L(X) ∩ BW (X) ⊂ BW (X). That is, CW (X) is the subspace of W -bounded continuous functions whereas LW (X) is the subspace of W -bounded and l.s.c. functions.

A.3 Multifunctions and Selectors Throughout the following, X and Y are Borel spaces. Definition A.2. A multifunction Ψ from X to Y is a function such that Ψ (x) is a nonempty subset of Y for all x ∈ X. A multifunction is also called a correspondence or set-valued mapping. We use the notation Ψ : X  Y for a multifunction Ψ from X to Y . A multifunction Ψ : X  Y is said to be compact-valued (respectively, closedvalued) if for each x ∈ X, Ψ (x) is a compact (resp. closed) subset of Y. Its graph is the subset Gr(Ψ ) ⊂ X ×Y defined as Gr(Ψ ) := {(x, y) ∈ X ×Y : y ∈ Ψ (x)} . If Y0 is a nonempty subset of Y, we define Ψ −1 [Y0 ] := {x ∈ X : Ψ (x) ∩Y0 = 0} / .

98

A Elements from Analysis

Definition A.3. Let Ψ : X  Y be a multifunction. It is said that: (a) Ψ is Borel-measurable if Ψ −1 [G] is a Borel subset of X for each open subset G ⊂ Y; (b) Ψ is u.s.c. if Ψ −1 [F] is closed in X for each closed subset F ⊂ Y ; (c) Ψ is l.s.c. if Ψ −1 [G] is open in X for each open subset G ⊂ Y ; (d) Ψ is continuous if it is upper and lower semicontinuous. Proposition A.3. Let Ψ : X  Y be a compact-valued multifunction. The following statements are equivalent: (a) Ψ is Borel-measurable. (b) Ψ −1 [F] is a Borel subset of X for each closed subset F ⊂ Y. (c) Gr(Ψ ) is a Borel subset of X ×Y. (d) Ψ is a measurable function from X to the space of nonempty compact subset of Y topologized by the Hausdorff metric. Proof. See, e.g., [37, 66]. Definition A.4. Let Ψ : X  Y be a Borel-measurable multifunction. A measurable function f : X → Y such that f (x) ∈ Ψ (x), x ∈ X, is called a measurable selector for Ψ. A measurable selector for Ψ is also called decision function for Ψ , and results stating its existence are called measurable selection theorems. Theorem A.2. Let Ψ : X  Y be a Borel-measurable compact-valued multifunction and v : Gr(Ψ ) → ℜ be a measurable function. (a) If v(x, ·) is u.s.c. on Ψ (x) for each x ∈ X, then there exists a measurable selector f ∗ of Ψ such that v(x, f ∗ (x)) = max v(x, y) =: v∗ (x) ∀x ∈ X, y∈Ψ (x)

and v∗ is measurable. (b) If v(x, ·) is l.s.c. on Ψ (x) for each x ∈ X, then there exists a measurable selector f∗ of Ψ such that v(x, f∗ (x)) = min v(x, y) =: v∗ (x) ∀x ∈ X, y∈Ψ (x)

and v∗ is measurable. Proof. See, e.g., [37, 66]. Proposition A.4. Let Ψ : X  Y be a Borel-measurable compact-valued multifunction. If { fn } is a sequence of measurable selectors for Ψ , then there exists a measurable selector f ∗ for Ψ such that f ∗ (x) ∈ Ψ (x) is an accumulation point of the sequence { fn (x)} for each x ∈ X.

A.3 Multifunctions and Selectors

99

Proof. See, e.g., [66]. Theorem A.3 (Berge Maximum Theorem). Let Ψ : X  Y be a continuous compact-valued multifunction between topological spaces X and Y, and v : Gr(Ψ ) → ℜ be a continuous function. Define the multifunction Ψ ∗ : X  Y by Ψ ∗ (x) = {y ∈ Ψ (x) : v(x, y) = v∗ (x)} , where v∗ (x) = maxy∈Ψ (x) v(x, y). Then: (a) v∗ is continuous. (b) Ψ ∗ is a compact-valued multifunction. (c) If either v has a continuous extension to X × Y or Y is Hausdorff, Ψ ∗ is an u.s.c. multifunction. Proof. See [4, pp. 115-116] (see also [1]).

Appendix B

Probability Measures and Weak Convergence

Throughout the following, X is a Borel space with Borel σ -algebra B(X). Definition B.1. Let μ , μt , t ∈ N, probability measures on X. It is said that μt conw verges weakly to μ (denoted as μt → μ ) if 

vd μt →



vd μ

for all v ∈ C(X). We denote by P(X) the space of all probability measures on X. We assume that P(X) is endowed with the topology of weak convergence given in Definition B.1. In this case, as X is a Borel space, P(X) is a Borel space (see, e.g., [38, p. 91], [5, Cor. 7.25.1]). In addition, if X is compact, then so is P(X) (see, e.g., [62, Th. 6.4]). Furthermore, let !  P0 (X) := μ ∈ P(X) : |s| d μ (s) < ∞ . X

From [49], if X is a σ -compact set, then so is P0 (X). Proposition B.1. For μ , μt ∈ P(X), t ∈ N, the following statements are equivalent: w (a) μt → μ . (b) lim inf μt (D) ≥ μ (D) for every open set D ⊂ X and μt (X) → μ (X). t→∞

(c) lim sup μt (D) ≤ μ (D) for every closed set D ⊂ X and μt (X) → μ (X). t→∞

Proof. See, e.g., [3].

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 J. A. Minj´arez-Sosa, Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution, SpringerBriefs in Probability and Mathematical Statistics, https://doi.org/10.1007/978-3-030-35720-7

101

102

B Probability Measures and Weak Convergence w

Remark B.1. From Proposition A.1, it is easy to prove that if μt → μ and v ∈ L(X), then   lim inf vd μt ≥ vd μ . t→∞

w

Similarly, from Remark A.1, if μt → μ and v is u.s.c. and bounded from above, then 

lim sup t→∞

vd μt ≤



vd μ . 

For a function u : X → ℜ, we define the mapping u˜ : P(X) → ℜ as u( ˜ μ ) = ud μ . Proposition B.2. (a) If u ∈ L(X), then u˜ ∈ L(P(X)); (b) if u is u.s.c. and bounded from above, then u˜ is u.s.c. and bounded from above on P(X). Proof. See, e.g., [21, Lemma 3.3]. Proposition B.3. For a multifunction Ψ : X  Y , let Ψ¯ : X  P(X) be the multifunction defined as Ψ¯ (x) := P(Ψ (x)). If Ψ is continuous, then so is Ψ¯ . Proof. See, e.g., [36, Theorem 3]. Let V be a family of real-valued measurable function defined on X. Definition B.2. Let μ ∈ P(X). It is said that V is a μ -uniformity class if      sup  vd μt − vd μ  → 0 as t → ∞, v∈V

w

for any sequence {μt } ⊂ P(X) such that μt → μ . The following result states the μ -uniformity for equicontinuous families of functions. A family V is equicontinuous at each x ∈ X, if for each x ∈ X and ε > 0, there exists δ > 0 such that   d(x, x ) < δ , x ∈ X, implies v(x) − v(x ) < ε for all v ∈ V , where d is the metric on X. Proposition B.4. If V is uniformly bounded and equicontinuous at each x ∈ X, then V is a μ -uniformity class for every probability measure μ ∈ P(X). Proof. See, e.g., [64, Theorem 3.1] (see also [6]).

B.1 The Empirical Distribution

103

B.1 The Empirical Distribution Let ξ be a random variable defined on a probability space (Ω , F , P) taking values in a Borel space S. We denote by θ ∈ P(S) the distribution of ξ , that is, for D ∈ B(S),

θ (D) := P [ξ ∈ D] . Throughout the remainder, (Ω , F , P) is a fixed probability space and “a.s.” means “almost surely with respect to P.” Let S be a Borel space and ξ¯t = (ξ1 , ξ2 , . . . , ξt ) be an independent and identically distributed (i.i.d.) sample from the distribution θ ∈ P(S). The empirical distribution θt of the sample ξ¯t is defined as

θt (D) :=

1 t ∑ 1D (ξi ), t ∈ N, D ∈ B(S). t i=1

Observe that θt is a (random) probability measure that puts mass 1/t at each observation, and for a function v : S → ℜ  S

vd θt =

1 t ∑ v(ξi ), t ∈ N. t i=1

Furthermore, since, for each D ∈ B(S), {1D (ξi )} is a sequence of i.i.d. random variables with mean E [1D (ξi )] = θ (D), from the Strong Law of Large Numbers (SLLN) θt (D) → θ (D) a.s., as t → ∞. Remark B.2. In the particular case when S = ℜ, the study of the empirical distribution focuses on the empirical distribution function. Specifically, if ξ1 , ξ2 , . . . , ξt are i.i.d. random variables with distribution function F, we define the empirical distribution function as Ft (s) = θt ((−∞, s]), s ∈ ℜ. Thus, from the SLLN we have that Ft (s) → F(s) a.s., s ∈ ℜ. In fact, we have uniform convergence, that is sup |Ft (s) − F(s)| → 0 a.s. s∈ℜ

This result is known as Glivenko-Cantelli Theorem (see, e.g., [18]). A well-known property of the empirical distribution is the weak convergence given in the following result (see, e.g., [18]). w

Proposition B.5. θt → θ a.s., that is, 

vd θt →



vd θ a.s., as t → ∞,

104

B Probability Measures and Weak Convergence

for every v ∈ C(S). In addition, if v ∈ L(S), then (see Remark B.1) 

lim inf t→∞

ud θt ≥



ud θ a.s.

Similarly if v is only u.s.c. and bounded from above 

lim sup t→∞

vd θt ≤



vd θ .

A combination of the Propositions B.4 and B.5 yields the following result (see [64, Theorem 3.1]). Proposition B.6. Let V be a family of real-valued function defined on S. If V is uniformly bounded and equicontinuous at each s ∈ S, then       (B.1) ηt := sup  vd θt − vd θ  → 0 a.s., as t → ∞. v∈V

A family of function satisfying (B.1) is called a Glivenko-Cantelli class. This class of functions is widely studied in the field of statistical learning theory (see, e.g., [71]). We now present an important result that provides the rate of convergence of the expectation E ηt for a particular family of functions. Proposition B.7. Suppose that S = ℜk and 

|s|m¯ d θ < ∞ for m¯ =

km and some m > max {2, k} . (m − k)(m − 2)

If V is uniformly bounded and equi-Lipschitzian on ℜk , then there is a constant M¯ such that ¯ −1/m . E ηt ≤ Mt

Proof. See [12, Theorem 3.2 and Proposition 3.4]. The family V equi-Lipschitzian on ℜk means that there exists a constant L > 0 such that, for every s, s ∈ ℜk and v ∈ V     v(s) − v(s ) ≤ L s − s  , where |·| is the Euclidean distance in ℜk .

Appendix C

Stochastic Kernels

Throughout the following, X and Y are Borel spaces. Definition C.1. A stochastic kernel γ (dx|y) on X given Y is a function such that γ (·|y) ∈ P(X) for every y ∈ Y and γ (D|·) ∈ B(Y ) for each fixed D ∈ B(X). We denote by P(X|Y ) the family of stochastic kernels on X given Y. Definition C.2. Let γ ∈ P(X|Y ). It said that (a) γ is strongly continuous if the function y→



v(x)γ (dx|y)

(C.1)

is bounded and continuous for each function v ∈ B(X). (b) γ is weakly continuous if the function in (C.1) is bounded and continuous for each v ∈ C(X). The following result establishes some of the main properties of the stochastic kernels. Proposition C.1. Let γ ∈ P(X|Y ). (a) The following statements are equivalent: (a.1) γ is strongly continuous. (a.2) The function in (C.1) is l.s.c. for each v ∈ B(X). (a.3) γ (D|·) is continuous on Y for every D ∈ B(X). (b) The following statements are equivalent: (b.1) γ is weakly continuous. (b.2) The function in (C.1) is l.s.c. for each v ∈ L(X). Proof. See, e.g., [30, Appendix C].

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 J. A. Minj´arez-Sosa, Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution, SpringerBriefs in Probability and Mathematical Statistics, https://doi.org/10.1007/978-3-030-35720-7

105

106

C Stochastic Kernels

Proposition C.2 (Theorem of C. Ionescu Tulcea). Let {Xt }t∈N0 be a sequence of ∞ Xt . In addition, let ν ∈ Borel spaces, Yt := X0 × X1 × . . . × Xt , t ∈ N0 , and Y := ∏t=0 P(X0 ) be an arbitrary probability measure, and for t ∈ N0 , γt ∈ P (Xt+1 |Yt ) . Then there exists a unique probability measure Pν on Y such that, for every measurable rectangle D0 × D1 × . . . × Dt in Yt Pν (D0 × . . . × Dt ) =



ν (dx0 )

D0



γ0 (dx1 |y0 )

D1



γ1 (dx2 |y1 ) . . .

D2



γt−1 (dxt |yt−1 ) ,

Dt

where yt = (x0 , x1 , . . . , xt ) ∈ Yt . Proof. See, e.g., [5, pp. 140-141].

C.1 Difference-Equation Processes Let {xt } be a stochastic process on X defined by xt+1 = F(xt , ξt ), t ∈ N0 , x0 ∈ X given,

(C.2)

where {ξt } is a sequence of i.i.d. random variables taking values on a Borel space S, with a common distribution θ ∈ P(S), and independent of the initial state x0 . In addition, F : X × S → X is a given measurable function. Equation (C.2), together with the distribution θ , defines a stochastic kernel γ ∈ P (X|X) by

γ (D|x) : = Pr [xt+1 ∈ D|xt = x] = Pr [F(x, ξt ) ∈ D|xt = x] = θ {s ∈ S : F(x, s) ∈ D} =



S

1D [F(x, a)] θ (ds), D ∈ B(X), x ∈ X.

(C.3)

Hence, for any measurable function v on X E [v(xt+1 )|xt = x] =

 X

v(y)γ (dy|x) =

 S

v [F(x, a)] θ (ds), x ∈ X,

whenever the integrals exist. In particular, if S = ℜk and θ has a density ρ with respect to the Lebesgue measure, i.e., θ (D) = D ρ (s)ds for all D ∈ B(ℜk ), the stochastic kernel γ in (C.3) becomes  γ (D|x) = 1D [F(x, a)] ρ (s)ds, D ∈ B(X), x ∈ X, (C.4) ℜk

C.1 Difference-Equation Processes

and

 X

v(y)γ (dy|x) =

107

 ℜk

v [F(x, a)] ρ (s)ds, x ∈ X.

Proposition C.3. If the function F(x, s) in (C.2) is continuous in x ∈ X for each s ∈ S, then the stochastic kernel γ in (C.3) is weakly continuous. Proof. See Example C.7 in [30]. Proposition C.4. Assume that X = S = ℜk , F(x, s) = G(x) + s for some measurable function G, and θ has a density ρ with respect to the Lebesgue measure. Then the stochastic kernel γ in (C.3) (see (C.4)) is strongly continuous. Proof. See Example C.8 in [30].

Appendix D

Review on Density Estimation

We consider random variables ξ defined on a probability space (Ω , F , P) taking values in S = ℜk with density function ρ , that is, for all D ∈ B(ℜk ),

θ (D) =

 D

ρ (s)ds,

where θ ∈ P(ℜk ) is the distribution of ξ , i.e., θ (D) := P [ξ ∈ D] ; we denote by E the corresponding expectation operator. Let ξ¯t := (ξ1 , ξ2 , . . . , ξt ) be an i.i.d. sample from a density ρ . A density estimate is a sequence {ρˆt } of Borel measurable functions ρˆt : ℜk × (ℜk )t → ℜ such that, for each t ∈ N and sample ξ¯t , ρˆt (s; ξ¯t ) = ρˆt (s) is a density on ℜk .

D.1 Error Criteria The objective of density estimation methods is to obtain an estimate ρˆt under the least restrictive conditions possible on ρ . Hence, the density estimation problem can be considered as a functional approximation problem for which techniques from function approximation theory could be applied. From this point of view, there are several error criteria to measure the discrepancy between the estimate ρˆt and the density ρ . For instance, the Lq -criteria, q ≥ 1, defined by the Lq -norm 

1/q |ρˆt (s) − ρ (s)|q ds

,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 J. A. Minj´arez-Sosa, Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution, SpringerBriefs in Probability and Mathematical Statistics, https://doi.org/10.1007/978-3-030-35720-7

109

110

D Review on Density Estimation

being the most common cases q = 1 and q = 2. Specifically we define: • the integrated absolute error and the mean integrated absolute error 

Jt :=

|ρˆt (s) − ρ (s)| ds = ρˆt − ρ L1

and E (Jt ) = E ρˆt − ρ L1 , respectively; • the integrated squared error and the mean integrated squared error (2)

Jt and



:=

(ρˆt (s) − ρ (s))2 ds = ρˆt − ρ 2L2

  (2) E Jt = E ρˆt − ρ 2L2 ,

respectively. From these two criteria, the squared error is the most studied because it is more tractable and allows analyzing the error in terms of the variance-bias decomposition given by E(ρˆt (s) − ρ (s))2 = Var (ρˆt (s)) + [bias (ρˆt (s))]2 , where bias (ρˆt (s)) = E(ρˆt (s) − ρ (s)). On the other hand, as is pointed out in [10] (see also [9]), the L1 -criterion is a better error criterion than L2 for the specific case of density estimation. For instance, L1 is the natural space for all densities and the integrated absolute error Jt can be recognized visually as the area between the graphs of the density ρ and the estimator ρˆt . Furthermore, we have the following result, known as Scheffe’s Theorem, which is useful to interpret the L1 -criterion. Theorem D.1 (Scheffe Theorem [9, 10]). Let ρ and ϕ be densities. Then       |ρ (s) − ϕ (s)| ds = 2 sup  ρ (s) − ϕ (s) ds. D D k D∈B(ℜ )

For two probability measures θ , μ ∈ P(ℜk ), we define the total variation as θ − μ TV :=

sup D∈B(ℜk )

|θ (D) − μ (D)| .

If θ and μ have densities ρ and ϕ respectively, then Scheffe’s Theorem yields θ − μ TV =

1 2



|ρ − ϕ | =

1 ρ − ϕ L1 . 2

(D.1)

Hence, if ρˆt is an estimate of the density ρ , from (D.1), with ρˆt instead of ϕ , we have

D.2 Density Estimators



if

111

|ρ (s) − ρˆt ρ (s)| ds < ε then |θ (D) − θt (D)| < 

ε a.s., 2

(D.2)

for all D ∈ B(ℜk ), where θt (D) = D ρˆt (s)ds. Another important property of the L1 -criterion is that Jt is invariant under monotone transformations. Indeed, let T : ℜk → ℜk be a one-to-one transformation, and ξ and χ be random variables with densities ρ and ϕ , respectively. Consider the random variables ξ ∗ = T (ξ ) and χ ∗ = T (χ ) with densities ρ ∗ and ϕ ∗ , respectively. Then ρ − ϕ L1 = ρ ∗ − ϕ ∗ L1 .

D.2 Density Estimators There are several examples of density estimators, for instance the histograms, the kernel estimators, and the projection estimators. Among them, the most understood are the histograms which, in its simplest version, is defined as follows. For convenience, we assume that ξ1 , ξ2 , . . . , ξt is an i.i.d. sample from a density ρ taking values in a k-rectangle (box) S ⊂ ℜk . Divide S into disjoint bins or sub1 rectangles D1 , D2 , . . . , Dm of volume hk , that is m ≈ k . The histogram estimate is h defined as t 1 ξ ∈D 1 [ i j] ρˆt (s) = ∑ , s ∈ D j , j = 1, . . . , m. t i=1 hd Although the histogram is easy to understand and to compute, its applicability is limited in situations where a deeper mathematical analysis is required. For instance, its discontinuity is a limitation when it is necessary to know the derivative of the estimators. However, important density estimators emerge as extension of the histograms, like the kernel estimate on which we will focus.

D.2.1 The Kernel Estimate A Borel measurable function K : ℜk → ℜ is a kernel if K ≥ 0 and



ℜk K(s)ds =

1.

Definition D.1. Let ξ1 , ξ2 , . . . , ξt be an i.i.d. sample from a density ρ . Given a kernel K, the kernel estimate is defined by   s − ξi 1 t ρˆt (s) = k ∑ K , s ∈ ℜk , dt tdt i=1 where dt is a sequence of positive numbers. A well-known property of the kernel estimate is its consistency , that is E(Jt ) → 0, as t → ∞, which is stated in the following result (see, e.g., [11]).

112

D Review on Density Estimation

Theorem D.2. Let ρˆt be a kernel estimate with arbitrary kernel K and {dt } be a sequence such that dt → 0 and tdtk → ∞ as t → ∞. Then E(Jt ) → 0 as t → ∞. In fact, as is proved in [10], ρˆt is strongly consistent, i.e., Jt → 0 in probability, and furthermore all types of convergence to 0 of Jt are equivalent. Specifically we have the following result. Theorem D.3. Let ρˆt be a kernel estimate with arbitrary kernel K. Then the following statements are equivalent: (a) Jt → 0 in probability as t → ∞. (b) Jt → 0 a.s. as t → ∞. (c) Jt → 0 exponentially as t → ∞. That is, for all ε > 0 there exist r,t0 > 0 such that P [Jt ≥ ε ] ≤ e−rn , t ≥ t0 . (d) dt → 0 and tdtk → ∞ as t → ∞. Remark D.1. Observe that Jt ≤ 2 a.s. Hence, it is easy to see that if E(Jt ) → 0 as t → ∞, then, for any q > 0, E(Jt )q → 0. Analyzing the consistency of the kernel estimate ρˆt with the mean integrated absolute error E(Jt ), we can obtain its rate of convergence. If this is the case, it is necessary to impose conditions on the densityρ . For simplicity, we present the result when k = 1, which is proved, for instance, in [11] (see also [10]). D.4. Let ρˆt be a kernel estimate where K is a kernel such that Theorem 2

s K(s)ds < ∞ and K(s) = K(−s), s ∈ ℜ. In addition, {dt } is a sequence such that dt → 0 and tdtk → ∞ as t → ∞. Let ρ be a density satisfying the following conditions:  (i) |ρ √ | ∗< ∞; (ii) ρ < ∞, where ρ ∗ (s) = sup ρ (y). s−1≤y≤s+1

Then E(Jt ) = O(t −2/5 ) as t → ∞. Remark D.2. In the same sense of the Remark D.1, if E(Jt ) = O(t −γ ) as t → ∞ for some γ > 0, then, for any q > 1, E(Jt )q = O(t −γ ) as t → ∞. Indeed, if q > 1,

  E(Jt )q = E (Jt )(Jt )q−1 ≤ 2q−1 E(Jt ),

which, because E(Jt ) = O(t −γ ) as t → ∞, implies (D.3).

(D.3)

D.2 Density Estimators

113

Depending on their features, it is expected that there are densities that are more difficult to estimate than others, so the efficiency of the same kernel density estimator may be different. In addition, the accuracy and good behavior of the kernel density estimator depend on other factors such as the choice of the kernel K and the bandwidth h. A thorough analysis of these issues, as well as examples of estimators under several context, can be found, for instance, in [9–11, 27, 74, 75].

D.2.2 Projection Estimates In certain situations, beside the usual statistical properties, the estimators need to satisfy additional properties. For example, it is commonly required that the estimate ρˆt and the density ρ have the same functional properties. This case can be addressed in the following manner. Let D ⊂ L1 (ℜk ) be a class of densities containing ρ . Let ρˆt be an arbitrary density estimate. We define the projection estimate as the projection ρt of ρˆt onto the class D, that is ρt − ρˆt L1 = inf ϕ − ρˆt L1 . ϕ ∈D

The following result provides conditions ensuring the existence of the projection estimate ρt . Theorem D.5. Let D ⊂ L1 (ℜk ) be a closed and convex class of densities and ρˆt be an arbitrary estimate. Then there exists the projection estimate ρt ∈ D, which is defined as ρt = argmin ϕ − ρˆt L1 . (D.4) ϕ ∈D

The proof of this result for Lq -spaces can be found in [45]. For the case of densities see [11]. Remark D.3. Observe that from (D.4) we have ρt − ρ L1 ≤ ρt − ρˆt L1 + ρˆt − ρ L1 ≤ ρˆt − ρ L1 + ρˆt − ρ L1 = 2 ρˆt − ρ L1 . Hence, the consistency of the projection estimate ρt is inherited from the consistency of ρˆt , as well as the rate of convergence.

114

D Review on Density Estimation

D.2.3 The Parametric Case Consider the case in which the unknown density ρ belongs to a parametric family of densities {ρλ : λ ∈ Λ } , where Λ is the parameter set. So, when estimating the unknown parameter we obtain a density estimate whose properties will depend on the parametric estimation method used. For instance, under some regularity conditions, the maximum likelihood method yields consistent estimates (see, e.g., [68]). Below we will illustrate a particular case. 

Let ϕ : ℜ → ℜ+ be a measurable function such that the unknown density ρλ is of the form

ϕ < ∞. We assume that

ϕ (s) 1[λ ,∞) (s), λ ∈ Λ , ρλ (s) =  ∞ λ ϕ (s)ds where Λ is a subset of ℜ+ . It can be proven that λt := min{ξ1 , ξ2 , . . . , ξt } is the maximum likelihood estimate of λ and

λt → λ a.s., as t → ∞.

(D.5)

Let Fλ be the distribution function corresponding to the density ρλ , and define

ϕ (s) I[λt ,∞) (s). ρt (s) = ρλt (s) =  ∞ λt ϕ (s)ds Observe that |ρλ (s) − ρt (s)| = Hence Jt =

 ∞ λ

⎧ ⎨ ⎩

ρλ (s),

if λ < s ≤ λt ;

ρt (s) − ρλ (s), if

|ρλ (s) − ρt (s)| ds = 2

 λt λ

s > λt .

ρλ (s)ds = 2Fλ (λt ).

Thus, from (D.5), Jt → 0 a.s., as t → ∞, that is ρt is (strong) consistent. Furthermore E(Jt ) = 2E [Fλ (λt )] = 2 =2 =

 ∞ λ

 ∞ λ

Fλ (s)ρλt (s)ds

tFλ (s)(1 − Fλ (s))t−1 ρλ (s)ds

(D.6)

2 . t +1

In (D.6) we have used the fact that the distribution function of λt is Fλt (s) = 1 − (1 − Fλ (s))t , which yields ρλt (s) = t(1 − Fλ (s))t−1 ρλ (s). Therefore, E(Jt ) = O(t −γ ) for γ ∈ (0, 1].

References

115

References 1. Aliprantis, C.D., Border, K.C.: Infinite Dimensional Analysis. Springer, Berlin (1999) 2. Altman, E., Shwartz, A.: Adaptive control of constrained Markov chains: criteria and policies. Ann. Oper. Res. 28, 101–134 (1991) 3. Ash, R.: Real Analysis and Probability. Academic, New York (1972) 4. Berge, E.: Topological Spaces. Macmillan, New York (1963) 5. Bertsekas, D.P., Shreve, S.E.: Stochastic Optimal Control: The Discrete Time Case. Academic, New York (1978) 6. Billingsley, P., Topsoe, F.: Uniformity in weak convergence. Z. Wahrsch. Verw. Geb. 7, 1–16 (1967) 7. Cavazos-Cadena, R.: Nonparametric adaptive control of discounted stochastic systems with compact state space. J. Optim. Theory Appl. 65, 191–207 (1990) 8. Chang, H.S.: Perfect information two-person zero-sum Markov games with imprecise transition probabilities. Math. Methods Oper. Res. 64, 235–351 (2006) 9. Devroye, L.: A Course in Density Estimation. Birkh¨auser, Boston (1987) 10. Devroye, L., Gy¨orfi, L.: Nonparametric Density Estimation. The L1 View. Wiley, New York (1985) 11. Devroye, L., Lugosi, G.: Combinatorial Methods in Density Estimation. Springer, New York (2001) 12. Dudley, R.M.: The speed of mean Glivenko-Cantelli convergence. Ann. Math. Stat. 40, 40–50(1969) 13. Dynkin, E.B., Yushkevich, A.A.: Controlled Markov Processes. Springer, New York (1979) 14. Engwerda, J.: LQ Dynamic Optimization and Differential Games. Wiley, Hoboken (2005) 15. Fan, K.: Minimax theorems. Proc. Nat. Acad. Sci. U. S. A. 39, 42–47 (1953) 16. Fern´andez-Gaucherand, E.: A note on the Ross-Taylor Theorem. Appl. Math. Comput. 64, 207–212 (1994) 17. Filar, J., Vrieze, K.: Competitive Markov Decision Processes. Springer, New York (1997) 18. Gaenssler, P., Stute, W.: Empirical processes: a survey for i.i.d. random variables. Ann. Probab. 7, 193–243 (1979) 19. Ghosh, M.K., Bagachi, A.: Stochastic games with average payoff criterion. Appl. Math. Optim. 38, 283–301 (1998) 20. Ghosh, M.K., McDonald, D., Sinha, S.: Zero-sum stochastic games with partial information. J. Optim. Theory Appl. 121, 99–118 (2004) 21. Gonz´alez-Trejo, T.J., Hern´andez-Lerma, O., Hoyos-Reyes, L.F.: Minimax control of discrete-time stochastic systems. SIAM J. Control Optim. 41, 1626– 1659 (2003) 22. Gordienko, E.I.: Adaptive strategies for certain classes of controlled Markov processes. Theory Probab. Appl. 29, 504–518 (1985)

116

D Review on Density Estimation

23. Gordienko, E.I., Hern´andez-Lerma, O.: Average cost Markov control processes with weighted norms: existence of canonical policies. Appl. Math. 23, 119–218 (1995) 24. Gordienko, E.I., Hern´andez-Lerma, O.: Average cost Markov control processes with weighted norms: value iteration. Appl. Math. 23, 219–237 (1995) 25. Gordienko, E.I., Minj´arez-Sosa, J.A.: Adaptive control for discrete-time Markov processes with unbounded costs: discounted criterion. Kybernetika 34, 217–234 (1998) 26. Gordienko, E.I., Minj´arez-Sosa, J.A.: Adaptive control for discrete-time Markov processes with unbounded costs: average criterion. ZOR- Math. Methods Oper. Res. 48, 37–55 (1998) 27. Hasminskii, R., Ibragimov, I.: On density estimation in the view of Kolmogorov’s ideas in approximation theory. Ann. Stat. 18, 999–1010 (1990) 28. Hern´andez-Lerma, O.: Adaptive Markov Control Processes. Springer, New York (1989) 29. Hern´andez-Lerma, O., Cavazos-Cadena, R.: Density estimation and adaptive control of Markov processes: average and discounted criteria. Acta Appl. Math. 20, 285–307 (1990) 30. Hern´andez-Lerma, O., Lasserre, J.B.: Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, New York (1996) 31. Hern´andez-Lerma, O., Lasserre, J.B.: Further Topics on Discrete-Time Markov Control Processes. Springer, New York (1999) 32. Hern´andez-Lerma, O., Lasserre, J.B.: Zero-sum stochastic games in Borel spaces: average payoff criteria. SIAM J. Control Optim. 39, 1520–1539 (2001) 33. Hilgert, N., Minj´arez-Sosa, J.A.: Adaptive policies for time-varying stochastic systems under discounted criterion. Math. Methods Oper. Res. 54, 491–505 (2001) 34. Hilgert, N., Minj´arez-Sosa, J.A.: Limiting average cost adaptive control problem for time – varying stochastic systems. Bol. Soc. Mat. Mexicana 9, 197– 212 (2003) 35. Hilgert, N., Minj´arez-Sosa, J.A.: Adaptive control of stochastic systems with unknown disturbance distribution: discounted criteria. Math. Methods Oper. Res. 63, 443–460 (2006) 36. Himmelberg, C.J., Van Vleck, F.S.: Multifunctions with values in a space of probability measures. J. Math. Anal. Appl. 50, 108–112 (1975) 37. Himmelberg, C.J., Parthasarathy, T., Van Vleck, F.S.: Optimal plans for dynamic programming problems. Math. Oper. Res. 1, 390–294 (1976) 38. Hinderer, K.: Foundations of Non-stationary Dynamic Programming with Discrete Time Parameter. Lecture Notes in Operations Research and Mathematical Systems, vol. 33. Springer, Berlin (1970) 39. Ja´skiewicz, A.: A fixed point approach to solve the average cost optimality equation for semi-Markov decision processes with Feller transition probabilities. Commun. Stat. Theory Methods 36, 2559–2575 (2007) 40. Ja´skiewicz, A.: Zero-sum ergodic semi-Markov games with weakly continuous transition probabilities. J. Optim. Theory Appl. 141, 321–347 (2009)

References

117

41. Ja´skiewicz, A.: On a continuous solution to the Bellman-Poisson equation in stochastic games. J. Optim. Theory Appl. 145, 451–458 (2010) 42. Ja´skiewicz, A., Nowak, A.: Zero-sum ergodic stochastic games with Feller transition probabilities. SIAM J. Control Optim. 45, 773–789 (2006) 43. Ja´skiewicz, A., Nowak, A.: Approximation of noncooperative semi-Markov games. J. Optim. Theory Appl. 131, 115–134 (2006) 44. Ja´skiewicz, A., Nowak, A.: On the optimality equation for average cost Markov control processes with Feller transitions probabilities. J. Math. Anal. Appl. 316, 495–509 (2006) 45. K¨othe, G.: Topological Vector Spaces I. Springer, New York (1969) 46. Krausz, A., Rieder, U.: Markov games with incomplete information. Math. Methods Oper. Res. 46, 263–279 (1997) 47. K¨uenle, H.U.: On Markov games with average reward criterion and weakly continuous transition probabilities. SIAM J. Control Optim. 46, 2156–2168 (2007) 48. Luque-V´asquez, F.: Zero-sum semi-Markov games in Borel spaces: discounted and average payoff. Bol. Soc. Mat. Mexicana 8, 227–241 (2002) 49. Luque-V´azquez, F., Minj´arez-Sosa, J.A.: A note on the σ -compactness of sets of probability measures on metric spaces. Statist. Probab. Lett. 84, 212–214 (2014) 50. Luque-V´azquez, F., Minj´arez-Sosa, J.A.: Empirical approximation in Markov games under unbounded payoff: discounted and average criteria. Kybernetika 53, 694–716 (2017) 51. Meyn, S.P., Tweedie, R.L.: Markov Chains and Stochastic Stability. Springer, London (1993) 52. Minj´arez-Sosa, J.A.: Nonparametric adaptive control for discrete-time Markov processes with unbounded costs under average criterion. Appl. Math. 26, 267– 280 (1999) 53. Minj´arez-Sosa, J.A.: Average optimality for adaptive Markov control processes with unbounded costs and unknown disturbance distribution. In: Zhenting, H., Filar, J.A., Chen, A. (eds.) Markov Processes and Controlled Markov Chains, Chap. 7. Kluwer, Dordrecht (2002) 54. Minj´arez-Sosa, J.A.: Approximation and estimation in Markov control processes under a discounted criterion. Kybernetika 40, 681–690 (2004) 55. Minj´arez-Sosa, J.A.: Empirical estimation in average Markov control processes. Appl. Math. Lett. 21, 459–464 (2008) 56. Minj´arez-Sosa, J.A., Luque-V´asquez, F.: Two person zero-sum semi-Markov games with unknown holding times distribution on one side: discounted payoff criterion. Appl. Math. Optim. 57, 289–305 (2008) 57. Minj´arez-Sosa, J.A., Vega-Amaya, O.: Asymptotically optimal strategies for adaptive zero-sum discounted Markov games. SIAM J. Control Optim. 48, 1405–1421 (2009) 58. Minj´arez-Sosa, J.A., Vega-Amaya, O.: Optimal strategies for adaptive zerosum average Markov games. J. Math. Anal. Appl. 402, 44–56 (2013)

118

D Review on Density Estimation

59. Najim, K., Poznyak, A.S., G´omez, E.: Adaptive policy for two finite Markov chains zero-sum stochastic game with unknown transition matrices and average payoffs. Automatica 37, 1007–1018 (2001) 60. Neyman, A., Sorin, S.: Stochastic Games and Applications. Kluwer, Dordrecht (2003) 61. Nowak, A.: Measurable selection theorems for minimax stochastic optimization problems. SIAM J. Control Optim. 23, 466–476 (1985) 62. Parthasarathy, K.R.: Probability Measures on Metric Spaces. Academic, New York (1967) 63. Prieto-Rumeau, T., Lorenzo, J.M.: Approximation of zero-sum continuoustime Markov games under the discounted payoff criterion. TOP 23, 799–836 (2015) 64. Ranga Rao, R.: Relations between weak and uniform convergence of measures with applications. Ann. Math. Stat. 33, 659–680 (1962) 65. Shapley, L.S.: Stochastic games. Proc. Nat. Acad. Sci. U. S. A. 39, 1095–1100 (1953) 66. Sch¨al, M.: Conditions for optimality and for the limit of n-stage optimal policies to be optimal. Z. Wahrs. Verw. Gerb. 32, 179–196 (1975) 67. Sch¨al, M.: Estimation and control in discounted stochastic dynamic programming. Stochastics 20, 51–71 (1987) 68. Serfling, R.J.: Approximation Theorems of Mathematical Statistics. Wiley, Hoboken (2002) 69. Shimkin, N., Shwartz, A.: Asymptotically efficient adaptive strategies in repeated games. Part I: certainty equivalence strategies. Math. Oper. Res. 20, 743–767 (1995) 70. Shimkin, N., Shwartz, A.: Asymptotically efficient adaptive strategies in repeated games. Part II: asymptotic optimality. Math. Oper. Res. 21, 487–512 (1996) 71. Van der Vaart, A.E., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer, New York (1996) 72. Van Nunen, J.A.E.E., Wessels, J.: A note on dynamic programming with unbounded rewards. Manag. Sci. 24, 576–580 (1978) 73. Vega-Amaya, O.: Zero-sum average semi-Markov games: fixed point solutions of the Shapley equation. SIAM J. Control Optim. 42, 1876–1894 (2003) 74. Wand, M.P., Devroye, L.: How easy is a given density to estimate? Comput. Stat. Data Anal. 16, 311–323 (1993) 75. Wand, M.P., Jones, M.C.: Kernel Smoothing. Chapman and Hall, London (1995) 76. Wessels, J.: Markov programming by successive approximations with respect to weighted supremum norms. J. Math. Anal. Appl. 58, 326–335 (1977) 77. Yakowitz, S.: Dynamic programming applications in water resources. Water Resour. Res. 18, 673–696 (1982) 78. Yeh, W.W-G.: Reservoir management and operations models: a state-of-theart review. Water Resour. Res. 21, 1797–1818 (1985)

Index

ε −optimal strategies, 7, 12, 14 A additive noise systems, 70 asymptotic optimality, 15, 16, 22, 72 autoregressive game models, 3, 71 average criterion, 6, 31, 50 average optimal strategies, 7, 31, 34, 38, 61 B Berge Maximum Theorem, 11, 99 Borel space, 96 C continuity conditions, 10, 32, 48, 70 contraction property, 13, 50 D density estimation, 19, 37, 109 difference-equation games, 2, 6, 16, 34, 48, 69 discounted criterion, 6, 9, 50 discounted optimal strategies, 7, 9, 54 discrepancy function, 15, 26 E empirical distribution, 51, 53, 103 empirical game, 51, 58 equi-Lipschitz condition, 58, 81 equicontinuity condition, 53, 81 ergodicity conditions, 32, 48 estimation and control, 2, 23, 36 F Fan minimax theorem, 96

G game model, 1 game state process, 5 Glivenko-Cantelli class, 53, 104 Glivenko-Cantelli Theorem, 103 growth conditions, 12, 15, 17, 32, 48 K kernel estimates, 111 L linear-quadratic games, 2, 72 M Markov strategies, 4 measurable selection theorems, 11, 98 measurable selector, 98 minimax conditions, 10 minimax theorem, 11, 96 multifunction, 97 O optimal strategies, 7 P parametric estimation, 114 projection estimates, 22, 37, 113 pure strategies, 4 R reservoir models, 74

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 J. A. Minj´arez-Sosa, Zero-Sum Discrete-Time Markov Games with Unknown Disturbance Distribution, SpringerBriefs in Probability and Mathematical Statistics, https://doi.org/10.1007/978-3-030-35720-7

119

120 S saddle point, 7, 31 Scheffe Theorem, 110 Shapley equation, 10, 17 stochastic kernel, 105 storage game models, 78 strategies, 4

Index U uniformity class, 53, 102 V vanishing discount factor approach, 33, 59 W weak convergence, 53, 101 weighted norm, 13, 97