Data Assimilation: Methods, Algorithms, and Applications 9781611974539, 9781611974546, 9781611974539

1,424 183 9MB

English Pages 306 [310] Year 2016

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Data Assimilation: Methods, Algorithms, and Applications
 9781611974539,  9781611974546,  9781611974539

Citation preview

Data Assimilation

Fundamentals of Algorithms Editor-in-Chief: Nicholas J. Higham, University of Manchester The SIAM series on Fundamentals of Algorithms is a collection of short user-oriented books on stateof-the-art numerical methods. Written by experts, the books provide readers with sufficient knowledge to choose an appropriate method for an application and to understand the method’s strengths and limitations. The books cover a range of topics drawn from numerical analysis and scientific computing. The intended audiences are researchers and practitioners using the methods and upper level undergraduates in mathematics, engineering, and computational science. Books in this series not only provide the mathematical background for a method or class of methods used in solving a specific problem but also explain how the method can be developed into an algorithm and translated into software. The books describe the range of applicability of a method and give guidance on troubleshooting solvers and interpreting results. The theory is presented at a level accessible to the practitioner. MATLAB® software is the preferred language for codes presented since it can be used across a wide variety of platforms and is an excellent environment for prototyping, testing, and problem solving. The series is intended to provide guides to numerical algorithms that are readily accessible, contain practical advice not easily found elsewhere, and include understandable codes that implement the algorithms. Editorial Board

Paul Constantine Colorado School of Mines

Ilse Ipsen North Carolina State University

Sven Leyffer Argonne National Laboratory

Timothy A. Davis Texas A&M University

C. T. Kelley North Carolina State University

Catherine Powell University of Manchester

David F. Gleich Purdue University

Hans Petter Langtangen Argonne National Laboratory

Eldad Haber University of British Columbia

Randall J. LeVeque University of Washington

Series Volumes Asch, M., Bocquet M., and Nodet, M., Data Assimilation: Methods, Algorithms, and Applications Birgin, E. G., and Martínez, J. M., Practical Augmented Lagrangian Methods for Constrained Optimization Bini, D. A., Iannazzo, B., and Meini, B., Numerical Solution of Algebraic Riccati Equations Escalante, R. and Raydan, M., Alternating Projection Methods Hansen, P. C., Discrete Inverse Problems: Insight and Algorithms Modersitzki, J., FAIR: Flexible Algorithms for Image Registration Chan, R. H.-F. and Jin, X.-Q., An Introduction to Iterative Toeplitz Solvers Eldén, L., Matrix Methods in Data Mining and Pattern Recognition Hansen, P. C., Nagy, J. G., and O’Leary, D. P., Deblurring Images: Matrices, Spectra, and Filtering Davis, T. A., Direct Methods for Sparse Linear Systems Kelley, C. T., Solving Nonlinear Equations with Newton’s Method

Mark Asch

Université de Picardie Jules Verne Amiens, France

Marc Bocquet

École des Ponts ParisTech/CEREA Marne-la-Vallée, France

Maëlle Nodet

Université Grenoble Alpes Grenoble, France

Data Assimilation Methods, Algorithms, and Applications

Society for Industrial and Applied Mathematics Philadelphia

Copyright © 2016 by the Society for Industrial and Applied Mathematics. 10 9 8 7 6 5 4 3 2 1 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 Market Street, 6th Floor, Philadelphia, PA 19104-2688 USA. No warranties, express or implied, are made by the publisher, authors, and their employers that the programs contained in this volume are free of error. They should not be relied on as the sole basis to solve a problem whose incorrect solution could result in injury to person or property. If the programs are employed in such a manner, it is at the user’s own risk and the publisher, authors, and their employers disclaim all liability for such misuse. Trademarked names may be used in this book without the inclusion of a trademark symbol. These names are used in an editorial context only; no infringement of trademark is intended. MATLAB is a registered trademark of The MathWorks, Inc. For MATLAB product information, please contact The MathWorks, Inc., 3 Apple Hill Drive, Natick, MA 01760-2098 USA, 508-647-7000, Fax: 508-647-7001, [email protected], www.mathworks.com. Publisher Acquisitions Editor Developmental Editor Managing Editor Production Editor Copy Editor Production Manager Production Coordinator Compositor Graphic Designer

David Marshall Elizabeth Greenspan Gina Rinelli Harris Kelly Thomas Lisa Briggeman Julia Cochrane Donna Witzleben Cally Shrader Cheryl Hufnagle Lois Sellers

Library of Congress Cataloging-in-Publication Data Names: Asch, Mark. | Bocquet, Marc. | Nodet, Maëlle Title: Data assimilation : methods, algorithms, and applications / Mark Asch, Université de Picardie Jules Verne, Amiens, France, Marc Bocquet, École des Ponts ParisTech/CEREA, Marne-la-Vallée, France, Maëlle Nodet, Université Grenoble Alpes, Grenoble, France. Description: Philadelphia : Society for Industrial and Applied Mathematics, [2016] | Series: Fundamentals of algorithms ; 11 | Includes bibliographical references and index. Identifiers: LCCN 2016035022 (print) | LCCN 2016039694 (ebook) | ISBN 9781611974539 | ISBN 9781611974546 (ebook) | ISBN 9781611974539 (print) Subjects: LCSH: Inverse problems (Differential equations) | Numerical analysis. | Algorithms. Classification: LCC QA378.5 .A83 2016 (print) | LCC QA378.5 (ebook) | DDC 515/.357--dc23 LC record available at https://lccn.loc.gov/2016035022

is a registered trademark.

Contents List of figures

ix

List of algorithms

xi

Notation

xiii

Preface

xv

I

Basic methods and algorithms for data assimilation

1

Introduction to data assimilation and inverse problems 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Uncertainty quantification and related concepts . . . . . . . . 1.3 Basic concepts for inverse problems: Well- and ill-posedness 1.4 Examples of direct and inverse problems . . . . . . . . . . . . 1.5 DA methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Some practical aspects of DA and inverse problems . . . . . 1.7 To go further: Additional comments and references . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

3 3 4 6 7 11 22 23

Optimal control and variational data assimilation 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 2.2 The calculus of variations . . . . . . . . . . . . . 2.3 Adjoint methods . . . . . . . . . . . . . . . . . . . 2.4 Variational DA . . . . . . . . . . . . . . . . . . . . 2.5 Numerical examples . . . . . . . . . . . . . . . . .

25 25 26 33 50 67

2

3

1

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Statistical estimation and sequential data assimilation 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Statistical estimation theory . . . . . . . . . . . . . 3.3 Examples of Bayesian estimation . . . . . . . . . . 3.4 Sequential DA and Kalman filters . . . . . . . . . . 3.5 Implementation of the KF . . . . . . . . . . . . . . . 3.6 Nonlinearities and extensions of the KF . . . . . . 3.7 Particle filters for geophysical applications . . . . 3.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

71 . 71 . 75 . 83 . 90 . 96 . 99 . 100 . 103

v

. . . . .

vi

Contents

II

Advanced methods and algorithms for data assimilation

119

4

Nudging methods 121 4.1 Nudging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.2 Back-and-forth nudging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5

Reduced methods 5.1 Overview of reduction methods . . . . . . . . . 5.2 Model reduction . . . . . . . . . . . . . . . . . . . 5.3 Filtering algorithm reduction . . . . . . . . . . . 5.4 Reduced methods for variational assimilation

6

7

III 8

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

133 133 139 142 146

The ensemble Kalman filter 6.1 The reduced-rank square root filter . . . . . . . . . . 6.2 The EnKF: Principle and classification . . . . . . . . 6.3 The stochastic EnKF . . . . . . . . . . . . . . . . . . . 6.4 The deterministic EnKF . . . . . . . . . . . . . . . . . 6.5 Localization and inflation . . . . . . . . . . . . . . . . 6.6 Numerical illustrations with the Lorenz-95 model 6.7 Other important flavors of the EnKF . . . . . . . . 6.8 The ensemble Kalman smoother . . . . . . . . . . . . 6.9 A widespread and popular DA method . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

153 154 156 157 162 167 172 174 189 193

Ensemble variational methods 7.1 The hybrid methods . . 7.2 EDA . . . . . . . . . . . . . 7.3 4DEnVar . . . . . . . . . . 7.4 The IEnKS . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

195 197 202 203 207

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Applications and case studies Applications in environmental sciences 8.1 Physical oceanography . . . . . . . . . . . . 8.2 Glaciology . . . . . . . . . . . . . . . . . . . . 8.3 Fluid–biology coupling; marine biology . 8.4 Land surface modeling and agroecology . 8.5 Natural hazards . . . . . . . . . . . . . . . . .

217 . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

219 219 221 226 229 231

9

Applications in atmospheric sciences 237 9.1 Numerical weather prediction . . . . . . . . . . . . . . . . . . . . . . . 237 9.2 Atmospheric constituents . . . . . . . . . . . . . . . . . . . . . . . . . . 240

10

Applications in geosciences 245 10.1 Seismology and exploration geophysics . . . . . . . . . . . . . . . . . 245 10.2 Geomagnetism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 10.3 Geodynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

11

Applications in medicine, biology, chemistry, and physical sciences 251 11.1 Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 11.2 Systems biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

Contents

vii

11.3 11.4 11.5 11.6 12

Fluid dynamics . . . . . . . . . . . . . Imaging and acoustics . . . . . . . . . Mechanics . . . . . . . . . . . . . . . . Chemistry and chemical processes

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

254 257 259 261

Applications in human and social sciences 263 12.1 Economics and finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 12.2 Traffic control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 12.3 Urban planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

Bibliography

267

Index

303

List of figures 1

The big picture for DA methods and algorithms. . . . . . . . . . . . . . xvii

1.1 1.2 1.3 1.4 1.5 1.6 1.7

Ingredients of an inverse problem . . . . . . . . . . . . The deductive spiral of system science . . . . . . . . . UQ for a random quantity . . . . . . . . . . . . . . . . Duffing’s equation with small initial perturbations. DA methods . . . . . . . . . . . . . . . . . . . . . . . . . Sequential assimilation. . . . . . . . . . . . . . . . . . . Sequential assimilation scheme for the KF. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

4 5 5 10 12 16 17

2.1 2.2 2.3 2.4 2.5 2.6 2.7

A variety of local extrema . . . . . . . . . . . . . . . . . . . . . . . . . Counterexamples for local extrema in 2 . . . . . . . . . . . . . . . Curve η(x) and admissible functions y + εη(x). . . . . . . . . . . 3D- and 4D-Var . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation of the chaotic Lorenz-63 system of three equations. Assimilation of the Lorenz-63 equations by standard 4D-Var . . Assimilation of the Lorenz-63 equations by incremental 4D-Var

. . . . . . .

. . . . . . .

. . . . . . .

26 27 29 60 67 69 69

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18

Scalar Gaussian distribution example of Bayes’ law . . . . . Scalar Gaussian distribution example of Bayes’ law . . . . . A Gaussian product example for forecasting temperature . Bayesian estimation of noisy pendulum parameter . . . . . . Sequential assimilation trajectory . . . . . . . . . . . . . . . . . Sequential assimilation scheme for the KF. . . . . . . . . . . . KF loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of the particle filter. . . . . . . . . . . . . . . . . . . . . Particle filter applied to Lorenz model . . . . . . . . . . . . . Estimating a constant by a KF: R = 0.01. . . . . . . . . . . . . Estimating a constant by a KF: R = 1. . . . . . . . . . . . . . . Estimating a constant by a KF: R = 0.0001. . . . . . . . . . . Estimating a constant by a KF: convergence . . . . . . . . . . Position estimation for constant-velocity dynamics. . . . . . Position estimation errors for constant-velocity dynamics. Velocity estimation results for constant-velocity dynamics. Extrapolation of position for constant-velocity dynamics. . Convergence of the KF . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

85 86 87 88 91 92 96 101 102 109 110 111 112 115 116 116 117 117

4.1 4.2

Schematic representation of the nudging method. . . . . . . . . . . . . 121 Illustration of various nudging methods . . . . . . . . . . . . . . . . . . 127 ix

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

x

List of figures

4.3

Schematic representation of the BFN method. . . . . . . . . . . . . . . 128

5.1 5.2 5.3

Example of dimension reduction. . . . . . . . . . . . . . . . . . . . . . . . 134 Incremental 4D-Var with reduced models. . . . . . . . . . . . . . . . . . 149 Hybridization of the reduced 4D-Var and SEEK filter/smoother algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.1 6.2 6.3 6.4 6.5 6.6

Synthetic DA experiments with the anharmonic oscillator. . . . . . . 162 Schematic representation of the local update for EnKF . . . . . . . . . 168 Plot of the Gaspari–Cohn fifth-order piecewise rational function . . 169 Covariance localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Trajectory of a state of the Lorenz-95 model. . . . . . . . . . . . . . . . 173 Average analysis RMSE for a deterministic EnKF (ETKF)—localization and inflation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Average analysis RMSE of a deterministic EnKF (ETKF)—nonlinear observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Average analysis RMSE for a deterministic EnKF (ETKF)—optimal inflation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Average analysis RMSE for a deterministic EnKF (ETKF)—ensemble size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Schematic of the EnKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Analysis RMSE of the EnKS . . . . . . . . . . . . . . . . . . . . . . . . . . 193

6.7 6.8 6.9 6.10 6.11 7.1 7.2 7.3 7.4 7.5

Synthetic DA experiments with the Lorenz-95 model . . . . . . . . . . Cycling of the SDA IEnKS . . . . . . . . . . . . . . . . . . . . . . . . . . . Synthetic DA experiments with the Lorenz-95 model with IEnKS . Chaining of the MDA IEnKS cycles. . . . . . . . . . . . . . . . . . . . . . Synthetic DA experiments with the Lorenz-95 model—comparison of localization strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . .

216

8.1 8.2 8.3 8.4 8.5

Illustration of DA in operational oceanography . . . . . Illustration of DA for sea level rise and glaciology . . . . Illustration of DA in fish population ecology . . . . . . . Illustration of DA in agronomy and crop modeling . . . Illustration of DA for wildfire modeling and forecasting

222 225 228 232 235

9.1

Anomaly correlation coefficient of the 500 hPa height forecasts for the extratropical northern hemisphere and southern hemisphere . . Typical error growth following the empirical model (9.1). . . . . . . . Cesium-137 radioactive plume at ground level (activity concentrations in becquerel per cubic meter) emitted from the FDNPP in March 2011 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cesium-137 source term as inferred by inverse modeling . . . . . . . . Deposited cesium-137 (in kilobecquerel per square meter) measured (a) and hindcast (b) near the FDNPP . . . . . . . . . . . . . . . . . . . . .

9.2 9.3

9.4 9.5 11.1 11.2 11.3 11.4 11.5

. . . . .

. . . . .

Assimilation of medical data for the cardiovascular system Design cycle for aerodynamic shape optimization . . . . . . Physical setup for a geoacoustics inverse problem. . . . . . . Kirchhoff imaging algorithm results for source localization A simple mechanical system . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

196 210 212 214

239 239

242 243 243 252 256 258 260 261

List of algorithms 1.1 1.2

Iterative 3D-Var (in its simplest form). . . . . . . . . . . . . . . 4D-Var in its basic form . . . . . . . . . . . . . . . . . . . . . . .

20 21

2.1 2.2

Iterative 3D-Var algorithm. . . . . . . . . . . . . . . . . . . . . . 4D-Var . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58 61

4.1

BFN algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.1 5.2

SEEK filter equations. . . . . . . . . . . . . . . . . . . . . . . . . 144 Incremental 4D-Var. . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.1 6.2 6.3 6.4 6.5

Algorithm of the EKF . . . . . . . . . . . . . . . . . . . . . . . . Algorithm for RRSQRT . . . . . . . . . . . . . . . . . . . . . . . Algorithm for the (stochastic) EnKF . . . . . . . . . . . . . . . Pseudocode for a complete cycle of the ETKF . . . . . . . . . Pseudocode for a complete cycle of the MLEF, as a variant in ensemble subspace . . . . . . . . . . . . . . . . . . . . . . . . . Pseudocode for a complete cycle of the EnKS in ensemble subspace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.6 7.1 7.2

154 156 160 166 180 192

A cycle of the lag-L/shift-S/SDA/bundle/Gauss-Newton IEnKS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 A cycle of the lag-L/shift-S/MDA/bundle/Gauss-Newton IEnKS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

xi

Notation n state space  p observation space  m ensemble space, i = 1, . . . , m tk time, k = 1, . . . , K I identity matrix: In , I m , I p x vector xt true state vector xa analysis vector xb background vector xf forecast vector yo observation vector εa analysis error εb background error εf forecast error εo observation error εq model error Mk linear model operator: xk+1 = Mk+1 xk , with Mk+1 = Mk+1:k model from time step k to time step k + 1;  nonlinear model operator Xa analysis perturbation matrix Xf forecast perturbation matrix Pf forecast error covariance matrix Pa analysis error covariance matrix K Kalman gain matrix B background error covariance matrix H linearized observation operator;  nonlinear observation operator Q model error covariance matrix R observation error covariance matrix d innovation vector ( j ) iteration index of a variational assimilation (in parentheses). w coefficients in ensemble space (ensemble transform)

xiii

Preface This book places data assimilation (DA) into the broader context of inverse problems and the theory, methods, and algorithms that are used for their solution. It strives to provide a framework and new insight into the inverse problem nature of DA—the book emphasizes “why” and not just “how.” We cover both statistical and variational approaches to DA (see Figure 1) and give an important place to the latest hybrid methods that combine the two. Since the methods and diagnostics are emphasized, readers will readily be able to apply them to their own, precise field of study. This will be greatly facilitated by numerous examples and diverse applications. The applications are taken from the following fields: geophysics and geophysical flows, environmental acoustics, medical imaging, mechanical and biomedical engineering, urban planning, economics, and finance. In fact, this book is about building bridges—bridges between inverse problems and DA, bridges between variational and statistical approaches, bridges between statistics and inverse problems. These bridges will enable you to cross valleys and moats, thus avoiding the dangers that are most likely/possibly lurking down there. These bridges will allow you to fetch/go and get/retrieve different approaches and better understanding of the vast, and sometimes insular, domains of DA and inverse problems, stochastic and deterministic approaches, and direct and inverse problems. We claim that by assembling these, by reconciling these, we will be better armed to confront and tackle the grand societal challenges of today, broadly defined as “global change” issues—such as climate change, disaster prediction and mitigation, and nondestructive and noninvasive testing and imaging. The aim of the book is thus to provide a comprehensive guide for advanced undergraduate and early graduate students and for practicing researchers and engineers engaged in (partial) differential equation–based DA, inverse problems, optimization, and optimal control—we will emphasize the close relationships among all of these. The reader will be presented with a statistical approach and a variational approach and will find pointers to all the numerical methods needed for either. Of course, the applications will furnish many case studies. The book favours a continuous (infinite-dimensional) approach to the underlying inverse problems, and we do not make the distinction between continuous and discrete problems—every continuous problem, after discretization, yields a discrete (finite-dimensional) problem. Moreover, continuous problems admit a far richer and more extensive mathematical theory, and though DA (via the Kalman filter (KF)) is in fine a discrete approach, the variational analysis will be performed on the continuous model. Discrete inverse problems (finite dimensional) are very well presented in a number of excellent books, such as those of Lewis et al. [2006], Vogel [2002], and Hansen [2010], the latter of which has a strong emphasis on regularization methods. xv

xvi

Preface

Some advanced calculus and tools from linear algebra, real analysis, and numerical analysis are required in the presentation. We introduce and use Hadamard’s wellposedness theory to explain and understand both why things work and why they go wrong. Throughout the book, we observe a maximum of mathematical rigor but with a minimum of formalism. This rigor is extremely important in practice, since it enables us to eliminate possible sources of error in the algorithmic and numerical implementations. In summary, this is really a PDE-based book on inverse and DA modeling—readers interested in the specific application to meteorology or oceanography should additionally consult other sources, such as Lewis et al. [2006] and Evensen [2009]. Those who require a more mathematical approach to inverse problems are referred to Kirsch [1996] and Kaipio and Somersalo [2005], and for DA to the recent monographs of Law et al. [2015] and Reich and Cotter [2015]. Proposed pathways through the book are as follows (this depends on the level of the reader): • The “debutant” reader is encouraged to study the first chapter in depth, since it will provide a basic understanding and the means to choose the most appropriate approach (variational or statistical). • The experienced reader can jump directly to Chapter 2 or Chapter 3 according to the chosen or best-adapted approach. • All readers are encouraged to initially skim through the examples and applications sections of Part III to be sure of the best match to their type of problem (by seeing what kind of problem is the closest to their own)—these can then be returned to later, after having mastered the basic methods and algorithms of Part I or eventually the advanced ones of Part II. • For the most recent approaches, the reader or practitioner is referred to Part II and in particular to Chapters 4 and 7. The authors would like to acknowledge their colleagues and students who accompanied, motivated, and inspired this book. MB thanks Alberto Carrassi, Jean-Matthieu Haussaire, Anthony Fillion, Victor Winiarek, Alban Farchi, and Sammy Metref. MN thanks Elise Arnaud, Arthur Vidard, Eric Blayo, and Claire Lauvernet. MA thanks in particular the CIMPA1 and the Universidad Simon Bolivar in Caracas, Venezuela (where the idea for this book was born), for their hospitality. We thank the CIRM2 for allowing us to spend two intensive weeks finalizing (in optimal conditions) the manuscript.

1

Centre International de Mathématiques Pures et Appliquées, Nice, France. International de Rencontres Mathématiques, Marseille, France.

2 Centre

Preface

xvii

4D-Var

3D-Var

Hybrid Optimal Control

Variational

Data Assimilation Nudging

ensemble

Statistical

Optimal Interpolation

non linear KF Extensions

Kalman Filter

Figure 1. The big picture for DA methods and algorithms.

Part I

Basic methods and algorithms for data assimilation After an introduction that sets up the general theoretical framework of the book and provides several simple (but important) examples, the two approaches for the solution of DA problems are presented: classical (variational) assimilation and statistical (sequential) assimilation. We begin with the variational approach. Here we take an optimal control viewpoint based on classical variational calculus and show its impressive power and generality. A sequence of carefully detailed inverse problems, ranging from an ODE-based to a nonlinear PDE-based case, are explained. This prepares the ground for the two major variational DA algorithms, 3D-Var and 4D-Var, that are currently used in most large-scale forecasting systems. For statistical DA, we employ a Bayesian approach, starting with optimal statistical estimation and showing how the standard KF is derived. This lays the foundation for various extensions, notably the ensemble Kalman filter (EnKF).

1

Chapter 1

Introduction to data assimilation and inverse problems

1.1 Introduction What exactly is DA? The simplest view is that it is an approach/method for combining observations with model output with the objective of improving the latter. But do we really need DA? Why not just use the observations and average them or extrapolate them (as is done with regression techniques (McPherson, 2001), or just long-term averaging)? The answer is that we want to predict the state of a system, or its future, in the best possible way! For that we need to rely on models. But when models are not corrected periodically by reality, they can be of little value. Thus, we need to fit the model state in an optimal way to the observations, before an analysis or prediction is made. This fitting of a model to observations is a special case (but highly typical) of an inverse problem. According to J. B. Keller [1966], two problems are inverse to each other if “the formulation of each involves all or part of the solution of the other.” One of the two is named the direct problem, whereas the other is the inverse problem. The direct problem is usually the one that we can solve satisfactorily/easily. There is a back-and-forth transmission of information between the two. This is depicted in Figure 1.1, which represents a typical case: we replace the unknown (or partially known) medium by a model that depends on some unknown model parameters, m. The inverse problem involves reversing the arrows—by comparing the simulations and the observations (at the array) to find the model parameters. In fact, the direct problem involves going from cause to effect, whereas the inverse problem attempts to go from the effects to the cause. The comparison between model output and observations is performed by some form of optimization—recall that we seek an optimal match between simulations of the model and measurements taken of the system that we are trying to elucidate. This optimization takes two forms: classical and statistical. Let us explain. Classical optimization involves minimization of a positive, usually quadratic cost function that expresses the quantity that we seek to optimize. In most of the cases that we will deal with, this will be a function of the error between model and measurements—in this case we will speak of least-squares error minimization. The second form, statistical optimization, involves minimization of the variability or uncertainty of the model error and is based on statistical estimation theory. 3

4

Chapter 1. Introduction to data assimilation and inverse problems

source

unknown medium

s (x, t )

 (u; m) = s model

array

u(x r , t ), r = 1, . . . , N r

Figure 1.1. Ingredients of an inverse problem: the physical reality (top) and the direct mathematical model (bottom). The inverse problem uses the difference between the model-predicted observations, u (calculated at the receiver array points, x r ), and the real observations measured on the array to find the unknown model parameters, m, or the source, s (or both).

The main sources of inverse problems are science (social sciences included!) and engineering—in fact any process that we can model and measure satisfactorily. Often these problems concern the determination of the properties of some inaccessible region from observations on the boundary of the region, or at discrete instants over a given time interval. In other words, our information is incomplete. This incompleteness is the source of the major difficulties (and challenges. . .) that we will encounter in the solution of DA and inverse problems.

1.2 Uncertainty quantification and related concepts Definition 1.1. Uncertainty quantification (UQ) is the science of quantitative characterization and reduction of uncertainties in both computational and real-world applications. It tries to determine how likely certain outcomes are if some aspects of the system are not exactly known. The system-science paradigm, as expounded in Jordan [2015], exhibits the important place occupied by DA and inverse methods within the “deductive spiral”—see Figure 1.2. These methods furnish an essential link between the real world and the model of the system. They are intimately related to the concepts of validation and verification. Verification asks the question, “are we solving the equations correctly?”—this is an exercise in mathematics. Validation asks, “are we solving the correct equations?”— this is an exercise in physics. In geophysics, for example, the concept of validation is replaced by evaluation, since complete validation is not possible. UQ is a basic component of model validation. In fact it is vital for characterizing our confidence in results coming out of modeling and simulation and provides a mathematically rigorous certification that is often needed in decision-making. In fact, it gives a precise notion of what constitutes a validated model by replacing the subjective concept of confidence by mathematically rigorous methods and measures. There are two major categories of uncertainties: epistemic and aleatory. The first is considered to be reducible in that we can control it by improving our knowledge of

1.2. Uncertainty quantification and related concepts

5

(iv) Data Assimilation

Model space

(i) model formulation

Verification

Real-world space Validation/Evaluation

(ii) model simulation

(iii) Forecasting

Figure 1.2. The deductive spiral of system science (adapted from Jordan [2015]). The bottom half represents the direct problem (from model to reality); the top half represents the inverse problem (from reality to model). Starting from the center, with (i) model formulation and (ii) simulation, one works one’s way around iteratively over (iii) forecasting and (iv) DA, while passing through the two phases of UQ (validation/evaluation and verification).

model

Figure 1.3. UQ for a random quantity y: uncertainty propagation (left to right); uncertainty definition (right to left).

the system. The second is assumed to be irreducible and has to do with the inherent noise in, or stochastic nature of, any natural system. Any computation performed under uncertainty will forcibly result in predictive simulations (see the introduction to Chapter 3 for more details on this point). Uncertainty, in models of physical systems, is almost always represented as a probability density function (PDF) through samples, parameters, or kernels. The central

6

Chapter 1. Introduction to data assimilation and inverse problems

objective of UQ is then to represent, propagate, and estimate this density—see Figure 1.3. As a process, UQ can be decomposed into the following steps: 1. Define the system of interest, its response, and the desired performance measures. 2. Write a mathematical formulation of the system—governing equations, geometry, parameter values. 3. Formulate a discretized representation and the numerical methods and algorithms for its solution. 4. Perform the simulations and the analysis. 5. Loop back to step 1. The numerical simulations themselves can be decomposed into three steps: 1. DA, whose objective is to compute the PDFs of the input quantities of interest. This is the major concern of this book and the methods described herein. 2. Uncertainty propagation, whose objective is to compute the PDFs of the output quantities of interest. This is usually the most complex and computationally intensive step and is generally based on Monte Carlo and stochastic Galerkin (finite element) methods—see Le Maitre and Knio [2010]. 3. Certification, whose objective is to estimate the likelihood of specific outcomes and compare them with risk or operating margins. For a complete, recent mathematical overview of UQ, the reader is referred to Owhadi et al. [2013]. There are a number of research groups dedicated to the subject—please consult the websites of UQ groups at Stanford University,3 MIT,4 and ETH Zurich5 (for example).

1.3 Basic concepts for inverse problems: Well- and ill-posedness There is a fundamental, mathematical distinction between the direct and the inverse problem: direct problems are (invariably) well-posed, whereas inverse problems are (notoriously) ill-posed. Hadamard [1923] defined the concept of a well-posed problem as opposed to an ill-posed one. Definition 1.2. A mathematical model for a physical problem is well-posed if it possesses the following three properties: WP1 Existence of a solution. WP2 Uniqueness of the solution. WP3 Continuous dependence of the solution on the data. 3 4 5

            

1.4. Examples of direct and inverse problems

7

Note that existence and uniqueness together are also known as “identifiability,” and the continuous dependence is related to the “stability” of the inverse problem. A more rigorous mathematical formulation is the following (see Kirsch [1996]). Definition 1.3. Let X and Y be two normed spaces, and let K : X → Y be a linear or nonlinear map between the two. The problem of finding x given y such that Kx = y is well-posed if the following three properties hold: WP1 Existence—for every y ∈ Y there is (at least) one solution x ∈ X such that K x = y. WP2 Uniqueness—for every y ∈ Y there is at most one x ∈ X such that K x = y.

WP3 Stability—the solution, x, depends continuously on the data, y, in that for every sequence {xn } ⊂ X with K xn → K x as n → ∞, we have that xn → x as n → ∞. This concept of ill-posedness will be the “red thread” running through the entire book. It will help us to understand and distinguish between direct and inverse models. It will provide us with basic comprehension of the methods and algorithms that will be used to solve inverse problems. Finally, it will assist us in the analysis of what went wrong in our attempt to solve the inverse problems.

1.4 Examples of direct and inverse problems Take a parameter-dependent dynamical system, dz = g (t , z; θ), dt

z(t0 ) = z0 ,

with g known, z0 an initial state, θ ∈ Θ (a space, or set, of possible parameter values), and the state z(t ) ∈ n . We can now define the two classes of problems.

Direct Given parameters θ and initial state z0 , find z(t ) for t ≥ t0 . Inverse Given observations z(t ) for t ≥ t0 , find θ ∈ Θ.

Since the observations are incompletely known (over space-time), they must be modeled by an observation equation, f (t , θ) =  z(t , θ), where  is the observation operator (which could, in ideal situations, be the identity). Usually we have a finite number, p, of discrete (space-time) observations  p y˜j , j =1

where

y˜j ≈ f (t j , θ)

and the approximately equal sign denotes the possibility of measurement errors. We now present a series of simple examples that clearly illustrate the three properties of well-posedness.

8

Chapter 1. Introduction to data assimilation and inverse problems

Example 1.4. (Simplest case—inspired by an oral presentation of H.T. Banks). Suppose that we have one observation, y˜, for f (θ) and we want to find the pre-image θ∗ = f −1 (˜ y) for a given y˜:

θ∗

f



f −1

Θ



This problem can be severely ill-posed! Consider the following function:

y3 y2 y1

θ˜1 θ˜2

f (θ) = 1 − θ2

θ2 θ1

θ

Nonexistence There is no θ3 such that f (θ3 ) = y3 . Nonuniqueness y j = f (θ j ) = f (θj ) for j = 1, 2.      Lack of continuity |y1 − y2 | small   f −1 (y1 ) − f −1 (y2 ) = θ1 − θ˜2  small.

Note that all three well-posedness properties, WP1, WP2, and WP3, are violated by this very basic case. Why is this so important? Couldn’t we just apply a good leastsquares algorithm (for example) to find the best possible solution? Let’s try this. We define a mismatch-type cost function, J (θ) = |y1 − f (θ)|2 ,

1.4. Examples of direct and inverse problems

9

for a given y1 and try to minimize this square error by applying a standard iterative scheme, such as direct search or gradient-based minimization [Quarteroni et al., 2007], to obtain a solution. For example, if we apply Newton’s method, we obtain the following iteration:  −1 θ k+1 = θ k − J (θ k ) J (θ k ), where

 J (θ) = 2 (y1 − f (θ)) − f (θ) .

Let us graphically perform a few iterations on the above function:

y1

θ (−)

(+)

• J (θ0 ) = 2(y1 − f (θ0 ))(− f (θ0 )) < 0 (−)



θ1 > θ0 , etc.,



θ˜1 < θ˜0 , etc.,

(−)

• J (θ˜0 ) = 2(y1 − f (θ˜0 ))(− f (θ˜0 )) > 0

where in the last two formulas for J we have indicated the sign (+/−) above each of the terms. We observe that in this simple case we have a highly unstable, oscillating behavior: at each step we move from positive to negative increments due to the changing sign of the gradient, and convergence is not possible. So what went wrong here? This behavior is not the fault of descent algorithms. It is a manifestation of the inherent ill-posedness of the problem. How to fix this problem has been the subject of much research over the past 50 years! Many remedies (fortunately) exist, such as explicit and implicit constrained optimizations, regularization, and penalization—these will be referred to, when necessary, in what follows. To further appreciate the complexity, let us briefly consider one of the remedies: Tykhonov regularization (TR)—see, for example, Engl et al. [1996] and Vogel [2002]. The idea here is to replace the ill-posed problem for J (θ) = |y1 − f (θ)|2 by a “nearby” problem for Jβ (θ) = |y1 − f (θ)|2 + β |θ − θ0 |2 , where β is a suitably chosen regularization/penalization parameter. When it is done correctly, TR provides convexity and compactness,6 thus ensuring the existence of a 6 Convexity means that the function resembles a quadratic function, f (x) = x 2 , with positive second derivative; compactness means that any infinite sequence of functions must get arbitrarily close to some function of the space.

Chapter 1. Introduction to data assimilation and inverse problems

4

4

3

3

2

2

1

1 x(t)

x(t)

10

0

0

-1

-1

-2

-2

-3 -4

-3 0

5

10

15

20

25 t

30

35

(a) Initial error of 0.03%

40

45

50

-4

0

5

10

15

20

25 t

30

35

40

45

50

(b) Initial error of 0.06%

Figure 1.4. Duffing’s equation with small initial perturbations. Unperturbed (solid line) and perturbed (dashed line) trajectories.

unique solution. However, even when done correctly, it modifies the problem, and new solutions may be far from the original ones. In addition, it is not trivial to regularize correctly or even to know if we have succeeded in finding a solution. Example 1.5. The highly nonlinear Duffing’s equation [Guckenheimer and Holmes, 1983], x¨ + 0.05˙ x + x 3 = 7.5 cos t , exhibits great sensitivity to the initial conditions (WP3). We will observe that two very closely spaced initial states can lead to a large discrepancy in the trajectories. • Let x(0) = 3 and x˙(0) = 4 be the true initial state. • Introduce an error of 0.03% in the initial state—here we have an accurate forecast until t = 35 (see Figure 1.4(a)). • Introduce an error of 0.06% in the initial state—here we only have an accurate forecast until t = 20 (see Figure 1.4(b)). The initial perturbations are scarcely visible (and could result from measurement error), but the terminal states can differ considerably. Example 1.6. Seismic travel-time tomography provides an excellent example of nonuniqueness (WP2):

A signal seismic ray (or any other ray used in medical or other imaging) passes through a two-parameter block model.

1.5. DA methods

11

• The unknowns are the two block slownesses (inverse of seismic velocity), (Δs1 , Δs2 ) . • The data consist of the observed travel time of the ray, Δt1 . • The model is the linearized travel time equation Δt1 = l1 Δs1 + l2 Δs2 , where l j is the length of the ray in the j th block. Clearly we have one equation for two unknowns, and hence there is no unique solution. In fact, for a given value of Δt1 , each time we fix Δs1 we obtain a different Δs2 (and vice versa). We hope the reader is convinced, based on these disarmingly simple examples, that inverse problems present a large number of potential pathologies. We can now proceed to examine the therapeutic tools that are at our disposal for attempting to “heal the patient.”

1.5 DA methods Definition 1.7. DA is the approximation of the true state of some physical system at a given time by combining time-distributed observations with a dynamic model in an optimal way. DA can be classically approached in two ways: as variational DA and as statistical7 DA—see Figure 1.5. They will be briefly presented here and then in far more detail in Chapters 2 and 3, respectively. Newer approaches are also becoming available: nudging methods, reduced methods, ensemble methods, and hybrid methods that combine variational and statistical8 approaches. These are the subject of the chapters on advanced methods—see Part II. In both we seek an optimal solution—statistically we will, for example, seek a solution with minimum variance, whereas variationally we will seek a solution that minimizes a suitable cost (or error) function. In fact, in certain special cases the two approaches are identical and provide exactly the same solution. However, the statistical approach, though often more complex and time-consuming, can provide a richer information structure: an average solution and some characteristics of its variability (probability distribution). Clearly a temperature forecast of “15°C for tomorrow” is much less informative than a forecast of “an average of 15°C with a standard deviation of 1.5°C,” or “the probability of a temperature below 10°C is 0.125 for tomorrow,” or, as is now quite common (on our smartphones), “there is a 60% chance of rain at 09h00 in New York.” Ideally (and this will be our recommendation), one should attempt to combine the two into a single, “hybrid” approach. We can then take advantage of the relative rapidity and robustness of the variational approach, and at the same time obtain an information-rich solution thanks to the statistical/probabilistic approach. This is easily said but (as we will see) is not trivial to implement and will be highly problem dependent and probably computationally expensive. However, we can and 7

Alternatives are “filtering” or “probabilistic.”

8 ibid.

12

Chapter 1. Introduction to data assimilation and inverse problems

EDA 4DEnVar 4D-Var

3D-Var

Hybrid Nudging

IEnKS

Variational

Data Assimilation Reduced

Incremental Hybrid Reduced

Statistical

Optimal Interpolation

Ensemble Extended

KF Extensions

Kalman Filter

Figure 1.5. DA methods: variational (Chapters 2, 4, 5); statistical (Chapters 3, 5, 6), hybrid (Chapter 7).

will provide all the necessary tools (theoretical, algorithmic, numerical) and the indications for their implementation. It is worthwhile to point out that recently (since 2010) a number of the major weather-forecasting services across the world (Canada, France, United Kingdom, etc.) have started basing their operational forecasting systems on a new, hybrid approach, 4DEnVar (presented in Chapter 7), which combines 4D-Var (variational DA) with an ensemble, statistical approach. To complete this introductory chapter, we will now briefly introduce and compare the two approaches of variational and statistical. Each one is subsequently treated, in far greater detail, in its own chapter—see Chapters 2 and 3, respectively.

1.5.1 Notation for DA and inverse problems We begin by introducing the standard notation for DA problems as formalized by Ide et al. [1997]. We first consider a discrete model for the evolution of a physical (atmospheric, oceanic, mechanical, biological, etc.) system from time tk to time tk+1 , described by a dynamic state equation   xf (tk+1 ) = Mk+1 xf (tk ) ,

(1.1)

1.5. DA methods

13

where x is the model’s state vector of dimension n (see below for the definition of the superscripts) and M is the corresponding dynamics operator that can be time dependent. This operator usually results from a finite difference [Strikwerda, 2004] or finite element [Hughes, 1987] discretization of a (partial) differential equation (PDE). We associate an error covariance matrix P with the state x since the true state will differ from the simulated state (1.1) by random or systematic errors. Observations, or measurements, at time tk are defined by yo = Hk [x t (tk )] + εok ,

(1.2)

where H is an observation operator that can be time dependent and εo is a white noise process with zero mean and associated covariance matrix R that describes instrument errors and representation errors due to the discretization. The observation vector, yok = yo (tk ), has dimension pk , which is usually much smaller than the state dimension, pk n. Subscripts are used to denote the discrete time index, the corresponding spatial indices, or the vector with respect to which an error covariance matrix is defined; superscripts refer to the nature of the vectors/matrices in the DA process: • “a” for analysis, • “b” for background (or initial/first guess), • “f” for forecast, • “o” for observation, and • “t” for the (unknown) true state. Analysis is the process of approximating the true state of a physical system at a given time. Analysis is based on • observational data, • a model of the physical system, and • background information on initial and boundary conditions. An analysis that combines time-distributed observations and a dynamic model is called data assimilation. Now let us introduce the continuous system. In fact, continuous time simplifies both the notation and the theoretical analysis of the problem. For a finite-dimensional system of ODEs, the equations (1.1)–(1.2) become x˙f =  (xf , t ), and

yo (t ) =  (xt , t ) + ε,

where (˙) = d/dt and  and  are nonlinear operators in continuous time for the model and the observation, respectively. This implies that x, y, and ε are also continuous-in-time functions. For PDEs, where there is in addition a dependence on space, attention must be paid to the function spaces, especially when performing variational analysis. Details will be provided in the next chapter. With a PDE model, the field (state) variable is commonly denoted by u(x, t ), where x represents the space

14

Chapter 1. Introduction to data assimilation and inverse problems

variables (no longer the state variable as above!), and the model dynamics is now a nonlinear partial differential operator,  =  [∂xα , u(x, t ), x, t ] ,

with ∂xα denoting the partial derivatives with respect to the space variables of order up to |α| ≤ m, where m is usually equal to two and in general varies between one and four.

1.5.2 Statistical DA Practical inverse problems and DA problems involve measured data. These data are inexact and are mixed with random noise. Only statistical models can provide rigorous, effective means for dealing with this measurement error. Let us begin with the following simple example. 1.5.2.1 A simple example

We want to estimate a scalar quantity, say the temperature or the ozone concentration at a fixed point in space. Suppose we have • a model forecast, x b ( background, or a priori value), and • a measured value, x o ( observation). The simplest possible approach is to try a linear combination of the two, x a = x b + w(x o − x b ),

where x a denotes the analysis that we seek and 0 ≤ w ≤ 1 is a weight factor. We subtract the (always unknown) true state x t from both sides, x a − x t = x b − x t + w(x o − x t − x b + x t ), and, defining the three errors (analysis, background, observation) as ea = xa − xt, we obtain

eb = x b − x t ,

eo = xo − xt,

e a = e b + w(e o − e b ) = w e o + (1 − w)e b .

If we have many realizations, we can take an ensemble average,9 denoted by 〈·〉:      〈e a 〉 = e b + w 〈e o 〉 − e b . Now if these errors are centered (have zero mean, or the estimates of the true state are unbiased), then 〈e a 〉 = 0 also. So we are logically led to look at the variance and demand that it be as small as possible. The variance is defined, using the above notation, as   σ 2 = (e − 〈e〉)2 .

9 Please refer to a good textbook on probability and statistics for all the relevant definitions—e.g., DeGroot and Schervisch [2012].

1.5. DA methods

15

So by taking variances of the error equation, and using the zero-mean property, we obtain    2   σa2 = σb2 + w 2 e o − e b + 2w e b e o − e b . This reduces to

 σa2 = σb2 + w 2 σo2 + σb2 − 2wσb2

if e o and e b are uncorrelated. Now, to compute a minimum, take the derivative of this last equation with respect to w and equate to zero,  0 = 2w σo2 + σb2 − 2σb2 , where we have ignored all cross terms since the errors have been assumed to be independent. Finally, solving this last equation, we can write the optimal weight, w∗ =

σb2 2 σo + σb2

=

1 , 1 + σo2 /σb2

which, we notice, depends on the ratio of the observation and the background errors. Clearly 0 ≤ w∗ ≤ 1 and • if the observation is perfect, σo2 = 0 and thus w∗ = 1, the maximum weight;

• if the background is perfect, σb2 = 0 and w∗ = 0, so the observation will not be taken into account. We can now rewrite the analysis error variance as σa2 = w∗2 σo2 + (1 − w∗ )2 σb2 =

σb2 σo2

σo2 + σb2

= (1 − w∗ )σb2 1 = −2 , σo + σb−2 where we suppose that σb2 , σo2 > 0. In other words, 1 1 1 = + . σa2 σo2 σb2 Finally, the analysis equation becomes xa = xb +

1 (x o − x b ), 1+α

where α = σo2 /σb2 . This is called the BLUE—best linear unbiased estimator—because it gives an unbiased, optimal weighting for a linear combination of two independent measurements. We can isolate three special cases: • If the observation is very accurate, σo2 σb2 , α 1, and thus x a ≈ x o .

16

Chapter 1. Introduction to data assimilation and inverse problems

model trajectory observation error observation

t0

t1

...

tn

Figure 1.6. Sequential assimilation. The x-axis denotes time; the y-axis is the assimilated variable.

• If the background is accurate, α  1 and x a ≈ x b . • And, finally, if the observation and background variances are approximately equal, then α ≈ 1 and x a is just the arithmetic average of x b and x o . We can conclude that this simple, linear model does indeed capture the full range of possible solutions in a statistically rigorous manner, thus providing us with an “enriched” solution when compared with a nonprobabilistic, scalar response such as the arithmetic average of observation and background, which would correspond to only the last of the above three special cases. 1.5.2.2 The more general case: Introducing the Kalman filter

The above analysis of the temperature was based on a spatially dependent model. However, in general, the underlying process that we want to model will be time dependent. Within the significant toolbox of mathematical tools that can be used for statistical estimation from noisy sensor measurements, one of the most well-known and often-used tools is the Kalman filter (KF). The KF is named after Rudolph E. Kalman, who in 1960 published his famous paper describing a recursive solution to the time-dependent discrete-data linear filtering problem [Kalman, 1960]. We consider a dynamical system that evolves in time, and we seek to estimate a series of true states, xtk (a sequence of random vectors), where discrete time is indexed by the letter k. These times are those when the observations or measurements are taken—see Figure 1.6. The assimilation starts with an unconstrained model trajectory from t0 , t1 , . . . , tk−1 , tk , . . . , tn and aims to provide an optimal fit to the available observations/measurements given their uncertainties (error bars), depicted in the figure. This situation is modeled by a stochastic system. We seek to estimate the state, x ∈ n , of a discrete-time dynamic process that is governed by the linear stochastic difference equation xk+1 = Mk+1 [xk ] + wk ,

1.5. DA methods

17

yk+2 xak+2

yk+1 xak+1

xfk+2

xak+3

xfk+1

yk+3

xak k

k +1

xfk+3

k +2

k +3

Figure 1.7. Sequential assimilation scheme for the KF. The x-axis denotes time; the y-axis is the assimilated variable. We assume scalar variables.

with a measurement/observation y ∈  m defined by yk = Hk [xk ] + vk . The random vectors wk and vk represent the process/modeling and measurement/ observation errors, respectively. They are assumed10 to be independent, white, and with Gaussian/normal probability distributions, wk ∼  (0, Qk ), vk ∼  (0, Rk ),

(1.3) (1.4)

where Q and R are the covariance matrices (assumed known) and can in general be time dependent. We can now set up a sequential DA scheme. The typical assimilation scheme is made up of two major steps: a prediction/forecast step and a correction/analysis step. At time tk we have the result of a previous forecast, xfk (the analogue of the background state xbk ), and the result of an ensemble of observations in yk . Based on these two vectors, we perform an analysis that produces xak . We then use the evolution model, which is usually (partial) differential equation–based, to obtain a prediction of the state at time tk+1 . The result of the forecast is denoted xfk+1 and becomes the background (or initial guess) for the next time step. This process is summarized in Figure 1.7. We can now define forecast (a priori) and analysis (a posteriori) estimate errors in the same way as above for the scalar case, with their respective error covariance matrices, which generalize the variances used before, since we are now dealing with vector quantities. The goal of the KF is to compute an optimal a posteriori estimate, xak , that is a linear combination of an a priori estimate, xfk , and a weighted difference   between the actual measurement, yk , and the measurement prediction, Hk xfk . This 10 These

assumptions are necessary in the KF framework. For real problems, they must often be relaxed.

18

Chapter 1. Introduction to data assimilation and inverse problems

is none other than the BLUE that we saw in the example above. The filter must be of the form   xak = xfk + Kk yk − Hk xfk , (1.5)

 where K is the Kalman gain. The difference yk − Hk xfk is called the innovation and reflects the discrepancy between the actual and the predicted measurements at time tk . Note that for generality, the matrices are shown with a time dependence. Often this is not the case, and the subscripts k can then be dropped. The Kalman gain matrix, K, is chosen to minimize the a posteriori error covariance equation. This is straightforward to compute: substitute (1.5) into the definition of the analysis error, then substitute in the error covariance equation, take the derivative of the trace of the result with respect to K, set the result equal to zero, and, finally, solve for the optimal gain K. The resulting optimal gain matrix is −1  Kk = Pfk HT HPfk HT + R , where Pfk is the forecast error covariance matrix. Full details of this computation, as well as numerous examples, are provided in Chapter 3, where we will also generalize the approach to more realistic cases.

1.5.3 Variational DA Unlike sequential/statistical assimilation (which emanates from estimation theory), variational assimilation is based on optimal control theory [Kwakernaak and Sivan, 1972; Friedland, 1986; Gelb, 1974; Tröltzsch, 2010], itself derived from the calculus of variations. The analyzed state is not defined as the one that maximizes a certain PDF, but as the one that minimizes a cost function. The minimization requires numerical optimization techniques. These techniques can rely on the gradient of the cost function, and this gradient will be obtained here with the aid of adjoint methods. 1.5.3.1 Adjoint methods: An introduction

All descent-based optimization methods require the computation of the gradient, ∇J , of a cost function, J . If the dependence of J on the control variables is complex or indirect, this computation can be very difficult. Numerically, we can always manage by computing finite increments, but this would have to be done in all possible perturbation directions. We thus need to find a less expensive way to compute the gradient. This will be provided by the calculus of variations and the adjoint approach. A basic example: Let us consider a classical inverse problem known as a parameter identification problem, based on the ODE (of convection-diffusion type) 

− b u (x) + c u (x) = f (x), u(0) = 0, u(1) = 0,

0 < x < 1,

(1.6)

where depicts the derivative with respect to x, f is a given function, and b and c are unknown (constant) parameters that we seek to identify using observations of u(x) on the interval [0, 1]. The mismatch (or least-squares error) cost function is then J (b , c) =

1 2



1 0

(u(x) − u o (x))2 dx,

1.5. DA methods

19

where u o is the observational data. The gradient of J can be calculated by introducing the tangent linear model (TLM). Perturbing the cost function by a small perturbation11 in the direction α, with respect to the two parameters, b and c, gives  1 1 J (b + αδ b , c + αδc) − J (b , c) = ( u˜ − u o )2 − (u − u o )2 dx, 2 0 where u˜ = u b +αδ b ,c+αδc is the perturbed solution and u = u b ,c is the unperturbed one. Now we divide by α and pass to the limit α → 0 to obtain the directional derivative (with respect to the parameters, in the direction of the perturbations), 1 Jˆ[b , c] (δ b , δc) = (u − u o ) uˆ dx, (1.7) 0

where we have defined

u˜ − u . α Then, passing to the limit in equation (1.6), we can define the TLM  − b uˆ + c uˆ = (δ b )u − (δc)u , uˆ(0) = 0, uˆ (1) = 0. uˆ = lim

α→0

(1.8)

We would like to reformulate the directional derivative (1.7) to obtain a calculable expression for the gradient. For this we introduce the adjoint variable, p, satisfying the adjoint model  − b p − c p = (u − u o ), (1.9) p(0) = 0, p(1) = 0. Multiplying the TLM by this new variable, p, and integrating by parts enables us to finally write an explicit expression (see Chapter 2 for the complete derivation) for the gradient based on (1.7),  1 T 1 ∇J (b , c) = p u dx , − p u dx , 0

0

or, separating the two components,  ∇ b J (b , c) =

1

p u dx,

0

∇c J (b , c) = −



1

p u dx.

0

Thus, for the additional cost of solving the adjoint model (1.9), we can compute the gradient of the cost function with respect to either one, or both, of the unknown parameters. It is now a relatively easy task to find (numerically) the optimal values of b and c that minimize J by using a suitable descent algorithm. This important example is fully developed in Section 2.3.2, where all the steps are explicitly justified. Note that this method generalizes to (linear and nonlinear) time-dependent PDEs and to inverse problems where we seek to identify the initial conditions. This latter problem is exactly the 4D-Var problem of DA. All of this will be amply described in Chapter 2. 11 The

exact properties of the perturbation will be fully explained in Chapter 2.

20

Chapter 1. Introduction to data assimilation and inverse problems

Algorithm 1.1 Iterative 3D-Var (in its simplest form). k = 0, x = x0 while ∇J  > ε or j ≤ jmax compute J compute ∇J gradient descent and update of x j +1 j = j +1 end 1.5.3.2 3D-Var

We have seen above that the BLUE requires the computation of an optimal gain matrix. We will show (in Chapters 2 and 3) that the optimal gain takes the form K = BHT (HBHT + R)−1 to obtain an analyzed state, xa = xb + K(y − H(xb )), that minimizes what is known as the 3D-Var cost function, J (x) =

T   1 1 x − xb B−1 x − xb + (Hx − y)T R−1 (Hx − y) , 2 2

(1.10)

where R and B (also denoted Pf ) are the observation and background error covariance matrices, respectively. But the matrices involved in this calculation are often neither storable in memory nor manipulable because of their very large dimensions. The basic idea of variational methods is to overcome these difficulties by attempting to directly minimize the cost function, J . This minimization can be achieved, for inverse problems in general (and for DA in particular), by a combination of (1) an adjoint approach for the computation of the gradient of the cost function with (2) a descent algorithm in the direction of the gradient. For DA problems where there is no time dependence, the adjoint is not necessary and the approach is named 3D-Var, whereas for time-dependent problems we use the 4D-Var approach. We recall that R and B are the observation and background error covariance matrices, respectively. When the observation operator H is linear, the gradient of J in (1.10) is given by   ∇J = B−1 x − xb − HT R−1 (y − Hx) . In the iterative 3D-Var Algorithm 1.1 we use as a stopping criterion the fact that ∇J is small or that the maximum number of iterations, jmax , is reached. 1.5.3.3 A simple example of 3D-Var

We seek two temperatures, x1 and x2 , in London and Paris. The climatologist gives us an initial guess (based on climate records) of x b = (10 5)T , with background error covariance matrix   1 0.25 B= . 0.25 1

1.5. DA methods

21

Algorithm 1.2 4D-Var in its basic form j = 0, x = x0 while ∇J  > ε or j ≤ jmax (1) compute J with the direct model M and H (2) compute ∇J with adjoint model MT and HT (reverse mode) gradient descent and update of x j +1 j = j +1 end We observe yo = 4 in Paris, which implies that H = (0 1), with an observation error variance R = (0.25) . We can now write the cost function (1.10) as −1   1 0.25 x1 − 10 + R−1 (x2 − 4)2 x2 − 5 0.25 1     16  1 −0.25 x1 − 10 = x1 − 10 x2 − 5 + 4(x2 − 4)2 x −0.25 1 − 5 15 2 16  (x − 10)2 + (x2 − 5)2 − 0.5(x1 − 10)(x2 − 5) + 4(x2 − 4)2 = 15 1 16  2 x − 17.5x1 + 100 + x22 − 5x2 − 0.5x1 x2 + 4(x22 − 8x + 16), = 15 1

J (x) =



x1 − 10

x2 − 5



and its gradient can be easily seen to be     1 16 2x1 − 0.5x2 − 17.5 32x1 − 8x2 − 280 = ∇J (x) = . 15 −8x + 152x − 560 2x − 5 − 0.5x + (2x − 8) 15 15 1 2 2 1 2 4 The minimum is obtained for ∇J (x) = 0, which yields x1 = 9.8,

x2 = 4.2.

This is an optimal estimate of the two temperatures, given the background and observation errors. 1.5.3.4 4D-Var

In 4D-Var,12 the cost function is still expressed in terms of the initial state, x0 , but it will include the model because the observation yoi at time i is compared to Hi (xi ), where xi is the state at time i initialized by x0 and the adjoint is not simply the transpose of a matrix but also the “transpose” of the model/operator dynamics. To compute this will require the use of a more general adjoint theory, which is introduced just after the following example and fully explained in Chapter 2. In step (1) of Algorithm 1.2, we use the equations dk = yok − Hk Mk Mk−1 . . . M2 M1 x and

12 The

T    1 dTi R−1 di . x − xb B−1 x − xb + i 2 i =0 j

J (x) =

4 refers to the additional time dimension.

22

Chapter 1. Introduction to data assimilation and inverse problems

In step (2), we use   ∇J (x) = B−1 x − xb −     T T −1 T T −1 T T −1 , HT0 R−1 0 d0 + M1 H1 R1 d1 + M2 H2 R2 d2 + · · · + M j H j R j d j where we have assumed that H and M are linear.

1.6 Some practical aspects of DA and inverse problems In this brief section we point out some important practical considerations. It should now be clear that there are four basic ingredients in any inverse or DA problem: 1. observation or measured data; 2. a forward or direct model of the real-world context; 3. a backward or adjoint model in the variational case and a probabilistic framework in the statistical case; and 4. an optimization cycle. But where does one start? The traditional approach, often employed in mathematical and numerical modeling, is to begin with some simplified, or at least well-known, situation. Once the above four items have been successfully implemented and tested on this instance, we then proceed to take into account more and more reality in the form of real data, more realistic models, more robust optimization procedures, etc. In other words, we introduce uncertainty, but into a system where we at least control some of the aspects.

1.6.1 Twin experiments Twin experiments, or synthetic runs, are a basic and indispensable tool for all inverse problems. To evaluate the performance of a DA system we invariably begin with the following methodology: 1. Fix all parameters and unknowns and define a reference trajectory, obtained from a run of the direct model—call this the “truth.” 2. Derive a set of (synthetic) measurements, or background data, from this “true” run. 3. Optionally, perturb these observations to generate a more realistic observed state. 4. Run the DA or inverse problem algorithm, starting from an initial guess (different from the “true” initial state used above), using the synthetic observations. 5. Evaluate the performance, modify the model/algorithm/observations, and cycle back to step 1. Twin experiments thus provide a well-structured methodological framework. Within this framework we can perform different “stress tests” of our system. We can modify the observation network, increase or decrease (even switch off) the uncertainty, test the robustness of the optimization method, and even modify the model. In fact, these experiments can be performed on the full physical model or on some simpler (or reduced-order) model.

1.7. To go further: Additional comments and references

23

1.6.2 Toy models and other simplifications Toy models are, by definition, simplified models that we can play with, yes, but these are of course “serious games.” In certain complex physical contexts, of which meteorology is a famous example, we have well-established toy models, often of increasing complexity. These can be substituted for the real model, whose computational complexity is often too large, and provide a cheaper test-bed. Some well-known examples of toy models are • Lorenz models—see Lorenz [1963]—which are used as an avatar for weather simulations; • various harmonic oscillators that are used to simulate dynamic systems; and • famous examples such as the Ising model in physics, the Lotka–Volterra model in life sciences, and the Schelling model in social sciences; See Marzuoli [2008] for a more general discussion.

1.7 To go further: Additional comments and references • Examples of inverse problems: 11 examples can be found in Keller [1966] and 16 in Kirsch [1996]. • As the reader may have observed, the formulation and solution of DA and inverse problems require a wide range of tools and competencies in functional analysis, probability and statistics, variational calculus, numerical optimization, numerical approximation of (partial) differential equations, and stochastic simulation. This monograph will not provide all of this, so the reader must resort to other sources for the necessary background “tools.” A few bibliographic recommendations are – DeGroot and Schervisch [2012] for probability and statistics; – Courant and Hilbert [1989a] for variational calculus; – Nocedal and Wright [2006] for numerical optimization; – Kreyszig [1978] and Reed and Simon [1980] for functional analysis; – Strikwerda [2004] for finite difference methods; – Hughes [1987] and Zienkiewicz and Taylor [2000] for finite element methods; – Press et al. [2007] for stochastic simulation; – Quarteroni et al. [2007] for basic numerical analysis (integration, solution of ODEs, etc.); and – Golub and van Loan [2013] for numerical linear algebra.

Chapter 2

Optimal control and variational data assimilation

2.1 Introduction Unlike sequential assimilation (which emanates from statistical estimation theory and will be the subject of the next chapter), variational assimilation is based on optimal control theory.13 The analyzed state is not defined as the one that maximizes a certain probability density function (PDF), but as the one that minimizes a cost function. The minimization requires numerical optimization techniques. These techniques can rely on the gradient of the cost function, and this gradient will be obtained here with the aid of adjoint methods. The theory of adjoint operators, coming out of functional analysis, is presented in Kreyszig [1978] and Reed and Simon [1980]. A special case is that of matrix systems, which are simply the finite dimensional operator case. The necessary ingredients of optimization theory are described in Nocedal and Wright [2006]. In this chapter, we will show that the adjoint approach is an extremely versatile tool for solving a very wide range of inverse problems—DA problems included. This will be illustrated via a sequence of explicitly derived examples, from simple cases to quite complex nonlinear cases. We will show that once the theoretical adjoint technique is understood and mastered, almost any model equation can be treated and almost any inverse problem can be solved (at least theoretically). We will not neglect the practical implementation aspects that are vital for any real-world, concrete application. These will be treated in quite some detail since they are often the crux of the matter—that is, the crucial steps for succeeding in solving DA and inverse problems. The chapter begins with a presentation of the calculus of variations. This theory, together with the concept of ill-posedness, is the veritable basis of inverse problems and DA, and its mastery is vital for formulating, understanding, and solving real-world problems. We then consider adjoint methods, starting from a general setting and moving on through a series of parameter identification problems—all of these in a differential equation (infinite-dimensional) setting. Thereafter, we study finite-dimensional cases, which lead naturally to the comparison of continuous and discrete adjoints. It is here that we will introduce automatic differentiation, which generalizes the calculation of the adjoint to intractable, complex cases. After all this preparation, we will be ready to study the two major variational DA approaches: 3D-Var and 4D-Var. Once 13 This

is basically true; however, 4D-Var applied to a chaotic model is in fact a sequential algorithm.

25

26

Chapter 2. Optimal control and variational data assimilation

(a) y = (x − x∗ )2 : one minimum

(b) y = −2 cos (x − x∗ ): many minima

(c) y = x 3 : saddle point

(d) y = 0.015(x − x∗ )2 − 2 cos (x − x∗ ): one global minimum, many local minima

Figure 2.1. A variety of local extrema, denoted by x∗ .

completed, we present a few numerical examples. We end the chapter with a brief description of a selection of advanced topics: preconditioning, reduced-order methods, and error covariance modeling. These will be expanded upon in Chapter 5.

2.2 The calculus of variations The calculus of variations is, to quote Courant and Hilbert [1989a], one of the “very central fields of analysis.” It is also the central tool of variational optimization and DA, since it generalizes the theory of maximization and minimization. If we understand the workings of this theory well, we will be able to understand variational DA and inverse problems in a much deeper way and thus avoid falling into phenomenological traps. By this we mean that when, for a particular problem, things go wrong, we will have the theoretical and methodological distance/knowledge that is vital for finding a way around, over, or through the barrier (be it in the formulation or in the solution of the problem). Dear reader, please bear with us for a while, as we review together this very important theoretical tool. To start out from a solid and well-understood setting, let us consider the basic theory of optimization14 (maximization or minimization) of a continuous function

14 An

f (x, y, . . .): d →  excellent reference for a more complete treatment is the book of Nocedal and Wright [2006].

2.2. The calculus of variations

27

3.54 2.5 23 1.5

1.5 2

6 1

1

3.54 2.53 2 1.5

3

4

2

.5

2

-10

-5

3

3.54

1

2.5

2

3 3. 4 5

1.5

012345678 109

-15

2.5

1 4 3.5 3 2.52 1.5

9 8765432110 0

-2.5 -20

1

2

0.5

0

1.5 0.5

-2

2

4 3.52.5

1.5

8 76 5 4 3 2 1 0

1

7 8

0.5

10

10 87 6

6 9

3 1.52

5

7 8 10

0.5

4

1

4 5

6

9

1.5

4

9

4 32 3.5 1

9

2

78 456 0123

-1.5

5

7 8

1

10

1.5 0.5

8

2

10 8 67

2

7

6

7 8

3

6

9

9

1.5

1

-0.5

-1

4 3 2.5 3.5 1

6 7 8 10 5 2

9

2

8

5

0

1

0.5

9 10 876 4

3

1.5

2

1

1

10 1

-2 0

5

(a) z = F1 (x, y).

10

15

20

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

(b) z = F2 (x, y).

Figure 2.2. Counterexamples for local extrema in 2 .

in a closed region Ω. We seek a point x∗ = (x∗ , y∗ , . . .) ∈ d in Ω for which f has an extremum (maximum or minimum) in the vicinity of x∗ (what is known as a local extremum—see Figure 2.1). A classical theorem of Weierstrass guarantees the existence of such an object. Theorem 2.1. Every continuous function in a bounded domain attains a maximal and a minimal value inside the domain or on its boundary. If f is differentiable in Ω and if x∗ is an interior point, then the first derivatives of f , with respect to each of its variables, vanish at x∗ —we say that the gradient of f is equal to zero. However, this necessary condition is by no means sufficient because of the possible existence of saddle points. For example, see Figure 2.1c, where f (x) = x 3 at x∗ = 0. Moreover, as soon as we pass from  to even 2 , we lose the simple Rolle’s and intermediate-value theorem [Wikipedia, 2015c] results, and very counterintuitive things can happen. This is exhibited by the following two examples (see Figure 2.2): • F1 (x, y) = x 2 (1+ y)3 +7y 2 has a single critical point—the gradient of F1 vanishes at (0, 0)—which is a local minimum, but not a global one. This cannot happen in one dimension because of Rolle’s theorem. Note also that (0, 0) is not a saddle point. • F2 (x, y) = (x 2 y − x − 1)2 + (x 2 − 1)2 has exactly two critical points, at (1, 2) and (−1, 0), both of which are local minima—again, an impossibility in one dimension where we would have at least one additional critical point between the two. A sufficient condition requires that the second derivative of the function exist and be positive. In this case the point x∗ is indeed a local minimizer. The only case where things are simple is when we have smooth, convex functions [Wikipedia, 2015d]—in this case any local minimizer is a global minimizer, and, in fact, any stationary point is a global minimizer. However, in real problems, these conditions are (almost) never satisfied even though we will in some sense “convexify” our assimilation and inverse problems—see below. Now, if the variables are subject to n constraints of the form g j (x, y, . . .) = 0 for j = 1, . . . , n, then by introducing Lagrange multipliers we obtain the necessary conditions

28

Chapter 2. Optimal control and variational data assimilation

for an extremum. For this, we define an augmented function, F=f +

n 

λj gj ,

j =1

and write down the necessary conditions ∂F = 0, ∂x ∂F = g1 = 0, ∂ λ1

∂F = 0, ∂y

...

(d equations),

∂F = gn = 0 (n equations), ∂ λn

...,

which gives a system of equations (m equations in m unknowns, where m = d + n) that are then solved for x∗ ∈ d and λ j , j = 1, . . . , n. We will now generalize the above to finding the extrema of functionals.15 The domain of definition becomes a space of admissible functions16 in which we will seek the extremal member. The calculus of variations deals with the following problem: find the maximum or minimum of a functional, over the given domain of admissible functions, for which the functional attains the extremum with respect to all argument functions in a small neighborhood of the extremal argument function. But we now need to generalize our definition of the vicinity (or neighborhood) of a function. Moreover, the problem may not have a solution because of the difficulty of choosing the set of admissible functions to be compact, which means that any infinite sequence of functions must get arbitrarily close to some function of the space—an accumulation point. However, if we restrict ourselves to necessary conditions (the vanishing of the first “derivative”), then the existence issue can be left open.

2.2.1 Necessary conditions for functional minimization What follows dates from Euler and Lagrange in the 18th century. Their idea was to solve minimization problems by means of a general variational approach that reduces them to the solution of differential equations. Their approach is still completely valid today and provides us with a solid theoretical basis for the variational solution of inverse and DA problems. We begin, as in Courant and Hilbert [1989a], with the “simplest problem” of the calculus of variations, where we seek a real-valued function y(x), defined on the interval [a, b ], that minimizes the integral cost function  J [y] =

b a

F (x, y, y ) dx.

(2.1)

A classical example is to find the curve, y(x), known asthe geodesic, that minimizes the

length between x = a and x = b , in which case F =

1 + (y )2 . We will assume here

15 These are functions of functions, rather than of scalar variables. 16 An example of such a space is the space of continuous functions with continuous first derivatives, usually denoted C 1 (Ω), where Ω is the domain of definition of each member function.

2.2. The calculus of variations

29

y + εη(x)

x=b

x =a

η(x)

x=b

x =a

Figure 2.3. Curve η(x) and admissible functions y + εη(x).

that the functions F and y possess all the necessary smoothness (i.e., they are differentiable up to any order required). Suppose that y∗ is the extremal function that gives the minimum value to J . This means that in a sufficiently small neighborhood (recall our difficulties with multiple extremal points—the same thing occurs for functions) of the function y∗ (x) the integral (2.1) is smallest when y = y∗ (x). To quantify this, we will define the variation, δy, of the function y. Let η(x) be an arbitrary, smooth function defined on [a, b ] and vanishing at the endpoints, i.e., η(a) = η(b ) = 0. We construct the new function y = y + δy, with δy = εη(x), where ε is an arbitrary (small) parameter—see Figure 2.3. This implies that all the functions y will lie in an arbitrarily small neighborhood of y∗ , and thus the cost function J [˜ y ] , taken as a function of ε, must have its minimum at ε = 0, and its “derivative” with respect to ε must vanish there. If we now take the integral  J (ε) =

b a

F (x, y + εη, y + εη ) dx

and differentiate it with respect to ε, we obtain at ε = 0 J (0) =

 b a

 Fy η + Fy η dx = 0,

where the subscripts denote partial differentiation. Integrating the second term by parts and using the boundary values of η, we get  a

b

  d Fy dx = 0, η Fy − dx

30

Chapter 2. Optimal control and variational data assimilation

and since this must hold for all functions η, we can invoke the following fundamental lemma. x

Lemma 2.2. (Fundamental lemma of the calculus of variations). If x 1 η(x)φ(x)d x = 0 0, with φ(x) continuous, holds for all functions η(x) vanishing on the boundary and continuous with two continuous derivatives, then φ(x) = 0 identically. Using this lemma, we conclude that the necessary condition, known as the Euler– Lagrange equation, is d Fy − F = 0. (2.2) dx y We can expand the second term of this expression, obtaining (we have flipped the signs of the terms) (2.3) y Fy y + y Fy y + Fy x − Fy = 0. If we want to solve this equation for the highest-order derivative, we must require that Fy y = 0, which is known as the Legendre condition and plays the role of the second derivative in the optimization of scalar functions, by providing a sufficient condition for the existence of a maximum or minimum. We will invoke this later when we discuss the study of sensitivities with respect to individual parameters in multiparameter identification problems. Example 2.3. Let us apply the Euler–Lagrange necessary condition to the geodesic problem with b  b J [y] = F (x, y, y ) dx = 1 + (y )2 dx. a

a

Note that F has no explicit dependence on the variables y and x. The partial derivatives of F are Fy = 0,

Fy = 

y 1 + (y )2

,

Fy y = 0,

  2 −3/2 Fy y = 1 + y , Fy x = 0.

Substituting in (2.3), we get y = 0, which implies that y = c x + d, and, unsurprisingly, we indeed find that a straight line is the shortest distance between two points in the Cartesian x-y plane. This result can be extended to finding the geodesic on a given surface by simply substituting parametric equations for x and y.

2.2.2 Generalizations The Euler–Lagrange equation (2.2) can be readily extended to functions of several variables and to higher derivatives. This will then lead to the generalization of the Euler– Lagrange equations that we will need in what follows.

2.2. The calculus of variations

31

In fact, we can consider a more general family of admissible functions, y(x, ε), with   ∂ η(x) = y(x, ε) . ∂ε ε=0 Recall that we defined the variation of y ( y = y + δy) as δy = εη. This leads to an analogous definition of the (first) variation of J ,  x1   δJ = εJ (0) = ε Fy η + Fy η dx 

x0

  x=x1  d Fy η dx + εFy η x=x0 dx x  x1 0   x=x1 = [F ]y δy dx + Fy δy , x1





Fy −

x=x0

x0

where

  d . [F ]y = Fy − Fy dx is the variational derivative of F with respect to y. We conclude that the necessary condition for an extremum is that the first variation of J be equal to zero for all admissible y + δy. The curves for which δJ vanishes are called stationary functions. Numerous examples can be found in Courant and Hilbert [1989a]. Let us now see how this fundamental result generalizes to other cases. If F depends on higher derivatives of y (say, up to order n), then  x1 J [y] = F (x, y, y , . . . , y (n) ) dx, x0

and the Euler–Lagrange equation becomes Fy −

dn d d2 F + F − · · · + (−1)n F (n) = 0. dx y d x2 y d xn y

If F consists of several scalar functions (y1 , y2 , . . . , yn ), then  x1 J [y1 , y2 , . . . , yn ] = F (x, y1 , . . . , yn , y1 , . . . , yn ) dx, x0

and the Euler–Lagrange equations are Fy i −

d F = 0, d x yi

i = 1, . . . , n.

If F depends on a single function of n variables and if Ω is a surface, then  J [y] = F (x1 , . . . , xn , y, y x1 , . . . , y xn ) dx1 · · · dxn , Ω

and the Euler–Lagrange equations are now PDEs: Fy −

n  ∂ F = 0, ∂ xi y xi i =1

i = 1, . . . , n.

Finally, there is the case of several functions of several variables, which is just a combination of the above.

32

Chapter 2. Optimal control and variational data assimilation

Example 2.4. We consider the case of finding an extremal function, u, of two variables, x and y, from the cost function  J [u] = F (x, y, u, u x , uy ) dx dy (2.4) Ω

over the domain Ω. The necessary condition is   d J [u + εη] = 0, δJ = ε dε ε=0

(2.5)

where η(x, y) is a “nice” function satisfying zero boundary conditions on ∂ Ω, the boundary of Ω. Substituting (2.4) in (2.5), we obtain    δJ = ε F u η + F ux η x + F uy ηy dx dy = 0, Ω

which we integrate by parts (by applying the Gauss divergence theorem), getting    ∂ ∂ F ux − F uy dx dy = 0 η Fu − δJ = ε ∂x ∂y Ω (we have used the vanishing of η on the boundary), which yields the Euler–Lagrange equation ∂ ∂ [F ] u = F u − F − F = 0. ∂ x u x ∂ y uy This can be expanded to F ux ux u x x + 2F ux uy u xy + F uy uy uyy + F ux u u x + F uy u uy + F ux x + F uy y − F u = 0. We can now apply this result to the case where F=

 1 2 u + uy2 . 2 x

Clearly, F ux ux = F uy uy = 1, with all other terms equal to zero, and the Euler–Lagrange equation is precisely Laplace’s equation, Δu = u x x + uyy = 0, which can be solved subject to the boundary conditions that must be imposed on u.

2.2.3 Concluding remarks The calculus of variations, via the Euler–Lagrange (partial) differential equations, provides a very general framework for minimizing functionals, taking into account both the functional to be minimized and the (partial) differential equations that describe the underlying physical problem. The calculus of variations also covers the minimization of more general integral equations, often encountered in imaging problems, but

2.3. Adjoint methods

33

these will not be dealt with here. Entire books are dedicated to this subject—see, for example, Aster et al. [2012] and Colton and Kress [1998]. In what follows, for DA problems we will study a special case of calculus of variations and generalize it. Let us explain: the special case lies in the fact that our cost function will be a “mismatch” function, expressing the squared difference between measured values and simulated (predicted) values, integrated over a spatial (or spacetime) domain. To this we will sometimes add “regularization” terms to ensure wellposedness. The generalization takes the form of the constraints that we add to the optimization problem: in the case of DA these constraints are (partial) differential equations that must be satisfied by the extremal function that we seek to compute.

2.3 Adjoint methods Having, we hope by now, acquired an understanding of the calculus of variations, we will proceed to study the adjoint approach for solving (functional) optimization problems. We will emphasize the generality and the inherent power of this approach. Note that this approach is also used frequently in optimal control and optimal design problems—these are just special cases of what we will study here. We note that the Euler–Lagrange system of equations, amply seen in the previous section, will here be composed of the direct and adjoint equations for the system under consideration. A very instructive toy example of an ocean circulation problem can be found in Bennett [2004], where the Euler–Lagrange equations are carefully derived and their solution proposed using a special decomposition based on “representer functions.” In this section, we will start from a general setting for the adjoint method, and then we will back up and proceed progressively through a string of special cases, from a “simple” ODE-based inverse problem of parameter identification to a full-blown, nonlinear PDE-based problem. Even the impatient reader, who may be tempted to skip the general setting and go directly to the special case examples, is encouraged to study the general presentation, because a number of fundamental and very important points are dealt with here. After presenting the continuous case, the discrete (finitedimensional) setting will be explained. This leads naturally to the important subject of automatic differentiation, which is often used today for automatically generating the adjoints of large production codes, though it can be very efficient for smaller codes too.

2.3.1 A general setting We will now apply the calculus of variations to the solution of variational inverse problems. Let u be the state of a dynamical system whose behavior depends on model parameters m and is described by a differential operator equation L(u, m) = f, where f represents external forces. Define a cost function, J (m), as an energy functional or, more commonly, as a misfit functional that quantifies the L2 -distance17 between the observation and the model prediction u(x, t ; m). For example, T  2 J (m) = u(x, t ; m) − uobs (x, t ) δ(x − x r ) dx dt , 0

17 The

Ω

space L2 is a Hilbert space of (measurable) functions that are square-integrable (in the Lebesgue sense). Readers unfamiliar with this should consult a text such as Kreyszig [1978] for this definition as well as all other (functional) analysis terms used in what follows.

34

Chapter 2. Optimal control and variational data assimilation

where x ∈ Ω ⊂ n ; n = 2, 3; 0 ≤ t ≤ T ; δ is the Dirac delta function; and x r are the observer positions. Our objective is to choose the model parameters, m, as a function of the observed output, uobs , such that the cost function, J (m), is minimized. We define the variation of u with respect to m in the direction δm (known as the Gâteaux differential, which is the directional derivative, but defined on more general spaces of functions) as . δu = ∇ m u δm,

where ∇ m (·) is the gradient operator with respect to the model parameters (known, in the general case, as the Fréchet derivative). Then the corresponding directional derivative of J can be written as δJ = ∇ m J δm

= ∇ u J δu = 〈∇ u J1 δu〉 ,

(2.6)

where in the second line we have used the chain rule together with the definition of δu, and in the third line 〈·〉 denotes the space-time integral. Here we have passed the “derivative” under the integral sign, and J1 is the integrand. There remains a major difficulty: the variation δu is impossible or unfeasible to compute numerically (for all directions δm). To overcome this, we would like to eliminate δu from (2.6) by introducing an adjoint state (which can also be seen as a Lagrange multiplier). To achieve this, we differentiate the state equation with respect to the model m and apply the necessary condition for optimality (disappearance of the variation) to obtain δL = ∇ m L δm + ∇ u L δu = 0.

Now we multiply this equation by an arbitrary test function u† (Lagrange multiplier) and integrate over space-time to obtain  †    u · ∇ m L δm + u† · ∇ u L δu = 0. Add this null expression to (2.6) and integrate by parts, regrouping terms in δu:     ∇ m J δm = 〈∇ u J1 δu〉 + u† · ∇ m L δm + u† · ∇ u L δu      = δu · ∇ u J1† + ∇ u L† u† + u† · ∇ m L δm ,

where we have defined the adjoint operators ∇ u J1† and ∇ u L† via the appropriate inner products as   〈∇ u J1 δu〉 = δu · ∇ u J1†

and



   u† · ∇ u L δu = δu · ∇ u L† u† .

Finally, to eliminate δu, the adjoint state, u† , should satisfy ∇ u L† u† = −∇ u J1† ,

which is known as the adjoint equation. Once the adjoint solution, u† , is found, the derivative/variation of the objective functional becomes   ∇ m J δm = u† · ∇ m L δm . (2.7)

This key result enables us to compute the desired gradient, ∇ m J , without the explicit knowledge of δu. A number of important remarks are necessary here:

2.3. Adjoint methods

35

1. We obtain explicit formulas for the gradient with respect to each/any model parameter. Note that this has been done in a completely general setting, without any restrictions on the operator, L, or on the model parameters, m. 2. The computational cost is one solution of the adjoint equation, which is usually of the same order as (if not identical to) the direct equation,18 but with a reversal of time. 3. The variation (Gâteaux derivative) of L with respect to the model parameters, m, is, in general, straightforward to compute. 4. We have not considered boundary (or initial) conditions in the above general approach. In real cases, these are potential sources of difficulties for the use of the adjoint approach—see Section 2.3.9, where the discrete adjoint can provide a way to overcome this hurdle. 5. For complete mathematical rigor, the above development should be performed in an appropriate Hilbert space setting that guarantees the existence of all the inner products and adjoint operators—the interested reader could consult the excellent short course notes of Estep [2004] and references therein, or the monograph of Tröltzsch [2010]. 6. In many real problems, the optimization of the misfit functional leads to multiple local minima and often to very “flat” cost functions—these are hard problems for gradient-based optimization methods. These difficulties can be (partially) overcome by a panoply of tools: (a) Regularization terms can alleviate the nonuniqueness problem—see Engl et al. [1996] and Vogel [2002]. (b) Rescaling the parameters and/or variables in the equations can help with the “flatness”—this technique is often employed in numerical optimization— see Nocedal and Wright [2006]. (c) Hybrid algorithms, which combine stochastic and deterministic optimization (e.g., simulated annealing), can be used to avoid local minima—see Press et al. [2007]. 7. When measurement and modeling errors can be modeled by Gaussian distributions and a background (prior) solution exists, the objective function may be generalized by including suitable covariance matrices. This is the approach employed systematically in DA—see below for full details. We will now present a series of examples where we apply the adjoint approach to increasingly complex cases. We will use two alternative methods for the derivation of the adjoint equation: a Lagrange multiplier approach and the tangent linear model (TLM) approach. After seeing the two in action, the reader can adopt the one that suits her/him best. Note that the Lagrangian approach supposes that we perturb the soughtfor parameters (as seen above in Section 2.2) and is thus not applicable to inverting for constant-valued parameters, in which case we must resort to the TLM approach. 18 Note that for nonlinear equations this may not be the case, and one may require four or five times the computational effort.

36

Chapter 2. Optimal control and variational data assimilation

2.3.2 Parameter identification example A basic example: Let us consider in more detail the parameter identification problem (already encountered in Chapter 1) based on the convection-diffusion equation (1.6),  − b u (x) + c u (x) = f (x), 0 < x < 1, (2.8) u(0) = 0, u(1) = 0, where f is a given function in L2 (0, 1) and b and c are the unknown (constant) parameters that we seek to identify using observations of u(x) on [0, 1]. The least-squares error cost function is  2 1 1 J (b , c) = u(x) − u obs (x) dx. 2 0 Let us, once again, calculate its gradient by introducing the TLM. Perturbing the cost function by a small perturbation in the direction α with respect to the two parameters gives  2  2 1 1 J (b + αδ b , c + αδc) − J (b , c) = u˜ − u obs − u − u obs dx, 2 0 where u˜ = u b +αδ b ,c+αδc is the perturbed solution and u = u b ,c is the unperturbed one. Expanding and rearranging, we obtain   1 1 J (b + αδ b , c + αδc) − J (b , c) = u˜ + u − 2u obs ( u˜ − u) dx. 2 0 Now we divide by α on both sides of the equation and pass to the limit α → 0 to obtain the directional derivative (which is the derivative with respect to the parameters, in the direction of the perturbations),  1  Jˆ[b , c] (δ b , δc) = u − u obs uˆ dx, (2.9) 0

where we have defined uˆ = lim

α→0

u˜ − u , α

J (b + αδ b , c + αδc) − J (b , c) Jˆ[b , c] (δ b , δc) = lim , α→0 α

and we have moved the limit under the integral sign. Let us now use this definition to find the equation satisfied by uˆ. We have  − (b + αδ b ) u˜ + (c + αδc) u˜ = f , u˜ (0) = 0, u˜ (1) = 0, and the given model (2.8),



− b u + c u = f , u(0) = 0, u(1) = 0.

Then, subtracting these two equations and passing to the limit (using the definition of uˆ ), we obtain  − b uˆ − (δ b )u + c uˆ + (δc)u = 0, uˆ (0) = 0, uˆ(1) = 0.

2.3. Adjoint methods

37

We can now define the TLM  − b uˆ + c uˆ = (δ b )u − (δc)u , uˆ(0) = 0, uˆ (1) = 0.

(2.10)

We want to be able to reformulate the directional derivative (2.9) to obtain a calculable expression for the gradient. So we multiply the TLM (2.10) by a variable p and integrate twice by parts, transferring derivatives from uˆ onto p: 1 1 1  −b uˆ p dx + c uˆ p dx = (δ b )u dx − (δc)u p dx, 0

0

0

which gives (term by term) 1 1  1 uˆ p dx = uˆ p 0 − uˆ p dx 0

0





= uˆ p

1 − uˆ p 0



+

1

uˆ p dx

0

= uˆ (1) p(1) − uˆ (0) p(0) + and



1 0

uˆ p dx = [ uˆ p]10 − 

=−

1



1



1

uˆ p dx

0

uˆ p dx

0

uˆ p dx.

0

Putting these results together, we have   1  1  1  uˆ p + c − uˆ p = (δ b )u − (δc)u p −b uˆ (1) p(1) − uˆ (0) p(0) + 0

0

0

or, grouping terms, 1 1   −b p − c p uˆ = b uˆ (1) p(1) − b uˆ (0) p(0) + (δ b )u − (δc)u p. (2.11) 0

0

Now, to get rid of all the terms in uˆ in this expression, we impose that p must satisfy the adjoint model ! − b p − c p = (u − u obs ), (2.12) p(0) = 0, p(1) = 0. Integrating (2.12) and using the expression (2.11), we obtain   1   1 1 1  −b p − c p uˆ = (δ b ) (u − u obs ) uˆ = p u + (δc) − p u . 0

0

0

0

We recognize, in the last two terms, the L2 inner product, which enables us, based on the key result (2.7), to finally write an explicit expression for the gradient, based on (2.9),  1 T 1 ∇J (b , c) = p u dx, − p u dx 0

0

38

Chapter 2. Optimal control and variational data assimilation

or, separating the two components,  ∇ b J (b , c) =

1

p u dx,

(2.13)

0

∇c J (b , c) = −



1

p u dx.

(2.14)

0

Thus, in this example, to compute the gradient of the least-squares error cost function, we must • solve the direct equation (2.8) for u and derive u and u from the solution, using some form of numerical differentiation (if we solved with finite differences), or differentiating the shape functions (if we solved with finite elements); • solve the adjoint equation (2.12) for p (using the same solver19 that we used for u); • compute the two terms of the gradient, (2.13) and (2.14), using a suitable numerical integration scheme [Quarteroni et al., 2007]. Thus, for the additional cost of one solution of the adjoint model (2.12) plus a numerical integration, we can compute the gradient of the cost function with respect to either one, or both, of the unknown parameters. It is now a relatively easy task to find (numerically) the optimal values of b and c that minimize J by a suitable descent algorithm, for example, a quasi-Newton method [Nocedal and Wright, 2006; Quarteroni et al., 2007].

2.3.3 A simple ODE example: Lagrangian method We now consider a variant of the convection-diffusion example, where the diffusion coefficient is spatially varying. This model is closer to many physical situations, where the medium is not homogeneous and we have zones with differing diffusive properties. The system is !

 − a(x)u (x) − u (x) = q(x),

0 < x < 1,

u(0) = 0, u(1) = 0,

(2.15)

with the cost function J [a] =

1 2

 1 0

u(x) − u obs (x)

2

d x,

where u obs (x) denotes the observations on [0, 1] . We now introduce an alternative approach for deriving the gradient, based on the Lagrangian (or variational formulation). Let the cost function be  1  2   1 1 J ∗ [a, p] = p − au − u − q d x, u(x) − u obs (x) d x + 2 0 0 19 This

is not true when we use a discrete adjoint approach—see Section 2.3.9.

2.3. Adjoint methods

39

noting that the second integral is zero when u is a solution of (2.15) and that the adjoint variable, p, can be considered here to be a Lagrange multiplier function. We begin by taking the variation of J ∗ with respect to its variables, a and p: δJ ∗ =

 1 0

=0

1  1

      p −δa u − a δ u . u − u obs δ u d x+ δ p − au − u − q d x+ 0

0

Now the strategy is to “kill terms” by imposing suitable, well-chosen conditions on p. This is achieved by integrating by parts and then defining the adjoint equation and boundary conditions on p as follows: δJ ∗ =

 1 

0

1  δa u p d x (u − u obs ) + p − (a p ) δ u d x +

− p(δ u + u δa + aδ u ) + p aδ u 1 = δa u p d x,

1

0

0

0

where we have used the zero boundary conditions on δ u and assumed that the following adjoint system must be satisfied by p: !

 − a p + p = −(u − u obs ),

0 < x < 1,

p(0) = 0, p(1) = 0.

(2.16)

And, as before, based on the key result (2.7), we are left with an explicit expression for the gradient, ∇a(x) J ∗ = u p . Thus, with one solution of the direct system (2.15) plus one solution of the adjoint system (2.16), we recover the gradient of the cost function with respect to the soughtfor diffusion coefficient, a(x).

2.3.4 Initial condition control For DA problems in meteorology and oceanography, the objective is to reconstruct the initial conditions of the model. This is also the case in certain source identification problems for environmental pollution. We redo the above gradient calculations in this context. Let us consider the following system of (possibly nonlinear) ODEs: ⎧ ⎨ dX = M(X) in Ω × [0, T ] , dt ⎩ X(t = 0) = U, with the cost function J (U) =

1 2



T 0

(2.17)

HX − Yo 2 dt ,

where we have used the classical vector-matrix notation for systems of ODEs and · denotes the L2 -norm over the space variable. To compute the directional derivative,

40

Chapter 2. Optimal control and variational data assimilation

˜ we perturb the initial condition U by a quantity α in the direction u and denote by X the corresponding trajectory, satisfying ⎧ ˜ ⎨ dX = M(X) ˜ in Ω × [0, T ] , dt ⎩˜ X(t = 0) = U + αu.

(2.18)

We then have J (U + αu) − J (U) =

1 2

=

1 2

1 = 2 =

1 2



T



0 T



0 T



0 T



% %2 % ˜ % %HX − Yo % − HX − Yo 2 dt 

0

 ˜ − Y, HX ˜ − HX + HX − Y − (HX − Y, HX − Y) HX

 ˜ − Y, H(X ˜ − X) − (HX ˜ − Y − (HX − Y), HX − Y) HX  ˜ − Y, H(X ˜ − X) + (H(X ˜ − X), HX − Y). HX

Now, we set ˜ ˆ = lim X − X , X α→0 α and we compute the directional derivative, J (U + αu) − J (U) Jˆ[U] (u) = lim α→0 α T  1 ˆ + (HX, ˆ HX − Y) = HX − Y, HX 2 0 T ˆ HX − Y) = (HX, 0



=

0

T

ˆ HT (HX − Y)). (X,

(2.19)

˜ and X, we obtain By subtracting the equations (2.18) and (2.17) satisfied by X ⎧ ) 2 * ' ( ˜ ⎪ ⎨ d(X − X) = M(X) ˜ − X) + · · · , ˜ − MX = ∂ M (X ˜ − X) + 1 (X ˜ − X)T ∂ M (X dt ∂X 2 ∂ X2 ⎪ ⎩ ˜ (X − X)(t = 0) = αu. Now we divide by α and pass to the limit α → 0 to obtain ⎧ ( ˆ ' ⎪ ⎨ dX = ∂ M X, ˆ dt ∂X ⎪ ⎩ˆ X(t = 0) = u.

These equations are the TLM.

(2.20)

2.3. Adjoint methods

41

We will now proceed to compute the adjoint model. As in the ODE example of Sections 1.5.3.1 and 2.3.2, we multiply the TLM (2.20) by P and integrate by parts on [0, T ] . We find 

T 0

+

, T    ˆ dX ˆ P) T ˆ dP + (X, ,P = − X, 0 dt dt 0 T      dP ˆ ), P(T ) − X(0), ˆ ˆ + X(T P(0) =− X, dt 0 T    ˆ ), P(T ) − (u, P(0)) ˆ dP + X(T =− X, dt 0

and



T

0

'

(  T+ ' (T , ∂M ˆ ˆ ∂M P . X, P = X, ∂X ∂X 0

Thus, substituting in equation (2.20), we get 

T 0

+

, T+ ' (T ,   ˆ '∂ M( dX ˆ ), P(T ) −(u, P(0)) . ˆ − dP − ∂ M P + X(T − Xˆ , P = 0 = X, dt ∂X dt ∂X 0

Identifying with the directional derivative (2.19), we obtain the equations of the adjoint model ⎧ ' (T ⎨ dP + ∂ M P = HT (HX − Y), (2.21) dt ∂X ⎩ P(t = T ) = 0, which is a backward model, integrated from t = T back down to t = 0. We can now find the expression for the gradient. Using the adjoint model (2.21) in (2.19), we find  Jˆ[U] (u) = =

T

0 T 0

ˆ HT (HX − Y)) (X, + ' (T , ˆ dP + ∂ M P X, dt ∂X

= (−u, P(0)) . But, by definition, Jˆ[U] (u) = (∇JU , u), and thus ∇JU = −P(0). Once again, with a single (backward) integration of the adjoint model, we obtain a particularly simple expression for the gradient of the cost function with respect to the control parameter.

42

Chapter 2. Optimal control and variational data assimilation

2.3.5 Putting it all together: The case of a linear PDE The natural extension of the ODEs seen above is the initial boundary value problem known as the diffusion equation: ∂u − ∇ · (ν∇u) = 0, x ∈ (0, L), t > 0, ∂t u(x, 0) = u0 (x), u(0, t ) = 0, u(L, t ) = η(t ). This equation has multiple origins emanating from different physical situations. The most common application is particle diffusion, where u is a concentration and ν is a diffusion coefficient. Then there is heat diffusion, for which u is temperature and ν is thermal conductivity. The equation is also found in finance, being closely related to the Black–Scholes model. Another important application is population dynamics. These diverse application fields, and hence the diffusion equation, give rise to a number of inverse and DA problems. A variety of different controls can be applied to this system: • internal control: ν(x)—this is the parameter identification problem, also known as tomography; • initial control: ξ (x) = u0 (x)—this is a source detection inverse or DA problem; • boundary control: η(t ) = u(L, t )—this is the “classical” boundary control problem, also a parameter identification inverse problem. As above, we can define the cost function, TL 1 J [ν, ξ , η] = (u − u o )2 dx dt , LT 0 0 which is now a space-time multiple integral, and its related Lagrangian,  T L  T L 1 1 J∗ = (u − u o )2 dx dt + p [u t − (ν u x ) x ] dx dt . LT 0 0 LT 0 0 Now take the variation of J ∗ , δJ ∗ =

1 LT

 T 0

1 + LT

0

L

 T 0

2(u − u o )δ u dx dt +

0

1 LT

 T 0

L 0

=0



δ p [u t − (ν u x ) x ] dx dt

L

p [δ u t − (δν u x + νδ u x ) x ] dx dt ,

and perform integration by parts to obtain  T L L T 1 1 1 δJ ∗ = δν u x p x dx dt − p δ u| t =0 dx + p δη| x=L dt , (2.22) LT 0 0 LT 0 LT 0 where we have defined the adjoint equation as ∂p + ∇ · (ν∇u) = 2(u − u o ), ∂t p(0, t ) = 0, p(L, t ) = 0, p(x, T ) = 0.

x ∈ (0, L),

t > 0,

2.3. Adjoint methods

43

As before, this equation is of the same type as the original diffusion equation but must be solved backward in time. Finally, from (2.22), we can pick off each of the three desired terms of the gradient:  1 T u p dt , T 0 x x ∗ ∇ u|t =0 J = − p| t =0 , ∇ν(x) J ∗ =

∇ η|x=L J ∗ = p| x=L .

Once again, at the expense of a single (backward) solution of the adjoint equation, we obtain explicit expressions for the gradient of the cost function with respect to each of the three control variables. This is quite remarkable and completely avoids “brute force” or exhaustive minimization, though, as mentioned earlier, we only have the guarantee of finding a local minimum. However, if we have a good starting guess, which is usually obtained from historical or other “physical” knowledge of the system, we are sure to arrive at a good (or, at least, better) minimum.

2.3.6 An adjoint “zoo” As we have seen above, every (partial) differential operator has its very own adjoint form. We can thus derive, and categorize, adjoint equations for a whole variety of partial differential operators. Some common examples can be found in Table 2.1. Table 2.1. Adjoint forms for some common ordinary and partial differential operators.

Operator du dx

−γ

d2u d x2

∇ · (k∇u)

Adjoint dp −dx

d2 p

− γ d x2

∇ · (k∇ p) ∂p

∂2p

∂p

∂p

∂u ∂t

∂ 2u − c ∂ x2

− ∂ t − c ∂ x2

∂u ∂t

+c∂x

∂u

−∂t −c∂x

The principle is simple: all second-order (or even) derivatives remain unchanged, whereas all first-order (or uneven) derivatives undergo a change of sign.

2.3.7 Application: Burgers’ equation (a nonlinear PDE) We will now consider a more realistic application based on Burgers’ equation [Lax, 1973] with control of the initial condition and the boundary conditions. Burgers’ equation is a very good approximation to the Navier–Stokes equation in certain contexts where viscous effects dominate convective effects. The Navier–Stokes equation itself is the model equation used for all aerodynamic simulations and for many flow problems. In addition, it is the cornerstone of numerical weather prediction (NWP) codes.

44

Chapter 2. Optimal control and variational data assimilation

The viscous Burgers’ equation in the interval x ∈ [0, L] is defined as ∂u ∂ 2u ∂u +u −ν =f, ∂t ∂x ∂ x2 u(0, t ) = ψ1 (t ), u(L, t ) = ψ2 (t ), u(x, 0) = u0 (x). The control vector will be taken as a combination of the initial state and the two boundary conditions, (u0 , ψ1 , ψ2 ), and the cost function is given by the usual mismatch, J (u0 , ψ1 , ψ2 ) =

1 2



T

 L

0

0

u − u obs

2

dx dt .

We know that the derivative of J in the direction20 (h u , h1 , h2 ) is given (as above in (2.9)) by  T  L  Jˆ[u0 , ψ1 , ψ2 ] (h u , h1 , h2 ) = u − u obs uˆ dx dt , 0

0

where uˆ is defined, as usual, by u˜ − u α u(u0 + αh u , ψ1 + αh1 , ψ2 + αh2 ) − u(u0 ,ψ1 ,ψ2 ) = lim , α→0 α

uˆ = lim

α→0

which is the solution of the TLM ∂ 2 uˆ ∂ uˆ ∂ (u uˆ) + −ν = 0, ∂t ∂x ∂ x2 uˆ(0, t ) = h1 (t ), uˆ(L, t ) = h2 (t ), uˆ(x, 0) = h u (x). We can now compute the equation of the adjoint model. As before, we multiply the TLM by p and integrate by parts on [0, T ] . For clarity, we do this term by term: 

T 0



 TL ∂ uˆ ∂ uˆ , p dt = p dx dt ∂t 0 0 ∂ t L  L = [ uˆ p]T0 dx − 

=

0

0

0

T

∂p uˆ dx dt ∂t

L 0

( uˆ(T ) p(x, T ) − h u p(x, 0)) dx −

 L 0

T 0

∂p uˆ dx dt , ∂t

20 Instead of the δ notation, we have used another common form—the letter h—to denote the perturbation direction.

2.3. Adjoint methods





T

0

45

 TL ∂ (u uˆ) ∂ (u uˆ ) , p dx = p dx dt ∂x ∂x 0 0 T TL ∂p = dx dt [u uˆ p]0L dt − u uˆ ∂x 0 0 0 T  = (ψ2 h2 p(L, t ) − ψ1 h1 p(0, t )) dx − 0



T

T



L

u uˆ 0

0

∂p dx dt , ∂x



0

 TL 2 ∂ uˆ ∂ 2 uˆ , p dt = p dx dt 2 ∂ x2 0 0 ∂ x *L TL T) ∂ uˆ ∂ uˆ ∂ p p dx dt dt − = ∂x 0 0 0 0 ∂ x ∂ x *L TL T) ∂p ∂ 2p ∂ uˆ p − uˆ uˆ dt + dx dt = ∂ x ∂ x ∂ x2 0 0 0 0  T ∂p ∂p ∂ uˆ ∂ uˆ p(L, t ) (L, t ) − h2 (L, t ) − p(0, t ) (0, t ) + h1 (0, t ) dt = ∂x ∂x ∂x ∂x 0 TL ∂ 2p + uˆ dx dt . ∂ x2 0 0

The natural initial21 and boundary conditions for p are thus p(x, T ) = 0,

p(0, t ) = p(L, t ) = 0,

which give  ∂ 2 uˆ ∂ uˆ ∂ (u uˆ) p dx dt + −ν ∂t ∂x ∂ x2 0 0 TL  2  ∂p ∂ p ∂p = dx dt −u −ν uˆ − ∂t ∂x ∂ x2 0 0 L T ∂p ∂p + (L, t ) − ν h1 (0, t ) dt . −h u p(x, 0) dx + ν h2 ∂x ∂x 0 0 

T

 L

0=

In other words,  TL  L ∂p ∂ 2p ∂p dx dt = − −u −ν uˆ − h u p(x, 0) dx 2 ∂ t ∂ x ∂ x 0 0 0 T ∂p ∂p + (L, t ) − ν h1 (0, t ) dt . ν h2 ∂x ∂x 0 We thus define the adjoint model as ∂p ∂ 2p ∂p +u −ν = u − u obs , ∂t ∂x ∂ x2 p(0, t ) = 0, p(L, t ) = 0, p(x, T ) = 0. 21 This

is in fact a terminal condition, as we have encountered above.

46

Chapter 2. Optimal control and variational data assimilation

Now we can rewrite the gradient of J in the form  Jˆ[u0 , ψ1 , ψ2 ] (h u , h1 , h2 ) = − +

L

h u p(x, t = 0) dx

0



T 0

ν h2

∂p ∂p (x = L, t ) − ν h1 (x = 0, t ) dt , ∂x ∂x

which immediately yields ∇ u0 J = − p(x, t = 0),

∂p (x = 0, t ), ∂x ∂p ∇ψ2 J = ν (x = L, t ). ∂x

∇ψ1 J = −ν

These explicit gradients enable us to solve inverse problems for either (1) the initial condition, which is a data assimilation problem, or (2) the boundary conditions, which is an optimal boundary control problem, or (3) both. Another extension would be a parameter identification problem for ν. This would make an excellent project or advanced exercise.

2.3.8 Adjoint of finite-dimensional (matrix) operators Suppose now that we have a solution vector, x, of a discretized PDE, or of any other set of n equations. Assume that x depends as usual on a parameter vector, m, made up of p components—these are sometimes called control variables, design parameters, or decision parameters. If we want to optimize these values for a given cost function, J (x, m), we need to compute, as for the continuous case, the gradient, dJ /dm. As we have seen above, this should be possible with an adjoint method at a cost that is independent of p and comparable to the cost of a single solution for x. In the finite-dimensional case, this implies the inversion of a linear system, usually  (n 3 ) operations. This efficiency, especially for large values of p, is what makes the solution of the inverse problem tractable—if it were not for this, many problems would be simply impossible to solve within reasonable resource limits. We will first consider systems of linear algebraic equations, and then we can readily generalize to nonlinear systems of algebraic equations and to initial-value problems for linear systems of ODEs. 2.3.8.1 Linear systems

Let x be the solution of the (n × n) linear system Ax = b,

(2.23)

and suppose that x depends on the parameters m through A(m) and b(m). Define a cost function, J = J (x, m), that depends on m through x. To evaluate the gradient of J with respect to m directly, we need to compute by the chain rule dJ ∂J ∂J ∂x = + = Jm + Jx xm , dm ∂ m ∂ x ∂ m

(2.24)

2.3. Adjoint methods

47

where Jm is a ( p × 1) column vector, Jx is a (1 × n) row vector, and xm is an (n × p) matrix. For a given function J the derivatives with respect to x and m are assumed to be easily computable. However, it is clearly much more difficult to differentiate x with respect to m. Let us try and do this directly. We can differentiate, term by term, equation (2.23) with respect to the parameter mi and solve for x mi from (applying the chain rule) x mi = A−1 (b mi − A mi x). This must be done p times, rapidly becoming unfeasible for large n and p. Recall that p can be of the order of 106 in practical DA problems. The adjoint method, which reduces this to a single solve, relies on the trick of adding zero in an astute way. We can do this, as was done above in the continuous case, by introducing a Lagrange multiplier. Since the residual vector r(x, m) = Ax − b vanishes for the true solution x, we can replace the function J by the augmented function Jˆ = J − λT r, (2.25) where we are free to choose λ at our convenience and we will use this liberty to make the difficult-to-compute term in (2.24), xm , disappear. So let us take the expression for the gradient (2.24) and evaluate it at r = 0,   dJ  dJˆ  =   dm r=0 dm r=0

  = Jm − λT rm + Jx − λT rx xm .

(2.26)

 Then, to “kill” the troublesome xm term, we must require that Jx − λT rx vanish, which implies rTx λ = JxT . But rx = A, and hence λ must satisfy the adjoint equation AT λ = JxT ,

(2.27)

which is a single (n × n) linear system. Equation (2.27) is of identical complexity as the original system (2.23), since the adjoint matrix AT has the same condition number, sparsity, and preconditioner as A; i.e., if we have a numerical scheme (and hence a computer code) for solving the direct system, we will use precisely the same one for the adjoint. With λ now known, we can compute the gradient of J from (2.26) as follows:  dJ  = Jm − λT rm + 0 dm r=0 = Jm − λT (Am x − bm ). Once again, we assume that when A(m) and b(m) are explicitly known, this permits an easy calculation of the derivatives with respect to m. If this is not the case, we must resort to automatic differentiation to compute these derivatives. The automatic differentiation approach will be presented below, after we have discussed nonlinear and initial-value problems.

48

Chapter 2. Optimal control and variational data assimilation

2.3.8.2 Nonlinear systems

In general, the state vector x will satisfy a nonlinear functional equation of the general form f (x, m) = 0. In this case the workflow is similar to the linear system. We start by solving for x with an iterative Newton-type algorithm, for example. Now define the augmented J as in (2.25), take the gradient as in (2.26), require that rTx λ = JxT , and finally compute the gradient  dJ  = Jm − λT rm . (2.28) dm r=0 There is, of course, a slight modification needed: the adjoint equation is not simply the adjoint as in (2.27) but rather a tangent linear equation obtained by analytical (or automatic) differentiation of J with respect to x. 2.3.8.3 Initial-value problems

We have, of course, seen this case in quite some detail above. Here we will reformulate it in matrix-vector form. We consider an initial-value problem for a linear, time-independent, homogeneous system of ODEs, x˙ = Bx, with x(0) = b. We know that the solution is given by x(t ) = eBt b, but this can be rewritten as a linear system, Ax = b, where A = e−Bt . Now we can simply use our results from above. Suppose we want to minimize J (x, m) based on the solution, x, at time, t . As before, we can compute the adjoint vector, λ, using (2.27), e−B t λ = JxT , T

but this is equivalent to the adjoint ODE, λ˙ = BT λ, with λ(0) = JxT . This is exactly what we would expect: solving for the adjoint state vector, λ, is a problem of the same complexity and type as that of finding the state vector, x. Clearly we are not obliged to use matrix exponentials for the solution, but we can choose among Runge–Kutta formulas, forward Euler, Crank–Nicolson, etc. [Quarteroni et al., 2007]. What about the important issue of stability? The eigenvalues of B and BT are complex conjugates and thus the stability of one (spectral radius less than one) implies the stability of the other. Finally, using (2.28), we obtain the gradient of the cost function in the time-dependent case, dJ = Jm − λT (Am x − bm ) dm t = Jm + λT (t − t )Bm x(t ) dt + λT bm , 0

2.3. Adjoint methods

49

where we have differentiated the expression for A. We observe that this computation of the gradient via the adjoint requires that we save in memory x(t ) for all times 0 ≤ t ≤ t to be able to compute the gradient. This is a well-known issue in adjoint approaches for time-dependent problems and can be dealt with in three ways (that are problem or, more precisely, dimension dependent): 1. Store everything in memory, if feasible. 2. If not, use some kind of checkpointing [Griewank and Walther, 2000], which means that we divide the time interval into a number of subintervals and store consecutively subinterval by subinterval. 3. Re-solve “simultaneously” forward and adjoint, and at the same time compute the integral; i.e., at each time step of the adjoint solution process, recompute the direct solution up to this time.

2.3.9 Continuous and discrete adjoints In the previous section, we saw how to deal with finite-dimensional systems. This leads us naturally to the study of discrete adjoints, which can be computed by automatic differentiation, as opposed to analytical methods, where we took variations and used integration by parts. In the following discussion, the aim is not to show exactly how to write an adjoint code generator but to provide an understanding of the principles. Armed with this knowledge, the reader will be able to critically analyze (if needed) the eventual reasons for failure of the approach when applied to a real problem. An excellent reference is Hascoet [2012]—see also Griewank [2000]. To fix ideas, let us consider a second-order PDE of the general form (without loss of generality) F (t , u, u t , u x , u x x , θ) = 0 and an objective function, J (u, θ), that depends on the unknown u and the parameters θ. As usual, we are interested in calculating the gradients of the cost function with respect to the parameters to find an optimal set of parameter values—usually one that attains the least-squares difference between simulated model predictions and real observations/measurements. There are, in fact, two possible approaches for computing an adjoint state and the resulting gradients or sensitivities: • discretization of the (analytical) adjoint, which we denote by AtD = Adjoint then Discretize (we have amply seen this above); • adjoint of the discretization (the code), which we denote as DtA = Discretize then Adjoint. The first is the continuous case, where we differentiate the PDE with respect to the parameters and then discretize the adjoint PDE to compute the approximate gradients. In the second, called the discrete approach, we first approximate the PDE by a discrete (linear or nonlinear) system and then differentiate the resulting discrete system with respect to the parameters. This is done by automatic differentiation of the code, which solves the PDE using tools such as TAPENADE, YAO, OpenAD, ADIFOR, ADMat, etc.—see   . Note that numerical computation of gradients can be

50

Chapter 2. Optimal control and variational data assimilation

achieved by two other means: divided/finite differences or symbolic differentiation.22 The first is notoriously unstable, and the latter cannot deal with complex functionals. For these reasons, the adjoint method is largely preferable. In “simpler” problems, AtD is preferable,23 but this assumes that we are able to calculate analytically the adjoint equation by integration by parts and that we can find compatible boundary conditions for the adjoint variable—see, for example, Bocquet [2012a]. This was largely developed above. In more realistic, complex cases, we must often resort to DtA, but then we may be confronted with serious difficulties each time the code is modified, since this implies the need to regenerate the adjoint. DtA is, however, well-suited for a nonexpert who does not need to have a profound understanding of the simulation codes to compute gradients. The DtA approach works for any cost functional, and no explicit boundary conditions are needed. However, DtA may turn out to be inconsistent with the adjoint PDE if a nonlinear, high-resolution scheme (such as upwinding) is used—a comparison of the two approaches can be found in Li and Petzold [2004], where the important question of consistency is studied and a simple example of the 1D heat equation is also presented.

2.4 Variational DA 2.4.1 Introduction 2.4.1.1 History

Variational DA was formally introduced by the meteorological community for solving the problem of numerical weather prediction (NWP). In 1922, Lewis Fry Richardson published the first attempt at forecasting the weather numerically. But large errors were observed that were caused by inaccuracies in the fields used as the initial conditions in his analysis [Lynch, 2008], thus indicating the need for a DA scheme. Originally, subjective analysis was used to correct the simulation results. In this approach, NWP forecasts were adjusted manually by meteorologists using their operational expertise and experience. Then objective analysis (e.g., Cressman’s successive correction algorithm), which fitted data to grids, was introduced for automated DA. These objective methods used simple interpolation approaches (e.g., a quadratic polynomial interpolation scheme based on least-squares regression) and thus were 3D DA methods. Later, 4D DA methods, called nudging, were developed. These are based on the simple idea of Newtonian relaxation and introduce into the right-hand side of the model dynamical equations a term that is proportional to the difference of the calculated meteorological variable and the observed value. This term has a negative sign and thus keeps the calculated state vector closer to the observations. Nudging can be interpreted as a variant of the Kalman filter (KF) with the gain matrix prescribed, rather than obtained from covariances. Various nudging algorithms are described in Chapter 4. A major development was achieved by L. Gandin [1963], who introduced the statistical interpolation (or optimal interpolation) method, which developed earlier ideas of Kolmogorov. This is a 3D DA method and is a type of regression analysis that utilizes information about the spatial distributions of covariance functions of the errors 22 By packages such as Maple, Mathematica, SAGE, etc. 23 Though there are differences of opinion among practitioners who prefer the discrete adjoint for these cases as well. Thus, the final choice depends on one’s personal experience and competence.

2.4. Variational DA

51

of the first guess field (previous forecast) and true field. The optimal interpolation algorithm is the reduced version of the KF algorithm in which the covariance matrices are not calculated from the dynamical equations but are predetermined. This is treated in Chapter 3. Attempts to introduce KF algorithms as a 4D DA tool for NWP models came later. However, this was (and remains) a difficult task due to the very high dimensions of the computational grid and the underlying matrices. To overcome this difficulty, approximate or suboptimal KFs were developed. These include the ensemble Kalman filter (EnKF) and the reduced-rank Kalman filters (such as RRSQRT)—see Chapters 3, 5, and 6. Another significant advance in the development of the 4D DA methods was the use of optimal control theory, also known as the variational approach. In the seminal work of Le Dimet and Talagrand [1986] based on earlier work of G. Marchuk, they were the first to apply the theory of Lions [1988] (see also Tröltzsch [2010]) to environmental modeling. The significant advantage of the variational approach is that the meteorological fields satisfy the dynamical equations of the NWP model, and at the same time they minimize the functional characterizing the difference between simulations and observations. Thus, a problem of constrained minimization is solved, as has been amply shown above in this chapter. As has been shown by Lorenc [2003], Talagrand [2012], and others, all the abovementioned 4D DA methods are in some limit equivalent. Under certain assumptions they minimize the same cost function. However, in practical applications these assumptions are never fulfilled and the different methods perform differently. This raises the still disputed question: Which approach, Kalman filtering or variational assimilation, is better? Further fundamental questions arise in the application of advanced DA techniques. A major issue is that of the convergence of the numerical method to the global minimum of the functional to be minimized—please refer to the important discussions in the first two sections of this chapter. The 4D DA method that is currently most successful is hybrid incremental 4DVar (see below and Chapters 5 and 7), where an ensemble is used to augment the climatological background error covariances at the start of the DA time window, but the background error covariances are evolved during the time window by a simplified version of the NWP forecast model. This DA method is used operationally at major forecast centers, though there is currently a tendency to move toward the more efficient ensemble variational (EnVar) methods that will be described in Chapter 7. 2.4.1.2 Formulation

In variational DA we describe the state of the system by a state variable, x(t ) ∈  , a function of space and time that represents the physical variables of interest, such as current velocity (in oceanography), temperature, sea-surface height, salinity, biological species concentration, or chemical concentration. The evolution of the state is described by a system of (in general nonlinear) differential equations in a region Ω, ⎧ ⎨ dx =  (x) in Ω × [0, T ] , dt ⎩ x(t = 0) = x0 ,

(2.29)

where the initial condition is unknown (or inaccurately known). Suppose that we are in possession of observations y(t ) ∈  and an observation operator  that describes

52

Chapter 2. Optimal control and variational data assimilation

the available observations. Then, to characterize the difference between the observations and the state, we define the objective (or cost) function J (x0 ) =

1 2



T 0

y(t ) −  (x(x0 , t ))2 dt +

%2 1% % % %x − xb % ,  2 0

(2.30)

where xb is the background (or first guess) and the second term plays the role of a regularization (in the sense of Tikhonov—see Vogel [2002] and Hansen [2010]). The two norms under the integral, in the finite-dimensional case, will be represented by the error covariance matrices R and B, respectively—see Chapter 1 and Section 2.4.3 below. Note that, for mathematical rigor, we have indicated the relevant functional spaces on which the norms are defined. In the continuous context, the DA problem is formulated as follows: find the analyzed state, xa0 , that minimizes J and satisfies xa0 = argmin J (x0 ). As seen above, the necessary condition for the existence of a (local) minimum is ∇J (xa0 ) = 0.

2.4.2 Adjoint methods in DA To solve the above minimization problem for variational DA, we will use the adjoint approach. In summary, the adjoint method for DA is an iterative scheme that involves searching for the minimum of a scalar cost function with respect to a multidimensional initial state. The search algorithm is called a descent method and requires the derivative of the cost function with respect to arbitrary perturbations of the initial state. This derivative, or gradient, is obtained by running an adjoint model backward in time. Once the derivative is obtained, a direction that leads to lower cost has been identified, but the step size has not. Therefore, further calculations are needed to determine how far along this direction one needs to go to find a lower cost. Once this initial state is found, the next iteration is started. The algorithm proceeds until the minimum of the cost function is found. It should be noted that the adjoint method is used in 4D-Var to find the initial conditions that minimize a cost function. However, one could equally well have chosen to find the boundary conditions, or model parameters, as was done in the numerous examples presented in Section 2.3. We point out that a truly unified derivation of variational DA should start from a probabilistic/statistical model. Then, as was mentioned above (see Section 1.5), we can obtain the 3D-Var model as a special case. We will return to this in Chapter 3. Here, as opposed to the presentation in Chapter 1, we will give a unified treatment of 3D- and 4D-Var that leads naturally to variants of the approach.

2.4.3 3D-Var and 4D-Var: A unified framework The 3D-Var and 4D-Var approaches were introduced in Chapter 1. Here we will recall the essential points of the formulation, present them in a unified fashion (after Talagrand [2012]), and expand on some concrete aspects and variants. Unlike sequential/statistical assimilation (which emanates from estimation theory), we saw that variational assimilation is based on optimal control theory, itself

2.4. Variational DA

53

derived from the calculus of variations. The analyzed state was defined as the one that minimizes a cost function. The minimization requires numerical optimization techniques. These techniques can rely on the gradient of the cost function, and this gradient will be obtained with the aid of adjoint methods, which we have amply discussed above. Note that variational DA is a particular usage of the adjoint approach. Usually, 3D-Var and 4D-Var are introduced in a finite-dimensional or discrete context—this approach will be used in this section. For the infinite-dimensional or continuous case, we must use the calculus of variations and PDEs, as was done in the previous sections of this chapter. We start out with the following cost function: J (x) =

T   1 1 x − xb B−1 x − xb + (Hx − y)T R−1 (Hx − y) , 2 2

(2.31)

where, as was defined in the notation of Section 1.5.1, x, xb , and y are the state, the background state, and the measured state, respectively; H is the observation matrix (a linearization of the observation operator  ); and R and B are the observation and background error covariance matrices, respectively. This quadratic function attempts to strike a balance between some a priori knowledge about a background (or historical) state and the actual measured, or observed, state. It also assumes that we know and can invert the matrices R and B—this, as will be pointed out below, is far from obvious. Furthermore, it represents the sum of the (weighted) background deviations and the (weighted) observation deviations. 2.4.3.1 The stationary case: 3D-Var

We note that when the background, xb = xb +εb , is available at some time tk , together with observations of the form y = Hxt + εo that have been acquired at the same time (or over a short enough interval of time when the dynamics can be considered stationary), then the minimization of (2.31) will produce an estimate of the system state at time tk . In this case, the analysis is called three-dimensional variational analysis and is abbreviated as 3D-Var. We have seen above, in Section 1.5.2, that the best linear unbiased estimator (BLUE) requires the computation of an optimal gain matrix. We will show (in Chapter 3) that the optimal gain takes the form K = BHT (HBHT + R)−1 , where B and R are the covariance matrices, to obtain an analyzed state, xa = xb + K(y − H(xb )). But this is precisely the state that minimizes the 3D-Var cost function. This is quite easily verified by taking the gradient, term by term, of the cost function (2.31) and equating to zero,   ∇J (xa ) = B−1 xa − xb − HT R−1 (y − Hxa ) = 0, where xa = argmin J (x).

(2.32)

54

Chapter 2. Optimal control and variational data assimilation

Solving the equation, we find   B−1 xa − xb = HT R−1 (y − Hxa ) ,   B−1 + HT R−1 H xa = HT R−1 y + B−1 xb ,  −1   xa = B−1 + HT R−1 H HT R−1 y + B−1 xb −1    = B−1 + HT R−1 H B−1 + HT R−1 H xb  − HT R−1 Hxb + HT R−1 y  −1   = xb + B−1 + HT R−1 H HT R−1 y − Hxb   = xb + K y − Hxb ,

(2.33)

 where we have simply added and subtracted the term HT R−1 H xb in the third-to-last line, and in the last line we have brought out what are known as the innovation term, d = y − Hxb , and the gain matrix,

−1  K = B−1 + HT R−1 H HT R−1 .

This matrix can be rewritten as

 −1 K = BHT R + HBHT

(2.34)

using the well-known Sherman–Morrison–Woodbury formula of linear algebra [Golub and van Loan, 2013], which completely avoids the direct computation of the inverse of the matrix B. The linear combination in (2.33) of a background term plus a multiple of the innovation is a classical result of linear-quadratic control theory [Friedland, 1986; Gelb, 1974; Kwakernaak and Sivan, 1972] and shows how nicely DA fits in with and corresponds to (optimal) control theory. The form of the gain matrix (2.34) can be explained quite simply. The term HBHT is the background covariance transformed to the observation space. The denominator term, R + HBHT , expresses the sum of observation and background covariances. The numerator term, BHT , takes the ratio of B and R+HBHT back to the model space. This recalls (and is completely analogous to) the variance ratio, σb2 , σ b2 + σo2 that appears in the optimal BLUE (see Chapter 1 and Chapter 3) solution. This is the case for a single observation, y, of a quantity, x, xa = xb + = xb +

σb2 σb2

+ σo2

(x o − x b )

1 (x o − x b ), 1+α

where α=

σo2 σb2

.

2.4. Variational DA

55

In other words, the best way to estimate the state is to take a weighted average of the background (or prior) and the observations of the state. And the best weight is the ratio of the mean squared errors (variances). This statistical viewpoint is thus perfectly reproduced in the 3D-Var framework. 2.4.3.2 The nonstationary case: 4D-Var

A more realistic, but complicated, situation arises when one wants to assimilate observations that are acquired over a time interval during which the system dynamics (flow, for example) cannot be neglected. Suppose that the measurements are available at a succession of instants, tk , k = 0, 1, . . . , K, and are of the form yk = Hk xk + εok ,

(2.35)

where Hk is a linear observation operator and εok is the observation error with covariance matrix Rk , and suppose that these observation errors are uncorrelated in time. Now we add the dynamics described by the state equation, xk+1 = Mk+1 xk ,

(2.36)

where we have neglected any model error.24 We suppose also that at time index k = 0 we know the background state, xb0 , and its error covariance matrix, Pb0 , and we suppose that the errors are uncorrelated with the observations in (2.35). Then a given initial condition, x0 , defines a unique model solution, xk+1 , according to (2.36). We can now generalize the objective function (2.31), which becomes J (x0 ) =

K T  −1   1 1 (H x − yk )T R−1 (Hk xk − yk ) . (2.37) Pb0 x0 − xb0 + x0 − xb0 k 2 2 k=0 k k

The minimization of J (x0 ) will provide the initial condition of the model that fits the data most closely. This analysis is called strong constraint four-dimensional variational assimilation, abbreviated as strong constraint 4D-Var. The term strong constraint implies that the model found by the state equation (2.36) must be exactly satisfied by the sequence of estimated state vectors. In the presence of model uncertainty, the state equation becomes xtk+1 = Mk+1 xtk + ηk+1 ,

(2.38)

where the model noise has covariance matrix Qk ,which we suppose to be uncorrelated in time and uncorrelated with the background and observation errors. The objective function for the BLUE for the sequence of states

is of the form J (x0 , x1 , . . . , xK ) =

{xk , k = 0, 1, . . . , K} T  −1   1 Pb0 x0 − xb0 x − xb0 2 0 K 1 + (H x − yk )T R−1 (Hk xk − yk ) k 2 k=0 k k +

24 This

 T  1 K−1 x xk+1 − Mk+1 xk . − Mk+1 xk Q−1 k+1 2 k=0 k+1

will be taken into account in Section 2.4.7.5.

(2.39)

56

Chapter 2. Optimal control and variational data assimilation

This objective function has become a function of the complete sequence of states {xk , k = 0, 1, . . . , K} , and its minimization is known as weak constraint four-dimensional variational assimilation, abbreviated as weak constraint 4D-Var. Equations (2.37) and (2.39), with an appropriate reformulation of the state and observation spaces, are special cases of the BLUE objective function—see Talagrand [2012]. All the above forms of variational assimilation, as defined by (2.31), (2.37), and (2.39), have been used for real-world DA, in particular in meteorology and oceanography. However, these methods are directly applicable to a vast array of other domains, among which we can cite geophysics and environmental sciences, seismology, atmospheric chemistry, and terrestrial magnetism. Examples of all these can be found in the applications chapters of Part III. We remark that in real-world practice, variational assimilation is performed on nonlinear models. If the extent of nonlinearity is sufficiently small (in some sense), then variational assimilation, even if it does not solve the correct estimation problem, will still produce useful results. Some remarks concerning implementation: Now, our problem reduces to quantifying the covariance matrices and then, of course, computing the analyzed state. The quantification of the covariance matrices must result from extensive data studies (or the use of a KF approach—see Chapter 3). The computation of the analyzed state will be described in the next subsection—this will not be done directly, but rather by an adjoint approach for minimizing the cost functions. There is of course the inverse of B or Pb to compute, but we remark that there appear only matrix-vector products  −1 of B−1 and Pb , and we can thus define operators (or routines) that compute these efficiently without the need for large storage capacities. 2.4.3.3 The adjoint approach

We explain the adjoint approach in the case of strong constraint 4D-Var, taking into account a completely general nonlinear setting for the model and for the observation operators. Let Mk and Hk be the nonlinear model and observation operators, respectively. We reformulate (2.36) and (2.37) in terms of the nonlinear operators as J (x0 ) =

T  −1   1 Pb0 x0 − xb0 x0 − xb0 2 K 1 + (H (x ) − yk )T R−1 (Hk (xk ) − yk ) , k 2 k=0 k k

(2.40)

with the dynamics xk+1 = Mk+1 (xk ) ,

k = 0, 1, . . . , K − 1.

(2.41)

The minimization problem requires that we now compute the gradient of J with respect to x0 . The gradient is determined from the property that for a given perturbation δx0 of x0 , the corresponding first-order variation of J is  T δJ = ∇x0 J δx0 . (2.42) The perturbation is propagated by the tangent linear equation, δxk+1 = M k+1 δxk ,

k = 0, 1, . . . , K − 1,

(2.43)

2.4. Variational DA

57

obtained by differentiation of the state equation (2.41), where M k+1 is the Jacobian matrix (of first-order partial derivatives) of xk+1 with respect to xk . The first-order variation of the cost function is obtained similarly by differentiation of (2.40), K  T  −1  δJ = x0 − xb0 δx0 + (Hk (xk ) − yk )T R−1 Hk δxk , Pb0 k

(2.44)

k=0

where Hk is the Jacobian of Hk and δxk is defined by (2.43). This variation is a compound function of δx0 that depends on all the δxk ’s. But if we can obtain a direct dependence on δx0 in the form of (2.42), eliminating the explicit dependence on δxk , then we will (as in the previous sections of this chapter) arrive at an explicit expression for the gradient, ∇x0 J , of our cost function, J . This will be done, as we have done before, by introducing an adjoint state and requiring that it satisfy certain conditions— namely, the adjoint equation. Let us now proceed with this program. We begin by defining, for k = 0, 1, . . . , K, the adjoint state vectors pk that belong to the dual of the state space. Now we take the null products (according to the tangent state equation (2.43)),  pTk δ xk − M k δxk−1 , and subtract them from the right-hand side of the cost function variation (2.44), K  T  −1  δJ = x0 − xb0 δx0 + (Hk (xk ) − yk )T R−1 Hk δxk Pb0 k k=0



K 

pTk

k=0



δxk − M k δxk−1 .

Rearranging the matrix products, using the symmetry of Rk , and regrouping terms in δx, we obtain - −1  .  T δJ = Pb0 x0 − xb0 + HT0 R−1 0 (H0 (x0 ) − y0 ) + M 0 p1 δx0 /K−1 0  T + HTk R−1 (H (x ) − y ) − p + M p δxk k k k k k+1 k k k=1

  + HTK R−1 (HK (xK ) − yK ) − pK δxk . K

Notice that this expression is valid for any choice of the adjoint states, pk , and, in order to “kill” all δxk terms, except δx0 , we must simply impose that (HK (xK ) − yK ) , pK = HTK R−1 K

pk = HTk R−1 (Hk (xk ) − yk ) + M kT pk+1 , k = K − 1, . . . , 1, k  −1   T x0 − xb0 + HT0 R−1 p0 = Pb0 0 (H0 (x0 ) − y0 ) + M 0 p1 .

(2.45) (2.46) (2.47)

We recognize the backward adjoint equation for pk , and the only term remaining in the variation of J is then δJ = pT0 δx0 , so that p0 is the sought-for gradient, ∇x0 J , of the objective function with respect to the initial condition, x0 , according to (2.42). The system of equations (2.45)–(2.47) is

58

Chapter 2. Optimal control and variational data assimilation

Algorithm 2.1 Iterative 3D-Var algorithm. k = 0, x = x0 while ∇J  > ε or k ≤ kmax compute J with (2.31) compute ∇J with (2.32) gradient descent and update of xk+1 k = k +1 end the adjoint of the tangent linear equation (2.43). The term adjoint here corresponds to the transposes of the matrices HTk and M kT that, as we have seen before, are the finite-dimensional analogues of an adjoint operator. We can now propose the “usual” algorithm for solving the optimization problem by the adjoint approach: 1. For a given initial condition, x0 , integrate forward the (nonlinear) state equation (2.41) and store the solutions, xk (or use some sort of checkpointing). 2. From the final condition, (2.45), integrate backward in time the adjoint equations (2.46). 3. Compute directly the required gradient (2.47). 4. Use this gradient in an iterative optimization algorithm to find a (local) minimum. The above description for the solution of the 4D-Var DA problem clearly covers the case of 3D-Var, where we seek to minimize (2.31). In this case, we need only the transpose Jacobian HT of the observation operator.

2.4.4 The 3D-Var algorithm The matrices involved in the calculation of equation (2.33) are often neither storable in memory nor manipulable because of their very large dimensions, which can be as much as 106 or more. Thus, the direct calculation of the gain matrix, K, is unfeasible. The 3D-Var variational method overcomes these difficulties by attempting to iteratively minimize the cost function, J . This minimization can be achieved, for inverse problems in general, by a combination of an adjoint approach for the computation of the gradient with a descent algorithm in the direction of the gradient. For DA problems where there is no time dependence, the adjoint operation requires only a matrix adjoint (and not the solution of an adjoint equation25 ), and the approach is called 3DVar, whereas for time-dependent problems we will use the 4D-Var approach, which is presented in the next subsection. The iterative 3D-Var Algorithm 2.1 is a classical case of an optimization algorithm [Nocedal and Wright, 2006] that uses as a stopping criterion the fact that ∇J is small or that the maximum number of iterations, kmax , is reached. For the gradient descent, there is a wide choice of algorithmic approaches, but quasi-Newton methods [Nocedal and Wright, 2006; Quarteroni et al., 2007] are generally used and recommended. 25 This

may not be valid for complicated observation operators.

2.4. Variational DA

59

2.4.4.1 On the roles of R and B

The relative magnitudes of the errors due to measurement and background provide us with important information as to how much “weight” to give to the different information sources when solving the assimilation problem. For example, if background errors are larger than observation errors, then the analyzed state solution to the DA problem should be closer to the observations than to the background and vice versa. The background error covariance matrix, B, plays an important role in DA. This is illustrated by the following example. Example 2.5. Effect of a single observation. Suppose that we have a single observation at a point corresponding to the j th element of the state vector. The observation operator is then H = ( 0 · · · 0 1 0 · · · 0 ). The gradient of J is

  ∇J = B−1 x − xb + HT R−1 (Hx − yo ) . Since it must be equal to zero at the minimum xa ,   xa − xb = BHT R−1 (yo − Hxa ) .

. But R = σ 2 ; Hxa = x aj ; and BHT is the j th column of B, whose elements are denoted by Bi , j with i = 1, . . . , n. So we see that ⎛ xa − xb =

y

o

− xka σ2

B1, j ⎜ B2, j ⎜ ⎜ . ⎝ .. Bn, j

⎞ ⎟ ⎟ ⎟. ⎠

The increment is proportional to a column of B. The choice of B is thus crucial and will determine how this observation provides information about what happens around the j th variable. In the 4D-Var case, the increment at time t will be proportional to a single column of MBMT , which describes the error covariances of the background at the time, t , of the observation.

2.4.5 The 4D-Var algorithm In this section, we reformulate the 4D-Var approach in a form that is better adapted to algorithmic implementation. As we have just seen, the 4D-Var method generalizes 3D-Var to the case where the observations are obtained at different times—this is depicted in Figure 2.4. As was already stated in Chapter 1, the difference between three-dimensional (3D-Var) and four-dimensional (4D-Var) DA is the use of a numerical forecast model in the latter. In 4D-Var, the cost function is still expressed in terms of the initial state, x0 , but it includes the model because the observation yok at time k is compared to Hk (xk ), where xk is the state at time k initialized by x0 and the adjoint is not simply the transpose of a matrix, but the “transpose” of the model/operator dynamics.

60

Chapter 2. Optimal control and variational data assimilation

observationo J

x

observation Jo Jo observation

Jo b

x

xa

J

b

previous forecast updated forecast

observation

Jo observation assimilation window

3D-Var

time

Figure 2.4. 3D- and 4D-Var.

2.4.5.1 Cost function and gradient

The cost function (2.37) is still expressed in terms of the initial state, x (we have dropped the zero subscript, for simplicity), but it now includes the model because the observation yok at time k is compared to Hk (xk ), where xk is the state at time k initialized by x. The cost function is the sum of the background and the observation errors, J (x) = J b (x) + J o (x), where the background term is the same as above: T   1 J b (x) = x − xb B−1 x − xb . 2 The background xb , as with x, is taken as a vector at the initial time, k = 0. The observation term is more complicated. We define J o (x) =

K   o T 1 yo − Hk (xk ) R−1 yk − Hk (xk ) , k 2 k=0 k

where the state at time k is obtained by an iterated composition of the model matrix, xk = M0→k (x) = Mk−1,k Mk−2,k−1 . . . M1,2 M0,1 x = Mk Mk−1 . . . M2 M1 x. This gives the final form of the observation term, J o (x) =

K   o T 1 yo − Hk Mk Mk−1 . . . M2 M1 x R−1 yk − Mk Mk−1 . . . M2 M1 x . k 2 k=0 k

2.4. Variational DA

61

Algorithm 2.2 4D-Var n = 0, x = x0 while ∇J  > ε or n ≤ nmax (1) compute J with the direct model M and H (2) compute ∇J with adjoint model MT and HT (reverse mode) gradient descent and update of xn+1 n = n +1 end Now we can compute the gradient directly (whereas in the previous subsection we computed the variation, δJ ): K     o ∇J (x) = B−1 x − xb − yk − Mk Mk−1 . . . M2 M1 x . MT1 MT2 . . . MTk−1 MTk HTk R−1 k k=0

If we denote the innovation vector as dk = yok − Hk Mk Mk−1 . . . M2 M1 x, then we have K 

MT1 MT2 . . . MTk−1 MTk HTk R−1 dk k k=0 T −1 T T −1 T = H0 R0 d0 + M1 H1 R1 d1 + M1 MT2 HT2 R−1 2 d2 + · · · + MT1 . . . MTK−1 MTK HTK R−1 d K K  T −1  T −1  T T T T −1 . = HT0 R−1 0 d0 + M1 H1 R1 d1 + M2 H2 R2 d2 + · · · + MK HK RK dK

−∇J o (x) =

This factorization enables us to compute J o followed by ∇J o with one integration of the direct model followed by one integration of the adjoint model. 2.4.5.2 Algorithm

For Algorithm 2.2, in step (1) we use the equations dk = yok − Hk Mk Mk−1 . . . M2 M1 x and J (x) =

K T    1 dTk R−1 dk . x − xb B−1 x − xb + k 2 k=0

In step (2), we use     T −1  T −1 T T ∇J (x) = B−1 x − xb − HT0 R−1 0 d0 + M1 H1 R1 d1 + M2 H2 R2 d2 + · · ·  . + MTK HTK R−1 d K K 2.4.5.3 A very simple scalar example

We consider an example with a single observation at time step 3 and a known background at time step 0. In this case, the 4D-Var cost function (2.37) for determining the

62

Chapter 2. Optimal control and variational data assimilation

initial state becomes scalar, J (x0 ) =

 2  b 2 K xk − xko 1 x0 − x0 1 + , 2 2 k=1 σB2 σR2

where σB2 and σR2 are the (known) background and observation error variances, respectively. With a single observation at time step 3, the cost function is J (x0 ) =

  b 2 o 2 1 x0 − x0 1 x3 − x3 + . 2 σB2 2 σR2

The minimum is reached when the gradient of J disappears, J (x0 ) = 0, which can be computed as  x0 − x0b σB2

 +

x3 − x3o d x3 d x2 d x1 = 0. d x2 d x1 d x0 σR2

(2.48)

We now require a dynamic relation between the xk ’s to compute the derivatives. To this end, let us take the most simple linear forecast model, dx = −αx, dt with α a known positive constant. This is a typical model for describing decay, for example, of a chemical compound whose behavior over time is then given by x(t ) = x(0)e−αt . To obtain a discrete representation of the dynamics, we can use an upstream finite difference scheme [Strikwerda, 2004],   x(tk+1 ) − x(tk ) = (tk+1 − tk ) −αx(tk+1 ) , (2.49) which can be rewritten in the explicit form   1 x(t + Δt ) = x(t ), 1 + αΔt where we have assumed a fixed time step, Δt = tk+1 − tk , for all k. We thus have the scalar relation xk+1 = M (xk ) = γ xk , (2.50) where the constant is γ=

1 . 1 + αΔt

The necessary condition (2.48) then becomes   x0 − x0b x3 − x3o 3 + γ = 0. σB2 σR2

2.4. Variational DA

63

This can be solved for x0 and then for x3 to obtain the analyzed state x0 = x0b + = x0b + =

γ 3 σB2 σR2

(x3o − x3 )

γ 3 σB2  σR2

σR2 σR2 + γ 6 σB2

= x0b +

x3o − γ 3 x0b

x0b +

γ 3 σB2 2 σR + γ 6 σB2



 γ 3 σB2  o x − γ 3 x0b σR2 + γ 6 σB2 3   x3o − γ 3 x0b ,

where we have added and subtracted x0b to obtain the last line and used the system dynamics (2.50). Finally, by again using the dynamics, we find the 4D-Var solution x3 = γ 3 x0b +

γ 6 σB2



σR2 + γ 6 σB2

 x3o − γ 3 x0b .

(2.51)

Let us examine some asymptotic cases. If the parameter α tends to zero, then the dynamic gain, γ , tends to one and the model becomes stationary, with xk+1 = xk . The solution then tends to the 3D-Var case, with x3 = x0 = x0b +

σB2 2 σR + σB2



 x3o − x0b .

(2.52)

If the model is stationary, we can thus use all observations whenever they become available, exactly as in the 3D case. The other asymptotic occurs when the step size tends to infinity and the dynamic gain goes to zero. The dynamic model becomes xk+1 = 0, with the initial condition x0 = x0b , and there is thus no connection between states at different time steps. Finally, if the observation is perfect, then σR2 = 0 and x3 = x3o . But there is no link to x0 , and there is once again no dynamical connection between states at two different instants.

2.4.6 Practical variants of 3D-Var and 4D-Var We have described above the simplest classical 3D-Var and 4D-Var algorithms. To overcome the numerous problems encountered in their implementation, there are several extensions and variants of these methods. We will describe two of the most important here. Further details can be found in Chapter 5.

64

Chapter 2. Optimal control and variational data assimilation

2.4.6.1 Incremental 3D-Var and 4D-Var

We saw above that the adjoint of the complete model (2.40) is required for computing the gradient of the cost function. In NWP, the full nonlinear model is extremely complex [Kalnay, 2003]. To alleviate this, Courtier et al. [1994] proposed an incremental approach to variational assimilation, several variants of which now exist. Basically, the idea is to simplify the dynamical model (2.41) to obtain a formulation that is cheaper for the adjoint computation. To do this, we modify the tangent model (2.43), which becomes δxk+1 = Lk+1 δxk , k = 0, 1, . . . , K − 1, (2.53)

where Lk is an appropriately chosen simplified version of the Jacobian operator M k . To preserve consistency, the basic model (2.41) must be appropriately modified so that the TLM corresponding to a known (e.g., from the background) reference solution, (0) xk , is given by (2.53). This is easily done by letting the initial condition (0)

x0 = x0 + δx0 evolve according to (2.53) into (0)

xk = xk + δxk . The resulting dynamics are then linear. Several possibilities exist for simplifying the objective function (2.40). One can linearize the observation operator Hk , as was done for the model M k . We use the substitution (0) Hk (xk ) −→ Hk (xk ) + Nk δxk , where Nk is some simplified linear approximation, which could be the Jacobian of Hk at xk . The objective function (2.40) then becomes J1 (δx0 ) =

T  −1   1 (0) (0) Pb0 δx0 + x0 − xb0 δx0 + x0 − xb0 2 K 1 + (N δx − dk )T R−1 (Nk δxk − dk ) , k 2 k=0 k k

(2.54) (0)

where the δxk satisfy (2.53) and the innovation at time k is dk = yk − Hk (xk ). This objective function, J1 , is quadratic in the initial perturbation δx0 , and the minimizer, δx0,m , defines an updated initial state (1)

(0)

x0 = x0 + δx0,m , (1)

from which a new solution, xk , can be computed using the dynamics (2.41). Then we (1)

loop and repeat the whole process for xk . This defines a system of two-level nested loops (outer and inner) for minimizing the original cost function (2.40). The savings are thanks to the flexible choice that is possible for the simplified linearized operators Lk and Nk . These can be chosen to ensure a reasonable trade-off between ease of implementation and physical fidelity. One can even modify the operator Lk in (2.53) during the minimization by gradually introducing more complex dynamics in the successive outer loops—this is the multi-incremental approach that is described in Section 5.4.1. Convergence issues are of course a major concern—see, for example, Tremolet [2007a].

2.4. Variational DA

65

These incremental methods together with the adjoint approach are what make variational assimilation computationally tractable. In fact, they have been used until now in most operational NWP systems that employ variational DA. 2.4.6.2 FGAT 3D-Var

This method, “first guess at appropriate time,” (abbreviated FGAT) is best viewed as a special case of 4D-Var. It is in fact an extreme case of the incremental approach (2.53)– (2.54), in which the simplified linear operator Lk is set equal to the identity. The process is 4D in the sense that the observations, distributed over the assimilation window, are compared with the computed values in the time integration of the assimilating model. But it is 3D because the minimization of the cost function (2.54) does not use the correct dynamics, i.e., δxk+1 = δxk ,

k = 0, 1, . . . , K − 1.

The FGAT 3D-Var approach, using a unique minimization loop (there is no nesting any more), has been shown to improve the accuracy of the assimilated variables. The reason for this is simple: FGAT uses a more precise innovation vector than standard 3D-Var, where all observations are compared with the same first-guess field.

2.4.7 Extensions and complements 2.4.7.1 Parameter estimation

If we want to optimize a set of parameters, α = (α1 , α2 , . . . , α p ), we need only include the control variables as terms in the cost function, J (x, α) = J1b (x) + J2b (α) + J o (x, α). The observation term includes a dependence on α, and it is often necessary to add a regularization term for α, such as %2 %     % % J2b (α) = %α − αb % , or J2b (α) = α − αb B−1 α − αb , α or

J2b (α) = ∇α − β2 .

2.4.7.2 Nonlinearities

When the nonlinearities in the model and/or the observation operator are weak, we can extend the 3D- and 4D-Var algorithms to take their effects into account. One can then define the incremental 4D-Var algorithm—see above. 2.4.7.3 Preconditioning

We recall that the condition number of a matrix A is the product A A−1 . In general, variational DA problems are badly conditioned. The rate of convergence of the minimization algorithms depends on the conditioning of the Hessian of the cost function: the closer it is to one, the better the convergence. For 4D-Var, the Hessian is equal to B−1 + HT R−1 H , and its condition number is usually very high.

66

Chapter 2. Optimal control and variational data assimilation

Preconditioning [Golub and van Loan, 2013] is a technique for improving the condition number and thus accelerating the convergence of the optimization. We make a change of variable δx = x − xb such that

w = L−1 δx,

B−1 = LLT ,

where L is a given simple matrix. This is commonly used in meteorology and oceanography. The modified cost function is 1 1 J˜(w) = wT w + (HLw − d)T R−1 (HLw − d), 2 2 and its Hessian is equal to

J˜ = I + LT HT R−1 HL.

It is in general much better conditioned, and the resulting improvement in convergence can be spectacular. 2.4.7.4 Covariance matrix modeling

The modeling of the covariance matrices of the background error B and the observation error R is an important operational research subject. Reduced-cost models are particularly needed when the matrices are of high dimensions—in weather forecasting or turbulent flow control problems, for example, this can run into tens of millions. One may also be interested in having better-quality approximations of these matrices. In background error covariance modeling [Fisher, 2003], compromises have to be made to produce a computationally viable model. Since we do not have access to the true background state, we must either separate out the information about the statistics of background error from the innovation statistics or derive statistics for a surrogate quantity. Both approaches require assumptions to be made, for example about the statistical properties of the observation error. The “separation” approach can be addressed by running an ensemble of randomly perturbed predictions, drawn from relevant distributions. This method of generating surrogate fields of background error is strongly related to the EnKF, which is fully described in Chapter 6—see also Evensen [2009]. Other approaches for modeling the B matrix by reduced bases, factorization, and spectral methods are fully described in Chapter 5. 2.4.7.5 Model error

In standard variational assimilation, we invert for the initial condition only. The underlying hypothesis that the model is perfectly known is not a realistic one. In fact, to take into account eventual model error, we should add an appropriate error term to the state equation and insert a cost term into the objective function. We thus arrive at a parameter identification inverse problem, similar to those already studied above in Section 2.3. In the presence of model uncertainty, the state equation and objective functions become (see also the above equations (2.38) and (2.39)) ⎧ ⎨ dx =  (x) + η(t ) in Ω × [0, T ] , dt ⎩ x(t = 0) = x0 ,

2.5. Numerical examples

67

Lorenz attractor

45 40 35

Z

30 25 20 15 10 5 20 10 0

-10

0

10

X

-10

Y

-20

Figure 2.5. Simulation of the chaotic Lorenz-63 system of three equations.

where η(t ) is a suitably uncorrelated white noise. The new cost functional is J (x0 , η) =

  %2 1% 1 T 1 T % % y(t ) −  (x(x0 , t ))2 dt + η(t )2 dt , %x − xb % +  2 0 2 0 2 0

where the model noise has covariance matrix Q, which we suppose to be uncorrelated in time and uncorrelated with the background and observation errors. However, in cases with high dimensionality, this approach is not feasible, especially for practical problems. Numerous solutions have been proposed to overcome this problem—see Griffith and Nichols [2000], Tremolet [2007b], Vidard et al. [2004], and Tremolet [2007c].

2.5 Numerical examples 2.5.1 DA for the Lorenz equation We study the nonlinear Lorenz system of equations [Lorenz, 1963], dx = −σ(x − y), dt dy = ρx − y − x z, dt dz = xy − βz, dt which exhibits chaotic behavior when we fix the parameter values σ = 10, ρ = 28, and β = 8/3 (see Figure 2.5) This equation is a simplified model for atmospheric convection and is an excellent example of the lack of predictability. It is ill-posed in the

68

Chapter 2. Optimal control and variational data assimilation

sense of Hadamard. In fact, the solution switches between two stable orbits around the points 7 8  β(ρ − 1), β(ρ − 1), ρ − 1 and

8 7   − β(ρ − 1), − β(ρ − 1), ρ − 1 .

We now perform 4D-Var DA on this equation with only the observation term, J o (x) =

n   o T 1 yo − Hk (xk ) R−1 yk − Hk (xk ) . k 2 i =0 k

This relatively simple model enables us to study a number of important effects and to answer the following practical questions: • What is the influence of observation noise? • What is the influence of the initial guess? • What is the influence of the length of the assimilation window and the number of observations? In addition, we can compare the performance of the standard 4D-Var with that of an incremental 4D-Var algorithm. All computations are based on the codes provided by A. Lawless of the DARC (Data Assimilation Research Centre) at Reading University [Lawless, 2002]. Readers are encouraged to obtain the software and experiment with it. The assimilation results shown in Figures 2.6 and 2.7 were obtained from twin experiments with the following conditions: • True initial condition is (1.0, 1.0, 1.0) . • Initial guess is (1.2, 1.2, 1.2) . • Time step is 0.05 seconds. • Assimilation window is 2 seconds. • Forecast window is 3 seconds. • Observations are every 2 time steps. • Number of outer loops (for incremental 4D-Var) is 4. We remark that the incremental algorithm produces a more accurate forecast, over a longer period, in this case—see Figure 2.7.

2.5.2 Additional DA examples Numerous examples of variational DA can be found in the advanced Chapters 4, 5, and 7, as well as in the applications sections—see Part III. Another rich source is the training material of the ECMWF [Bouttier and Courtier, 1997]—see         .

2.5. Numerical examples

69

Solution for x

30

Truth Observations First guess Analysis

20

x

10 0 -10 -20

0

10

20

30

40

50 Time step

60

70

80

100

Solution for z

60

Truth Observations First guess Analysis

50 40 z

90

30 20 10 0

0

10

20

30

40

50 Time step

60

70

80

90

100

Figure 2.6. Assimilation of the Lorenz-63 equations by standard 4D-Var, based on a twin experiment. The assimilation window is from step 0 to step 40 (2 seconds). The forecast window is from step 41 to 100 (3 seconds).

Solution for x

30

Truth Observations First guess Analysis

20

x

10 0 -10 -20

0

10

20

30

40

50 Time step

60

70

80

90

100

Solution for z

60

Truth Observations First guess Analysis

50 40 z

30 20 10 0 -10

0

10

20

30

40

50 Time step

60

70

80

90

100

Figure 2.7. Assimilation of the Lorenz-63 equations by incremental 4D-Var, based on a twin experiment. The assimilation window is from step 0 to step 40 (2 seconds). The forecast window is from step 41 to 100 (3 seconds).

Chapter 3

Statistical estimation and sequential data assimilation

The road to wisdom?—Well, it’s plain and simple to express: Err and err and err again but less and less and less. —Piet Hein (1905–1996, Danish mathematician and inventor)

3.1 Introduction In this chapter, we present the statistical approach to DA. This approach will be addressed from a Bayesian point of view. But before delving into the mathematical and algorithmic details, we will discuss some ideas about the history of weather forecasting and of the distinction between prediction and forecasting. For a broad, nontechnical treatment of prediction in a sociopolitical-economic context, the curious reader is referred to Silver [2012], where numerous empirical aspects of forecasting are also broached.

3.1.1 A long history of prediction From Babylonian times, people have attempted to predict future events, for example in astronomy. Throughout the Renaissance and the Industrial Revolution there were vast debates on predictability. In 1814, Pierre-Simon Laplace postulated that a perfect knowledge of the actual state of a system coupled with the equations that describe its evolution (natural laws) should provide perfect predictions! This touches on the far-reaching controversy between determinism and randomness/uncertainty . . . and if we go all the way down to the level of quantum mechanics, then due to Heisenberg’s principle there cannot be a perfect prediction. However, weather (and many other physical phenomena) go no further than the molecular (not the atomic) level and as a result molecular chemistry and Newtonian physics are sufficient for weather forecasting. In fact, the (deterministic) PDEs that describe the large-scale circulation of air masses and oceans are 71

72

Chapter 3. Statistical estimation and sequential data assimilation

remarkably precise and can reproduce an impressive range of meteorological conditions. This is equally true in a large number of other application domains, as described in Part III. Weather forecasting is a success story: human and machine combining their efforts to understand and to anticipate a complex natural system. This is true for many other systems thanks to the broad applicability of DA and inverse problem methods and algorithms.

3.1.2 Stochastic versus deterministic The simplest statistical approach to forecasting (rather like linear regression, but with a flat line) is to calculate the probability of an event (e.g., rain tomorrow) based on past knowledge and records—i.e., long-term averages. But these purely statistical predictions are of little value—they do not take into account the possibility and potential that we have of modeling the physics—this is where the progress (over the last 30 years) in numerical analysis and high-performance computing can come to the rescue. However, this is not a trivial pursuit, as we often notice when surprised by a rain shower, flood, stock market crash, or earthquake. So what goes wrong and impairs the accuracy/reliability of forecasts? • The first thing that can go wrong is the resolution (spatial and temporal) of our numerical models . . . but this is an easier problem: just add more computing power, energy, and money! • Second, and more important, is chaos (see Section 2.5.1), which applies to dynamic, nonlinear systems and is closely associated with the well-posedness issues of Chapter 1—note that this has nothing to do with randomness, but rather is related to the lack of predictability. In fact, in weather modeling, for example, after approximately one week only, chaos theory swamps the dynamic memory of the atmosphere (as “predicted” by the physics), and we are better off relying on climatological forecasts that are based on historical averaged data. • Finally, there is our imprecise knowledge of the initial (and boundary) conditions for our physical model and hence our simulations—this loops back to the previous point and feeds the chaotic nature of the system. Our measurements are both incomplete and (slightly) inaccurate due to the physical limitations of the instruments themselves. All of the above needs to be accounted for, as well as possible, in our numerical models and computational analysis. This can best be done with a probabilistic26 approach.

3.1.3 Prediction versus forecast The terms prediction and forecast are used interchangeably in most disciplines but deserve a more rigorous definition/distinction. Following the philosophy of Silver [2012], a prediction will be considered as a deterministic statement, whereas a forecast will be a probabilistic one. Here are two examples: • “A major earthquake will strike Tokyo on May 28th” is a prediction, whereas “there is a 60% chance of a major earthquake striking Northern California over the next 25 years” is a forecast. 26 Equivalently, a

statistical or stochastic approach can be used.

3.1. Introduction

73

• Extrapolation is another example of prediction and is in fact a very basic method that can be useful in some specific contexts but is generally too simplistic and can lead to very bad predictions and decisions. We notice the UQ in the forecast statement. One way to implement UQ is through Bayesian reasoning—let us explain this now.

3.1.4 DA is fundamentally Bayesian Thomas Bayes27 believed in a rational world of Newtonian mechanics but insisted that by gathering evidence we can get closer and closer to the truth. In other words, rationality is probabilistic. Laplace, as we saw above, claimed that with perfect knowledge of the present and of the laws governing its evolution, we can attain perfect knowledge of the future. In fact it was Laplace who formulated what is known as Bayes’ theorem. He considered probability to be “a waypoint between ignorance and knowledge.” This is not bad . . . it corresponds exactly to our endeavor and what we are trying to accomplish throughout this book: use models and simulations to reproduce and then predict (or, more precisely, forecast or better understand) the actual state and future evolutions of a complex system. For Laplace it was clear: we need a more thorough understanding of probability to make scientific progress! Bayes’ theorem is a very simple algebraic formula based on conditional probability (the probability of one event, A, occurring, knowing or given that another event, B, has occurred—see Section 3.2 below for the mathematical definitions): pA|B =

pB|A pA pB

.

It basically provides us with a reevaluated probability (posterior, pA|B ) based on the prior knowledge, pB|A pA, of the system that is normalized by the total knowledge that we have, pB . To better understand and appreciate this result, let us consider a couple of simple examples that illustrate the importance of Bayesian reasoning. Example 3.1. The famous example of breast cancer diagnosis from mammograms shows the importance and strength of priors. Based on epidemiological studies, the probability that a woman between the ages of 40 and 50 will be afflicted by a cancer of the breast is low, of the order of pA = 0.014 or 1.4%. The question we want to answer is: If a woman in this age range has a positive mammogram (event B), what is the probability that she indeed has a cancer (event A)? Further studies have shown that the false-positive rate of mammograms is p = 0.1 or 10% of the time and that the correct diagnosis (true positive) has a rate of pB|A = 0.75. So a positive mammogram, taken by itself, would seem to be serious news. However, if we do a Bayesian analysis that factors in the prior information, we get a different picture. Let us do this now. The posterior probability can be computed from Bayes’ formula, pA|B =

pB|A pA pB

=

0.75 × 0.014 = 0.096, 0.75 × 0.014 + 0.1 × (1 − 0.014)

and we conclude that the probability is only 10% in this case, which is far less worrisome than the overall 75% true-positive rate. So the false positives have dominated the result thanks to the fact that we have taken into account the prior information of 27 English

clergyman and statistician (1701–1761).

74

Chapter 3. Statistical estimation and sequential data assimilation

low cancer incidence in this age range. For this reason, there is a tendency in the medical profession today to recommend that women (without antecedents, which would increase the value of pA) start having regular mammograms starting from age 50 only because, starting from this age, the prior probability is higher. Example 3.2. Another good example comes from global warming, now called climate change, and we will see why it is so important to quantify uncertainty in the interest of scientific advancement and trust. The study of global warming started around the year 2001. At this time, it was commonly accepted, and scientifically justified, that CO2 emissions caused and would continue to cause a rise in global temperatures. Thus, we could attribute a high prior probability, pA = 0.95, to the hypothesis of global warming (event A). However, over the subsequent decade from 2001 to 2011, we have observed (event B) that global temperatures have not risen as expected—in fact they appeared to have decreased very slightly.28 So, according to Bayesian reasoning, we should reconsider our estimation of the probability of global warming—the question is, to what extent? If we had a good estimate of the uncertainty in short-term patterns of temperature variations, then the downward revision of the prediction would not be drastic. By analyzing the historical data again, we find that there is a 15% chance that there is no net warming over a decade even if the global warming hypothesis holds— this is due to the inherent variability in the climate. On the other hand, if temperature variations were purely random, and hence unpredictable, then the chance of having a decade in which there is actually a cooling would be 50%. So let us compute the revised estimate for global warming with Bayes’ formula. We find pA|B =

pB|A pA pB

=

0.15 × 0.95 = 0.851, 0.15 × 0.95 + 0.5 × (1 − 0.95)

so we should revise our probability, in light of the last decade’s evidence, from 95% to 85%. This is a truly honest approximation that takes into account the observations and revises the uncertainty. Of course, when we receive a new batch of measurements, we can recompute and obtain an update. This is precisely what DA seeks to achieve. The major difference resides in our possession (in the DA context) of a sophisticated model for actually computing the conditional probability, pB|A, the probability of the data, or observations, given the parameters.

3.1.5 First steps toward a formal framework Now let us begin to formalize. It can be claimed that a major part of scientific discovery and research deals with questions of this nature: what can be said about the value of an unknown, or inaccurately known, variable θ that represents the parameters of the system, if we have some measured data  and a model  of the underlying mechanism that generated the data? But this is precisely the Bayesian context,29 where we seek a quantification of the uncertainty in our knowledge of the parameters that, according 28 A recent paper, published in Science, has rectified this by taking into account the evolution of instrumentation since the start of the study. Indeed, it now appears that there has been a steady increase! Apparently, the “hiatus” was the result of a double observational artefact [see T.R. Karl et al., Science Express, 4 June 2015]. 29 See Barber [2012], where Bayesian reasoning is extensively developed in the context of machine learning.

3.2. Statistical estimation theory

75

to Bayes’ theorem takes the form p(θ | ) =

p( | θ) p(θ) p( | θ) p(θ) = . p() p( | θ) p(θ) θ

Here, the physical model is represented by the conditional probability (also known as the likelihood) p( | θ), and the prior knowledge of the system by the term p(θ). The denominator is considered as a normalizing factor and represents the total probability of . From these we can then calculate the resulting posterior probability, p(θ | ). The most probable estimator, called the maximum a posteriori (MAP) estimator, is the value that maximizes the posterior probability θ∗ = arg max p(θ | ). θ

Note that for a flat, or uninformative, prior p(θ), the MAP is just the maximum likelihood, which is the value of θ that maximizes the likelihood p( | θ) of the model that generated the data, since in this case neither p(θ) nor the denominator plays a role in the optimization.

3.1.6 Concluding remarks (as an opening . . .) There are links between the above and the theories of state space, optimal control, and optimal filtering. We will study KFs, whose original theory was developed in this state space context, below—see Friedland [1986] and Kalman [1960]. The following was the theme of a recent Royal Meteorological Society meeting (Imperial College, London, April 2013): “Should weather and climate prediction models be deterministic or stochastic?”—this is a very important question that is relevant for other physical systems. In this chapter, we will argue that uncertainty is an inherent characteristic of (weather and most other) predictions and thus that no forecast can claim to be complete without an accompanying estimation of its uncertainty—what we call uncertainty quantification (UQ).

3.2 Statistical estimation theory In statistical modeling, the concepts of sample space, probability, and random variable play key roles. Readers who are already familiar with these concepts can skip this section. Those who require more background on probability and statistics should definitely consult a comprehensive treatment, such as DeGroot and Schervisch [2012] or the excellent texts of Feller [1968], Jaynes [2003], McPherson [2001], and Ross [1997]. A sample space,  , is the set of all possible outcomes of a random, unpredictable experiment. Each outcome is a point (or an element) in the sample space. Probability provides a means for quantifying how likely it is for an outcome to take place. Random variables assign numerical values to outcomes in the sample space. Once this has been done, we can systematically work with notions such as average value, or mean, and variability. It is customary in mathematical statistics to use capital letters to denote random variables (r.v.’s) and corresponding lowercase letters to denote values taken by the r.v. in its range. If X :  →  is an r.v., then for any x ∈ , by {X ≤ x} we mean {s ∈  | X (s) ≤ x} .

76

Chapter 3. Statistical estimation and sequential data assimilation

Definition 3.3. A probability space ( , ,  ) consists of a set  called the sample space, a collection  of (Borel) subsets of  , and a probability function  :  → + for which •  () = 0,

•  ( ) = 1, and 9 : •  S = i  (Si ) for any disjoint, countable collection of sets Si ∈ . i i

A random variable X is a measurable function X :  → . Associated with the r.v. X is its distribution function, FX (x) =  {X ≤ x} ,

x ∈ .

The distribution function is nondecreasing and right continuous and satisfies lim F (x) = 0, x→−∞ X

lim F (x) = 1. x→+∞ X

Definition 3.4. A random variable X is called discrete if there exist countable sets {xi } ⊂  and { pi } ⊂ + for which pi =  {X = xi } > 0 for each i, and



pi = 1.

i

In this case, the PDF for X is the real-valued function with discrete support ! pi if x = xi , i = 1, 2, . . . , pX (x) = 0 otherwise. The xi ’s are the points of discontinuity of the distribution function,  FX (x) = pX (xi ). {i |xi ≤x}

Definition 3.5. A random variable X is called continuous if its distribution function, FX , is absolutely continuous. In this case, x FX (x) = pX (u) du, −∞

and there exists a derivative of FX , pX (x) =

dFX , dx

that is called the probability density function (PDF) for X . Definition 3.6. The mean, or expected value, of an r.v. X is given by the integral ∞ E(X ) = x dFX (x). −∞

3.2. Statistical estimation theory

77

This is also known as the first moment of the random variable. If X is a continuous r.v., then dFX (x) = pX (x) dx, and, in the discrete case, In the latter case,

dFX (x) = pX (xi )δ(x − xi ). E(X ) =



xi pX (xi ).

i

The expectation operator, E, is a linear operator. Definition 3.7. The variance of an r.v. X is given by   σ 2 = E (X − μ)2 = E(X 2 ) − (E(X ))2 , where μ = E(X ). Definition 3.8. The mode is the value of x for which the PDF pX (x) attains its maximal value. Definition 3.9. Two r.v.’s, X and Y, are jointly distributed if they are both defined on the same probability space, ( , ,  ) . Definition 3.10. A random vector, X = (X1 , X2 , . . . , Xn ) , is a mapping from  into n for which all the components Xi are jointly distributed. The joint distribution function of X is given by FX (x) =  {X1 ≤ x1 , . . . , Xn ≤ xn } ,

x = (x1 , . . . , xn ) ∈ n .

The components Xi are independent if the joint distribution function is the product of the distribution functions of the components, FX (x) =

n ; i =1

FXi (xi ).

Definition 3.11. A random vector X is continuous with joint PDF pX if  x1  xn FX (x) = ··· pX (u) du1 . . . dun . −∞

−∞

Definition 3.12. The mean, or expected value, of a random vector X = (X1 , X2 , . . . , Xn ) is the n-vector E(X) with components [E(X)]i = E(Xi ),

i = 1, . . . , n.

The covariance of X is the n × n matrix cov(X) with components   [cov(X)]i j = E (Xi − μi )(X j − μ j ) = σi2j , 1 ≤ i, j ≤ n, where μi = E(Xi ).

78

Chapter 3. Statistical estimation and sequential data assimilation

3.2.1 Gaussian distributions A continuous random vector X has a Gaussian distribution if its joint PDF has the form   1 1 exp − (x − μ)T Σ−1 (x − μ) , pX (x; μ, Σ) = < n 2 (2π) det(Σ) where x, μ ∈ n and Σ is an n × n symmetric positive definite matrix. The mean is given by E(X) = μ, and the covariance matrix is cov(X) = Σ. These two parameters completely characterize the distribution, and we indicate this situation by X   (μ, Σ). Note that in the scalar case we have the familiar bell curve, p(x; μ, σ) =
0

(x, y) ∈ n × n .

x ∈ n .

(3.1)

The conditional PDF for Y given X = x is then defined as p(Y|X) (y | x) =

p(X,Y) (x, y) pX (x)

,

(3.2)

where the denominator is nonzero. So, the conditional probability p(A|B) is the revised probability of an event A after learning that the event B has occurred.

80

Chapter 3. Statistical estimation and sequential data assimilation

Remark 3.17. If X and Y are independent random vectors, then the conditional density function of Y given X = x does not depend on x and satisfies pX (x) pY (y) = pY (y). pX (x)

p(Y|X) (y | x) =

(3.3)

Definition 3.18. Let φ : n → k be a measurable mapping. The conditional expectation of φ(Y) given X = x is  E (φ(Y) | X = x) = φ(y) p(Y|X) (y | x), x ∈ n . (3.4)  {Y=y}>0

Remark 3.19. For continuous random vectors X and Y, we can define the analogous concepts by replacing the summations in (3.1)–(3.4) with appropriate integrals: ∞ pX (x) = p(X,Y) (x, y) dFY (y), −∞



E (φ(Y) | X = x) =



−∞

φ(y) p(Y|X) (y | x) dFY (y).

We are now ready to state Bayes’ law, which relates the conditional random vector X|Y=y to the inverse conditional random vector Y|X=x . Theorem 3.20 (Bayes’ law). Let X and Y be jointly distributed random vectors. Then p(X|Y) (x | y) =

p(Y|X) (y | x) pX (x) pY (y)

.

(3.5)

Proof. By the definition of conditional probability (3.2), p(X|Y) (x | y) =

p(X,Y) (x, y) pY (y)

,

and the numerator is exactly equal to that of (3.5), once again by definition.

Definition 3.21. In the context of Bayes’ law (3.5), suppose that X represents the variable of interest and that Y represents an observable (measured) quantity that depends on X. Then, • pX (x) is called the a priori PDF, or the prior; • p(X|Y) (x | y) is called the a posteriori PDF, or the posterior;

• p(Y|X) (y | x), considered as a function of x, is the likelihood function;

• the denominator, called the evidence, pY (y), can be considered as a normalization factor; and • the posterior distribution is thus proportional to the product of the likelihood and the prior distribution or, in applied terms, p(parameter | data) ∝ p(data | parameter) p(parameter).

3.2. Statistical estimation theory

81

Remark 3.22. A few fundamental remarks are in order here. First, Bayes’ law plays a central role in probabilistic reasoning since it provides us with a method for inverting probabilities, going from p(y | x) to p(x | y). Second, conditional probability matches perfectly our intuitive notion of uncertainty. Finally, the laws of probability combined with Bayes’ law constitute a complete reasoning system for which traditional deductive reasoning is a special case [Jaynes, 2003].

3.2.5 Linear least-squares estimation: BLUE, minimum variance linear estimation In this section we define the two estimators that form the basis of statistical DA. We show that these are optimal, which explains their widespread use. Let X = (X1 , X2 , . . . , Xn ) and Z = (Z1 , Z2 , . . . , Z m ) be two jointly distributed, realvalued random vectors with finite expected squared components: E(Xi2 ) < ∞,

E(Z 2j ) < ∞,

i = 1, . . . , n,

j = 1, . . . , m.

This is an existence condition and is necessary to have a rigorous, functional space setting for what follows. Definition 3.23. The cross-correlation matrix for X and Z is the n × m matrix ΓXZ = E(XZT ) with entries [ΓXZ ]i j = E(Xi Z j ),

i = 1, . . . , n, j = 1, . . . , m.

The autocorrelation matrix for X is ΓXX = E(XXT ), with entries [ΓXX ]i j = E(Xi X j ),

1 ≤ i, j ≤ n.

T Remark 3.24. Note that ΓZX = ΓXZ and that ΓXX is symmetric and positive semidefinite, i.e., ∀x, xT ΓXx x ≥ 0. Also, if E(X) = 0, then the autocorrelation reduces to the covariance, ΓXX = cov(X).

We can relate the trace of the autocorrelation matrix to the second moment of the random vector X. Proposition 3.25. If a random vector X has finite expected squared components, then  E X2 = trace(ΓXX ). We are now ready to formally define the BLUE. We consider a linear model, z = Kx + N, where K is an m×n matrix, x ∈ n is deterministic, and N is a random (noise) n-vector with E(N) = 0, CN = cov(N), and CN is a known, nonsingular, n × n covariance matrix.

82

Chapter 3. Statistical estimation and sequential data assimilation

Definition 3.26. The best linear unbiased estimator (BLUE) for x from the linear model ˆBLUE that minimizes the quadratic cost function z is the vector x   J (ˆ x) = E ˆ x − x2 subject to the constraints of linearity, ˆ = Bz, x

B ∈ n×m ,

and unbiasedness, E(ˆ x) = x. In the case of a full-rank matrix K, the Gauss–Markov theorem [Sayed, 2003; Vogel, 2002] gives us an explicit form for the BLUE. Theorem 3.27 (Gauss–Markov). If K has full rank, then the BLUE is given by ˆ xˆBLUE = Bz, where

 ˆ = KT C−1 K −1 KT C−1 . B N N

Remark 3.28. If the noise covariance matrix CN = σ 2 I (white, uncorrelated noise), and K has full rank, then  −1 ˆBLUE = KT K KT z = K† z, x where K† is called the pseudoinverse of K. This corresponds, in the deterministic case, to the least-squares problem min Kx − z . x

Due to the dependence of the BLUE on the inverse of the noise covariance matrix, it is unsuitable for the solution of noisy, ill-conditioned linear systems. To remedy this situation, we assume that x is a realization of a random vector X, and we formulate a linear least-squares analogue of Bayesian estimation. Definition 3.29. Suppose that x and z are jointly distributed, random vectors with finite expected squares. The minimum variance linear estimator (MVLE) of x from z is given by ˆ xˆ = Bz, MVLE

where

 ˆ = arg min E Bz − x2 . B n×m B∈

Proposition 3.30. If ΓZZ is nonsingular, then the MVLE of x from z is given by  −1 z. xˆMVLE = ΓXZ ΓZZ

3.3. Examples of Bayesian estimation

83

3.3 Examples of Bayesian estimation In this section we provide some calculated examples of Bayesian estimation.

3.3.1 Scalar Gaussian distribution example In this simple, but important, example we will derive in detail the parameters of the posterior distribution when the data and the prior are normally distributed. This will provide us with a richer understanding of DA. We suppose that we are interested in forecasting the value of a scalar state variable, x, which could be a temperature, a wind velocity component, an ozone concentration, etc. We are in possession of a Gaussian prior distribution for x, x   (μX , σX2 ), with expectation μ and variance σ 2 , which could come from a forecast model, for example. We are in possession of n independent, noisy observations, y = (y1 , y2 , . . . , yn ) , each with conditional distribution yi | x   (x, σ 2 ), that are conditioned on the true value of the parameter/process x. Thus, the conditional distribution of the data/observations is a product of Gaussian laws, =  1 exp − (yi − x)2 2σ 2 2πσ 2 i =1 > ! n  1 (yi − x)2 . ∝ exp − 2σ 2 i =1

p(y | x) =

n ;


n 1 (yi − x)2 /σ 2 + (x − μX )2 /σX2 2 i =1 + n /  ! , 0>   yi μ n 1 1 ∝ exp − x 2 + 2 −2 + X2 x . 2 2 σ 2 σX σX i =1 σ

p(x | y) ∝ exp −

Notice that this is the product of two Gaussians, which, by completing the square, can be show to be Gaussian itself. This produces the posterior distribution,   2 x | y   μ x|y , σ x|y ,

(3.6)

84

Chapter 3. Statistical estimation and sequential data assimilation

where

 μ x|y =

−1 +

1 n + σ 2 σX2

and

 2 = σ x|y

n  yi μ + X σ 2 σX2 i =1

1 n + σ 2 σX2

−1

,

.

Let us now study more closely these two parameters of the posterior law. We first remark that the inverse of the posterior variance, called the posterior precision, is equal to the sum of the prior precision, 1/σX2 , and the data precision, n/σ 2 . Second, the posterior mean, or conditional expectation, can also be written as a sum of two terms:   σ 2 σX2 μ n E(x | y) = y¯ + X2 σ 2 + nσX2 σ 2 σX = wy y¯ + wμX μX , where the sample mean,

n 1 y, n i =1 i

y¯ = and the two weights, wy =

nσX2 σ 2 + nσX2

,

wμX =

σ2 , σ 2 + nσX2

add up to wy + wμX = 1. We observe immediately that the posterior mean is the weighted sum/average of the data mean (¯ y ) and the prior mean (μX ). Now let us examine the weights themselves. If there is a large uncertainty in the prior, then σX2 → ∞ and hence wy → 1, wμX → 0 and the likelihood dominates the prior, leading to what is known as the sampling distribution for the posterior: p(x | y) →  (¯ y , σ 2 /n). If we have a large number of observations, then n → ∞ and the posterior now tends to the sample mean, whereas if we have few observations, then n → 0 and the posterior p(x | y) →  (μX , σX2 ) tends to the prior. In the case of equal uncertainties between data and prior, σ 2 = σX2 , and the prior mean has the weight of a single additional observation. Finally, if the uncertainties are small, either the prior is infinitely more precise than the data (σX2 → 0) or the data are perfectly precise (σ 2 → 0). We end this example by rewriting the posterior mean and variance in a special form. Let us start with the mean: E(x | y) = μX +

nσX2 2 σ + nσX2

(¯ y − μX )

= μX + G (¯ y − μX ) .

(3.7)

3.3. Examples of Bayesian estimation

85

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 10

12

14

16

18

20

22

24

26

28

30

Figure 3.1. Scalar Gaussian distribution example. Prior  (20, 3) (dotted), instrument  (x, 1) (dashed), and posterior  (20.86, 0.43) (solid) distributions.

We conclude that the prior mean μX is adjusted toward the sample mean y¯ by a gain (or amplification factor) of G = 1/(1+σ 2 /nσX2 ), multiplied by the innovation y¯ −μX , and we observe that the variance ratio, between data and prior, plays an essential role. In the same way, the posterior variance can be reformulated as 2 σ x|y = (1 − G)σX2 ,

(3.8)

and the posterior variance is thus updated from the prior variance according to the same gain G. These last two equations, (3.7) and (3.8), are fundamental for a good understanding of DA, since they clearly express the interplay between prior and data and the effect that each has on the posterior. Let us illustrate this with two initial numerical examples. Suppose we have a prior distribution x   (μX , σX2 ) with mean 20 and variance 3. Suppose that our data model has the conditional law yi | x   (x, σ 2 ) with variance 1. Here the data are relatively precise compared to the prior. Say we have acquired two observations, y = (19, 23) . Now we can use (3.7) and (3.8) to compute the posterior distribution: 6 E(x | y) = 20 + (21 − 20) = 20.86, 1+6   6 2 3 = 0.43, = 1− σ x|y 7 thus yielding the posterior distribution yi | x   (20.86, 0.43), which represents the update of the prior according to the observations and takes into account all the uncertainties available—see Figure 3.1. In other words, we have obtained a complete forecast at a given point in time.

86

Chapter 3. Statistical estimation and sequential data assimilation

0.3

0.25

0.2

0.15

0.1

0.05

0

5

10

15

20

25

30

35

40

Figure 3.2. Scalar Gaussian distribution example. Prior  (20, 3) (dotted), instrument  (x, 10) (dashed), and posterior  (20.375, 1.875) (solid) distributions.

Now consider the same prior, x   (20, 3), but with a relatively uncertain/imprecise observation model, yi | x   (x, 10), and the same two measurements, y = (19, 23) . Redoing the above calculations, we now find 6 E(x | y) = 20 + (21 − 20) = 20.375, 16   6 2 σ x|y 3 = 1.875, = 1− 16 thus yielding the new posterior distribution, yi | x   (20.375, 1.875), which has virtually the same mean but a much larger variance—see Figure 3.2, where the scales on both axes have changed.

3.3.2 Estimating a temperature Suppose that the outside temperature measurement gives 2°C and the instrument has an error distribution that is Gaussian with mean μ = 2 and variance σ 2 = 0.64—see the dashed curve in Figure 3.3. This is the model/data distribution. We also suppose that we have a prior distribution that estimates the temperature, with mean μ = 0 and variance σ 2 = 1.21. The prior comes from either other observations, a previous model forecast, or physical and climatological constraints—see the dotted curve in Figure 3.3. By combining these two, using Bayes’ formula, we readily compute the posterior distribution of the temperature given the observations, which has mean μ = 1.31 and variance σ 2 = 0.42. This is the update or the analysis—see the solid curve in Figure 3.3.

3.3. Examples of Bayesian estimation

87

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 -6

-4

-2

0

2

4

6

Figure 3.3. A Gaussian product example for forecasting temperature: prior (dotted), instrument (dashed), and posterior (solid) distributions.

The code for this calculation can be found in   [DART toolbox, 2013].

3.3.3 Estimating the parameters of a pendulum We present an example of a simple mechanical system and seek an estimation of its parameters from noisy measurements. Consider a model for the angular displacement, x t , of an ideal pendulum (no friction, no drag), x t = sin(θt ) + ε t , where ε t is a Gaussian noise with zero mean and variance σ 2 , the pendulum parameter is denoted by θ, and t is time. From these noisy measurements (suppose that the instrument is not very accurate) of x t we want to estimate θ, which represents the < physical properties of the pendulum—in fact θ = g /L, where g is the gravitational constant and L is the pendulum’s length. Using this physical model, can we estimate (or infer) the unknown physical parameters of the pendulum? If the measurements are independent, then the likelihood of a set of T observations x1 , . . . , xT is given by the product p (x1 , . . . , xT |θ) =

T ; t =1

p (x t |θ) .

In addition, suppose that we have some prior estimation (before obtaining the measurements) of the probabilities of a set of possible values of θ. Then the posterior distribution of θ, given the measurements, can be calculated from Bayes’ law, as seen above, T 1 ; 1 − (x −sin(θt ))2 p (θ|x1 , . . . , xT ) ∝ p (θ) e 2σ 2 t , < 2 2πσ t =1

88

Chapter 3. Statistical estimation and sequential data assimilation

5

0.3

1 0.9

4 0.25

0.8

3 0.7

0.2 2

0.6

1

0.15

0.5 0.4

0 0.1

0.3

-1 0.2

0.05 -2

-3

0.1

0

20

40

60

80

100

0 -0.1

0

0.1

0.2

0.3

0.4

0.5

0 -0.1

0

0.1

0.2

0.3

0.4

0.5

Figure 3.4. Bayesian estimation of noisy pendulum parameter, θ = 0.2. Observations of 100 noisy positions (left). Prior distribution of parameter values (center). Posterior distribution for θ (right).

where we have omitted the denominator. We are given the following table of priors: [θmin , θmax ] [0, 0.05] [0.05, 0.15] [0.15, 0.25] [0.25, 0.35] [0.35, 0.45] [0.45, 0.55]

p(θmin < θ < θmax ) 0.275 0.15 0.275 0.025 0.05 0.225

After performing numerical simulations, we observe (see Figure 3.4) that the posterior for θ develops a prominent peak for a large number (T = 100) of measurements, centered around the real value θ = 0.2 (which was used to generate the time series, x t ).

3.3.4 Vector/multivariate Gaussian distribution example As a final case that will lead us naturally to the following section, let us consider the vector/multivariate extension of the example in Section 3.3.1. We will now study a vector process, x, with n components and a prior distribution x   (μ, B), where the mean vector μ and the covariance matrix B are assumed to be known (as usual from historical data, model forecasts, etc.). The observation now takes the form of a data vector, y, of dimension p and has the conditional distribution/model: y | x   (Hx, R), where the ( p ×n) observation matrix H maps the process to the measurements and the error covariance matrix R is known. These are exactly the same matrices that we have already encountered in the variational approach—see Chapters 1 and 2. The difference is that now our modeling is placed in a richer, Bayesian framework. As before, we would like to calculate the posterior conditional distribution of x | y, given by p(x | y) ∝ p(y | x) p(x).

3.3. Examples of Bayesian estimation

89

Just as with the scalar/univariate case, the product of two Gaussians is Gaussian, and the posterior law is the multidimensional analogue of (3.6) and can be shown to take the form   x | y   μ x|y , Σ x|y , where and

  −1  HT R−1 y + B−1 μ μ x|y = HT R−1 H + B−1  −1 Σ x|y = HT R−1 H + B−1 .

As above, we will now rewrite the posterior mean and variance in a special form. The posterior conditional mean becomes   −1 −1 E(x | y) = HT R−1 H + B−1 HT R−1 y + HT R−1 H + B−1 B−1 μ = μ + K(y − Hμ),

(3.9)

where the gain matrix is now

K = BHT (R + HBHT )−1 . In the same manner, the posterior conditional covariance matrix can be reformulated as  −1 Σ(x | y) = HT R−1 H + B−1 = (I − KH)B,

(3.10)

with the same gain matrix K as for the posterior mean. As before, these last two equations, (3.9) and (3.10), are fundamental for a good understanding of DA, since they clearly express the interplay between prior and data and the effect that each has on the posterior. They are, in fact, the veritable foundation of DA.

3.3.5 Connections with variational and sequential approaches As was already indicated in the first two chapters of this book, the link between variational approaches and optimal BLUE is well established. The BLUE approach is also known as kriging in spatial geostatistics, or optimal interpolation (OI) in oceanography and atmospheric science. In the special, but quite widespread, case of a multivariate Gaussian model (for data and priors), the posterior mode (which is equivalent to the mean in this case) can equally be obtained by minimizing the quadratic objective function (2.31), T   1 1 J (x) = x − xb B−1 x − xb + (Hx − y)T R−1 (Hx − y) . 2 2 This is the fundamental link between variational and statistical approaches. Though strictly equivalent to the Bayes formulation, the variational approach has, until now, been privileged for operational, high-dimensional DA problems—though this is changing with the arrival of new hardware and software capabilities for treating “big data and extreme computing” challenges [Reed and Dongarra, 2015]. Since many physical systems are dynamic and evolve in time, we could improve our estimations considerably if, as new measurements became available, we could simply update the previous optimal estimate of the state process without having to redo all computations. The perfect framework for this sequential updating is the KF, which we will now present in detail.

90

Chapter 3. Statistical estimation and sequential data assimilation

3.4 Sequential DA and Kalman filters We have seen that DA, from the statistical/Bayesian point of view, strives to have as complete a knowledge as possible of the a posteriori probability law, that is, the conditional law of the state given the observations. But it is virtually impossible to determine the complete distribution, so we seek instead an estimate of its statistical parameters, such as its mean and/or its variance. Numerous proven statistical methods can lead to best or optimal estimates [Anderson and Moore, 1979; Garthwaite et al., 2002; Hogg et al., 2013; Ross, 2014]; for example, the minimum variance (MV) estimator is the conditional mean of the state given the observations, and the maximum a posteriori (MAP) estimator produces the mode of the conditional distribution. As seen above, assuming Gaussian distributions for the measurements and the process, we can determine the complete a posteriori law, and, in this case, it is clear that the MV and MAP estimators coincide. In fact, the MV estimator produces the optimal interpolation (OI) or kriging equations, whereas the MAP estimator leads to 3D-Var. In conclusion, for the case of a linear observation operator together with Gaussian error statistics, 3D-Var and OI are strictly equivalent. So far we have been looking at the spatial estimation problem, where all observations are distributed in space but at a single instant in time. For stationary stochastic processes, the mean and covariance are constant in time [Parzen, 1999; Ross, 1997], so such a DA scheme could be used at different times based on the invariant statistics. This is not so rare: in practice, for global NWP, the errors have been considered stationary over a one-month time scale.30 However, for general environmental applications, the governing equations vary with time and we must take into account nonstationary processes. Within the significant box of mathematical tools that can be used for statistical estimation from noisy sensor measurements over time, one of the most well known and often used tools is the Kalman filter (KF). The KF is named after Rudolph E. Kalman, who in 1960 published his famous paper describing a recursive solution to the discrete data linear filtering problem [Kalman, 1960]. There exits a vast literature on the KF, and a very “friendly” introduction to the general idea of the KF can be found in Chapter 1 of Maybeck [1979]. As just stated above, it would be ideal and very efficient if, as new data or measurements became available, we could easily update the previous optimal estimates without having to recompute everything. The KF provides exactly this solution. To this end, we will now consider a dynamical system that evolves in time, and we will seek to estimate a series of true states, xtk (a sequence of random vectors), where discrete time is indexed by the letter k. These times are those when the observations or measurements are taken, as shown in Figure 3.5. The assimilation starts with an unconstrained model trajectory from t0 , t1 , . . . , tk−1 , tk , . . . , tn and aims to provide an optimal fit to the available observations/measurements given their uncertainties (error bars). For example, in current synoptic scale weather forecasts, tk − tk−1 = 6 hours; the time step is less for the convective scale.

3.4.1 Bayesian modeling Let us recall the principles of Bayesian modeling from Section 3.2 on statistical estimation and rewrite them in the terminology of the DA problem. We have a vector, x, of (unknown) unobserved quantities of interest (temperature, pressure, wind, etc.) and 30 This

assumption is no longer employed at MétéoFrance or ECWMF, for example.

3.4. Sequential DA and KFs

91

model trajectory observation error observation

t0

t1

tn

...

Figure 3.5. Sequential assimilation: a computed model trajectory, observations, and their error bars.

a vector, y, of (known) observed data (at various locations, and at various times). The full joint probability model can always be factored into two components, p(x, y) = p(y | x) p(x) = p(x | y) p(y), and thus p(x | y) =

p(y | x) p(x) , p(y)

provided that p(y) = 0. The KF can be rigorously derived from this Bayesian perspective following the presentation above in Section 3.3.4.

3.4.2 Stochastic model of the system We seek to estimate the state x ∈ n of a discrete-time dynamic process that is governed by the linear stochastic difference equation xk+1 = Mk+1 xk + wk ,

(3.11)

with a measurement/observation y ∈  : m

y k = Hk xk + vk .

(3.12)

Note that Mk+1 and Hk are considered linear here. The random vectors wk and vk represent the process/modeling and measurement/observation errors, respectively. They are assumed to be independent, white-noise processes with Gaussian/normal probability distributions wk ∼  (0, Qk ), vk ∼  (0, Rk ),

92

Chapter 3. Statistical estimation and sequential data assimilation

yk+2 xak+2

yk+1 xak+1

xfk+2

xak+3

xfk+1

yk+3

xak k

xfk+3

k +1

k +2

k +3

Figure 3.6. Sequential assimilation scheme for the KF. The x-axis denotes time; the y-axis denotes the values of the state and observation vectors.

where Q and R are the covariance matrices (assumed known) of the modeling and observation errors, respectively. All these assumptions about unbiased and uncorrelated errors (in time and between each other) are not limiting, since extensions of the standard KF can be developed should any of these not be valid—see below and Chapters 5, 6, and 7. We note that for a broader mathematical view on the above system, we could formulate everything in terms of stochastic differential equations (SDEs). Then the theory of Itô can provide a detailed solution of the problem of optimal filtering as well as existence and uniqueness results—see Oksendal [2003], where one can find such a precise mathematical formulation.

3.4.3 Sequential assimilation scheme The typical assimilation scheme is made up of two major steps: a prediction/forecast step and a correction/analysis step. At time tk we have the result of a previous forecast, xfk (the analogue of the background state xbk ), and the result of an ensemble of observations in yk . Based on these two vectors, we perform an analysis that produces xak . We then use the evolution model to obtain a prediction of the state at time tk+1 . The result of the forecast is denoted xfk+1 and becomes the background, or initial guess, for the next time step. This process is summarized in Figure 3.6. The KF problem can be summarized as follows: given a prior/background estimate, xf , of the system state at time tk , what is the best update/analysis, xak , based on the currently available measurements, yk ? We can now define forecast (a priori) and analysis (a posteriori) estimate errors as efk = xfk − xtk , eak = xak − xtk ,

3.4. Sequential DA and KFs

93

where xtk is the (unknown) true state. Their respective error covariance matrices are   Pfk = cov(efk ) = E efk (efk )T ,   Pak = cov(eak ) = E eak (eak )T .

(3.13)

The goal of the KF is to compute an optimal a posteriori estimate, xak , that is a linear combination of an a priori estimate, xfk , and a weighted difference between the actual measurement, yk , and the measurement prediction, Hk xfk . This is none other than the BLUE that we have seen above. The filter is thus of the linear, recursive form   xak = xfk + Kk yk − Hk xfk . (3.14) The difference dk = yk − Hk xfk is called the innovation and reflects the discrepancy between the actual and the predicted measurements at time tk . Note that for generality, the matrices are shown with a time dependence. When this is not the case, the subscripts k can be dropped. The Kalman gain matrix, K, is chosen to minimize the a posteriori error covariance equation (3.13). To compute this optimal gain requires a careful derivation. Begin by substituting the observation equation (3.12) into the linear filter equation (3.14):   xak = xfk + Kk Hk xtk + vk − Hk xfk     = xfk + Kk Hk xtk − xfk + vk . Now place this last expression into the definition of eak : eak = xak − xtk

    = xfk + Kk Hk xtk − xfk + vk − xtk       = Kk −Hk xfk − xtk + vk + xfk − xtk .

Then substitute in the error covariance equation (3.13):   Pak = E eak (eak )T -           T . . = E Kk vk − Hk xfk − xtk + xfk − xtk Kk vk − Hk xfk − xtk + xfk − xtk

 Now perform the indicated expectations over the r.v.’s, noting that xfk − xtk = efk is the a priori estimation error, that is uncorrelated   the observation error  this error  with vk , that by definition Pfk = E efk (efk )T and that Rk = E vk vTk . We thus get -      T . Kk vk − Hk efk + efk Kk vk − Hk efk + efk . -   T   T = E efk efk − Kk Hk efk Kk Hk efk + (Kk vk ) (Kk vk )T

Pak = E

= (I − Kk Hk ) Pfk (I − Kk Hk )T + Kk Rk KTk .

(3.15)

Note that this is a completely general formula for the updated covariance matrix and that it is valid for any gain Kk , not necessarily optimal.

94

Chapter 3. Statistical estimation and sequential data assimilation

Now we still need to compute the optimal gain that minimizes the matrix entries along the principal diagonal of Pak , since these terms are the ones that represent the estimation error variances for the entries of the state vector itself. We will use the classical approach of variational calculus, by taking the derivative of the trace of the result with respect to K and then setting the resulting derivative expression equal to zero. But for this, we require two results from matrix differential calculus [Petersen and Pedersen, 2012]. These are d Tr (AB) = BT , dA   d Tr ACAT = 2AC, dA where Tr denotes the matrix trace operator and we assume that AB is square and that C is a symmetric matrix. The derivative of a scalar quantity with respect to a matrix is defined as the matrix of derivatives of the scalar with respect to each element of the matrix. Before differentiating, we expand (3.15) to obtain   Pak = Pfk − Kk Hk Pfk − Pfk HTk KTk + Kk Hk Kk HTk + Rk KTk . There are two linear terms and one quadratic term in Kk . To minimize the trace of Pak , we can now apply the above matrix differentiation formulas (supposing that the individual squared errors are also minimized when their sum is minimized) to obtain  T   d Tr Pak = −2 Hk Pfk + 2Kk Hk Kk HTk + Rk . dKk Setting this last result equal to zero, we can finally solve for the optimal gain. The resulting K that minimizes equation (3.13) is given by  −1 , Kk = Pfk HTk Hk Pfk HTk + Rk 

(3.16)

 T

where we remark that HPfk HTk + Rk = E dk dk is the covariance of the innovation. Looking at this expression for Kk , we see that when the measurement error covariance, Rk , approaches zero, the gain, Kk , weights the innovation more heavily, since lim Kk = H−1 . k

R→0

On the other hand, as the a priori error estimate covariance, Pfk , approaches zero, the gain, Kk , weights the innovation less heavily, and lim Kk = 0.

Pfk →0

Another way of thinking about the weighting of K is that as the measurement error covariance, R, approaches zero, the actual measurement, yk , is “trusted” more and more, while the predicted measurement, Hk xfk , is trusted less and less. On the other hand, as the a priori error estimate covariance, Pfk , approaches zero, the actual measurement, yk , is trusted less and less, while the predicted measurement, Hk xfk , is trusted more and more—see the computational example below.

3.4. Sequential DA and KFs

95

The covariance matrix associated with the optimal gain can now be computed from (3.15). We already have Pak = (I − Kk Hk ) Pfk (I − Kk Hk )T + Kk Rk KTk   = Pfk − Kk Hk Pfk − Pfk HTk KTk + Kk Hk Kk HTk + Rk KTk , and, substituting the optimal gain (3.16), we can derive three more alternative expressions:  −1 Pak = Pfk − Pfk HTk Hk Pfk HTk + Rk Hk Pfk ,   Pak = Pfk − Kk Hk Pfk HTk + Rk KTk ,

and

Pak = (I − Kk Hk ) Pfk .

(3.17)

Each of these four expressions for Pak would give the same results with perfectly precise arithmetic, but in real-world applications some may perform better numerically. In what follows, we will use the simplest form (3.17), but this is by no means restrictive, and any one of the others could be substituted. 3.4.3.1 Predictor/forecast step

We start from a previous analyzed state, xak , or from the initial state if k = 0, characterized by the Gaussian PDF p(xak | yo1:k ) of mean xak and covariance matrix Pak . We use here the classical notation yi : j = (yi , yi +1 , . . . , y j ) for i ≤ j that denotes conditioning on all the observations in the interval. An estimate of xtk+1 is given by the dynamical model, which defines the forecast as xfk+1 = Mk+1 xak , Pfk+1

= Mk+1 Pak MTk+1

(3.18) + Qk+1 ,

(3.19)

Pfk+1

is obtained from the dynamics equation and the definiwhere the expression for tion of the model noise covariance, Q. 3.4.3.2 Corrector/analysis step

At time tk+1 , the PDF p(xfk+1 | yo1:k ) is known, thanks to the mean, xfk+1 , and covariance matrix, Pfk+1 , just calculated, as well as the assumption of a Gaussian distribution. The analysis step then consists of correcting this PDF using the observation available at time tk+1 to compute p(xak+1 | yok+1:1 ). This comes from the BLUE in the dynamical context and gives  −1 , Kk+1 = Pfk+1 HT HPfk+1 HT + Rk+1   xak+1 = xfk+1 + Kk+1 yk+1 − Hxfk+1 ,  Pak+1 = I − Kk+1 H Pfk+1 .

(3.20) (3.21) (3.22)

The predictor–corrector loop is illustrated in Figure 3.7 and can be immediately transposed into an operational algorithm.

96

Chapter 3. Statistical estimation and sequential data assimilation

Measurement Update (“Correct”) (1) Compute the Kalman gain

Time Update (“Predict”)

Kk+1 = Pfk+1 HT (HPfk+1 HT + R)−1

(1) Project the state ahead xfk+1

= Mxak

(2) Update estimate with measurement

(2) Project the error covariance ahead Pfk+1

= MPak MT

+Q

xak+1 = xfk+1 + Kk+1 (yk+1 − Hxfk+1 ) (3) Update the error covariance Pak+1 = (I − Kk+1 H)Pfk+1

Initialization Initial estimates for xak and Pak

Figure 3.7. Kalman filter loop.

3.4.4 Note on the relation between Bayes and BLUE If we know that the a priori and the observation data are both Gaussian, Bayes’ rule can be applied to compute the a posteriori PDF. The a posteriori PDF is then Gaussian, and its parameters are given by the BLUE equations. Hence, with Gaussian PDFs and a linear observation operator, there is no need to use Bayes’ rule. The BLUE equations can be used instead to compute the parameters of the resulting PDF. Since the BLUE provides the same result as Bayes’ rule, it is the best estimator of all. In addition (see the previous chapter), one can recognize the 3D-Var cost function. By minimizing this cost function, 3D-Var finds the MAP estimate of the Gaussian PDF, which is equivalent to the MV estimate found by the BLUE.

3.5 Implementation of the Kalman Filter We now describe some important implementation issues and discuss ways to overcome the difficulties that they give rise to.

3.5.1 Stability and convergence Stability is a concern for any dynamic system. The KF will be uniformly asymptotically stable if the system model itself is controllable and observable. The reader is referred to Friedland [1986], Gelb [1974], and Maybeck [1979] for detailed explanations of these concepts. If the model is linear and time invariant (i.e., system matrices do not vary with time), the autocovariances will converge toward steady-state values. Consequently, the KF gain will converge toward a steady-state KF gain value that can be precalculated by solving an algebraic Ricatti equation. It is quite common to use only the steady-state gain in applications. For a nonlinear system, the gain may vary with the operating point (if the system matrix of the linearized model varies with the operating point).

3.5. Implementation of the KF

97

In practical applications, the gain may be recalculated as the operating point changes. In practical situations, there are a vast number of different sources for nonconvergence. In Grewal and Andrews [2001], the reader can find a very well explained presentation of all these (and many more). In particular, as we will point out, there are various remedies for • convergence, divergence, and failure to converge; • testing for unpredictable behavior; • effects due to incorrect modeling; • reduced-order and suboptimal filtering (see Chapter 5); • reduction of round-off errors and computational expenses; • analysis and repair of covariance matrices (see next subsection).

3.5.2 Filter divergence and covariance matrices If the a priori statistical information is not well specified, the filter might underestimate the variances of the state errors, eak . Too much confidence is put in the state estimation and too little confidence is put in the information contained in the observations. The effect of the analysis is minimized, and the gain becomes too small. In the most extreme case, observations are simply rejected. This is known as filter divergence, where the filter seems to behave well, with low predicted analysis error variance, but where the analysis is in fact drifting away from the reality. Very often filter divergence is easy to diagnose: • state error variances are small, • the time sequence of innovations is biased, and • the Kalman gains tend to zero as time increases. It is thus important to monitor the innovation sequence and check that it is “white,” i.e., unbiased and normally distributed. If this is not the case, then some of your assumptions are not valid. There are a few rules to follow to avoid divergence: • Do not underestimate model errors; rather, overestimate them. • If possible, it is better to use an adaptive scheme to tune model errors by estimating them on the fly using the innovations. • Give more weight to recent data, thus reducing the filter’s memory of older data and forcing the data into the KF. • Place some empirical, relevant lower bound on the Kalman gains.

98

Chapter 3. Statistical estimation and sequential data assimilation

3.5.3 Problem size and optimal interpolation The straightforward application of the KF implies the “propagation” of an n × n covariance matrix at each time step. This can result in a very large problem in terms of computations and storage. If the state has a spatial dimension of 107 (which is not uncommon in large-scale geophysical and other simulations), then the covariance matrices will be of order 1014 , which will exceed the resources of most available computer installations. To overcome this, we must resort to various suboptimal schemes (an example of which is detailed below) or switch to ensemble approaches (see Chapters 6 and 7). If the computational cost of propagating Pak+1 is an issue, we can use a frozen covariance matrix, Pak = Pb , k = 1, . . . , n. This defines the OI class of methods. Under this simplifying hypothesis, the two-step assimilation cycle defined above becomes the following: 1. Forecast: xfk+1 = Mk+1 xak , Pfk+1 = Pb . 2. Analysis:

 −1 Kk+1 = Pb HT HPb HT + Rk+1 ,   xak+1 = xfk+1 + Kk+1 yk+1 − Hxfk+1 , Pak+1 = Pb .

There are at least two ways to compute the static covariance matrix Pb . The first is an analytical formulation, Pb = D1/2 CD1/2 , where D is a diagonal matrix of variances and C is a correlation matrix that can be defined, for example, as   1 Ci j = 1 + ah + a 2 h 2 e −a h , 3 where a is a tuneable parameter, h is the grid size, and the exponential function provides a local spatial dependence effect that often corresponds well to the physics. The second approach uses an ensemble of Ne snapshots of the state vector taken from a model free run, from which we compute the first and second statistical moments as follows: Ne 1  xb = x, Ne l =1 l e   T 1  x − xb x l − xb . Ne − 1 l =1 l

N

Pb =

The static approach is more suited to successive assimilation cycles that are separated by a long enough time delay so that the corresponding dynamical states are sufficiently decorrelated. Other methods are detailed in the sections on reduced methods—see Chapter 5.

3.6. Nonlinearities and extensions of the KF

99

3.5.4 Evolution of the state error covariance matrix In principle, equation (3.19) generates a symmetric matrix. In practice, this may not be the case, and numerical truncation errors may lead to an asymmetric covariance matrix and a subsequent collapse of the filter. A remedy is to add an extra step to enforce symmetry, such as  1 Pfk+1 = Pfk+1 + (Pfk+1 )T , 2 or a square root decomposition—see Chapter 5.

3.6 Nonlinearities and extensions of the KF In real-life problems, we are most often confronted with a nonlinear process and/or a nonlinear measurement operator. Our dynamic system now takes the more general form xk+1 = M k+1 (xk ) + wk , yk = Hk (xk ) + vk , where M k now represents a nonlinear function of the state at time step k and Hk represents the nonlinear observation operator. To deal with these nonlinearities, one approach is to linearize about the current mean and covariance, which is called the extended Kalman filter (EKF). This approach and its variants are presented in Chapter 6. As previously mentioned, the KF is only optimal in the case of Gaussian statistics and linear operators, in which case the first two moments (the mean and the covariances) suffice to describe the PDF entering the estimation problem. Practitioners report that the linearized extension to nonlinear problems, the EKF, only works for moderate deviations from linearity and Gaussianity. The ensemble Kalman filter (EnKF) [Evensen, 2009] is a method that has been designed to deal with nonlinearities and non-Gaussian statistics, whereby the PDF is described by an ensemble of Ne timedependent states xk,e . This method is presented in detail in Chapter 6. The appeal of this approach is its conceptual simplicity, the fact that it does not require any TLM or adjoint model (see Chapter 2), and the fact that it is extremely well suited to parallel programming paradigms, such as MPI [Gropp et al., 2014]. What happens if both the models are nonlinear and the PDFs are non-Gaussian? The KF and its extensions are no longer optimal and, more important, can easily fail the estimation process. Another approach must be used. A promising candidate is the particle filter, which is described below. The particle filter (see [Doucet and Johansen, 2011] and references therein) works sequentially in the spirit of the KF, but unlike the latter, it handles an ensemble of states (the particles) whose distribution approximates the PDF of the true state. Bayes’ rule (3.5) and the marginalization formula (3.1) are explicitly used in the estimation process. The linear and Gaussian hypotheses can then be ruled out, in theory. In practice, though, the particle filter cannot yet be applied to very high dimensional systems (this is often referred to as “the curse of dimensionality”). Finally, there is a new class of hybrid methods, called ensemble variational methods, that attempt to combine variational and ensemble approaches—see Chapter 7 for a detailed presentation. The aim is to seek compromises to exploit the best aspects of (4D) variational and ensemble DA algorithms.

100

Chapter 3. Statistical estimation and sequential data assimilation

For further details of all these extensions, the reader should consult the advanced methods section (Part II) and the above references.

3.7 Particle filters for geophysical applications Can we actually design a filtering numerical algorithm that converges to the Bayesian solution? Such a numerical approach would typically belong to the class of sequential Monte Carlo methods. That is to say, a PDF is represented by a discrete sample of the targeted PDF. Rather than trying to compute the exact solution of the Bayesian filtering equations, the transformations of such filtering (Bayes’ rule for the analysis; model propagation for the forecast) are applied to the members of the sample. The statistical properties of the sample, such as the moments, are meant to be those of the targeted PDF. Obviously this sampling strategy can only be exact in the asymptotic limit, that is, in the limit where the number of members (or particles) goes to infinity. This is the focus of a large body of applied mathematics that led to the design of many very successful Monte Carlo type methods [see, for instance, Doucet et al., 2001]. However, they have mostly been applied to very low dimensional systems (only a few dimensions). Their efficiency for high-dimensional models has been studied more recently, in particular thanks to a strong interest in these approaches in the geosciences. In the following, we give a brief biased overview of the subject as seen by the geosciences DA community.

3.7.1 Sequential Monte Carlo The most popular and simple algorithm of Monte Carlo type that solves the Bayesian filtering equations is called the bootstrap particle filter [Gordon et al., 1993]. Its description follows. 3.7.1.1 Sampling

Let us consider a sample of particles {x1 , x2 , . . . , x m }. The related PDF at time tk is : pk (x), where pk (x) " im=1 ωik δ(x−xik ), δ is the Dirac mass, and the sum is meant to be an approximation of the exact density that the sample emulates. A positive scalar, ωki , weights the importance of particle i within the ensemble. At this stage, we assume that the weights, ωik , are uniform and ωik = 1/m. 3.7.1.2 Forecast

At the forecast : step, the particles are propagated by the model without approximation, pk+1 (x) " im=1 ωki δ(x − xik+1 ), with xik+1 = k+1 (xk ). A stochastic noise can be optionally added to the dynamics of each particle (see below). 3.7.1.3 Analysis

The analysis step of the particle filter is extremely simple and elegant. The rigorous implementation of Bayes’ rule ascribes to each particle a statistical weight that corresponds to the likelihood of the particle given the data. The weight of each particle is updated according to (see Figure 3.8) a,i f,i ωk+1 ∝ ωk+1 p(yk+1 |xik+1 ) .

(3.23)

3.7. Particle filters for geophysical applications

101

Figure 3.8. Analysis of the particle filter. The initial ensemble of particles is sampled from a normal prior, with equal weights (bottom). Given an observation with Gaussian noise and the relative state likelihood (bottom), the particle filter analysis ascribes a weight to each particle, which is proportional to the likelihood of the particle given the observation (top). The major axis of the ellipses, representing the particles, is proportional to the particle weight.

It is remarkable that the analysis is carried out with only a few multiplications. It does not involve inverting any system or matrix, as opposed, for instance, to the KF. 3.7.1.4 Resampling

Unfortunately, these normalized statistical weights have a potentially large amplitude of fluctuation. Even worse, as sequential filtering progresses, one particle (one trajectory of the model) will stand out from the others. Its weight will largely dominate the others (ωi  1), while the other weights will vanish. Then the particle filter becomes very inefficient as an estimating tool since it has lost its variability. This phenomenon is called degeneracy of the particle filter [Kong et al., 1994]. An example of such degeneracy is given in Figure 3.9, where the statistical properties of the biggest weight are studied on an extensive meteorological toy model of 40 and 80 variables. In a degenerate case, the maximum weight will often reach 1 or close to 1, whereas in a balanced case, values very close to 1 will be less frequent. One way to mitigate this phenomenon is to resample the particles by redrawing a sample with uniform weights from the degenerate distribution. After resampling, all particles have the same weight: ωki = 1/m. The particle filter is very efficient for highly nonlinear models but with low dimensionality. Unfortunately, it is not suited for DA systems with models of high dimension, as soon as the dimension exceeds, say, about 10. Avoiding degeneracy requires a great number of particles. This number typically increases exponentially with

102

Chapter 3. Statistical estimation and sequential data assimilation 0.2

Frequency of the maximum weight

Frequency of the maximum weight

0.2

N = 40 variables 0.15

0.1

0.05

0

0

0.2

0.4

0.6

0.8

1

N = 80 variables 0.15

0.1

0.05

0

0

0.2

0.4

0.6

0.8

1

Figure 3.9. On the left: statistical distribution of the maximal weight of the particle bootstrap filter in a balanced case. The physical system is a Lorenz-95 model with 40 variables [Lorenz and Emmanuel, 1998]. On the right: the same particle filter is applied to a Lorenz-95 low-order model, but with 80 variables. The maximal weight clearly degenerates with a peak close to 1.

the system state space dimension. This is because the support of the prior PDF overlaps exponentially less with the support of the likelihood as the dimension of the state space of the systems increases. This is known as the curse of dimensionality. For the forecast step, it could also be crucial to introduce stochastic perturbations of the states. Indeed, the ensemble will become impoverished with the many resamplings that it has to undergo. To enrich the sample, it is necessary to stochastically perturb the states of the system.

3.7.2 Application in the geosciences The applicability of particle filters to high-dimensional models has been investigated in the geosciences [van Leeuwen, 2009; Bocquet et al., 2010]. The impact of the curse of dimensionality has been quantitatively studied in Snyder et al. [2008]. It has been shown, on a heuristic basis, that the number of particles m required to efficiently track the system must scale like the variance of the log-likelihood, ln(m) ∝ Var [ln( p(y|x))] ,

(3.24)

which usually scales like the size of the system for typical geophysical problems. It is known [see, for instance, MacKay, 2003] that using an importance proposal to guide the particles toward regions of high probability will not change this trend, albeit with a smaller proportionality factor in (3.24). Snyder et al. [2015] confirmed this and gave bounds to the optimal proposal for particle filters that use an importance proposal leading to a minimal variance in the weights. They conclude again on the exponential dependence of the effective ensemble size with the problem dimension. When smoothing is combined with a particle filter (which becomes a particle smoother) over a DA window, alternative and more efficient particle filters can be designed, such as the implicit particle filter [Morzfeld et al., 2012]. Particle filters can nevertheless be useful for high-dimensional models if the significant degree of nonlinearity is confined to a small subspace of the state space. For instance, in Lagrangian DA, the errors on the location of moving observation platforms have significantly non-Gaussian statistics. In this case, these degrees of freedom can be addressed with a particle filter, while the rest is controlled by an EnKF, which is practical for high-dimensional models [Slivinski et al., 2015]. If we drop the assumption that a particle filter should have the proper Bayesian asymptotic limit, it becomes possible to design nonlinear filters for DA with

3.8. Examples

103

high-dimensional models such as the equal-weight particle filter (see [Ades and van Leeuwen, 2015] and references therein). Finally, if the system cannot be split, then a solution to implement a particle filter in high dimension could come from localization, just as with the EnKF (Chapter 6). This was proven to be more difficult because locally updated particles cannot easily be glued together into global particles. However, an ensemble transform representation that has been built for the EnKF [Bishop et al., 2001] is better suited to ensure a smoother gluing of the local updates [Reich, 2013]. An astute merging of the particles has been shown to yield a local particle filter that could outperform the EnKF in specific regimes with a moderate number of particles [Poterjoy, 2016].

3.8 Examples In this section we present a number of examples of special cases of the KF—both analytical and numerical. Though they may seem overly simple, the intention is that you, the user, gain the best possible feeling and intuition regarding the actual operation of the filter. This understanding is essential for more complex cases, such as those presented in the advanced methods and applications chapters. Example 3.31. Case without observations. Here, the observation matrix Hk = 0 and thus Kk = 0 as well. Hence the KF equations (3.18)–(3.22) reduce to xfk+1 = Mk+1 xak , Pfk+1 = Mk+1 Pak MTk+1 + Qk+1 , and Kk+1 = 0, xak+1 = xfk+1 , Pak+1 = Pfk+1 . Thus, we can completely eliminate the analysis stage of the algorithm to obtain xfk+1 = Mk+1 xfk , Pfk+1 = Mk+1 Pfk MTk+1 + Qk+1 , initialized by xf0 = x0 , Pf0 = P0 . The model then runs without any input of data, and if the dynamics are neutral or unstable, the forecast error will grow without limit. For example, in a typical NWP assimilation cycle, where observations are obtained every 6 hours, the model runs for 6 hours without data. During this period, the forecast error grows and is damped only when the data arrives, thus giving rise to the characteristic “sawtooth” pattern of error variance evolution.

104

Chapter 3. Statistical estimation and sequential data assimilation

Example 3.32. Perfect observations at all grid points. In the case of perfect observations, the observation error covariance matrix Rk = 0 and the observation operator H is the identity. Hence the KF equations (3.18)–(3.22) reduce to xfk+1 = Mk+1 xak , Pfk+1 = Mk+1 Pak MTk+1 + Qk+1 , and

 −1 Kk+1 = Pfk+1 HT HPfk+1 HT = I,   a f f xk+1 = xk+1 + yk+1 − xk+1 ,  Pak+1 = I − Kk+1 H Pfk+1 = 0.

This is obviously another case of ideal observations, and we can once again completely eliminate the analysis stage to obtain xfk+1 = Mk+1 xfk , Pfk+1 = Qk+1 , with initial conditions xf0 = y0 , Pf0 = 0. Since R is in fact the sum of measurement and representation errors, R = 0 implies that the only scales that are observed are those resolved by the model. The forecast is thus an integration of the observed state, and the forecast error reduces to the model error. Example 3.33. Scalar case. As in Section 2.4.5, let us consider the same scalar example, but this time apply the KF to it. We take the simplest linear forecast model, dx = −αx, dt with α a known positive constant. We assume the same discrete dynamics considered in (2.49) with a single observation at time step 3. The stochastic system (3.11)–(3.12) is t xk+1 = M (xkt ) + wk ,

yk+1 = xkt + vk , where wk   (0, σQ2 ), vk   (0, σR2 ), and x0t − x0b   (0, σB2 ). The KF steps are as follows: Forecast: f xk+1 = M (xka ) = γ xk , f = γ 2 Pka + σQ2 . Pk+1

3.8. Examples

105

Analysis:  −1 f f H H 2 Pk+1 + σR2 , Kk+1 = Pk+1 a f obs f xk+1 = xk+1 + Kk+1 (xk+1 − H xk+1 ), @−1 ? 1 1 a f Pk+1 = (1 − Kk+1 H )Pk+1 = + , f σR2 Pk+1

H = 1.

Initialization: x0a = x0b , P0a = σB2 . We start with the initial state, at time step k = 0. The initial conditions are as above. The forecast is x1f = M (x0a ) = γ x0b , P1f = γ 2 σB2 + σQ2 . Since there is no observation available, H = 0, and the analysis gives K1 = 0, x1a = x1f = γ x0b , P1a = P1f = γ 2 σB2 + σQ2 . At the next time step, k = 1, and the forecast gives x2f = M (x1a ) = γ 2 x0b , P2f = γ 2 P1a + σQ2 = γ 4 σB2 + (γ 2 + 1)σQ2 . Once again there is no observation available, H = 0, and the analysis yields K2 = 0, x2a = x2f = γ 2 x0b , P2a = P2f = γ 4 σB2 + (γ 2 + 1)σQ2 . Moving on to k = 2, we have the new forecast: x3f = M (x2a ) = γ 3 x0b , P3f = γ 2 P2a + σQ2 = γ 6 σB2 + (γ 4 + γ 2 + 1)σQ2 . Now there is an observation, x3o , available, so H = 1, and the analysis is  −1 , K3 = P3f P3f + σR2 x3a = x3f + K3 (x3o − x3f ),

P3a = (1 − K3 )P3f .

106

Chapter 3. Statistical estimation and sequential data assimilation

Substituting and simplifying, we find x3a = γ 3 x0b +

γ 6 σB2 + (γ 4 + γ 2 + 1)σQ2 2 σR + γ 6 σB2 + (γ 4 + γ 2 + 1)σQ2



 x3o − γ 3 x0b .

(3.25)

Case 1: Assume we have a perfect model. Then σQ2 = 0 and the KF state (3.25) becomes  γ 6 σB2  o x3a = γ 3 x0b + 2 x − γ 3 x0b , σR + γ 6 σB2 3 which is precisely the 4D-Var expression (2.51) obtained before. Case 2: When the parameter α tends to zero, then γ tends to one, the model is stationary, and the KF state (3.25) becomes x3a = x0b +

σB2 + 3σQ2 2 σR + σB2 + 3σQ2



 x3o − x0b ,

which, when σQ2 = 0, reduces to the 3D-Var solution, x3a = x0b +

σB2



σR2 + σB2

 x3o − x0b ,

that was obtained before in (2.52). Case 3: When α tends to infinity, then γ goes to zero, and we are in the case where there is no longer any memory with x3a =

σQ2 2 σR + σQ2

x3o .

Then, if the model is perfect, σQ2 = 0 and x3a = 0. If the observation is perfect, σR2 = 0 and x3a = x3o . This example shows the complete chain, from the KF solution through the 4D-Var and finally reaching the 3D-Var solution. We hope that this clarifies the relationship between the three and demonstrates why the KF provides the most general solution possible. Example 3.34. Brownian motion. Here we compute a numerical application of the scalar case seen above in Example 3.33. We have the following state and measurement equations: xk+1 = xk + wk , yk+1 = xk + vk , where the dynamic transition matrix M k = 1 and the observation operator H = 1. Let us suppose constant error variances of Qk = 1 and Rk = 0.25 for the process and measurement errors, respectively. Here the KF equations (3.18)–(3.22) reduce to f xk+1 = xka , f = Pka + 1, Pk+1

3.8. Examples

107

and

−1  f f Kk+1 = Pk+1 + 0.25 , Pk+1   a f f xk+1 = xk+1 + Kk+1 yk+1 − xk+1 , f  a = I − Kk+1 Pk+1 . Pk+1

f from the forecast equation, we can rewrite the Kalman gain By substituting for Pk+1 in terms of Pka as Pa + 1 , Kk+1 = a k Pk + 1.25

and we obtain the update for the error variance: a Pk+1 =

Pka + 1 4Pka + 5

.

Plugging into the analysis equation, we now have the complete update: a xk+1 = xka + a = Pk+1

Pka + 1  Pka + 1.25

Pka + 1 4Pka + 5

yk+1 − xka ,

.

Let us now, manually, perform a couple of iterations. Taking as initial conditions x0a = 0,

P0a = 0,

we readily compute, for k = 0, 1 = 0.8, 1.25 x1a = 0 + K1 (y1 − 0) = 0.8y1 , 1 P1a = = 0.2. 5

K1 =

Then for k = 1, 0.2 + 1 ≈ 0.8276, 0.2 + 1.25 x2a = 0.8y1 + K2 (y2 − 0.8y1 ) ≈ 0.138y1 + 0.828y2 , 6 0.2 + 1 = ≈ 0.207. P2a = 0.8 + 5 29

K2 =

One more step for k = 2 gives 6/29 + 1 ≈ 0.8284, 6/29 + 1.25 a x3 = 0.138y1 + 0.828y2 + K3 (y3 − 0.138y1 − 0.828y2 ) ≈ 0.024y1 + 0.142y2 + 0.828y3 , 6/29 + 1 ≈ 0.207. P3a = 24/29 + 5

K3 =

108

Chapter 3. Statistical estimation and sequential data assimilation

Let us see what happens in the limit, k → ∞. We observe that Pk+1 ≈ Pk ; thus a = P∞

a P∞ +1 , a +5 4P∞

a , whose solutions are which is a quadratic equation for P∞ #  1 a P∞ = −1 ± 2 . 2

The positive definite solution is a = P∞

#  1 −1 + 2 ≈ 0.2071, 2

and hence

# 2+2 2 # ≈ 0.8284. 3+2 2 We observe in this case that the KF tends toward a steady-state filter after only two steps. The reasons for this rapid convergence are that the dynamics are neutral and that the observation error covariance is relatively small when compared to the process error, R Q, which means that the observations are relatively precise compared to the model error. In addition, the state (being scalar) is completely observed whenever an observation is available. In conclusion, the combination of dense, precise observations with steady, linear dynamics will always lead to a stable filter. K∞ =

Example 3.35. Estimation of a random constant. In this simple numerical example, let us attempt to estimate a scalar random constant, for example, a voltage. Let us assume that we have the ability to take measurements of the constant, but that the measurements are corrupted by a 0.1 volt root mean square (rms) white measurement noise (e.g., our analog-to-digital converter is not very accurate). In this example, our process is governed by the state equation xk = M xk−1 + wk = xk−1 + wk and the measurement equation y k = H xk + v k = xk + v k . The state, being constant, does not change from step to step, so M = I . Our noisy measurement is of the state directly, so H = 1. We are in fact in the same Brownian motion context as the previous example. The time update (forecast) equations are f xk+1 = xka , f = Pka + Q, Pk+1

and the measurement update (analysis) equations are f f (Pk+1 + R)−1 , Kk+1 = Pk+1 a f f = xk+1 + Kk+1 (yk+1 − xk+1 ), xk+1

a f = (1 − Kk+1 )Pk+1 . Pk+1

3.8. Examples

109

0

-0.1

-0.2

-0.3

-0.4

-0.5

-0.6

-0.7

0

10

20

30

40

50

60

70

80

90

100

Figure 3.10. Estimating a constant—simulation with R = 0.01. True value (solid), measurements (dots), KF estimation (dashed). The x-axis denotes time; the y-axis denotes the state variable.

Initialization. Presuming a very small process variance, we let Q = 1.e − 5. We could certainly let Q = 0, but assuming a small but nonzero value gives us more flexibility in “tuning” the filter, as we will demonstrate below. Let’s assume that from experience we know that the true value of the random constant has a standard Gaussian probability distribution, so we will “seed” our filter with the guess that the constant is 0. In other words, before starting, we let x0 = 0. Similarly, we need to choose an initial value for Pka ; call it P0 . If we were absolutely certain that our initial state estimate was correct, we would let P0 = 0. However, given the uncertainty in our initial estimate, x0 , choosing P0 = 0 would cause the filter to initially and always believe that xka = 0. As it turns out, the alternative choice is not critical. We could choose almost any P0 = 0 and the filter would eventually converge. We will start our filter with P0 = 1. Simulations. To begin with, we randomly chose a scalar constant y = −0.37727. We then simulated 100 distinct measurements that had an error normally distributed around zero with a standard deviation of 0.1 (remember we presumed that the measurements are corrupted by a 0.1 volt rms white measurement noise). In the first simulation we fixed the measurement variance at R = (0.1)2 = 0.01. Because this is the “true” measurement error variance, we would expect the “best” performance in terms of balancing responsiveness and estimate variance. This will become more evident in the second and third simulations. Figure 3.10 depicts the results of this first simulation. The true value of the random constant, x = −0.37727, is given by the solid line, the noisy measurements by the dots, and the filter estimate by the remaining dashed curve. In Figures 3.11 and 3.12 we can see what happens when the measurement error variance, R, is increased or decreased by a factor of 100. In Figure 3.11, the filter was told

110

Chapter 3. Statistical estimation and sequential data assimilation

3

2

1

0

-1

-2

-3

0

10

20

30

40

50

60

70

80

90

100

Figure 3.11. Estimating a constant—simulation with R = 1. True value (solid), measurements (dots), KF estimation (dashed). The x-axis denotes time; the y-axis denotes the state variable.

that the measurement variance was 100 times as great (i.e., R = 1), so it was “slower” to believe the measurements. In Figure 3.12, the filter was told that the measurement variance was 1/100th the size (i.e., R = 0.0001), so it was very “quick” to believe the noisy measurements. While the estimation of a constant is relatively straightforward, this example clearly demonstrates the workings of the KF. In Figure 3.11 in particular the Kalman “filtering” is evident, as the estimate appears considerably smoother than the noisy measurements. We observe the speed of convergence of the variance in Figure 3.13. Here is the MATLAB code used to perform the simulations.    !" #       $   %     &      '( )**+* ,-  - 

. / '( )**+*0

 / 10  %   

  

2 / ( ((((10  ,    %     ,-      ,-   - 

3 / 10  %       

 / ( (14+0  ,   5   , 4+   /  #  $0         ,   &   -   

. / '( )**+*0

" / 10  6     ,-       -    /780   ,- 

  /11((

3.8. Examples

111

-0.35 -0.355 -0.36 -0.365 -0.37 -0.375 -0.38 -0.385 -0.39 -0.395 -0.4

0

10

20

30

40

50

60

70

80

90

100

Figure 3.12. Estimating a constant—simulation with R = 0.0001. True value (solid), measurements (dots), KF estimation (dashed). The x-axis denotes time; the y-axis denotes the state variable.

 #   91$ / '( )**+*0

#   $ & /  #   $ 9   :   0      

#   91$/  - # #   $$0    ; - -  

   -      

 5 -  5   & / - #7 #1   '1$ & 8 5 1, the model is unstable, and if α2 < 1 it is stable. Let us denote by bk the forecast/prior error variance, r the static observation error variance, and ak the error analysis variance. Sequential DA implies the following recursions for the variances: ak−1 = bk−1 + r −1 and bk+1 = α2 ak , whose asymptotic solution (ak → a∞ ) is a∞ = 0 if α2 < 1

and

a∞ = (1 − 1/α2 )r if α2 ≥ 1.

Very roughly, it tells us that only the growing modes need to be controlled, i.e., that DA should be targeted at preventing errors to increase indefinitely in the space generated by the growing modes. This paradigm is called assimilation in the unstable space or AUS [see Palatella et al., 2013, and references therein]. It is tempting to identify the unstable subspace with the time-dependent space generated by the Lyapunov vectors with nonnegative exponents, which, strictly speaking, is the unstable and neutral subspace. Applied to the KF and possibly the EnKF, it is intuitively known that the error covariance matrix tends to collapse to this unstable and neutral subspace [Trevisan and Palatella, 2011]. This can be made rigorous in the linear model Gaussian statistics case. The generalization of the paradigm to nonlinear dynamical systems is more speculative, but Ng et al. [2011] and Palatella and Trevisan [2015] put forward some enlightening arguments about it. A connection between the AUS paradigm and a justification of multiplicative inflation was established in Bocquet et al. [2015].

6.7.2 Variational analysis We recall one of the key elementary results of DA seen in Chapters 1 and 3, namely that when the observation model is linear, the BLUE analysis is equivalent to a 3D variational problem (3D-Var). That is to say, evaluating the matrix formula  −1   xa = xf + Pf HT R + HPf HT y − Hxf is equivalent to solving the minimization problem xa = argmin ' (x) with x

' (x) =

%2 1 1% % % y − Hx2R + %x − xf % f , P 2 2

178

Chapter 6. The ensemble Kalman filter

where x2A = xT A−1 x for any symmetric positive definite matrix A. We assume momentarily that Pf is full rank so that it is invertible and positive-definite. This equivalence can be fruitful with high-dimensional systems, where tools of numerical optimization can be used in place of linear algebra. This equivalence is also of theoretical interest because it enables an elegant generalization of the BLUE update to the case where the observation operator is nonlinear. Simply put, the cost function is now replaced with %2 1 1% % % 'NL (x) = y −  (x)2R + %x − xf % f , P 2 2 where  is nonlinear. 6.7.2.1 The maximum likelihood ensemble filter

This equivalence was put to use in the maximum likelihood ensemble filter (MLEF) introduced by Zupanski [2005]. To describe the MLEF in a framework we have already detailed, let us formulate it in terms of the ETKF (following Bocquet and Sakov [2013]). Hence, we write the analysis of the MLEF in ensemble subspace closely following Section 6.4. However, we must first write the corresponding cost function. Recall that the state vector is parameterized in terms of the vector of coefficients w in  m , x = xf + Xw. The reduced cost function is denoted by (NL (w) = 'NL (xf + Xw). Its first term can easily be written in the reduced ensemble subspace: o (NL (w) =

%2 1% % % %y −  (xf + Xw)% . R 2

To proceed with the background term, ( b (w), of the cost function, we first have to explain what the inverse of Pf = Xf XTf of incomplete rank is whenever m ≤ n. Because for most EnKFs, the analysis is entirely set in ensemble subspace as we repeatedly pointed out, the inverse, Pf = Xf XTf , must be the Moore–Penrose inverse [Golub and van Loan, 2013] of Pf , denoted by P†f . Indeed, it is defined in the range of Pf . It is even more direct to introduce the SVD of the perturbation matrix X = UΣVT , where Σ > 0 is the diagonal matrix of the m ≤ m positive singular values, V is of size m × m such that VT V = I m , and U is of size n×m such that UT U = I m . Note that P†f = UΣ−2 UT . Then we have  † 1 1 1 ( b (w) = Xf w2Pf = wT XTf P†f Xf w = wT VΣUT UΣ2 UT UΣVT w 2 2 2 1 1 = wT VΣUT UΣ−2 UT UΣVT w = wT VVT w. 2 2 As was mentioned earlier, there is a freedom in w that makes the solution of the minio mization problem degenerate. Clearly (NL (w) is unchanged if w is shifted by λ1. So is ( b (w) because 1 is in the null space of V since X1 = 0. One way to lift the degeneracy of the variational problem is to add a gauge-fixing term that will constrain the solution in the null space of X, or V. In practice, one can add 1 ( g (w) = wT (I m − VVT )w 2

6.7. Other important flavors of the EnKF

179

to the cost function (NL to obtain the regularized cost function ( (w) =

%2 1 1% % % %y −  (xf + Xw)% + w2 , R 2 2

still has where w2 = wT w. The cost function the same minimum but it is achieved  at a non-degenerate w such that I m − VVT w = 0. This is the cost function of the ETKF [Hunt et al., 2007] but with a nonlinear observation operator. The update step of this EnKF can now be seen as a nonlinear variational problem, which can be solved using a variety of iterative methods, such as a Gauss–Newton method, a quasi-Newton method, of a Levenberg–Marquardt method [Nocedal and Wright, 2006]. For instance, with the Gauss–Newton method, we would define the iterate, the gradient, and an approximation of the Hessian as x( j ) = xf + Xf w( j ) ,   ∇(( j ) = −YT( j ) R−1 y −  (x( j ) ) + w( j ) , H( j ) = I m + YT( j ) R−1 Y( j ) ,

respectively. The Gauss–Newton iterations are indexed by j and given by w( j +1) = w( j ) − H−1 ∇(( j ) . (j) The scheme a satisfying convergence is reached, for instance when the % % is iterated until % % norm of %w( j +1) − w( j ) % crosses below a given threshold. The vector Y( j ) is defined as the image of the initial ensemble perturbations Xf through the tangent linear model of the observation operator computed at x( j ) by Y( j ) = H|x( j ) Xf . Following Sakov et al. [2012], there are at least two ways to compute these sensitivities. One explicitly mimics the tangent linear by a downscaling of the perturbations by  such that 0 <  1 before application of the full nonlinear operator  followed by an upscaling by −1 . The operation reads   1  11T Y( j ) ≈  x( j ) 1T + Xf I m − .  m # Note that  accounts for a normalization factor of m − 1. The second way consists of avoiding resizing the perturbations because this implies applying the observation operator to the ensemble an extra time. Instead of downscaling the perturbations, we can (i) generate transformed perturbations by applying the right-multiplication operator  − 1 T = I m + YT( j ) R−1 Y( j ) 2 , (ii) build a new ensemble from these transformed perturbations around x( j ) , (iii) apply  to this ensemble, and finally (iv) rotate back the new perturbations around x( j +1) by applying T−1 . Through the T-transformation the second scheme also ensures a resizing of the perturbations where  is in a close-to-linear regime. However, as opposed to the first scheme, the last propagation of the perturbation can be used to directly estimate the final approximation of the Hessian and the final updated set of perturbations, which can be numerically efficient.

180

Chapter 6. The ensemble Kalman filter

Algorithm 6.5 Pseudocode for a complete cycle of the MLEF, as a variant in ensemble subspace following Zupanski [2005], Carrassi et al. [2009], Sakov et al. [2012], and Bocquet and Sakov [2013] Require: Observation operator  at current time; algorithm parameters: e, jmax , ; E, the prior ensemble; y, the observation at current time; U, an orthogonal matrix in  m×m satisfying U1 = 1;  , the model resolvent from current time to the next analysis time. 1: x = E1/m #  2: X = E − x1T / m − 1 3: T = I m 4: j = 0, w = 0 5: repeat 6: x = x + Xw 7: Bundle: E = x1T + X # 8: Transform: E = x1T + m − 1XT 9: Z =  (E) 10: y = Z1/m  11: Bundle: Y = Z − y1T /  # 12: Transform: Y = Z − y1T T−1 / m − 1 T −1 13: ∇( = w − Y R (y − y) 14: H = I m + YT R−1 Y 15: Solve HΔw = ∇( 16: w := w − Δw 1 17: Transform: T = H− 2 18: j := j + 1 19: until ||Δw|| ≤ e or j ≥ jmax 1 20: Bundle: T = H− 2 # 21: E = x1T + m − 1XTU 22: E =  (E) Note that each iteration amounts to solving an inner loop problem with the quadratic cost function %2 %2 1 % 1% % % % % ( ( j ) (w) = %y −  (x( j ) ) − Y( j ) (w − w( j ) )% + %w − w( j ) % . R 2 2 The update of the perturbations follows that of the ETKF, i.e., equation (6.17). A full cycle of the algorithm is given in Algorithm 6.5 as a pseudocode. Either the bundle (finite differences with the -rescaling) scheme or the transform (using T) scheme is needed to compute the sensitivities. Both are indicated in the algorithm. Inflation, possibly localization, should be added to the scheme to make it functional. In summary, the MLEF is an EnKF scheme that can use nonlinear observation operators in a consistent way using a variational analysis. 6.7.2.2 Numerical illustration

The (bundle) MLEF as implemented by Algorithm 6.5 is tested against the EnKF (ETKF implementation) using a setup similar to that of the Lorenz-95 model. To exhibit a difference of performance, the observation operator has been chosen to be

6.7. Other important flavors of the EnKF

181

nonlinear. Each of the 40 variables is observed with the nonlinear observation operator   γ −1 = |x| x , (6.27) 1+  (x) = 2 10 where |x| is the componentwise absolute value of x. The second nonlinear term in the brackets is meant to be of the order of magnitude of the first linear term to avoid numerical overflow. Obviously, γ tunes the nonlinearity of the observation operator, with γ = 1 corresponding to the linear case  (x) = x. The prior observation error is chosen to be R ≡ I p . The ensemble size is m = 20, which, in this context, makes localization unnecessary. For both the ETKF and the MLEF, the need for inflation is addressed either by using the Bayesian hierarchical scheme for the EnKF, known as the finite-size EnKF or EnKF-N, which we shall describe later, or by optimally tuning a uniform inflation (which comes with a significant numerical cost). The MLEF is expected to offer strong performance in the first cycles of the DA scheme when the spread of the ensemble is large enough, over a span where the tangent linear observation model is not a good approximation. To measure this performance, the length of the DA run is set to 102 cycles and these runs are repeated 103 times, over which a mean analysis RMSE is computed. The spread of the initial ensemble is chosen to be 3, i.e., roughly the climatological variability of a single Lorenz-95 variable. The overall performances of the schemes are computed as a function of γ , i.e., the nonlinearity strength of the observation operator, and reported in Figure 6.7. Since the model is fully observed, an efficient DA scheme should have an RMSE smaller than the prior observation error, i.e., 1, which is indeed the case for all RMSEs computed in these experiments. As γ departs from γ = 1, the performance of

Analysis root mean square error

0.80 EnKF opt. infl. MLEF opt. infl. EnKF-N MLEF-N

0.70

0.60

0.50

0.40

0.30 1

2

3

4

5

6

7

8

Nonlinear observation operator parameter γ

9

10

Figure 6.7. Average analysis RMSE of a deterministic EnKF (ETKF) and of the MLEF with the Lorenz-95 model and the nonlinear observation operator equation (6.27), as a function of γ the nonlinearity strength in the observation operator. For each RMSE, 103 DA experiments are run over 102 cycles. The final RMSE is the mean over those 103 experiments. The EnKF-N and MLEF-N are hierarchical filter counterparts to the EnKF and MLEF that will be discussed in Section 6.7.3.

182

Chapter 6. The ensemble Kalman filter

both filters consistently degrades. Beyond γ > 9, the EnKF diverges from the truth. However, the MLEF better handles the model nonlinearity, especially beyond γ > 4 and the gap between the EnKF and the MLEF increases with γ . We also note that the EnKF-N and MLEF-N offer better performance than an optimal but uniformly tuned EnKF. This is explained by the fact that the finite-size scheme is adaptive and adjusts the inflation as the ensemble spread decreases. For much longer runs, the RMSE gap between the EnKF and the MLEF decreases significantly. Indeed, in the permanent regime of these experiments, the spread of the ensemble is significantly smaller than 1 and the system is closer to linearity, a regime where both the EnKF and the MLEF perform equally well. The gap between the finite-size and the optimally tuned ensemble filters also decreases since the adaptation feature is not necessarily important in a permanent regime. The MLEF will be generalized later by including not only the observation operator but also the forecast model in the analysis over a 4D DA window, leading to the IEnKF and IEnKS. 6.7.2.3 The α control variable trick

The analysis using covariance localization with a localization matrix ρ can be equivalently formulated in terms of ensemble coefficients similar to the w of the ETKF, but that have been made space dependent. Instead of a vector w of size m that parameterizes the state vector x, m  x −x x = x+ wi # i , m−1 i =1 we choose a much larger control vector α of size mn that parameterizes the state vector x as m  x −x x=x+ αi ◦ # i , m −1 i =1 where αi is a subvector of size n. This change of control variable is known as the α control variable [Lorenc, 2003; Buehner, 2005; Wang et al., 2007]. It is better formulated in a variational setting. Consider the cost function ' (x, α) =

% ! >%2 m m  x −x % 1% 1 % % αi ◦ # i αi 2ρ , %y −  x + % + % % 2 m − 1 R 2 i =1 i =1

(6.28)

which is formally equivalent to the Lagrangian, ' (x, α, β) =

! > m m  x −x 1 1 y −  (x)2R + αi 2ρ + βT x − x − αi ◦ # i , 2 2 i =1 m −1 i =1

where β is a vector in n of Lagrange multipliers. Yet, as opposed to ' (x, α), the Lagrangian ' (x, α, β) is quadratic in α and can be analytically minimized on α. The xi −x saddle point condition on αi is, denoting as before Xi = # m−1 , αi = ρβ ◦ Xi .

6.7. Other important flavors of the EnKF

183

Its substitution in ' (x, α, β) implies computing m  i =1

(β ◦ Xi )T ρ(β ◦ Xi ) = =

n m  

βk [Xi ]k ρk,l β l [X j ] l

i =1 k,l =1 n   k,l =1

βk ρ ◦ XXT

where P is the sample covariance matrix P = the new Lagrangian ' (x, β) =

1 m−1

 k,l

:m

β l = βT ρ ◦ Pβ,

i =1 (xi

− x)(xi − x)T . This yields

1 1 y −  (x)2R + βT (ρ ◦ P) β + βT (x − x) . 2 2

This can be further optimized on β to yield the cost function ' (x) =

1 1 y −  (x)2R + x − x2ρ◦P . 2 2

(6.29)

Hence we have obtained a formal equivalence between (6.28) and the cost function (6.29) that governs the analysis of the EnKF with covariance localization. Both approaches are used in the literature.

6.7.3 Hierarchical EnKFs Key assumptions of the EnKF are that the mean and the error covariance matrix are exactly given by the sampled moments of the ensemble. As we have seen, this is bound to fail without fixes. Indeed, the uncertainty is underestimated and may often lead to the divergence of the filter. As seen in Section 6.5, inflation and localization are fixes that address this issue in a satisfying manner. The use of a Bayesian statistical hierarchy has been more recently explored as a distinct route toward solving this issue. If x is a state vector that depends on parameters θ and is observed by y, Bayes’ rule (see section 3.2) tells us that the probability of the variables and parameters conditioned on the observations is p(x, θ|y) ∝ p(y|θ, x) p(x, θ), where the proportionality factor only depends on y. But if θ is further assumed to be an uncertain parameter vector obeying a prior distribution, p(θ), and if the likelihood of y only depends on θ through x, one can further decompose this posterior distribution into p(x, θ|y) ∝ p(y|x) p(x|θ) p(θ). We have created a hierarchy of random variables: x at the first level and θ at a second level. In this context, p(x|θ) is still called a prior, while p(θ) is termed a hyperprior to emphasize that it operates at a second level of the Bayesian hierarchy [Gelman et al., 2014]. Applied to the EnKF, a Bayesian hierarchy could be enforced by seeing the moments of the true error distribution as multivariate random variables, rather than coinciding with the sampled moments of the ensemble. One of the simplest ways to enforce this idea is to marginalize p(x, θ|y), and hence p(x|θ), over all potential θ. In the following, θ will be the ensemble sampled moments.

184

Chapter 6. The ensemble Kalman filter

Specifically, Bocquet [2011] recognized that the ensemble mean x and ensemble error covariance matrix P used in the EnKF may be different from the unknown firstand second-order moments of the true error distribution, x b and B, where B is a positive definite matrix. The mismatch is due to the finite size of the ensemble, which leads to sampling errors. It is claimed in Bocquet et al. [2015] that these errors are mainly induced by the nonlinear ensemble propagation in the forecast step. Let us account for the uncertainty in x b and in B. As in Section 6.4.1, we denote by E = [x1 , x2 , . . . , x m ] the ensemble of size m formatted as an n × m matrix; x= # E1/m the ensemble mean, where 1 = (1, . . . , 1)T ; and X = (E − x1T )/ m − 1 the normalized perturbation matrix. Hence, P = XXT is the empirical covariance matrix of the ensemble. Marginalizing over all potential xb and B, the prior of x reads  p(x|E) = dx b dB p(x|E, x b , B) p(x b , B|E). The Hn symbol dB corresponds to the Lebesgue measure on all independent entries i ≤ j d[B]i j , but the integration is restricted to the cone of positive definite matrices. Since p(x|E, x b , B) is conditioned on the knowledge of the true prior statistics, it does not depend on E, so that  p(x|E) = dx b dB p(x|x b , B) p(x b , B|E). Bayes’ rule can be applied to p(x b , B|E), yielding  1 dx b dB p(x|x b , B) p(E|x b , B) p(x b , B). p(x|E) = p(E)

(6.30)

Assuming independence of the samples, the likelihood of the ensemble E can be written m ; p(E|x b , B) = p(xi |x b , B). i =1

The last factor in (6.30), p(x b , B), is the hyperprior. The distribution represents our beliefs about the forecast filter statistics, x b and B, prior to actually running any filter. We recall that this distribution is termed hyperprior because it represents a prior for the background information in the first stage of a Bayesian hierarchy. Assuming one subscribes to this view of the EnKF, it shows that more information is actually required in the EnKF, in addition to the observations and the prior ensemble, which are potentially insufficient for an inference. A simple choice was made in Bocquet [2011] for the hyperprior: the Jeffreys’ prior is an analytically tractable and uninformative hyperprior of the form pJ (x b , B) ∝ |B|−

n+1 2

,

(6.31)

where |B| is the determinant of the background error covariance matrix B of dimension n × n. A more sophisticated hyperprior meant to hold static information, the normal-inverse-Wishart distribution, was proposed in Bocquet et al. [2015]. With a given hyperprior, the marginalization over x b and B, (6.30), can in principle be carried out to obtain p(x|E). We choose to call it a predictive prior to comply with the traditional view that sees it as prior before assimilating the observations. Note,

6.7. Other important flavors of the EnKF

185

however, that statisticians would rather call it a predictive posterior distribution as the outcome of a first-stage inference of a Bayesian hierarchy, where E is the data. Using Jeffreys’ hyperprior, Bocquet [2011] showed that the integral can be obtained analytically and that the predictive prior is a multivariate t -distribution,  − m  (x − x) (x − x)T  2   p(x|E) ∝  +  m P ,   m −1

(6.32)

where |·| denotes the determinant and  m = 1 + 1/m. The determinant is computed in the ensemble subspace ξ = x + Vec (X1 , X2 , . . . , X m ), i.e., the affine space spanned by the perturbations of the ensemble so that it is not singular. Moreover, we impose p(x|E) = 0 if x is not in ξ . This distribution has fat tails, thus accounting for the uncertainty in B. The factor  m is a result of the uncertainty in x b ; if x b were known to coincide with the ensemble mean x, then  m would be 1 instead. For a Gaussian process,  m P is an unbiased estimator of the squared error of the ensemble mean x [Sacher and Bartello, 2008], where  m stems from the uncertain x b , which does not coincide with x. In the derivation of Bocquet [2011], the  m P correction comes from integrating out on x b . Therefore,  m can be seen as an inflation factor on the prior covariance matrix that should actually apply to any type of EnKF. This non-Gaussian prior distribution can be seen as a mixture of Gaussian distributions weighted according to the hyperprior. It can be shown that (6.32) can be rearranged as ! >− m2 (x − x)T ( m P)† (x − x) p(x|E) ∝ 1 + (6.33) m −1

for x ∈ ξ and p(x|E) = 0 if x ∈ / ξ ; P† is the Moore–Penrose inverse of P. In comparison, the traditional EnKF implicitly assumes that the hyperprior is δ(B−P)δ(x b −x), where δ is a Dirac multidimensional distribution. In other words, the background statistics generated from the ensemble coincide with the true background statistics. As a result, one obtains in this case the Gaussian prior I J 1 p(x|E) ∝ exp − (x − x)T P† (x − x) (6.34) 2

for x ∈ ξ and p(x|E) = 0 if x ∈ / ξ. From these predictive priors given in state space, it is possible to derive a formally simple prior in ensemble subspace, i.e., in terms of the coefficients w that we used for the ETKF and MLEF (x = x + Xw). In turn, this prior leads to an effective cost function in the ensemble subspace for the analysis [Bocquet et al., 2015] ( (w) =

  w2 1 m +1 y −  (x + Xw)2R + ln  m + , 2 2 m −1

(6.35)

which should be used in place of the related cost function of the ETKF, which reads [Hunt et al., 2007] ( (w) =

1 1 y −  (x + Xw)2R + w2 . 2 2

(6.36)

The EnKF that results from the effective cost function (6.35) has been called the finitesize EnKF because it sees the ensemble in the asymptotic limit, but as a finite set. It

186

Chapter 6. The ensemble Kalman filter

is denoted EnKF-N, where the N indicates an explicit dependence on the size of the ensemble. It was further shown in Bocquet and Sakov [2012] that it is enlightening to separate the angular degrees of freedom of w, i.e., w/ |w| , from its radial one |w| in the cost function. This amounts to defining a Lagrangian of the form ' (w, ρ, ζ ) =

1 ζ y −  (x + Xw)2R + 2 2



 w2 m+1 −ρ + ln ( m + ρ) , m −1 2

where ζ is a Lagrange parameter used to enforce the decoupling. When the observation operator is linear or linearized, this Lagrangian turns out to be equivalent to a dual cost function of the ζ parameter, which is  −1  ζ m +1 m +1 m +1 1 m −1 YYT ln − , (ζ ) = δ T R + δ+ m + 2 ζ 2 2 ζ 2

(6.37)

where δ = y −  (x) is the innovation vector. The dual cost function is defined over the interval ]0, (m + 1)/ m ]. Although it is not necessarily convex, its global minimum can easily be found numerically because it is a one-dimensional optimization problem. To perform an EnKF-N analysis using this dual cost function, one would first minimize (6.37) to obtain the optimal ζa . The analysis is then  wa = YT R−1 Y +

ζa m −1

Im

−1

 YT R−1 δ = YT

ζa m −1

R + YYT

−1

δ.

(6.38)

Based on the effective cost function (6.35), an updated set of perturbations can be obtained: 1

Xa = X [Ha ]− 2 U

with

Ha = YT R−1 Y +

ζa 2 I − m −1 m m +1



ζa m −1

2 wa wTa . (6.39)

 ζ 2 2 a The last term of the Hessian, − m+1 m−1 wa wTa , which is related to the covariances of the angular and radial degrees of freedom of w, can very often be neglected. If so, the update equations are equivalent to those of the ETKF but with an inflation of the prior covariance matrix by a factor (m −1)/ζa . Hence the EnKF-N implicitly determines an adaptive optimal inflation. If an SVD of Y is available, the minimization of (ζ ) is immediate, using for instance a dichotomous search. Such a decomposition is often already available because it was meant to be used to compute (6.38) and (6.39). Practically, it was found in several low-order models and in perfect model conditions that the EnKF-N does not require any inflation and that its performance is close to that of an equivalent ETKF, where a uniform inflation would have been optimally tuned to obtain the best performance of the filter. In a preliminary study by Bocquet et al. [2015], the use of a more informative hyperprior, such as the normalinverse-Wishart distribution, was proposed to avoid the need for localization while still avoiding the need for inflation.

6.7. Other important flavors of the EnKF

187

6.7.3.1 Numerical illustrations

The three-variable Lorenz model [Lorenz, 1963] (Lorenz-63 hereafter) is the emblem of chaotic systems. It is defined by the ODEs

dx = σ(y − x), dt dy = ρx − y − x z, dt dz = xy − βz, dt where σ = 10, ρ = 28, and β = 8/3. This model is chaotic, with (0.91, 0, −14.57) as its Lyapunov exponents. The model doubling time is 0.78 time units. Its attractor has the famous butterfly shape with two distinct wings, or lobes, which were illustrated in Section 2.5.1. It was used by Edward Lorenz to explain the finite horizon of predictability in meteorology. To demonstrate the relevance of the EnKF-N with this model, a numerical twin experiment similar to that used with the Lorenz-95 model is designed. The system is assumed fully observed so that Hk ≡ I3 , with the observation error covariance matrix Rk ≡ 4I3 . The time interval between observational updates is varied from Δt = 0.10 to Δt = 0.50. The larger Δt is, the stronger the impact of model nonlinearity on the state estimation, and the stronger the need for an inflation correction. The performance of the DA schemes is measured by the analysis RMSE (6.23) averaged over a very long run. We test a standard ETKF where the uniform inflation is optimally tuned to minimize the RMSE (about 20 values are tested). We compare it to the EnKF-N, which does not require inflation. In both cases, the ensemble size is set to m = 3. The skills of both filters are shown in Figure 6.8. It is remarkable that the EnKFN achieves an even better performance than the optimally tuned ETKF, without any tuning. As Δt increases, the ETKF requires a significantly stronger inflation. This is mostly needed at the transition between the two lobes of the Lorenz-63 attractor. Within the lobes, the DA system is effectively much more linear and requires little inflation. By contrast, the EnKF-N, which is adaptive, applies a strong inflation only when needed, i.e., at the transition between lobes. An analogous experiment can be carried out, but with the Lorenz-95 model. The setup of Section 6.6 is used. The performance of both filters is shown in the left panel of Figure 6.9 when the ensemble size is varied and when Δt = 0.05. The EnKF-N achieves the same performance but without any tuning, and hence with numerical efficiency. A similar experiment is performed but varying Δt , and hence system nonlinearity, and setting m = 20. The results are reported in the right panel of Figure 6.9. Again, it can be seen that the same performance can be reached without any tuning with the EnKF-N. The adaptation of the EnKF-N is also illustrated by Figure 6.7, where the MLEF and ETKF were compared in the first stages of a DA experiment with the Lorenz-95 model and a nonlinear observation operator. In this context, the finite-size MLEF and ETKF were shown to outperform the standard MLEF and ETKF with optimally tuned uniform inflation.

188

Chapter 6. The ensemble Kalman filter

EnKF with optimally tuned inflation EnKF-N

1.75

Analysis RMSE

1.50

1.25

1.00

0.75

0.50 0.10

0.20

0.30

0.40

0.50

Time interval between update Δt

Figure 6.8. Average analysis RMSE for a deterministic EnKF (ETKF) with optimally tuned inflation and for the EnKF-N. The model is Lorenz-63. EnKF with optimally tuned inflation EnKF-N

EnKF with optimally tuned inflation EnKF-N

1.50

3 2

1.00

Analysis RMSE

Average analysis root mean square error

5 4

1

0.5 0.4

0.80 0.60 0.50 0.40 0.30

0.3

0.25

0.2

0.20

5

6

7

8

9 10

15

20

25

30

35 40 45 50

0.15 0.05

0.10

Ensemble size

0.15

0.20

0.25

0.30

0.35

Time interval between update Δt

0.40

0.45

0.50

Figure 6.9. Average analysis RMSE for a deterministic EnKF (ETKF) with optimally tuned inflation and for the EnKF-N. Left panel: the ensemble size is varied from m = 5 to m = 50 and Δt = 0.05. Right panel: Δt is varied from 0.05 to 0.50 and the ensemble size is set to m = 20. The model is Lorenz-95.

6.7.3.2 Passing on hierarchical statistical information

The EnKF-N was built with the idea to remain algorithmically as close as possible to the EnKF. To avoid relying on any additional input, the hierarchy of information that was established on x b and B was closed by marginalizing over all potential priors leading to effective cost functions. However, it is actually possible to propagate the information that the filter carries about x b and B from one cycle to the next. This idea was formalized in Myrseth and Omre [2010]. It relies on the natural conjugacy of the normal-inverse-Wishart distribution with the multivariate normal distribution. The ensemble can be seen as an observation set for the estimation of the moments of the true distribution, x b and B, which obeys a multivariate normal distribution. If x b and B are supposed to follow a normal-inverse-Wishart distribution, then the posterior distribution will also follow a normal-inverse-Wishart distribution with parameters that are easily updated using the data, i.e., the ensemble members, thanks to the natural conjugacy. This defines a level-2 update scheme for the mean and the error covariances. The mean and error

6.8. The ensemble Kalman smoother

189

covariances used on the level-1 update scheme, i.e., the EnKF, are subsequently drawn from this updated distribution. This scheme is quite different from a traditional EnKF and truly accounts for the uncertainty in the moments. Another and rather similar attempt was documented in Tsyrulnikov and Rakitkoa [2016].

6.8 The ensemble Kalman smoother Filtering consists of assimilating observations as they become available, making the best estimate at the present time. If yK:1 = yK , yK−1 , . . . , y1 is the collection of observations from t1 to tK , filtering aims at estimating the PDF p(xK |yK:1 ) at tK . Only past and present observations are accounted for, which would necessarily be the case for real-time nowcasting and forecasting. Smoothing, on the other hand, aims at estimating the state of the system (or a trajectory of it), using past, present, and possibly future observations. Indeed, assuming again tK is the present time, one could also be interested in the distribution p(xk |yK:1 ), where 1 ≤ k ≤ K. More generally, one would be interested in estimating p(xK:1 |yK:1 ), where xK:1 = xK , xK−1 , . . . , x1 is the collection of state vectors from t1 to tK , i.e., a trajectory. Note that p(xk |yK:1 ) is a marginal distribution of p(xK:1 |yK:1 ) obtained by integrating out x l , with l = 1, . . . , k − 1, k + 1, . . . , K. This is especially useful for hindcasting and reanalysis problems that aim at retrospectively obtaining the best estimate for the model state using all available information. The KF can be extended to address the smoothing problem, leading to a variety of Kalman smoother algorithms [Anderson and Moore, 1979]. It has been introduced in the geosciences and studied in this context by Cohn et al. [1994]. The generalization of the Kalman smoother to the family of ensemble Kalman filters followed the development of the EnKF. Even for linear systems, where we theoretically expect equivalent schemes, there are several implementations of the smoother, even for a given flavor of EnKF [Cosme et al., 2012]. Its implementation may also depend on whether one wishes to estimate a state vector or a trajectory, or on how long the backward analysis can go. In the following, we shall focus on the fixed-lag ensemble Kalman smoother (EnKS) that was introduced and used in Evensen and van Leeuwen [2000], Zhu et al. [2003], Khare et al. [2008], Cosme et al. [2010], and Nerger et al. [2014]. If Δt is the time interval between updates, fixed-lag means that the analysis goes backward by LΔt in time. The variable L measures the length of the time window in units of Δt . We will implement it following the ensemble transform framework as used in Bocquet and Sakov [2013, 2014]. The EnKS is built on the EnKF, which will be its backbone. There are two steps. The first step consists of running the EnKF from tk−1 to tk . The second step consists of updating the state vectors at tk−1 , tk−2 back to tk−L . Let us make L the lag of the EnKS and define tL as the present time. Hence, we wish to retrospectively estimate xL:0 = xL , xL−1 , . . . , x0 . An EnKF has been run from the starting time to tL . The collection of all posterior ensembles EL , EL−1 , . . . , E0 is assumed to be stored, which is a significant technical constraint. For each Ek there is a corresponding mean state xk and a normalized perturbation matrix Xk . This need for memory (at least L × m × n scalars) is the main requirement of the EnKS. The EnKF provides an approximation for the PDF p(xL |yL: ), where yL: represents the collection of observation vectors from the beginning of the DA experiment (earlier than t0 ) to tL . For the EnKF, this PDF can be approximated as a Gaussian distribution. Now, let us describe the backward pass, starting from the latest dates. Let us first derive a retrospective update for x one time step backward. Hence, we wish to compute a

190

Chapter 6. The ensemble Kalman filter

Gaussian approximation for p(xL−1 |yL: ). From Bayes’ rule, we obtain p(xL−1 |yL: ) ∝ p(yL |xL−1 , yL−1: ) p(xL−1 |yL−1: ),

which relates the smoothing PDF to the current observation likelihood and to the filtering distribution at tL−1 . From the EnKF’s standpoint, p(xL−1 |yL−1: ) is approximately Gaussian in the affine subspace centered at xaL−1 and spanned by the columns of XaL−1 . The affine subspace can be parameterized by x = xaL−1 + XaL−1 w. Using the variational expression of the ETKF or the MLEF as defined in ensemble subspace, 1 p(xL−1 |yL−1: ) is proportional to exp(− 2 w2 ) when written in terms of the coordinates, w, of the ensemble subspace (see Section 6.4.2). Then p(yL |xL−1 , yL−1: ) is the likelihood and reads, in ensemble subspace, I %2 J 1% p(yL |xL−1 , yL−1: ) ∝ exp − %yL − L ◦ L:L−1 (xaL−1 + XaL−1 w)%R . L 2 Hence the complete cost function for the retrospective analysis on tL−1 is ' (w) =

%2 1 1% 2 a a %y −  ◦  % L L:L−1 (xL−1 + XL−1 w) RL + w . 2 L 2

This type of potentially nonquadratic cost function will be at the heart of the IEnKF/ IEnKS (see Chapter 7). Here, the update is expanded around xaL−1 using the TLM to make the cost function quadratic: %2 1% 1 2 a a %y −  ◦  % L L:L−1 (xL−1 + XL−1 w) RL + w 2 L 2 %2 1% 1 " %yL − L ◦ L:L−1 (xaL−1 ) − HL ML:L−1 XaL−1 w%R + w2 L 2 2 %2 1% 1 % % = %yL − L (xfL ) − HL XfL w% + w2 RL 2 2 % % 1% 1 %2 = %δL − YfL w% + w2 , RL 2 2

' (w) =

where δL = yL − L (xfL ) and, as before, Yfk = Hk Xfk . From this cost function, the derivation of an ETKF-like analysis is immediate. We obtain < = XaL−1 Ω and xa,1 = xaL−1 + XaL−1 w , (6.40) Xa,1 L−1 L−1 with

  f −1 Ω = I m + (YfL )T R−1 L YL

and

w = Ω YfL R−1 L δL .

(6.41)

The superscript 1 indicates that the estimate of xL−1 now accounts for observation one time step ahead. We can proceed backward and consider an updated estimation for xL−2 . The EnKF had yielded xaL−2 , two time steps earlier. The backward pass of the EnKS run, one time step earlier, must have updated xL−2 using the same formula that we derived for xL−1 a few lines above, yielding the estimate xa,1 and the updated ensemble Xa,1 . With L−2 L−2 a,1 xL−2 , we have accounted for yL−1 , but not yL yet. Hence, using Bayesian estimation as a guide, we need to estimate p(xL−2 |yL: ) ∝ p(yL |xL−2 , xL−1: ) p(xL−2 |yL−1: ).

6.8. The ensemble Kalman smoother

191

As above, we can derive from these distributions an approximate quadratic cost function for the retrospective analysis on xL−2 . We formally obtain the cost function ' (w) =

%2 1% 1 % % + Xa,1 w)% + w2 , %y − L ◦ L:L−2 (xa,1 L−2 L−2 RL 2 L 2

where xL−2 is parameterized as xL−2 = xa,1 + Xa,1 w. The outcome is formally the L−2 L−2 same, %2 1% 1 % % ' (w) = %δL − YfL w% + w2 , RL 2 2 with δL = yL − L (xfL ) and YfL = HL XfL , but where w is a vector of coefficients that applies to a different ensemble of perturbations (Xa,1 instead of XaL−1 ). From this cost L−2 function, an ETKF-like analysis is immediate. We obtain = xa,1 + Xa,1 w xa,2 L−2 L−2 L−2  with

and

 −1 YfL Ω = I m + (YfL )T R−1 L

Xa,2 = Xa,1 L−2 L−2