Mathematics for Modeling and Scientific Computing [1 ed.] 1848219881, 978-1-84821-988-5

This book provides the mathematical basis for investigating numerically equations from physics, life sciences or enginee

424 116 8MB

English Pages 472 [475] Year 2016

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Mathematics for Modeling and Scientific Computing [1 ed.]
 1848219881, 978-1-84821-988-5

Table of contents :
Content: Preface ix Chapter 1. Ordinary Differential Equations 1 1.1. Introduction to the theory of ordinary differential equations 1 1.1.1. Existence uniqueness of first-order ordinary differential equations 1 1.1.2. The concept of maximal solution 11 1.1.3. Linear systems with constant coefficients 16 1.1.4. Higher-order differential equations 20 1.1.5. Inverse function theorem and implicit function theorem 21 1.2. Numerical simulation of ordinary differential equations, Euler schemes, notions of convergence, consistence and stability 27 1.2.1. Introduction 27 1.2.2. Fundamental notions for the analysis of numerical ODE methods 29 1.2.3. Analysis of explicit and implicit Euler schemes 33 1.2.4. Higher-order schemes 50 1.2.5. Leslie s equation (Perron Frobenius theorem, power method) 51 1.2.6. Modeling red blood cell agglomeration 78 1.2.7. SEI model 87 1.2.8. A chemotaxis problem 93 1.3. Hamiltonian problems 102 1.3.1. The pendulum problem 106 1.3.2. Symplectic matrices
symplectic schemes 112 1.3.3. Kepler problem 125 1.3.4. Numerical results 129 Chapter 2. Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems 141 2.1. Introduction 141 2.1.1. The 1D model problem
elements of modeling and analysis 144 2.1.2. A radiative transfer problem 155 2.1.3. Analysis elements for multidimensional problems 163 2.2. Finite difference approximations to elliptic equations 166 2.2.1. Finite difference discretization principles 166 2.2.2. Analysis of the discrete problem 173 2.3. Finite volume approximation of elliptic equations 180 2.3.1. Discretization principles for finite volumes 180 2.3.2. Discontinuous coefficients 187 2.3.3. Multidimensional problems 189 2.4. Finite element approximations of elliptic equations 191 2.4.1. P1 approximation in one dimension 191 2.4.2. P2 approximations in one dimension 197 2.4.3. Finite element methods, extension to higher dimensions 200 2.5. Numerical comparison of FD, FV and FE methods 204 2.6. Spectral methods 205 2.7. Poisson Boltzmann equation
minimization of a convex function, gradient descent algorithm 217 2.8. Neumann conditions: the optimization perspective 224 2.9. Charge distribution on a cord 228 2.10. Stokes problem 235 Chapter 3. Numerical Simulations of Partial Differential Equations: Time-dependent Problems 267 3.1. Diffusion equations 267 3.1.1. L2 stability (von Neumann analysis) and L stability: convergence 269 3.1.2. Implicit schemes 276 3.1.3. Finite element discretization 281 3.1.4. Numerical illustrations 283 3.2. From transport equations towards conservation laws 291 3.2.1. Introduction 291 3.2.2. Transport equation: method of characteristics 295 3.2.3. Upwinding principles: upwind scheme 299 3.2.4. Linear transport at constant speed
analysis of FD and FV schemes 301 3.2.5. Two-dimensional simulations 326 3.2.6. The dynamics of prion proliferation 329 3.3. Wave equation 345 3.4. Nonlinear problems: conservation laws 354 3.4.1. Scalar conservation laws 354 3.4.2. Systems of conservation laws 387 3.4.3. Kinetic schemes 393 Appendices 407 Appendix 1 409 Appendix 2 417 Appendix 3 427 Appendix 4 433 Appendix 5 443 Bibliography 447 Index 455

Citation preview

,

Mathematics for Modeling and Scientific Computing

Series Editor Jacques Blum

Mathematics for Modeling and Scientific Computing

Thierry Goudon

First published 2016 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address: ISTE Ltd 27-37 St George’s Road London SW19 4EU UK

John Wiley & Sons, Inc. 111 River Street Hoboken, NJ 07030 USA

www.iste.co.uk

www.wiley.com

© ISTE Ltd 2016 The rights of Thierry Goudon to be identified as the author of this work have been asserted by him in accordance with the Copyright, Designs and Patents Act 1988. Library of Congress Control Number: 2016949851 British Library Cataloguing-in-Publication Data A CIP record for this book is available from the British Library ISBN 978-1-84821-988-5

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

Chapter 1. Ordinary Differential Equations

1

. . . . . . . . . . . . . . . .

1.1. Introduction to the theory of ordinary differential equations . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1. Existence–uniqueness of first-order ordinary differential equations . . . . . . . . . . . . . . . . . . . . . . . 1.1.2. The concept of maximal solution . . . . . . . . . . . . 1.1.3. Linear systems with constant coefficients . . . . . . . . 1.1.4. Higher-order differential equations . . . . . . . . . . . 1.1.5. Inverse function theorem and implicit function theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Numerical simulation of ordinary differential equations, Euler schemes, notions of convergence, consistence and stability . . . . . . . . . . . . . . . . . . . . . . 1.2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2. Fundamental notions for the analysis of numerical ODE methods . . . . . . . . . . . . . . . . . . . . . 1.2.3. Analysis of explicit and implicit Euler schemes . . . . 1.2.4. Higher-order schemes . . . . . . . . . . . . . . . . . . . 1.2.5. Leslie’s equation (Perron–Frobenius theorem, power method) . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.6. Modeling red blood cell agglomeration . . . . . . . . . 1.2.7. SEI model . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.8. A chemotaxis problem . . . . . . . . . . . . . . . . . . 1.3. Hamiltonian problems . . . . . . . . . . . . . . . . . . . . . 1.3.1. The pendulum problem . . . . . . . . . . . . . . . . . . 1.3.2. Symplectic matrices; symplectic schemes . . . . . . . 1.3.3. Kepler problem . . . . . . . . . . . . . . . . . . . . . . 1.3.4. Numerical results . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . .

1 11 16 20

. . . . . . .

21

. . . . . . . . . . . . . .

27 27

. . . . . . . . . . . . . . . . . . . . .

29 33 50

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

1

. . . . . . . . .

. . . . . . . . .

51 78 87 93 102 106 112 125 129

vi

Mathematics for Modeling and Scientific Computing

Chapter 2. Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems . . . . . . . . . . . . . 141 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 2.1.1. The 1D model problem; elements of modeling and analysis . . . . . . . . . . . . . . . . . . . 2.1.2. A radiative transfer problem . . . . . . . . . . . . 2.1.3. Analysis elements for multidimensional problems 2.2. Finite difference approximations to elliptic equations 2.2.1. Finite difference discretization principles . . . . . 2.2.2. Analysis of the discrete problem . . . . . . . . . . 2.3. Finite volume approximation of elliptic equations . . 2.3.1. Discretization principles for finite volumes . . . . 2.3.2. Discontinuous coefficients . . . . . . . . . . . . . 2.3.3. Multidimensional problems . . . . . . . . . . . . 2.4. Finite element approximations of elliptic equations . 2.4.1. P1 approximation in one dimension . . . . . . . . 2.4.2. P2 approximations in one dimension . . . . . . . 2.4.3. Finite element methods, extension to higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 2.5. Numerical comparison of FD, FV and FE methods . 2.6. Spectral methods . . . . . . . . . . . . . . . . . . . . . 2.7. Poisson–Boltzmann equation; minimization of a convex function, gradient descent algorithm . . . . . . . . 2.8. Neumann conditions: the optimization perspective . 2.9. Charge distribution on a cord . . . . . . . . . . . . . . 2.10. Stokes problem . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . 141 . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

144 155 163 166 166 173 180 180 187 189 191 191 197

. . . . . . . . . . 200 . . . . . . . . . . 204 . . . . . . . . . . 205 . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

217 224 228 235

Chapter 3. Numerical Simulations of Partial Differential Equations: Time-dependent Problems . . . . . . . . . . . . . . . . . . . 267 3.1. Diffusion equations . . . . . . . . . . . . . . . . . . . 3.1.1. L2 stability (von Neumann analysis) and L∞ stability: convergence . . . . . . . . . . . . . . . . . . . 3.1.2. Implicit schemes . . . . . . . . . . . . . . . . . . . 3.1.3. Finite element discretization . . . . . . . . . . . . 3.1.4. Numerical illustrations . . . . . . . . . . . . . . . 3.2. From transport equations towards conservation laws . 3.2.1. Introduction . . . . . . . . . . . . . . . . . . . . . 3.2.2. Transport equation: method of characteristics . . 3.2.3. Upwinding principles: upwind scheme . . . . . . 3.2.4. Linear transport at constant speed; analysis of FD and FV schemes . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . 267 . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

269 276 281 283 291 291 295 299

. . . . . . . . . . 301

Contents

3.2.5. Two-dimensional simulations . . . 3.2.6. The dynamics of prion proliferation 3.3. Wave equation . . . . . . . . . . . . . . 3.4. Nonlinear problems: conservation laws 3.4.1. Scalar conservation laws . . . . . . 3.4.2. Systems of conservation laws . . . 3.4.3. Kinetic schemes . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

vii

326 329 345 354 354 387 393

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 Appendix 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409

Appendix 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

Appendix 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427

Appendix 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433

Appendix 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443

Bibliography Index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455

Preface Early he rose, far into the night he would wait, To count, to cast up, and to calculate, Casting up, counting, calculating still, For new mistakes for ever met his view. Jean de La Fontaine (The Money-Hoarder and Monkey, Book XII, Fable 3). This book was inspired by a collection of courses of varied natures and at different levels, all of which focused on different aspects of scientific computing. Therefore, it owes much to the motivation of students from the universities of Nice and Lille, and the Ecole Normale Supérieure. The writing style adopted in this book is based on my experience as a longtime member of the jury for the agrégation evaluations, particularly in the modeling examination. In fact, a substantial part of the examples on the implementation of numerical techniques was drawn directly from the texts made public by the evaluation’s jury (see http://agreg.org), and a part of this course was the foundation for a series of lectures given to students preparing for the Moroccan agrégation evaluations. However, some themes explored in this book go well beyond the scope of the evaluations. They include, for example, the rather advanced development of Hamiltonian problems, the fine-grained distinction between the finite-difference method and the finite-volume method, and the discussion of nonlinear hyperbolic problems. The latter topic partially follows a course given at the IFCAM (Indo-French Centre for Applied Mathematics) in Bangalore. A relatively sophisticated set of tools is developed on this topic: this heightened level can be explained by the fact that the questions it explores are of great practical importance. It provides a relevant introduction for those who might want to learn more, and it prepares them for reading more advanced and specialized works.

x

Mathematics for Modeling and Scientific Computing

Numerical analysis and scientific calculus courses are often considered a little scary. This fear is often due to the fact that the subject can be difficult on several counts: – problems of interest are very strongly motivated by their potential applications (for example, in physics, biology, engineering, finance). Therefore, it is impossible to restrict the discussion strictly to the field of mathematics, and the intuitions motivating the math are strongly guided by the specificities of its applications. As a result, the subject requires a certain degree of scientific familiarity that goes beyond mere technical dexterity. – Numerical analysis uses a very ample technical background. This is a subject that we cannot address by utilizing a small and previously delimited set of tools. Rather, we must draw from different areas of mathematics1, sometimes in rather unexpected ways, for example by using linear algebra to analyze the behavior of numerical approximations to differential equations. However, this somewhat roundabout way of finding answers is what makes the subject so exciting. – Finally, it is often difficult to produce categorical statements and conclusions. For example, although it can be shown that several numerical schemes produce an approximate solution that “converges” towards the solution of the problem of interest (when the numerical parameters are sufficiently small), in practice, some methods are more suitable than others, according to qualitative criteria that are not always easy to formalize. Similarly, the choice of method may depend on the criteria that are considered most important for the target application context. Many questions do not have definite, clear-cut answers. The answer to the question “how should we approach this problem?” is often “it depends”: numerically simulating a physical phenomenon through calculations performed by a computer is a real, delicate and nuanced art. This art must be based on strong technical mastery of mathematical tools and deep understanding of the underlying physical phenomena they study. The aim of this book is to fully address these challenges and, by design, to “mix everything up”. Therefore, the book will include many classical results from analysis and algebra, details for certain equation resolution algorithms, examples from science and technology, and numerical illustrations. Some “theoretical” tools will be introduced by studying an application example, even if it means repurposing it for an entirely different field. Nevertheless, the book does follow a certain structure, which is organized into three main sections, focused on numerical solutions to (ordinary and partial) differential equations. The first chapter addresses the solution of ordinary differential equations, with a very broad overview of its essential theoretical basis 1 The following quote is quite telling: [...] in France, there was even some snobbishness surrounding pure mathematics: when a gifted student was identified, he was told: “Do your PhD in pure mathematics”. On the other hand, average students were advised to focus on applied mathematics, under the rationale that that was “all they were capable of doing”! But the opposite is true: it is impossible to do applied mathematics without first knowing how to do pure mathematics properly. J.A. Dieudonné, [SCH 90, p. 104].

Preface

xi

(for example, Cauchy–Lipschitz, qualitative analysis, linear problems). This chapter details the analysis of classical schemes (explicit and implicit Euler methods), and distinguishes various concepts of stability, which are more or less relevant depending on the context. This set of concepts is illustrated by a series of examples, which are motivated mostly by the description of biological systems. A large section, with fairly substantial technical content, is devoted to the particular case of Hamiltonian systems. The second chapter deals with numerical solutions to elliptic boundary value problems, once again with a detailed exploration of basic functional analysis tools. Although the purpose is mostly restricted to the one-dimensional framework   ∂ ∂ and to the model problem λu(x) − ∂x k(x) ∂x u(x) = f (x) on ]0, 1[ with homogeneous Dirichlet conditions, different discretization families are distinguished: finite differences, finite volumes, finite element and spectral methods. Techniques related to optimization are also presented through the simulation of complex problems such as Boltzmann–Poisson equations, load optimization and Stokes’ problem. The last chapter deals with evolutionary partial differential equations, again addressing only the one-dimensional case. Questions of stability and consistency are addressed in this chapter, first for the heat equation and then for hyperbolic problems. The transport and wave equations can be considered “classics”. In contrast, discussion of nonlinear equations, scalars or systems with the simulations of Euler equations for gas dynamics as a final target, leads to more advanced topics. The book does not contain exercise problems. However, readers are invited to carry out the simulations illustrated in the book on their own. This work of numerical experimentation will allow readers to develop an intuition of the mathematical phenomena and notions it presents by playing with numerical and modeling parameters, which will lead to a complete understanding of the subject. My colleagues and collaborators have had an important influence on building my personal mathematical outlook; they helped me discover points of view that I was not familiar with, and they have made me appreciate notions that I admit had remained beyond my understanding during my initial schooling and even during the early stages of my career. This is the place to thank them for their patience with me and for everything they have taught me. In particular, I am deeply indebted to Frédéric Poupaud and Michel Rascle, as well as Stella Krell and Magali Ribot in Nice, Caterina Calgaro and Emmanuel Creusé in Lille, Virginie Bonnaillie - Noël, Frédéric Coquel, Benoît Desjardins, Frédéric Lagoutière, and Pauline Lafitte in Paris. I also thank my colleagues on the agrégation jury, especially Florence Bachman, Guillaume Dujardin, Denis Favennec, Hervé Le Dret, Pascal Noble and Gregory Vial. A large number of developments were directly inspired by our passionate conversations. Finally, Claire Scheid, Franck Boyer and Sebastian Minjeaud were kind and patient enough to proofread some of the passages of the manuscript; their advice and suggestions have led to many improvements. Thierry G OUDON August 2016

1 Ordinary Differential Equations

1.1. Introduction to the theory of ordinary differential equations 1.1.1. Existence–uniqueness equations

of

first-order

ordinary

differential

The most important result from the theory of ordinary differential equations ensures the existence and uniqueness of solutions to equations of the form y  (t) = f (t, y(t)),

y(tInit ) = yInit

[1.1]

where yInit ∈ Ω ⊂ RD ,

f : I × Ω −→ RD

Here I is an open interval of R containing tInit , and Ω is an open set of RD containing yInit . The variable t ∈ I is called the time variable, and the variable y is referred to as the state variable. When the function f depends only on the state variable, equation [1.1] is said to be autonomous. We say that a function t → y(t) is a solution of [1.1] if – y is defined on an interval J that contains tInit and is included in I; – y(tInit ) = yInit and for all t ∈ J, y(t) ∈ Ω; – y is differentiable on J and for all t ∈ J, y  (t) = f (t, y(t)). T HEOREM 1.1 (Picard–Lindelöf1).– Assume that f is a continuous function on I × Ω and that for every (t , y ) ∈ I × Ω, there exist ρ > 0 and L > 0, such that 1 Also known as the Cauchy-Lipschitz Theorem. Mathematics for Modeling and Scientific Computing, First Edition. Thierry Goudon. © ISTE Ltd 2016. Published by ISTE Ltd and John Wiley & Sons, Inc.

2

Mathematics for Modeling and Scientific Computing

B(y , ρ ) ⊂ Ω, [t − ρ , t + ρ ] ⊂ I, and if y, z ∈ B(y , ρ ) and |t − t | ≤ ρ , then we have |f (t, y) − f (t, z)| ≤ L |y − z|. Then for every (t , y ) ∈ I × Ω, there exist r > 0 and h > 0, such that if |yInit − y | ≤ r and |tInit − t | ≤ r , the problem [1.1] has a solution y :]tInit − h , tInit + h [−→ Ω, which is a class C 1 function. This solution is unique in the sense that if z is a function of class C 1 defined on ]tInit −h , tInit +h [ satisfying [1.1], then y(t) = z(t) for all t ∈]tInit −h , tInit +h [. Finally, if f is a function of class C k on I × Ω, then y is a function of class C k+1 . This statement calls for a number of comments, which we present in detail here. 1) Theorem 1.1 assumes that the function f satisfies a certain regularity property with respect to the state variable, this property is stronger than mere continuity: f must be Lipschitz continuous in the state variable, at least locally.2 In particular, note that if f is a function of class C 1 on I × Ω, then it satisfies the assumptions of theorem 1.1. This regularity hypothesis cannot be completely ruled out. However, we will see later that it can be relaxed slightly. 2) Theorem 1.1 only defines the solution in a neighborhood of the initial time tInit . Once the question of existence–uniqueness is settled, it can be worthwhile to take interest in the solution’s lifespan: is the solution only defined on a bounded interval, or does it exist for all times? We will see that the answer depends on estimates that can be established for the solution of [1.1]. The “classic proof” for the Picard–Lindelöf theorem is based on a fixed point argument that requires the following statement. T HEOREM 1.2 (Banach theorem).– Let E be a vector space with norm  · , for which it is assumed that E is complete. Let T : E → E be a strict contraction mapping, that is to say, such that there exists 0 < k < 1, which satisfies the following inequality for all x, y ∈ E: T (x) − T (y) ≤ kx − y. Then, T has a unique fixed point in E. 2 The terminology “Lipschitz continuous with respect to the second variable”, found in many books, is likely to lead to unfortunate confusion and misinterpretations, for example when addressing an autonomous system in dimension 2.

Ordinary Differential Equations

3

P ROOF.– Let us begin by establishing uniqueness, assuming existence: if x and y satisfy T (x) = x and T (y) = y, then T (x) − T (y) = x − y ≤ kx − y, which implies that x = y because 0 < k < 1. In order to show existence, let us examine the sequence defined iteratively by xn+1 = T (xn ), starting at any x0 ∈ E. We have xn+p − xn  ≤ xn+p − xn+p−1  + ... + xn+1 − xn  ≤ T (xn+p−1 ) − T (xn+p−2 ) + ... + T (xn ) − T (xn−1 ) n+p−1    ≤ k xn+p−1 − xn+p−2  + ... + xn − xn−1  ≤ x1 − x0  kj . j=n

∞ Since 0 < k < 1, the series j=0 k j converges. It follows that the sequence   xn n∈N is Cauchy in the complete space E. So, it has a limit x and by continuity of the mapping T , we obtain T (x) = x.  P ROOF OF T HEOREM 1.1.– The proof of theorem 1.1 is based on a functional analysis argument: the subtle trick works with a vector space whose “points” are functions. In this case, it is important to distinguish between: – the function t → y(t), which is a point in the functional space (here C 0 ([tInit , T [; RD ), for example); – and its value y(t) for a fixed t, which is a point in the state space RD . We will justify theorem 1.1 in the case where f is globally Lipschitz with respect to the state variable: we assume that f is defined on R × RD and that there exists an L > 0, such that for all x, y ∈ RD and any t ∈ R, we have |f (t, x) − f (t, y)| ≤ L|x − y|.

[1.2]

This technical limitation is important, but enables us to only focus on the key elements of the proof. A proof of the general case can be found in [BEN 10] or [ARN 88, Chapter 4], and later we present a somewhat different approach, which starts from the perspective of numerical approximations. The starting point of the proof is to integrate [1.1] to obtain ˆ

t

y(t) = yInit +

f (s, y(s)) ds, tInit

[1.3]

4

Mathematics for Modeling and Scientific Computing

such that the solution y of [1.1] is interpreted as a fixed point of the mapping T : C 0 (R; RD ) −→ C 0 (R; RD ) ˆ

 t → z(t) −→ t → yInit +

f (s, z(s)) ds .

t tInit

We will see that this point of view, which transforms from a differential equation to an integral equation (the formulation of [1.3]), is also useful for finding numerical approximations to the solutions of [1.1]. It is now necessary to construct a functional space and a norm in order for T to be a contraction. Thus, the sequence defined by y0 given in C 0 (R; RD ) and yn+1 = T (yn ) will converge to a fixed point of T , which will be the solution to [1.1] (Picard method). We only focus on time t ≥ tInit . We introduce the auxiliary function ˆ

t

A (t) = 1 + |yInit | +

|f (s, 0)| ds > 0 tInit

and we set y = sup

t≥tInit

e−M t |y(t)| A (t)

with M > 0 that remains to be defined. We denote the subspace of functions z ∈ C 0 ([tInit , ∞[; RD ), such that z < ∞ as E . With norm · , this space is complete. If y = T (z), with z ∈ E , we can write ˆ y(t) = yInit +

t

ˆ

t

f (s, 0) ds + tInit



 f (s, z(s)) − f (s, 0) ds

tInit

and deduce that y ∈ E . Indeed, this function t → y(t) is continuous and satisfies e−M t e−M t |y(t)| ≤ 1 + A (t) A (t) ≤ 1+

−M t

e A (t)

ˆ

t

LeM s A (s) × tInit t

ˆ

e−M s |z(s)| ds A (s)

LeM s A (s) ds z, tInit

for every t ≥ tInit . Finally, using integration by parts, we calculate ˆ

ˆ

d  M s e × A (s) ds ds tInit tInit

ˆ t L e−M t L Mt M tInit Ms = × e A (t) − e (1 + |yInit |) − e |f (s, 0)| ds ≤ . M A (t) M tInit e−M t A (t)

t

Ms

Le

e−M t L A (s) ds = A (t) M

t

Ordinary Differential Equations

5

Thus, we have y < ∞. Similarly, we have ˆ t e−M t e−M s L eM s A (s) × |y(s) − z(s)| ds A (t) A (s) tInit L ≤ y − z. M

T (y) − T (z) ≤

By choosing M > L, the mapping T appears as a contraction in the complete space (E ,  · ). The Banach theorem ensures the existence and uniqueness of a fixed point, since the relation y = T (y) proves that t → y(t) is a continuous and even C 1 function because f is continuous. We can easily adapt the proof in order to expand the resulting solution to time t ≤ tInit . Interestingly, by assuming f is globally Lipschitz continuous with respect to the state variable, see [1.2], it has been possible to directly show that the solution is defined for any time. This fact is important and should be justified on its own.  T HEOREM 1.3 (Picard–Lindelöf theorem, assuming global Lipschitz continuity).– Let f be a continuous function defined on R × RD , which satisfies [1.2]. Then, for every yInit ∈ RD , the equation [1.1] has a unique solution y of class C 1 defined on R. Let us now return to the comments for theorem 1.1 on the regularity of the function f . First, in problem [1.1], if we interpret the equation as [1.3], the assumption that f is continuous in the time variable might be weakened; it suffices to assume integrability. For example, the proof for theorem 1.3 can be slightly modified in order to justify the existence and uniqueness of a fixed point of the mapping T in a space of continuous functions on [tInit , ∞[ by assuming that there exists t → L(t), a function locally integrable on [tInit , ∞[, such that for every t ≥ tInit , and all x, y ∈ RD , we have |f (t, x) − f (t, y)| ≤ L(t) |x − y|. This hypothesis generalizes [1.2] (which amounts to the case of L(t) = L). We therefore obtain a function y as a solution to [1.3], which is continuous but not necessarily C 1 , and the equation [1.1] is only satisfied in a generalized sense. Next, assume that the function f defined on R × RD is only continuous and bounded on R × RD (there exists C > 0, such that for every t, x, we have |f (t, x)| ≤ C). We introduce a regularizing sequence by defining, for  ∈ N \ {0}, ρ (x) = D ρ(x), where ρ ∈ Cc∞ (RD ) is a smooth function with compact support in RD , such that ˆ 0 ≤ ρ(x) ≤ 1,

ρ(x) dx = 1. RD

6

Mathematics for Modeling and Scientific Computing

We define ˆ f (t, x) =

RD

f (t, x − y)ρ (y) dy = f (t, ·)  ρ (x).

(For more details about these convolution regularization techniques, the reader may refer to [GOU 11, Section 4.4]). Thus, for every fixed , f is continuous and globally Lipschitz in the state variable, because by writing ˆ

1

ρ (z2 ) − ρ (z1 ) = 0

ˆ

 d ρ (z1 + θ(z2 − z1 )) dθ dθ 1

= D

  (∇ρ) (z1 + θ(z2 − z1 )) · (z2 − z1 ) dθ

0

we obtain ˆ |f (t, x) − f (t, y)| ≤

RD

ˆ 1  |f (t, z)| (∇ρ)((x − z + θ(y − x)) 0

  ·(x − y) dθD dz ˆ 1ˆ ≤ C|x − y| ×

RD

0

|∇ρ(z  )| dz  dθ = C∇ρL1 |x − y|,

where we use Fubini’s theorem (for functions with positive values) and then a variable change z  = (x − z − θ(x − y)), dz  = D dz. By theorem 1.1 for every  ∈ N \ {0}, there exists a function t → y (t) of class C 1 , which is defined on R and is a solution to y (t) = f (t, y (t)),

y (tInit ) = yInit .

However, f is uniformly bounded with respect to ; in particular, we have |f (t,   x)| ≤ C. From this, we deduce that for every 0 < T < ∞, the set t ∈ [−T, T ] → y (t),  ∈ N \ {0} is equibounded and equicontinuous in C 0 ([−T, T ]). The Arzela–Ascoli theorem (see [GOU 11], theorem 7.49 and  example 7.50) allows us to extract a subsequence yk k∈N that converges uniformly on [−T, T ], whereas limk→∞ k = +∞. We can therefore see that ˆ |fk (s, yk (s)) − f (s, y(s))| ≤

RD

  f (s, y (s) − z/k ) − f (s, y(s)) ρ(z) dz. k

Since yk (s) tends to y(s) and f is continuous, the integrand tends to 0 when k → ∞, and, moreover, it is still dominated by 2Cρ(z) ∈ L1 (RD ). Lebesgue’s

Ordinary Differential Equations

7

theorem implies that limk→∞ fk (s, yk (s)) = f (s, y(s)). Letting k tend to +∞ in the integral relation ˆ yk (t) = yInit +

t tInit

fk (s, yk (s)) ds,

for every t ∈ [−T, T ], with tInit < T < ∞, by the Lebesgue theorem, we obtain ˆ

t

y(t) = yInit +

f (s, y(s)) ds tInit

and finally we can conclude that y is a solution of [1.1]. The continuity of f is therefore sufficient to show the existence of a solution to the equation [1.1]. This is the Cauchy–Peano theorem. However, the solution is not in general unique. A simple counter example, in dimension D = 1, is given by the equation y  (t) =



y(t),

y(0) = 0.

It is clear that t → y(t) = 0 is a solution on R. But it is not the only one because t → z(t) = t2 /4 is also a solution. The assumptions of theorem 1.1 are not satisfied √ 1 because y → y is continuous but not Lipschitz around y = 0 (the derivative is 2√ , y which tends to +∞ when y → 0). Thus, we note that on its own, the continuity of f in the state variable is not enough to ensure the truth of an existence–uniqueness statement as strong as √ theorem 1.1, as shown by the example f (t, y) = y. However, we can slightly relax the assumption of regularity with respect to the state variable stated in theorem 1.1 and justify the existence–uniqueness of solutions to the differential problem. Sometimes, more sophisticated statements like these are necessary. We will study one such example in detail. The following statement is crucial in the analysis of the equations for incompressible fluid mechanics [CHE 95]. D EFINITION 1.1.– Let μ : [0, ∞[→ [0, ∞[ be a continuous, strictly increasing function, such that μ(0) = 0. We denote by Cμ (Ω) the set of functions u that are continuous on Ω, and for which there exists a C > 0, such that |u(x) − u(y)| ≤ Cμ(|x − y|) for all x, y ∈ Ω. Cμ (Ω) is a Banach space for the norm uμ = uL∞ + sup x=y

|u(x) − u(y)| . μ(|x − y|)

8

Mathematics for Modeling and Scientific Computing

Therefore, it is convenient to consider the right hand side of equation [1.1] as “a function in the state variable, with the time variable as a parameter”: for (almost) every t ∈ I, we assume that y → f (t, y) is a function in Cμ (Ω; RD ). We say that f ∈ L1 (I; Cμ (Ω)) when there exists an L ∈ L1 (I) with strictly positive values, such that for all x, y ∈ Ω |f (t, x)| ≤ L(t),

|f (t, x) − f (t, y)| ≤ L(t) μ(|x − y|).

[1.4]

T HEOREM 1.4 (Osgood theorem).– Let μ : [0, ∞[→ [0, ∞[ be a continuous, strictly increasing function, such that μ(0) = 0. Assume further that ˆ

∞ 0

ds = +∞. μ(s)

[1.5]

Let I be an interval of R that contains tInit and f ∈ L1 (I; Cμ (Ω)). Then, for every yInit ∈ Ω, there is an interval I ⊂ I containing tInit , such that the equation [1.1] has a unique solution t → y(t) defined on I . Equation [1.1] is understood in the sense of [1.3] being satisfied. (We say that y is a mild solution of [1.1]). P ROOF.– In fact, this statement raises two questions of a somewhat different nature. We can easily adapt the proof of the Cauchy–Peano theorem to demonstrate the existence of a continuous solution to [1.3] with the simple assumption that f ∈ L1 (I; C(Ω)). However, this does not justify uniqueness, which has been found to be wrong, with the mere continuity of f . Extending uniqueness to functions satisfying [1.4] is a result due to [OSG 98]. Another problem is to justify the convergence of the sequence of Picard iterations. Given y0 ∈ C 0 [tInit , ∞[), then ˆ

t

yk+1 (t) = yInit +

f (s, yk (s)) ds,

[1.6]

tInit

converges to one (and therefore the) solution of [1.3]. This issue has been studied for its part in [WIN 46]. Let us first prove uniqueness. Suppose that there are two solutions, y1 and y2 , to [1.3], and define z = |y2 − y1 |. Therefore, for t ≥ tInt , we obtain ˆ

t

0 ≤ z(t) ≤

L(s) μ(z(s)) ds,

z(tInit ) = 0.

tInt

We then focus on functions α : [tInit , ∞[→ [0, ∞], such that ˆ

t

0 ≤ α(t) ≤ α0 +

L(s) μ(α(s)) ds. tInt

Ordinary Differential Equations

9

Let us first assume that α0 > 0. We define ˆ

t

Ω : t ∈ [tInit , ∞[ −→

L(s) μ(α(s)) ds. tInit

This function is continuous, non-negative, and satisfies Ω(tInit ) = 0. In addition, since Ω is defined as the integral of a positive function, it is non-decreasing and therefore differentiable almost everywhere, with   0 ≤ Ω (t) = L(t) μ(α(t)) ≤ L(t) μ α0 + Ω(t) since L(t) ≥ 0 and μ are non-decreasing.   Because α0 + Ω(t) ≥ α0 > 0 and μ is strictly increasing, we have μ α0 + Ω(t) > μ(0) = 0, making the following relation true: ˆ

t tInit

Ω (s)  ds ≤ μ α0 + Ω(s) 

ˆ

t

L(s) ds. tInit

Recalling that Ω(tInit ) = 0 and with the change of variables σ = α0 + Ω(s), dσ = Ω (s) ds, it becomes ˆ

α0 +Ω(t) α0

dσ = μ(σ)

ˆ

t tInit

Ω (s)   ds ≤ μ α0 + Ω(s)

ˆ

t

L(s) ds. tInit

In particular, the function t → z(t) = |y2 (t) − y1 (t)| satisfies this relation for any α0 > 0. Suppose that z is not equal to zero. In this case, we can find t > tInit , η > 0, such that z(t) > 0 on an interval of the form [t − η, t ] ⊂ [tInit , t ]. Accordingly, the function Ω, which is associated with α = z, is strictly positive at t . We obtain ˆ

Ω(t ) α0

dσ ≤ μ(σ)

ˆ

α0 +Ω(t ) α0

dσ ≤ μ(σ)

ˆ

t

L(s) ds tInit

for every 0 < α0 < Ω(t ), recalling that L is integrable. Letting α0 tend to 0, we are led to a contradiction with [1.5].   We then examine the behavior of the sequence yk k∈N defined by [1.6]. We first check that, when T is small enough, this sequence is well defined and remains bounded in C([tInit , T ]): we set M > 0 and show that for every k ∈ N and t ∈ [tInit , T ], |yk (t) − yInit | ≤ M.

10

Mathematics for Modeling and Scientific Computing

Indeed, we have  ˆ t       |yk+1 (t) − yInit | =  f (s, yk (s)) − f (s, yInit ) + f (s, yInit ) ds tInit ˆ t   ≤ L(s) μ(|yk (s) − yInit |) + 1 ds. tInit

If |yk (t) − yInit | ≤ M , for a given M > 0, since μ is increasing, it follows that ˆ

T

|yk+1 (t) − yInit | ≤ (μ(M ) + 1))

L(s) ds tInit

is satisfied for all tInit ≤ t ≤ T . Let us now introduce T > 0, such that ˆ

T

(μ(M ) + 1))

L(s) ds < M, tInit

´T which is permissible because limT →tInit tInit L(s) ds = 0. Given this preliminary result, we note that for all integers k, p and t ∈ [tInit , T ], ˆ

t

|yk+p (t) − yk (t)| ≤

L(s) μ(|yk+p−1 (s) − yk−1 (s)|) ds. tInit

Since μ is monotonous, we can say that zk (t) = supp |yk+p (t) − yk (t)| satisfies ˆ

t

0 ≤ zk (t) ≤

L(s) μ(zk−1 (s)) ds. tInit

Applying the Fatou lemma [GOU 11, lemma 3.27] to the sequence of positive functions μ(2M ) − μ(zk (t)), we can deduce that Z(t) = lim supk→∞ zk (t) satisfies ˆ

t

0 ≤ Z(t) ≤

L(s) μ(Z(s)) ds. tInit

The same reasoning that made it possible to demonstrate the uniqueness of the solution to [1.3]  applies here and allows us to conclude that Z(t) = 0 on [tInit , T ]. Therefore, yk k∈N is a Cauchy sequence in the complete space C([tInit , T ]). This sequence has a limit y for uniform convergence on [tInit , T ]. By making k → ∞ in [1.6], we effectively show that y satisfies [1.3]. 

Ordinary Differential Equations

11

1.1.2. The concept of maximal solution It is important to recall that theorem 1.1 only ensures the existence of a solution to [1.1] defined in a neighborhood of the initial time tInit . This is an inherently local result. Additional information is required to prove that the solution is globally defined. The case of a function f that is globally Lipschitz continuous with respect to the state variable is one of those special situations in which the solution of [1.1] is defined for all times. However, in dimension D = 1, the example of y  (t) = y 2 (t),

y(tInit ) = yInit

shows that the solution can explode in finite time, even if the second member f is a function of class C 1 on R × RD . In this case, we, in fact, have y(t) =

1 1/yInit − (t − tInit )

which is therefore not defined beyond3 T  = tInit + 1/yInit . Defining a solution for [1.1], therefore, requires determining both an interval I containing tInit and a function y defined on I (with values in Ω). Given two solutions (I˜, y˜) and (I , y) for [1.1], we can say that (I˜, y˜) extends (I , y) if I ⊂ I˜

and

 y˜I = y.

We say that (I˜, y˜) is a maximal solution if the solution (I˜, y˜) can only be extended by itself. P ROPOSITION 1.1.– Let f satisfy the assumptions of theorem 1.1. For every tInit ∈ I and yInit ∈ Ω, there is a unique maximal solution (I˜, y˜) for [1.1] and I˜ ⊂ I is an open interval of R. P ROOF.– The set of solutions (I , y) of [1.1] contains at least ({tInit }, yInit ). Let us consider two solutions (I˜, y˜) and (I , y). In particular, I˜ and I are two intervals containing tInit . Let J = I˜ ∩ I . Then, J ⊂ I ⊂ R is also an interval that at least contains tInit , and we have y˜(tInit ) = y(tInit ). Let U = {t ∈ J , y˜(t) = y(t)}. By theorem 1.1, U is an open set of R containing tInit (there is a solution of the differential equation with initial values (t, y(t) = y˜(t)), which is uniquely defined on an open interval ]t − η, t + η[, with η depending on (t, y(t)). Since y and y˜ are themselves solutions to this problem, we have y(s) = y˜(s) for every s ∈]t−η, t+η[...). 3 For yInit > 0 (respectively yInit < 0), the formula is only meaningful for t < T  (respectively t > T  ).

12

Mathematics for Modeling and Scientific Computing

  However, U is also closed in J by the continuity of functions y˜ and y: if tn n∈N is a sequence of elements of U , which tends to t ∈ J , then y˜(t) = y(t) and t ∈ U (we can also note that U is the inverse image of the closed set {0} by the continuous function y − y˜). It follows that U = J , by characterization of connected sets of R. We define V as the union of all the intervals in which we can define a solution of [1.1]. For t ∈ V , there exists a unique solution s → y(s) of [1.1] defined in a neighborhood of s = t; we thus define y˜(t) = y(t). By construction, (V , y˜) is the maximal solution of [1.1]. We introduce the set  A = T > 0, such that there exists a solution yT of [1.1]  which satisfies yT (tInit ) = yInit and is defined on [tInit , tInit + T ] ⊂ I . As a result of theorem 1.1, A is non-empty. We define T  = sup A = sup V . Let T, T  ∈ A , T < T  . By uniqueness of the solution of the differential equation associated with the point (tInit , yInit ), it follows that ([tInit , tInit + T  ], yT  ) extends ([tInit , tInit + T ], yT ); in particular, A is an interval. Let tInit < t < tInit + T  . Then, there exists a T ∈ A ,such that t < T + tInit < T  + tInit and yT is the unique solution of [1.1] defined on [tInit , tInit + T ]. Therefore, we conclude that there exists a unique solution to [1.1] associated with (tInit , yInit ) and defined on the interval [tInit , tInit + T  [, which is an open set of [tInit , +∞[∩I. Finally, if T  is a finite element of A , then (T  , y˜(T  )) ∈ I × Ω and theorem 1.1 extends the solution beyond T  . We conclude that I˜ is an open interval.  N OTE .– An important special case that allows us to estimate solutions is when there exists a “stationary point”. Indeed, if f (t, x0 ) = 0 for every t ∈ I, then the constant function t → x0 is a solution to [1.1], with yInit = x0 being the initial data (for any initial time tInit ). Uniqueness implies that no other solution of [1.1] can pass through x0 . Let (I , y) be the maximal solution of [1.1]. We can write I =]T , T  [. If T  is finite, we call right end the associated set    ω  = b ∈ Ω such that there exists a sequence tn n∈N satisfying  tn ∈ I, lim tn = T  and lim y(tn ) = b . n→∞

n→∞

We then compare I with the definition set I of the data for the problem [1.1].

Ordinary Differential Equations

13

T HEOREM 1.5.– We have (T  , ω  ) ⊂ I × Ω \ I × Ω = ∂(I × Ω), as well as analogous definitions and results for the “left end”. P ROOF.– Suppose that T  < ∞ and ω  = ∅, that is, b ∈ ω  . By definition, we always have (T  , b) ∈ I × Ω. More specifically,  if we denote the maximal solution of [1.1] as (]T , T  [, y), there exists a sequence tn n∈N , such that 

 tn , y(tn ) ∈ I × Ω,

lim tn = T  ,

n→∞

lim y(tn ) = b.

n→∞

Suppose that T  ∈ I and ω  ∈ Ω. We can therefore find τ, r > 0, such that – C = [T  − τ, T  + τ ] × B(b, r) ⊂ I × Ω; – for every (t, y) ∈ C , |f (t, y)| ≤ M and r > 2M τ ; – f is locally Lipschitz continuous in the state variable on C . We define C  = [T  −τ /3, T  +τ /3]×B(b, r/3). Then, given that (s, z) ∈ C  , we can see that C  = [s − 2τ /3, s + 2τ /3] × B(z, 2r/3). By construction, we have C  ⊂ C  ⊂ C . The local existence theorem ensures the existence of a function t → ψ(t), defined in a neighborhood J of s, which is a solution of ψ  (t) = f (t, ψ(t)), with ψ(s) = z. Moreover, the neighborhood can be chosen in such a way that (t, ψ(t)) ∈ C  for every t ∈ J . Furthermore, there exists an integer N , such that |tN − T  | ≤ τ /3 and |y(tN ) − b| ≤ r/3 : (tN , X(tN )) ∈ C  ⊂ C  . We can find a function t → ψ(t) that satisfies ψ  (t) = f (t, ψ(t)) and ψ(tN ) = y(tN ). By writing ˆ

t

ψ(t) = y(tN ) +

f (s, ψ(s)) ds, tN

we can see that if |t − tN | ≤ 2τ /3, ˆ   |ψ(t) − b| ≤ y(tN ) − b +

t

  f (s, ψ(s)) ds

tN

≤ |y(tN ) − b| + M |t − tN | ≤ r/3 + 2M τ /3 < 2r/3. That is, ψ(t) ∈ B(b, 2r/3), (t, ψ(t)) ∈ C  . It follows that ψ is defined on an interval [tN − 2τ /3, tN + 2τ /3].

14

Mathematics for Modeling and Scientific Computing

We have two solutions to the differential equation x (t) = f (t, x(t)), which pass through y(tN ) at the instant tN : the maximal solution y and the solution ψ that was just presented. By uniqueness, they must be one and the same. However, T  = T  − tN + tN < tN + 2τ /3, so we could have extended the maximal solution, which is a contradiction.  The statement of theorem 1.5 can be a bit abstruse, but it is possible to deduce some more practical formulations. C OROLLARY 1.1.– If T  < ∞, then – either y  (t) = f (t, y(t)) is unbounded in a neighborhood of T  , – or else t → y(t) has a limit b ∈ RD when t → T  , and (T  , b) ∈ / I × Ω. P ROOF.– Suppose that T  is finite and that t → y  (t) = f (t, y(t)) is bounded by M > 0. Itfollows that |y(t) − y(s)| ≤ M |t − s|.We can therefore infer that for every  sequence tn n∈N that converges to T  , y(tn ) n∈N is a Cauchy sequence and thus has a limit. Moreover, this limit does not depend on the sequence considered. In other words, there exists a b ∈ RD , such that limt→T  y(t) = b. Theorem 1.5 ensures that (T  , b) ∈ / I × Ω. This result is subtle: there is no guarantee that y is not extensible by continuity in T  and that, by that same token, f (t, y) can be extended by continuity in (T  , b).  C OROLLARY 1.2 (blow up criterion).– If f is defined on R × RD and one of the ends T or T  , denoted as T¯ , is finite, we have limt→T¯ |y(t)| = +∞.     P ROOF.– Let tn n∈N be a sequence that converges to T  < ∞, such that y(tn ) n∈N remains bounded. We can thus extract a subsequence, such that limk→∞ y(tnk ) = b ∈ RN . By theorem 1.5, (T  , b) is not in the domain of f . However, since the domain is the entire space R × RD , we have a contradiction.  It is useful to have simple and representative examples in mind for the variety of situations that can occur (in one dimension): – The function f (t, y) = y 2 is defined on all of R; it is locally but not globally Lipschitz continuous. The solutions to [1.1] are not defined for all times: there exists a finite lifetime T  (yInit ) and limt→T  (yInit ) y(t) = +∞. – With f (t, y) = −1/y, we have  I × Ω = R × (R \ {0}). For tInit = 0 2 and yInit > 0, we get y(t) = yInit − 2t: lifetime in the future T (yInit ) is finite and limt→T  (yInit ) y(t) = 0, at the edge of f ’s domain. Moreover, we have limt→T  (yInit ) y  (t) = −∞. √ – The function f (t, y) = y is locally Lipschitz continuous on R × Ω, with Ω =]0, ∞[, and continuous at (t, 0) for every t ∈ R. For tInit = 0 and yInit > 0,

Ordinary Differential Equations

15

√ 2 we obtain y(t) = yInit + t/2 . The lifespan in the future T  (yInit ) is infinite, but the lifespan in the past T (yInit ) is finite with limt→T (yInit ) y(t) = 0, at the edge of f ’s domain. Here we have limt→T  (yInit ) y  (t) = 0. – The function f (t, y) = sin(1/y) is locally Lipschitz continuous on R × Ω, with Ω = R\{0}, and cannot be extended by continuity at 0. Since this function is bounded, if the lifespan T  of the maximal solution for [1.1] is finite, then y(t) should tend to 0 when t → T  . However, every initial value yInit > 0 can be bounded from above 1 and below by equilibrium points of the form nπ . The maximal solution is therefore  globally defined: T (yInit ) = +∞. By virtue of corollary 1.2, in order to establish that the maximal solution of [1.1] is defined for all times, it suffices to provide estimates that rule out the possibility that the solution blows up. A very useful tool for obtaining this kind of estimates is as follows. L EMMA 1.1 (Grönwall’s Lemma).– Let α0 ∈ R and a, b and α be continuous, realvalued functions on [0, T ], 0 < T < ∞, with a(t) ≥ 0 for all t ∈ [0, T ]. Suppose that ˆ t ˆ t α(t) ≤ α0 + a(s)α(s) ds + b(s) ds 0

0

for 0 ≤ t ≤ T . Then, we have α(t) ≤ α0 exp



t

 ˆ t ˆ t  a(s) ds + b(s) exp a(σ) dσ ds.

0

0

[1.7]

s

P ROOF.– We denote the majorant in [1.7] as β(t), which is a differentiable function that satisfies β  (t) = a(t)β(t) + b(t),

β(0) = α0 .

We set δ(t) = β(t) − α(t), and we wish to show that it is positive-valued. Now, we have ˆ t δ(t) = β(0) + β  (s) ds − α(t) 0

ˆ

t

= α0 +



 a(s)β(s) + b(s) ds − α(t) ≥

0

ˆ

t

a(s)δ(s) ds. 0

We denote the right-hand term as Δ(t), which is a differentiable function and satisfies Δ (t) = a(t)δ(t) ≥ a(t)Δ(t)

16

Mathematics for Modeling and Scientific Computing

since a(t) ≥ 0, with Δ(0) = 0. It follows that d dt



 ˆ t   ˆ t 

exp − a(s) ds Δ(t) = exp − a(s) ds Δ (t) − a(t)Δ(t) ≥ 0. 0

0

This implies that 

ˆ

exp −

t

 a(s) ds Δ(t) ≥ Δ(0) = 0

0

and that Δ(t) ≥ 0. Finally, we conclude that δ(t) ≥ Δ(t) ≥ 0.



C OROLLARY 1.3.– Let f be a continuous, locally Lipschitz function defined on R × RD . Suppose that there exists an M > 0, such that for all t, y, we have |f (t, y)| ≤ M (1 + |y|). Then, the maximal solution of [1.1] is defined on all of R. P ROOF.– The maximal solution satisfies the following estimate, for t > tinit , ˆ

ˆ

t

|y(t)| ≤ |yInit | +

t

|f (s, y(s))| ds ≤ |yInit | + M tinit

(1 + |y(s)|) ds. tinit

Grönwall’s lemma allows us to infer that   |y(t)| ≤ eM (t−tInit ) |yInit | + M (t − tInit ) . This estimate rules out the possibility that the solution explodes in finite time.



1.1.3. Linear systems with constant coefficients For linear systems with constant coefficients, the solution can be expressed with the matrix exponential: the solution of y  (t) = Ay(t) + b(t),

y(0) = yInit

[1.8]

is given by ˆ At

t

eA(t−s) b(s) ds

y(t) = e yInit + 0

where eAt =

∞  (At)k k=0

k!

.

This formula allows us to determine the qualitative properties of the solution. In some cases, especially for small dimensions, it allows us to find explicit expressions. The process relies on the Jordan–Chevalley decomposition (see, for example, [SER 01, Prop. 2.7.2]).

Ordinary Differential Equations

17

T HEOREM 1.6 (Jordan–Chevalley decomposition).– Every matrix A ∈ MN (C) can be written in the form A = D + N , where D is diagonalizable on C, N is nilpotent and DN = N D. We write the eigenvalues λ1 , ..., λN , of D repeated according to their multiplicity, with p + 1 as the nilpotence index of N . Since D and N commute, we can write eAt = e(D+N)t = eDt eN t  t2 N 2 tp N p  = P diag(eλ1 t , ..., eλN t )P −1 I + tN + + ... + . 2! p! This method can be used for “small” systems: once the eigenvalues, eigenvectors and generalized eigenvectors of the matrix A are known, we have an explicit formulation of etA . More generally, the spectral properties of the matrix are reflected in the qualitative behavior of the solutions to [1.8]. In order to show this relationship, we recall the following notions and notation: – λ ∈ C is an eigenvalue of A whenever Eλ = Ker(A − λI) = {0}, a vector subspace called eigenspace associated with the eigenvalue λ. λ = – We also define the characteristic space associated with an eigenvalue λ: E  k Ker (A − λI) , where k is the largest integer, such that         Ker (A − λI)k−1 ⊂ Ker (A − λI)k , Ker (A − λI)k−1 = Ker (A − λI)k . – The dimension of Eλ is the geometric multiplicity of eigenvalue λ, and the λ is the algebraic multiplicity of eigenvalue λ. When the two dimension of E values are the same and k > 1, λ is said to be a semisimple eigenvalue; if λ ) = dim(Eλ ) = 1, then λ is said to be a simple eigenvalue. The algebraic dim(E multiplicity of λ is, in fact, given by the order of λ, as the root of the characteristic polynomial μ → det(A − μI) of the matrix A. λ , where σ(A) is the set of – It is possible to decompose CN = ⊕λ∈σ(A) E eigenvalues (the spectrum) of A. In particular, if all the eigenvalues are semisimple, then the matrix A is diagonalizable. N OTE .– It would be well to avoid hasty conclusions when determining the Jordan– Chevalley decomposition of a matrix. For example, it could be tempting to decompose 23 20 03 A= = + . 04 04 00       D

=N

18

Mathematics for Modeling and Scientific Computing

It is indeed a “diagonal(-izable)+nilpotent” decomposition, but those matrices do not commute:

06 0 12 DN = , ND = . 00 0 0 In fact, the matrix A has two different eigenvalues, 2 and 4, and therefore it is diagonalizable. It is similar to A =

20 04

with the change-of-basis matrix P =



1 1 . 0 2/3

T HEOREM 1.7.– The mapping t → etA is bounded on [0, ∞[ if and only if all the eigenvalues of A have a negative or zero real component and the strictly imaginary eigenvalues are semisimple. Furthermore, etA tends to 0 when t → ∞ if the eigenvalues of A have a strictly negative real component. P ROOF.– We denote the distinct eigenvalues of A as λ1 , ..., λp and introduce the λ parallel to ⊕k=j E λ . Thus, for projection operators Pj for the projection onto E j k each j ∈ {1, ..., p}, we can find an integer kj , such that (A − λj I)kj Pj = 0,

[1.9]

and the identity matrix can be written as the sum of the projectors I=

p 

Pj .

j=1

We can then write etA = etA

p 

Pj =

j=1

=

p  j=1

e

λj t



p 

eλj t et(A−λj I) Pj 

j=1 ∞   t =0

!

(A − λj I)



Pj .

If Re(λj ) < 0 for every j ∈ {1, ..., p}, this quantity therefore tends to 0 when t → ∞, by [1.9]. It remains to study the case where A has eigenvalues with a real

Ordinary Differential Equations

19

component equal to zero. If those eigenvalues are semisimple, then the index kj associated with them is kj = 1, and we can infer that etA is bounded for every t ≥ 0. Reciprocally, suppose that there exists an eigenvalue λ, such that Re(λ) > 0. Let v = 0 be an associated eigenvalue. We have Av = λv, so etA v = eλt v is not bounded on [0, ∞[. Finally, suppose that there exists a λ ∈ σ(A) ∩ iR that is not semisimple. We can find two vectors v, w = 0, such that Av = λv,

Aw = λw + v.

It follows that A w = λ w + λ−1 v, and therefore etA w =

∞  (tλ) =0

!

w+

∞  t λ−1 =1

!

v = eλt (w + tv).

We can infer that, for a large enough t, |etA w| ≥ t|v| − |w|, which tends to +∞ when t → ∞, and therefore etA is not bounded on [0, ∞[. Finally, if there exists λ ∈ σ(A) ∩ iR, whether or not it is simple, etA does not tend to 0 when t → ∞ because for every eigenvector v = 0 associated with λ, we have etA v = etλ v, which has a constant norm.  This rationale is also important in understanding the behavior of certain nonlinear equations. Indeed, we are interested in problems of the type d y(t) = f (y(t)), dt

[1.10]

where the function f : RD → RD is regular enough to apply theorem 1.1. Suppose that there exists an equilibrium point y¯; that is, a solution to f (¯ y ) = 0. Of course, the constant function t → y(t) = y¯ is a solution to [1.10] with the initial condition y(tInit ) = y¯. We can now study the stability of this particular solution: if we choose an initial value for [1.10] that is close to y¯, does the corresponding solution remain close to y¯? Does it tend to y¯ when t → ∞? The answer to this question lies in linearizing equation [1.10]: we can write y(t) = y¯ + y˜(t), where the perturbation is assumed to be “small”. We thus have d y˜(t) = f (¯ y + y˜(t)) = 0 + ∇f (¯ y )˜ y (t) + R(t), dt where ∇f (¯ y ) is the Jacobian matrix of f at y¯, whose coefficients are ∂yj fi (¯ y ) and the |R(t)| remainder term R(t) satisfies |˜y(t)| → 0 when |˜ y (t)| → 0. Therefore, locally, within

20

Mathematics for Modeling and Scientific Computing

a neighborhood of the equilibrium point y¯, the dynamic is described by the behavior of solutions of the linear system d z(t) = Az(t), dt

A = ∇f (¯ y ).

If all the eigenvalues of A have a strictly negative real component, we have limt→+∞ |z(t)| = 0 and the equilibrium point y¯ is said to be locally stable. If there are eigenvalues with a strictly positive real component, there are unstable trajectories that depart from the point of equilibrium. 1.1.4. Higher-order differential equations Differential equations of order n > 1 can be converted into first-order equations of the form [1.1]. Indeed, if we have u(n) (t) = Φ(t, u(t), u (t), ..., u(n−1) (t)), and given (u, u , ..., u(n−1) )(t = tInit ), then we set y(t) = (u(t), u (t), ..., u(n−1) (t)) = (y0 , ..., yn−1 ) ∈ (RD )n , which satisfies the equation ⎛

⎞ ⎛  ⎞ ⎛ ⎞ u (t) y0 y1 (t) ⎜ u (t) ⎟ ⎜ y1 ⎟ ⎟ ⎜ y2 (t) ⎟ ⎜ ⎟ ⎜ ⎟ d d ⎜ ⎜ .. ⎟ ⎜ .. ⎟ ⎜ .. ⎟ y(t) = ⎜. ⎟ (t) = ⎜ . ⎟ = ⎜. ⎟ ⎜ ⎟ ⎟ ⎟ ⎜ dt dt ⎜ ⎝ yn−2 ⎠ ⎝ u(n−1) (t) ⎠ ⎝ yn−1 (t) ⎠ yn−1 Φ(t, y(t)) u(n) (t) ⎞ ⎛ ⎛ ⎞ 0 0 I 0 0 == ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ == ⎜ ⎟ ⎟ ⎜ = ⎜ ⎟ == ⎜ ⎟ ⎟ ⎜ ⎟ ⎜ == ⎟ ⎜ ⎜ == 0 ⎟ ⎜ ⎟ ⎟ ⎜ == y(t) + =⎜ ⎜ ⎟ = f (t, y(t)) ⎟ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ 0 ⎟ ⎜ I ⎜ ⎟ ⎟ ⎜ ⎜ ⎟ ⎜ ⎟ ⎠ ⎝ ⎠ ⎝ Φ(t, y(t)) 0 0

Ordinary Differential Equations

21

on which we can apply the Picard–Lindelöf theorem, with the function f defined by ⎛ ⎞ ⎛ ⎞ 0 I 0 0 0 == ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ == ⎜ ⎟ ⎜ ⎟ = ⎜ ⎟ = ⎜ ⎟ = ⎜ ⎟ ⎜ ⎟ == ⎜ ⎟ ⎜ ⎟ = ⎜ ⎟ 0 == ⎟y + ⎜ f : (t, y) ∈ R × (RD )n −→ ⎜ ⎟. = ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 0 ⎜ ⎟ I ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ Φ(t, y) 0 0

1.1.5. Inverse function theorem and implicit function theorem Let us finish this review of differential calculus with some analysis tools that will prove to be useful in studying the behavior of numerical schemes that allows us to find approximations for the solutions of [1.1]. T HEOREM 1.8 (Inverse Function Theorem).– Let Ω ⊂ RD be an open set and f : Ω → RD a function of class C 1 . Let x ¯ ∈ Ω, such that the Jacobian Matrix ∇f (¯ x) ∈ MD (R) is invertible. There exists an open neighborhood V ⊂ RD of x ¯ and an open neighborhood W ⊂ RD of f (¯ x), such that   – f V is a bijection of V in W ; – the inverse mapping x → f −1 (x) is continuous on W ;

 −1 – f −1 is, in fact, a function of class C 1 on W and ∇(f −1 )(f (x)) = ∇f (x) . P ROOF.– Recall that, for f : x = (x1 , ..., xD ) ∈ RD → f (x) = (f1 (x), ..., fD (x)), the Jacobian matrix of f evaluated at point x, denoted as ∇f (x) ∈ MD (R), is defined as ⎛ ⎞ ∂x1 f1 (x) ∂x2 f1 (x) . . . ∂xD f1 (x) ⎜ ⎟ ⎜ ⎟ ⎜ ∂x1 f2 (x) ∂x2 f2 (x) . . . ∂xD f2 (x) ⎟ ⎜ ⎟ ⎟. ∇f (x) = ⎜ ⎜ ⎟ .. .. .. ⎜ ⎟ . . . ⎜ ⎟ ⎝ ⎠ ∂x1 fD (x) ∂x2 fD (x) . . . ∂xD fD (x) This matrix identifies the derivative of f at point x as a linear mapping on RD because f (x + h) = f (x) + ∇f (x)h + |h|(h),

where lim (h) = 0. |h|→0

22

Mathematics for Modeling and Scientific Computing

 −1 Even if it means replacing f by the mapping x → ∇f (¯ x) (f (¯ x + x) − f (¯ x)), we may assume that x ¯ = 0, f (0) = 0 and ∇f (0) = I. Since f is a function of class C 1 on the open set Ω, there exists an r > 0, such that the ball B(0, r) is included in Ω and for every x ∈  B(0, r), we have ∇f (x)  − ∇f (0) = ∇f (x) − I ≤ 1/2, where A = sup |Ax|, x ∈ RD , |x| = 1 denotes the norm of a matrix A. For x ∈ B(0, r), we can therefore write ∇f (x) = I − u(x), where u(x) ≤ 1/2, so  −1 ∞ that ∇f (x) is invertible, with ∇f (x) = n=0 u(x)n , whose series is normally convergent. We also note that for every x ∈ B(0, r), ∞ ∞  ! !  1 n ! ∇f (x) −1 ! ≤ u(x) ≤ = 2. n 2 n=0 n=0

The classic proof of the inverse function theorem relies on Banach’s fixed-point theorem: we will demonstrate the locally bijective nature of f through a fixed point argument. We choose a point y in B(0, r/2) and introduce the mapping Φ : x → y + x − f (x). This mapping is C 1 on B(0, r) and satisfies ∇Φ(x) = I − ∇f (x) ≤ 1/2 for every x ∈ B(0, r). On B(0, r), it follows that, on the one hand, |Φ(x1 ) − Φ(x2 )| ≤

1 |x1 − x2 |, 2

and on the other hand, |Φ(x)| = |y + x − f (x)| ≤ |y| + |x − f (x)| = |y| + |Φ(0) − Φ(x)| ≤ |y| +

|x| ≤ r/2 + r/2 = r. 2

Therefore, Φ is a contraction of B(0, r) in B(0, r). It has a single fixed point, denoted as x . We have Φ(x ) = x = y + x − f (x ), so f (x ) = y. We denote the intersection of the reciprocal image of B(0, r/2) through f – that is to say, the set {x ∈ Ω, f (x) ∈ B(0, r/2)} – with B(0, r) as V . Since f (0) = 0 and f is continuous, V is an open set containing 0, which shows that the mapping f is a bijection of V in W = B(0, r/2). We define Ψ : x → x − f (x). For all x1 , x2 ∈ B(0, r), we have |x1 − x2 | = |Ψ(x1 ) + f (x1 ) − Ψ(x2 ) − f (x2 )| ≤ |Ψ(x1 ) − Ψ(x2 )| + |f (x1 ) − f (x2 )| ≤

|x1 − x2 | + |f (x1 ) − f (x2 )|, 2

because Ψ corresponds to the function Φ that was previously studied with y = 0. It follows that |x1 − x2 | ≤ 2|f (x1 ) − f (x2 )| and therefore |f −1 (x1 ) − f −1 (x2 )| ≤

Ordinary Differential Equations

23

2|x1 − x2 |, which proves that f −1 is continuous. We now identify the derivative of f −1 . Let x ∈ V and y = f (x) ∈ W . Let z ∈ RD , such that y + z ∈ W . We define h = f −1 (y+z)−f −1 (y) = f −1 (f (x)+z)−x, which means that f (x+h) = f (x)+z. Note that |h| ≤ 2|z|. We have |f −1 (y + z) − f −1 (y) − [∇f (x)]−1 z| = |h − [∇f (x)]−1 (f (x + h) − f (x))| = | − [∇f (x)]−1 (f (x + h) − f (x) − ∇f (x)h)| ≤ 2|f (x + h) − f (x) − ∇f (x)h| ≤ 2|h| |(h)|

where lim (h) = 0. |h|→0

We rewrite the remainder term as a function of the increment of z by writing η(z) = (f −1 (y + z) − f −1 (y)) = (h). By the continuity of f −1 , we still have lim|z|→0 η(z) = 0. This inequality thus becomes |f −1 (y + z) − f −1 (y) − [∇f (x)]−1 z| ≤ 4|z| |η(z)|, which proves that ∇f −1 (y) = [∇f (x)]−1 = [∇f (f −1 (y))]−1 . (In dimension D = 1, 1 the formula (f −1 ) (y) = f  (f −1 is easily recognizable.)  (y)) N OTE .– It is possible to relax the regularity assumption by assuming that f is merely differentiable, with ∇f (x) invertible for all x ∈ Ω, irrespective of the continuity of its derivative. In the C 1 case, Banach’s fixed-point theorem directly guaranteed the existence and uniqueness of the solution z to the equation f (z) = y, for a fixed y. The continuity of x → ∇f (x) stands as the key argument in constructing a suitable contraction. In fact, the surjective nature of f is not problematic. Indeed, suppose that f (0) = 0 and ∇f (0) = I. Then, by the definition of differentiability, f (x) − x = f (x) − f (0) − ∇f (0)x = |x|(x), with lim|x|→0 (x) = 0. So, there exists an r > 0, such that for every |x| ≤ r, we have |(x)| ≤ 1/2 and therefore |f (x) − x| ≤ r/2. It follows that for every |y| ≤ r/2, the function Φ : x → y + x − f (x) satisfies |Φ(x)| ≤ |y| + |x − f (x)| ≤ r for every x ∈ B(0, r). Therefore, Φ is a continuous function that maps the ball B(0, r) onto itself. By Brouwer’s theorem (see Appendix 4, theorem A4.1), Φ has a fixed point x in B(0, r), which is the solution of f (x ) = y. Demonstrating that f is also injective is the more subtle point. In dimension D = 1, the conclusion follows from a simple application of Rolle’s theorem. We introduce the differentiable function ψ : x → ψ(x) = |f (x) − y|2 . If ψ(x1 ) = ψ(x2 ) = 0, then there exists a u ∈]x1 , x2 [, such that ψ is maximal in u and ψ  (u) = 0 = 2(f (u) − y)f  (u), which contradicts the assumption that f  does not vanish. This approach can be generalized to every dimension [SAI 02].

24

Mathematics for Modeling and Scientific Computing

C OROLLARY 1.4 (Implicit Function Theorem).– Let Ω be an open set of RP × RQ . Let f : (x, y) ∈ Ω → f (x, y) ∈ RQ be a C 1 function. Let J(x, y) denote the matrix of partial derivatives of f with respect to the variable y ∈ RQ evaluated at (x, y) ∈ Ω (with components ∂yj fi (x, y), for i, j ∈ {1, ..., Q}). If J(¯ x, y¯) is invertible, then there exists – an open set U ⊂ RP containing x ¯, an open set V ⊂ RQ containing f (¯ x, y¯), an P Q open set W ⊂ Ω ⊂ R × R containing (¯ x, y¯), – and a mapping ϕ : U × V → RQ of class C 1 , such that for every x ∈ U , z ∈ V , ϕ(x, z) is the only solution y of the equation f (x, y) = z, with (x, y) ∈ W . In particular, for all x ∈ U , z ∈ V , we have f (x, ϕ(x, z)) = z. P ROOF.– We apply the inverse function theorem to the mapping defined by Φ : U × V −→ RP × RQ (x, y) −→ (x, f (x, y)). This mapping is indeed of class C 1 and its derivative, which is a continuous linear mapping of RP × RQ onto itself, is associated with the matrix ∇Φ(x, y) =

I 0 ˜ J(x, y) J(x, y)

˜ y) designates the matrix of f ’s partial where I is the identity matrix in RP and J(x, derivatives with respect to the variable x ∈ RP evaluated at (x, y) ∈ Ω, which has Q rows and P columns. This matrix is invertible in (¯ x, y¯), and its inverse can be expressed in the following form:

h h −→ ˜ x, y¯)k) . k J(¯ x, y¯)−1 (h − J(¯ The inverse function theorem guarantees the existence of an open set W containing (¯ x, y¯) as well as an open set W˜ containing (¯ x, f (¯ x, y¯)), such that Φ is a class C 1 ˜ diffeomorphism of W on W . Since the first component of Φ(x, y) is simply x, the inverse mapping is of the form Φ−1 (x, z) = (x, ϕ(x, z)).  C OROLLARY 1.5 (Constrained optimization, Lagrange Multipliers).– Let Ω be an open set of RD . Let f and g1 , .., gQ be C 1 functions defined on Ω with values in R. We define the set   C = x ∈ Ω, g1 (x) = ... = gQ (x) = 0 .

Ordinary Differential Equations

25

Suppose that f has a local extremum on C in x ¯ ∈ C and that the family (∇g1 (¯ x), ..., ∇gQ (¯ x)) is linearly independent in RD (in particular, Q ≤ D). Then, there exists real numbers λ1 , ..., λQ , called Lagrange multipliers, such that ∇f (¯ x) −

Q 

λk ∇gk (¯ x) = 0.

k=1

P ROOF.– The assumption about the gk ’s can be reformulated by saying that the matrix with D columns and Q rows ⎛

∂x1 g1 (¯ x) ∂x2 g1 (¯ x) . . . ∂xD g1 (¯ x)



⎜ ⎟ ⎜ ⎟ x) ∂x2 g2 (¯ x) . . . ∂xD g2 (¯ x) ⎟ ⎜ ∂x1 g2 (¯ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ . . . ⎜ ⎟ . . . . . . ⎜ ⎟ ⎝ ⎠ ∂x1 gQ (¯ x) ∂x2 gQ (¯ x) . . . ∂xD gQ (¯ x) is of rank Q. We can therefore extract an invertible Q × Q submatrix. We let P = D − Q ≥ 0. Even though it may involve renumbering the variables, we may suppose that ⎛

∂xP +1 g1 (¯ x) ∂xP +2 g1 (¯ x) . . . ∂xP +Q g1 (¯ x)



⎜ ⎟ ⎜ ⎟ x) ∂xP +2 g2 (¯ x) . . . ∂xP +Q g2 (¯ x) ⎟ ⎜ ∂xP +1 g2 (¯ ⎜ ⎟ ⎟ Γ=⎜ ⎜ ⎟ .. .. .. ⎜ ⎟ . . . ⎜ ⎟ ⎝ ⎠ ∂xP +1 gQ (¯ x) ∂xP +2 gQ (¯ x) . . . ∂xP +Q gQ (¯ x) is invertible and we decompose x ∈ RD in the form (ˆ x, y) ∈ RP ×RQ . By theorem 1.4 applied to the function G : (ˆ x, y) ∈ RP × RQ −→ (g1 (ˆ x, y), ..., gQ (ˆ x, y)) ∈ RQ , ˆ there exists an open set U of RP containing x ¯ and a mapping ϕ:x ˆ ∈ U −→ ϕ(ˆ x) = (ϕ1 (ˆ x), ..., ϕQ (ˆ x)) ∈ RQ such that g1 (ˆ x, y) = ... = gP (ˆ x, y) = 0, with x ˆ ∈ U , if and only if y = ϕ(ˆ x). ˆ Moreover, we have y¯ = ϕ(x ¯). Theorem 1.4 applies because the matrix Γ corresponds to the matrix of partial derivatives of (ˆ x, y) → (g1 (ˆ x, y), ..., gQ (ˆ x, y)) with respect to

26

Mathematics for Modeling and Scientific Computing

the variables y ∈ RQ . We define the following function: H : x ˆ ∈ U → f (ˆ x, ϕ(ˆ x)) ∈ R. Since (ˆ x, ϕ(ˆ x)) ∈ C , by assumption, the function H, which is C 1 on the open set ˆ U , has an extremum in x ¯. It follows that ˆ¯) = 0 = ∇xˆ f (x ˆ ˆ ˆ ∇xˆ H(x ¯, y¯) + Φ(x ¯) ∇y f (x ¯, y¯), where Φ is the matrix-valued function4 defined by ⎛

∂xˆ1 ϕ1 (ˆ x) ∂xˆ2 ϕ1 (ˆ x) . . . ∂xˆP ϕ1 (ˆ x)

⎜ ⎜ x) ∂xˆ2 ϕ2 (ˆ x) . . . ∂xˆP ϕ( x ˆ) ⎜ ∂xˆ1 ϕ2 (ˆ ⎜ ⎜ x ˆ −→ Φ(ˆ x) = ⎜ .. .. .. ⎜ . . . ⎜ ⎝

⎞ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎠

∂xˆ1 ϕQ (ˆ x) ∂xˆ2 ϕQ (ˆ x) . . . ∂xˆP ϕQ (ˆ x) Now, for all k ∈ {1, ..., Q} and x ˆ ∈ U , we have gk (ˆ x, ϕ(ˆ x)) = 0, so the derivatives with respect to x ˆ satisfy  ∇xˆ gk (ˆ x, ϕ(ˆ x)) = 0 = (∇xˆ gk )(ˆ x, ϕ(ˆ x)) + Φ(ˆ x) (∇y gk )(ˆ x, ϕ(ˆ x)). We can deduce that the matrix with Q + 1 rows and D columns, ⎛

∂xˆ1 f (¯ x) ∂xˆ2 f (¯ x) . . . ∂xˆP f (¯ x) ∂y1 f (¯ x) . . . ∂yQ f (¯ x)

⎜ ⎜ x) ∂xˆ2 g1 (¯ x) . . . ∂xˆP g1 (¯ x) ∂y1 g1 (¯ x) ⎜ ∂xˆ1 g1 (¯ ⎜ ⎜ ⎜ .. .. .. ⎜ . . . ⎜ ⎝



⎟ ⎟ . . . ∂yQ g1 (¯ x) ⎟ ⎟ ⎟ ⎟ .. ⎟ . ⎟ ⎠

∂xˆ1 gQ (¯ x) ∂xˆ2 gQ (¯ x) . . . ∂xˆP gQ (¯ x) ∂y1 gQ (¯ x) . . . ∂yQ gQ (¯ x) has a rank of at most Q because these relations show that the first P columns are expressed as linear combinations of the last Q. We conclude5 that the Q + 1 lines of this matrix constitute a linearly dependent family inRD . Therefore, there exists Q (μ0 , μ1 , ..., μQ ) ∈ RQ+1 \ {0}, such that μ0 ∇f (¯ x) + k=1 μk ∇gk (¯ x) = 0. Since D (∇g1 (¯ x), ..., ∇gQ (¯ x)) is free in R , we necessarily have μ0 = 0. Finally, we let λk = − μμk0 to obtain the desired result.  4 It is important to properly understand this formula and to be able to pass from the ˆ ˆ ˆ matrix formulation to its formulation in coordinates ∂xˆj H(x ¯) = 0 = (∂xˆj f )(x ¯, ϕ(x ¯)) + Q ˆ ˆ ˆ (∂ f )( x ¯ , ϕ( x ¯ )) ∂ ϕ ( x ¯ ) and vice versa. y x ˆ  j  =1 5 For example, by emphasizing the fact that a matrix and its transpose have the same rank.

Ordinary Differential Equations

27

1.2. Numerical simulation of ordinary differential equations, Euler schemes, notions of convergence, consistence and stability 1.2.1. Introduction Before beginning, it is worthwhile to recall the stakes of numerical simulation of [1.1]. Even though theorem 1.1 guarantees the existence and uniqueness of solutions, and we have some qualitative information about the solution, as well as some estimates, mostly, formulas relating the time t and the solution y of [1.1] are not known at this moment. The main problem lies in finding a sequence of operations that can be efficiently performed by a computer, which can produce an approximation of the solution. This approximation procedure is of an essentially discrete nature, and numerical analysis precisely involves the study of the relation between discrete and continuous descriptions of the solution. More specifically, for the sake of simplicity, supposing that tInit = 0, we seek to approach the solution t → y(t) of [1.1] on a fixed time interval [0, T ]. We introduce a subdivision in that interval [0, T ]: 0 = t0 < t1 < ... < tN = T for a given integer N = 0. For n ∈ {1, ..., N }, we can write Δtn = tn − tn−1 , and  Δt = max Δtn ,

 n ∈ {1, ..., N } > 0.

Next, we limit the discussion to the case of a uniform subdivision, where Δt = tn − tn−1 > 0

for all n ∈ {1, ..., N }.

This seemingly obvious remark is fundamental, the final time T , the interval Δt and the number N of intervals that form the subdivision are related by N × Δt = T. In what follows we write N = NΔt to clearly denote this dependence. Given the initial value y0 = yInit ∈ RD in the state space, a numerical scheme defines states y1 , ..., yNΔt that are explicitly calculable and that are supposed to be approximations to the solution at the discrete points y(t1 ), ..., y(tNΔt ). It is therefore necessary to distinguish y(tn ), evaluation of the function t → y(t), which is the solution of [1.1] at a time t = tn from yn , the nth iterated term of a sequence that is constructed to approximate the solution.

28

Mathematics for Modeling and Scientific Computing

The construction of the approximation is based on the integrated form of the equation ˆ

tn+1

y(tn+1 ) = y(tn ) +

  f s, y(s) ds,

tn

which is satisfied for n ∈ {0, ..., N − 1} and, in fact, used to prove theorem 1.1. The idea is to use formulas from numerical integration, some aspects of which are reviewed in section Appendix 2. The simplest approach consists of the “left rectangle rule”. We thus obtain the recurrence formula  yn+1 = yn + Δt f tn , yn ).

[1.11]

This is known as the explicit Euler method. If the initial value y0 = y(t0 ) = yInit is known, this formula effectively defines the elements y1 , ..., yNΔt . It now becomes necessary to determine its relation to the solution of [1.1] and to know if it indeed provides an approximation of y(t1 ), ..., y(T ). It is also possible to assume a construction based on the right rectangle rule, which gives   yn+1 = yn + Δt f tn+1 , yn+1 .

[1.12]

This is known as the implicit Euler method. However, this method has one difficulty: the formula does not allow us to directly calculate yn+1 simply by knowing the previous iteration yn . In order to determine yn+1 , it is necessary to solve an equation. More specifically, yn+1 is a zero for the mapping   z → Φ(z) = z − yn − Δt f tn+1 , z . Therefore, in order to obtain a numerical value for yn+1 , it would be necessary to operate a root-finding method for the mapping z → Φ(z) (and, in fact, in that case we would only use an approximation of yn+1 , the solution of [1.12]). Nevertheless, we will see that, despite its undeniably greater complexity, the method [1.12] also has advantages with respect to method [1.11] in some cases. One variant of these methods is the predictor–corrector method, which is defined by the following relations: given yn , we can write y  = yn +

Δt f (tn , yn ), 2

  Δt yn+1 = yn + Δtf (tn + Δt/2, y  ) = yn + Δtf tn + Δt/2, yn + f (tn , yn ) . 2

Ordinary Differential Equations

29

´ t +Δt This corresponds to an approximation of tnn f (s, y(s)) ds using a “midpoint rule”. However, since we do not know the value of y(tn + Δt/2), we replace it with its approximation y  obtained by using the explicit Euler method. Before going into the analysis of these methods, we highlight two important points: – As mentioned above, numerical approximations of solutions to differential equations and the development of mathematical methods adapted to these questions are justified by the fact that explicit expressions of these solutions in terms of the time variable are not generally known. Nevertheless, it is always useful to evaluate the performance of numerical methods by testing them on problems whose explicit solutions are known. This experimental confirmation ensures the robustness of a given numerical analysis. – In the following sections, we will attempt to estimate the errors yn −y(tn ). These estimates involve the solution and possibly some of its derivatives. These calculations assume that the solution is “fairly regular”, possibly even beyond the simple C 1 regularity provided by theorem 1.1. 1.2.2. Fundamental notions for the analysis of numerical ODE methods In order to analyze the behavior of these numerical methods and justify that they indeed define an appropriate approximation of the solution to [1.1], we will now introduce a certain number of definitions that will allow us to formalize what we may refer to as the “properties” of a numerical approximation. Recall that the working interval [0, T ] is subdivided into intervals separated by   Δt = max tn − tn−1 , n ∈ {1, ..., NΔt = T /Δt} . Given initial data y0 = yInit , a numerical scheme defines the points y1 , ..., yNΔt with a relation of the form yn+1 = yn + (tn+1 − tn )Φ(tn , Δt, yn ) = yn + ΔtΦ(tn , Δt, yn )

[1.13]

for n ∈ {0, ..., NΔt − 1}. We assume throughout the discussion that the function Φ depends on those arguments in a regular manner. With this notation, for the explicit Euler scheme [1.11], we simply have Φ(t, h, y) = f (t, y). For the implicit Euler scheme [1.12], we have Φ(t, h, y) = f (t + h, φ(t, h, y)), where φ(t, h, y) is the solution of the equation z − yn − Δtf (t + h, z) = 0, which, for the moment, we assume that it exists and is unique. It will be easy to adapt the discussion to schemes with variable time steps. The principal notion is as follows: a scheme is said to be convergent whenever the elements of the sequence y0 , y1 , ..., yN “resemble” the evaluations of the solution y(0), y(t1 ), ..., y(tN ) = y(T ). More formally, we can state the following.

30

Mathematics for Modeling and Scientific Computing

D EFINITION 1.2.– We say that the scheme [1.13] is convergent if, given the solution t → y(t), of [1.1] associated with the initial value yInit = y0 , we have  lim

Δt→0

sup n∈{0,...,NΔt }

 |yn − y(tn )| = 0.

In order to show that a scheme is convergent, it is necessary to define new notions that characterize the “properties” of a numerical scheme. The notion of consistence evaluates the coherence of a scheme with the equation: if we replace the approximation yn with the solution evaluated at a point tn in [1.13], an error emerges, but it must tend to 0 when Δt → 0, and sometimes it can be calculated precisely as a function of the step Δt. The notion of stability controls the accumulation of errors at each step in time, since, more specifically, the different y(tn )’s do not quite satisfy [1.13]. D EFINITION 1.3.– We say that the scheme [1.13] is consistent if, given the solution of [1.1] associated with an initial value yInit , t → y(t), and n =

  y(tn+1 ) − y(tn ) − Φ tn , Δt, y(tn ) , Δt

we have  lim

Δt→0

Δt

N Δt −1

 |n |

= 0.

n=0

We say the scheme [1.13] is (consistent) of order p if there exists a C > 0, independent of Δt, such that we have Δt

N Δt −1

|n | ≤ C Δtp .

n=0

N OTE .– If we introduce the piecewise constant function Δt : t ∈ [0, T ] −→

N Δt −1

n 1nΔt≤t 0, such that for all 0 < Δt ≤ Δt , and every n ∈ N, we have |n | ≤ C Δtp . An estimate of this kind implies that the scheme is of order p: the global consistency error, on the time interval [0, T ] with T = NΔt Δt, is controlled by Δt

N Δt −1

|n | ≤ C NΔt Δt (Δt)p = CT (Δt)p .

n=0

Importantly, in these estimates, the constant C, in general, depends on the L∞ norms of the derivatives of the solution y on [0, T ]; therefore, it is also a function of the final time T . Determination of this kind of estimates also requires that the solution be regular beyond the mere class C 1 required in the Picard–Lindelöf theorem. In fact, there exists a practical criterion for determining if a method is consistent. P ROPOSITION 1.2.– If Φ(t, 0, y) = f (t, y) for all t and y, then [1.13] is consistent. P ROOF.– Recall that the step Δt > 0 and the number of intervals required to reach the final time T are related by the relation T = ΔtNΔt . First, note that Δt

N Δt −1

|n | = Δt

n=0

N Δt −1  n=0

= Δt

  y(tn + Δt) − y(tn )  − Φ(tn , Δt, y(tn ))  Δt

N Δt −1  ˆ 1

 

n=0



  f (tn + τ Δt, y(tn + τ Δt)) − Φ(tn , Δt, y(tn )) dτ .

0

For t ∈ [0, T ], the solution t → y(t) remains within a compact domain K ⊂ RD . The functions t → y(t),

(t, y) → f (t, y), and

(t, h, y) → Φ(t, h, y)

are continuous and therefore uniformly continuous on the compact domains [0, T ], [0, T ] × K and [0, T ] × [0, 1] × K, respectively. Let α > 0. Then, there exists a δ(α) > 0, such that for all 0 < h < δ(α), t ∈ [0, T ] and τ ∈ [0, 1], we have     f (t + hτ, y(t + hτ )) − Φ(t, h, y(t)) ≤ f (t + hτ, y(t + hτ )) − f (t, y(t))   +Φ(t, 0, y(t)) − Φ(t, h, y(t)) ≤ α.

32

Mathematics for Modeling and Scientific Computing

It follows that Δt

N Δt −1

|n | ≤ Δt

n=0

N Δt −1

α = α ΔtNΔt ≤ T α.

n=0



for all 0 < Δt < δ(α) because NΔt = T /Δt.

D EFINITION 1.4.– The scheme [1.13] is said to be stable if, when considering y1 , ..., yNΔt defined by [1.13] and the elements z1 , ..., zNΔt defined by zn+1 = zn + Δt Φ(tn , Δt, zn ) + ηn with given z0 and η1 , ..., ηNΔt , there exists an M > 0, independent of Δt, such that n−1    |yn − zn | ≤ M |yInit − z0 | + |ηj | . j=0

The combination of consistency and stability properties allows us to determine that a scheme is convergent. T HEOREM 1.9.– A scheme of the form [1.13] that is stable and consistent is convergent. P ROOF.– We will use the stability property with zn = y(tn ), ηn = Δtn , such that max

n∈{0,...,NΔt }

N Δt −1   |yn − y(tn )| ≤ M |y0 − y(0)| + Δt |n | n=0

is satisfied for some M > 0. Of course, we begin the scheme with the initial value prescribed for the ODE: y0 = y(0). The consistency of the scheme ensures that max

n∈{0,...,NΔt }

|yn − y(tn )| −−−−→ 0. Δt→0

If the scheme has the order p, then max

n∈{0,...,NΔt }

|yn − y(tn )| ≤ M CT Δtp

(Here, it is important to recall that the estimate involves the final time T , even by means of the constant C; in particular, note that for a given accuracy, the greater the final time T , the smaller the step Δt must be). 

Ordinary Differential Equations

33

1.2.3. Analysis of explicit and implicit Euler schemes Consistency analysis involves successive derivatives of the solution t → y(t) of [1.1] and uses Taylor developments taken up to convenient order. Recall that given a second member f that is Lipschitz continuous in the state variable, the Picard– Lindelöf theorem ensures that the maximal solution has only C 1 regularity. Therefore, consistency analysis should be accompanied by additional regularity assumptions. T HEOREM 1.10 (Taylor’s Formula With Integral Remainder).– Let t → y(t) be a C n function, with n ≥ 1. Then, for every k ∈ {1, ..., n}, we have y(t + h) =

k−1 

y

()

=0

h (t) + !

ˆ

1

y (k) (t + sh)hk

0

(1 − s)k−1 ds. (k − 1)!

P ROOF.– This theorem can be proved very easily using induction and integration by parts. Indeed, for k = 1, it is true that ˆ

1

y(t + h) − y(t) = 0

 d y(t + sh) ds = ds

ˆ

1

y  (t + sh) h ds.

0

Suppose that the relation holds for all k ∈ {1, ..., n − 1}. We integrate the remainder term using integration by parts ˆ 0

1

y (k) (t + sh)hk    :=F (s)

ˆ

1

+ 0

(1 − s)k−1 −(1 − s)k 1 ds = y (k) (t + sh)hk      (k − 1)! k!  0    =F (s) =G(s)

:=g(s)=G (s)

(1 − s)k y (k+1) (t + sh)hk+1 ds      k!  =f (s)=F  (s)

k

h = y (k) (t) + k!

ˆ

0

1

=−G(s)

y (k+1) (t + sh)hk+1

(1 − s)k ds, k!

This enables us to extend the formula to k + 1.



L EMMA 1.2.– Suppose that the solution y of [1.1] is a function of class C 2 on [0, T ]. The explicit and implicit Euler schemes are consistent of order 1. P ROOF.– For the explicit Euler scheme, by manipulating Taylor’s formula, we obtain |y(tn + Δt) − y(tn ) − Δtf (tn , y(tn ))|   ˆ 1    y  ∞ 2  2  = y (tn )Δt + y (tn + sΔt)Δt (1 − s) ds − Δtf (tn , y(tn )) ≤ Δt . 2 0

34

Mathematics for Modeling and Scientific Computing

An identical manipulation applies for the implicit scheme.



N OTE .– Consider the case where the differential system is autonomous with f (y) = −∇Φ(y) for some function Φ : RD → R. We assume that – Φ is C 1 ; – Φ is α−convex: for every x, y ∈ RD , we have (∇Φ(x) − ∇Φ(y)) · (x − y) ≥ α|x − y|2 , with α > 0; – Φ is coercive: lim|x|→∞ Φ(x) = +∞. Under these assumptions, the function Φ has a unique minimum on RD , denoted as y¯. Indeed, the minimum is unique as a consequence of strict convexity (which itself is a product of the fact that Φ is α–convex): if y¯ and y satisfy Φ(¯ y ) = Φ(y ) = m = min{Φ(y), y ∈ RD }, then, for every 0 ≤ λ ≤ 1, we have m ≤ Φ(λ¯ y+ (1 − λ)y ) < λΦ(¯ y ) + (1 − λ)Φ(y ), which is absurd because the right-hand term is equal to λm + (1 − λ)m   = m. To justify the existence of a minimum, consider a minimizing sequence yn n∈N : limn→∞ Φ(yn ) = inf{Φ(y), y ∈ RD }. Because Φ is coercive, we can be sure that this sequence is bounded. Even if it is necessary to extract a subsequence, we may assume that limn→∞ yn = y¯ ∈ RD . For every 0 < λ ≤ 1 and all x, y ∈ RD , we have Φ(x + λ(y − x)) − Φ(x) Φ((1 − λ)x + λy) − Φ(x) = ≤ Φ(y) − Φ(x). λ λ Using this inequality with x = y¯, y = yn and by making λ tend to 0, we obtain Φ(¯ y ) + ∇Φ(¯ y ) · (yn − y¯) ≤ Φ(yn ) since limλ→0 Φ(x+λ(y−x))−Φ(x) = ∇Φ(x) · (y − x). When n → ∞, it follows that λ Φ(¯ y ) + 0 ≤ m, and therefore the limit y¯ minimizes Φ. Since the minimum of Φ occurs at y¯, we have ∇Φ(¯ y ) = 0. The asymptotic behavior of the differential system

d dt y

= −∇Φ(y) is given by

lim y(t) = y¯.

t→∞

In order to demonstrate this remarkable property, we calculate d |y(t) − y¯|2 = (y(t) − y¯) · (−∇Φ(y(t))) dt 2 = −(∇Φ(y(t)) − ∇Φ(¯ y )) · (y(t) − y¯) ≤ −α|y(t) − y¯|2 ,

Ordinary Differential Equations

35

which implies the convergence of y(t) towards the equilibrium point y¯ at an exponential rate as time increases. The explicit Euler scheme for this problem corresponds exactly to the gradient descent algorithm with a fixed step Δt, which allows us to minimize the functional Φ (see section 2.7). L EMMA 1.3.– Suppose that the function f is L-Lipschitz continuous in the state variable. The explicit Euler scheme is stable, and the stability constant is eLT . P ROOF.– We can write L=

sup 0≤t≤T, y=z

|f (t, y) − f (t, z)| . |y − z|

By reproducing the notation used in definition 1.3, we have     |zn+1 − yn+1 | = zn − yn + Δt f (tn , zn ) − f (tn , yn ) + ηn  ≤ (1 + LΔt)|zn − yn | + |ηn |. The basic inequality 1 + x ≤ ex , which is satisfied by every x ≥ 0, gives |zn+1 − yn+1 | ≤ eLΔt |zn − yn | + |ηn | ≤ eLΔt (eLΔt |zn−1 − yn−1 | + |ηn−1 |) + |ηn | ≤ ... ≤ e(n+1)LΔt |z0 − y0 | +

n 

eL(n−k)Δt |ηk |

k=0

 ≤ eLT |z0 − y0 | +

n 

 |ηk | .



k=0

T HEOREM 1.11.– The explicit Euler scheme is stable and consistent of order 1. Therefore, it is convergent and the error between the approximate solution and the actual solution tends to 0 with Δt. This statement ensures that the explicit Euler scheme actually does the work expected of it: it provides an approximation of the real solution, which gets better as time steps become smaller. Nevertheless, if we study the approximations in more detail, we realize that there are practical limitations that can be quite significant, so that they may even make it unrealistic to simulate certain problems using this method. Indeed, the stability constant is eLT . When L  1, in order to obtain a given degree of precision for a given final time T , it becomes necessary to choose a very small time step and therefore to perform a great deal of operations, which means a prohibitive computation time. (For example, suppose that L = 1043 and that we want a solution on the interval [0, 1] with a degree of precision of 10−2 . Since the error is 43 bounded by eL Δt, it is necessary to take Δt = 10−2 e−10 !).

36

Mathematics for Modeling and Scientific Computing

It is important to be wary of abuse of language assimilating these difficulties to a lack of stability: the explicit Euler method is stable. However, the stability estimate is too poor for the method to provide pertinent results, within realistic computational conditions. This discussion shows just how subtle the analysis of numerical schemes can be. We will see later that, in some cases, the implicit Euler scheme [1.12] can be more effective in terms of this stability criterion. We will now see that the “numerical scheme” approach allows us to construct an alternative proof of the Picard–Lindelöf theorem and some of its generalizations. Suppose that f is a continuous function on an open set U ⊂ R × RD .

[1.14]

Let (tInit , yInit ) ∈ U . We take T0 > 0 and r0 > 0 small enough for the cylinder C0 = [tInit , tInit + T0 ] × B(yInit , r0 ) ⊂ U . Since the function f is continuous and C0 is a compact set, we can set   M0 = sup |f (t, y)|, (t, y) ∈ C0 < ∞. We let6 T = min(T0 , r0 /M0 ). For a fixed N ∈ N\{0}, we introduce a subdivision of [tInit , tInit + T ], with step h = T /N > 0, defined by t0 = tInit < t1 < ... < tN = tInit + T,

tn+1 = tn + h,

h=

T . N

Consider the vectors y0 , y1 , ..., yN defined by y0 = yInit and the recurrent relation defined by the explicit Euler scheme yn+1 = yn + hf (tn , yn ). We associate these vectors with the sequence of functions defined on [tInit , tInit + T ] by z (N ) (t) = =

N−1 

  yn + (t − tn )f (tn , yn ) 1tn ≤t≤tn+1

n=0 N−1 

yn +

n=0

 (t − tn ) (yn+1 − yn ) 1tn ≤t≤tn+1 . h

6 Note that if t has the homogeneity of time and y that of length, then f and M0 have the homogeneity of speed, whereas r0 is homogeneous to speed; therefore, r0 /M0 is homogeneous to time.

Ordinary Differential Equations

37

L EMMA 1.4.– The functions z (N ) are continuous on [tInit , tInit + T ] and satisfy z (N ) (t) ∈ B(yInit , r0 ) for every t ∈ [tInit , tInit + T ] and N ∈ N \ {0}. P ROOF.– The functions z (N) are piecewise P1 , with continuous connections. In particular, these functions are continuous and piecewise C 1 . An immediate induction shows that for every n ∈ {0, ..., N }, we have |yn − y0 | ≤ nhM0 . In light of the definition of h, T, r0 and M0 , this shows, in passing, that (tn , yn ) ∈ C0 . Therefore, if tn ≤ t ≤ tn+1 , n ∈ {0, ..., N − 1}, then |z (N) (t) − yInit | ≤ |yn − yInit + (t − tn )f (tn , yn )| ≤ nhM0 + (t − tn )M0 = (t − tInit )M0 ≤ T M0 ≤ r0 . It follows that (t, z (N ) (t)) ∈ C0 ⊂ U .



Moreover, the functions z (N) are piecewise C 1 and for everywhere other than N −1 d (N ) points t0 , t1 , ..., tN , we have dt z (t) = n=0 f (tn , yn )1tn 0, such that for all (s1 , y1 ) ∈ C0 and (s2 , y2 ) ∈ C0 , if |s1 − s2 | ≤ η and |y1 − y2 | ≤ η, then |f (s1 , y1 ) − f (s2 , y2 )| ≤ . Thus, since  > 0 is fixed, whenever h < η, we obtain ˆ t     f (N) (s) − f (s, z (N) (s)) ds  tInit



N −1 ˆ tn+1  n=0

  f (tn , z (N ) (s)) − f (s, z (N) (s)) ds

tn

which is bounded from  above  by N h = T . We apply this result by considering the extracted sequence z (N ) ∈N , which converges uniformly towards z on [tInit , tInit + T ]: by making  tend to +∞, and by using the continuity of f , we obtain ˆ

t

z(t) = yInit +

f (s, z(s)) ds.

[1.15]

tInit

This proves the Cauchy–Peano existence theorem. We now strengthen the regularity assumptions on f with respect to the state variable by assuming that the function is locally Lipschitz continuous. D EFINITION 1.5.– We say that f is locally Lipschitz continuous in the state variable on the open set U ⊂ R × RD if for every compact set K ⊂ U , there exists a LK > 0, such that for all s, y1 , y2 satisfying (s, y1 ) ∈ K and (s, y2 ) ∈ K, the following relation holds: |f (s, y1 ) − f (s, y2 )| ≤ LK |y1 − y2 |. Under this assumption, and retaining the preceding paragraph’s notation, we obtain ˆ



t

z (N ) (t) − z (M ) (t) =

 f (N ) (s) − f (M ) (s) ds

tInit

and therefore ˆ

 (N )  f (s) − f (s, z (N ) (s)) ds

t

|z (N) (t) − z (M ) (t)| ≤

tInit ˆ t

+ tInit t

ˆ +

tInit

  f (s, z (N) (s)) − f (s, z (M ) (s)) ds   f (s, z (M ) (s)) − f (M ) (s) ds.

Ordinary Differential Equations

39

By following the reasoning similar to that made above, thanks to the uniform continuity of f on C0 , for every  > 0, we have ˆ

ˆ

 (N )  f (s) − f (s, z (N) (s)) ds ≤ T ,

t tInit

 (M )  f (s) − f (s, z (M ) (s)) ds ≤ T 

t

tInit

given that N and M are large enough. We denote the Lipschitz constant of f on C0 as L0 , such that ˆ

t

  f (s, z (N) (s)) − f (s, z (M ) (s)) ds ≤ L0

tInit

ˆ

t

 (N )  z (s)) − z (M ) (s) ds.

tInit

It follows that ˆ |z

(N)

(t) − z

(M )

t

(t)| ≤ 2T  + L0

 (N )  z (s)) − z (M ) (s) ds

tInit

for large enough N and M . Grönwall’s inequality implies that |z (N) (t) − z (M ) (t)| ≤ 2T eL0 T .   Therefore, z (N ) N∈N is a Cauchy sequence in C 0 ([tInit , tInit + T ]) for the uniform norm, and it converges uniformly towards a continuous function z on [tInit , tInit + T ]. Using the continuity of f , we also show that z satisfies the integral formulation in [1.15]. Finally, note that the locally Lipschitz nature of f also allows us to justify uniqueness, by again applying Grönwall’s inequality. Let z1 , z2 be continuous functions satisfying ˆ

t

zj (t) = yInit,j +

f (s, zj (s)) ds tInit

on [tInit , tInit + T ]. Let K ⊂ U denote a compact domain, such that (t, zj (t)) ∈ K for every t ∈ [tInit , tInit + T ] and j ∈ {1, 2}.We thus have ˆ

t

|z1 (t) − z2 (t)| ≤ |yInit,1 − yInit,2 | + LK

|z1 (s) − z2 (s)| ds tInit

which implies that |z1 (t) − z2 (t)| ≤ |yInit,1 − yInit,2 |eLK (t−tInit ) . We will now focus on a specific class of differential equations, by assuming certain properties for the second member of equation [1.1].

40

Mathematics for Modeling and Scientific Computing

D EFINITION 1.6.– We say the problem [1.1] is dissipative if the function f satisfies (f (t, y1 ) − f (t, y2 ), y1 − y2 ) ≤ 0 for all t, y1 , y2 , such that (t, y1 ) ∈ U , (t, y2 ) ∈ U . Whenever the function f is differentiable, we have a practical criterion that uses the Jacobian matrix (with respect to the state variable) of f . L EMMA 1.5.– Assume that f is differentiable. Then, the problem is dissipative if for every (t, y) ∈ U and every vector ξ ∈ RD , we have ∇y f (t, y)ξ · ξ ≤ 0. As mentioned before, the construction of the implicit Euler scheme is based on the possibility of solving the equation z = yn + Δtf (tn + Δt, z) whose unknown is z and where yn ∈ RD , tn and Δt > 0 are fixed. If this equation does indeed have a solution, then it defines the next iteration of the scheme, which will become an approximation to the solution of [1.1] at the instant tn+1 . One such approach involves introducing the function Φ : y → yn + Δtf (tn + Δt, y) and showing the conditions that justify the existence of a fixed point. Therefore, by using Banach’s theorem, we show that – if the function f is globally Lipschitz continuous in the state variable, there exists an L > 0, such that for all t ∈ [tInit , tInit, + T ] and y1 , y2 ∈ U , we have |f (t, y1 ) − f (t, y2 )| ≤ L|y1 − y2 |, – and if Δt
0, the implicit Euler scheme [1.12] is well defined; the scheme is of order 1, stable and therefore convergent. P ROOF.– Let n ∈ {0, ..., N − 1}. For fixed yn and Δt > 0, to determine the next iterated term yn+1 using the implicit Euler scheme, it becomes necessary to find the root of the function y → y − yn − hf (tn + h, y).

Ordinary Differential Equations

41

We introduce the mapping Fn : (h, y) ∈ R × RD → y − yn − hf (tn + h, y) ∈ RD . ¯ y¯) be a root of Fn : Fn (h, ¯ y¯) = 0. Note that Fn (0, yn ) = yn − yn = 0. Let (h, ¯ The Jacobian matrix of Fn at point (h, y¯) is expressed as ¯ y¯) = I − h∇ ¯ y f (tn + h, ¯ y¯). ∇y Fn (h, Due to the dissipative nature of the problem, for every ξ ∈ RD , we have ¯ y¯)ξ · ξ ≥ |ξ|2 ∇y Fn (h, ¯ y¯) is an invertible matrix (because the linear mapping associated with and ∇y Fn (h, ¯ y¯), we can it is one-to-one). By the implicit function theorem 1.4, starting from (h, D ¯ define a curve h → yh ∈ R in a neighborhood of h, such that Fn (h, yh ) = 0,

yh¯ = y¯.

We introduce the set  ¯ > 0, such that for all 0 < h < h, ¯ there exists E = h  yh ∈ RD , a solution of Fn (h, yh ) = 0 . ¯ there exists at We just saw that this set is non-empty. Note first that with a fixed h, ¯ most one vector y¯ that cancels y → Fn (h, y). Indeed, if y¯1 and y¯2 satisfy y¯1 − yn − ¯ (t + h, ¯ y¯1 ) = y¯2 − yn − hf(t ¯ ¯ y¯2 ), then hf + h,     ¯ f (t + h, ¯ y¯1 ) − f (t + h, ¯ y¯2 ) · y¯1 − y¯2 |¯ y1 − y¯2 |2 = h ˆ

1

¯ =h

  ¯ y¯2 + s(¯ ∇y f t + h, y1 − y¯2 ) (¯ y1 − y¯2 ) · (¯ y1 − y¯2 ) ds ≤ 0

0

which indeed  implies that y¯1 = y¯2 . Suppose that h = sup E < ∞. Consider a sequence hk k∈N of elements from E , such that limk→∞ hk = h , and a sequence   y¯k k∈N of roots of Fn , which is associated with it:  y¯k = yn + hk f (t + hk , y¯k ) = yn + hk f (t + hk , y¯k )  − f (t + hk , yn ) + hk f (t + hk , yn ).

42

Mathematics for Modeling and Scientific Computing

From this relation, we infer the following estimate: |¯ yk − yn |2 ≤ hk |f (t + hk , yn )| |¯ yk − yn |     +hk f (t + hk , y¯k ) − f (t + hk , yn ) · y¯k − yn    ≤0

which implies |¯ yk | ≤ |¯ yk − yn | + |yn | ≤ |yn | + hk |f (t + hk , yn )|.   Therefore, the sequence y¯k k∈N is bounded. We now find a subsequence, such that lim→∞ hk = h and lim→∞ y¯k = y¯ ∈ RD . By the continuity of f , we obtain y¯ − yn − h f (tn + h , y¯ ) = Fn (h , y¯ ) = 0. Furthermore, if 0 < h < h , then, for a large enough k, we have 0 < h < hk < h , and therefore there exists a yh ∈ RD , such that F (h, yh ) = 0. It follows that h ∈ E . This contradicts the definition of h because the implicit function theorem would make it possible to construct roots (h, yh ) of Fn with h > h . We conclude that h = +∞ and the implicit Euler scheme is well defined, with no restriction on the step h. To study stability, we consider the sequences defined by y0 , z0 and the relations yn+1 = yn + hf (t + h, yn+1 ),

zn+1 = zn + hf (t + h, zn+1 ) + ηn

  where ηn n∈N is a given sequence in RD . We have   |yn+1 − zn+1 |2 = |yn+1 − zn+1 | |yn − zn | + |ηn |     +h f (t + h, yn+1 ) − f (t + h, zn+1 ) · yn+1 − zn+1    ≤0   ≤ |yn+1 − zn+1 | |yn − zn | + |ηn | . It follows that |yn+1 − zn+1 | ≤ |yn − zn | + |ηn | ≤ |y0 − z0 | +

n  k=0

|ηk |.

Ordinary Differential Equations

43

The stability constant is 1: here, a crucial difference emerges with respect to the stability analysis for the explicit Euler scheme [1.11] whose stability constant depends on the Lipschitz constant of f and the final time. Finally, in order to describe consistency, we can write n = y(tn+1 ) − y(tn ) − hf (tn+1 , y(tn+1 )) ˆ tn+1 ˆ tn+1 ˆ = y  (s) ds − ds × y  (tn+1 ) = − tn

tn

tn+1

tn

ˆ

tn+1

y  (σ) dσ

ds.

s

This makes it possible to bound it from above by, for example, |n | ≤ y  L∞ (0,T ) h2 . We can therefore conclude our discussion of the implicit Euler scheme’s convergence.  Explicit and implicit Euler schemes are both first-order consistent, stable and therefore convergent. Nevertheless, for certain differential equations, there might be a good reason to prefer the implicit method, which provides more favorable estimations. These phenomena are very subtle and their analysis is delicate. To this end, we will attempt to quantify the propensity that a numerical solution might have to distance itself from the real solution. Let T > 0 be a simulation time that is strictly less than the lifetime of the solution t → y(t), which we assume to be defined from the initial time tInit = 0. We can write   M = max |y(t)|, 0 ≤ t ≤ T . There is no apparent reason for the approximate solution yn to remain in the ball B(0, M ). Also, for a given C > 0, we seek to know under what conditions the approximate solution remains in B(0, M + C). In order to perform this study, we assume that B(0, M + C) ⊂ U and denote the Lipschitz constant of f on [0, T ] × U as L (this is to say that |f (t, x) − f (t, y)| ≤ L|x − y| for all 0 ≤ t ≤ T and x, y ∈ U ). Recall that yn = en + y(tn ),

en = yn − y(tn )

such that |yn | ≤ M + |en | and the question comes down to estimating the local error en . For the implicit Euler scheme, we have yn+1 = yn + hf (tn+1 , yn+1 ), y(tn+1 ) = y(tn ) + hf (tn+1 , y(tn+1 )) + n

44

Mathematics for Modeling and Scientific Computing

such that   en+1 = en + h f (tn+1 , yn+1 ) − f (tn+1 , y(tn+1 )) − n . Using the fact that the problem is dissipative, it follows that |en+1 |2 = (en − n ) · en+1     + h f (tn+1 , yn+1 ) − f (tn+1 , y(tn+1 )) · yn+1 − y(tn+1 )    ≤0

≤ (|en | + |n |) |en+1 |

and therefore, |en+1 | ≤ |en | + |n |. However, we have seen that   |n | = y(tn+1 ) − y(tn ) − hf (tn+1 , y(tn+1 )) ˆ  =

tn+1

tn tn+1

ˆ ≤

ˆ



y (s) ds −

tn+1

  ds × y  (tn+1 )

tn

   y (s) − y  (tn+1 ) ds =

tn

ˆ

≤h

ˆ

tn+1

ˆ  

tn tn+1

tn+1

  y  (σ) dσ  ds

s

|y  (σ)| dσ.

tn

We can therefore establish by direct induction that ˆ

tn

|en | ≤ |e0 | + h

|y  (σ)| dσ.

0

(Again, this relation shows that the implicit Euler scheme converges.) We can thus obtain the estimate |yn | ≤ M + hy  L1 (0,T ) . It follows that the numerical solution remains in B(0, M + C), given that h ≤ y C1 . L (0,T )

With the explicit Euler scheme, we have yn+1 = yn + hf (tn , yn ), y(tn+1 ) = y(tn ) + hf (tn , y(tn )) + n

Ordinary Differential Equations

45

so   |en+1 | ≤ |en | + hf (tn , yn ) − f (tn , y(tn )) + |n |. We assume that yn ∈ B(0, M + C). Using the fact that f is L-Lipschitz in the state variable on [0, T ] × B(0, M + C), and the estimate on the local consistency error  |n | ≤ y 2 ∞ h2 , we obtain |en+1 | ≤ (1 + Lh)|en | +

y  ∞ 2 h 2

which leads to  y  ∞  |en+1 | ≤ eLT |e0 | + h . 2L Therefore, the desired estimate (yn+1 ∈ B(0, M + C)) is satisfied under the restrictive condition h eLT

y  ∞ ≤ C. 2L

With a fixed T and C, the greater the L is, the smaller the time interval h must be, and the condition is more restrictive than that shown for the implicit scheme by using the dissipative nature of the equation. This discussion reveals that, in fact, several notions of stability for a numerical scheme coexist, which are more or less pertinent according to the problem being considered, and can be related to a scheme’s capacity to a) not amplify perturbations introduced by the numerical approximation in an uncontrolled manner (stability with respect to errors)7; b) produce an approximate solution that remains bounded within a domain as close as possible to that of the solution to the continuous problem; c) provide a solution that preserves the known properties (positiveness8, conservation of energy, etc.) of the continuous solution; 7 The relevant errors are those produced by the scheme itself : the continuous problem’s solution does not satisfy the recurrence formula that defines the scheme and we can wonder how this consistency error is propagated by the scheme. This consistency error is the leading phenomena; in particular, it largely dominates the so-called “machine round-off” that results from the way real numbers are represented on a computer! 8 On this point, we can compare the implicit and explicit Euler schemes for solving y  (t) = −ay(t), with a > 0. The continuous solution is known: if y(0) = yInit , then y(t) = e−at yInit , which is non-negative if yInit ≥ 0. The explicit Euler scheme can be written as yn+1 = (1 − aΔt)yn ; it preserves only non-negative values under the constraint Δt < 1/a. The implicit Euler scheme can be written as yn+1 = (1 + aΔt)−1 yn , and guarantees that all iterations are non-negative regardless of Δt > 0. For the implicit scheme, if a < 0, there arise constraints on the time step.

46

Mathematics for Modeling and Scientific Computing

d) dissipate certain quantities. The last two aspects will be discussed in more detail when we study Hamiltonian problems. To conclude the presentation of Euler schemes, we will now return to the practical implementation of the implicit method. We have seen that when yn is known, each new iteration is constructed as the solution of the nonlinear problem z = yn + Δtf (tn + Δt, z). We introduce the following function, which is parameterized by tn , yn and Δt: Ψ : z ∈ RD −→ z − yn − Δtf (tn + Δt, z). We therefore seek a process that would allow us to calculate (an approximation to) the root of the function Ψ. The construction relies on an iterative process – Newton’s method – whose main idea is as follows. If z satisfies Ψ(z ) = 0, Taylor’s formula leads to Ψ(z ) = 0 = Ψ(z) + ∇Ψ(z)(z − z) + z − z(z − z) with lim (h) = 0. |h|→0

The idea involves considering only the leading part in the right-hand term, which is therefore a linear function of z , in order to define a sequence, and we will attempt to show that it indeed approaches z . More specifically, we take a vector z (0) and then define the following iterative scheme: a) Constructing the matrix ⎛

M (k)

∂z1 Ψ1 (z (k) ) ∂z2 Ψ1 (z (k) ) . . . ∂zD Ψ1 (z (k) )



⎜ ⎟ ⎜ ⎟ ⎜ ∂z Ψ2 (z (k) ) ∂z Ψ2 (z (k) ) . . . ∂z Ψ2 (z (k) ) ⎟ 2 D ⎜ 1 ⎟ ⎟ ⎜ =⎜ ⎟, ⎜ ⎟ .. .. .. ⎜ ⎟ . . . ⎜ ⎟ ⎝ ⎠ ∂z1 ΨD (z (k) ) ∂z2 ΨD (z (k) ) . . . ∂zD ΨD (z (k) )

(It is the Jacobian matrix of Ψ evaluated at z (k) ). b) Solving the linear system M (k) ζ (k) = Ψ(z (k) ), c) Letting z (k+1) = z (k) − ζ (k) .

Ordinary Differential Equations

47

In other words, since M (k) = ∇Ψ(z (k) ), z (k+1) satisfies Ψ(z (k) ) + ∇Ψ(z (k) ) · (z (k+1) − z (k) ) = 0  −1 (or z (k+1) = z (k) − ∇Ψ(z (k) ) Ψ(z (k) ). However, in practice, it is preferable to solve one linear system rather than calculating the inverse of a matrix. The efficiency of this algorithm is confirmed by the following statement. T HEOREM 1.12 (Convergence of Newton’s Method).– Let Ψ : RD → RD be a C 2 function and z ∈ RD , such that Ψ(z ) = 0. Suppose that the matrix ∇Ψ(z ) ∈ MD (R) is invertible. There exists a  > 0, such that if |z (0) − z | ≤ , then for every  (k) (k) k ∈ N, we have |z − z | ≤  and the sequence z converges towards z . k∈N We will see that convergence of z (k) towards z is fast: we will effectively establish that there exists a κ > 0, such that |z (k+1) − z | ≤ κ|z (k) − z |2 .

[1.16]

We say the convergence is quadratic. Indeed, we can write r(k) = κ|z (k) − z |, which satisfies r (k+1) ≤ |r (k) |2 . An immediate recurrence leads to the estimate  2k r (k) ≤ r(0) which proves that limk→∞ |z (k) − z | = limk→∞

r(k) κ

= 0, given that r (0) ≤ κ < 1.

However, the success of the method depends on the fact that the initial iteration is “close enough” to the desired point z . In practice, we do not know z or κ (and therefore ). It may be that the algorithm does not converge and that it does not allow for a correct determination of z . In that case, we must find another way to approach z , even if somewhat approximately, but sufficiently well to return to Newton’s method. Finally, as we will see in the proof, the assumption that ∇Ψ(z ) is invertible is crucial. Figure 1.1 shows a convergence defect for the method. The function Ψ that we studied is equal to x2 /2 in the neighborhood of the origin: it cancels itself out at that point, which is also a minimum of the function. Other than in a “small” neighborhood of 0, the function Ψ is linear, with a rather small slope. Moreover, the function is constructed in such a way as to gather that behavior regularly. As can be seen in the figure, when Newton’s method starts from a point that is not close enough to the origin, it oscillates between two values. P ROOF OF T HEOREM 1.12. The first stage consists of ensuring that the matrix ∇Ψ(z) remains invertible when z is in a sufficiently small neighborhood of z . We can write     ∇Ψ(z) = ∇Ψ(z ) − ∇Ψ(z ) − ∇Ψ(z) = ∇Ψ(z ) I − M (z)  −1   with M (z) = ∇Ψ(z ) ∇Ψ(z ) − ∇Ψ(z) .

48

Mathematics for Modeling and Scientific Computing

0.25

0.2

0.15

0.1

0.05

0 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 1.1. Illustration of non-convergence for Newton’s method. The algorithm starts from a point marked as x and oscillates between two values indicated by a •

However, recall that if H ∈ MD (R) satisfies H < 1, then (I − H) is invertible. ∞ ∞ The series k=0 H k is normally convergent and (I − H)−1 = k=0 H k . Therefore, ∇Ψ(z) is invertible whenever the matrix M (z) has a norm strictly smaller than 1, as a product of invertible matrices. For z = z , we have M (z ) = 0 and z → ∇Ψ(z) is continuous. It follows that there exists an 1 > 0, such that for every |z − z | < 1 , we have M (z) < 1 and therefore ∇Ψ(z) is invertible. Moreover, for |z − z | ≤ 1 , we have ! !  ! !  ! ! ! !∇Ψ(z) −1  = ! I − M (z) −1 ∇Ψ(z ) −1 ! ≤ ! I − M (z) −1 ! ! ∇Ψ(z ) −1 ! ! ! ∞ ! ∇Ψ(z ) −1 ! ! −1 !  k ! ! ≤ ∇Ψ(z ) M (z) = . 1 − M (z) k=0

Next, we write κ1 > 0, such that for all |z − z | ≤ 1 , we have ! ! ∇Ψ(z) −1  ≤ κ1 .

Ordinary Differential Equations

49

Suppose that |z (k) − z | ≤  < 1 , for a certain  > 0. We define z (k+1) as a solution to the linear problem ∇Ψ(z (k) )(z (k+1) − z (k) ) + Ψ(z (k) ) = 0. Since Ψ(z ) = 0, we can rewrite the relation in the form of ∇Ψ(z (k) )(z (k+1) − z ) − ∇Ψ(z (k) )(z (k) − z ) + Ψ(z (k) ) − Ψ(z ) = 0. We therefore have ∇Ψ(z (k) )(z (k+1) − z ) = ∇Ψ(z (k) )(z (k) − z ) ˆ 1  d   (k) + Ψ z + t(z − z (k) ) dt 0 dt ˆ 1   = ∇Ψ(z (k) ) − ∇Ψ z (k) + t(z − z (k) ) (z (k) − z ) dt 0

ˆ

1

=− 0

ˆ

1

  t D2 Ψ z (k) + θt(z − z (k) ) (z (k) − z , z (k) − z ) dθ dt.

0

However, |z (k) + θt(z − z (k) ) − z | = (1 − θt)|z (k) − z | ≤  ≤ 1 for every 0 ≤ θ, t ≤ 1, and since Ψ is a function of class C 2 , there exists a κ2 > 0, such that for all |u − z | ≤ 1 , and every ξ ∈ RD , we have |D2 Ψ(u)(ξ, ξ)| ≤ κ2 |ξ|2 . It follows that ˆ   0

1

ˆ

1

    t D2 Ψ z (k) + θt(z − z (k) ) (z (k) − z , z (k) − z ) dθ dt ≤ κ2 |z (k) − z |2 .

0

By writing κ = κ1 κ2 , we deduce that |z (k+1) − z | ≤ κ|z (k) − z |2 . However, we assumed that |z (k) −z | ≤  < 1 , implying that |z (k+1) −z | ≤ κ2 . In order to repeat the argument, we will impose 0 <  < min( κ1 , 1 ) in such a way that all elements of the sequence thus defined satisfy |z (k) − z | ≤ . This discussion allows us to show [1.16], with κ < 1, by construction and therefore theorem 1.12 is proved.  Returning to the implicit Euler method, we define yn+1 by applying Newton’s algorithm to the function Ψ(z) = z − yn − Δtf (tn + Δt, z), with z (0) = yn . In order to make limk→∞ z (k) numerically accessible, we set a stopping criterion and (k+1) −z (k) | stop the iterative process when the relative error |z |z (k) exceeds a previously | fixed tolerance limit.

50

Mathematics for Modeling and Scientific Computing

1.2.4. Higher-order schemes The idea for obtaining higher-order methods involves taking inspiration from higher-order numerical integration methods. The trapezoidal rule (see section Appendix 2) allows us to obtain the scheme yn+1 = yn +

 Δt  f (tn , yn ) + f (tn+1 , yn+1 ) , 2

which is of an implicit nature. Therefore, it would be necessary to analyze the conditions that effectively make it possible to define yn+1 . It is Crank–Nicolson’s method. Assuming that f is regular enough and denoting a function that tends to 0 when h → 0 as h → (h), the local consistency error can be expressed as  Δt  f (tn , y(tn )) + f (tn+1 , y(tn+1 )) 2 2 Δt Δt  = y  (tn )Δt + y  (tn ) 2 f (tn , y(tn )) +∂t f (tn , y(tn ))Δt −    2 2

n = y(tn+1 ) − y(tn ) −

=y  (tn )

  +∇y f (tn , y(tn )) y(tn+1 ) − y(tn ) +Δt2 (Δt).    =y  (tn )Δt+Δt (Δt)

However, y  (t) = f (t, y(t)) and by differentiation, we have y  (t) =

d f (t, y(t)) = ∂t f (t, y(t)) + ∇y f (t, y(t)) · y  (t) dt

= ∂t f (t, y(t)) + ∇y f (t, y(t)) · f (t, y(t)). It follows that the local consistency error is n = Δt2 (Δt). The Crank–Nicolson scheme is therefore second-order consistent. In order to obtain higher orders, it is necessary to approach the integral ˆ

tn+1

f (s, y(s)) ds tn

J in the form of a linear combination j=1 βj f (tjn , ynj ), where the intermediate values ynj are themselves obtained as combinations of the previous points ynj = yn +

j−1  k=1

αj,k f (tkn , ynk )

Ordinary Differential Equations

51

with tjn = tn + γ j Δt,

0 < γ j < 1,

J 

αj,k = γ k .

j=1

We thus obtain the family of Runge–Kutta schemes. The values of the coefficients are tabulated. These methods remain explicit, but the step tn+1 is achieved in several stages. 1.2.5. Leslie’s equation (Perron–Frobenius theorem, power method) We consider here a population dynamics problem. We divide the population into n age classes. At instant t ≥ 0, xj (t) is the number of individuals in the j th class. Its dynamics depends – on the fertility rates of the age classes: fj > 0 is the average rate of newborns– that is to say, individuals within class 1 – produced by individuals in the class j; – on the transition rate 0 ≤ tj+1,j < 1, which gives the proportion of individuals in the age category j that can reach the age category j + 1. The population’s behavior is completely described by the Leslie matrix ⎛ ⎜ f1 ⎜ ⎜ ⎜ t2,1 ⎜ ⎜ ⎜ ⎜ 0 ⎜ L=⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0

f2 0 t3,2

FF FF FF FF FF FF F 0

tn,n−1

⎞ fn ⎟ ⎟ ⎟ 0 ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ 0 ⎠

We will analyze how the asymptotic behavior of the population for large times is related to the remarkable spectral properties of the matrix L. For more details about Leslie’s model, we refer the reader to [CUS 98] and the original articles [LES 45, LES 48].

52

Mathematics for Modeling and Scientific Computing

1.2.5.1. Discrete time model We begin by considering discrete times. The population’s evolution between two instants, k and k + 1, is defined by the linear relation X (k+1) = LX (k) (k)

(k)

where X (k) = (x1 , ..., xn ) is the vector describing the population distribution in age classes at instant k ∈ N. Even if it is not quite accurate from a biological point of view, we can begin by studying the scalar case involving only a single class. In this case, the sequence x(k) is a simple geometric sequence x(k) = (t + f )x(k−1) = ... = (t + f )k x(0) and the asymptotic behavior depends on the ratio

lim x(k)

k→∞

⎧ (0) ⎨x = 0 ⎩ +∞

if if if

t + f = 1, t + f < 1, t + f > 1.

In other words, the dynamic is governed by the number R = 0 ≤ t < 1, which we describe in the form of a series R=f

∞ 

f 1−t ,

with f > 0 and

tk = f + f t + f t2 + f t3 + ...

k=0

where t is the probability of passing from age k to age k + 1, in particular from age 1 to age 2, t2 is the probability of surviving until age 2, t3 is the probability of surviving up to age 3, etc. We multiply these numbers by f , the probable number of babies per individuals of a given age. Therefore, R is interpreted as the number of babies per individual over an entire lifetime. In the vector case, the population’s asymptotic behavior, on a long time scale, can be deduced from the spectral properties of the matrix L. More specifically, we will use the Perron–Frobenius theorem [FRO 12, PER 07], which, as we will see, has several versions with somewhat different conclusions. T HEOREM 1.13 (Perron–Froebenius Theorem).– Let A ∈ Mn (R) be a matrix with strictly positive coefficients ak,j . Then,   i) the spectral radius ρ := sup |λ|, where λ ∈ C is an eigenvalue of A is a simple eigenvalue of A; ii) there exists an eigenvector v of A associated with the eigenvalue ρ whose components are strictly positive;

Ordinary Differential Equations

53

iii) every other eigenvalue λ of A satisfies |λ| < ρ; iv) there exists an eigenvector φ of A associated with the eigenvalue ρ and whose components are strictly positive. C OROLLARY 1.6.– Let A ∈ Mn (R) be a matrix with non-negative coefficients ak,j . Its spectral radius ρ(A) is an eigenvalue and there exists an eigenvector for that eigenvalue whose coordinates are non-negative. The statement in theorem 1.13 (which we will prove further below) does not apply to the Leslie matrix L, since that matrix’s coefficients are not all positive. However, we will be able to use the following variant. C OROLLARY 1.7.– Let B ∈ Mn (R). We assume there exists a q ∈ N, such that the coefficients of B q are strictly positive9. Then, the conclusions of the Perron– Frobenius theorem apply to the matrix B. This corollary applies to the Leslie matrix since the fertility rates fj and the transition rates tj+1,j are all strictly positive. This assumption excludes, in particular, the case of populations whose younger age group j ∈ {1, ..., j0 } is infertile, as well as those whose older age groups j ∈ {j1 , ..., n} have become sterile. To justify the fact that the Leslie matrix L satisfies the assumptions of corollary 1.7, we provide an induction argument. Let k ∈ {1, ..., n − 1}. We assume that Lk is such that (Lk )m,i > 0 for every m ∈ {1, ..., k}, i ∈ {1, ..., n} and (Lk )k+1,1 > 0, (Lk )k+2,2 > 0,..., (Lk )k+(n−k),n−k > 0. This property is satisfied by k = 1. We use the relation (Lk+1 )i,j =

n 

(Lk )i, L,j ,

=1

knowing that those matrices’ coefficients are positive or zero. On the one hand, we have (Lk+1 )m,j ≥ (Lk )m,1 fj > 0 for all m ∈ {1, ..., k} and j ∈ {1, ..., n} (the rows of strictly positive coefficients remain strictly positive) and (Lk+1 )k+1,j ≥ (Lk )k+1,1 fj > 0 9 A matrix of this kind is called primitive.

54

Mathematics for Modeling and Scientific Computing

for every j ∈ {1, ..., n} (the row k + 1 is comprised of strictly positive coefficients). On the other hand, we obtain (Lk+1 )k+1+j,j ≥ (Lk )k+(j+1),j+1 tj+1,j > 0 for every j ∈ {1, ..., n − (k + 1)}. The induction hypothesis is passed on to the rank k + 1. It follows that the coefficients of Ln are all strictly positive and therefore L is primitive. As in the scalar case, we can always write X (k) = LX (k−1) = Lk X (0) . We let ρL > 0 designate the spectral radius of L and use the Jordan–Chevalley decomposition, which allows us to write ⎛

L=P

−1



ρL

0

⎜ ⎜ ⎜ ⎜ 0 ⎜ D=⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0

(D + N )P,

N k0 = 0 for a certain k0 ∈ N,

μ1

0 CC CC CC CC 0 CC 0 μn−1

⎟ ⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ ⎟ ⎠

|μj | < ρL ,

DN = N D.

It follows that, for k  k0 ,  k

L =P

−1

k

(D + N ) P = P

−1

k 

 Ck Dk− N 

 P =P

−1

=0

=0

In that sum, we have – on the one hand, ⎛

D

k−

⎜ ⎜ ⎜ ⎜ ⎜ ⎜ k ⎜ = ρL ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

ρ− L 0

0

k0 

⎞ 0

0 − κk− 1 ρL II II II II II

0

0

− κk− n−1 ρL

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

 Ck Dk− N 

P.

Ordinary Differential Equations

with κj =

λj ρL

55

having a modulus strictly less than 1, so that

lim κk− ρ− j L = 0,

k→∞

– on the other hand, we can estimate the binomial coefficients from above Ck =

k! k! ≤ ≤ k(k − 1)...(k − k0 + 1) ≤ k k0 . !(k − )! 1 × (k − k0 )!

We can therefore reorganize the sum to write ⎛ X (k) = ρkL ⎝Y (k) +

n−1 

⎞ κkj Z (j,k) ⎠

j=1

where Y (k) and Z (j,k) are vectors whose growth in k is polynomial, at most in k k0 . Moreover, the system has a conservation property: by the Perron–Frobenius theorem, we can find two vectors, φ and e, with strictly positive coordinates, such that Le = ρL e,

L φ = ρL φ,

|e| = 1,

φ · e = 1,

and we thus obtain X (k) · φ = Lk X (0) · φ = X (0) · (L )k φ = ρkL X (0) · φ. It follows that X (k) · φ = Y (k) · φ + r (k) = X (0) · φ, ρkL

with lim r (k) = 0. k→∞

Since the coordinates of φ are strictly positive, it follows that the coordinates  Xj(k)  form a bounded sequence. In light of the estimates of Y (k) and Z (j,k) , ρk k∈N L this implies that Y (k) = Y¯ is constant, with Y¯ · φ = X (0) · φ. Finally, we obtain (k) X (k+1) (k+1) ¯ = lim LX = ρ × lim Y = ρ Y = LY¯ . L L k→∞ k→∞ k→∞ ρkL ρkL

lim

This means that Y¯ is an eigenvector of L associated with the eigenvalue ρL , so ¯ Y = αe for some α ∈ C. We can express this constant with Y¯ · φ = α = X (0) · φ.

56

Mathematics for Modeling and Scientific Computing

We therefore conclude that X (k) ∼k→∞ ρkL X (0) · φ e. Therefore, the total population n 

X (k)

j=1

⎧ n  ⎪ ⎪ (0) ⎪ ej > 0 if ρL = 1, ⎨ tends to X · φ j=1

⎪ tends to +∞ if ρL > 1, ⎪ ⎪ ⎩ tends to 0 if ρL < 1.

In order to continue the discussion, we consider the following statement: L EMMA 1.6.– Let A be a matrix whose coefficients are positive or zero, and which we assume to be primitive. i) If z = 0 has positive or zero coordinates and satisfies Az ≥ ρ(A)z, then the coordinates of z are strictly positive and Az = ρ(A)z. ii) Let B = A be a matrix whose coefficients satisfy Bij ≥ Aij ≥ 0. Then, ρ(B) > ρ(A). P ROOF.– According to the Perron–Frobenius theorem applied to A , we can find a vector φ whose coordinates are strictly positive, such that A φ = ρ(A)φ. We write y = Az − ρ(A)z whose coordinates are non-negative. If y = 0, we have y · φ > 0, which contradicts the fact that y · φ = (A − ρ(A)I)z · φ = z · (A − ρ(A)I)φ is zero because φ is an eigenvector of A associated with ρ(A). We thus have Az = ρ(A)z. This implies that Ak z = ρ(A)k z, since the vector has strictly positive values for a large enough k because A is primitive. Item i) is thus demonstrated. Next, the inequalities Bij ≥ Aij ≥ 0 imply [B k ]ij ≥ [Ak ]ij ≥ 0. Because A is primitive, B is also primitive. By the Perron–Frobenius theorem, ρ(A) is an eigenvector of A and has an eigenvector x with strictly positive values. We assume |x| = 1. For each coordinate, it follows that 0 ≤ ρ(A)k xj ≤ (Ak x)j ≤ (B k x)j which implies ρ(A)k ≤ |Ak x| ≤ |B k x| ≤ B k ,

Ordinary Differential Equations

57

so ρ(A) ≤ B k 1/k and, finally, by lemma A1.2, ρ(A) ≤ lim B k 1/k = ρ(B). k→∞

Suppose that ρ(A) = ρ(B). In this case, we have 0 ≤ ρ(A)xj = ρ(B)xj ≤ (Ax)j ≤ (Bx)j and, since B is primitive, item i) implies that x is also an eigenvector of B, for the eigenvalue ρ(B) and its coordinates are strictly positive. We therefore have (B − A)x = 0 and for all i, j ∈ {1, ..., N }, N 

(Bik − Aik )xk = 0 ≥ (Bij − Aij )xj

k=1

with Bij ≥ Aij ≥ 0 and xj > 0, which is only possible if Aij = Bij . (This property reinforces lemma A1.2 in the case of primitive matrices).  We seek to interpret the analysis of asymptotic behavior by way of an analogy with the scalar case, by showing a simple criterion that makes it possible to distinguish between cases of population extinction and explosion. To this end, we decompose L = F + T, ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ F =⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

f1

f2

fn 0

0

0

0





⎜ ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎜ ⎟ ⎟ ⎜ ⎟ ⎜ ⎟, T = ⎜ ⎜ ⎟ ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ⎜ ⎟ ⎜ ⎠ ⎝

0

0

t2,1 0

0

t3,2

FF FF FF FF FF FF 0

tn,n−1

0

and we assume (for the moment) that R = ρ(F (I − T )−1 ) > 0. We will see further below that this quantity is easy to calculate. Note that (I − T )−1 =

∞ 

Tk

k=0

has non-negative coefficients, since those of T are also non-negative. It is valid to express things in this way because the coefficients of the matrix T are in [0, 1[ and, in

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

58

Mathematics for Modeling and Scientific Computing

particular, T ∞ < 1. Likewise, A = F (I − T )−1 = 0 has non-negative coefficients and its spectral radius R is strictly positive. Therefore, A has an eigenvector z = 0, whose coordinates are non-negative, and is associated with the eigenvalue R, its spectral radius, according to corollary 1.6. Thus, we obtain F (I − T )−1 z = z = (I − T ) (I − T )−1 z    R ∞  =φ= T kz k=0

where the coordinates of φ = 0 are non-negative. We can rewrite this equation as  T+

F φ = φ. R

However, just as T + F is primitive, the matrix T + F/R is primitive, and we have found an eigenvector φ with non-negative coordinates for that matrix, associated with the eigenvalue 1. It follows from the Perron–Frobenius theorem that ρ(T +F/R) = 1. In the same way, we show that φ is the eigenvector of (RT + F ) for the eigenvalue R, and thus ρ(RT + F ) = R. The number R allows us to describe the asymptotic behavior of the discrete dynamic system. We distinguish between three cases: – if R = 1, then ρL = 1 = R. In this case, the total population converges towards a positive finite value; – if R > 1, we have Fij /R < Fij and RTij > Tij . It follows that Tij + Fij /R ≤ Tij + Fij ≤ RTij + Fij and lemma 1.6 ensures that ρ(T + F/R) = 1 < ρ(T + F ) = ρL < ρ(RT + F ) = R. In this case, the total population tends towards +∞; – if R < 1, we also obtain ρ(T +F/R) = 1 > ρ(T +F ) = ρL > ρ(RT +F ) = R. In this case, the total population tends to 0. It remains to find the number R for the Leslie matrix. We determine the inverse of I − T : if y = (I − T )x, the relations x1 = y 1 , x2 − τ21 x1 = y2 , ... xm − τmm−1 xm−1 = ym , lead to x1 = y 1 , x2 = y2 + τ21 y1 , ... xm = ym + τmm−1 (ym−1 + τm−1m−2 (ym−2 ....(y2 + τ21 y1 )...)),

Ordinary Differential Equations

so



(I − T )−1

⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ =⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

1

0 HH HH HH HH t2,1 HH HH HH HH HH HH HH HH HH HH HH HH HH HH HH HH

t21 τ32 ...tmm−1

t32 ...tmm−1

tmm−1

0

0

1

59

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

The matrix product therefore takes a simpler form ⎞ ⎛ R   ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎜ 0 0 ⎟ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ F (I − T )−1 = ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎠ ⎝ 0 0 from which we can deduce the spectral radius R = f1 + f2 t21 + ... + fm t21 t32 ...tmm−1 . The total population’s behavior – extinction or explosion – is uniquely determined by this quantity. 1.2.5.2. Model in continuous time We now pass from the description in discrete time to that in continuous time in terms of ordinary differential equations: X (k+1) − X (k) = (L − I)X (k) defines the population’s variation by unit of time tk+1 − tk . Assuming that this unit of time is (k+1) (k) small with respect to the observation scale, we are driven to consider X δt−X = (L − I)X (k) , with 0 < δt  1, which can be interpreted as a discrete approximation (which is just the Euler scheme) for the differential system d X = (L − I)X, dt

 X t=0 = XInit .

[1.17]

60

Mathematics for Modeling and Scientific Computing

It is a simple homogeneous linear equation with constant coefficients. The Picard–Lindelöf theorem (linear version) applies and the problem has a unique solution defined on R. We know that this solution is defined for all t ∈ R by the relation X(t) = exp(t(L − I))XInit . Because of their physical interpretation, the components xj (t) are non-negative numbers. The differential system must preserve this property. We write d Y (t) = et X(t), which satisfies dt Y = LY and Y (t) = eLt Y (0) = eLt XInit . However, the matrix L has non-negative coefficients, so for every t ≥ 0, eLt also has non-negative coefficients. Therefore, if the components of Xinit are non-negative, those of Y (t) are also non-negative, as well as those of X(t). This remark, which is necessary to make the model coherent, will be crucial for analyzing the problem (for more details and further development, see [RAP 12]).  = L − I can be Note that the spectrum of the differential system’s matrix L obtained from the spectrum of L from a simple shift, with the same eigenvectors    = λ − 1, λ ∈ σ(L) . σ(L) Let ρL be the spectral radius of L. Then, by corollary 1.7, ρ = ρL − 1 is an  and has an eigenvector V whose components are strictly positive. eigenvalue of L  Furthermore, L has an eigenvector φ associated with ρ, whose components are strictly positive. We can assume that |V | = 1,

V · φ = 1.

Interestingly, the system [1.17] leaves the quantity e−ρt X(t) · φ unchanged. L EMMA 1.7.– For all t ≥ 0, we have e−ρt X(t) · φ = XInit · φ. P ROOF.– We simply calculate d d    φ = ρX(t) · φ. X(t) · φ = X(t) · φ = LX(t) · φ = X(t) · L dt dt  Because X(t) · φt=0 = XInit · φ, it follows that for all t ∈ R, X(t) · φ = eρt XInit · φ.   there exist We now return to the Jordan–Chevalley decomposition of the matrix L:  D and N , such that L = D + N and – DN = N D,

Ordinary Differential Equations

61

– there exists an invertible P ∈ M(Cn ), such that ⎛

P DP

−1

ρ

⎜ ⎜ ⎜ ⎜ 0 ⎜ =⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0

0 λ1

0 BB BB BB BB 0 B 0

λn−1

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

 which are distinct from ρ. where λj are the eigenvalues of L, – there exists an r ∈ N that satisfies N r = 0.  which is of the form We can also use the Jordan form of the matrix L, ⎛

⎞ ρ

0

⎜ ⎜ ⎜ ⎜ 0 ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0

0

⎟ ⎟ ⎟ ⎟ J1 ⎟ @@ ⎟ @@ @@ 0 ⎟ ⎟ @@ ⎟ ⎟ ⎠ 0 J

where the upper left-hand block has a size of 1 × 1 and the blocks Jk are of the form ⎞ 0 0 == == ⎟ ⎜ ⎟ ⎜ == == ⎟ ⎜ 0 == == ⎟ ⎜ = == = ⎟ ⎜ == == ⎟ ⎜ == == 0 ⎟ ⎜ Jk = ⎜ ⎟ == == ⎟ ⎜ == = ⎟ ⎜ = == 1 ⎟ ⎜ ⎟ ⎜ == ⎟ ⎜ ⎠ ⎝ 0 0 λ ⎛

λ

1

 with λ, an eigenvalue of L.

62

Mathematics for Modeling and Scientific Computing

We thus obtain ⎛

t L

X(t) = e Xinit



eρt

⎜ ⎜ ⎜ ⎜ 0 ⎜ −1 ⎜ =P ⎜ ⎜ ⎜ ⎜ ⎝ 0

0 e

λ1 t

0 EE EE EE EE E 0

0 eλn−1 t

⎟ ⎟ ⎟ ⎟ r−1 k k  ⎟ t N ⎟P XInit . ⎟ ⎟ k=0 k! ⎟ ⎟ ⎠

Thus, for each element of vector X(t), we have an expression of the form xj (t) = Pρ,j (t)eρt +

n−1 

Pλk ,j (t)eλk t

k=1

where the functions t → Pλ,j (t) are polynomials. It follows that e−ρt xj (t) = Pρ,j (t) +

n−1 

Pλk ,j (t)e(λk −ρ)t .

k=1

 can be written as λk = μk − 1, where μk However, each eigenvalue λk = ρ of L is an eigenvalue of L that satisfies Re(μk ) ≤ |Re(μk )| ≤ |μk | < ρL . It follows that Re(λk − ρ) = Re(μk − ρL ) < 0, and therefore lim

t→∞

n−1 

 Pλk ,j (t)e

(λk −ρ)t

= 0.

k=1

n Moreover, j=1 eρt xj (t)φj = XInit · φ is a sum of strictly positive terms, so for each j ∈ {1, ..., n}, the function t → eρt xj (t) is bounded. We infer that the polynomial functions t → Pρ,j (t) are bounded. They are therefore constant functions, denoted as pj ∈ C, and we have lim e−ρt xj (t) = pj .

t→∞

However, note that   −ρt   d  −ρt    − ρI)p , e xj (t) = −ρe−ρt xj (t) + Le x(t) j −−−→ −ρpj + Lp = (L j j t→∞ dt

Ordinary Differential Equations

63

by denoting the vector of coordinates p1 , ..., pn as p. This quantity can also be calculated in the following form:  d  −ρt d e xj (t) = dt dt =



n−1 

pj +

n−1 

 Pλk ,j (t)e(λk −ρ)t

k=1

  e(λk −ρ)t (λk − ρ)Pλk ,j (t) + Pλ k ,j (t) −−−→ 0 t→∞

k=1

again using the fact that Re(λk ) < ρ. The vector p = (p1 , ..., pn ) therefore satisfies  − ρI)p = 0, which implies that p = αV ∈ Ker(L  − ρI) for some α ∈ C. Finally, (L we identify the constant α by using the conservation law eρt X(t) · φ = XInit · φ −−−→ p · φ = αV · φ = α = XInit · φ t→∞

(which is a strictly positive real number). In summary, we have demonstrated the following relation: P ROPOSITION 1.4.– Let t → X(t) be the solution of [1.17]. Then, when t → ∞, X(t) behaves like XInit · φ eρt V . In terms of population, this means that there will be extinction whenever ρ < 0, that is to say, ρL < 1. There will also be exponential growth whenever ρ > 0, that is to say, ρL > 1. This kind of behavior has been most famously studied by Malthus [MAL 98], and identifying a leading eigenvalue for a transition matrix therefore plays a very important role in several biological applications. Another illustration of this type of method, with important practical applications, involves the analysis of criticality in neutron transport [DAU 84, Chapter 1, Section 5, Paragraph 3] and its applications to nuclear security problems. N OTE .– It is useful to describe in detail a somewhat different approach to systems of differential equations d y = Ay dt where A is a matrix for which all coefficients are strictly positive (and therefore theorem 1.13 applies directly). In this case, we can present dissipation properties for the differential system that translates the convergence to the asymptotic state. The Perron–Frobenius theorem allows us to effectively find V = (v1 , ..., vn ) and φ = (φ1 , ..., φn ), the positive-coordinate eigenvectors of A and A , respectively,

64

Mathematics for Modeling and Scientific Computing

which are associated with the spectral radius and normalized by |V | = 1, V · φ = 1. We calculate 1 d   −ρt yj (t) 2 e vj φj 2 dt j=1 vj n  yj (t)  −ρyj (t) (Ay(t))j  = e−2ρt + vj φj vj vj vj j=1 n n   y (t) 2   yj (t)  j = e−2ρt − ρvj φj + aj,k yk (t) φj vj vj j=1 n

k=1

However, we have ρvj φj =

1 1 aj,k vk φj + ak,j φk vj 2 2 n

n

k=1

k=1

which allows us to write 1 d   −ρt yj (t) 2 e vj φj 2 dt j=1 vj n

  y (t) 2  y (t) 2 1   −2ρt yj (t) yk (t)  j k e aj,k vk φj + vk φj − 2vk φj 2 vj vk vj vk n

=−

n

j=1 k=1

 1  yk (t) yj (t) 2 aj,k vk φj e−ρt − e−ρt . 2 j=1 vk vj n

=−

n

k=1

We denote the right-hand term as −D(t). This calculation shows that the positiven  −ρt yj (t) 2 valued function H : t → vj φj is decreasing on R+ and thus j=1 e vj   has a limit H∞ ≥ 0 when t → ∞, and also t → D(t) ∈ L2 (R+ ). Let t(ν) ν∈N be an increasing sequence that tends towards +∞. We can apply the Arzela–Ascoli (ν)

(ν)

y (t+t(ν) )

theorem to the sequence of functions zj (t) = e−ρ(t+t ) j vj because these functions are uniformly bounded. Even if it means extracting a subsequence denoted   (ν ) as tνm m∈N , we may assume that zj m converges uniformly on [0, T ], 0 < T < ∞, ∞ towards a limit j (t). However, we have ˆ

ˆ

t+t(νm )

H(t + t(νm ) ) +

D(s) ds = H(t + t(νm ) ) + t(νm )

t

D(s + t(νm ) ) ds = H(t(νm ) ). 0

Ordinary Differential Equations

65

So, by letting m → ∞, we get ˆ

t

D(s + t(νm ) ) ds = 0.

lim

m→∞

0

Since D(t) ≥ 0, by Fatou’s lemma, it follows that D(s + t(νm ) ) −−−−→ 0 = m→∞

n 

 2   ∞ aj,k vk φj ∞ (s) −  (s)  . j k

k,j=1

We assume that the coefficients aj,k are strictly positive and the Perron–Frobenius theorem ensures that the coordinates vk and φj are also strictly positive. It follows that ∞ ∞ j (t) =  (t) does not depend on j. The conservation relation allows us to identify the function ∞ : n 

(νm )

e−ρ(t+t

)

yj (t + t(νm ) )φj =

j=1

n 

(νm ) )

e−ρ(t+t

j=1

yj (t + t(νm ) ) vj φj vj

= yInit · φ −−−−→ ∞ (t)v · φ m→∞

gives ∞ (t) =

yInit · φ = yInit · φ, v·φ

which does not depend on j or t. Since the limit is defined uniquely, we have identified the asymptotic behavior of the system when t → ∞: y(t) behaves like eρt ∞ v when t → ∞. This kind of approach can be related to several different contexts: Lyapunov functional for dynamic systems or “entropy functional” for systems with partial derivatives. For extensions and applications motivated by biology and population dynamics, the reader may refer to P. Michel’s dissertation [MIC 05]. 1.2.5.3. Proof of the Perron–Frobenius theorem Before advancing to the proof of the Perron–Frobenius theorem, we note that corollary 1.7 can be deduced from the following general result. L EMMA 1.8.– Let A ∈ Mn (C). For every polynomial P , we have   σ(P (A)) = P (λ), λ ∈ σ(A) .

66

Mathematics for Modeling and Scientific Computing

P ROOF.– For every λ ∈ C, the mapping z ∈ C → λ − P (z) is a polynomial, which can be written as λ − P (z) = a

m &

(z − zj )

j=1

where m is the degree of the polynomial and the roots zj depend on λ. Suppose that λ = P (μ) for every μ ∈ σ(A), Then, for every j ∈ {1, ..., m}, the matrix A − zj I is invertible (otherwise, there would exist a j0 , such that zj0 ∈ σ(A) and λ − P (zj0 ) = 0, which is a contradiction) and λI − P (A) is written as the product of invertible operators, so it is invertible. In other words, we have λ ∈ / σ(P (A)). For the reciprocal result, we provide a general analysis that treats the case of operators in infinite dimensions, given that the matrix solution case provides a faster argument. Let λ ∈ / σ(P (A)): λI − P (A) is invertible. – Let x ∈ Ker(A − zj I): (λI − P (A))x = a

&

(A − zk I)(A − zj I)x = a

k=j

&

(A − zk I)0 = 0.

k=j

Since λI − P (A) is invertible, it follows that x = 0, so for every j, the operator A − zj I is injective. – Let y be a given vector. There exists an x, such that (λI − P (A))x = y = ' (A − zj I) x, with x  = a k=j (A − zk I)x, which proves that (A − zj I) is surjective. – Finally, we can write I = (λI − P (A))−1 (λI − P (A)) = (λI − P (A))−1 a

&

(A − zk I) (A − zj I)

k=j

in such a way that (A − zj I)−1 = (λI − P (A))−1 a product of bounded operators.

'

k=j (A − zk I)

is bounded as the

These three points show that if λ ∈ / σ(P (A)), then for all j ∈ {1, .., m}, zj ∈ / σ(A). Suppose that λ'∈ / σ(P (A)) is written as λ = P (μ), with μ ∈ σ(A). We obtain m λ − P (μ) = 0 = a j=1 (μ − zj ), so there is an index k, such that μ = zk , which would contradict the fact that the zj s are not in the spectrum of A. We conclude that λ = P (μ) for all μ ∈ σ(A). We will, however, remain attentive to the fact that we cannot identify the eigenvectors in general because we can have λ1 = λ2 eigenvalues of A, with λp1 = λp2 ∈ σ(Ap ). Nevertheless, if the Perron–Frobenius theorem applies to Ap in

Ordinary Differential Equations

67

such a way that the dominant eigenvalue ρ of Ap is simple, then there exists an eigenvalue λ of A, such that λp = ρ and the associated eigenspace has dimension 1.  P ROOF OF T HEOREM 1.13. We begin by noting that if all the coefficients of matrix A are strictly positive, then for every x = 0 whose components are non-negative, the components of Ax are strictly positive. (The reciprocal statement is also true: if A has this property, then by applying A to the vectors of the canonical basis, we show that the coefficients Ak,j of A are strictly positive). We introduce the sets     C = x = xj 1≤j≤n ∈ Rn , such that xj ≥ 0 for all j ∈ {1, ..., n} , E = {t ≥ 0, such that there exists x ∈ C \ {0} satisfying Ax − tx ∈ C}. We will employ the following observations: – The set E contains at least 0 by assumption on A. Then, by denoting e the vector for which all components are equal to 1, Ae has strictly positive coordinates, and for a sufficiently small ε > 0, we have Ae − εe ∈ C, so ε ∈ E. For example, ε=

n  1 min ak,j > 0 2 k∈{1,...,n} j=1

works. Therefore, E is not reduced to {0}. – Let t ∈ E, x ∈ C \ {0}, such that Ax − tx ∈ C, and let t ∈ [0, t]. Therefore, Ax − t x = Ax − tx + (t − t )x ∈ C because t − t ≥ 0, and the coordinates of x are non-negative. This ensures that t ∈ E, and E is therefore an interval. – Let x ∈ C \ {0}. Since ak,j and xj are non-negative, we get 0 ≤ Ax · e =

n  n 

ak,j xj =

k=1 j=1





max m∈{1,...,n}

n  k=1

n  n  j=1

ak,m

 ak,j xj

k=1

n  j=1

 xj =

max m∈{1,...,n}

n 

 ak,m x · e,

k=1

where x · e > 0. Let t ∈ E and x ∈ C \ {0}, such that Ax − tx ∈ C \ {0}. Therefore, (Ax − tx) · e ≥ 0, that is to say, tx · e ≤ Ax · e. It follows that 0≤t≤

max j∈{1,...,n}

n  k=1

and E is therefore bounded.

ak,j

68

Mathematics for Modeling and Scientific Computing

– Finally, let (tp )p∈N be a sequence of elements of E that converges to t ∈ R. For every p ∈ N, there exists a xp ∈ C \ {0}, such that Axp − tp xp ∈ C. Because the xp s x are non-zero, we can consider the sequence (yp )p∈N = |xpp | p∈N , which is bounded in a finite dimension   space: by the Bolzano–Weierstrass theorem, we can extract a subsequence ypk k∈N that converges to y, with |y| = 1. Since the coordinates of yp are non-negative, those of y are also non-negative, and y ∈ C \ {0}. Because for every k ∈ N, we have Aypk − tpk ypk ∈ C, by passing to the limit, we deduce that Ay − ty ∈ C. Therefore, t ∈ E, and E is closed. Since the set E is a closed and bounded interval that is not reduced to {0}, 0 < ρ = max E < ∞ is well defined and belongs to E. In particular, there exists an x = 0 with non-negative coordinates, such that Ax ≥ ρx. If y = Ax − ρx = 0, then y ∈ C \ {0} and therefore Ay has strictly positive coordinates. For a small enough  > 0, we have Ay − Ax = A(Ax) − (ρ + )Ax ∈ C. For example, =

min (Ay)j 1 j∈{1,...,n} 2 max (Ax)j j∈{1,...,n}

works. Because Ax ∈ C \ {0}, it contradicts the definition of ρ, and we conclude that necessarily Ax = ρx. However, x = 0 has non-negative coordinates, so Ax has strictly positive coordinates. Moreover, since ρ > 0, we have xj > 0 for every j ∈ {1, .., n}. The vector x obtained is indeed an eigenvector of A associated with the eigenvalue ρ and its components are all strictly positive. In order to show that Ker(A − ρI) has dimension 1, we will use the following result. L EMMA 1.9.– Let (w1 , ..., wn ) ∈ Cn , such that |w1 + ... + wn | = |w1 | + ... + |wn |. Therefore, there exists θ ∈ [0, 2π[, such that for every j ∈ {1, ..., n}, we have wj = eiθ |wj |. P ROOF.– We begin by noting that the assumption on the vector w implies that |w1 + ... + wn |2 = (w1 + ... + wn )(w1 + ... + wn ) =

n  k=1

|wk |2 + 2



Re(wj w )

1≤j 0. It follows that λ = ρ, and   v ∈ Ker A − ρI = Span{x}, which finishes the proof of theorem 1.13.



P ROOF OF C OROLLARY 1.6.– For n ∈ N \ {0}, we let A(n) = A +

1 E, n

where E is the matrix for which all coefficients are equal to 1. Theorem 1.13 applies to An because its coefficients are strictly positive. Therefore, there exists a ρ(n) = ρ(A(n) ) > 0 and an eigenvector x(n) with strictly positive coordinates, such that A(n) x(n) = ρ(n) x(n) ,

|x(n) | = 1.

[1.19]

Ordinary Differential Equations

71

Using the monotony property of the spectral radius stated in lemma A1.2, we have ρ(A) ≤ ρ(A(m) ) = ρ(m) ≤ ρ(A(n) ) = ρ(n) for all m ≥ n.   The sequence ρ(n) n∈N\{0} is non-increasing and has a limit when n → ∞,   denoted as ρ¯. We note that ρ¯ ≥ ρ(A). Moreover, since the sequence x(n) n∈N\{0} is bounded, the Bolzano–Weierstrass theorem allows us to extract a convergent subsequence; we denote its limit as lim→∞ x(n ) = x ¯. The coordinates of x ¯ are non-negative. Finally, by passing to the limit  → ∞ in equation [1.19], we obtain A¯ x = ρ¯x ¯,

|¯ x| = 1.

Therefore, x ¯ = 0 is an eigenvector of A that is associated with the eigenvalue ρ¯ ≥ ρ(A), which implies that ρ¯ = ρ(A).  In fact, in those statements, the key notion is that of irreducibility, which is related to the connectedness of the graph of the matrix A: the matrix A ∈ Mn is associated with a graph composed of n vertices where an edge connects vertices i and j when Ai,j = 0. Moreover, we say that A is irreducible if for every ordered pair (i, j), there exists a path (that is to say, a sequence of such edges) connecting i to j. Another way to state this property is to say that for every i, j, there exists an integer k(i, j), such that [Ak(i,j) ]i,j > 0. The smallest integer k(i, j) satisfying that criterion is called the path length for the path connecting i to j. For an irreducible matrix A with non-negative coefficients, by developing (I + A)k =

k 

Ckm Am ,

Ckm =

m=0

k! m!(k − m)!

we realize that, for a large enough k, all the coefficients of (I + A)k are strictly positive: I + A is primitive (and, clearly, the reciprocal statement is also true). Therefore, corollary 1.7 applies to I + A, whose spectral radius is denoted as ρ. In particular, ρ − 1 is the eigenvector of A and the associated eigensubspace is spanned by an eigenvector x with strictly positive coordinates. We recall from the proof of the Perron–Frobenius theorem that ρ can be characterized by  ρk = max t ≥ 0, such that there exists an x ∈ C \ {0}  satisfying (I + A)k x − tx ∈ C . [1.20] We note also that ρ > 1. Indeed, if x designates the eigenvector of (I + A) with strictly positive coordinates and normalized by |x|∞ = 1, assuming that xi0 = 1, we obtain, for a large enough k, k  n   ρk xi0 = ρk = (I + A)k x i = xi0 + Ckm [Am ]i0 , x > xi0 = 1. 0

m=1 =1

72

Mathematics for Modeling and Scientific Computing

It follows that the spectral radius of A is strictly positive. Let λ ∈ C be an eigenvalue of A and z an associated eigenvector. Since the coefficients of A are positive or equal to zero, we can write n n      |λzj | = |λ| |zj | =  Aj, z  ≤ Aj, |z |, =1

=1

which implies (1 + |λ|) |zj | ≤

n 

(δj, + Aj, )|z |.

=1

Therefore, the vector y with (non-negative) coordinates |z1 |, ..., |zn | satisfies (1 + |λ|)y ≤ (I + A)y as well as (1 + |λ|)k y ≤ (I + A)k y. It follows from the definition in equation [1.20] of ρ that ρ − 1 > 0 is the spectral radius of A. We have therefore demonstrated a version of theorem 1.13 for irreducible matrices with non-negative coefficients. C OROLLARY 1.8.– Let A ∈ Mn (R) be an irreducible matrix with non-negative coefficients. Then, the spectral radius ρ(A) > 0 is a simple eigenvector of A, and there exists an eigenvector whose coordinates are strictly positive for that eigenvalue. It is important to distinguish the assumptions and conclusions of the different versions of theorem 1.13, corollaries 1.6, 1.7 and 1.8. For a matrix with non-negative coefficients, as in the case of corollary 1.6, we have no information on the dimension of the eigensubspace associated with the spectral radius. There can exist non-collinear eigenvectors with non-negative coordinates. Furthermore, there can be several eigenvalues whose modulus is equal to the spectral radius. The identity matrix provides a simple example of this situation. For an irreducible matrix, as in corollary 1.8, the eigenspace associated with the spectral radius does indeed have dimension 1, and there is only one normalized eigenvector with strictly positive coordinates. However, there can be other eigenvalues whose modulus is equal to the spectral radius, as is shown by the example of the matrix

0 1 . 1 0

This matrix is irreducible, but is not primitive (its square is the identity matrix and its eigenvalues are +1 and −1). For more details on this theory and its interpretations in terms of graphs, the reader may refer to, for example, [HOR 90].

Ordinary Differential Equations

73

1.2.5.4. Computation of the leading eigenvalue Sections 1.2.5.1 and 1.2.5.2 show that all the asymptotic behavior of the Leslie system, which ultimately corresponds to the situations observed in practice, is directed by the dominant eigenvalue of the matrix L. The question therefore comes down to finding the largest eigenvalue of a matrix satisfying the assumptions of the Perron–Frobenius theorem: the power method, which we will now describe, provides a practical way to calculate ρ and V . Suppose that: a) A ∈ MN is a diagonalizable matrix; b) there exists a unique eigenvalue λ1 , such that ρ = |λ1 |, the spectral radius of A: if we denote the eigenvalues of A as λ1 , ..., λN , we have |λ1 | > |λp+1 | ≥ |λN |, where p = dim(Ker(A − λ1 I)). In this case, we say that λ1 is the dominant eigenvalue of A. We (randomly) choose two vectors x(0) = 0 and φ = 0. We construct the sequence defined by Ax(n) , φ · Ax(n)

x(n+1) =

where x(0) is taken as the first iteration, while φ defines a normalization factor. The role of this factor can be motivated as follows: we seek to construct a sequence that converges to an eigenvector. If we start specifically from y ∈ Ker(A − λ1 I), then Ak y = λk1 can oscillate, “explode” or vanish according to the value of λ1 . The normalization factor prevents this kind of behavior. Thanks to assumption a), we can develop x0 on a basis {ψ1 , ..., ψN } composed of eigenvectors, with the first p elements corresponding to the eigenvalue λ1 : x0 =

N 

α k ψk .

k=1

We note that An x0 = λn1 (u + r (n) ), u=

p 

αk ψk ∈ Ker(A − λ1 I),

k=1

r

(n)

=

N  k=p+1

αk

 λ n k

λ1

ψk .

[1.21]

74

Mathematics for Modeling and Scientific Computing

As a result of assumption b), we obtain lim r (n) = 0.

n→∞

We will now show that x(n) =

An x(0) . φ · An x(0)

[1.22]

We argue by induction. The property is clearly satisfied for n = 1. We suppose that it is satisfied for an integer n. Therefore, it follows that n

x

(n+1)

=

(0)

A x A φ·A n x(0)

φ·

An x(0) A φ·A n x(0)

=

An+1 x(0) φ · An x(0) φ · An x(0) φ · An+1 x(0)

and the property is also satisfied for n + 1. We combine equations [1.21] and [1.22] to obtain x(n) =

λn1 (u + r (n) ) u + r(n) u = −−−−→ ∈ Ker(A − λ1 I). n (n) (n) φ · λ1 (u + r ) φ · (u + r ) n→∞ φ · u

Finally, we conclude that Ax(n) =

A(u + r(n) ) λ1 u + Ar(n) u = −−−−→ λ1 . φ·u φ · (u + r (n) ) φ · (u + r(n) ) n→∞

The method allows us to find the eigenvalue λ1 = limn→∞ Ax(n) · φ and the u associated eigenvector v = φ·u normalized by the condition v · φ = 1. The method works because u = 0. However, in practice, we do not know Ker(A − λ1 I) and cannot exclude the possibility that the vector x(0) is chosen orthogonal to Ker(A − λ1 I). In practice, we run the algorithm with several different choices of data in order to ensure that the “correct” eigenvalue is found. In fact, by choosing the initial vector at random, following a uniform law, the probability of choosing a vector orthogonal to the eigenspace is null. Finally, we note that the rate of convergence of the method is governed by the ratio |λ2 |/|λ1 |: the larger the “spectral gap”, the faster the convergence occurs. The above detailed proof assumes that A is diagonalizable: this is assumption a). It is an important restriction that can be relaxed, and the proof for the general case

Ordinary Differential Equations

75

follows a similar argument. We use the Jordan decomposition A = P JP −1 , with the Jordan matrix J ∈ MN composed of blocks of the form: ⎞ 0 0 == == ⎟ ⎜ ⎟ ⎜ == == ⎟ ⎜ 0 = = = = ⎟ ⎜ == == ⎟ ⎜ == == ⎟ ⎜ == == 0 ⎟ ⎜ J(λ, q) = ⎜ ⎟ ∈ Mq , == == ⎟ ⎜ == = ⎟ ⎜ == ⎜ == 1 ⎟ ⎟ ⎜ = ⎟ ⎜ ⎠ ⎝ 0 0 λ ⎛

λ

1

where λ describes the set of eigenvalues of A. The matrix J contains precisely r such blocks; the number of blocks with the same eigenvalue λ corresponds to the geometric multiplicity dim(Ker(A − λI)), while the sum of those blocks of size q gives the algebraic multiplicity of λ, that is to say, its multiplicity as a root of the characteristic polynomial (if λ is semisimple , the corresponding Jordan blocks have no nilpotent part). We therefore develop the first iteration on the basis defined for the Jordan decomposition (and whose coordinates in the canonical basis can be read on the columns of the change-of-basis matrix P ): x

(0)

=

N 

αk ψk .

k=1

We continue to assume that the first p vectors of this basis are associated with the dominant eigenvalue λ1 . We note that P −1 ψk = ek , the column made of 0, except for the kth row, which is 1. We thus write An x(0) = P J n P −1 x(0) = λn1 (u(n) + r (n) ), u(n) =

p 

αk P

k=1

r (n) =

N  k=p+1

 J n ek , λ1

αk P

 J n ek . λ1

76

Mathematics for Modeling and Scientific Computing

For a large enough n, J(λ, q)n is given by ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

λn L Cn1 λn−1 Cn2 λn−2 Cnq−1 λn−q+1 LL L L LLL LLL LL LL LLL LL LL LLL 0 L LL L LLL LL LL LL L LL LL L 2 n−2 LL LL LL LLL Cn λ LL L LL LL LL LL Cn1 λn−1 LL LL 0 0 λn

with Cnk =

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

[1.23]

n! k!(n−k)! .

We can verify this formula by writing J(λ, q) = λI + N , with q−1 N = 0, and by using the relation (λI + N )n = k=0 Cnk λn−k N k . Since r (n) is expressed by only using blocks associated with eigenvalues such that |λ|/|λ1 | < 1, we have limn→∞ r (n) = 0 as a result of the following estimate, which is valid for all k ∈ {1, ..., q}, n  q, λ = 0, q

 n−k 

λ  n(n − 1)...(n − k + 1) |λ| n 1 Cnk  n  ≤ λ1 k! |λ1 | |λ|k

n nk |λ| ≤ k! |λ|k |λ1 |

 |λ|  1 ≤ exp n ln + k ln(n) −−−−→ 0. n→∞ k! |λ|k |λ1 | When dim(Ker(A − λ1 I)) = p, there are p Jordan blocks (of size 1 × 1!) reduced to λ1 , the ψk = P ek are eigenvectors of A associated with λ1 and we can directly reproduce the arguments of the diagonalizable case. It remains to discuss Lthe case when the eigenvalue λ1 has L non-trivial Jordan blocks of size 1 ≤ q ≤ p, =1 q = p. We have (with the convention q0 = 0) u(n) =

L 

q 

=1 k=q−1 +1

αk P

 J(λ , q ) n 1  ek . λ1

We thus need understand the asymptotic behavior of the matrices J(λ1 , q)n , which are of size q × q. To this end, we work with the canonical basis (κ1 , ..., κq ) of Rq and note that k  J(λ , q) n  1 κk = Cnk−m λm−k κm 1 λ1 m=1

Ordinary Differential Equations

77

(This expression can be read on the kth column of the matrix in equation [1.23]). The leading term in these sums corresponds to k = q, m = 1 (element located top and on the right of the matrix)10. The behavior of u(n) is therefore directly related to the size of the largest Jordan block. By assuming that it corresponds to the elements {ψ1 , ..., ψq1 } of the Jordan basis, when n → ∞, u(n) behaves like Cnq−1 λ1−q α1 P e1 , 1 where we recall that AP e1 = Aψ1 = λ1 ψ1 = P Je1 = λ1 P e1 is the eigenvector associated with λ1 . Finally, we arrive at x(n) −−−−→ n→∞

P e1 φ · P e1

and Ax(n) −−−−→ n→∞

AP e1 P e1 = λ1 φ · P e1 φ · P e1

We will find several commentaries, variants and extensions of this method in Chapter 10 of [LAS 04] and section 10.4 of [SER 01]. In particular, the case where several eigenvectors have a modulus equal to the spectral radius – which case is excluded by the Perron–Frobenius theorem – leads to difficulties. We conclude this analysis with the following statement. T HEOREM 1.14 (Convergence of the Power Method).– We assume there exists a unique λ ∈ σ(A), such that |λ| = ρ(A). We decompose RN = E ⊕ F , where E and F are stable subspaces by A, such that σ(AE ) = {λ} and λ ∈ / σ(AF ). If x(0) is not orthogonal to E, then the sequence defined by the power method is such that limn→∞ Ax(n) · φ = λ and limn→∞ x(n) = v, such that Av = λv and v · φ = 1. 1.2.5.5. Numerical simulations We will perform several simulations to show the phenomena described above. We consider a system of size 10 × 10 with initial values XInit = (22, 18, 15, 15, 10, 11, 9, 8, 6, 3), f1 = 0.05, f2 = 0.1, f3 = 0.8, f4 = 0.9, f5 = 0.6, f6 = 0.4, f7 = 0.2, f8 = 0.08, f9 = 0.06 and f10 = 0.002 as the fertility rate, and t2,1 = 0.85, t3,2 = 0.94, t4,3 = 0.9, t5,4 = 0.83, t6,5 = 0.76, t7,6 = 0.7, t8,7 = 0.6, t9,8 = 0.1 10 Indeed, for k > k and n being “large” compared to k and k  , we have n(n−1)...(n−k)...(n−k +1) k(k−1)...1 k (k −1)...k...1 n(n−1)...(n−k+1)

=

(n−k)...(n−k +1) k (k −1)...(k+1)

> 1.



k Cn k Cn

=

78

Mathematics for Modeling and Scientific Computing

and t10,9 = 0.02 as its transition rate. The power method gives ρL = 1.1896445 with associated eigenvectors, respectively, for L and L ⎞ 1. ⎜ 0.7144990 ⎟ ⎟ ⎜ ⎜ 0.5645625 ⎟ ⎜ ⎟ ⎜ 0.4271077 ⎟ ⎟ ⎜ ⎜ 0.2979879 ⎟ ⎜ ⎟ V =⎜ ⎟ ⎜ 0.1903681 ⎟ ⎜ 0.1120146 ⎟ ⎜ ⎟ ⎜ 0.0564951 ⎟ ⎟ ⎜ ⎝ 0.0047489 ⎠ 0.0000798 ⎛

⎞ 0.2596440 ⎜ 0.3633933 ⎟ ⎟ ⎜ ⎜ 0.4322813 ⎟ ⎟ ⎜ ⎜ 0.3406064 ⎟ ⎟ ⎜ ⎜ 0.2066518 ⎟ ⎟. ⎜ φ=⎜ ⎟ ⎜ 0.1184944 ⎟ ⎜ 0.0530123 ⎟ ⎜ ⎟ ⎜ 0.0185617 ⎟ ⎟ ⎜ ⎝ 0.0131025 ⎠ 0.0004365 ⎛

Simulating this linear problem [1.17] is not particularly difficult. Figure 1.2 shows the graphs of the approximations of ln(xj ) as a function of time, compared with the asymptotic profiles (simulations carried out with the explicit Euler scheme), until the final time T = 15. In qualitative terms, we confirm the behavior predicted by the theory. However, a direct evaluation of the difference in the asymptotic profile can leave something to be desired: see Figure 1.3 for a simulation performed by the Crank–Nicolson scheme and time step Δt = 1/1000. In particular, the quantity e−ρt X(t) · φ is not preserved by the schemes employed, and the errors produce significant discrepancies on large temporal scales. 1.2.6. Modeling red blood cell agglomeration In blood, red blood cells (or erythrocytes) are subject to attractive interaction forces that drive them to form aggregates. We here provide a simplified model of this kind of phenomenon. Red blood cells move along the segment [0, 1] (which represents a blood vessel!). We consider N + 2 blood cells, whose center points are located at xi (t), with i ∈ {0, ..., N + 1}, as a function of time. The outermost cells must remain fixed at the ends of the interval studied: x0 (t) = 0, xN +1 (t) = 1. In this description, we represent the blood cells as point-based particles, although we take account of their size to describe the interaction efforts. The particles are effectively subject to forces exerted by their neighboring particles. At large distances, those forces are attractive and tend to 0 when the distance between the particles tends to +∞. When the distance between particles is less than a certain threshold a > 0, the interaction force becomes repulsive. These effects are described by a function ϕ :]0, ∞[→ R, such that ϕ(d) > 0 if d > a,

ϕ(d) < 0 if d < a,

lim ϕ(d) = 0,

d→∞

lim ϕ(d) = −∞.

d→0

Ordinary Differential Equations

ln of the solution as a function of time (EE)

ln of the asymptotic profile

79

ln of the solution as a function of time (EE)

ln of the asymptotic profile

Figure 1.2. Logarithm of the numerical solution (explicit Euler) versus logarithm of the asymptotic profile as a function of time (components 1 to 5 on the left, 6 to 10 on the right)

For example, we can take ϕ(d) =

γ d ln , d a

[1.24]

with a given γ > 0. Therefore, the particle located in xi+1 exerts the force ϕ(xi+1 − xi ) on the particle i. Finally, the particles undergo a friction force, exerted by the carrying fluid. This force is opposite to the movement and proportional to the velocity of the particle considered. Applying the fundamental principle of dynamics, the movement of the particles is described by the following system of differential equations: d2 d xi = ϕ(xi+1 − xi ) − ϕ(xi − xi−1 ) − λ xi 2 dt dt for i ∈ {1, ..., N }, with λ > 0.

[1.25]

80

Mathematics for Modeling and Scientific Computing

ln gap of profile

Figure 1.3. Logarithm of the difference between the numerical approximation (using Crank–Nicolson) and the profile as a function of time

  d We write ξi = dt xi and t → U (t) = x1 (t), ..., xN (t), ξ1 (t), ..., ξN (t) . We rewrite the problem in the form of a (autonomous) first-order differential equation satisfied by U : d U (t) = F (U (t)). dt

[1.26]

To this end, we introduce the open set Ω = {(y, θ) ∈ R2N , 0 < y1 < y2 < ... < yN < 1} and the function F : Y = (y1 , ..., yN , θ1 , ..., θN ) ∈ Ω → F (Y ) ∈ R2N is defined by Fi (Y ) = θi , FN +i (Y ) = ϕ(yi+1 − yi ) − ϕ(yi − yi−1 ) − λθi for i ∈ {1, ..., N } and using the convention y0 = 0, yN +1 = 1. F is a function of class C 1 on the open set Ω. For each initial value that sets the positions and velocities of the particles in Ω, the Picard–Lindelöf theorem guarantees the existence and uniqueness of a class C 1 solution of [1.25], which is defined on an interval [0, T [, 0 < T ≤ ∞.

Ordinary Differential Equations

81

We will see that the system [1.25] has remarkable “energetic” properties. Indeed, let Ψ be an antiderivative of ϕ. We can write E(t) =

N N 1   d 2  Ψ(xi+1 − xi ).  xi  + 2 dt i=1

i=0

This quantity sums the kinetic energy of the particles and the potential energy due to their interaction forces. We can compute N N  d   d d d2 E= xi 2 xi + ϕ(xi+1 − xi ) xi+1 − dt dt dt dt i=1 i=0 N   d  = xi ϕ(xi+1 − xi ) − ϕ(xi − xi−1 ) dt i=1 N  N   d   d 2  −λ ϕ(xi+1 − xi ) xi+1 −  xi  + dt dt i=1 i=0

d  xi dt

d  xi . dt

However, we have N N   d d xi × ϕ(xi − xi−1 ) = xi+1 × ϕ(xi+1 − xi ). dt dt i=1 i=0

We conclude that N    d  d 2 E = −λ  xi  ≤ 0 dt dt i=1

(the right-hand term can be interpreted as the energy dissipated by friction) and therefore E(t) ≤ E(0). Owing to the properties of the function ϕ, this estimate can allow us to determine that the solutions are globally defined. We note that 1 d 1 d 2  ln(z) = ln(z) ln(z) = ln (z) , z dz 2 dz   2 d which allows us to find Ψ(d) = γa ln for the model [1.24]. It follows in that 2 a case that for all i ∈ {1, ..., N } and t ≥ 0, we have ln2

x

− xi (t)  2E(0) ≤ . a γa

i+1 (t)

 d 2 Likewise, t →  dt xi (t) is bounded. The solutions of [1.25] remain in Ω for all times and are therefore defined globally.

82

Mathematics for Modeling and Scientific Computing

We will perform numerical simulations of this problem, which will allow us to describe the aggregate formation phenomenon. We fix a time step h > 0 and a final time T (in such a way that the final time will be reached in K = T /h iterations): t0 = 0 < t1 = h < t2 = 2h < .... < tK = T. We let y k = (xk1 , ..., xkN ) denote the numerical unknown at time tk . We denote the vector of RN whose ith component is ϕ(xi+1 − xi ) − ϕ(xi − xi−1 ) as Φ(y), and we continue to use the convention x0 = 0, xN+1 + 1. Given y 0 , y1 , we use the following scheme: y k+1 = 2y k − y k−1 + h2 Φ(y k ) − λh(y k − y k−1 ).

[1.27]

Note that the second derivative in [1.25] is consistently approached at order 2, and the friction term is consistent at order 1. We may ask ourselves how to interpret the scheme when we write the problem in the form of a first-order system as in [1.26]. The scheme y k+1 = y k + hξ k+1 ,

ξ k+1 = ξ k + hΦ(y k ) − λξ k

for [1.26] indeed gives [1.27] for the system [1.25]. Although some terms are processed implicitly, the scheme does not require a system resolution stage (this would still be the case if we write λξ n+1 when updating ξ). We will return to the motivation for this algorithm, which differs from the classic Euler method, in section 1.3, which deals with the Hamiltonian structure of the interaction terms. A typical simulation is shown in Figure 1.4. Here, we have taken N = 49 particles, initially dispersed randomly along ]0, 1[, with velocities equal to zero: yi =

i + 0.1ρi N +1

with ρi representing a random variable uniformly distributed on [−1, +1]. The parameters are fixed as follows: h = 0.002,

K = 100,

γ = 0.5,

λ = 10,

a = 0.004.

We observe the formation of aggregates gathering several particles (note that the trajectories do not cross one another). Figure 1.5 shows the evolution of the discrete energy. It is not decreasing because the numerical scheme does not preserve that property of the continuous model. However, we do observe a tendency for energy to decrease in time. We can also note that the standard Euler method does not provide an adequate result in these conditions (the numerical solution does not remain in the interval [0, 1] and takes outlying values, even with much smaller time steps).

Ordinary Differential Equations

83

Globules Globules 1 0.9 0.8 0.7

position

Position

0.6

0.5 0.4 0.3 0.2 0.1

0

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

temps Time

Figure 1.4. Red blood cell agglomeration: example of trajectories for 49 particles. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

We now consider equilibrium solutions of the system [1.25]. For those solutions, the velocity is equal to zero, ξi = 0, and the forces are at equilibrium, ϕ(xi+1 − xi ) = ϕ(xi − xi−1 ) for all i ∈ {0, ...N }. The discussion is based on an analysis of the graph of the function ϕ (see Figure 1.6). We write di = xi+1 − xi , which satisfies N 

di = 1.

[1.28]

i=0

At equilibrium, we have ϕ(di ) = ϕ, ¯ constant.

[1.29]

We distinguish between two different cases: – If a ≥ N1+1 (dense population), then in order to satisfy [1.28], at least one of the distances di is less than or equal to a. Since d → ϕ(d) is strictly increasing and negative-valued on ]0, a], by [1.29], all distances are equal: di = d¯ ≤ a and, in fact, 1 [1.28] guarantees that d¯ = N+1 . 1 – If a < N+1 (dispersed population), then in order to satisfy [1.28], at least one of the distances di is greater than a. Since d → ϕ(d) has negative values on ]0, a], [1.29]

84

Mathematics for Modeling and Scientific Computing

implies that all the di are greater than a. We note that di = N1+1 for all i ∈ {0, ..., N } is still a solution; however, it is not the only one in the situation considered here because ϕ is not injective on ]a, +∞[. We can find d > d > a, such that ϕ(d ) = ϕ(d ), and any configuration with N distances equal to d and N  distances equal to d and N + N  = N is a solution. Energy vs. time

Energie vs Temps 64

62

60

Energie

Energy

58

56

54

52

50

48 0

0.02

0.04

0.06

0.08

0.1 Temps

0.12

0.14

0.16

0.18

0.2

Time

Figure 1.5. Red blood cell agglomeration: evolution of energy

It is interesting to focus on the case with a single free particle (N = 1, the other two particles are attached to the boundaries x = 0 and x = 1). In this case, the dynamic is described by the simple differential system (in R2 ) d dt



x ξ = . ξ ϕ(1 − x) − ϕ(x) − λξ

When a > 1/2, x ¯ = 1/2 is the single equilibrium position. There are three equilibrium configurations when a < 1/2: x ¯ = 1/2, x ¯ near 0 or x ¯ near 1; the last solutions correspond to the solutions of the nonlinear equation ϕ(1 − x) − ϕ(x) = 0. If (¯ x, 0) denotes an equilibrium solution, the system linearized in the neighborhood of this configuration is written as d dt



x  x  0 1 = , Φ (¯ x) −λ ξ ξ    :=A

Ordinary Differential Equations

85

with Φ(x) = ϕ(1 − x) − ϕ(x). In particular, with [1.24], we obtain: Φ (¯ x) = −ϕ (1 − x) − ϕ (x) =

 1 − x   γ γ  x ln − 1 + ln − 1 . (1 − x)2 a x2 a

phi function fonction phi 30 20 10 0 −10 −20 −30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.7

0.8

0.9

1

potentialPotentiel function fonction 100 0 −100 −200 −300 −400 −500

0

0.1

0.2

0.3

0.4

0.5

0.6

Figure 1.6. Graph of the function ϕ and the associated potential

When a > 1/2, x ¯ = 1/2 is the only equilibrium solution, and this equilibrium is stable because Tr(A) = −λ is strictly negative and det(A) = −Φ (1/2) = 1 +2ϕ (1/2) = −8γ(ln( 2a ) − 1) > 0. Therefore, either the eigenvalues of A are complex conjugates and have strictly negative real parts, or they are real and both negative. This shows that the equilibrium (1/2, 0) is linearly stable when a > 1/2. When a < 1/2, we proceed to numerical experimentation. We let the system evolve on rather long periods of time. For small values of a, we observe that x(t) converges either towards a position near 0, or towards a position near 1, even if the initial value is chosen within a neighborhood of 1/2. We therefore numerically confirm the following points: – the system linearized on (1/2, 0) is linearly instable (complex conjugate eigenvalues with positive real parts or at least one of the values is positive); – for the system linearized on (¯ x, 0), the equilibrium position is linearly stable;

86

Mathematics for Modeling and Scientific Computing

– Newton’s algorithm for solving Φ(x) = 0, starting from a first iteration “near” x ¯, converges to x ¯. For Figure 1.7, we have taken λ = 10, γ = 3 and a = 0.04. The initial position is x0 = 0.51. The eigenvalues of the system linearized on (1/2, 0) are −12.8497 and 2.8497, which confirms linearized instability. The position at the final time is x(T ) = 0.95327, which coincides with the equilibrium position provided by Newton’s method. The eigenvalues of the system linearized at that point are −5 ± 33.5899 i, which confirms linearized stability. In practice, we can observe a diversity of situations when we modify the value of the parameter a and the initial values. Case forglobule a globule Cas d un 1 0.95 0.9 0.85

position

Position

0.8 0.75 0.7 0.65 0.6 0.55 0.5

0

1

2

3 temps

4

5

6

Time

Figure 1.7. Single particle model: convergence towards a state of equilibrium

Finally, we perform simulations for a dense situation, with a large number of particles: here with a = 0.004, we choose N = 500. We define an initial value as a small perturbation of the solution, for example xi (0) =

 i/(N + 1) − 1/2  i 1 + exp − . N + 1 1000 0.003

We consider γ = 1/2 and λ = 1 (low damping). Figure 1.8 shows the initial perturbation and its evolution in time. We observe the effect of damping, which levels out maximums, as well as a wave propagation phenomenon: the initial bump separates into two parts of equal size that propagate in opposite directions, one towards the left and the other towards the right, producing a behavior close to that of a solution to the wave equation (with damping) (see section 3.3).

Ordinary Differential Equations

87

−4

10

x 10

8

6

4

2

0

−2 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 1.8. Blood cell agglomeration: evolution of a disturbance in the equilibrium state

1.2.7. SEI model We are now interested in an epidemiology problem where we seek to describe transmission phenomena for a communicable disease. The population is therefore separated into three categories: – those at risk, which are the individuals who can be contaminated; – those exposed, which are the individuals who have already been infected but are not yet contagious; – those infected, which are individuals who are infected and can transmit the disease to those at risk. The concentration of individuals in these categories at time t ≥ 0 is denoted as S(t), E(t) and I(t), respectively. The infection dynamic is governed by the following rules: – There is a constant influx of newborns, which enter the category of those at risk. This contribution is described by a constant λ > 0, which therefore appears as a production in the at-risk population. – Each category is subject to a natural mortality rate, which depends on the category. We denote the mortality rates as γ, μ, ν for categories S, E, I, respectively.

88

Mathematics for Modeling and Scientific Computing

– Individuals exposed to the disease pass on to the infectious category at a certain rate α. – Infectious individuals either die or remain healthy at a rate β. – Encounters between at-risk and infectious individuals make those at risk enter the category of those exposed. We let τ denote the rate of those encounters. Taking these principles into account, the population’s evolution is therefore described by the differential system ⎧ d ⎪ ⎪ S = λ − γS − τ IS + βI, ⎪ ⎪ dt ⎪ ⎪ ⎪ ⎨ d E = τ IS − (α + μ)E, ⎪ dt ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ d I = αE − (β + ν)I. dt

[1.30]

It is an autonomous system; the function ⎛ ⎞ ⎛ ⎞ S λ − γS − τ IS + βI f : ⎝E ⎠ ∈ R3 −→ ⎝ τ IS − (α + μ)E ⎠ I αE − (β + ν)I is C 1 and the Picard–Lindelöf theorem ensures the existence and uniqueness of a solution associated with any initial value (SInit , EInit , IInit ) ∈ R3 . In light of the physical interpretation of the unknown functions, those values are positive and the solutions must remain positive. We will also show that the solutions are well defined for all times t ≥ 0. It is interesting to note that if the population is initially healthy, IInit = EInit = 0, SInit > 0, then it remains healthy for all times and the corresponding solution of [1.30] is simply E(t) = I(t) = 0,

S(t) = (1 − e−γt )

λ + e−γt SInit . γ

This solution is well defined for all times t ≥ 0 and converges to the constant state (S∞ , E∞ , I∞ ) = (λ/γ, 0, 0) when t → ∞. More generally, consider an initial value SInit > 0, IInit ≥ 0, EInit ≥ 0. By writing f : x = (x1 , x2 , x3 ) → (f1 (x), f2 (x), f3 (x)), we observe that if xj ≥ 0 and xk = 0, then fk (x) ≥ 0. This remark ensures that the solution of [1.30] remains positive: S(t) ≥ 0, E(t) ≥ 0, I(t) ≥ 0. In order to analyze this property, it is necessary to imagine several cases. Let us first suppose that there exists a t0 > 0, such that S(t0 ) = 0, with S(t) > 0 on [0, t0 [

Ordinary Differential Equations

89

and E(t), I(t) ≥ 0 on [0, t0 ]. Then, at t0 , we have dS (t ) ≥ λ > 0. By continuity, dt 0 t → S(t) remains strictly increasing for t close enough to t0 , which contradicts the fact that S(t) reaches 0 when t tends towards t0 . We conclude that S remains strictly positive. Next, suppose that there exists a t0 > 0, such that E(t0 ) = 0, with S(t) > 0 on [0, t0 ] and E(t), I(t) ≥ 0 on [0, t0 ]. If I(t0 ) > 0, then dS dt (t0 ) > 0, and we can reproduce this reasoning to arrive at a contradiction. If I(t0 ) = 0, we explicitly know the solution that satisfies S(t) > 0, with E(t) = I(t) = 0 for all times t ≥ t0 . The same discussion applies if we suppose there exists a t0 > 0, such that I(t0 ) = 0, with S(t) > 0 on [0, t0 ] and E(t), I(t) ≥ 0 on [0, t0 ]. Finally, we have d (S + E + I) = λ − (γS + μE + νI) ≤ λ.    dt mortality terms

This implies that 0 ≤ S(t)+E(t)+I(t) ≤ SInit +EInit +IInit +λt, which excludes the possibility of the solution exploding in finite time. Therefore, the solutions of [1.30] are defined for all times t ≥ 0. It is clear that (S∞ , E∞ , I∞ ) = (λ/γ, 0, 0) is a solution of equation [1.30], which corresponds to a population in a completely healthy state. More generally, we seek to ¯ E, ¯ I), ¯ a stationary solution of [1.30]. In other words, we have find (S, ¯ = (β + ν)I, ¯ αE ¯ = (β + ν)(α + μ) I, ¯ τ I¯S¯ = (α + μ)E α  (β + ν)(α + μ)  ¯ λ = γ S¯ + τ I¯S¯ − β I¯ = γ S¯ + − β I. α  It follows that I¯ is the root of a simple quadratic equation (β + ν)(α + μ) −  γ ¯ where αβ I¯2 + (β + ν)(α + μ)I¯ = λαI, τ λα − γτ (β + ν)(α + μ) I¯ = . ν(α + μ) + βμ This expression provides an acceptable solution (I¯ > 0) when ρ=

αλτ > 1. γ(α + μ)(β + ν)

¯ E, ¯ I) ¯ thus obtained corresponds to an epidemiological The stationary solution (S, situation.

90

Mathematics for Modeling and Scientific Computing

We compute ⎛ −γ − τ I 0 β − τS ⎜ ⎜ −(α + μ) τS ∇S,E,I f (S, E, I) = ⎜ τ I ⎝ 0

α

⎞ ⎟ ⎟ ⎟. ⎠

−(β + ν)

We write M = ∇S,E,I f (S∞ , E∞ , I∞ ) = ∇S,E,I f (λ/γ,√0, 0). The eigenvalues    of M are z0 = −γ < 0 and z± = 12 − (α + μ + β + ν) ± Δ , with Δ = (α + 2 μ) − (β + ν) + 2 ατγ λ > 0. The three eigenvalues are strictly negative when ρ < 1. d In this case, the solutions to the linear system dt X = M X satisfy X(t) = eM t XInit and converge to 0 exponentially fast. This means that, for the linearized problem, the healthy state (λ/γ, 0, 0) is stable when ρ < 1, which describes the situation in which there is no epidemiological state. The previous result only refers to the system linearized around the equilibrium state (λ/γ, 0, 0). We will now demonstrate the stability of the nonlinear problem [1.30], assuming ρ < 1. Let A > 0 be a parameter whose value will be fixed later. We calculate  d  |S − λ/γ|2 α + μ  +A E+ I = −γ|S − λ/γ|2 − PA (S)I, dt 2 α  λ  λ (α + μ)(β + ν) PA (S) = τ S 2 − τ + β + Aτ S + β + A . γ γ α The function S → PA (S) is a second-order polynomial that satisfies PA (0) > 0 and limS→∞ PA (S) = +∞. Therefore, PA (S) remains strictly positive for all S > 0 in the case where the polynomial does not have real roots. This condition requires that  λ  λ 2 (α + μ)(β + ν)  q(A) = τ + β + Aτ − 4τ β + A < 0. γ γ α However, A → q(A) is a second-order polynomial that satisfies q(0) = (τ λ/γ − β)2 > 0 and limA→∞ q(A) = +∞. The polynomial’s discriminant is Δq = 4

 τ2  λ  (α + μ)(β + ν) − ατ ν(α + μ) + βμ 2 α γ

which remains strictly positive when ρ < 1. Thus, the polynomial A → q(A) has a positive root A+ . We therefore fix 0 < A < A+ , in such a way that we can deduce

Ordinary Differential Equations

91

from the analysis that there exists πA > 0, such that for all S ≥ 0, we have PA (S) ≥ πA > 0 and arrive at the estimate |S(t) − λ/γ|2 α+μ + AE(t) + A I(t) 2 α ˆ t ˆ t +γ |S(σ) − λ/γ|2 udσ + πA I(σ) dσ ≤ C0 0

where C0 =

|SInit −λ/γ| 2

2

0

α+μ + AEInit + A IInit . α

In particular, this proves that the functions t → E(t), t → I(t) and t → |S(t) − λ/γ|2 , and therefore t → S(t) are uniformly bounded on [0, +∞[. d Returning to equation [1.30], it follows that the derivative functions t → dt E(t), d d t → dt I(t) and t → dt S(t) are also uniformly bounded on [0, +∞[. Moreover, the estimate obtained also shows that t → |S(t) − λ/γ|2 and t → I(t) are functions of L1 ([0, ∞[). In our final discussion of the asymptotic behavior of [1.30], we will use the following general lemma. L EMMA 1.10.– Let U : t ≥ 0 → U (t) ∈ R be a uniformly continuous function integrable on [0, ∞). Thus, we have limt→∞ U (t) = 0. P ROOF.– If we only assume that V is a continuous integrable function, we can find counterexamples to this statement, with functions that do not have a limit in +∞: V (t) =

∞    (n2 t + (1 − n3 )1[n−1/n2 ,n] (t) + (−n2 t + (1 + n3 )1[n,n+1/n2 ] (t) n=2

provides one such counterexample (it is useful to trace the graph of this function and verify that it is continuous and integrable on R.) On the other hand, uniform continuity allows us to eliminate this kind of oscillatory behavior. We argue by contradiction by   assuming that there exists tn  > 0 and a strictly increasing sequence tn n∈N of positive real numbers, such that limn→∞ = ∞ and |U (tn ))| ≥  for all n ∈ N. Since U is uniformly continuous, there exists η > 0, such that if |t − s| ≤ η, then |U (t) − U (s)| ≤ /2. It follows that ˆ

tn +η tn −η

ˆ |U (s)| ds ≥

tn +η

tn −η

  |U (tn )| − |U (tn ) − U (s)| ds ≥  × 2η = η > 0, 2

which would contradict the Cauchy criterion satisfied by the integrable function U .  Since the derivatives of S and of I are uniformly bounded, we can apply the lemma to the functions t → |S(t) − λ/γ|2 and t → I(t). We therefore prove that lim S(t) = λ/γ,

t→∞

lim I(t) = 0.

t→∞

92

Mathematics for Modeling and Scientific Computing

Finally, we return to the dissipation estimate obtained previously: t → 12 |S(t) − λ/γ|2 + AE(t) + A α+μ I(t) is a decreasing function on [0, ∞[ and therefore has a α limit when t → ∞, denoted as  ≥ 0. It follows that limt→∞ E(t) = /A. However, by substituting this information into [1.30], we conclude that  = 0. We have thus demonstrated that, when ρ < 1, the solutions of [1.30] converge when t → ∞ to the healthy state (λ/γ, 0, 0). In terms of numerical simulations, we will see that, in certain cases, it may be interesting to use the implicit Euler scheme rather than the explicit version. Since the function f is nonlinear, implementing the implicit scheme requires a procedure to find the roots of a certain function. Indeed, by letting X denote the vector with components (S, E, I), the scheme [1.12] is written as X n+1 = X n + Δtf (X n+1 ). In other words, given X n = (S n , E n , I n ) and Δt, X n+1 satisfies the nonlinear equation Ψ(X n+1 ) = 0, where Ψ : Y ∈ R3 → Ψ(Y ) = Y − X n − Δtf (Y ) ⎛

Y1 − S n − Δt(λ − γY1 − τ Y1 Y3 + βY3 )

⎜ ⎜ = ⎜ Y2 − E n − Δt(τ Y1 Y3 − (α + μ)Y2 ) ⎝

⎞ ⎟ ⎟ ⎟. ⎠

Y3 − I n − Δt(αY2 − (β + ν)Y3 ) In fact, we will provide an approximation of the roots of the function Ψ using Newton’s method. We construct the sequence defined by  −1 Y (0) = X n = (S n , E n , I n ), Y (k+1) = Y (k) − ∇Y Ψ(Y (k) ) Ψ(Y k ) Iteration X n+1 corresponds to the limit of Y (k) when k → ∞. More specifically, we introduce the matrix ⎛ ⎞ ∂Y1 Ψ1 (Y (k) ) ∂Y2 Ψ1 (Y (k) ) ∂Y3 Ψ1 (Y (k) ) ⎜ ⎟ ⎜ ⎟ (k) (k) (k) (k) (k) ⎟ ⎜ M = ∇Y Ψ(Y ) = ⎜∂Y1 Ψ2 (Y ) ∂Y2 Ψ2 (Y ) ∂Y3 Ψ2 (Y )⎟ ⎝ ⎠ (k) (k) (k) ∂Y1 Ψ3 (Y ) ∂Y2 Ψ3 (Y ) ∂Y3 Ψ3 (Y ) ⎛ ⎜ ⎜ = I − Δt ⎜ ⎜ ⎝

(k)

γY1

(k)

− τ Y3

0

(k) ⎞

β − τ Y1

(k) τ Y3

−(α + μ)

(k) τ Y1

0

α

−(β + ν)

⎟ ⎟ ⎟. ⎟ ⎠

Ordinary Differential Equations

93

In practice, we do not compute the inverse matrix of M (k) and stop the process after a finite number of steps. Having Y (k) and therefore this matrix, we proceed in two steps to determine Y (k+1) : a) solving the linear system M (k) Z (k) = Ψ(Y k ); b) updating the formula Y (k+1) = Y (k) − Z (k) . We then employ a stopping criteria, for example by stopping the algorithm when (k+1) −Y (k)

the relative error Y Y (k) reaches a certain limit set previously. We complete

this test by conditioning the number of iterations that allows us to detect a possible difficulty in making the method converge. In order to confirm the accuracy of the numerical method, we compare the numerical solution with the known correct solution when EInit = 0 = IInit . The chosen parameters are as follows: λ = 3, γ = 1, τ = 1, β = 1, μ = 5, ν = 1, α = 1 and SInit = 0.3. We perform a simulation until the final time T = 3 with a time step Δt = 0.05. Both the explicit and implicit methods provide similar results, which are shown in Figure 1.9. We observe that the exposed and infectious populations indeed remain completely absent throughout the simulation and that the simulation’s quality degenerates somewhat for higher simulation times. We now study a case where the exposed and infectious populations are present. We keep the parameters the same as in the previous simulation, except the initial values SInit = 0.3, EInit = 0.2 and IInit = 0.4. In this case, we have ρ = 0.25 < 1. Figure 1.10 shows the extinction of the exposed and infectious populations, as well as convergence in the long term towards the healthy state, with limt→∞ S(t) = λ/γ (in order to account for this last result, it is necessary to continue the simulation on longer time spans than those shown in Figure 1.10, whose final time is only T = 2). Finally, we study a case where λ = 13, γ = 1, τ = 67, β = 1, μ = 5, ν = 1 and α = 1, which leads to ρ = 72.5833. The initial values are the same, that is, SInit = 0.3, EInit = 0.2 and IInit = 0.4. We can then expect to reach a stationary state with some infectious and exposed populations. Figure 1.11 shows the result of simulations using the explicit and implicit Euler schemes with interval Δt = 0.05 until the final time T = 1.1. We observe that the explicit Euler scheme produces incoherent results after t  0.9. In particular, one of the quantities becomes negative. These incoherences become worst in time and the simulation can no longer be continued with these numerical conditions. We therefore use the implicit Euler scheme to explore the solution’s behavior on longer spans of time. From Figure 1.12, at the final time T = 3, we can clearly see that a non-trivial stationary configuration takes hold. 1.2.8. A chemotaxis problem We set out to study a system of differential equations that describes bacterial behavior (for example, E. coli). In order to simplify, we assume that the bacteria only move in a straight line (mono-dimensional description) and divide the domain of

94

Mathematics for Modeling and Scientific Computing

SEI: solution as a function of time (EE)

Figure 1.9. SEI model: comparison with an exact solution

SEI: solution as a function of time (EE)

Figure 1.10. SEI model: convergence to the healthy state (S represented by a bold line, E represented by a dotted line, I represented by a double dotted line)

Ordinary Differential Equations

95

SEI: solution as a function of time (EE)

SEI: solution as a function of time (EI)

Figure 1.11. SEI model: comparison of the explicit and implicit Euler schemes in an epidemic scenario (S represented by a bold line, E represented by a dotted line, I represented by a double dotted line)

study into same sized locations, identified by the index j ∈ {1, ..., J}. The dynamic is governed by the following principles: – at each unit of time τ > 0, a proportion 0 < p < 1 of the bacteria present at location j migrates towards one of the neighboring locations (so to either j − 1 or j + 1); – the bacteria react to an attractive chemical potential. A certain substance is emitted by the bacteria themselves, which move according to the signal they perceive.

We let Xj (t) denote the concentration of bacteria present at location j. The chemical attractive effect is described as follows. Each bacteria acts over a long range, but with an amplitude that diminishes as it goes farther away from the source. Mose specifically, given a decreasing function E :]0, ∞[→]0, ∞[, limz→0 E(z) = ∞, we write Uj = −

 k=j

(j − k)E(|j − k|)Xk .

96

Mathematics for Modeling and Scientific Computing

SEI: solution as a function of time (EI)

Figure 1.12. SEI model: simulation in an epidemic scenario (S represented by a bold line, E represented by a dotted line, I represented by a double dotted line)

Exchanges between points j and j + 1 depend on the average value Uj+1/2 =

Uj + Uj+1 . 2

If Uj+1/2 > 0, the bacteria migrate towards the location j+1; instead, if Uj+1/2 < 0, the location j is reinforced by an inflow of bacteria coming from the location j + 1. We therefore write ⎧ ⎨ Xj if Uj+1/2 ≥ 0, Xj+1/2 = ⎩ Xj+1 if Uj+1/2 < 0. We thus obtain the following differential system: p d Xj = (Xj+1 − 2Xj + Xj−1 ) − (Uj+1/2 Xj+1/2 − Uj−1/2 Xj−1/2 ), dt 2τ

Ordinary Differential Equations

97

where we extend X by 0 when necessary: we let X0 = 0 = XJ+1 . In condensed form, with = (X1 , ..., XJ ), we can write ⎛

d X = AX − U (X), dt

⎞ 0 0 ? ?? ⎜ ⎟ ⎜ ⎟ ?? ??? ⎜ 1 ⎟ ?? ?? ⎜ ⎟ ? ? ?? ?? ?? ⎜ ⎟ ?? ? ⎜ ⎟ ? p ⎜ 0 ?? ??? ??? 0 ⎟ A= ⎜ ⎟ ? ? ? ? ?? ? ⎟ 2τ ⎜ ?? ??? ⎜ ⎟ ? ⎜ ?? ?? 1 ⎟ ⎜ ⎟ ? ? ? ⎜ ⎟ ⎝ ⎠ 0 0 1 −2 −2

1

where the components of the function U are defined by X = (X1 , ..., XJ ) → U (X) = (U3/2 X3/2 −U1/2 X1/2 , ..., UJ+1/2 XJ+1/2 −UJ−1/2 XJ−1/2 ). It can be convenient to rewrite this in a clever form as Uj+1/2 Xj+1/2 =

1 1 (Uj+1/2 + |Uj+1/2 |)Xj + (Uj+1/2 − |Uj+1/2 |)Xj+1 , 2 2

which allows us to determine the locally Lipschitz continuous nature of the function X → U (X), thereby proving that the Cauchy problem is well defined, at least locally in time. This formulation also makes it fairly simple to program the Euler schemes that approximate the problem. In light of their physical interpretation, the unknown Xj s must remain positive. To ensure this, we fix T > 0, smaller than the lifespan of the maximal solution, and we designate as M an upper bound of the Uj+1/2 on [0, T ]. We consider the evolution of the negative part of the solution: by denoting [Xj ]− = min(0, Xj ) ≥ 0, we have J J J  1 d  p  [Xj ]2− = (Xj+1 − Xj − (Xj − Xj−1 ))[Xj ]− − U (X)j [Xj ]− 2 dt j=1 2τ j=1 j=1

=−

J p  (Xj+1 − Xj )([Xj+1 ]− − [Xj ]− ) 2τ j=1



1 1 (Uj+1/2 + |Uj+1/2 |)Xj [Xj ]− − (Uj+1/2 − |Uj+1/2 |)Xj+1 [Xj ]− 2 j=1 2 j=1

+

1 1 (Uj−1/2 + |Uj−1/2 |)Xj−1 [Xj ]+ (Uj−1/2 − |Uj−1/2 |)Xj [Xj ]− . 2 j=1 2 j=1

J

J

J

J

98

Mathematics for Modeling and Scientific Computing

However, on the one hand, we have Xj [Xj ]− = [Xj ]2− and, on the other hand, because 1 (Φj+1/2 − |Φj+1/2 |) = [Φj+1/2 ]− ≤ 0 2 it makes it possible to bound −(Φj+1/2 − |Φj+1/2 |)Xj+1 [Xj ]− ≤ M [Xj+1 ]− [Xj ]− from above. Since (Xj+1 − Xj )([Xj+1 ]− − [Xj ]− ) ≥ 0, it follows that   1 d  [Xj ]2− ≤ M [Xj+1 ]− [Xj ]− ≤ 2M [Xj ]2− 2 dt j=1 j=1 j=1 J

J

J

using the Cauchy–Schwarz inequality. Thus, Grönwall’s lemma leads to J J 2 2M t 2 [X (t)] ≤ e [X j Init,j ]− . Therefore, if the components of the initial − j=1 j=1 value are non-negative, we have [XInit,j ]− = 0 and the upper bound cancels out, so we infer that Xj (t) ≥ 0 for all 0 ≤ t ≤ T . Let us now study the behavior of this differential system numerically. Given Δt > 0, the explicit Euler scheme is written as Y n+1 = Y n + ΔtAY n − ΔtU (Y n ) where Y n = (Y1n , ..., YJn ) must approach X(nΔt) = (X1 (nΔt), ..., XJ (nΔt)) and n we still assume that Y0n = 0 = YJ+1 . Even though the role of the nonlinear term U (X) might not be evident at first glance, we can still note that the linear term is J clearly dissipative because Aξ · ξ = − j=1 |ξj − ξj+1 |2 ≤ 0. It is therefore worthwhile to make the processing of that term implicit, especially because it does not require finding the root of a complicated function, but only finding the inverse of the linear system. This approach can also be motivated in terms of preserving the solutions’ non-negative values. Indeed, let us consider the case where U (X) = 0, so the explicit Euler scheme leads to  p  p p n n Yjn+1 = 1 − Δt Yjn + ΔtYj−1 + ΔtYj+1 . τ 2τ 2τ Therefore, non-negative values are preserved at each step in time when Yjn+1 is a n n convex combination of Yjn , Yj−1 and Yj+1 , which requires a restriction on the time step Δt < τp . When the ratio τ /p is “small” (for example, when the unit of time

Ordinary Differential Equations

99

τ is small compared with the observation scale), this condition is actually restrictive and can have a considerable impact on the computation time necessary to evaluate the solution up to a given final time T > 0. The implicit Euler scheme for this simplified problem where U (X) = 0 is 

I−

 p ΔtA Y n+1 = Y n . 2τ

The fact that the components of Y n+1 remain non-negative if those of Y n are also non-negative is related to a remarkable property of the matrix (I − ΔtA): it is an M −matrix for all Δt > 0. We will study the details of this question when the matrix A will be involved in the discretization of certain partial differential equations (see Chapter 2 and section 3.1). For the problem at hand, we therefore recommend using a semi-implicit strategy with the scheme 

I−

 p ΔtA Yjn+1 = Y n − ΔtU (Y n ). 2τ

Clearly, the “stability” difficulties can arise from the nonlinear term U (Y ) = 0, whose analysis is a much more delicate matter. However, this scheme has the advantage of remaining simple and overcoming an immediately sensitive difficulty associated with the explicit Euler scheme. This very simple model allows us to show the rich behavior, depending on the parameter p/τ , the amplitude of the connecting term U and the initial value. For the simulations, we take E(z) = C/|z|3 , C > 0 and a random initial value in [0, 1]. We begin with a trial at p = 0.95, τ = 1, C = 0.1. Here we have no restrictions on Δt. We take Δt = 0.1 and the final time is T = 10. We do not find any notable differences between the results of the explicit Euler scheme and those of the semi-implicit scheme. Figure 1.13 shows a relatively smooth profile, where the maximal amplitude remains below 1. The situation is radically different when we take C = 1, and keep the other parameters the same: Figure 1.14 shows the appearance of “peaks” with very large amplitudes. We perform the same simulations now with τ = 0.03. For Δt = 0.1, the stability condition that guarantees that the results are non-negative is violated and we can effectively see that the explicit Euler scheme does not work. The semi-implicit scheme allows us to perform the simulation in those conditions. Figure 1.15 shows the result obtained with C = 0.1: we see a regular profile, which has a tendency to be smoothed out over time. Figures 1.16 and 1.17 correspond to the same data but with C = 8 and C = 10. They show peaks with a very localized density. If we increase C further, if we increase the amplitude of the initial value, or if we perform simulations for higher final times, we will encounter stability difficulties due to these singular structures and to the nonlinear term.

100

Mathematics for Modeling and Scientific Computing

Distribution of bacteria

Figure 1.13. Chemotaxis: initial value (dotted) and solution at the final time T = 10 for p = 0.95, τ = 1, C = 0.1

Distribution of bacteria

Figure 1.14. Chemotaxis: initial value (dotted) and solution at the final time T = 10 for p = 0.95, τ = 1, C = 1

The mathematical study of these fascinating phenomena and their biological implications are the object of intense research activity: the one presented here is a discrete version of a model proposed by Keller–Segel [KEL 70]. Original results can

Ordinary Differential Equations

101

be found in [ JÄG 92, RAS 95] as well as in the review articles [HOR 03, HOR 04]. Very similar problems are also at play in astrophysics with models of galaxy formation under the effect of gravitational forces [CHA 43].

Figure 1.15. Chemotaxis: solutions at different times for p = 0.95, τ = 0.03, C = 0.1

Figure 1.16. Chemotaxis: solutions at different times T ≤ 10 for p = 0.95, τ = 0.03, C = 8

102

Mathematics for Modeling and Scientific Computing

Figure 1.17. Chemotaxis: solutions at different times T ≤ 10 for p = 0.95, τ = 0.03, C = 10

1.3. Hamiltonian problems A very large class of differential systems are of the following form: d dt



q ∇p H(q, p) = , p −∇q H(q, p)

[1.31]

for a certain function (q, p) ∈ RD × RD → H(q, p) ∈ R, the Hamiltonian of the system. We let ∇p H(q, p) = (∂p1 H(q, p), ..., ∂pD H(q, p)) and ∇q H(q, p) = (∂q1 H(q, p), ..., ∂qD H(q, p)). We will assume that H is a function of class C 2 on the domain U ⊂ RD × RD .

[1.32]

This assumption allows us to apply the Picard–Lindelöf theorem, which ensures the local existence and uniqueness of a solution for each initial value (tInit , qInit , pInit ). Then, the key statement is related to the fact that the Hamiltonian is preserved: for every solution of [1.31], we have d d d H(q(t), p(t)) = ∇q H(q(t), p(t)) · q(t) + ∇p H(q(t), p(t)) · p(t) = 0. dt dt dt The properties of the function H often allows us to draw estimates on the solution that guarantee its global existence from the conservation H(q(t), p(t)) = H(qInit , pInit ).

Ordinary Differential Equations

103

This kind of problems arise naturally in mechanics. We present a classical example as follows. Let us consider a particle subject to a force field due to a potential Φ. Letting m denote the mass of the particle, q(t) its position at time t and p(t) its velocity, the fundamental principle of dynamics leads to d q(t) = p(t), dt

m

d q(t) = −∇q Φ(q(t)). dt

The corresponding Hamiltonian is H(q, p) = m

|p|2 + Φ(q), 2 2

the sum of the kinetic energy m |p|2 and the potential energy Φ(q). We say that the potential Φ is confining when there exists C > 0, such that for every q, we have Φ(q) ≥ −C. In this case, the system is globally well defined. Indeed, 2 m |p(t)| ≤ H(q(t), p(t)) + C = H(qInit , pInit ) + C implies that the velocity 2 ´t remains bounded in time. Thus, we have q(t) = qInit + 0 p(s) ds, which allows us to establish that |q(t)| ≤ M (1 + t) for a certain constant M > 0. These estimates exclude the possibility of explosions in finite time. In what follows, it will be convenient to express [1.31] in the form d dt

p = J∇q,p H(q, p), q

J=

0 −I

I 0

with ∇q,p H(q, p) = (∂q1 H(q, p), ..., ∂qD H(q, p), ∂p1 H(q, p), ..., ∂pD H(q, p)). The properties of the matrix J play a crucial role in the analysis of Hamiltonian systems. A direct calculation allows us to determine the following properties. L EMMA 1.11.– The matrix J satisfies i) det(J) = 1; ii) J = −J, J2 = −I, J−1 = J = −J. Beyond energy conservation, systems of the form [1.31] satisfy another remarkable property: the conservation of volumes in the phase space. Indeed, we consider a domain in the phase space D0 ⊂ RD × RD with bounded volume ˆ V0 =

dq0 dp0 < ∞. D0

For t > 0, we focus on its image by the flow [1.31]  Dt = (q(t), p(t)), where t → (q(t), p(t)) is a solution of [1.31]  with (q(0), p(0)) = (q0 , p0 ) ∈ D0 .

104

Mathematics for Modeling and Scientific Computing

Thus, the volume of that domain is given by (Chapter 4 in [GOU 11]): ˆ ˆ   det(J (t, q0 , p0 )) dq0 dp0 , Vt = dq dp = Dt

D0

where J (t, q0 , p0 )) denotes the Jacobian matrix of the change of variable (q, p) → (q0 , p0 ), that is ⎛

∂q0,1 q1

⎜ ⎜ ∂q0,1 q2 ⎜ ⎜. ⎜. ⎜. ⎜ ⎜ ⎜ ∂q0,1 qD ⎜ J (t, q0 , p0 ) = ⎜ ⎜ ∂q0,1 p1 ⎜ ⎜ ⎜ ∂q0,1 p2 ⎜ ⎜ ⎜. ⎜. ⎝. ∂q0,1 pD

...

∂q0,D q1

∂p0,1 q1

...

...

∂q0,D q2

∂p0,1 q2

...

...

∂q0,D qD

∂p0,1 qD

...

...

∂q0,D p1

∂p0,1 p1

...

...

∂q0,D p2

∂p0,1 p2

...

∂q0,D pD

∂p0,1 pD

...

.. .

.. . ...

∂p0,D q1



⎟ ∂p0,D q2 ⎟ ⎟ ⎟ .. ⎟ ⎟ . ⎟ ⎟ ∂p0,D qD ⎟ ⎟ ⎟ (t). ∂p0,D p1 ⎟ ⎟ ⎟ ∂p0,D p2 ⎟ ⎟ ⎟ ⎟ .. ⎟ . ⎠ ∂p0,D pD

P ROPOSITION 1.5.– The solutions of [1.31] satisfy i) for all t ≥ 0, det(J (t)) = 1, so Vt = V0 (conservation of volume); ii) for all t ≥ 0, J (t) JJ (t) = J (the matrix J (t) is symplectic). P ROOF.– We introduce the vector U (q, p) = J∇q,p H(q, p) = (∇p H(q, p), −∇q H(q, p)) and let X = (q, p), x0 = (q0 , p0 ). We observe that 2D   d d Jnm (t) = ∂x0,m Xn (t) = ∂x0,m Un (X(t)) = ∂Xk Un (X(t)) ∂x0,m Xk (t) dt dt k=1

=

2D 

 ∇X U (X(t))nk Jkm (t) = ∇X U J (t) nm

k=1

where ⎛

∂X1 U1

⎜ ⎜ ∂X1 U2 ⎜ ∇X U = ⎜ ⎜ .. ⎜. ⎝ ∂X1 U2D

∂X2 U1

...

∂X2 U2

...

.. . ∂X2 U2D

...

∂X2D U1



⎟ ∂X2D U2 ⎟ ⎟ ⎟. ⎟ .. ⎟ . ⎠ ∂X2D U2D

Ordinary Differential Equations

In the case where H(q, p) = ∇X U =

0 −Dq2 (Φ)

I 0

|p|2 2

105

+ Φ(q), we have

,

with Dq2 Φ of the Hessian matrix for Φ. In general, returning to the variables (q, p), we obtain 2 ∇X U (X) = JDq,p H(q, p).

It follows that divX (U ) = ∂q1 U1 + ... + ∂qD UD + ∂p1 UD+1 + ... + ∂pD U2D = tr(∇X U ) D    = ∇q · ∇p H + ∇p · (−∇q H) = ∂qi ∂pi H − ∂pi ∂qi H = 0. i=1

However, for an invertible matrix A, the derivative of the determinant mapping is defined by det (A)(H) = tr(A−1 H)det(A). In order to determine this relation, on the one hand, we write det(A + H) = det(A(I + A−1 H)) = det(A)det(I + A−1 H) and, on the other hand, we show that ˜

R(H)

˜ = 1+tr(H)+R( ˜ ˜ where the remainder satisfies lim ˜ det(I+H) H), = 0. ˜

H →0 H

The last property can be obtained by mathematical induction on dimension N . It is clear that it is satisfied in dimension N = 1. Let us assume that it is satisfied for an integer N and consider a matrix H of MN+1 (C). We decompose IN +1 + H in the form ⎡

⎤ 1 + h t 1 t2 ... tN ⎥ ⎢ s1 ⎢ ⎥ ⎢ s2 ⎥ ⎢ ⎥ IN +1 + H = ⎢ . ⎥ ⎢ .. ˜ ⎥ I + H N ⎢ ⎥ ⎣ sN −1 ⎦ sN ˜ ∈ MN (C). We develop the determinant with respect to the first row to obtain with H ˜ + r(H), where the remainder r(H) indeed det(I + H) = (1 + h)det(IN + H) satisfies the desired estimate. However, according to the induction hypothesis, we ˜ = (1 + h)(1 + tr(H) ˜ + R( ˜ H)), ˜ which is indeed equal to have (1 + h)det(IN + H) ˜ = 1 + tr(H)+ remainder terms that satisfy the estimate as desired. 1 + (h + tr(H))

106

Mathematics for Modeling and Scientific Computing

Therefore, we have d det(J ) = tr(J −1 ∇X U J ) det(J ) = divX (U )det(J ) = 0. dt So t → det(J (t)) is constant. Since for t = 0 we have J (0) = I, we conclude that det(J (t)) = 1. We use the same observations to demonstrate iii). Indeed, we have      d  J JJ = ∇X U J JJ + J  J ∇X U J = J  ∇X U  J + J∇X U J dt     = J  (JD 2 H) J + JJD 2 H J = J  D2 H(−J)J + JJD2 H J = 0 using the fact that D2 H is symmetric and lemma 1.11. We therefore conclude that J (t) JJ (t) = J (0) JJ (0) = J. We also note that this property implies ii) since  2 it leads to det(J) = 1 = det(J (t) J J (t)) = det(J) det(J (t)) and therefore  2 det(J (t)) = 1. However, for t = 0, we have det(J (0)) = 1. It follows, by continuity, that det(J (t)) = 1 for all t ≥ 0.  The typical example of the Hamiltonian problem is given by the description of a pendulum’s movement, which we will now describe. 1.3.1. The pendulum problem Consider a pendulum subject to the action of its own weight. The pendulum’s position is tracked by the vector OM (t) with coordinates (sin(θ(t), cos(θ(t)) in the canonical basis (ex , ey ), with  being the cord length. Newton’s second law of motion leads to the vectorial relation ⎛

⎞  dθ(t) 2 d2 θ(t) cos(θ(t)) − sin(θ(t)) ⎜ dt2 ⎟ d2 dt ⎟ m 2 OM (t) = mgey + T =  ⎜ ⎝ ⎠ 2 dt  dθ(t) 2 d θ(t) − sin(θ(t)) − cos(θ(t)) dt2 dt where g > 0 is the acceleration of gravity, m is the pendulum’s mass and T is the traction exerted by the cord, which is a force carried by the vector OM . We project this relation in the direction perpendicular to OM (t). This makes it possible to get rid of the traction force. The following formula is therefore obtained

Ordinary Differential Equations

107

by multiplying the first component by cos(θ(t)) and the second one by − sin(θ(t)) and then performing the sum: d2 θ(t) g = − sin(θ(t)). 2 dt 

[1.33] x

O θ(t)

m



− → T

y

− → d2 −−−→ → OMt = mg − ey+ T dt2

−−−→ sin(θ(t)) OMt =  cos(θ(t))

• Mt → mg − ey

Figure 1.18. Simple pendulum

We note that g is homogeneous to the square of a frequency (that is to 1/time2 ). We let ξ(t) = dθ(t) dt and Θ(t) = (θ(t), ξ(t)) in such a way that this second-order equation can be rewritten like a first-order system ⎞ ⎛ ξ(t) d ⎠ = f (Θ(t)). Θ(t) = ⎝ g dt − sin(θ(t))    This system is autonomous: the function f : (θ, ξ) ∈ R2 → ξ, − g sin(θ) ∈ R2 is C 1 and its derivatives are uniformly bounded. Therefore, the Cauchy problem has a unique solution defined for all times. It is a Hamiltonian system associated with the function H(θ, ξ) =

1 2 g |ξ| + (1 − cos(θ)). 2 

The quantity H(θ(t), ξ(t)) is conserved throughout time.

108

Mathematics for Modeling and Scientific Computing

For small oscillations |θ(t)|  1, we can approximate sin(θ(t)) by θ(t). We can therefore expect that the solution of [1.33] remains near the solution θL of the linear problem d2 θL (t) g = − θL (t). dt2 

[1.34]

This equation can be expressed in the form of a first-order linear system





d θL (t) 0 1 θL (t) = , −g/ 0 ξL (t) dt ξL (t) which can be associated with the following Hamiltonian 1  2 g 2 HL (θ, ξ) = |ξ| + |θ| . 2  Of course, replacing the problem [1.33] with the problem [1.34] is an approximation, which relies on the assumption of small derivations with respect to the equilibrium position θ = 0. This approximation can be verified with the following statement. P ROPOSITION 1.6.– Let ε > 0 and f1 , f2 be two functions defined on R+ × RD , uniformly L−Lipschitz in the state variable and such that f1 − f2 ∞ ≤ ε. Consider d the solutions xj of dt xj (t) = fj (t, xj (t)), with xj (0) = xj,Init . Then, we have      Lt |x1 (t) − x2 (t)| ≤ x1,Init − x2,Init eLt + e −1 . L P ROOF.– We roughly estimate ˆ t      |x1 (t) − x2 (t)| = x1,Init − x2,Init + f1 (s, x1 (s)) − f2 (s, x2 (s)) ds 0

  ≤ x1,Init − x2,Init  + ˆ +

ˆ

  f1 (s, x1 (s)) − f1 (s, x2 (s)) ds

t 0

t

  f1 (s, x2 (s)) − f2 (s, x2 (s)) ds

0

  ≤ x1,Init − x2,Init  + L

ˆ

t

  x1 (s) − x − 2(s) ds + 

0

Using Grönwall’s lemma, we obtain   |x1 (t) − x2 (t)| ≤ x1,Init − x2,Init eLt + 

ˆ

t

ds. 0

ˆ

t

eL(t−s) ds 0

     Lt = x1,Init − x2,Init eLt + e −1 . L

Ordinary Differential Equations

109

This estimate determines how close are the solutions of differential equations whose second members are close. However, we see that this estimate is based on the fact that sin(θ) and θ are close when θ is small. Given  > 0, we can find a small enough 0 < θ0 < π/2, such that for all |θ| ≤ θ0 , we have | sin(θ) − θ| ≤ . We assume that the pendulum is released with no initial speed. Thus, conservation of energy implies that 1  d 2 g g  θ  + (1 − cos(θ)) = (1 − cos(θ0 )), 2 dt  

2 1  d 2 g θL g θ02 = .  θL  + 2 dt  2  2

It follows that |θ(t)| ≤ θ0 , |θL (t)| ≤ θ0 . We infer that  d  d   |θ(t) − θL (t)| +  θ(t) − θL (t) ≤ et . dt dt The solutions of [1.33] and [1.34] remain close for only short periods of time.  The solutions of [1.34] are of the form . θL (t) = A cos(ωt) + B sin(ωt),

ω=

g 

where A and B are determined with the initial conditions A = θL (0),

B=

d dt θL (0)

ω

.

In particular, the solution of [1.34] satisfies d g (|ξL (t)|2 + |θL (t)|2 ) = 0 dt  and the solutions describe ellipses in the phase plane (θ, ξ). We therefore do not need a numerical scheme to know the solution of [1.34]. Nevertheless, it is interesting to study the behavior of schemes, for which we have determined convergence, on this simple example, and to note that they can have different properties with respect to energy criteria. In order to analyze the behavior of numerical schemes, it is convenient to modify the time scale, by letting s = ωt.

110

Mathematics for Modeling and Scientific Computing

We thus work with a variable s with no dimension. We write θ¯L (s) = θL (s/ω) and ξ¯L (s) = ω1 ξL (s/ω). We obtain d ds



θ¯L ξ¯L

=J

θ¯L ξ¯L

,

J=

0 −1

1 0

,

that is to say [1.34] where we have replaced g/ with 1. With these variables, the ¯ ξ) ¯ because conservation of energy trajectories describe circles in the phase plane (θ, d ¯ 2 is expressed as ds |ΘL (s)| . Henceforth, we will discuss the behavior of numerical schemes with g/ = 1 and abandon the notation θ¯L , ξ¯L . The explicit Euler scheme [1.11] for [1.34] takes the form n Θn+1 EE = (I + ΔtJ)ΘEE .

so 2 n 2 2 n 2 n n 2 n 2 |Θn+1 EE | = |ΘEE | + Δt |JΘEE | + 2ΔtJΘEE · ΘEE = (1 + Δt )|ΘEE |

whereas with the implicit Euler scheme [1.12], we have n (I − ΔtJ)Θn+1 EI = ΘEI

and n+1 2 n+1 2 n+1 n+1 2 2 |(I − ΔtJ)Θn+1 EI | = |ΘEI | + Δt |JΘEI | − 2ΔtJΘEI · ΘEI 2 n 2 = (1 + Δt2 )|Θn+1 EI | = |ΘEI | .

Therefore, the explicit Euler scheme makes energy increase; it is energetically instable, whereas the explicit Euler scheme dissipates energy, so it is energetically stable. For both cases, this is somewhat unsatisfying, since energy is conserved in the continuous problem. The Crank–Nicolson scheme 

I−

Δt  n+1  Δt  n J ΘCN = I + J ΘCN 2 2

2 n 2 fulfills the criterion: it preserves energy because |Θn+1 CN | = |ΘCN | , with the same numerical cost as the implicit Euler scheme.

This discussion allows us to effectively understand the behavior and the properties of the different schemes, even if for the linear problem, its value is limited, since, as mentioned earlier, the exact solution is well known. With these remarks in mind, we can return to the nonlinear problem. In order to construct well-performing schemes,

Ordinary Differential Equations

111

we will seek to preserve, or approximate, the conservation of energy and the geometric properties of the continuous problem described in proposition 1.5. We write q n+1 = q n + Δt∇p H(q n+1 , pn ), pn+1 = pn − Δt∇q H(q n+1 , pn ). This scheme is consistent of order 1. It is of an implicit nature and, in general, requires appealing to a fixed point method. However, in several practical situations, as in the pendulum problem, we simply have H(q, p) = h(q) + Φ(p). In this case, the scheme is written in an explicit form because we have q n+1 = q n + Δt∇p h(pn ), which updates q n+1 , so pn+1 = pn − Δt∇q Φ(q n+1 ). For the linear pendulum, this gives ξ n+1 = ξ n − Δtθn+1 ,

θ n+1 = θn + Δtξ n ,

and for the nonlinear pendulum, ξ n+1 = ξ n − Δt sin(θ n+1 ).

θ n+1 = θn + Δtξ n ,

(Recall that the temporal scale has been modified in order to replace g/ with 1). The scheme does not exactly preserve the Hamiltonian H. However, for the approximated Hamiltonian defined by  p) = H(q, p) + Δt ∇p H(q, p) · ∇q H(q, p) H(q, 2 we manage to show, with the help of the somewhat tedious Taylor expansion, that11  n+1 , pn+1 ) = H(q  n , pn ) + O(Δt3 ). Thus, finally, H(q  n , pn ) − H(q  0 , p0 )| ≤ CΔt2 . |H(q  is, in fact, In the case of the linear pendulum, this approximated Hamiltonian H exactly preserved. Indeed, the scheme is written in the form

1 Δt

0 1

Θn+1 =

1 0

Δt 1

Θn

11 We let O(h) denote a remainder term that can be dominated above by h.

[1.35]

112

Mathematics for Modeling and Scientific Computing

or, developing ξ n+1 = ξ n − Δt(θn + Δtξ n ) = (1 − Δt2 )ξ n − Δtθ n , Θ

n+1

n

= SΘ ,

S=

1 −Δt

Δt 1 − Δt2

.

[1.36]

An interesting fact is that the matrix S satisfies S  JS = J: we say that it is a symplectic matrix and that the scheme [1.36] is symplectic. Moreover, with [1.35], we compute

1 Δt

0 1





Δt 1 0 Θn · Θn+1 = Θn · Θn+1 1 Δt 1



1 Δt 1 0 = Θn · Θn = Θn · Θn 0 1 Δt 1

Θn+1 · Θn+1 =

1 0

 n, ξn) = |θn |2 + |ξ n |2 + Δtθn ξ n = H(θ  n+1 , ξ n+1 ). = |θn+1 |2 + |ξ n+1 |2 + Δtθn+1 ξ n+1 = H(θ Therefore, the analysis of numerical schemes for Hamiltonian systems relies on an adequate understanding of the properties of symplectic matrices. 1.3.2. Symplectic matrices; symplectic schemes We begin by presenting antisymmetric bilinear forms and matrices. D EFINITION 1.7.– Let f : CN × CN → C be a bilinear form. We say that – f is alternating if for all x ∈ CN , f (x, x) = 0; – f is antisymmetric if for all x, y ∈ CN , f (x, y) = −f (y, x); if for all x ∈ CN \ {0}, the set  – anNantisymmetricform f is non-degenerate N y ∈ C , f (x, y) = 0 is not all C . L EMMA 1.12.– An alternating form is antisymmetric. P ROOF.– We compute f (x + y, x + y) = 0 = f (x, x) + f (y, y) + f (x, y) + f (y, x) = 0 + f (x, y) + f (y, x) = 0.  P ROPOSITION 1.7.– Let A ∈ MN (C) be an antisymmetric matrix: A = −A. We can associate it with the alternating form f (x, y) =

N  j,k=1

Aj,k xk yj .

Ordinary Differential Equations

113

Reciprocally, given an alternating form and a basis (a1 , ..., aN ), we associate them with an antisymmetric matrix A ∈ MN (C) that represents the form in that basis with the relation Aj,k = f (aj , ak ). We say the matrix A is non-degenerate when the associated form f is non-degenerate, which is equivalent to saying that 0 is not an eigenvalue of A. If (b1 , ..., bN ) is another basis of CN , and B is the matrix associated with f in that basis, then there exists an invertible matrix P , such that B = P AP  . P ROOF.– Let P be the change-of-basis matrix, such that bj = x ∈ CN is expressed in the form x= =

N 

αk ak =

k=1 N 

N 

N k=1

Pj,k ak . A vector

βj bj

j=1

βj Pj,k ak .

k,j=1

We therefore have αk = 

f (x, x ) =

N 

N j=1

Aj,k αk αj

Pj,k βj . So

=

j,k=1

N 

 Aj,k P,k β Pm,j βm

j,k,,m=1

=

N 

 Bm, β βm .

,m=1

We infer that Bm, =

N 

Aj,k P,k Pm,j

j,k=1

that is to say, B = P AP  .



L EMMA 1.13.– Let A ∈ MN (C) be an antisymmetric matrix. i) If N = 2p + 1 is odd, then det(A) = 0; ii) If N = 2p is even and the coefficients of A are real, then det(A) ≥ 0, and det(A) > 0 when A is non-degenerate; iii) If N = 2p is even, then det(A) ≥ 0 is written as the square of a homogeneous polynomial of degree p = N/2 in the coefficients of A, the Pfaffian of A.

114

Mathematics for Modeling and Scientific Computing

P ROOF.– If A is antisymmetric, we have det(A) = det(A ) = det(−A) = (−1)N det(A). When N is odd, this relation implies that det(A) = 0. When N = 2p and A is real, its eigenvalues are purely imaginary and two-by-two ¯ = |λ|2 ≥ 0, we obtain det(A) = ' conjugates. Since λλ λ∈σ(A) λ ≥ 0. If A is non-degenerate, then 0 is not in its spectrum and therefore det(A) > 0. Clearly, the subtle point here is to generalize the statement and to justify iii). This will involve showing some intermediate results.  L EMMA 1.14.– Let A ∈ M2p (C) be an antisymmetric matrix. If A is invertible, then there exists an invertible matrix P , such that ⎞ 0 J 0 ⎟ ⎜ == ⎟ ⎜ == ⎟ ⎜ == ⎟ ⎜ 0 == ⎟ ⎜ ⎟ ⎜ == ⎟ ⎜ == A=P ⎜ ⎟ P. == ⎟ ⎜ == ⎟ ⎜ ⎜ == 0 ⎟ ⎟ ⎜ = ⎟ ⎜ ⎝ 0 0 J ⎠ ⎛

More generally, if A ∈ MN (C) is an antisymmetric matrix of rank 2p, then there exists an invertible matrix P , such that ⎞



0 J 0 ⎜ == ⎜ = ⎜ == ⎜ 0 == ⎜ == ⎜ == ⎜ == ⎜ ⎜ == ⎜ == 0 A=P ⎜ == ⎜ ⎜ ⎜ 0 0 J ⎜ ⎜ ⎜ ⎜ ⎝ 0

0

0

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟P ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

where the set of non-zero blocks forms a sub-matrix of size 2p × 2p. P ROOF.– If A is exactly zero, then the result is clearly satisfied. We therefore assume that A = 0 and consider the associated alternating form on RN × RN . We will see that by the change of basis, we can express f using the matrix J. The form f is not identically zero, so there exist v, w ∈ RN , such that f (v, w) = 0. We can choose

Ordinary Differential Equations

115

these vectors in such a way that f (v, w) = −f (w, v) = 1. Let x = αv + βw, with α, β ∈ C. We obtain f (x, v) = f (αv + βw, v) = βf (w, v) = −β, f (x, w) = f (αv + βw, w) = αf (v, w) = α. In particular, we infer that {v, w} are linearly independent: if x = αv + βw = 0, then f (x, v) = f (0, v) = 0 = f (0, w) = f (x, w) = α = β.   We write E1 = Span{v, w} and Y1 = y ∈ CN , f (y, x) = 0 ∀x ∈ E1 . The restriction f E ×E is associated, relative to the basis (v, w), with the matrix 1



0 −1

1

1 . 0

We will show that E1 ⊕ Y1 = CN . Let y ∈ E1 ∩ Y1 . Thus, on the one hand, we have f (y, v) = f (y, w) = 0, since y ∈ Y1 , and on the other hand, since y ∈ E1 , y = f (x, w)v − f (x, v)w. We conclude that y = 0, so E1 ∩ Y1 = {0}. Let u ∈ CN . We write x = f (u, w)v − f (u, v)w ∈ E1 and let y = u − x. Note that f (y, w) = f (u, w) − f (x, v) = f (u, w) − f (u, w)f (v, w) = f (u, w) − f (u, w) = 0 and f (y, v) = f (u, v) − f (x, v) = f (u, v) + f (u, w)f (w, v) = f (u, w)− f (u, w) = 0, that is to say, y ∈ Y1 . This proves that any vector in CN can be uniquely decomposed in the form of the sum of an element from E1 and an element from Y1 . Let (y1 , ..., yN −2 ) be a basis of Y1 . The alternating form f can be expressed in the basis (v, w, y1 , ..., yN −2 ) using the matrix ⎞ 0 1 0... 0 ⎜ −1 0 0 . . . 0 ⎟ ⎟ ⎜ ⎟ ⎜ ⎠ ⎝ 0 A1 ⎛

where A1 is an antisymmetric matrix of MN−2 (C). We distinguish between two situations. Either A1 = 0 and the proof is finished, or A1 = 0 represents an alternating bilinear form f1 = f Y1 ×Y1 on a space of dimension N − 2 to which we can meaningfully reapply these arguments. If the rank of A is N = 2p, then A is invertible, as well as A1 . The construction can be repeated p times and allows us to decompose CN = E1 ⊕ E2 ⊕ ... ⊕ Ep , with

116

Mathematics for Modeling and Scientific Computing

Ek = Span{vk , wk }, f (vk , wk ) = −f (wk , vk ) = 1. In the basis (v1 , w1 , ..., vp , wp ), the matrix associated with f is indeed ⎛

0 1 ⎜−1 0 ⎜ ⎜ 0 1 ⎜ ⎜ −1 0 ⎜ ⎜ .. ⎜ . ⎜ ⎝

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ 0 1⎠ −1 0

If rang(A) = 2p < N , then the construction stops after p steps and gives way to a decomposition of the form CN = E1 ⊕ ... ⊕ Ep ⊕ Y˜ , with Ek = Span{vk , wk }, f (vk , wk ) = −f (wk , vk ) = 1 and f Y˜ ×Y˜ = 0. We therefore construct a basis of CN in which f is expressed with the matrix ⎞



0 J 0 ⎜ == ⎜ = ⎜ == ⎜ 0 == ⎜ == ⎜ == ⎜ == ⎜ ⎜ == ⎜ == 0 ⎜ == ⎜ ⎜ ⎜ 0 0 J ⎜ ⎜ ⎜ ⎜ ⎝ 0

0

0

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

J=

0 −1

1 0

.

It is important to recall that P is not necessarily an orthogonal matrix: in general, P −1 = P  .  Let A ∈ M2p (C) be an antisymmetric, invertible matrix. Let J denote the matrix of M2p (C) whose diagonal blocks are defined by the matrix J. We write A = P  J P . It follows that det(A) = (det(P ))2 det(J ) = (det(P ))2 . Therefore, the determinant of A can be written as the square of a function of the coefficients of A, this function being the determinant of the change-of-basis matrix P . We will see that this function depends polynomially on the coefficients of the matrix A. In order to demonstrate this property, we return to this demonstration from a slightly different perspective. We let A = {ai,j , j > i}} denote the set of N (N − 1)/2 coefficients that determine the matrix A (since it is antisymmetric). The determinant det(A) is a polynomial with integer coefficients in the elements of A :

Ordinary Differential Equations

117

det(A) ∈ Z[A ]. Therefore, we can interpret the matrix A as a matrix of coefficients in the field K of rational fractions with rational coefficients, whose indeterminate element is given by the elements of A . The resulting matrix P is an element of GL(K); its coefficients are elements of K. The determinant det(P ) is a rational fraction p/q, where p and q are polynomials in the elements of A with rational coefficients. Finally, we will use the following statement. L EMMA 1.15.– Let K be a field. Let m be a polynomial on K , such that m can be written as the square of a rational function r. Then, r is, in fact, a polynomial. P ROOF.– We begin by studying polynomials and rational fractions with one indeterminate element. The proof takes its inspiration from the demonstration of Gauss’s lemma in arithmetic. We write r(t) = p(t) q(t) , where p and q have no common roots, and q is unitary. Then, Bezout’s theorem (see [ARN 87, Th. VII.2.4] or [PER 96, Th. 3.25 & Cor. 3.26]) ensures the existence of two polynomials, u and v, such  2 that p(t)u(t) + q(t)v(t) = 1. It follows that p(t) = u(t) p(t) + v(t)q(t)p(t) =  2   u(t)m(t) q(t) + v(t)p(t)q(t) = q(t) u(t)m(t)q(t) + v(t)p(t) , which proves that q divides p. Since p and q have no common roots, it follows that q(t) = 1, so r = p is a polynomial. Given this result in one variable, we generalize it for M > 1 variables. Indeed, considering the fixed variables mj for j = i, we apply this reasoning to the function φi : μ → r(m1 , ..., mi−1 , μ, mi+1 , ..., mM ): it is a polynomial in the variable μ. dNi Ni In particular, there exists an integer Ni , such that dμ Ni φi = 0 = ∂mi r = 0. This conclusion is valid for all i ∈ {1, ..., M }. We infer that r is a polynomial in the M  variables m1 , ...mM . T HEOREM 1.15.– Let N = 2p be an even integer. For any antisymmetric matrix A ∈ MN (C), there exists a unique polynomial Pf with integer coefficients, such that det(A) = (Pf(A))2 and, in the case A = J, we have Pf(J) = 1. Moreover, for any matrix M ∈ MN (C), we have Pf(M  AM ) = Pf(A) det(M ). P ROOF.– If A is antisymmetric, then for any matrix M ∈ MN (C), the matrix M  AM is also antisymmetric because (M  AM ) = M  A M T T = −M  AM . to obtain We can therefore evaluate the Pfaffian of M  AM det(M  AM ) = (det(M ))2 det(A) = (det(M ))2 (Pf(A))2 = (Pf(M  AM ))2 . It follows that Pf(M  AM ) = ±det(M )Pf(A). In particular, this relation is satisfied for M = I, which leads to Pf(M  AM ) = +det(M )Pf(A)  D EFINITION 1.8.– We say that a real matrix S ∈ M2d is symplectic if it satisfies S  JS = J. P ROPOSITION 1.8.– The determinant of a symplectic matrix is equal to 1.

[1.37]

118

Mathematics for Modeling and Scientific Computing

This follows from the definition stating that a symplectic matrix S satisfies  2  2 det(S  JS) = det(J) = det(J) det(S) so det(S) = 1. This implies that S is invertible. It remains to show that the determinant cannot be negative. Several approaches are possible, which are all equally interesting. We can provide an argument based on the Pfaffian. Indeed, we have Pf(S  JS) = +det(S)Pf(J) = det(S) = Pf(J) = 1. A different approach involves analyzing the spectral properties of symplectic matrices. L EMMA 1.16.– Let S be a symplectic matrix. Then, i) S  is also symplectic; ii) S and S −1 = −JS  J are similar. P ROOF.– We invert the relation [1.37] J−1 = −J = (S  JS)−1 = S −1 J−1 (S −1 ) = −S −1 J(S −1 ) . It follows that SJS  = S(S −1 J(S −1 ) )S  = J, which proves that S  is symplectic. However, we also have S −1 = J−1 JS −1 = J−1 (S  JS)S −1 = J−1 S  J. So S −1 is similar to S  . The following general statement allows us to conclude ii).  L EMMA 1.17.– Let M ∈ MN (C). Then, M and M  are similar: there exists an invertible matrix P ∈ MN (C), such that M = P M  P −1 . The result is also true for M ∈ MN (R), with a real-coefficient change-of-basis matrix P . P ROOF.– The proof uses the Jordan form: for M ∈ MN (C), there exists an invertible matrix Q ∈ MN (C), such that M = QJM Q−1 , where Q is written in blocks ⎛

J1

⎜ ⎜ ⎜ ⎜ 0 ⎜ Q=⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0

⎞ 0

0

⎟ ⎟ ⎟ ⎟ J2 ⎟ @@ ⎟ @@ @@ 0 ⎟ ⎟ @@ ⎟ ⎟ ⎠ 0 Jr

Ordinary Differential Equations

119

with Jk of the form ⎛

⎞ 0 0 == == ⎜ ⎟ ⎜ ⎟ == == ⎜ 0 ⎟ = = == == ⎜ ⎟ ⎜ ⎟ == == ⎜ ⎟ = == == ⎜ ⎟ = Jk = ⎜ == == 0 ⎟ = ⎜ ⎟ = == ⎜ ⎟ == 1 ⎟ ⎜ ⎜ ⎟ == ⎜ ⎟ ⎝ ⎠ 0 0 λ λ

1

where λ is an eigenvalue of M . The transposition operation is equivalent to multiplying by the matrix ⎛

⎞ 0

0

1

⎟ ⎜  ⎜ ⎟  ⎜ ⎟  0  ⎟ ⎜  ⎟ ⎜  T=⎜ ⎟   ⎜ 0 ⎟   ⎟ ⎜   ⎟ ⎜  ⎝ ⎠ 1 0 0   that is to say, M  = TM T. Note that T−1 = T. Therefore, M  = (Q−1 ) JM Q = −1  −1  −1   (Q ) TJM TQ = (Q ) TQ M QTQ . We write P = QTQ in such a way that M  = P −1 M P .

This result applies when the matrix M has real coefficients, but the resulting matrix P can have complex coefficients. We can still transform P in order to construct a realvalued change-of-basis matrix. Indeed, let A and B be two real matrices, such that A = P BP −1 , with P ∈ MN (C). We decompose P = PR + iPI , where PR and PI have real coefficients. We have AP = P B. Therefore, by decomposing into real and complex parts, APR = PR B and API = BPI . We seek to construct a new change-of-basis matrix P˜ as a linear combination of PI and PR . We introduce the function f : z ∈ C → f (z) = det(PR + zP2 ). This function is a polynomial with real coefficients, which is not identical to zero because f (i) = det(P ) = 0. There exists a real number t, such that f (t) = 0. We write P˜ = PR + tPI . This matrix with real coefficients is invertible and satisfies AP˜ = P˜ B.  N OTE .– Note that two matrices with the same characteristic polynomial and the same minimal polynomial are not necessarily similar. Indeed, for matrices whose dimension

120

Mathematics for Modeling and Scientific Computing

N is greater than 3, it is possible to find the matrices whose Jordan forms are not the same. For example, ⎛ λ ⎜0 A=⎜ ⎝0 0

1 λ 0 0

0 0 λ 0

⎞ 0 0⎟ ⎟ 1⎠ λ

and

⎛ λ ⎜0 B=⎜ ⎝0 0

0 λ 0 0

0 0 λ 0

⎞ 0 0⎟ ⎟. 1⎠ λ

For the above two matrices, the characteristic polynomial is (X − λ)4 and the minimal polynomial is (X − λ)2 . However, dim(A − λI) = 2 and dim(B − λI) = 3. By finding the criteria that allow us to decide whether two matrices are similar, notions of similarity invariants can be introduced (see section 6.3 in [SER 01]). ¯ 1/λ L EMMA 1.18.– Let S be a symplectic matrix and λ an eigenvalue of S. Then, λ, ¯ and 1/λ are also eigenvalues of S. P ROOF.– We have seen that a symplectic matrix is invertible, so λ = 0. The matrix S is real, so its complex eigenvalues are two-by-two conjugates. The eigenvalues of S −1 are of the form 1/λ, with λ being an eigenvalue of S. Now, S and S −1 are similar, so they have the same characteristic polynomial and 1/λ, the eigenvalue of S −1 , is also the eigenvalue of S.  C OROLLARY 1.9.– Let PS be the characteristic polynomial of a symplectic matrix S ∈ M2d (R). Thus, for all λ ∈ C \ {0}, we have det(S) PS (λ) = (λ)2d PS (1/λ). P ROOF.– We consider that S and S −1 have the same characteristic polynomial: PS (λ) = det(S − λI) = det(S −1 − λI) = det(S −1 − λS −1 S) = det(S −1 )det(I − λS)  1 1   1 1 = det λ I − S = λ2d det I − S det(S) λ det(S) λ =

1 λ2d PS (1/λ). det(S)



For the analysis, we introduce the following vector subspaces: for λ, an eigenvalue of S and k ∈ N \ {0}, we define Nk (λ) = Ker(S − λI)k , which are just the generalized eigenspaces of S. In particular, Nk (λ) ⊂ Nk+1 (λ) and there exists kλ/∈ N \{0}, the order of algebraic multiplicity for the eigenvalue λ, such that N (λ) = k Nk (λ) = Nkλ (λ). For any eigenvalue λ of S, 1/λ is an eigenvalue of

Ordinary Differential Equations

k (λ) = Ker(S −1 − 1 I)k as well as N  (λ) = S −1 , and we note N λ we write f (x, y) = Jx · y =



121

/  k Nk (λ). Finally,

Jk, x yk .

k,

L EMMA 1.19.– For all eigenvalues λ and μ of S, such that μ = 1/λ, we have f (N1 (λ), N1 (μ)) = 0. P ROOF.– Let x ∈ N1 (λ) and y ∈ N1 (μ): Sx = λx and Sy = μy. We therefore also have S −1 x = λ1 x. Considering that S is symplectic, the computation of the bilinear form f leads to μf (x, y) = f (x, μy) = f (x, Sy) = S  Jx · y = S  J

Sx 1 1 · y = Jx · y = f (x, y). λ λ λ

In other words, (μ − λ1 )f (x, y) = 0. Since μ = λ1 , it follows that f (x, y) = 0.  k (λ). L EMMA 1.20.– For each eigenvalue λ of S and k ∈ N\{0}, we have Nk (λ) = N P ROOF.– If x is an eigenvector of S for the eigenvalue λ, then Sx = λx also leads 1 (λ). However, S and S −1 have symmetric to S −1 x = λ1 x. We infer that N1 (λ) ⊂ N roles. Therefore, we will also show the opposite inclusion. We conclude that N1 (λ) = 1 (λ). N Suppose that the result is obtained for an integer k. We consider an x ∈ Nk+1 (λ) in such a way that (S − λI)k+1 x = (S − λI)k (S − λI)x = 0, which shows that (S − k (λ) leads to (S −1 − 1 I)k (S − λI)x = λI)x ∈ Nk (λ). So, the equality Nk (λ) = N λ 1 (λ), 0 = (S − λI)(S −1 − λ1 I)k x. Therefore, we have (S −1 − λ1 I)k x ∈ N1 (λ) = N 1 1 k 1 k+1 −1 −1 −1 that is to say, (S − λ I)(S − λ I) x = 0 = (S − λ I) x. We have shown that k+1 (λ). We show the equality of those sets by noting that S and S −1 Nk+1 (λ) ⊂ N have symmetric roles.  C OROLLARY 1.10.– For all eigenvalues λ and μ of S, such that μ = 1/λ, we have f (N (λ), N (μ)) = 0. P ROOF.– We will show this result by induction. Lemma 1.19 gives us the identity f (Nk (λ), Nk (μ)) = 0 for k = 1. Let us then suppose that f (Nk (λ), Nk (μ)) = 0 and

122

Mathematics for Modeling and Scientific Computing

we will show the heredity of the property. Consider x ∈ Nk+1 (λ) and y ∈ Nk (μ): (S − λI)k+1 x = 0 = (S − μI)k y. We have   1   1 f (x, (S − μ)k y) = 0 = f x, S − I + −μ I y λ λ =

k 

Ckj

1

j=0

λ

−μ

k−j   1 j  f x, S − I y . λ

Now, by using the relation S  J = JS −1 , we show that f (x, (S − λ1 I)j y) = Jx · (S − λ1 I)j y = (S  − λ1 I)j Jx · y = J(S −1 − λ1 I)j x · y = f ((S −1 − λ1 I)j x, y). Therefore, we obtain f (x, (S − μ)k y) = 0 =

k 

Ckj

1

j=0

λ

−μ

k−j   1 j f S −1 − I x, y . λ

k+1 (λ) implies that However, using lemma 1.20, x ∈ Nk+1 (λ) = N (S −1 −

1 k+1−j −1 1 j I) (S − I) x = 0. λ λ

Therefore, for all j ∈ {1, ...., k}, we have 1 j   (S − λ I) x ∈ Nk+1−j (λ) ⊂ Nk (λ) = Nk (λ). Using the recurrence hypothesis, we obtain −1

f (x, (S − μ)k y) = 0 =

1 λ

−μ

k f (x, y).

Since m = 1/λ, we conclude that f (x, y) = 0 and, more generally, that f (Nk+1 (λ), Nk (μ)) = 0. With this intermediate result obtained, we can repeat the same argument in order to finally show that f (Nk+1 (λ), Nk+1 (μ)) = 0.  C OROLLARY 1.11.– If ±1 is an eigenvalue of S, then N (±1) has an even dimension and the order of algebraic multiplicity of ±1 is even. P ROOF.– Jordan’s theorem makes it possible to decompose R2d into direct sums of generalized eigenspaces, where we distinguish the spaces N (λ), N (1/λ) for λ = ±1 and  N (±1). Let us suppose that ±1 is an eigenvalue of S. Then, the restriction g = f N(±1)×N(±1) defines an alternating bilinear form on N (±1). Suppose that g is

Ordinary Differential Equations

123

degenerate: there exists an x = 0 in N (±1), such that f (x, N (±1)) = 0. However, this implies that f is degenerate on R2d because every vector y ∈ R2d can be written in the form y1 + y⊥ , where y1 ∈ N (±1) and y⊥ belongs to the sum of the other generalized eigenspaces in such a way that, using corollary 1.11, f (x, y) = f (x, y1 + y⊥ ) = f (x, y⊥ ) = 0. This conclusion is absurd, so g is a non-degenerate alternating bilinear form on N (±1). Its associated matrix has a non-zero determinant, which implies that dim(N (±1)) is even.  Finally, it remains to find that det(S), for which we have just established |det(S)| = 1, is expressed as the product of the eigenvalues of S. If λ ∈ C \ ±1 is an ¯ is also one, and the product of those eigenvalues |λ|2 is strictly eigenvalue, then λ positive. If ±1 is an eigenvalue, then its order of multiplicity is even. We conclude that det(S) > 0 and therefore det(S) = +1. N OTE .– The fact that S and S −1 are similar is not enough to reach a conclusion about the parity of the multiplicity of the eigenvalues ±1, as shown by the example of the matrix diag(1, −1) We consider a Hamiltonian problem [1.31], which we seek to approximate using a numerical scheme. The example of the linear pendulum shows us that the explicit Euler scheme q n+1 = q n + Δt∇p H(q n , pn ),

pn+1 = pn − Δt∇q H(q n , pn )

will certainly not have the right qualitative properties, even though it will be convergent. This is also true for the implicit Euler scheme q n+1 = q n + Δt∇p H(q n+1 , pn+1 ),

pn+1 = pn − Δt∇q H(q n+1 , pn+1 )

for which we face the difficulty of inverting a nonlinear system of equations. In order to construct better performing schemes, we will seek to reproduce the properties of the continuous problem at the discrete level (see proposition 1.5). We therefore write the numerical scheme in an abstract form X = (p, q),

X n+1 = ΨΔt (X n ).

We let ∇X ΨΔt denote the Jacobian matrix of that mapping. The symplectic structure of the problem is preserved if ∇X ΨΔt is a symplectic matrix; that is, if it satisfies ∇X ΨΔt J∇X ΨΔt = J. In light of proposition 1.8, in this case, we have det(∇X ΨΔt ) = 1. Such a scheme is called a symplectic scheme.

124

Mathematics for Modeling and Scientific Computing

The symplectic Euler scheme, which we have already described for the pendulum problem, takes the general form q n+1 = q n + Δt∇p H(q n+1 , pn ), pn+1 = pn − Δt∇q H(q n+1 , pn ). We have seen that this scheme is, in fact, explicit in the case where H(q, p) = h(p) + Φ(q). In the general case, for fixed (q, p) and Δt > 0, we can apply the implicit function theorem 1.4 and define (q  , p ) = ΨΔt (q, p), the solution of q  = q + Δt∇p H(q  , p),

p = p − Δt∇q H(q  , p).

We differentiate with respect to the data to obtain the relations  2  2 2 ∇q q  = I − ΔtDpq H∇q q  , ∇p q  = −Δt Dpp H + Dpq H∇p q  , 2 ∇q p = ΔtDqq H∇q q  ,

 2  2 ∇p p = I + Δt Dpq H + Dqq H∇q q  ,

which we reorganize in the form  −1 2 ∇q q  = I + ΔtDpq H

 −1 2 2 ∇p q  = −Δt I + ΔtDpq H Dpp H,

 −1 2 2 ∇q p = ΔtDqq H I + ΔtDpq H ,  −1 2 2 2 2 ∇p p = I + ΔtDpq H − Δt2 Dqq H I + ΔtDpq H Dpp H. We then calculate

∇p p ∇p q  ∇p p J   ∇q p ∇q q ∇p q 

∇q p ∇q q 

=

0 −A

A 0

with A = ∇p p ∇q q  − ∇q p ∇p q  . By developing this expression, we confirm that A = I, which proves that the scheme is symplectic. N OTE .– The symplectic Euler scheme is consistent of order 1. There is a fairly simple variant that makes it possible to reach order 2. It is Verlet’s scheme, which has the following form: given qn , pn , we write Δt ∇p H(q  , pn ), 2  Δt  = pn − ∇q H(q  , pn ) + ∇q H(q  , pn+1 ) , 2 Δt = qn + ∇p H(q  , pn+1 ). 2

q = qn + pn+1 q n+1

Ordinary Differential Equations

125

2

In the case where H(q, p) = p2 + Φ(q), by rewriting the scheme using only the 2 variable q, we obtain q n+1 − 2q  + q n = − h2 ∇q Φ(q  ), which is indeed a natural d2 discretization of the second-order equation dt 2 q = −∇q Φ(q). 1.3.3. Kepler problem We describe the evolution of two bodies subject to gravity forces, one of which is far more massive than the other, like a planet and its sun. We construct a model in R3 where the sun is at the origin and t → q(t) ∈ R3 denotes the planet’s position in the reference frame, while t → p(t) ∈ R3 denotes its velocity. The gravity force exerted by the sun on the planet is in the planet–sun direction, its intensity is proportional to the product of their masses, and it derives from a potential that is inversely proportional to the distance between them. The proportionality constant G is the gravitational constant. We therefore obtain the differential system d q = p, dt

m

d q p = −mM G dt |q|3

where M and m are the masses of the heavy and light bodies, respectively. This is a Hamiltonian system associated with H(q, p) =

|p|2 GM − . 2 |q|

In particular, t → H(q(t), p(t)) is preserved in time and the flow is symplectic: the Jacobian of the flow satisfies J  JJ = J. For analyzing the trajectories, a brief review of elementary geometry is required. 1.3.3.1. Review of conic sections and vector product D EFINITION 1.9.– In a plane P, we consider a line D and a point F ∈ / D. Let e > 0. The conic section with directrix D, focus F and eccentricity e refers to the set of points M on the plane P, such that dist(M, F ) = e dist(M, D). For 0 ≤ e < 1, the resulting set is an ellipse (a circle whenever e = 0); in this case, it is a closed and bounded curve. For e > 1, the resulting set is a hyperbola; for e = 1, it is a parabola. Let u, v, w be three vectors in  R3 . Consider the set 3 E = {x ∈ R , x = r1 u + r2 v + r3 w, 0 ≤ rj ≤ 1 . We calculate the volume of E using a change of variables x = (x1 , x2 , x3 ) ∈ E → (r1 , r2 , r3 ) ∈ [0, 1]3 . In this

126

Mathematics for Modeling and Scientific Computing

case, the change of variables is a simple change of basis. We let P denote the matrix whose columns give the coordinates of u, v, w in the canonical basis, respectively. Then, the change of variables is given by ⎛ ⎞ ⎛ ⎞ x1 r1 ⎝x2 ⎠ = P ⎝r2 ⎠ x3 r3 and we have (see [GOU 11, Theorem 4.8]) ˆ ˆ vol(E ) = dx1 dx2 dx3 = |det(P )| dr1 dr2 dr3 = |det(P )|. [0,1]3

E

We write det(P ) = det(u, v, w). For fixed u, v, w → det(u, v, w) is a linear form. Therefore, there exists a unique vector, denoted as u ∧ v and called the vector product of u and v, such that for all w ∈ R3 we have det(u, v, w) = (u ∧ v) · w. In particular, we have det(u, v, u ∧ v) = |u ∧ v|2 , which therefore corresponds to the volume of the domain of R3 given by the vectors u, v and u ∧ v. LetA be the area of the plane  domain determined by the vectors u and v, that is, the set r1 u + r2 v, 0 ≤ rj ≤ 1 . We therefore obtain det(u, v, u ∧ v) = |u ∧ v|2 = A |u ∧ v| and finally |u ∧ v| = A = |u| |v| sin(θ) where θ represents the angle formed by the vectors u and v. Thus, for the Kepler problem, the area of the triangle defined by the vectors q(t) and q(t + Δt) is  1    1  1  d d q(t) ∧ q(t + Δt)  q(t) ∧ q(t) + q(t)Δt  = q(t) ∧ q(t)Δt, 2 2 dt 2 dt with an error of order O(Δt2 ). We evaluate the area swept by the vector q(t) when t runs through the segment [t0 , t1 ] with a Riemann sum −1   t − t N 1  t1 − t0  d  t1 − t0  1 0 ∧ q t0 + k q t0 + k  N →∞ N 2 N dt N

Area = lim

k=0

 t − t  1 0 +O . N

Ordinary Differential Equations

127

We infer that this area is, in fact, given by the integral 1 2

ˆ

t1

t0

  d   q(t) dt. q(t) ∧ dt

N OTE .– A simple, albeit tedious, calculation allows us to identify the coordinates of u ∧ v as a function of those of u and v: ⎛ ⎞ u2 v3 − u3 v2 u ∧ v = ⎝−u1 v3 + u3 v1 ⎠ . u1 v2 − u2 v1 The analysis of Kepler’s equations relies on identifying invariant quantities. We d introduce the kinetic moment (t) = q(t) ∧ dt q(t). Next, we define the Laplace– q(t) d Runge–Lenz vector A(t) = μ1 dt q(t) ∧ (t) − |q(t)| . We observe that these vectors remain constant in time. 2

d d L EMMA 1.21.– We have dt (t) = 0 = dt A(t). Moreover, we have |A| = 1+2E || μ2 , μ μ 1 d d 2 where E is the energy E = 2 | dt q(t)| − |q(t)| = 12 | dt q(0)|2 − |q(0)| .

P ROOF.– Indeed, we have d d (t) = q(t) ∧ dt dt d = q(t) ∧ dt

d q(t) + q(t) ∧ q¨(t) dt d μ q(t) − q(t) ∧ q(t) = 0. dt |q(t)|3

This proves that the trajectory remains within a fixed plane and that the area swept by the vector q(t) between two instants t0 and t0 + s depends only on s. Next, we use the formula a ∧ (b ∧ c) = a · c b − a · b c to write d q(t) q(t) ⊗ q(t) d d 1 A(t) = q¨(t) ∧  − dt + q(t) dt μ |q(t)| |q(t)|3 dt

=−

d   q(t) · dt q(t) q(t) d ∧ q(t) ∧ q(t) + q(t) = 0. 3 3 |q(t)| dt |q(t)|

The norm of that vector is thus evaluated as follows: |A|2 =

1 d q(t) · q(t) q(t) 1 d | q(t) ∧ |2 + −2 · q(t) ∧ . 2 2 μ dt |q(t)| |q(t)| μ dt

128

Mathematics for Modeling and Scientific Computing

d d d Now, since dt q(t) and  = q(t) ∧ dt q(t) are orthogonal, we have | dt q(t) ∧ | = d d d | dt q(t)||| while q(t) · ( dt q(t) ∧ ) = (q(t) ∧ dt q(t)) ·  =  ·  = ||2 . It follows that

1 d 2||2 | q(t)|2 ||2 + 1 − 2 μ dt μ|q(t)|  ||2 d 2μ  ||2 = 1 + 2 | q(t)|2 − = 1 + 2E 2 , μ dt |q(t)| μ

|A|2 =

which is therefore a positive quantity (even if we can have E < 0). However, we will distinguish the case where E > 0 and therefore |A| > 1, and the case in which E < 0 and thus |A| < 1.  We thus calculate q(t) · A =  − |q(t)| =

2

|| μ

q(t) 1 d q(t) · ( dt q(t) ∧ ) − q(t) · |q(t)| μ

=

1 d (q(t) ∧ dt q(t)) · μ

− |q(t)|, which leads to the relation

|q(t)| = |A|

 ||2 A  − q(t) · . μ|A| |A|

[1.38]

It remains to show that this equality is indeed the equation of a conic section. We introduce an orthonormal vector system (O, i, j). We identify q(t) with the vector OQ(t), where Q(t) denote the position of the planet at the instant t in that frame, and ||2 we have |q(t)| = dist(O, Q(t)). Let F be the point with coordinates ( μ|A| , 0). We let D denote the line that passes through F whose direction vector is orthogonal to A. The distance from Q(t) to D is thus given by dist(Q(t), D) = min{|M Q(t)|, M ∈ D} = |OF | + |P (t)Q(t)|, where P (t) is the orthogonal projection of Q(t) on the line that passes through O, in a direction orthogonal to A. Therefore, dist(Q(t), D) =

||2 A − OQ(t) · . μ|A| |A|

A We can also describe Q(t) using the orthonormal basis for which |A| is the first A vector. We write x(t) = q(t) · |A| and let y(t) denote the second coordinate of Q(t)  in that basis. The relation [1.38] can be written as x(t)2 + y(t)2 = |A|(K − x(t)), 2

2

|| |A| μ with K = μ|A| = 2E (|A|2 − 1). Let z = K |A| 2 −1 . We write X(t) = x(t) − z and Y (t) = y(t). We obtain

(X(t) + z)2 + Y (t)2 = |A|2 (K − X(t) − z)2 = |A|2 (X(t) + z/|A|2 )2 , which leads to X(t)2

2 (1 − |A|2 )2 2 1 − |A| + Y (t) = 1. |A|2 K 2 |A|2 K 2

Ordinary Differential Equations

129

Indeed, we find the Cartesian equation for conic sections (ellipse for |A| < 1; hyperbola for |A| > 1). 1.3.4. Numerical results We compare the performance of different numerical schemes when solving Hamiltonian problems. For this purpose, we use the explicit and implicit Euler schemes, the symplectic Euler scheme, which is a well-adapted variant for Hamiltonian problems, and the second-order Runge–Kutta scheme (RK2). We begin by considering the linear pendulum [1.34], for which we can compare the solutions that are obtained from these schemes with the exact solution that is known in this case. As observed, the mass m of the pendulum does not appear in the equation. Moreover, with a change of time scale, we can consider the case of g/ = 1. For these tests, the initial values are θL,Init = 0.1 and ξL,Init = 0. The value of the associated Hamiltonian is thus 0.005. The exact solution is shown in Figure 1.19. Exact solution

Trajectory of exact solution

Figure 1.19. Exact solution for linear pendulum: Above θL (bold line) and ξL (dotted line) as a function of time; below, the trajectory in the phase plane

130

Mathematics for Modeling and Scientific Computing

Figure 1.20 shows the comparison of the solutions that are provided by the four schemes until the final time T = 30, using the same time step Δt = 0.04. The variables are amplified by the explicit Euler scheme, reduced by the implicit Euler scheme; the RK2 and symplectic schemes reproduce the solution fairly accurately. These conclusions are confirmed by the phase portraits shown in Figure 1.21, as well as the evolution of the discrete Hamiltonians in Figure 1.22, which are consistent with the above analysis: an increase in the Hamiltonian with the implicit Euler scheme, a decrease in the Hamiltonian with the explicit Euler scheme, oscillations in the Hamiltonian and preservation of the approximated Hamiltonian with the symplectic scheme; variations observed with the RK2 scheme seem to be completely acceptable. Figure 1.23 shows this comparison more relatively: we bring the simulation up to the final time T = 60 with the time step Δt = 0.33. Hamiltonian’s amplification by the RK2 scheme becomes perceptible on a scale of time as long as this one, and the deviations with respect to the exact trajectory become very much visible, even though the method is consistent of order 2. In order to obtain acceptable results, it would be necessary to reduce the time step and therefore to increase the computation cost significantly. Even though it is only of order 1, the symplectic Euler method takes a neat advantage for this kind of “long-term” simulations (note, however, that the trajectory in the phase plane is slightly deformed, with an error of order O(Δt)). Finally, Figure 1.25 shows the comparison of the trajectories’ evolution for different initial values, which allows us to evaluate variations of volume in the phase space that are produced by the different methods. The final time is T = 10 and the time step is fixed at Δt = 0.4. The initial values are (θInit , ξInit ) = (0.08, −0.02), (0.1, −0.02), (0.12, 0), (0.08, 0), (0.1, 0, 02) and (0.12, 0.02), respectively. The results can be compared with the exact trajectories shown in Figure 1.24. The Euler schemes produce an expansion or restriction of the volume. We observe that the positions given by the RK2 scheme with that (relatively large) time step correspond poorly with the exact solution. Finally, as we know the exact solution, the results of the convergence analysis can be verified experimentally. For the same physical data, we test the Euler explicit, implicit, symplectic and RK2 schemes up to the final time T = 5, with different time steps Δt = 0.0×2−k , k ∈ {0, ..., 8}. We then evaluate ln(maxn (|θn −θ(nΔt)|+|ξn − ξ(nΔt)|)). Figure 1.26 shows the evolution of those quantities as a function of ln(Δt). We can clearly see that the RK2 scheme leads to a greater slope, consistent with its higher convergence order. However, it is, in fact, not necessarily a very good scheme for that problem: adequate results require a rather small time steps, and its behavior for longer times is not satisfying. Although it is only of order 1, the symplectic Euler scheme performs better under these criteria. We now consider the study of the nonlinear problem [1.33], where we assume that g/ = 1. In view of the results in the linear case, we can expect the implicit Euler scheme not to hold a great advantage over the explicit Euler scheme for this problem. We will therefore compare the explicit schemes that are simpler to implement. The initial values are θInit = 1.2 and ξInit = 0. We perform a first simulation until the final time T = 30, with intervals Δt = 0.25. Figure 1.27 shows

Ordinary Differential Equations

131

Hamiltonian’s evolution with the explicit Euler, symplectic Euler and RK2 schemes. The explicit Euler scheme is immediately discriminated in these conditions, since the Hamiltonian has values that are very far from the theoretical value. The variations for the RK2 scheme are acceptable for this span of time (they are indeed less than Δt); however, they are strictly increasing and suggest undesirable behavior in longer spans of time. The variations for the symplectic scheme have a greater amplitude, but the discrete Hamiltonian oscillates around the theoretical value. The approximated Hamiltonian is not precisely preserved, unlike in the linear case, but the values that are produced by it oscillate with a very small amplitude (which is shown to remain within O(Δt2 )). Figure 1.28 shows the evolution of the numerical solution through time and Figure 1.29 shows the trajectories in the phase plane (θ, ξ): it is clear that the solution produced by the explicit Euler scheme is not acceptable, whereas the solutions for the RK2 and symplectic Euler schemes are very close in those conditions. However, we see that perceivable gaps appear in the phase portrait using the RK2 method. We then test the behavior for the symplectic Euler and RK2 schemes in longer spans of time. The data remain the same, except for the final time and the time step, which are now set to T = 140 and Δt = 0.48. Figure 1.30 shows Hamiltonian’s evolution and Figure 1.31 shows the evolution of the solutions as a function of time and phase portraits. On this time scale, the results of the RK2 scheme become incoherent, whereas those of the symplectic scheme remain completely satisfying, as suggested in the linear case. We finish these numerical experimentations with an evaluation of the order of convergence for the different methods. In the nonlinear case, we do not have an explicit expression of the solution to the Cauchy problem, and therefore we do not have access to yn − y(nΔt). So we compare the numerical solutions obtained with different time intervals. More specifically, given Δt0 > 0, we produce simulations with steps of the form Δtk = 2−k Δt0 , where k varies. With a small abuse of notation, we write yΔtk and yΔtk+1 , the discrete solutions corresponding to consecutive exponents, which we define on the same discretization grid (either by not taking all points in the fine-grained grid, or by interpolating the coarse grid to the fine-grained grid). We thus evaluate yΔtk+1 − yΔtk and report the values obtained as a function of k. Figure 1.32 shows the error curves with the final time T = 50, a first coarse grid defined by Δt0 = 1/10 and k ∈ {0, ..., 6}. We observe that the RK2 scheme is of order 2 and the explicit and symplectic Euler schemes are of order 1. However, the higher order is not enough to make RK2 a “good” method for the problem, unless we consider only short time spans and work with a relatively small time step. Finally, we study the Kepler system numerically. Again, we compare only the explicit Euler, symplectic Euler and RK2 schemes. The data are such that G M = 1, qinit = (0.65, 0), pInit = (0, 0.7). The simulation is performed until T = 15 with the step Δt = 0.01. The position q(t) is shown in Figure 1.33, while Figure 1.34 shows Hamiltonian’s evolution in time. As can be expected, the explicit Euler scheme very quickly provides unacceptable results: the numerical trajectory immediately departs from the expected ellipse and the Hamiltonian of the initial value. The RK2 scheme

132

Mathematics for Modeling and Scientific Computing

provides relevant results only for short time spans. The Hamiltonian computed with the symplectic scheme oscillates around the theoretical value, and the trajectory remains within a bounded domain, relatively close to the ellipse predicted by the theory. We note that the symplectic scheme also preserves the kinetic moment. By writing  = q ∧ p, we have    q n+1  n+1 = q n + Δtpn ∧ pn − ΔtG M n+1 3 n |q | − ΔtG M (q n + Δtpn ) ∧

q n − Δtpn = n . |q n − Δtpn |3

The explicit Euler and RK2 schemes do not preserve this quantity, which can be verified numerically.

Explicit Euler method

Implicit Euler method

Symplectic Euler method

RK2 method

Figure 1.20. Comparison of explicit Euler, implicit Euler, symplectic Euler and RK2 numerical solutions for the linear pendulum problem: θL (bold line) and ξL (dotted line) as a function of time

Ordinary Differential Equations

Explicit Euler method

Symplectic Euler method

Implicit Euler method

RK2 method

Figure 1.21. Comparison of explicit Euler, implicit Euler, symplectic Euler and RK2 numerical solutions for the linear pendulum problem: trajectories in the phase plane

Hamiltonian evolution: explicit Euler

Hamiltonian evolution: symplectic Euler

Hamiltonian evolution: implicit Euler

Hamiltonian evolution: RK2

Figure 1.22. Comparison of explicit Euler, implicit Euler, symplectic Euler and RK2 numerical solutions for the linear pendulum problem: evolution of the discrete Hamiltonian as a function of time

133

134

Mathematics for Modeling and Scientific Computing

RK2 method, Hamiltonian

RK2 method (long-term)

Euler symplectic method, Hamiltonian

Euler symplectic method (long-term)

Figure 1.23. Long-term linear pendulum simulation: comparison of RK2 and symplectic Euler schemes – Hamiltonian (approximated by the symplectic Euler scheme) and trajectories in the phase plane

Exact solution

Figure 1.24. Linear pendulum: trajectories in the phase plane – exact solution

Ordinary Differential Equations

Explicit Euler

Implicit Euler

Symplectic Euler

Runge Kutta order 2

Figure 1.25. Linear pendulum: trajectories in the phase plane, comparison of Euler explicit, implicit, symplectic and RK2 schemes. The initial values are indicated by rectangles and the final solutions by circles

Error curves

Figure 1.26. Linear pendulum: error evaluation (log scale) as a function of ln(Δt) – explicit Euler (bold line), implicit Euler (dotted), symplectic Euler (dotted-dashed) and RK2 (dashed)

135

136

Mathematics for Modeling and Scientific Computing

Hamiltonian evolution: explicit Euler

Hamiltonian evolution: RK2

Hamiltonian evolution: symplectic Euler

Hamiltonian evolution approach: symplectic Euler

Figure 1.27. Nonlinear pendulum: evaluation of the Hamiltonian in time for the explicit Euler, symplectic Euler and RK2 schemes Euler explicit method

Euler symplectic method

RK2 method

Figure 1.28. Comparison of explicit Euler, symplectic Euler and RK2 numerical solutions for the nonlinear pendulum problem: θL (bold line) and ξL (dotted line) as a function of time

Ordinary Differential Equations

Euler explicit method

Euler symplectic method

RK2 method

Figure 1.29. Comparison of explicit Euler, symplectic Euler and RK2 numerical solutions for the nonlinear pendulum problem: θL (bold line) and ξL (dotted line) as a function of time Hamiltonian evolution: RK2

Hamiltonian evolution: Euler symplectic

Hamiltonian evolution approach: Euler symplectic

Figure 1.30. Simulation of the non linear pendulum for large times: comparison of symplectic Euler and RK2 scheme; evolution of the Hamiltonian and trajectories in the phase plane

137

138

Mathematics for Modeling and Scientific Computing

Euler symplectic method

Euler symplectic method

RK2 method

RK2 method

Figure 1.31. Simulation of the nonlinear pendulum for long periods of time: comparison of symplectic Euler and RK2 schemes – solution as a function of time and trajectories in the phase plane

Error curves

Figure 1.32. Nonlinear pendulum: error evaluation (log scale) as a function of ln(Δt): explicit Euler (bold line), symplectic Euler (dotted-dashed line) and RK2 (dashed)

Ordinary Differential Equations

RK2 method

Explicit Euler method

Symplectic Euler method

Figure 1.33. Kepler problem: representation of positions

Hamiltonian evolution: Euler explicit

Hamiltonian evolution: Euler symplectic

Hamiltonian evolution: RK2

Figure 1.34. Kepler problem: evolution of the Hamiltonian in time

139

2 Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

2.1. Introduction In this chapter we will study stationary equations; that is, those in which there is only a single space variable x ∈ Ω ⊂ RD , unlike in evolution problems, where the unknown also depends on the time variable t ≥ 0. In what follows, the domain Ω is either all of RD , or a regular bounded open set, in the sense that its boundary is given by a regular function, say, of class C ∞ , for the sake of simplicity. In this case, at each point x ∈ ∂Ω, we can define a vector ν(x) that will be normal to ∂Ω in x and will point to the outside of Ω. In general, we are interested in partial differential equations1 of the form P (x, ∂x )u = f

for x ∈ Ω,

where ∂x = (∂x1 , ..., ∂xD ) and, for any given x, ξ ∈ RD → P (x, ξ) is a polynomial. The problem is completed by any suitable conditions on the boundaries of ∂Ω. Following certain structure assumptions on P , analysis theorems provide the functional framework for which we can justify the existence and uniqueness of solutions to this kind of equation. In some cases, we also have qualitative information: estimates satisfied by the solutions, gain of regularity with respect to the data, etc.. The numerical solutions should be able to reproduce those properties as much as possible. More specifically, we will focus on the following model problem (which is a typical elliptic problem): λu − ∇ · (k(x)∇u) = f

for x ∈ Ω,

1 In what follows, we indifferently write ∂xi or direction.

[2.1] ∂ ∂xi

for the differentiation operator in the ith

Mathematics for Modeling and Scientific Computing, First Edition. Thierry Goudon. © ISTE Ltd 2016. Published by ISTE Ltd and John Wiley & Sons, Inc.

142

Mathematics for Modeling and Scientific Computing

where λ ≥ 0, f is given, let us say in L2 (Ω), and the function k : x ∈ Ω → k(x) ∈ MN (R) is such that there exist κ, K > 0 satisfying and for all x ∈ Ω ξ ∈ RN k(x)ξ · ξ ≥ κ|ξ|2 ,

|k(x)ξ| ≤ K|ξ|.

When λ = 0, equation [2.1] is known as the Poisson equation (which is nonhomogeneous when the coefficient k varies as a function of the position x). We use the following differential operators: – gradient: for a function u : x ∈ RD → u(x) ∈ R, ∇u(x) is the vector of RD ∂u whose components are the partial derivatives ∂x (x); j – divergence: for a function V : x ∈ RD → V (x) = (V1 (x), ..., VD (x)) ∈ RD , D ∂ ∇ · V (x) is the scalar j=1 ∂x Vj (x). (This is the trace of the Jacobian matrix of V j evaluated at x.)   D ∂ ∂ In components, equation [2.1] is written as λu − i,j=1 ∂x kij (x) ∂x u = f. i j An interesting particular case assumes k(x) = kI, with k > 0; then the equation becomes λu − kΔu = f where Δ represents the (Laplace) differential operator Δu(x) =

 ∂ 2u ∂x21

+ ... +

∂ 2u  (x). ∂x2D

The problem is completed by prescribing conditions on the boundary ∂Ω. We consider the specific cases of:  – (homogeneous) Dirichlet conditions: u∂Ω = 0;   – (homogeneous) Neumann conditions: k∂ν u∂Ω = k∇u · ν ∂Ω = 0, where we recall that ν(x) denotes the unitary outward vector normal to ∂Ω at the point x ∈ ∂Ω. The direct use of general functional analysis theorems together with the set up of an appropriate framework ensures the existence and uniqueness of the solution of [2.1]. More precisely, the Lax–Milgram theorem (see [GOU 11, corollary 5.27]) guarantees that, for all λ ≥ 0 with Dirichlet conditions, or for all λ > 0 with Neumann conditions, [2.1] has a unique solution u ∈ L2 (Ω), with ∇x u ∈ L2 (Ω). Additionally, if the data, coefficients and second members, are more regular, k ∈ C m (Ω), and f is such that for each D-uplet α = (α1 , ..., αD ) ∈ ND with |α| = α1 + ... + αD ≤ m, ∂ α f = ∂xα11 ...∂xαDD f ∈ L2 (Ω), then the solution u satisfies ∂ β u ∈ L2 (Ω) for each D-uplet of length |β| ≤ m + 2 (see [EVA 98, Chapter 6.3]). Finally, we can

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

143

demonstrate the following maximum principle: if 0 ≤ f (x) ≤ M for almost all x ∈ Ω, then 0 ≤ u(x) ≤ M/λ for almost all x ∈ Ω. The numerical methods used  toapproximate the solutions to this kind of problem seek to construct a sequence uh h>0 , where the parameter h > 0 measures the approximation “effort”. The somewhat vague notion of “effort” is understood as the computation burden: the smaller h is, the greater the number of unknowns that must be processed, and the greater the memory  and  time required to perform the computations. The elements of this sequence uh h>0 are vectors for which the number of components depends on h, and those components allow us to define a function meant to approximate the solution x → u(x) of the continuous problem: – using a certain procedure, which can be computed effectively, we define a vector uh ∈ RNh , with limh→0 Nh = ∞; – that vector’s components make it possible to define a function x ∈ Ω → u ˜h (x); – we seek to show that, for a certain norm · , we have limh→0 u − u ˜h = 0, where u is the solution of [2.1]. Here we consider functional spaces of infinite dimension. Our choice of norm is therefore not trivial: convergence can be determined for some norms, but not for others, which implies, for example, greater or lesser regularity properties (convergence in the norm L2 or L1 , for example, is less “demanding” than convergence in the norm L∞ ). We can vaguely distinguish between three families of numerical methods for these problems, which are based on different ways of seeing and interpreting the equation. – Finite difference (FD) methods are “cross-country” methods, which are the easiest to understand and implement, at first glance. We introduce a subdivision of the domain Ω in the form of cubes: the interval h > 0 measures how fine-grained this subdivision will be. We denote the nodes of this discretization as xi , so the numerical unknown ui can be seen as an approximation of the value u(xi ) of the solution at the point xi . The derivatives Nh are approximated with differential quotients. The scheme takes the general form j=1 ωij uj = fi , where fi = f (xi ). (Note that this assumes there is sufficient regularity on f , for example f is continuous.) The weights ωij are constructed in such a way that the consistency error tends   to 0 when the grid becomes Nh more fine-grained2: limh→0 ω u(x ) − f (x ) = 0. ij j i j=1 – Finite volume (FV) methods are based on the integration of the equation on control volumes that partition the domain Ω. These control volumes can, in principle, have any shape (although the analysis of the schemes often imposes restrictions on  Nh 2 It is important to note that ui = u(xi ): the first does satisfy j=1 ωij uj − fi = 0, but, Nhin general, the evaluations of the unknown u at the discretization points are such that j=1 ωij u(xj ) − fi = 0. This distinction will be discussed later.

144

Mathematics for Modeling and Scientific Computing

those shapes). This approach works on equations with conservative form, where the differential term is expressed as ∇ · (...). Here, the numerical unknown ui can be interpreted as an approximation of the average of the solution u on the ith control volume. The key notion for FV methods is that of numerical flux, which defines a relevant approximation of the values of (...) · ν at the control volume interfaces. – Finite element (FE) methods use a variational structure, which is not suitable to handle all problems (but which works for equations of the form [2.1]). Typically, the solution of the partial differential equation can be reinterpreted as “finding u ∈ H, a certain Hilbert space, that satisfies the relation a(u, v) = (v) for all v ∈ H”, where a and are a bilinear form and a linear form, respectively, both continuous on H. The task involves the construction of a space Hh of finite dimension Nh , an “approximation” of the original space H. We will solve the problem a(˜ uh , v h ) =

(vh ) for all vh ∈ Hh , since it is a finite-dimensional problem that has a simple matrix expression. For example, we can construct the space Hh by restricting ourselves to a finite number of Fourier modes (spectral methods) or by considering spaces of piecewise polynomial functions. Here, the convergence evaluation of the method depends on the structure of the form a and the construction of the approximation space Hh . In fact, the differences between these methods tend to be smoothed out in modern presentations. The most advanced methods are often “hybrid”, their principles and interpretations can combine several points of view (discontinuous Galerkin (DG) methods [HES 08], residual distribution (RD) methods [ABG 10, ABG 09], etc.). In what follows, it will be necessary to consider the regularity assumptions imposed (sometimes implicitly) on the solution to the original PDE in order to provide the order of convergence for certain numerical methods: in several cases, consistency analysis assumes far more regularity on the solution x → u(x) than that which is “naturally” available. 2.1.1. The 1D model problem; elements of modeling and analysis We will focus on mono-dimensional scenarios, Ω =]0, 1[, and [2.1] with Dirichlet conditions becomes λu −

d  d  k u =f dx dx

for x ∈]0, 1[,

u(0) = 0 = u(1)

[2.2]

with k :]0, 1[→ R satisfying 0 < κ ≤ k(x) ≤ K < ∞ , for all x ∈]0, 1[. Whenever λ = 0, the problem can be triggered by the description of heat fluxes within a rod, represented by the segment ]0, 1[. The ends of the bar are kept at a constant temperature, and the units are chosen so that that reference temperature is 0. The second member of the equation f describes the heat sources on the rod and

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

145

x → k(x) > 0 is the thermal conductivity coefficient, which depends on the position on the rod if it is composed of a heterogeneous material. If we let q(x) denote the heat flux in a section of the abscissa x ∈]0, 1[, then between two positions x and x + dx, the difference in heat fluxes is balanced by the sources ˆ

x+dx

q(x + dx) − q(x) =

f (y) dy. x

We describe this relation with dx > 0 and then we let dx tend to 0 to obtain q(x + dx) − q(x) d 1 = q(x) = lim dx→0 dx→0 dx dx dx

ˆ

x+dx

lim

f (y) dy = f (x) x

by the definition of the derivative and using the Lebesgue theorem3. Equation [2.2] can therefore be obtained using the Fourier law that connects the heat flux with the temperature q(x) = −k(x)

d u(x). dx

We note that the heat flux is opposite to the variations in temperature: if the temperature is increasing, the heat flux is negative and vice versa. As a result, heat fluxes will heat the cold areas and cool the hot ones. Before detailing the analysis of equation [2.2], let us make some simple remarks: – If λ > 0, we can always go back to the case of λ = 1 by noting that v = λu   satisfies v − λk v  = f , with v(0) = v(1) = 0. – If we impose heterogeneous Dirichlet conditions u(0) = u0 and u(1) = u1 , we can go to the case of [2.2] by modifying the source term. Indeed, we write ϕ(x) = γx + η, where γ and η are chosen such that ϕ(0) = u0 and ϕ(1) = u1 . In other words, we have ϕ(x) = (u1 − u0 )x + u0 . Let v : x → v(x) = u(x) − ϕ(x). This function satisfies v(0) = 0 = v(1) and λv − (kv  ) = λu − (ku ) − (λϕ − (kϕ ) ) = f − λϕ + (u1 − u0 )k  . – It is very important to distinguish between the Cauchy problem for the ordinary differential equation λu −

d  d  k u = f, dx dx

given u(0) and

d u(0), dx

3 The result is quite direct if we assume that f is continuous; if we only assume that f is integrable, convergence is satisfied for almost all x ∈]0, 1[, see for example [RUD 87, theorem 7.7].

146

Mathematics for Modeling and Scientific Computing

and the boundary value problem for the partial differential equation4 λu −

d  d  k u = f, dx dx

given u(0) and u(1).

– In the one-dimensional case, we can express the solution directly as a function of the data, integrating equation [2.2]. ´ x Consider, for example, a case where λ = 0. We d first obtain k(x) dx u(x) = A − 0 f (z) dz, and then the general form ˆ x ˆ y  1  u(x) = B + A− f (z) dz dy. [2.3] 0 k(y) 0 The constants A, B are determined by the boundary conditions. If we consider the homogeneous Dirichlet conditions u(0) = 0 = u(1), then  ˆ 1 dy −1 ˆ 1 1  ˆ y  B = 0, A= f (z) dz dy. [2.4] 0 k(y) 0 k(y) 0 In particular, note that even without assuming the continuity of k and f , the solution thus obtained with formula [2.3] is continuous on [0, 1] (assuming only f ∈ L1 (]0, 1[) and k measurable, bounded from below by a strictly positive constant). This is a specific property of the 1D case, which will be made explicit in our discussion of lemma 2.1. We can also verify the maximum principle: if f takes non-negative values, then u is also non-negative on [0, 1]. Suppose that u, which is continuous, is negative on [a, b] ⊂ [0, 1], with u(a) = u(b) = 0. Then, we have ˆ

ˆ  d  x 0≥ f (x)u(x) dx = f (y) dy − A u(x) dx a a dx 0 ˆ b ˆ x ˆ b ˆ x  d 2 1  ≥− f (y) dy − A u(x) dx = A− f (y) dy dx dx a 0 a k(x) 0 b

ˆ

b

using´integration by parts5. It follows that the functions x → f (x)u(x) and x → x d A − 0 f (y) dy = k(x) dx u(x) are exactly zero on [a, b]. (In fact, the relations that ´b  d 2 we have just established can be rewritten as a k(x) dx u(x) dx = 0, and we will deal with this perspective further below.) This last property implies that u is constant on the segment [a, b], so since u(a) = u(b) = 0, it is zero on [a, b]. – Formulas [2.3]–[2.4] do not, in general, provide an explicit expression of u(x) as a function of x, since the computations of the integrals involved in those formulas 4 Here we have chosen this terminology, although the unknown depends only on the single variable x. 5 Integration is justified if u is C 1 , thanks to the fact that k is continuous. We will see that the formula remains meaningful for weak solutions lying in the Sobolev space H01 (]0, 1[); that is, d u and dx u have an integrable square.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

147

may not be accomplished with standard integral calculus techniques. One possible approach, for the one-dimensional case, could be to approximate those integrals. We will instead focus on methods that are based on the partial differential equation, which can be extended to the multidimensional case. Nevertheless, in some simple cases, we have an explicit formula for the solution. For instance, for k(x) = 1, f (x) = 1, we find u(x) = x(1−x) . 2 – Let us finish this list of commentaries by describing a method, for the one-dimensional case, which allows us to also ensure the existence of a solution to the problem  d d − k(x) u = f (x) [2.5] dx dx with the boundary condition u(0) = 0,

u(1) = 0,

[2.6]

and also to calculate an approximation to this solution. This approach provides the opportunity to distinguish between the Cauchy problem and the boundary value problem. We begin by focusing on the Cauchy problem where [2.5] is completed by initial data at x = 0 (considered as a “time” variable) d u(0) = α. dx

u(0) = 0,

It is a second-order differential equation, which can be rewritten as a first-order system d dx

  u 0 = v 0

1 −k /k

  u 0 + v −f /k

assuming here that k is C 1 and f is continuous. The Cauchy–Lipschitz theorem ensures the existence and uniqueness of the solution, which we denote as x → uα (x), of problem [2.5]–[2.6]. As shown above, we have the formula  ˆ x ˆ y 1 uα (x) = k(0)α − f (z) dz dy. [2.7] 0 k(y) 0 In particular, we have ˆ

1

uα (1) = 0

1 k(y)

 ˆ k(0)α −



y

f (z) dz

dy.

0

The function α → uα (1) is affine. It follows by the intermediate value theorem that there exists a unique α0 ∈ R, such that uα0 (1) = 0.

148

Mathematics for Modeling and Scientific Computing

We thus have a practical procedure, the shooting method, to calculate the solution of [2.5]–[2.6]. For a given α, we solve the Cauchy problem numerically (using a numerical differential equation resolution method or by approximating the integrals [2.7]), and we thus determine uα (1). With values of the parameter α > α , such that uα (1) > 0 and uα (1) < 0, we use a dichotomy approach to obtain an approximation of α0 that guarantees uα0 (1) = 0 (see Figures 2.1 and 2.2). Clearly, for this particular problem, we can directly obtain (an approximation of) the solution by calculating the integrals (or its approximation) involved in [2.3] and [2.4]. A more relevant example is presented in section 2.1.2. Shooting Method 0.5

0

−0.5

u(x)

−1

−1.5

−2

−2.5

−3

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

Figure 2.1. Shooting method: k(x) = 3 + cos(2.5x), f (x) = 1 + sin(3x + .5) + .3 ln(1.7 + x). The first points correspond to α = 1 and α = −5. Expression [2.7] is evaluated by a simple rectangle method

As mentioned above, the generalization for the multidimensional case involves functional analysis arguments. The key is the fact that the equation has a variational formulation. This perspective will also be fruitful for designing numerical schemes. Before considering the general case, we will review some notions in the simple onedimensional case.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

149

Shooting Method vs. Discretization 0.045

0.04

0.035

0.03

u(x)

0.025

0.02

0.015

0.01

0.005

0

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

Figure 2.2. Comparison of solutions obtained by the shooting method and an approximation using finite differences for the data in Figure 2.1 (400 discretization points): the two curves coincide

The task consists of multiplying [2.2] by a “fairly regular” function v, such that v(0) = v(1) = 0 and, integrating by parts, we obtain ˆ λ

ˆ

1

1

u(x)v(x) dx +

k(x)

0

0

d d u(x) v(x) dx = dx dx

ˆ

1

f (x)v(x) dx.

[2.8]

0

The boundary terms that result from the integration by parts cancel each other out because v satisfies v(0) = v(1) = 0. The right-hand term in [2.8] defines, for a given f in L2 (]0, 1[), a continuous linear form on L2 (]0, 1[): ˆ

: v ∈ L (]0, 1[) → 2

1

f (x)v(x) dx. 0

The left-hand term in [2.8] defines a bilinear form a : (u, v) → a(u, v), which satisfies d d – the continuity relation |a(u, v)| ≤ λ u L2 v L2 + K dx u L2 dx v L2 ; ´1 2 ´1 d – the coercivity relation a(u, u) = λ 0 |u| dx + 0 k(x)| dx u|2 dx ≥ λ u 2L2 + d 2 κ dx u L2 .

150

Mathematics for Modeling and Scientific Computing

Note that a(u, u) ≥ 0 and a(u, u) cancels out if and only if u(x) = 0 for almost all x ∈ Ω. The last remark applies even in the case of λ = 0: indeed, from a(u, u) = 0, d we infer that dx u(x) = 0 is zero for almost all x ∈]0, 1[, so x → u(x) is a constant function. However, the Dirichlet condition u(0) = 0 = u(1) implies that this constant is 0. The difficulty lies in the construction of an appropriate functional framework to use these observations. The expressions that we derived could be meaningful if the functions u and v, and their derivatives, are square integrable. Note that (v) and a(u, v) are well defined, even with discontinuous functions f and k. These considerations first lead us to generalize the notion of derivative. If u is a regular function, say, of class C ∞ (]0, 1[), ´1 d ´1 d by integration by parts, we clearly have 0 dx u(x)ϕ(x) dx = − 0 u(x) dx ϕ(x) dx, ∞ for all ϕ ∈ Cc (]0, 1[). We use the right-hand expression to define the derivative of a function u ∈ L2 (]0, 1[), considered as a linear form acting on the functions ϕ ∈ Cc∞ (]0, 1[). The functions u ∈ H 1 (]0, 1[) are the elements of L2 (]0, 1[), such that there exists a C > 0 satisfying, for all ϕ ∈ Cc∞ (]0, 1[), the estimate ´1   u(x) d ϕ(x) dx ≤ C ϕ L2 . By the Riesz’s lemma, this estimate means that we dx 0 can identify the derivative of u with an element of L2 (]0, 1[). We let H 1 (]0, 1[) denote the space of functions u ∈ L2 (]0, 1[) satisfying that property. More generally, we can thus define a family of spaces, the Sobolev spaces, by requiring integrability properties on the successive derivatives. We define H01 (]0, 1[) as the complement of Cc∞ (]0, 1[) for the norm

u H 1 =

u 2L2 +

d 2 u 2 . dx L

The space H 1 (]0, 1[) is completed from C ∞ ([0, 1]) for the same norm. The spaces H 1 (]0, 1[) and H01 (]0, 1[) along with the norm · H 1 are Hilbert spaces. One subtle question deals with the meaning given to the traces of those functions at the boundary of the domain (since we recall that the functions in L2 are only defined almost everywhere). In dimension one, the difficulty is overcome by noting that the elements of that space H 1 are, in fact, continuous functions (or, more precisely, they admit a continuous representative). L EMMA 2.1 (Sobolev Embedding Theorem (1D)).– Let u ∈ H 1 (]0, 1[). Then, u ∈ C 0 ([0, 1]). Moreover, there exists a C > 0, such that for all u ∈ H 1 (]0, 1[), we have

u L∞ ≤ C u H 1 .

[2.9]

L EMMA 2.2 (Poincaré lemma (1D)).– Let u ∈ H01 (]0, 1[), then we have the following estimate: d

u L∞ ≤ u , dx L2

d

u L2 ≤ C u . dx L2

where C > 0 does not depend on u.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

151

P ROOF.– Let ϕ ∈ C ∞ ([0, 1]). We can write ˆ

x

ϕ(x) − ϕ(y) = y

d ϕ(ξ) dξ. dx

The Cauchy–Schwarz inequality leads to ˆ  |ϕ(x) − ϕ(y)| ≤  ≤



x

y

 d  ϕ(ξ) dξ  ≤ |x − y| dx

ˆ

1

0

 d 2   ϕ(ξ) dξ  dx

|x − y| ϕ H 1 .

By the density of C ∞ ([0, 1]) in H 1 (]0, 1[), we extend this relation to the functions of H 1 (]0, 1[), which proves their continuity: H 1 (]0, 1[) ⊂ C 0 ([0, 1]). The 1 last estimate implies even a stronger  consequence: the embedding of 1H (]0, 1[) in 0 C ([0, 1]) is compact. Indeed, if un n∈N is a bounded sequence in H (]0, 1[), then there exists C > 0, such that for all n  ∈  N, x, y ∈ [0, 1], we have |un (x) − un (y)| ≤ C |x − y|. The sequence un n∈N is therefore equicontinuous on [0, 1]. The estimate [2.9] shows that it is also equibounded. Thus, the   Arzela–Ascoli theorem [GOU 11, theorem 7.49] implies that un n∈N is compact on C 0 ([0, 1]). To justify [2.9], we will use a duality argument. Let ϕ ∈ Cc∞ (]0, 1[). We introduce an auxiliary function θ ∈ C ∞ ([0, 1]), such that 0 ≤ θ(x) ≤ 1 for all x ∈ [0, 1], θ(x) = 1 for x ∈ [3/4, 1] and θ(x) = 0 for x ∈ [0, 1/4]. We therefore write ˆ

ˆ

x

1

ϕ(y) dy − θ(x)

ψ(x) = 0

ϕ(y) dy. 0

We have ψ ∈ Cc∞ (]0, 1[),

|ψ(x)| ≤ K ϕ L1

(for a certain constant K > 0 that depends on the auxiliary function θ). For u ∈ H 1 (]0, 1[), we get ˆ   0

1

 ˆ   u(x)ϕ(x) dx =  ˆ  =

1 0 1 0

ˆ

ˆ

 d  θ(x) dx dx 0 0 ˆ 1 ˆ 1  d d  u(x)ψ(x) dx + ϕ(y) dy u(x) θ(x) dx dx dx 0 0

u(x)

d ψ(x) dx + dx

1

≤ (K + θ H 1 ) u H 1 ϕ L1 .

ϕ(y) dy

1

u(x)

152

Mathematics for Modeling and Scientific Computing

Thus, u is equivalent to a continuous linear form on L1 (]0, 1[) and therefore, by Riesz’s theorem [GOU 11, theorem 4.37], to a function in L∞ (]0, 1[) whose norm satisfies [2.9]. To demonstrate lemma 2.2, consider ϕ ∈ Cc∞ (]0, 1[). Since ϕ(0) = 0, we obtain d the estimate |ϕ(x)| ≤ dx ϕ L2 .  With this statement, we can return to the continuity and coercivity relations introduced with [2.8]: for all λ ≥ 0, including the case where λ = 0, thanks to lemma 2.2, there exist constants m(λ), M (λ) > 0, such that for all u, v ∈ H01 (]0, 1[), we have |a(u, v)| ≤ M (λ) u H 1 v H 1 ,

a(u, u) ≥ m(λ) u 2H 1 .

The Lax–Milgram theorem now applies, and allows us to demonstrate T HEOREM 2.1.– For all λ ≥ 0 and all f ∈ L2 (]0, 1[), problem [2.2] with homogeneous Dirichlet conditions has a unique solution u ∈ H01 (]0, 1[) in the sense that for all v ∈ H01 (]0, 1[), we have a(u, v) = (v). Note that

u H 1 ≤

1

f L2 . m(λ)

The estimate can be obtained simply by taking v = u in the variational formulation, so that m(λ) u 21 ≤ a(u, u) = (u) ≤ f L2 u L2 ≤ f L2 u H 1 . The problem takes a different nature when we work with Neumann conditions d d u(0) = 0 = dx u(1). Assuming that u is a fairly regular solution of dx d d λu − dx (k dx u) = f with those boundary conditions, given that the integration by parts is valid, we obtain a(u, v) = (v) for all v ∈ C ∞ ([0, 1]), with the same bilinear form a and the linear form defined on H 1 (]0, 1[). Here we do not require the test function v to cancel out in x = 0 and x = 1: the boundary terms of the integration by parts cancel out, thanks to the Neumann condition. For λ > 0, we verify that the Lax–Milgram theorem still applies in the space H 1 (]0, 1[), which allows us to extend theorem 2.1 to this case. T HEOREM 2.2.– For all λ > 0 and all f ∈ L2 (]0, 1[), problem [2.2] along with homogeneous Neumann conditions has a unique solution u ∈ H 1 (]0, 1[) in the sense that for all v ∈ H 1 (]0, 1[), we have a(u, v) = (v). For λ = 0, we can see that the coerciveness of the form a fails. In particular, note d that dx u L2 = 0 implies that x → u(x) is a constant function, but the boundary conditions do not imply that this constant is 0. In fact, the problem −(ku ) = f , with u (0) = 0 = u (1), has an infinite number of solutions: if u is a solution, then for all C ∈ R, x → u(x) + C is also a solution. We also observe that the

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

153

equation makes sense only if the right hand side f satisfies the compatibility condition ´1 f (x) dx = 0. Indeed, this condition is a consequence of the boundary conditions 0 when the equation is integrated. We encounter an analogous difficulty in the periodic case: the problem −(ku ) = f with 1−periodic conditions does not have a unique solution in L2# (]0, 1[), where the symbol # denotes the condition of 1−periodicity, since the constant functions are solutions to the homogeneous equation, with f = 0. Moreover, by formally integrating the equation, we realize that the second member must average to zero in order for the problem to make sense. We will therefore look for a solution that also averages to zero. The periodic problem can be understood using Fourier analysis, and we are going to show how those difficulties can be overcome. We introduce the Fourier coefficients, for n ∈ Z, ˆ

1

u(x)e−2iπnx dx,

u n = 0

 and associate u ∈ L2# with the series n∈Z u n e2iπnx (convergence taken to be in 2 L , see[GOU 11, sections 5.3 & 5.4]). Moreover, its derivative is associated with the series n∈Z\{0} 2πin un e2iπnx . In this context, we thus define the Sobolev space       1 H# = u ∈ L2# , u n n∈Z ∈ 2 and n un n∈Z ∈ 2 which is indeed a Hilbert space for the norm defined by 

u 2H 1 =

(1 + 4π 2 n2 )| un | 2 .

n∈Z

We write   1 1 H0,# = u ∈ H# , u (0) = 0 1 which is a closed subspace of H# . We therefore have the following Poincaré inequality, which is analogous to lemma 2.2 in this context

ˆ

1

if u 0 =

u(x) dx = 0, then 0

u(x) 2L2 =

 n∈Z

| un | 2 ≤

 n∈Z\{0}

d 2 4π 2 n2 | un | 2 = u(x) . dx L2

(For other variants of this kind of statement, the reader may also refer to the applications given in Appendix 3.) This observation allows us to apply the Lax–Milgram theorem in the periodic case.

154

Mathematics for Modeling and Scientific Computing

T HEOREM 2.3.– For all λ > 0 and all f ∈ L2# , problem [2.2] with periodic conditions 1 1 has a unique solution u ∈ H# in the sense that for all v ∈ H# , we have a(u, v) = 2

(v). If λ = 0 for all f ∈ L , such that f(0) = 0, problem [2.2] with periodic #

1 1 conditions has a unique solution u ∈ H0,# in the sense that for all v ∈ H0,# , we have a(u, v) = (v).

Returning to problem [2.2] with Neumann conditions, we must thus find a technical means to eliminate constant solutions, as accomplished by introducing 1 H0,# in the periodic case. This involves the introduction of the appropriate quotient spaces and next, it translates into the numerical processing of the equation. The variational formulation of the Dirichlet problem also allows us to justify the maximum principle. P ROPOSITION 2.1.– Let λ ≥ 0. Let f be a measurable function, such that for almost all x ∈]0, 1[, we have 0 ≤ f (x) ≤ M < ∞. Then, the solution u ∈ H01 (]0, 1[) of [2.2] satisfies 0 ≤ u(x) ≤ M/λ for almost all x ∈]0, 1[. P ROOF.– Let Ψ : R → R be a C 1 function, such that Ψ(0) = 0 and Ψ ∈ L∞ (R). By assuming that d d Ψ(u) = Ψ (u) u, dx dx

Ψ(u) ∈ H01 (]0, 1[),

[2.10]

we may take v = Ψ(u) in [2.8] to obtain ˆ λ

ˆ

1

Ψ(u)u dx + 0

0

1

ˆ 1  d 2   Ψ (u) u dx = f Ψ(u) dx. dx 0

We choose the function Ψ in such a way that Ψ is strictly increasing on ] − ∞, 0[ and Ψ(z) = 0 for z ≥ 0. Since Ψ(u) ≤ 0, uΨ(u) ≥ 0 and Ψ (u) ≥ 0, if f is positive, then for such a function we obtain ˆ 1 ˆ 1 ˆ 1  d 2    0≤λ Ψ(u)u dx + Ψ (u) u dx = f Ψ(u) dx ≤ 0. dx 0 0 0 We distinguish between the following two situations: – λ > 0: in this case, uΨ(u) = 0 almost everywhere, which implies that u(x) ≥ 0 for almost all x ∈]0, 1[; ´s – λ = 0: in this case, we write Z(s) = 0 Ψ (σ) dσ. Note that Z(s) = 0 for all s ≥ 0 and Z(s) < 0 for all s < 0. It follows from these manipulations  d 2  d 2 that Ψ (u(x)) dx u(x) =  dx Z(u(x)) cancels out for almost all x ∈]0, 1[. Now, Z(u) ∈ H01 (]0, 1[) and the Poincaré inequality thus imply that Z(u(x)) = 0, so u(x) ≥ 0, for almost all x ∈]0, 1[.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

155

Bounding u from above is accomplished with the same argument, by considering in this case a function Ψk that cancels out on ] − ∞, k] and is strictly increasing on [k, +∞[. We write M = f ∞ and note that ΨM/λ (0) = 0, such that ΨM/λ (u) ∈ H01 (]0, 1[). We thus obtain ˆ 1 ˆ 1  d 2   λ ΨM/λ (u)(u − M/λ) dx + ΨM/λ (u) u dx dx    0 0    ≥0

ˆ = 0

≥0

1

(f − M )ΨM/λ (u) dx.    ≤0

We infer that x → ΨM/λ (u(x))(u(x) − M/λ) cancels out on all ]0, 1[, so that u(x) ≤ M/λ. Whenever λ = 0, we can still show that u ∞ ≤ C f ∞ , for a certain constant C > 0, as a result of lemma 2.1 and theorem 2.1. It remains to show [2.10]. The problem lies in extending the chain rule formula 1 to functions in H 1 . Let us first consider φ ∈ Cc∞ (]0, 1[).Then,  Ψ(φ) ∈ Cc (]0, 1[) d d  1 satisfies dx Ψ(φ) = Ψ (φ) dx φ. Let u ∈ H0 (]0, 1[) and φn n∈N be a sequence of   elements of Cc∞ (]0, 1[) that converges towards u in H01 (]0, 1[). In particular, φn n∈N converges to u in L2 (]0, 1[) and, even if it involves extracting a subsequence, we may assume that limn→∞ φn (x) = u(x) for almost all x ∈]0, 1[ and that the functions φn are dominated by a square-integrable function (see [GOU 11, lemma 3.31]). Since |Ψ(φn ) − Ψ(u)| ≤ Ψ ∞ |φn − u|, it follows that Ψ(φn ) converges to Ψ(u) in L2 (]0, 1[). Furthermore, Ψ (φn ) converges almost everywhere to Ψ (φ) and remains d d uniformly bounded on L∞ (]0, 1[). Finally, dx φn converges to dx u in L2 (]0, 1[). It d d   follows that the product Ψ (φn ) dx φn converges to Ψ (u) dx u in L2 (]0, 1[). Let η ∈ Cc∞ (]0, 1[). We have ˆ

1



ˆ 1 d d η dx = Ψ (φn ) φn η dx dx dx 0 ˆ 1 ˆ 1 d d −−−−→ − Ψ(u) η dx = Ψ (u) u η dx n→∞ dx dx 0 0

Ψ(φn ) 0

which means that

d Ψ(u) dx

d = Ψ (u) dx u ∈ L2 (]0, 1[).



2.1.2. A radiative transfer problem In order to illustrate the notions introduced, we will consider the following nonlinear problem: ⎧ 2 ⎪ ⎨ σT 4 (x) − d T (x) = σT 4 (x) for 0 < x < 1, ref dx2 [2.11] ⎪ ⎩ T (0) = 0 = T (1).

156

Mathematics for Modeling and Scientific Computing

This equation is also motivated by questions from thermal science. The source 4 f = σ(Tref − T 4 ) describes an energy transfer through radiation at temperature T : this medium emits energy at temperature Tref and absorbs ambient energy related to the temperature T . The nonlinearity is related to the Stefan–Boltzmann emission law. Indeed, a body emits and absorbs energy in the form of electromagnetic radiations, with a frequency distribution that depends on the temperature. More specifically, this energy is emitted or absorbed in the form of a flux of photons whose distribution follows Planck’s law 2h ν3 . 2 hν/kT (x) − 1 c e

I(x, ν) =

Here, ν > 0 is the frequency of photons emitted or absorbed, h is Planck’s constant, k is Boltzmann’s constant and c is the speed of light. The total energy is obtained by summing over the frequencies ˆ



E(x) =

I(x, ν) dν = 0

2k 4 T (x)4 c2 h3

ˆ



0

μ3 dμ. −1



hν where we have used the change of variables μ = kT . In order to calculate the integral that appears in this expression, we use the series expansion ∞ ∞   1 1 −μ −μ −nμ = e × = e e = e−nμ . eμ − 1 1 − e−μ n=0 n=1

By Fubini’s theorem, and since the expressions are non-negative, we can permute the sum and integral and write it as ∞ ˆ ∞  2k 4 4 E(x) = 2 3 T (x) × μ3 e−nμ dμ. c h 0 n=1

With three successive integrations by parts, we reach E(x) =

ˆ ∞ ∞ ∞   2k 4 6 2k 4 1 4 −nμ 4 T (x) × e dμ = T (x) × 6 . 3 2 h3 4 c2 h3 n c n 0 n=1 n=1

We express this last sum by recognizing the Fourier coefficients of the function ψ : x ∈ [0, 1] → x(1 − x), extended on R by 1−periodicity. Indeed, we find  ψ(0) =

ˆ

1

x(1 − x) dx = 1/2 − 1/3 = 1/6, 0

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

157

and for k = 0  ψ(k) =

ˆ

1

x(1 − x)e−2iπkx dx =

0

ˆ

ˆ

1

(1 − 2x) 0

−2iπkx

1

=

e−2iπkx dx 2iπk

2 0

e 1 dx = . −(2iπk)2 2π 2 k 2

It follows that ˆ

ψ 2L2

1

1 30

x2 (1 − x)2 dx =

= 0

=

+∞ 

2 2   |ψ(k)| = |ψ(0)| +2

k=−∞

+∞  k=1

2  |ψ(k)| =

+∞ 1 1  1 + 4 36 2π k4 k=1

+∞ 4 and therefore k=1 k14 = π90 . Finally, we find E(x) = σT (x)4 , with σ = which is the Stefan–Boltzmann constant.

2π 4 k4 15c2 h3 ,

2.1.2.1. Uniqueness given existence We suppose there are two solutions in H01 (]0, 1[), which we denote as T and S. In particular, these functions are continuous by lemma 2.1. Then, we have σ(T 4 − S 4 ) −

d2 (T − S) = 0, dx2

with T (0) = S(0) = 0 = T (1) = S(1). We multiply by T − S and integrate on [0, 1] to obtain ˆ

ˆ

1

(T 4 − S 4 )(T − S) dx +

σ 0

0

1

 d 2  (T − S) dx = 0. dx

Now, (T 4 − S 4 )(T − S) is a non-negative quantity. It follows that T (x) = S(x) almost everywhere on [0, 1]. 2.1.2.2. A priori estimates To be consistent with the physical interpretation, we expect to have T ≥ 0. In order to ensure this property, we rewrite the equation in the form σf (T ) −

d2 4 T = σTref , dx2

[2.12]

158

Mathematics for Modeling and Scientific Computing

with f (T ) = |T |3 T. We proceed as in the proof of proposition 2.1. If T ∈ H01 (]0, 1[) is a solution of [2.12], then we have ˆ

ˆ

1

σ

1

f (T )Ψ(T ) dx + 0

0

 d 2 Ψ (T ) T  dx = dx 

ˆ

1 4 σTref Ψ(T ) dx

0

for Ψ in C 1 , such that Ψ(0) = 0 and Ψ is bounded. First, we take Ψ increasing, such that zΨ(z) > 0 for all z < 0 and Ψ(z) = 0 if z ≥ 0, we have f (T )Ψ(T ) ≥ 0 while 4 4 σTref Ψ(T ) ≤ 0 because σTref ≥ 0. It follows that T (x) ≥ 0, so T indeed satisfies [2.11]. Next, we write M = Tref ∞ and consider ΨM , such that ΨM (z) = 0 for z ≤ M and ΨM (z) > 0 for z > M . Therefore, ˆ 1 ˆ 1 2 d 4 σ (f (T ) − M )ΨM (T ) dx + ΨM (T ) (T ) dx    0 0  dx  ≥0

ˆ

≥0

1 4 (Tref − M 4 )ΨM (T ) dx ≤ 0,

=σ 0

which allows us to conclude that T (x) ≤ M is satisfied on ]0, 1[. 2.1.2.3. Existence of solutions We construct a solution of [2.11] as a fixed point of the mapping Φ : T → T , where T is a solution to the linear problem σT 3 T −

d2 4 T = σTref dx2

for 0 < x < 1

with Dirichlet conditions. The existence of a fixed point is the result of a general statement concerning compact mappings, which is detailed in Appendix 4. We begin by recalling the following definition. D EFINITION 2.1.– Let E and F be two normed vector spaces. We say that a (linear or nonlinear) mapping f : A ⊂ E → F is compact if for every bounded  set B ⊂ A, f (B) is a compact set of F . In other words, for every sequence xn n∈N that     is bounded in A, we can extract a subsequence xnk k∈N , such that f (xnk ) k∈N converges in F . T HEOREM 2.4 (Schauder Theorem).– Let B be a non-empty, convex, closed and bounded set in a normed vector space E. Moreover, let f : B → B be a continuous and compact mapping. Then, f has a fixed point in B.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

159

We seek to find a fixed point of the mapping Φ in the set   C = u ∈ C 0 ([0, 1]), u(0) = 0 = u(1), 0 ≤ u(x) ≤ R for all x ∈ [0, 1] where R > 0 is a constant that will be determined later. This is a non-empty, convex, closed and bounded set of C 0 ([0, 1]). A direct application of the Lax–Milgram theorem (which reproduces the proofs of theorem 2.1 and proposition 2.1) ensures that for all T ∈ C , we can indeed define a unique function Φ(T ) = T ∈ H01 (]0, 1[) that is positive and, by lemma 2.1, also continuous on [0, 1]. Moreover, we have ˆ

ˆ

1

1

T |T | dx + 3

σ

2

0

0

ˆ 1  d 2   4 T  dx = σ Tref T dx.  dx 0

The first integral of the left-hand term is non-negative, whereas the right-hand term can be bounded from above using the Cauchy–Schwarz and then the inequality, d 4 4 Poincaré lemma 2.2, by σ Tref

L2 T L2 ≤ Cσ Tref

L2 dx T L2 . We can infer that

ˆ 1   d 2 4

T L∞ ≤ T  dx ≤ Cσ Tref

L2 ,  dx 0 4 using lemma 2.2. Therefore, with R ≥ Cσ Tref

L2 , Φ is indeed a mapping of C onto itself. In fact, this argument also shows that for all T ∈ C , the H 1 norm of the solution Φ(T ) = T is bounded from above by R. As noted throughout the proof of lemma 2.1, it follows (from the Arzela–Ascoli theorem) that Φ is a compact mapping of C onto C . It remains to show that Φ is continuous in order to conclude by using Schauder’s theorem that there exists a T ∈ C , such that Φ(T ) = T , which is therefore the non-negative solution of [2.12] (and of [2.11]!).   More generally, we consider a sequence λn n∈N of non-negative and continuous functions,such  that λn converges uniformly to λ on [0, 1]. We associate it with the sequence un n∈N of solutions for

λn un −

d2 un = f dx2

for 0 < x < 1

with Dirichlet conditions and where f ∈ L2 (]0, 1[). We let u denote the solution to the problem associated with the limit λ. These functions are elements of H01 (]0, 1[), and we have ˆ

ˆ

1

1

λn |un − u|2 dx + 0

0

ˆ 1  d 2   (un − u) dx = (λ − λn )u(un − u) dx  dx 0 ≤ λn − λ L∞ u L2 un − u L2 .

160

Mathematics for Modeling and Scientific Computing

By lemma 2.1 and lemma 2.2, it follows that d

un − u L∞ ≤ (un − u) ≤ C u L2 λn − λ L∞ , dx L2 which proves that un converges to u on H 1 and in C 0 . 2.1.2.4. Shooting method Unlike in the example that we studied in the introduction, the solution of problem [2.11] does not have a simple expression. We will seek to approximate it by using the shooting method. In order to justify the effectiveness of the method, we will use some of the information that we know about the solution. Indeed, we know that the desired solution T is bounded from above by a certain constant M > 0. Also, we will replace the nonlinearity σT 4 with  fM (T ) =

σT 4 σM 4 + ζM (T )

if if

|T | ≤ M , |T | ≥ M ,

where ζM is a function of class C ∞ , such that 0 ≤ ζM (z) ≤ 1, ζ(0) = 0, 0 ≤  ζM (z) ≤ CM and fM is C ∞ . This modification only holds a technical advantage: d2 4 by adopting the above estimate analysis for the solution of fM (u) − dx 2 u = σTref , with Dirichlet conditions, we can show that 0 ≤ u(x) ≤ M . Consider the differential system d dt

  T S(t) (t) = . 4 S fM (T (t)) − σTref (t)

Assuming that t → Tref (t) is continuous, the mapping 2 F : [0, 1] × R × R −→ R   Y 0 (t, X, Y ) −→ − 4 fM (X) σTref (t)

is indeed continuous and uniformly Lipschitz continuous in the variables X, Y . For all α ∈ R, the system therefore has a single solution (Tα , Sα ) : [0, 1] → R2 , defined on [0, 1], which meets the Cauchy data Tα (0) = 0, Sα (0) = α. The shooting method involves finding α, ¯ such that Tα¯ (1) = 0. By the continuity of the solutions of a differential system with respect to the data (see, for example, [ARN 88, Chapter 4, section 31] or [BEN 10, lemma II.3]), the mapping α → Tα (1) is continuous. More specifically, the Grönwall lemma shows that |Tα (1) − Tβ (1)| ≤ eL |α − β|, where L designates the Lipschitz constant for the function F . It remains to show that Tα (1) can take negative and positive values. On the one hand, we have 4 4 Tα (t) = fM (Tα (t)) − σTref ≥ −σTref ≥ −σM 4

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

161

so Tα (t) ≥ α − σM 4 t then Tα (t) ≥ αt − σM 4

t2 . 2

It follows that Tα (1) > 0 when α > argument, we obtain

σM 4 . 2

On the other hand, with the same

4 Tα (t) = fM (Tα (t)) − σTref ≤ fM (Tα (t)) ≤ σM 4 + 1

and finally Tα (t) ≤ αt + (σM 4 + 1)

t2 . 2 4

In particular, Tα (1) < 0 when α < − 1+σM . The intermediate value theorem 2 guarantees the existence of α, ¯ such that Tα¯ (1) = 0. Shooting Method 2

1

T(x)

0

−1

−2

−3

−4

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

Figure 2.3. Solution of the radiation transfer problem using the shooting method: successive iterations

We perform simulations with the data σ = 1.4 and Tref (x) = 1 + sin(3x + .5) + .3 ∗ ln(1.7 + x). The differential system is solved by the explicit Euler scheme. We find α ¯ = 8.1173. Figure 2.3 shows some iterations and Figure 2.4 shows the graph of the solution obtained.

162

Mathematics for Modeling and Scientific Computing

Shooting Method 2

1.5

T(x)

1

0.5

0

−0.5

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

Figure 2.4. Solution of the radiation transfer problem using the shooting method: solution obtained

2.1.2.5. Interpretation as a minimization problem It can also be interesting to interpret equation [2.11] as the solution of the minimization problem   min J (u), u ∈ H01 (]0, 1[) , ˆ where J (u) = 0

1

 1  du 2  |u|5     4 (x) + σ − uTref (x) dx.  2 dx 5

Using lemma 2.1, we will confirm that the functional J is well defined on H01 (]0, 1[) and strictly convex. More specifically, we note that ˆ 1    du dh 4 J (u + h) − J (u) = (x) + σ |u|3 u − Tref h(x) dx + R(u, h), dx ˆdx 0 ˆ 1  1  1  dh 2 R(u, h) = (x) + σ (1 − θ)4|u + θh|3 |h|2 (x) dθ dx  2 dx 0 0

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

where limhH 1 0

ˆ

1

J (u) ≥ 0

R(u,h) hH 1

163

= 0. Moreover,

0

 1  du 2    4 4 (x) − σuTref (x) dx ≥ u H01 ( u H01 − σ Tref

L2 ), [2.13]  2 dx

implies that limhH 1 (]0,1[)→0 J (u) = +∞. We infer from the computation that J 0   has a minimum. Indeed, if un n∈N designates a minimizing sequence, then     J (un ) n∈N is bounded in R. It follows that that sequence un n∈N is bounded in L5 (]0, 1[) and in the separable Hilbert space H01 (]0, 1[). It therefore has a subsequence that weakly converges in those spaces (see [GOU 11, corollary 7.32]). dT n We can therefore assume that un  T weakly in L5 (]0, 1[) and du dx  dx weakly 2 in L (]0, 1[). We use the convexity inequalities ˆ

1 0

ˆ 1 ˆ 1 ˆ 1  du 2   du dx(un − u)  n  d(un − u) 2  du 2 dx +   dx =   dx + 2   dx dx dx dx dx 0 0 dx 0 ˆ 1 ˆ 1  du dx(un − u)  du 2 ≥2 dx +   dx dx dx 0 dx 0

and ˆ

ˆ

1

|un | dx ≥ 0

ˆ

1

1

|u| dx + q

q

|u|q−2 u(un − u) dx

q

0

0



where |u|q−2 u ∈ Lq (]0, 1[) since (q − 1)q  = q. By making n → ∞, we thus obtain limn→∞ J (un ) ≥ J (T ), so T is a minimizer of J . Since J is strictly convex, this minimizer is unique. Finally, [2.13] shows that the minimizer T is characterized by ˆ

1 0

 dT dh    4 (x) + σ |T |3 T − Tref h(x) dx = 0 dx dx

for all h ∈ H01 (]0, 1[). In other words, T is a solution of [2.12]. 2.1.3. Analysis elements for multidimensional problems The approach detailed in section 2.1.1 can be adapted to higher dimensions. The analysis involves the preliminary introduction of appropriate functional spaces. We will therefore let H 1 (Ω) denote the set of square-integrable functions u : Ω ⊂ RD →

164

Mathematics for Modeling and Scientific Computing

R, such that there exists a constant C > 0 that, for all ϕ ∈ Cc∞ (Ω) and all i ∈ {1, ..., D}, satisfies ˆ    u(x)∂xi ϕ(x) dx ≤ C ϕ L2 (Ω) .  Ω

This allows us to define the gradient of u by duality as the vector whose components ∂xi u are elements of L2 (Ω). This space is equipped with the norm 

u H 1 =

u 2L2

+

D ˆ  i=1

Ω

1/2 |∂xi u| dx 2

which makes it a Hilbert space. We define H01 (Ω) as the complement of Cc∞ (Ω) for this norm. This space can be interpreted as the set of functions of H 1 (Ω) “that cancel out on ∂Ω”. This statement is ambiguous unless the precise meaning of the trace of elements of H 1 (Ω) on ∂Ω is specified: in dimension D > 1, this point represents a real difficulty because the elements are not necessarily continuous functions. The definition of the traces of elements of H 1 (Ω) is a delicate and deep subject that goes beyond the scope of this book, and here we provide only an intuitive sketch. (The reader may refer to [EVA 98, Chapter 5] or [ZUI 02].) Equation [2.1] can thus be understood in the following variational form: Finding u ∈ H, such that for all v ∈ H, we have a(u, v) = (v), where: – H is H01 (Ω) for the case of Dirichlet conditions, or H 1 (Ω) for Neumann conditions; – a is the bilinear form ˆ ˆ a(u, v) = λ u(x)v(x) dx + k(x)∇x u(x) · ∇x v(x) dx. Ω

Ω

Note that a satisfies the continuity property |a(u, v)| ≤ (λ + K) u H 1 v H 1 ; – is the linear form ˆ

(v) = f (x)v(x) dx, Ω

which satisfies the continuity property | (v)| ≤ f L2 v L2 ≤ f L2 v H 1 .

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

165

To apply the Lax–Milgram theorem, a coercivity estimate is also required to be confirmed. It can be obtained relatively directly whenever λ > 0 since a(u, u) ≥ λ u 2L2 + κ ∇u 2L2 ≥ min(λ, κ) u 2H 1 . When λ = 0, we have only a(u, u) ≥ κ ∇u 2L2 . We must therefore show something analogous to lemma 2.2, which only applies to the space H01 (Ω). L EMMA 2.3 (Poincaré lemma (general case)).– We assume that Ω ⊂ RD is an open set, bounded at least in one direction. Then, there exists a C > 0, such that for all u ∈ H01 (Ω), we have

u L2 ≤ C ∇u L2 . P ROOF.– We will demonstrate this relation for regular functions and conclude by density. Let ϕ ∈ Cc∞ (Ω). We assume that Ω is bounded in the direction ω ∈ SD−1 : there exists R > 0, such that for all y ∈ Ω, |y · ω| ≤ R. We write ˆ

0

ϕ(x) = −∞

d ϕ(x + sω) ds = ds

ˆ

0

−∞

∇ϕ(x + sω) · ω ds,

which leads to ˆ |ϕ(x)| ≤



−∞

    ∇ϕ(x + sω) ds

where the integration domain can, in fact, be restricted to |s| ≤ |x·ω|+|s+x·ω| ≤ 2R. Using the Cauchy–Schwarz inequality, it follows that ˆ ˆ

ˆ |ϕ(x)| dx ≤ Ω

Ω

≤ 4R

ˆ

ˆ

2R

2

−2R 2R

ds ×





2R

|∇ϕ(x + sω)| ds 2

−2R



|∇ϕ(x + sω)| dx 2

−2R

dx

ds = 16R2 ∇ϕ 2L2 .

Ω

Finally, we obtain the following general result. T HEOREM 2.5.– Let Ω ⊂ RD be an open set bounded in at least one direction. Let λ ≥ 0 (respectively λ > 0) and f ∈ L2 (Ω). Then, problem [2.1] has a unique solution u ∈ H01 (Ω) (respectively u ∈ H 1 (Ω)) in the sense that for all v ∈ H01 (Ω) (respectively v ∈ H 1 (Ω)), we have a(u, v) = (v). Moreover, there exists C > 0 independent of the data f , such that u H 1 ≤ C f L2 . If 0 ≤ f (x) ≤ M for almost all x ∈ Ω, then 0 ≤ u(x) ≤ M/λ.



166

Mathematics for Modeling and Scientific Computing

The estimates on the solution can be shown as in dimension one, by choosing an appropriate test function. The noticeable point is a regularization effect: for a data f that is only L2 , the solution and its derivatives are L2 functions. We know that when Ω is bounded, the embedding H 1 (Ω) ⊂ L2 (Ω) is compact (see [GOU 11, corollary 7.58]). Then, f → u, which associates with data f the solution u ∈ H 1 (Ω) of [2.1], is a compact operator (see definition 2.1), which implies remarkable spectral properties (see [BRÉ 05, Chapter 6]). D P ROPOSITION   2.2.– Let Ω be a regular bounded open set of R  .Then, there exists a sequence λk k∈N of positive real numbers and a sequence ψk k∈N of elements of H01 (Ω), such that

−Δψk = λk ψk ,

lim λk = +∞

k→∞

and the set {ψk , k ∈ N} forms an orthonormal basis for H01 (Ω). N OTE 2.1.– Whenever Ω =]0, 1[, solving −ψ  = λn ψ , ψ(0) = 0 = ψ(1) leads to λn = π 2 n2 , n ∈ N \ {0} and ψn (x) = sin(πnx). In the periodic case, we also find λn = 4π 2 n2 , ψn (x) = e2iπnx . 2.2. Finite difference approximations to elliptic equations 2.2.1. Finite difference discretization principles We will focus only on the one-dimensional case and thus consider problem [2.2] where we assume that h1) λ ≥ 0; h2) x → f (x) ∈ C 0 ([0, 1]) is given; h3) x → k(x) ∈ C 0 ([0, 1]) is given and there exist κ, K > 0, such that for all x ∈ [0, 1], we have 0 < κ ≤ k(x) ≤ K. The finite difference method is constructed as follows: a) Consider a subdivision of the segment [0, 1] for which we denote the discretization points as xj : 0 = x0 < x1 < .... < xj < xj+1 < ... < xJ+1 = 1. We write hj+1/2 = xj+1 − xj

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

167

and   h = max hj+1/2 , j ∈ {0, ..., J} which measure how “fine-grained” the discretization is. Note that J and h are related; in the case of a uniform grid, the space step is constant hj = h for all j ∈ {0, ..., J}, 1 and we simply have h = J+1 . b) The numerical unknown is a vector U = (u1 , ...., uJ ) ∈ RJ whose components are interpreted as potential approximations of the unknown u of the continuous problem [2.2] at the discretization points x1 , ..., xJ . c) The scheme is constructed by writing a linear system A U = F , where F = (f (x1 ), ..., f (xJ )) ∈ RD is given. (Note that F is well defined since we assume that f is continuous.) The coefficients ai,j of the matrix A are precisely defined so that by writing U = (u(x1 ), ...., u(xJ )), where u is the solution of the continuous problem [2.2], the consistency error Ei = (A U )i − Fi

[2.14]

tends to 0 when h → 0. There are several constructions that fulfill criterion c). In order to limit the complexity of the linear system that needs to be solved, we will construct a sparse matrix A , that is to say, with several coefficients equal to zero. Therefore, it would be very natural to attempt to approach λu(xi ) by only employing the value of u at the point xi and to approximate the derivatives by only using neighboring discretization points xi±1 . We thus decompose A = λI + A. By denoting the coefficients of the matrix A as αi,j , we assume that αi,j = 0

if |i − j| > 1.

It remains to establish that the consistency error [2.14] tends to 0 when h → 0, that is, with x → u(x) as the solution to [2.2],   λu(xi ) − αi,i−1 u(xi−1 ) + αi,i u(xi ) + αi,i+1 u(xi+1 ) − f (xi ) −−−→ 0. h→0

[2.15]

 Let us begin by studying the simplest case  where x → k(x) = k > 0. Then, αi,i−1 u(xi−1 ) + αi,i u(xi ) + αi,i+1 u(xi+1 ) should approach ku (x). We obtain 

 αi,i−1 u(xi−1 ) + αi,i u(xi ) + αi,i+1 u(xi+1 )

168

Mathematics for Modeling and Scientific Computing

  = αi,i−1 + αi,i + αi,i+1 u(xi )   + αi,i−1 (xi−1 − xi ) + αi,i+1 (xi+1 − xi ) u (xi )  (xi − xi−1 )2 (xi+1 − xi )2   + αi,i−1 + αi,i+1 u (xi ) + O(h3 ). 2 2 It therefore leads to the following linear system to identify the coefficients αi,j αi,i−1 + αi,i + αi,i+1 = 0, αi,i−1 (xi−1 − xi ) + αi,i+1 (xi+1 − xi ) = 0, αi,i−1

(xi − xi−1 )2 (xi+1 − xi )2 + αi,i+1 = k. 2 2

With hi+1/2 = xi+1 − xi , it follows that αi,i−1 = αi,i+1

hi+1/2 hi−1/2

and

αi,i+1 (h2i+1/2 + hi+1/2 hi−1/2 ) = 2k.

Finally, we obtain αi,i−1 =

2k hi−1/2

1 , hi−1/2 + hi+1/2

2k 1 , hi+1/2 hi−1/2 + hi+1/2  1 2k 1  =− + . hi−1/2 + hi+1/2 hi−1/2 hi+1/2

αi,i+1 = αi,i

[2.16]

In the case where the grid is uniform with step h, the numerical unknowns are determined by the solution of the linear system ⎛



−2

1

0 0 ?? ?? ⎜ ⎜ ⎜ ⎜ ? ?? ⎜ ⎜ 1 ?? ??? ⎜ ⎜ ? ?? ?? ? ⎜ ⎜ ?? ??? ??? ⎜ ⎜ k ⎜ ⎜ ?? ? ? ⎜λI − 2 ⎜ 0 ?? ??? ??? 0 ⎜ h ⎜ ? ?? ⎜ ⎜ ?? ??? ⎜ ⎜ ?? 1 ?? ⎜ ⎜ ? ? ⎜ ⎜ ⎝ ⎝ 0 0 1 −2

⎞⎞ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ U = F. ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎠⎠

[2.17]

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

169

We follow the same steps for a variable (but regular) coefficient x → k(x). To this end, we develop (ku ) = k  u + ku . The discretization coefficients αi,j are thus defined by the linear system αi,i−1 + αi,i + αi,i+1 = 0

(coefficient with u(xi )),

−αi,i−1 hi−1/2 + αi,i+1 hi+1/2 = k  (xi ) (coefficient with u (xi )), αi,i−1

h2i−1/2 2

+ αi,i+1

h2i+1/2 2

= k(xi ) (coefficient with u (xi )).

It follows that αi,i+1 hi+1/2 − k  (xi ) , hi−1/2  h2 h2i−1/2 hi+1/2  h2i−1/2 i+1/2 αi,i+1 + = k(xi ) + k  (xi ) 2 2 hi−1/2 2hi−1/2 αi,i−1 =

that is to say  2k(x ) hi−1/2  1 i + k (xi ) , hi+1/2 + hi−1/2 hi+1/2 hi+1/2  2k(x ) hi+1/2  1 i = − k  (xi ) , hi+1/2 + hi−1/2 hi−1/2 hi−1/2

αi,i+1 = αi,i−1

= −αi,i−1 − αi,i+1

αi,i

  1 1 1  2k(xi ) + hi+1/2 + hi−1/2 hi+1/2 hi−1/2 h  hi−1/2 i+1/2 +k  (xi ) − . hi−1/2 hi+1/2

=−

For a uniform grid with step h, this becomes αi,i = −2

k(xi ) , h2

αi,i+1 =

k(xi ) k  (xi ) + , h2 2h

αi,i−1 =

k(xi ) k  (xi ) − h2 2h

and the scheme is expressed as λui −

k(xi ) k  (xi ) (u − 2u + u ) − (ui+1 − ui−1 ) = f (xi ). i−1 i i+1 h2 2h

[2.18]

170

Mathematics for Modeling and Scientific Computing

Note that when the coefficient k or the grid step is variable, the matrix associated with the discrete problem is not symmetric6. This is a shortcoming in two respects: – on the one hand, the discrete equation comes from a continuous problem that has symmetric properties (the bilinear form a of the Lax–Milgram theorem is symmetric for the case studied here); the discretization has therefore brought about a loss of structure; – on the other hand, we have specific numerical methods for solving symmetric systems of linear equations more efficiently (conjugate gradient method, see section A1.3). We propose a different discretization for variable coefficients on a uniform grid with step h > 0. The idea is to approximate the derivative of a given regular function ϕ with a centered differential quotient inspired by the relation ϕ(x + h/2) − ϕ(x − h/2) = ϕ (x). h→0 h lim

More specifically, we have the estimate  ϕ(x + h/2) − ϕ(x − h/2)    − ϕ (x) ≤ ϕ L∞ h2 ,  h which is a better estimate than would be obtained with right-centered approximations ϕ(x+h)−ϕ(x) or left-centered approximations ϕ(x)−ϕ(x−h) . We use this idea to h h approximate (ku ) by using the values of ku at the points x ± h/2, and then approximating u (x ± h/2) using the same principle with the values of u at the points xi−1 , xi , xi+1 . The scheme obtained along this line of thought takes the form of the following linear system λui −

1 ui+1 − ui ui − ui−1  k(xi+1/2 ) − k(xi−1/2 ) = f (xi ) h h h

for i ∈ {1, ..., J } and with the convention u0 = 0 = uJ+1 . Assuming that a is a regular function, this approximation is consistent with order O(h2 ). We will see that this rationale becomes similar to finite volume techniques by giving a key role to the interfaces xi±1/2 . An advantage of this discretization is that it preserves the problem’s symmetry. However, we will also see (in section 2.3.1) that for a non-uniform grid, this scheme is no longer consistent, even though it is possible to show that it is convergent! N OTE 2.2 (Non-homogeneous conditions).– It is interesting to write an “extended” ¯ of size (N + 2) which form of problem [2.17] by dealing with a discrete unknown U 6 For a constant coefficient k, this remark can become more relative: in fact, the matrix A J hi+1/2 +hi−1/2 is symmetric for the scalar product u|v = . We can verify that, i=1 ui vi 2 indeed, Au|v = u|Av, whereas A is not symmetric for the usual scalar product on RN .

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

171

¯0 and U ¯J+1 . In that case, we let ¯I denote the identity includes the boundary terms U matrix of size (J + 2) × (J + 2), and write ⎛



⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜λ¯I − k ⎜ ⎜ 2 h ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎝

1

0

0

1

−2

0

1

1

0

@@ ?? @@ ?? ? @ ?? @@@ ??? ?? @@ ?? ?? @ ? ?? @@ ??? ?? @@ ?? @@ ?? @@

0

1

0

0

−2

1

0

1

0

1

0

0

⎞⎞ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ¯ ¯ ⎟⎟ U ⎟⎟ = F . ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎠⎠

The extended second member F¯ is constructed by taking into account the boundary conditions: F¯0 = 0, F¯J+1 = 0 and F¯j = Fj = f (xj ) for j ∈ {1, ..., J }. We can easily take into account non-homogeneous Dirichlet data by modifying the extremes of the second member F¯ . (Precisely, using the Dirichlet conditions u(0) = α and u(1) = β, we write F¯0 = (λ − hk2 )α, F¯J+1 = (λ − hk2 )β.) Note that the resulting matrix is no longer symmetric. One way of restoring symmetry is to introduce a penalty term ⎛



⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ k⎜ ⎜ ¯ ⎜ ⎜λI − 2 ⎜ ⎜ h ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎝

1/

1

1

−2

0

1

0 0

0 1

0

@@ ?? @@ ?? ? @ ?? @@@ ??? ?? @ ? ?? @@@ ??? ? ?? ?? @@@ ? @ ?? ?? @@@ 0

1

0

0 1

0

−2

1

1

1/

⎞⎞ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ¯ ⎟⎟ U = F¯ , ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎠⎠

where we also modify the extremes of the second member F¯0 = 1 F¯0 , F¯J+1 = 1 F¯J+1 for a parameter 0 <  1.

N OTE 2.3 (Neumann boundary conditions).– When we use Neumann boundary conditions, the matrix A must be modified. For Neumann conditions and a constant coefficient k, with a constant grid step h, we obtain

172

Mathematics for Modeling and Scientific Computing





ANeumann

⎜ ⎜ ⎜ ⎜ ⎜ k⎜ = − 2⎜ h ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

−1 1 0

1

0

CC CC CC −2 CC CC B C B CC B BB CCCC CC B CC CC CC BB C CC CC −2 CC

0

0

1

0

0 1 −1

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

Indeed, the condition u (0) = 0 = u (1) becomes u1 −u0 = 0 = uJ+1 −uJ at the discrete level. The last formula takes its cue from the fact that u(x+h)−u(x) − u (x) = h  O(h): it provides a consistent first-order approximation of u . We can improve the scheme by using an approximation, that uses the equation satisfied by u, which is second-order consistent. Indeed, we have seen that the construction of the scheme relies, within the calculation domain, on an approximation of the equation that is second-order consistent. A “naïve” treatment of the Neumann condition can therefore alter the global quality of the approximation. We use the development h2 u(x + h) = u(x) + u (x)h + u (x) + h2 (x, h), with limh→0 (x, h) = 0,    2 1 = (λu(x) − f (x)) by [2.2] k which defines u0 and uJ+1 by the following relations: u1 = u0 +

h2 (λu0 − f (0)), 2k

uJ = uJ+1 +

h2 (λuJ+1 − f (1)). 2k

N OTE 2.4 (Periodic conditions).– For periodic conditions, we must also modify the discretization matrix in order to take into account the behavior imposed on the boundaries. We obtain ⎛

APerio

⎜ ⎜ ⎜ ⎜ ⎜ ⎜ k⎜ ⎜ = − 2⎜ h ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

2

−1

−1

0

0

0

−1

−1

2

BB BB BBB BB BB −1 0 BB BB BBB B B B BB BB BBB BBB 0 B B BB BB BBB BBB BB 0 B BB BB BB BBB B BB B BB −1 BB 0 BB BB B 0

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

173

because, for the discrete problem, the periodicity condition requires that u0 = uJ and uJ+1 = u1 . 2.2.2. Analysis of the discrete problem We now analyze the discrete problem [2.17], which can be written in the form A U = F , with A = λI + A and ⎛

−2

1

0 0 ?? ?? ⎜ ⎜ ? ?? ⎜ 1 ?? ??? ⎜ ? ?? ?? ? ⎜ ?? ??? ??? ⎜ k ⎜ ?? ? ? A=− 2⎜ 0 ?? ??? ??? 0 h ⎜ ? ?? ⎜ ?? ??? ⎜ ?? 1 ?? ⎜ ? ? ⎜ ⎝ 0 0 1 −2

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

[2.19]

This equation therefore corresponds to the finite difference discretization of the model problem [2.2], for a constant coefficient k and a uniform grid of step h. Of course, the first question to ask is whether the linear system has a solution. P ROPOSITION 2.3.– When λ > 0, the matrix A has a strictly dominant diagonal, so it is invertible.  P ROOF.– When λ > 0, we find that, indeed, Ai,i > j =i |Ai,j | for all i ∈ {1, ..., J}. Suppose there exists a vector x = 0, such that A x = 0. At the possible cost of renormalizing x, we assume that xi = max |xj |, j ∈ {1, ..., J} = 1. So we have Ai,i xi = −



Ai,j xj = Ai,i >

j =i



|Ai,j |

j =i

which contradicts the fact that        Ai,j xj  ≤ |Ai,j | |xj | ≤ |Ai,j |. − j =i

j =i

The matrix A is therefore invertible.

j =i



For λ = 0, the matrix A does not have a strictly dominant diagonal: the diagonal coefficients are positive  and the extra-diagonal coefficients are negative or zero, but we can have Ai,i + j =i Ai,j = 0. However, the matrix A is still invertible.

174

Mathematics for Modeling and Scientific Computing

P ROPOSITION 2.4.– When λ = 0, the matrix A = A is invertible. P ROOF.– The argument is inspired by the derivation of an a priori estimate for the continuous problem (see theorem 2.1). We compute AU · U =

J  k  2 2ui − ui+1 ui − ui−1 ui 2 h i=1

=

J J k  k  (u − u )u + (ui − ui−1 )ui i i+1 i h2 i=1 h2 i=1

=

J k  (ui − ui+1 )2 ≥ 0. h2 i=1

Moreover, if AU · U = 0, then for all i ∈ {1, ..., J}, we have ui+1 = ui . Now, the boundary conditions (which we have used in the foregoing manipulations) require that u0 = 0 = uJ+1 = 0, which finally implies that U = 0. This justifies the fact that A is invertible.  In particular, this analysis shows that the matrix A is symmetric positive definite (with a constant space step). This observation is, in practice, important because it allows us to resort to better performing methods for solving systems of linear equations (Cholesky or conjugate gradient algorithms, see [LAS 04, Chapter 4] and section A1.3). N OTE 2.5 (Neumann boundary conditions and periodic conditions).– Matrices that correspond to Neumann or periodic conditions (see notes 2.3 and 2.4) are not invertible. The matrices ANeumann and APerio have a non-trivial kernel spanned by the vector (1, ..., 1). This is consistent with the analysis of the continuous problem for which the constants are solutions of the homogeneous equation. In practice, we seek solutions whose average is zero (or that have a fixed value at a given point). This leads us to replace one of the original matrix’s rows with a different expression, which corresponds to the constraint and makes it invertible again. In general, this manipulation makes the matrix stop being symmetric, which makes it impossible to resort to specific and effective methods for solving symmetric linear systems. In section 2.8, we will study an alternative. Studying the continuous problem has revealed that for positive data f , the solution u is positive (maximum principle). It is important to make sure that the discrete problem preserves this property. Such a result is related to the structure of the matrix A .

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

175

D EFINITION 2.2.– We say that a matrix M ∈ MJ (R) is positive7 if its coefficients mi,j are positive or equal to zero. D EFINITION 2.3.– We say that a matrix A ∈ MJ (R) is monotone if A is invertible and the matrix A−1 is positive. We also say that A is an M − matrix. T HEOREM 2.6.– A matrix A ∈ MJ (R) satisfies ()

if the components of Ax are ≥ 0, then the components of x are also ≥ 0

if and only if A is monotone. P ROOF.– Assume that the condition () is satisfied. Let x ∈ RJ , such that Ax = 0. Then, the components of Ax are both positive and negative and () therefore implies that the components of x and those of (−x) are positive or zero. It follows that, in fact, x = 0. Therefore, A is invertible. Let us suppose that Ax = y has positive components. Then, x = A−1 y has positive components by (). In particular, by taking y = ei , the elements of the canonical basis for RJ , A−1 ei , which corresponds to the ith column of A−1 , have positive components. We conclude that A−1 is positive. Reciprocally, if A is monotone, and Ax = y has positive components, then x = A−1 y also has positive components because the coefficients of A−1 are positive.  T HEOREM 2.7.– For all λ ≥ 0, the matrix A is monotone. P ROOF.– Suppose that A u = F has positive components  and that there exists i ∈ {1, ..., J}, such that ui = min uj , j ∈ {0, ..., J + 1} , where we recall that u0 = uJ+1 = 0 (note that under this assumption, ui is an interior point). Then, we have λui +

k k (ui − ui−1 ) + 2 (ui − ui+1 ) ≥ 0.   h h2    ≤0

≤0

If λ > 0, this implies directly that ui ≥ 0. If λ = 0, we can infer from the relation that ui = ui+1 = ui−1 . Repeating the same reasoning, we conclude that u is a constant vector with ui = u0 = uJ+1 = 0, that is, u = 0. However, this would imply that Au = F = 0. It follows that if F has positive components with F = 0, then the vector u ∈ RJ , which is a solution of Au = F  has only strictly positive components.   L EMMA 2.4.–For a matrix M ∈ MJ (R), we let M ∞ = sup |M x| ∞ , x ∈ J J R , |x|∞= 1 , with | · |∞ , denoting the infinite norm in R : |x|∞ = sup |xi |, i ∈ 1 {1, ..., J} . We have A−1 ∞ ≤ 8k . 7 It is important that this definition is not confused with the positive character of the quadratic form associated with a symmetric matrix.

176

Mathematics for Modeling and Scientific Computing

P ROOF.– We begin by noting that A−1 ∞ = |A−1 e|∞ , where e = (1, ..., 1). This is because A−1 is monotone. For x, y ∈ RJ , we designate |x| as the vector of components |xi | and write x ≥ y if for all i ∈ {1, ..., J}, we have xi ≥ yi . Then, for every vector x, such that |x|∞ = 1, since A is monotone, we have A−1 e ≥ A−1 |x| ≥ |A−1 x|. Now, Az = e corresponds to the discretization of the equation −kU  (x) = 1

pour x ∈]0, 1[,

U (0) = 0 = U (1).

The solution of this equation is known explicitly: U (x) = x(1−x) 2k . Since it is a second-order polynomial, the consistency error is zero: we have exactly −k

U (xi+1 ) − 2U (xi ) + U (xi−1 ) = 1. h2

In other words, the solution of Az = e is given by zi = U (xi ) = follows that x(1 − x) 1 = . 2k 8k x∈[0,1]

A−1 ∞ = |A−1 e|∞ = |z|∞ ≤ sup

ih(1−ih) . 2k

It



We can interpret this estimate as a stability property. Therefore, combining the stability and consistency properties of the finite difference scheme, we can arrive at the convergence of discrete solutions towards the solution of the continuous problem. T HEOREM 2.8.– Let x → u(x) be the solution of problem [2.2], with constant λ = 0 and k > 0. For all J ∈ N \ {0}, we associate it with the vector u ¯ ∈ RJ whose components are u(x1 ), ..., u(xJ ), where xi = ih, h = 1/(J + 1). We let uh ∈ RJ denote the solution of the linear system Auh = F , where F = (f (x1 ), ..., f (xJ )). Thus, we have |uh − u ¯| ∞ ≤

u(4) ∞ 2 h . 96

P ROOF.– Let R designate the vector of consistency errors Ri = −k

u(xi+1 ) − 2u(xi ) + u(xi−1 ) − f (xi ) = (A¯ u − F )i . h2

Then, we have A(uh − u ¯) = F − A¯ u = −R.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

177

Therefore, |uh − u ¯|∞ ≤ A−1 ∞ |R|∞ =

1 |R|∞ . 8

The consistency error can therefore be estimated by using the Taylor expansion of u and we obtain |R|∞ ≤ h2 u(4) ∞ /12. Note that this estimate requires higher-order derivatives of the solution u. This assumes that the solution is regular, beyond mere H 1 regularity ensured by the functional arguments discussed above.  We provide an analogous result for the problem with λ > 0. T HEOREM 2.9.– Let λ > 0 and k > 0 be a constant. The matrix A = (λI + A) 1 satisfies A −1 ∞ ≤ 8k . Let x → u(x) be the solution of problem [2.2]. For all J ∈ N \ {0}, we associate it with a vector u ¯ ∈ RJ whose components are u(x1 ), ..., u(xJ ), where xi = ih, h = 1/(J + 1). Let uh ∈ RJ denote the solution of the linear system A uh = F , where F = (f (x1 ), ..., f (xJ )). Then, we have |uh − u ¯| ∞ ≤

u(4) ∞ 2 h . 96

P ROOF.– Surprisingly, the stability estimate for A = λI + A is a corollary of the case when λ = 0. The trick is to write A−1 − A −1 = A−1 A A −1 − A−1 AA −1 = A−1 (A − A)A −1 = λA−1 A −1 . Now, A−1 and A −1 are both matrices with positive coefficients as a result of theorem 2.7. It follows that all the coefficients of A−1 − A −1 are also positive and, therefore, with the notation of lemma 2.4, that A−1 e ≥ A −1 e. We conclude that

A ∞ = |A−1 e|∞ ≥ |A −1 e|∞ = A −1 ∞ . Finally, it remains to analyze the consistency error and reproduce the arguments for the proof of theorem 2.8.  Let us finish with a few remarks on the spectral properties of the matrix A, in the d2 case where k is constant. Let us first note that the spectrum of the operator −k dx 2 with Dirichlet conditions on [0, 1] is {π 2 n2 , n ∈ N}. Indeed, let φ be the solution of ´1 d2 −k dx = λψ. By integrating by parts, we note that k 0 |ψ  (x)|2 dx = 2ψ ´1 λ 0 |ψ(x)|2 dx ≥ 0, which implies that necessarily λ ≥ 0. Now, the solutions of d2 −k dx λ/k x) + B sin( λ/k x), 2 ψ = λψ, with λ ≥ 0, are of the form A cos( with A, B ∈ R. Dirichlet conditions ensure that A = 0 and λ/k = πn for n ∈ Z. We conclude that the eigenvalues are of the form kπ 2 n2 , for n ∈ N \ {0}, and the associated eigenspaces are given by the functions ψn : x → sin(πnx).

178

Mathematics for Modeling and Scientific Computing j Let J ∈ N \ {0}. For n, j ∈ N, we write ψn,J = sin(πjn/(J + 1)), which satisfies j+ (J+1)

J+1 0 ψn,J = 0 = ψn,J ,

ψn,J+1

j = (−1) ψn,J

for any ∈ N

j and for j ∈ {1, ..., J }, ψn,J = ψn (j/(J + 1)). 1 J We thus evaluate the matrix A on the vector ψn,J = (ψn,J , ..., ψn,J ). We have j+1 j−1 j k(J + 1)2 (−ψn,J − ψn,J + 2ψn,J )

   = k(J + 1)2 Im 2eiπjn/(J+1) 1 − cos π

n  J +1  πjn 

πn  sin 2(J + 1) J +1  πn  j = 4k(J + 1)2 sin2 ψn,J . 2(J + 1) = 4k(J + 1)2 sin2



Therefore, the matrix A ∈ MJ has J distinct eigenvalues λ1 = 4k(J + 1)2 sin2



  Jπ  π , ..., λJ = 4k(J + 1)2 sin2 2(J + 1) 2(J + 1)

associated with the eigenvectors ψn,J , n ∈ {1, ..., J}. In particular, note that the conditioning of A degrades when the discretization step tends to 0 because cond(A) = λJ /λ1 ∼ 4J 2 /π 2 . J→∞

In practice, poor conditioning is not necessarily all that bad: we often use “regular” second members, in the sense that they are described by a small N number of basis function ψn (this is to say, we have an estimate of the type f − n=0 cf (n)ψn L2 ≤ ´1 CN −r , with cf (n) = 0 f (x)ψn (x) dx). The stability estimate is therefore far more favorable than that involving conditioning. N OTE 2.6.– The construction of the approximation can be generalized easily to higher dimensions by arguing coordinate by coordinate. For example, in dimension 2, the equation −Δu = (∂x2 + ∂y2 )u = f,

 u∂Ω = 0

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

179

can be approximated with the linear system −

1 1 (ui+1,j −2ui,j +ui−1,j )− (ui,j+1 −2ui,j +ui,j−1 ) = f (iΔx, jΔy).[2.20] Δx2 Δy 2 When the problem is set on the domain [0, L] × [0, M ], with (I + 1)Δx = L

et

(J + 1)Δy = M,

these equations are satisfied for i ∈ {1, ...I}, j ∈ {1, ..., J} with Dirichlet conditions u0,j = uI+1,j = 0 = ui,0 = ui,J +1 . In practice, we construct a system of size IJ × IJ in the usual matrix form. The construction is as follows: given the quantities vi,j , where i ∈ {1, ..., I}, j ∈ {1, ..., J }, we construct a vector V ∈ RIJ by writing Vi+(j−1)J = vi,j , which can be summarized by writing in coordinates ⎛

⎞ v1,j ⎜v2,j ⎟ ⎜ ⎟ V#j = ⎜ . ⎟ ∈ RI , ⎝ .. ⎠ vI,j

⎛# ⎞ V1 ⎜ V#2 ⎟ ⎜ ⎟ V = ⎜ . ⎟ ∈ RIJ ⎝ .. ⎠ V#J

We write K = i + (j − 1)J, which spans {1, .., IJ} when i and j pass through {1, ..., I} and {1, ..., J }, respectively. In this way, we define the second member F using the values f (iΔx, jΔy). Relation [2.20] now takes the generic form −

 1 1 1 1  1 1 U − U + 2 + UK − UK+J − UK−J = FK . K−1 K+1 Δx2 Δx2 Δx2 Δy 2 Δy 2 Δy 2

Finally, we write it as a matrix problem AU = F , where the matrix A has a tridiagonal structure by blocks (with J blocks by rows and columns) ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ A=⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

B

I − Δy 2

⎞ 0

0

CC CC CC CC CC C CC CC I CC − CC CC Δy 2 CC CC CC CC CC CC CC CC CC CC 0 CC 0 CC CC CC CC CC CC I CC CC CC CC − Δy 2 CC CC CC 0

0



I Δy 2

B

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

180

Mathematics for Modeling and Scientific Computing

with I, the identity matrix on RI and B, the following matrix of size I × I ⎞ ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ B=⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

1

0 0 DD − Δx2 DD D DD DD DD DD 1 DD D D − DD DD Δx2 DD D DD DD DD DD DD DD DD DD 0 0 DD DD DD D DD DD DD DD 1 DD DD DD − Δx2 DD D DD DD DD b

0

0



1 Δx2

b

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟   ⎟ ⎟, b = 2 1 + 1 . ⎟ Δx2 Δy 2 ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

An example for the right-hand side   f (x, y) = 20 exp − 100((x − 0.25)2 + (y − 0.5)2 )   + 15 exp − 120((x − 0.75)2 + (y − 0.4)2 ) is shown in Figure 2.5 2.3. Finite volume approximation of elliptic equations 2.3.1. Discretization principles for finite volumes We consider the same model problem [2.2], but with a different approach: the idea is to integrate the equation over control volumes Ci = [¯ xi−1/2 , x ¯i+1/2 [, where 0=x ¯1/2 < x ¯3/2 < ... < x ¯I−1/2 < x ¯I+1/2 = 1,

i ∈ {0, ..., I}.

We write x ¯0 = 0, x ¯I+1 = 1 and for i ∈ {1, ..., I}, we set x ¯i =

x ¯i+1/2 + x ¯i−1/2 2

which is the barycenter of the control volume Ci . Finally, for i ∈ {1, ..., I}, we write hi+1/2 = x ¯i+1 − x ¯i ,

hi = x ¯i+1/2 − x ¯i−1/2

which are, respectively, the distance between the barycenters of the volumes Ci and Ci+1 , and the length of volume Ci . In particular, we note that hi+1/2 =

hi hi+1 + . 2 2

[2.21]

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

181

Figure 2.5. Solution to the 2D Poisson equation. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

This notation is specified in Figure 2.6. We have ˆ

  ˆ d d λ u(y) dy − k(¯ xi+1/2 ) u(¯ xi+1/2 ) − k(¯ xi−1/2 ) u(¯ xi−1/2 ) = f (y) dy. dx dx Ci Ci hi+1

hi = |Ci |

• x ¯i−1/2

x ¯i ×

• x ¯i+1/2

x ¯i+1 ×

• x ¯i+3/2

hi+1/2

Figure 2.6. Finite volume discretization: control volumes are the segments Ci = [¯ xi−1/2 , x ¯i+1/2 ]

The numerical unknown U = (u1 , ..., uI ) ∈ RI is understood to give an approximation of the mean values 1 hi

ˆ u(y) dy. Ci

182

Mathematics for Modeling and Scientific Computing

This leads us to write the scheme in the form λhi ui + (Fi+1/2 − Fi−1/2 ) = hi f¯i

[2.22]

where 1 f¯i = hi

ˆ f (y) dy. Ci

We naturally associate the vector U = (u1 , ..., uI ) with the following function that is constant over each control volume uh (x) =

I 

ui 1Ci (x).

i=1

The key notion of FV methods is the numerical flux: the point is to find a relevant definition of Fi+1/2 , as a function of the components U1 , ..., UI . This quantity is designed to be an approximation of the flux at the interface x ¯i+1/2 between the control volumes Ci and Ci+1 . For the solution x → u(x) of the continuous problem, the flux is Fi+1/2 = −k(¯ xi+1/2 )

d u(¯ xi+1/2 ). dx

We approximate this quantity using a differential quotient that leads to the formula Fi+1/2 = −k(¯ xi+1/2 )

ui+1 − ui . hi+1/2

We have considered a formula of the same kind for the finite difference approximation on a regular grid at the end of section 2.2.1. This definition makes sense for the interior points, where i ∈ {1, ..., I}; it must be modified for interfaces on the boundary. At x = 0 and x = 1, the function x → u(x) vanishes. We interpret this condition as fixing the values u0 and uI+1 at 0 in control volumes of size zero, which are concentrated at the boundary points x = 0 = x ¯1/2 = x ¯0 and x=1=x ¯I+1/2 = x ¯I+1 . We thus obtain F1/2 = −2k(0)

u1 , h1

FI+1/2 = +2k(1)

uI . hI

Finally, the discrete problem comes down to solving the linear system 

 ¯ u = diag(h1 , ..., hI )f¯ λdiag(h1 , ..., hI ) + A

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

183

with f¯ = (f¯1 , ..., f¯I ) and ¯ i,j = A ¯ j,i , A ¯ i,j = 0 A

if |i − j| > 1,

xi+1/2 ) k(¯ xi−1/2 ) ¯ i,i = k(¯ A + , hi+1/2 hi−1/2 xi+1/2 ) ¯ i,i+1 = − k(¯ A , hi+1/2

i ∈ {2, ..., I − 1} i ∈ {2, ..., I − 1},

xI−1/2 ) k(1) ¯ I,I = + k(¯ A +2 , hI−1/2 hI x1/2 ) k(0) ¯ 1,1 = + k(¯ A +2 . h1/2 h1 We should pay special attention to the location of the discretization coefficients hi in the resulting system [2.22]. In particular, it would not be correct to divide the equations [2.22] by hi : the corresponding linear system would not be symmetric (for a general grid). It is also useful to compare the expression of the finite volume scheme with that of the finite difference scheme, even for a constant k. Let h > 0 denote the grid’s step, which is assumed to be uniform (except for the boundary points, as we will see in more detail). Therefore, for the interior points i ∈ {2, ..., I − 1}, we have λhuVi F −

k VF F − 2uVi F + uVi−1 ) = hf¯i (u h i+1

for the finite volume and λuDF − i

k DF (u − 2uDF + uDF xi ) i i−1 ) = f (¯ h2 i+1

for the finite difference scheme. The expression is different at the boundaries, where it should be noted that x ¯3/2 − x ¯1/2 = h, x ¯1 − x ¯0 = h/2, with x ¯0 = x ¯1/2 = 0. For the finite volume scheme, we have k λhuV1 F − (uV2 F − 3uV1 F ) = hf¯1 .  h   F3/2 −F1/2

184

Mathematics for Modeling and Scientific Computing

For the finite difference scheme (see [2.16], with h3/2 = h, h1/2 = h/2), we obtain λuDF − 1

4k DF (u − 3uDF 1 ) = f (h/2). 3h2 2

An analogous situation presents itself for the last discretization element: k VF (u − 3uVI F ) = hf¯I vs. h I−1 4k DF − 2 (uDF I−1 − 3uI ) = f (1 − h/2). 3h

λhuVI F − λuDF I

In particular, we observe that, since the finite difference grid is not uniform, the associated matrix is not symmetric (for the usual scalar product). By writing Fi+1/2 = ui+1 −ui , the finite volume scheme is itself defined by hi+1/2 λhi ui + Fi+1/2 − Fi−1/2 = hi fi .    $ % ¯ (λdiag(h1 ,...,hI )+A)u

[2.23]

i

The matrix for that system is symmetric (by construction) and positive definite because ¯ ·u = λ (λdiag(h1 , ..., hI ) + A)u =λ

I  i=0 I  i=0

hi u2i + hi u2i +

I  i=0 I 

Fi+1/2 (ui+1 − ui ) k(xi+1/2 )

i=0

(ui+1 − ui )2 ≥ 0. hi+1/2

This quantity equals zero if and only if we have ui+1 = ui for each index i, that is to say, taking into account the boundary conditions, u = 0. It is therefore tempting to interpret ui as an approximation of u(¯ xi ), as in the analysis carried out for finite difference schemes. However, the scheme obtained in that analysis is not, in general, consistent in the sense required for finite differences. Indeed, let us study the case where k(x) = k > 0 is constant. If we apply the formula that defines the scheme for values u(¯ xi ), we obtain k  u(¯ xi+1 ) − u(¯ xi ) u(¯ xi ) − u(¯ xi−1 )  − hi hi+1/2 hi−1/2  hi+1/2 hi−1/2  k  = λu(¯ xi ) − u (¯ xi ) + u (¯ xi ) − u (¯ xi ) + u (¯ xi ) + O(h) hi 2 2

λu(¯ xi ) −

= λu(¯ xi ) − k

hi+1/2 + hi−1/2  u (¯ xi ) + O(h). 2hi

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

185

Therefore, the consistency error does not necessarily tend to 0 when h → 0. For example, we can define a grid where h2p = h and h2p+1 = h/2, in which case h +h hi+1/2 + hi−1/2 = 3h/2 and the ratio i+1/22hi i−1/2 alternatively takes the values 3/4 and 3/2. However, that sense of consistency is satisfied when the grid is uniform, so that hi+1/2 = hi = h. Nevertheless, we can show that the finite volume scheme converges. By definition, the right-hand term of [2.23] is equal to ˆ

x ¯i+1/2

f (y) dy = λhi u ¯i + Fi+1/2 − Fi−1/2

x ¯i−1/2

where we denote 1 u ¯i = hi

ˆ

x ¯i+1/2

u(y) dy. x ¯i−1/2

Note that, on the one hand,  1 ˆ x¯i+1/2    |¯ ui − u(¯ xi )| =  (u(y) − u(¯ xi )) dy  ≤ Ch, hi x¯i−1/2 and, on the other hand, using the expansion of u in xi+1/2 ,  u(¯ xi+1 ) − u(¯ xi )    xi+1/2 ) − u (¯  hi+1/2 [2.24]  u(¯ xi+1 ) − u(¯ xi+1/2 ) + u(¯ xi+1/2 ) − u(¯ xi )    = u (¯ xi+1/2 ) −  ≤ Ch hi+1/2 where C depends on u C 2 . In what follows, we will again denote such a quantity as C, even if the value may change from one line to another. The second inequality employs the fact that x ¯i+1 − x ¯i+1/2 = hi+1 /2, x ¯i+1/2 − x ¯i = hi /2 and hi+1/2 = x ¯i+1 − x ¯i = hi+1 /2 + hi /2. We introduce the notation ei = ui − u(¯ xi ), u(¯ xi+1 ) − u(¯ xi ) − Fi+1/2 hi+1/2  u(¯ xi+1 ) − u(¯ xi ) = k(¯ xi+1/2 ) − u (¯ xi+1/2 ) , hi+1/2

Ri+1/2 = k(¯ xi+1/2 )

186

Mathematics for Modeling and Scientific Computing

and we have λhi (ui − u ¯i ) + (Fi+1/2 − Fi+1/2 ) − (Fi−1/2 − Fi−1/2 ) = 0 = λhi ei + k(¯ xi+1/2 )

ei+1 − ei ei − ei−1 − k(¯ xi−1/2 ) hi+1/2 hi−1/2

+λhi (u(¯ xi ) − u ¯i ) + Ri+1/2 − Ri−1/2 . We multiply this expression by ei and add to obtain λ





|ei+1 − ei |2 hi+1/2 i  = Ri+1/2 (ei+1 − ei ) + λ hi (¯ ui − u(xi ))ei .

hi |ei |2 +

i

k(¯ xi+1/2 )

i

i

Using the Cauchy–Schwarz inequality, the right-hand term can be dominated by



hi+1/2 |Ri+1/2 |

2

 |ei+1 − ei |2



 +λ hi |¯ ui − u(xi )|2 hi |ei |2

hi+1/2 i i ⎛ ⎞

 2  |ei+1 − ei | ≤ Ch ⎝ + λ hi |ei |2 ⎠ h i+1/2 i i

i

i

  hi are bounded above by the size of the domain. Since because hi+1/2 and √ √ √ √ a + b ≤ 2 a + b, and because k(x) ≥ κ > 0, we conclude that  λ

 i

hi |ei | + 2

 |ei+1 − ei |2 i

hi+1/2

1/2 ≤ Ch,

[2.25]

which  can be interpreted as an estimate on a discrete H 1 norm of the error. Note that i−1 ei = j=0 (ej+1 − ej ) because e0 = 0, which enables us to obtain the discrete Poincaré inequality ⎛ ⎞1/2  |ej+1 − ej |2 ⎠ . max |ei | ≤ C ⎝ i∈{0,...I} hj+1/2 j Finally, we infer that maxi∈{0,...I} |ei | ≤ Ch. Note that when using a general grid, the order of convergence is hampered: if we assume that the grid is uniform of step h, then for [2.25], we obtain an estimate in

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

187

O(h2 ). This results from the fact that in [2.24] the second-order term of the expansion  u(¯ xi+1/2 )  for 2hi+1/2 (¯ xi+1 − x ¯i+1/2 )2 − (¯ xi − x ¯i+1/2 )2 is zero for that grid. We thus obtain an estimate of the form |u (¯ xi+1/2 ) − u(¯xi+1h)−u(¯xi ) | ≤ Ch2 . More generally, we should note that the key ingredient for the convergence proof lies in the estimate of the remainder Ri+1/2 : although the discrete equation is not consistent, the fluxes will be consistent. As in the FD scheme, we establish the maximum principle and the L∞ stability. Assume that f ≥ 0, f = 0. Let i0 be an index, such that ui0 = min{u0 , ..., uI+1 }, with the convention u0 = 0 = uI+1 . Suppose that i0 is an interior point: if i0 ∈ {1, .., I}, then we have 0 ≤ hi0 fi0 = λhi0 ui0 −

ki0 +1/2 ki −1/2 (ui0 +1 − ui0 ) + 0 (ui − ui −1 ) .   hi0 −1/2  0  0  hi0 +1/2  ≥0

≤0

Whenever λ > 0, this proves that ui0 ≥ 0. If λ = 0, we can infer from this relation that ui0 = ui0 ±1 . We thus obtain the result that f is exactly zero, which is a contradiction. We have uj > 0 for all j ∈ {1, .., I}. As a result, the matrix ¯ is monotone. The L∞ stability lies in the fact that we know a λdiag(h1 , ..., hI ) + A specific solution whenever there is a constant second member f . If we know w, such ¯ j ¯ that Aw = hj , then the solution u of Au = hf satisfies ¯ A( f

∞ w − u)j = hj ( f ∞ − fj ) ≥ 0. We infer that f ∞ wj ≥ uj for all j, so

u ∞ ≤ f ∞ w ∞ . Finally, for constant coefficients and whenever λ = 0, we can h2 ¯ j = hj . verify that wj = 1 (¯ xj )(1 − x ¯j ) + j satisfies Aw 2

8

2.3.2. Discontinuous coefficients When the coefficient k is discontinuous on an interface x = x ¯i+1/2 , there is an ambiguity in defining the value of the numerical coefficient ki+1/2 . The idea is to reproduce the formulas from the continuous case. We write k ± = limh→0± k(¯ xi+1/2 + h). For f ∈ L2 , we have seen that the flux ku is continuous: when k has a jump, u also has a jump in such a way that the product remains continuous (u is continuous but not C 1 in that case; see the example described in Figure 2.16). We thus have lim k(¯ xi+1/2 + h)

h→0+

u(¯ xi+1/2 + h) − u(¯ xi+1/2 ) h

= lim− k(¯ xi+1/2 + h) h→0

u(¯ xi+1/2 + h) − u(¯ xi+1/2 ) . h

We first seek ui+1/2 such that the discrete analogue is satisfied k+

ui+1 − ui+1/2 ui+1/2 − ui = k− . xi+1 − xi+1/2 xi+1/2 − xi

188

Mathematics for Modeling and Scientific Computing

We let Fi+1/2 denote this common value, which will be the value of the numerical flux at the interface xi+1/2 . In other words, we have ui+1/2 =

ui+1 k + /hi+1 + ui k − /hi . k + /hi+1 + k− /hi

−ui Then, ki+1/2 is defined, so that Fi+1/2 = ki+1/2 uhi+1 . We thus obtain the i+1/2 following expression for the numerical flux

Fi+1/2 =

hi+1/2 2k + k − ui+1 − ui , + hi+1 hi k /hi+1 + k − /hi hi+1/2

this is to say, ki+1/2 =

2hi+1/2 . hi /k − + hi+1 /k +

[2.26]

When hi = hi+1/2 = h, this coefficient is just the harmonic mean

2 1/k− +1/k+ .

We have seen that the key point in establishing convergence of the finite volume method lies in the consistency property of the fluxes, which is equivalent to showing that lim Ri+1/2 = lim ki+1/2

h→0

h→0

u(¯ xi+1 ) − u(¯ xi ) − w(¯ xi+1/2 ) = 0, hi+1/2

d where w(x) = (k dx u)(x). Even though k is discontinuous, the regularity of the second member f implies that w is a regular function (at least continuous for f ∈ L2 , see lemma 2.1, and we will specify the required degree of regularity for the proof). We rewrite the desired quantity in the form

Ri+1/2

ki+1/2 = hi+1/2

ˆ

x ¯i+1

x ¯i

w(y) dy − w(¯ xi+1/2 ). k(y)

Now, we have ˆ

1

w(y) = w(¯ xi+1/2 ) +

w (¯ xi+1/2 + θ(y − x ¯i+1/2 ))(y − x ¯i+1/2 ) dθ

0

which leads to  Ri+1/2 = w(¯ xi+1/2 )

ki+1/2 hi+1/2

ˆ

x ¯i+1

x ¯i

dy − 1 + rh k(y)

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

189

where the remainder rh indeed tends to 0 when h → 0 because   ˆ ˆ 1  ki+1/2 x¯i+1 1    |rh | =  w (¯ xi+1/2 + θ(y − x ¯i+1/2 ))(y − x ¯i+1/2 ) dθ dy  hi+1/2 x¯i k(y) 0 K  ≤ w L∞ h. κ Therefore, for all interfaces x = xi+1/2 , we could set ki+1/2 = ˆ

hi+1/2 x ¯i+1 x ¯i

dy k(y)

.

If x → k(x) satisfies k(x) = k − on ]xi , xi+1/2 [ and k(x) = k + on ]xi+1/2 , xi+1 [, h

i+1/2 1 we obtain ki+1/2 = hi /(2k− )+h on + . More generally, suppose that k is C i+1 /(2k ) ± both sides, with limx→x± = k . Thus, we also have i+1/2

ˆ

1 hi+1/2

x ¯i+1

x ¯i

dy 1 = k(y) hi+1/2



x ¯i+1/2

x ¯i

dy + k(y)

ˆ

x ¯i+1

x ¯i+1/2

dy k(y)



hi /k − + hi+1 /k + = 2hi+1/2 ˆ x¯i+1/2  ˆ x¯i+1  1 1 1 1 1 1 + − − dy + − + dy hi+1/2 x¯i k(y) k hi+1/2 x¯i+1/2 k(y) k =

hi /k − + hi+1 /k + + sh , 2hi+1/2

avec |sh | ≤ Ch,

which justifies defining ki+1/2 by [2.26]. 2.3.3. Multidimensional problems The generalization of finite volume methods to solve multidimensional elliptic problems leads to difficulties related to the grid’s geometry. To illustrate those difficulties, we will limit ourselves to the two-dimensional case and cover &J the domain under study, Ω ⊂ R2 , with a collection of convex polygons Ω = j=1 Tj . These polygons represent the control volumes. For each polygon Tj , we select a “representative point” Gj . This point can be the barycenter of Tj , but there may be more practical choices. We assume that the grid is conforming: if for two control volumes Ti and Tj we have Ti ∩ Tj = ∅, then the intersection is one of the sides of Ti and Tj . We still obtain the scheme by minimizing the formula obtained by

190

Mathematics for Modeling and Scientific Computing

integrating the equation on the control volumes. Let us consider the simple Laplace problem with Dirichlet conditions  u∂Ω = 0.

λu − Δu = f ∈ L2 (Ω),

The finite volume scheme is inspired by the relations λ |Tj |

ˆ u(y) dy − Tj

1 |Tj |

ˆ ∇x u(y) · nj (y) dσ(y) = ∂Tj

1 |Tj |

ˆ f (y) dy, Tj

where nj (y) denotes the outward unit normal vector at the point y ∈ ∂Tj and dσ is the Lebesgue measure on ∂Tj . We can define a simple approximation of ∇x u(y) · nj by imposing the grid a specific structure. Therefore, we say the grid is admissible if for all volumes Tj , Tk with a common edge Ajk , the segment Gj Gk is orthogonal to Ajk . In this case, ∇u (y ) · n(y )

is approximated by

u(Gk ) − u(Gj ) , |Gk Gj |

[2.27]

with Gj Gk ∩ Ajk = {y }. However, the structure assumption is a very strong restriction on the grid. It is satisfied for rectangular control volumes or a grid made up of equilateral triangles, where Gj is the barycenter of the volume Tj . For more general grids, this condition might be violated: as a result, the fluxes defined by [2.27] are not consistent, and the scheme does not converge to the solution of the continuous problem [FAI 92]. Moreover, if the Laplace problem is replaced by −∇ · (k(x)∇u) = f ∈ L2 (Ω),

 u∂Ω = 0,

with k a matrix of MN (R), such that 0 < κ|ξ|2 ≤ k(x)ξ ≤ K|ξ|2 with κ, K > 0, this approximation is not enough: it only provides an approximation of the gradient of u in the direction of the normal ´ vector n(y), while we now need to know ∇u in the transverse direction to evaluate ∂Tj k∇u(y) · nj (y) dσ(y). These questions are strongly motivated by applications related to the simulation of flows in porous media. For those applications, methods based on finite difference or finite element approximations have reached certain limits, especially for very heterogeneous environments. The subject is an area of active research, especially since the developments of [COU 99, DOM 05], which have provided clues to manage ´ those difficulties in constructing an effective approximation of ∂T k∇u · n(y) dσ(y) on control volume interfaces. The reference work [EYM 00] provides a more complete overview of these finite volume methods and their analysis.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

y •

Gk •

nk (y )

191

Gj •

Figure 2.7. Example of an admissible grid: the segment Gk Gj cuts the edge common to Tk and Tj at a right angle

2.4. Finite element approximations of elliptic equations 2.4.1. P1 approximation in one dimension The finite element (FE) method uses the variational formulation of problem [2.2]: we have seen that [2.2] leads to the relation ˆ λ

ˆ

1

u(x)ϕ(x) dx + 0

1

k(x) 0

d d u(x) ϕ(x) dx = dx dx

ˆ

1

f (x)ϕ(x) dx,

[2.28]

0

which is satisfied for all ϕ ∈ C 1 ([0, 1]), such that ϕ(0) = 0 = ϕ(1). The idea is to use a restricted class of such test functions ϕ that describe a finite-dimensional vector space; we thus seek an approximation of the solution u as an element of that space. We continue to use a partition of [0, 1] x0 = 0 < x1 < ... < xI+1 = 1 and we use the notation hi+1/2 = xi+1 − xi ,

 h = max hi+1/2 ,

 i ∈ {0, ..., I} .

We now define a finite-dimensional space meant to approximate the “natural” functional space for [2.28]. This approximation space is given by    Vh = ϕ ∈ C 0 ([0, 1]), ϕ(0) = 0 = ϕ(1), ϕI ∈ P1 for all j ∈ {0, ..., I} j

where Ij = [xj , xj+1 ] (note that polynomial of degree ≤ 1.

&I j=0

Ij = [0, 1]) and P1 denotes the set of

192

Mathematics for Modeling and Scientific Computing

Gj •

• y

nk (y )

Gk •

Figure 2.8. Example of an non admissible grid: the segment Gk Gj does not intersect the edge common to Tk and Tj at a right angle; the numerical unknowns contained in Gj and Gk are not enough to derive a discrete derivative in the direction of the normal vector n(y )

L EMMA 2.5.– The set Vh is a vector space of dimension I spanned by the functions ⎧ x − xj−1 ⎪ ⎪ for xj−1 ≤ x ≤ xj , ⎪ ⎪ x j − xj−1 ⎪ ⎪ ⎪ ⎪ ⎨ xj+1 − x x → ϕj (x) = for xj ≤ x ≤ xj+1 , ⎪ ⎪ xj+1 − xj ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0 elsewhere with j ∈ {1, . . . , I}.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

Gj •

Gk •

Figure 2.9. Example of an non admissible grid: the segment Gk Gj does not intersect the edge common to Tk and Tj

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 2.10. Example of P1 basis functions functions. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

193

194

Mathematics for Modeling and Scientific Computing

P ROOF.– The mapping Φ : Vh −→ RI ,   u −→ u(x1 ), . . . u(xI ) is linear and bijective. Indeed, on each interval Ij , j describing {0, ..., I}, a function u ∈ Vh is of the form αj x + βj . If Φ(u) = 0, then it follows from u(0) = 0 = u(x1 ) that β0 = 0 = α0 = 0, and, next, that all coefficients αj and βj are null; it proves that Φ is injective. Then, given y ∈ RI , we construct the coefficients αj , βj , such that u ∈ y −yj y −yj Vh satisfies Φ(u) = y: β0 = 0, α0 = xy11 , αj = xj+1 and βj = yj − xj+1 xj ... j+1 −xj j+1 −xj (using the convention y0 = 0 = yI+1 to take into account the Dirichlet condition), in order to show that Φ is surjective8. Now, the I functions ϕi ∈ Vh are linearly independent and therefore make up a basis of Vh .  Note that [2.28] is still satisfied by the %φj s: we integrate over the segments d [xi , xi+1 [ and conclude using the continuity of ϕi and x → k(x) dx u(x). We thus seek to define an approximation to the solution of [2.2] of the form uh (x) =

I 

uj ϕj (x) ∈ Vh .

j=1

The numerical unknown U = (u1 , ..., uI ) now corresponds to the coefficients of uh in that expansion. We define them by requiring that ˆ λ

ˆ

1

1

uh (x)ϕj (x) dx + 0

k(x) 0

d d uh (x) ϕj (x) dx = dx dx

ˆ

1

f (x)ϕj (x) dx [2.29] 0

is satisfied for all j ∈ {1, . . . , I}. This is equivalent to saying that the vector U is a solution to the linear system ˆ AU = F,

Fi =

1

f (x)ϕi (x) dx, 0

ˆ

1

Ai,j = λ

ˆ

1

ϕi (x)ϕj (x) dx + 0

k(x) 0

8 In fact, this is equivalent to writing u(x) = I j=1 yj ϕj (x).

d d ϕi (x) ϕj (x) dx. dx dx I

j=0

y

−y

j 1Ij ( xj+1 (x − xj ) + yj ) = j+1 −xj

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

195

´1 The matrix of coefficients 0 ϕi (x)ϕj (x) dx is called the mass matrix, and the ´1 d d matrix of coefficients 0 dx ϕi (x) dx ϕj (x) dx is called the rigidity matrix. The matrix A is invertible, including for λ = 0, since for all ξ ∈ RI \ {0}, we have ˆ

1

Aξ · ξ = λ ˆ

0

≥κ

I 

i=1 I 1

0

i=1

ˆ

2 ϕi (x)ξi

1

dx + 0

d ϕi (x)ξi dx

2

I  2 d k(x) ϕi (x)ξi dx dx i=1

[2.30]

dx ≥ 0,

and, in fact, this quantity is strictly positive (for any ξ = 0). The key to demonstrating this property consists of noting that ⎧ 1 ⎪ ⎪ for xj−1 ≤ x ≤ xj , ⎪ ⎪ x − xj−1 j ⎪ ⎪ ⎪ ⎪ ⎨ d −1 ϕj (x) = for xj ≤ x ≤ xj+1 , ⎪ dx ⎪ x ⎪ j+1 − xj ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0 elsewhere and that the segments [0, x1 ], [xI , 1] only intersect the support of one of those basis I d functions. Let ξ ∈ Ker(A). It follows from [2.30] that i=1 ξi dx ϕi (x) = 0 for almost all x ∈ [0, 1]. With the convention ξ0 = 0 = ξI+1 , we can rewrite this relation as I 

(ξi+1 − ξi )

i=0

1Ii (x) = 0, xi+1 − xi

d which implies ξ1 = ξ2 = ... = ξI = 0. Therefore, the functions dx ϕi are linearly independent. Moreover, note that [2.29] means that uh is simply the projection of u on Vh for the scalar product

' ( u|v = λ

ˆ

ˆ

1

u(x)v(x) dx + 0

1

k(x) 0

d d u(x) v(x) dx dx dx

' ( because we have u − uh |ϕi = 0 for all i ∈ {1, . . . , I}. N OTE 2.7.– For a uniform grid with step h > 0, we have ϕi (x) =

x − (i − 1)h (i + 1)h − x 1(i−1)h≤x≤ih + 1ih≤x≤(i+1)h h h

196

Mathematics for Modeling and Scientific Computing

and with a constant coefficient k(x) = k > 0, we find ⎞ ⎛ 1 0 0 == == ⎟ ⎜ ⎜ ⎟ ⎜ ⎜ == == ⎟ ⎜ ⎜ 1 = == == ⎟ ⎜ ⎜ = == ⎟ ⎜ ⎜ == == == ⎟ ⎜ ⎜ = = = k λh ⎜ = ⎟ = = == == == 0 ⎟ − ⎜ A= ⎜ ⎜ 0 = == = ⎟ h⎜ 6 ⎜ == === ⎟ ⎜ ⎜ ⎜ ⎜ == == 1 ⎟ ⎟ ⎜ ⎜ == == ⎟ ⎜ ⎜ ⎠ ⎝ ⎝ 0 0 1 4 ⎛

4

−2

1

0 0 ?? ?? ? ?? ?? ??? 1 ? ?? ?? ? ?? ??? ??? ?? ? ? 0 ?? ??? ??? 0 ? ?? ?? ??? ?? 1 ?? ? ? 0 0 1 −2

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

In order to justify the validity of this approach, we begin by determining an interpolation result. P ROPOSITION 2.5.– Let g ∈ C 2 ([0, 1]) verify g(0) = 0 = g(1). We write # gh (x) = I i=1 g(xi )ϕi (x). Therefore, there exists a C > 0, such that sup |# gh (x) − g(x)| ≤ C h2 , x∈[0,1]

 d  d   sup  g#h (x) − g(x) ≤ C h. dx x∈[0,1] dx

P ROOF.– For all xi ≤ x < xi+1 , we have  d 1  1 g#h (x) = g(xi+1 ) − g(xi ) = dx hi+1/2 hi+1/2

ˆ

xi+1

xi

d g(z) dz. dz

It follows that d d g#h (x) − g(x) = dx dx

ˆ 0

1

 d  d g(xi + θhi+1/2 ) − g(x) dθ dx dx

 d2   h. Similarly, we whose absolute value is bounded above by 2 supξ  dx 2 g(ξ) 2 determine an upper bound for g#h − g with C h .  Using lemma 2.2, we obtain λ

uh − u L2 + uh − u 2L2 κ  ˆ 1 ˆ 1 1 2   2 ≤ λ (uh − u) (x) dx + k(x)(uh − u ) (x) dx κ 0 0  ˆ 1 ˆ 1 1 2   2 ≤ inf λ (ϕ − u) (x) dx + k(x)(ϕ − u ) (x) dx κ ϕ∈Vh 0 0

uh − u 2L∞ ≤

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

197

according to the interpretation of uh by means of orthogonal projection ≤

1 κ

 ˆ 1 ˆ λ (# uh − u)2 (x) dx + 0

1

k(x)(# uh − u )2 (x) dx



0

since u #h ∈ Vh ≤

λ K  λ+K

# uh − u 2L∞ + # uh − u 2L∞ ≤ C h2 κ κ κ

owing to proposition 2.5. T HEOREM 2.10.– There exists a constant C > 0, such that the error between the approximated solution uh and the solution u of [2.2] satisfies

uh − u L∞ ≤ uh − u L2 ≤ C h. We note that the resulting estimate seems less favorable than that obtained from the finite difference discretization, even though the matrices corresponding to the two methods are very similar. However, it should be noted in this interpretation that the norm required on the solution to prove the consistency estimate is also different: d2   d44 g(ξ). Finally, the finite element method favors supξ  dx 2 g(ξ) instead of supξ dx variational formulations, and by using the underlying functional spaces, it is possible to obtain more fine-grained estimations: the L∞ framework is not well adapted to the derivation of error estimates for the Finite Element method, in contrast to the L2 framework. 2.4.2. P2 approximations in one dimension In order to increase the order of the approximation, the idea is to use higher-order polynomials. For example, with second-order polynomials, the basis functions are obtained using appropriate transformations of the functions ψ0 (x) = 2(x − 1/2)(x − 1), ψ1/2 (x) = −4x(x − 1), ψ1 (x) = 2x(x − 1/2), which are shown in Figure 2.12. From a conceptual point of view, modifying the basis functions in this way does not present new difficulties. Instead, its implementation is somewhat more delicate: we do not have an easy expression for the matrix of finite elements, as discussed in note 2.7 for the elements P1 .

198

Mathematics for Modeling and Scientific Computing

We use a more systematic angle to construct the matrix, which can be adapted to higher dimensions. We let ψi denote the basis functions. The function ψi is piecewise polynomial, in this case of degree 2. It is equal to 1 at the grid nodes i and to zero elsewhere. To obtain the element Aij of the finite element matrix, we must evaluate integrals of the form ˆ

k(x)ψi (x)ψj (x) dx

I



where the sum is taken on all segments I that have end nodes i and j. In order to construct the matrix, it is necessary to have a numbering of the elements as well as a correspondence table to identify what element a given node belongs to and of the grid nodes. • x0 = 0

x1/2 ×

• x1 = h

x3/2 ×

• x2

• xJ

xJ+1/2 • × xJ+1 = 1

Figure 2.11. P2 Discretization nodes

For elements P2 in dimension 1, we can proceed as follows. We provide a discretization of the segment [0, 1], using the points x0 = 0 < x1 < ... < xJ < xJ+1 = 1. The basis functions also involve the barycenters xj+1/2 = (xj + xj+1 )/2, j ∈ {0, ..., J} (see Figure 2.11). The unknowns and the basis functions are thus attached to the nodes (xj , xj+1/2 , xj+1 ): we denote as U = (u0 , u1/2 , u1 , ..., uJ , uJ+1/2 , uJ+1 ) the vector of numerical unknowns corresponding to the nodes Y = (x0 , x1/2 , x1 , ..., xJ , xJ+1/2 , xJ+1 ). We thus have (J + 1) segments, each involving three nodes; the vectors U and Y have (2J + 3) coordinates ((J + 2) interfaces and (J + 1) barycenters). To construct the matrix, we sweep the segments k ∈ {1, ..., J + 1}, and adjust the contributions of the nodes indexed as 2k + i − 2, 2k + j − 2, where i, j take their value in {1, 2, 3}. For a problem, with constant coefficient we must evaluate the integrals ˆ

1

ˆ

0

ˆ

0

1

1 0

ψ0 (x)ψ0 (x) dx,  ψ1/2 (x)ψ0 (x) dx,

ψ1 (x)ψ0 (x) dx,

ˆ

1

ˆ

0

ˆ

0

1

  ψ1/2 (x)ψ1/2 (x) dx,

ˆ

1

ψ0 (x)ψ1 (x) dx,

0

ˆ

1

 ψ1/2 (x)ψ1 (x) dx,

0

1 0

 ψ0 (x)ψ1/2 (x) dx,

 ψ1 (x)ψ1/2 (x) dx,

ˆ

0

1

ψ1 (x)ψ1 (x) dx.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

199

1.2

1

0.8

0.6

0.4

0.2

0

−0.2 0.2

0.1

0

0.4

0.3

0.5

0.6

0.7

0.8

0.9

1

Figure 2.12. P2 basis functions. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

We find

Mloc =



7

1⎜ ⎜−8 3⎝ 1

−8 16 −8

1



⎟ −8⎟ ⎠. 7

The construction of the Laplace matrix is as follows: Start with A = 0. For k = 1 at k = (J + 1), for i = 1 at i = 3 and j = 1 at j = 3 I = 2k + i − 2, J = 2k + j − 2, A(I, J ) = A(I, J) + Mloc (i, j). Bring it to the scale9 A = h1 A Take into account the boundary conditions. 9 Here h > 0 designates the space step, which we assume is uniform (xj+1 − xj = xj+1/2 −  y−x  xj−1/2 = h). The basis functions are defined on the grid by the relation ψh (y) = ψ h j with y ∈ [xj , xj+1 ]. The factor 1/h comes from the combination of the fact that ψh (y) =   1  y−xj ψ and the fact that we go from integrals on [xj , xj+1 ] to integrals on [0, 1] using the h h change of variable y = xj + xh.

200

Mathematics for Modeling and Scientific Computing

´ We use a similar rationale to construct the data F , calculating the integrals f (x)ψi (x) dx. In this regard, these integrals cannot, in general, be computed exactly. It is necessary to use approximated formulas; however, they must themselves be chosen so as to guarantee a high quality approximation (it would be meaningless to use P2 elements if the integrals that define the matrix system are evaluated using a simple rectangle method!). These questions are discussed in Appendix 2. 2.4.3. Finite element methods, extension to higher dimensions In order to generalize the method to higher dimensions, we proceed as follows. In fact, we can develop an abstract framework that allows us to analyze finite element numerical schemes. We have two working functional spaces V and H. They are real Hilbert spaces with the respective scalar products ·|·V and ·|·H and the associated norms, denoted as · V and | · |H . We assume that V ⊂ H, given that the embedding is continuous and dense. Moreover, using the Riesz theorem, we can identify H and its dual space H  (we can therefore no longer identify V and V  ; see [BRÉ 05]). Let f ∈ H and a be a coercive and continuous bilinear form on V × V . By the Lax– Milgram theorem, there exists a unique element u ∈ V , such that for all v ∈ V , the relation a(u, v) = f |vH holds. The typical example consists of taking V = H01 (Ω), H = L2 (Ω) and V  = H −1 (Ω), and the functional framework effectively allows us to deal with elliptic problems [2.1]. The construction of the approximation relies on the fact that we have a sequence of subspaces Vh ⊂ V , with finite dimensions Nh = dim(Vh ) < ∞ . Then, for any fixed h, there exists a unique element uh ∈ Vh , such that for all v ∈ Vh , the relation a(uh , v) = f |vH holds. In practice, determining uh involves solving a linear system. Indeed, let {e1 , ..., eNh } be a basis for the subspace Vh , which can be assumed as orthonormal in V (so ei |ej V = δij ). Thus, we decompose uh in this Nh basis uh = j=1 uj ej and the coefficients u1 , ..., uNh are characterized by

a

Nh  j=1

Nh   uj ej , ei = uj a(ej , ei ) = f |ei H j=1

for all i ∈ {1, ..Nh } (we have taken v = ei for each basis vector in the variational formulation). This problem can be expressed in the form Ah U = b, where bi = f |ei H , U = (u1 , ...uNh ), and the coefficients of the matrix Ah are defined by (Ah )i.j = a(ej , ei ). P ROPOSITION 2.6.– The matrix Ah is invertible.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

201

P ROOF.– This is a result of the coerciveness of the bilinear form a. Indeed, for all ξ ∈ RNh , we have Ah ξ · ξ =

Nh  i,j=1

Nh Nh Nh  2    a(ej , ei )ξj ξi = a ξj e j , ξi ei ≥ κ ξj ej = κ|ξ|2 . j=1

i=1

This inequality proves that Ker(Ah ) = {0}.

j=1

V



N OTE 2.8.– In the case where the form a is symmetric, u is characterized as a solution to the following minimization problem [GOU 11, corollary 5.27]: )1 * 1 a(u, u) − f |uH = inf a(v, v) − f |vH , v ∈ V . 2 2 Likewise, the approximate solution uh is a solution to the same minimization problem set on Vh , which defines U = (u1 , ...uNh ) as a minimizer on RNh of x → 12 Ah x · x − b · x. In this symmetric case, a defines a inner product on V and uh can be interpreted as the orthogonal projection of u for that inner product associated with a, on the subspace Vh . Thus, the matrix Ah is symmetric and positive definite. By construction, we directly obtain a stability property.   L EMMA 2.6.– The sequence of approximated solutions uh h>0 is bounded in V , independently of h. P ROOF.– We use v = uh ∈ Vh in the variational formulation of the approximated problem. The continuity and coerciveness of a thus lead to κ uh 2V ≤ a(uh , uh ) = f |uh H ≤ f H uh V , so finally uh V ≤ f H /κ (like the estimation satisfied by the solution u of the original problem; see theorem 2.5).  The consistency of the method involves constructing a “good” approximation space Vh . Thus, we will be able to estimate u − uh V : the error tends to 0 when h → 0 if the spaces Vh “resemble V ” when h → 0 and limh→0 dim(Vh ) = ∞. D EFINITION 2.4.– We say that the approximation method converges if we have limh→0 u − uh V = 0. The method is of order k if there exists C > 0, such that

u − uh V ≤ Chk . The key tool for analyzing these methods is L EMMA 2.7 (Céa’s lemma).– There exists a constant C > 0 that depends only on the bilinear form a, such that  

u − uh V ≤ C inf u − vh V , vh ∈ Vh = C u − Ph u V = Cdist(u, Vh ) where Ph u denotes the orthogonal projection of u on Vh (for the inner product ·|·V ).

202

Mathematics for Modeling and Scientific Computing

P ROOF.– Let vh ∈ Vh . Then, uh − vh ∈ Vh ⊂ V , so that a(u − uh , uh − vh ) = a(u, uh − vh ) − a(uh , uh − vh ) = f |uh − vh  − f |uh − vh  = 0 = a(u − uh , uh − u) + a(u − uh , u − vh ). It follows that κ u − uh 2V ≤ a(u − uh , u − uh ) = a(u − uh , u − vh ) ≤ K u − uh V u − vh V . We conclude that C = K/κ by taking the infimum on vh ∈ Vh . This result indeed expresses a consistency property; the error u − uh is bounded above by C×, the distance from u to the point Vh that is closest to u. Furthermore, note that the property Vh ⊂ V is essential in this context; we say that the approximation space is conformal. When a is symmetric, we can improve the estimate because in this case a(u − vh , u − vh ) = a(u − uh , u − uh ) + a(uh − vh , uh − vh ) + 2 a(u − uh , uh − vh )   

=0 since uh − vh ∈ Vh ⊂ V

≥ a(u − uh , u − uh ). This proves that a(u − uh , u − uh ) = inf{a(u − vh , u − vh ), vh ∈ Vh }, and therefore we are led to the estimate α u − uh 2V ≤ K u − vh 2V for all vh ∈ Vh . In fact, in the symmetric case, as mentioned above, uh is the orthogonal projection of u on Vh for the inner product associated with a.  The convergence of the method is therefore ensured by the following statement. T HEOREM 2.11.– We assume there exists – a subspace V , which is dense in V ; – a mapping rh : V → Vh , such that for all v ∈ V , we have limh→0 v − rh (v) V = 0. Then, the variational approximation method converges.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

203

P ROOF.– Let u ∈ V and ε > 0. By density, there exists a vε ∈ V , such that

u − vε V ≤ ε. Moreover, since ε is fixed, there exists a h(ε) > 0, such that if 0 < h < h(ε), then vε − rh(ε) (vε ) ≤ ε. It follows that  

u − uh V ≤ C inf u − vh V , vh ∈ Vh ≤ C u − rh (vε ) V   ≤ C u − vε V + vε − rh (vε ) V ≤ 2Cε 

given that 0 < h < h(ε). Finally, we introduce the adjoint form a (u, v) = a(v, u),

which also satisfies the assumptions of the Lax–Milgram theorem. As a result, for all φ ∈ H, there exists a unique element vφ ∈ V , which, for all v ∈ V satisfies a (vφ , v) = φ|vH . L EMMA 2.8 (Aubin–Nitsche lemma).– The following relation is satisfied: )

u − uh H ≤ M

sup φ∈H\{0}

* 1 inf vφ − vh u − uh V

φ H vh ∈Vh

P ROOF.– We have ) u − u , φ * h H

φ H φ∈H\{0} ) a (v , u − u ) * φ h = sup since u − uh ∈ V

φ H φ∈H\{0} ) a(u − u , v ) * h φ = sup

φ

H φ∈H\{0} ) a(u − u , v − v ) * h φ h = sup

φ H φ∈H\{0}

u − uh H =

sup

for all vh ∈ Vh ⊂ V since a(u, vh ) = a(uh , vh ) ≤M

) u − u v − v ) * h V φ h V

φ H φ∈H\{0} sup

by continuity of a.

This relation applies for all vh ∈ Vh . In particular, it applies for Ph vφ , the orthogonal projection of vφ onto Vh , which satisfies  

vφ − Ph vφ V = inf vφ − vh V , vh ∈ Vh .



204

Mathematics for Modeling and Scientific Computing

This statement allows us to improve the error estimates, at the price of considering a weaker norm (a “small” L2 norm instead of the “large” H 1 norm). Indeed, if we manage to show that u − uh V ≤ C f H h, then an estimate of the same kind applies for the adjoint problem, for which we can also define a variational approximation: vφ − vφ,h V ≤ C φ H h. Now, we have  inf vφ − vh V , vh ∈ Vh ≤ vφ − vφ,h V . It follows that

u − uh H ≤ M

) v − v * φ φ,h V

u − uh V ≤ C f H h2 .

φ H φ∈H\{0} sup

With the finite element methods, we have two ways for reducing the error: making the grid more fine-grained or increasing the degree of the polynomials which define the approximation locally. The effectiveness of these two “cursors” can depend on the considered; application see, for example, the comments in [AIN 04]. 2.5. Numerical comparison of FD, FV and FE methods We present a series of numerical tests in which the performance of the methods is evaluated in cases where we know an explicit solution of the problem ⎧   ⎪ ⎨ − d k(x) d u = f (x), dx dx ⎪ ⎩ u(0) = 0 = u(1).

x ∈]0, 1[,

[2.31]

– Example 1: constant coefficient. k(x) = 1, f (x) = ex . The exact solution is uexact (x) = (e − 1)x − ex + e. – Example 2: variable coefficient. k(x) = 1/cosh(sin(πx)), f (x) = π 2 sin(πx). The exact solution is uexact (x) = sinh(sin(πx)). – Example 3: discontinuous coefficient. k(x) = 10≤x≤1/4 + 13/4≤x≤1 + 211/4≤x≤3/4 , f (x) = 1. The exact solution is uexact (x) =

 3 x(1 − x)  x(1 − x) (10≤x≤1/4 + 13/4≤x≤1 ) + 11/4≤x≤3/4 . 2 64 4

We compare the methods on grids with constant steps and grids constructed with randomly chosen steps, following a uniform law (except for the boundary points, xi

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

205

are taken at random and ordered according to a uniform law on [0, 1]). For all methods, in the case of a uniform grid with continuous coefficients (Examples 1 and 2), we observe convergence of order 2 (see Figures 2.13 and 2.14). This is also the case for a random grid (see Figures 2.15 and 2.18). With the P2 method, we improve the order of convergence, as shown in Figure 2.20. For discontinuous data, as in Example 3, the solution is no longer C 1 and the estimates are no longer justified by the developments presented in the consistency analysis. In practice, we observe a decrease in the order of convergence (see Figures 2.16, 2.17 and 2.19). This decrease is compensated, especially on a uniform grid, by defining the fluxes using the harmonic mean of the coefficients on both sides of the interface (see Figures 2.16 and 2.17). −4

0.25

−6

log(max|u−uex|/max|uex|)

0.2

u(x)

0.15

0.1

−8

−10

−12

0.05 −14

0

0

0.2

0.6

0.4 x

0.8

1

−16 0

2

4 −log(dx)

6

8

Figure 2.13. Example 1: numerical and exact solutions; error curve with L∞ norm for the FD method on a regular grid (log–log scale, line with slope 2). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

2.6. Spectral methods We present another method for solving problem [2.2], which is still understood as a projection on a finite-dimensional space. We limit ourselves in this presentation to the case where λ = 0. This method uses the existence of Hilbertian bases of H01 (]0, 1[),

206

Mathematics for Modeling and Scientific Computing

the functional space that we identified for the analysis of the problem. Let {en , n ∈ N} be a family of functions on [0, 1], such that en ∈ H 1 (]0, 1[),

en (0) = en (1),

ˆ

ˆ

1

1

en (x)em (x) dx + 0

0

d d en (x) em (x) dx = δnm , dx dx

and for all f ∈ H01 (]0, 1[), we have N  lim f (x) − ηn en (x)

N →∞

H01

n=0

ˆ

1



ηn =

f (x)en (x) + 0

= 0,

 d d f (x) en (x) dx. dx dx −4

1.2

−6

1

−8 log(max|u−uex|/max|uex|)

1.4

u(x)

0.8

0.6

−10

−12

0.4

−14

0.2

−16

0

0

0.2

0.4

0.6 x

0.8

1

−18

2

4

6 −log(dx)

8

10

Figure 2.14. Example 2: FV, FD numerical and exact solutions; error curve for L∞ norm for FD (x) and FV (o) methods on a regular grid (log–log scale, line with slope 2). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

207

0

1.4

1.2

log(max|u−uex|/max|uex|)

1

u(x)

0.8

0.6

−5

−10

0.4

0.2

0

0

0.2

0.4

0.6

0.8

−15

1

0

2

x

4 −log(dx)

6

8

Figure 2.15. Example 2: exact, numerical FV and modified FV solutions; error curve in L∞ norm for FV (o) and modified FV (x) on a random grid (log–log scale, lines with slope 1 [green] and 2 [blue]). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip 0

0.12

−2 0.1 −4

log(max|u−uex|/max|uex|)

u(x)

0.08

0.06

0.04

−6

−8

−10

−12

−14 0.02 −16

0

0

0.2

0.4

0.6 x

0.8

1

−18

0

2

4 6 −log(dx)

8

10

Figure 2.16. Example 3: exact, numerical FV and modified FV solutions; error curve in L∞ norm for FV (o) and modified FV (x) on a regular grid (log–log scale, lines with slope 1 [green] and 2 [blue]). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

208

Mathematics for Modeling and Scientific Computing

0.12

−2

−4

0.1

−6 log(max|u−uex|/max|uex|)

u(x)

0.08

0.06

0.04

−8

−10

−12

0.02

0

−14

0

0.2

0.4

0.6

0.8

−16

1

0

2

x

4 −log(dx)

6

8

Figure 2.17. Example 3: exact, numerical FV and modified FV solutions; error curve in L∞ norm for FV (o) and modified FV (x) on a regular grid (log–log scale, lines with slope 1 [green] and 2 [blue]). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

−2

−4

−6

−8

−10

−12

−14 0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

Figure 2.18. Example 2: error curve in L∞ norm for the FE method on a random grid (log–log scale, line with slope 2). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

209

−2

−4

−6

−8

−10

−12

−14

−16

1

2

3

4

5

6

7

8

Figure 2.19. Example 3: error curve in L∞ norm for the FE method on a regular grid (log–log scale, lines with slopes 1 [green] and 2 [blue]). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

We construct an approximation of the form u(N) (x) = defining the coefficients N ˆ  j=0

1

0

(N )

k(x)uj

(N ) uj

N j=0

(N)

uj

ej (x) by

as solutions to the linear system

d d ej (x) ek (x) dx = dx dx

ˆ

1

f (x)ek (x) dx for all k ∈ {1, ..., N }. 0

 Therefore, u(N) is the projection of u onto V (N ) = Span x → en (x), n ∈  {1, ..., N } for the scalar product defined by ˆ

1

a(u, v) =

k(x) 0

d d u(x) v(x) dx. dx dx

Note that (u, v) → a(u, v) is well defined on the set C = {u ∈ C 1 ([0, L]), u(0) = 0 = u(L)}; it is a symmetric form on that set. Moreover, the quadratic form u → a(u, u) is positive definite on C . We therefore have pre-Hilbertian space. Finally, V (N) is a finite-dimensional subspace, so it is ´1 complete for the norm defined by a. We have a(u, ek ) = 0 f (x)ek (x) dx, so a(u(N) − u, v) = 0 for all v ∈ V (N ) , which, indeed, characterizes u(N ) as the

210

Mathematics for Modeling and Scientific Computing

orthogonal projection of u on V (N ) for a(·, ·). The value of this approach is shown in the following theorem. T HEOREM 2.12.– For all f ∈ L2 (]0, 1[), when N → ∞, u(N) converges to u, the solution of [2.2] in H01 (]0, 1[). P ROOF.– By definition, we have a(u(N) − u, u(N) − u) =

inf

v∈V (N )

a(v − u, v − u) ≤ a(φ − u, φ − u) ≤ K φ − u L2 ,

which can be dominated by φ − u H01 for all φ ∈ V (N ) . In particular, this is true for N x → φ(x) = n=1 un en (x), the element of V (N) whose N coefficients are given by the Fourier coefficients of u in basis {en , n ∈ N}. In other words, φ is the orthogonal projection of u onto V (N ) for the inner product H01 , and no longer for the product associated with the form a. We can infer that limN →∞ a(u(N) − u, u(N) − u) = 0. The relations d d 2 κ u(N ) − u ≤ a(u(N) − u, u(N) − u) dx dx L2 and |u

(N )

ˆ  (x) − u(x)| = 

x 0

 d   d d d (N ) u (y) − u(y) dy  ≤ u(N ) − u dx dx dx dx L2

allow us to conclude that u(N ) converges uniformly to u on [0, 1] and in H01 (]0, 1[). (For higher dimensions, we can only show convergence in H01 (Ω) using this argument.) Note that the convergence speed is controlled by the approximation of u by its N th order Fourier expansion.  One particular case, which is adapted to problem [2.2], consists of taking en (x) = sin(nπx). To ensure that these functions form a basis of L2 (]0, 1[), we extend the functions of L2 (]0, 1[) on ] − 1, 1[ by imparity, then on all of R by 2− periodicity. We thus use the fact that { 12 eiπnx , n ∈ Z} is an hilbertian basis of L2# (] − 1, 1[) (see [GOU 11, section 5.4]). The Fourier expansion of an odd function f ∈ L2# (] − 1, 1[) is, in fact, written as f (t) =

∞  n=0

ˆ fn sin(nπt),

1

fn = 2

f (x) sin(nπx) dx. 0

We have seen elsewhere that these functions en correspond to eigenfunctions in d2 H01 (]0, 1[) of the operator − dx 2 . These comments motivate the search for an

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

211

approximated solution to the problem [2.2] in the form of a trigonometric polynomial N (N) (N ) (N) u(N ) (x) = = (u1 , ..., uN ) j=1 uj sin(jπx). The numerical unknown U ∈ RN is obtained as a solution to the linear system ⎧ (N) (N) A U = S(N) , ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ˆ 1 ⎪ ⎨ (N) 2 Aj = 2π j

k(x) cos(πjx) cos(π x) dx, j, ∈ {1, ..., N }, [2.32] ⎪ 0 ⎪ ⎪ ˆ 1 ⎪ ⎪ ⎪ ) ⎪ ⎩ S(N =2 sin(πjx) f (x) dx, j ∈ {1, ..., N }. j 0

−12

−14

log(max|u−uex|/max|uex|)

−16

−18

−20

−22

−24

−26

−28

−30

2

3

4

5 −log(dx)

6

7

8

Figure 2.20. Example 1: error curve in L∞ norm for the FE-P2 method on a uniform grid (log–log scale, line with slope 3). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

The matrix A(N ) is symmetric. We verify that it is positive definite because for any vector ξ ∈ RN \ {0}, we have A

(N )

ξ·ξ =

ˆ

N 

1

2

2π j

k(x) cos(πjx) cos(π x)ξj ξ dx 0

j, =1

ˆ = 2π 2 0

1

⎛ ⎞2 N  k(x) ⎝ j cos(πjx)ξj ⎠ dx > 0, j=1

212

Mathematics for Modeling and Scientific Computing

because the functions x → j cos(πjx) form a linearly independent family in L2 (]0, 1[). The general analysis, presented above, shows that the convergence speed is controlled by ˆ

2 0

N  d 2    u(x) − nπun cos(nπx) dx.  dx n=1

In other words, the convergence speed is governed by how well u is approached using its Fourier cosine series. We know that the quality of this approximation depends on the regularity of u : according to the regularity of the data f and the coefficient k, if we can show that the solution u is regular (say of class C k for some k > 0), then the approximation will be better. This method thus seems extremely attractive. However, when compared with other approximation methods, in practice, its efficiency is affected by the following facts: – The matrix A(N ) is, in general, “full”: it has few zeros10, unlike matrices resulting from discretization through finite differences, finite volumes or elements, which are very sparse and also have a very specific structure (tridiagonal matrices). The sparse structure can be used to radically improve the efficiency of algorithms for solving the underlying systems, by avoiding numerous unnecessary operations. Here, we cannot benefit from such improvements. – Computing the coefficients of the matrix A(N) and of the right hand side requires evaluating integrals. In general, these cannot be computed exactly, and it is necessary to resort to integral approximation methods. Therefore, we do not actually solve    (N) U (N) = S+ (N) , where the coefficients of A (N ) A(N ) U(N ) = S(N ) , but rather A + and S(N ) are obtained using formulas of the type Λh 

2

2π j ωλ k(λh) cos(πjλh) cos(π λh),

λ=1

Λh 

2ωλ sin(πjλh) f (λh),

λ=1

for a given parameter h > 0. However, the matrix A(N) is, in general, ill-conditioned: errors due to the approximation made on the coefficients, even though they may  (N ) becomes notably remain relatively small, can thus produce a significant error and U (N) different from the desired solution U . We can test this method with two types of data: 10 Unless we can identify a specific basis formed by orthogonal eigenvectors for the operator −∂x (k(x)∂x ·) along with Dirichlet conditions; in that case the matrix A(N ) would be diagonal.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

213

– Example 1: regular data. k(x) = 2 + cos(5.3x), f (x) = 100 × (ln(13/4) − ln(3 + (x − 1/2)2 )). – Example 2: discontinuous data. k(x) = 10≤x≤1/2 + 4 × 11/2 0, we note that r

r



2 1 2 2 1 ∂r2 + ∂r = 3+ − 2 = 0. r r r r r 1 Finally, we have ∇ |x| = − |x|x 3 . It follows that ˆ

 ˆ ˆ 1 1 1 2 2 Δϕ(x) dx = lim 0 −  ϕ(ω) dω −  ∇φ(ω) · ω dω 2

→0 |x| S2  S2  = −|S2 |ϕ(0) = −4πϕ(0).

As a result, if we associate the charge potential Φ defined by Φ(x) =

1 4π

ˆ

ρ(y) dy |x − y|

with a charge density ρ, then Φ satisfies −ΔΦ = ρ. In what follows, we will assume that the charge distribution is itself influenced by potential: the Boltzmann distribution ρ(x) = e−Φ(x) ,

220

Mathematics for Modeling and Scientific Computing

describes a situation where the charges are in an electrostatic equilibrium state. As a result, Φ is a solution of the nonlinear problem −ΔΦ = e−Φ .  We complete the problem with Dirichlet boundary conditions on the domain Φ∂Ω = 0. Energy considerations allow us to see this problem as related to minimizing the following energy functional: ˆ   1 J (Φ) = |∇Φ|2 + e−Φ dx. Ω 2 We will study this method from a discrete point of view. We use a finite element discretization, whose principles are summarized as follows: – we use the variational formulation of the problem: ˆ

ˆ

e−Φ(x) ψ(x) dx,

∇Φ(x) · ∇ψ(x) dx = Ω

Ω

which is satisfied for all test functions ψ; – given {ψ1 , ..., ψNh }, a basis of the chosen finite element space (for example, the “hat functions” of the approximation P1 ), Φ is approximated by Φh (x) =

Nh 

Φn ψn (x).

n=1

We let A denote the matrix corresponding to the discretization of −Δ, that is to say, ˆ Ai,j = ∇x ψj (x)∇x ψi (x) dx. Ω

We introduce the mapping E : (Φ1 , ...., ΦNh ) ∈ RNh ˆ −→



exp − Ω

Nh 



Φn ψn (x) ψk (x) dx, k ∈ {1, ..., Nh }

n=1

The Poisson–Boltzmann equation takes on the discrete form AΦ = E (Φ).

 ∈ RNh .

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

221

The energy criteria allows us to consider the functional J : Φ ∈ RNh −→ J (Φ) =

1 AΦ · Φ + b(Φ) 2

where ˆ



exp −

b(Φ) = Ω

Nh 

 Φn ψn (x) dx.

n=1

P ROPOSITION 2.7.– The functional J is C 2 and there exists an α > 0, such that J is α−convex on RNh . It has a unique minimum Φ that satisfies AΦ = E (Φ ). P ROOF.– Let us first suppose that J is α−convex, this is to say, for all Φ, Ψ ∈ RNh , we have J (Φ) ≥ J (Ψ) + ∇J (Ψ) · (Φ − Ψ) + α Φ − Ψ 2 ,   and let us begin by showing that J has a unique minimum. Let Φ(n) n∈N be a minimizing sequence, that is to say, such that limn→∞ J (Φ(n) ) = inf{J (Ψ), Ψ ∈ RNh }. In particular, we have J (Φ(n) ) ≥ J (Φ(0) ) + ∇J (Φ(0) ) · (Φ(n) − Φ(0) ) + α Φ(n) − Φ(0) 2 ≥ J (Φ(0) ) + Φ(n) − Φ(0) (α Φ(n) − Φ(0) − ∇J (Φ(0) ) ). Since the left-hand  term  is bounded, it follows from an argument by contradiction that the sequence Φ(n) n∈N is bounded. By the Bolzano–Weierstrass theorem, we   can extract a subsequence Φ(nk ) n ∈N that converges: limk→∞ Φ(nk ) = Φ . Then, k by continuity, we obtain limn→∞ J (Φ(n) ) = J (Φ ) = inf{J (Ψ), Ψ ∈ RNh }. This justifies the existence of a minimizer. Note that we also have ∇J (Φ ) = 0. Indeed, for all t > 0 and all η ∈ RNh , we have J (Φ + tη) ≥ J (Φ ). It follows J (Φ +tη)−J (Φ ) that ≥ 0. Making t tend to 0, we can see that ∇J (Φ ) · η ≥ 0 t Nh for all η ∈ R . This inequality is thus satisfied for both η and −η for all η ∈ RNh , which therefore implies that ∇J (Φ ) = 0. Finally, if we suppose that Φ and Ψ are both minimizers, then we obtain J (Ψ) ≥ J (Φ ) + ∇J (Φ ) · (Ψ − Φ ) + α Ψ − Φ 2 = J (Φ ) + α Ψ − Φ 2 .

222

Mathematics for Modeling and Scientific Computing

Now, J (Ψ) = J (Φ ), which implies Φ − Ψ = 0: the minimizer is thus unique. To show that J is α−convex, we first calculate its gradient ∇J (Φ) · η = AΦ · η −

Nh ˆ  k=1

Nh    exp − Φn ψn (x) ψk (x)ηk dx, Ω

n=1

and then its Hessian matrix D2 J (Φ)η (1) · η (2) = Aη (1) · η (2) Nh ˆ Nh     (1) (2) + exp − Φn ψn (x) ψk (x)ηk ψ (x)η dx. k, =1

Ω

n=1

It follows that ˆ D J (Φ)η · η = Aη · η +



exp −

2

Ω

Nh  n=1

 Φn ψn (x)

N h 

2 ψk (x)ηk

dx ≥ Aη · η.

k=1

The matrix A is positive definite; the functional J is therefore α−convex, with α equal to the smallest eigenvalue of A. Finally, we note that with the above formulas, the condition ∇J (Φ ) = 0 leads to 

 AΦ − b(Φ ) · η = 0.



These observations motivate the search for a solution using a “descent algorithm”: it is an iterative method where at each step, we follow the direction of the steepest descent, which is that of the gradient. The gradient algorithm proceeds as follows. Let Φ(0) ∈ RNh and ρ > 0 be fixed. We construct the sequence Φ(m+1) = Φ(m) − ρ∇J (Φ(m) ).

[2.33]

T HEOREM 2.13.– Let J be a C 2 , α−convex functional,  such  that ∇J is L−Lipschitz. Then, for all 0 < ρ < 2α/L, the sequence Φ(m) m∈N defined by [2.33] converges to Φ , the unique minimizer of J on RNh . P ROOF.– Recall that ∇J (Φ ) = 0 because Φ is a minimizer. We write δ (m) = Φ(m) − Φ . We therefore have δ (m+1) = δ (m) − ρ(∇J (Φ(m) ) − ∇J (Φ )). Now,

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

223

J is α−convex, which means that for all a, b ∈ RNh , we have (∇J (a) − ∇J (b)) · (a − b) ≥ α|a − b|2 . It follows that |δ (m+1) |2 = |δ (m) − ρ(∇J (Φ(m) ) − ∇J (Φ ))|2 ≤ |δ (m) |2 + ρ2 |∇J (Φ(m) ) − ∇J (Φ )|2 −2ρ(∇J (Φ(m) ) − ∇J (Φ )) · δ (m) ≤ |δ (m) |2 (1 − 2αρ + Lρ2 ). The condition 0 < ρ < 2α/L ensures that 0 < (1 − 2αρ + Lρ2 ) < 1, from which we conclude that limm→∞ |δ (m) | = 0.  0.12

0.1

u(x)

0.08

0.06

0.04

0.02

0 0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

Figure 2.29. Numerical solutions to the Boltzmann–Poisson problem for different space steps

Figure 2.29 shows the numerical solutions obtained using this algorithm, with a P1 method for different space steps. In practice, the algorithm is rather delicate: if the descent step ρ is too large, the method does not converge and produces outlying results. Moreover, as shown by the convergence analysis and the spectral properties

224

Mathematics for Modeling and Scientific Computing

of the Laplace matrix, the descent steps must be made smaller when the grid is finer. This difficulty is also related to the conditioning of the underlying linear system. As a result, it is necessary to perform a very large number of iterations and not be fooled by a solution that remains close to a different state of the desired solution. Therefore, the efficiency of the method is also strongly dependent on the choice of the initial guess. 2.8. Neumann conditions: the optimization perspective For the problem −

 d  d k(x) u(x) = f (x) dx dx

for x ∈]0, 1[, [2.34]

d d u(0) = 0 = u(1), dx dx we previously made the following remarks:

– if a solution exists, say u0 , it is not unique because for all c ∈ R, x → u0 (x) + c is also a solution; ´1 – integrating the equation, we realize that the data must satisfy 0 f (x) dx = 0 in order for the equation to make sense; – these difficulties spread at the discrete level, see note 2.3: the corresponding finite difference matrix for a constant coefficient k is written as ⎛

ANeumann

−1

1

0 0 CC ⎜ CC ⎜ CC ⎜ 1 −2 CC ⎜ C B CC ⎜ CC BB C ⎜ C BB CC k ⎜ 0 BB CCC CC =− 2⎜ CC 0 B B CC h ⎜ C CC ⎜ ⎜ CC −2 1 ⎜ CC ⎜ C ⎝ 0 0 1 −1

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ∈ MN . ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

Note that Ker(ANeumann ) = Span(1, 1, ..., 1). The following statement makes this last observation more precise. $ %⊥ L EMMA 2.9.– The matrix ANeumann is coercive on Span(1, 1, ..., 1) and there exists a constant α > 0, such that for all x ∈ RN , we have ANeumann x · x ≥ α|x − xe|2

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

where e = (1, 1, ..., 1) and x =

1 N

N i=1

xi =

225

x·e . e·e

P ROOF.– We have seen that ANeumann e = 0. Note that A

Neumann

k x·x = 2 h

N −1 

(xi+1 − xi )xi + (x2 − x1 )x1 −

k = 2 h =

k h2



i=2

N −1 

N −1 

(xi − xi−1 )xi − (xN − xN −1 )xN 

i=2

(xi+1 − xi )xi −

i=1 N −1 

N−1 

(xi+1 − xi )xi+1

i=1

(xi+1 − xi )2 ≥ 0.

i=1

Moreover, this quantity is zero whenever x is collinear with e. The function y → ANeumann y · y is continuous and strictly positive on the compact set {|y| = 1} ∩ Span(e)⊥ . So it is bounded below on that domain by a strictly positive constant α. It follows that for all y ∈ RN , such that y · e = 0, we have ANeumann y · y ≥ α|y|2 . Finally, we conclude with the following relations: ANeumann x · x = ANeumann (x − xe) · x = ANeumann (x − xe) · (x − xe) + 0 ≥ α|x − xe|2 because ANeumann is symmetric, so ANeumann (x − xe) · e = 0 and x − xe is orthogonal to e.  As mentioned before, the same difficulty arises for periodic conditions. One way to restore uniqueness is to impose an additional constraint on the solution, for example searching for the solution with zero mean (that is to say, in a periodic context, with a first Fourier mode equal to zero). We will study this question from a strictly discrete perspective (the extension onto the continuous framework relies on the same arguments, but it uses a more elaborated functional framework). We reinterpret the problem as a minimization problem with the constraint

(P)

⎧ ⎪ ⎨ Minimizing y ∈ RN → J (y) = 1 Ay · y − b · y 2 ⎪ ⎩ on the set C = y ∈ RN , y · e = 0,

where e represents the vector (1, ..., 1). T HEOREM 2.14.– The problem (P) has at least one solution.

226

Mathematics for Modeling and Scientific Computing

P ROOF.– The set C is a convex, closed and non-empty subset of RN . The functional J is continuous and convex on RN , so also on C . We will show that it is coercive   on C : if yn n∈N is a sequence of elements of C , such that limn→∞ |yn | = ∞, then limn→∞ J (yn ) = ∞. If this were false, we could find a subsequence,  statement   which we continue to denote yn n∈N , such that J (yn ) n∈N has a finite limit. We can write y¯n = yn /|yn |. Since this sequence is bounded, at the possible cost of extracting a subsequence, we can assume it has a limit y¯. In particular, we have y¯ ∈ C and J (yn ) 1 b · y¯n 1 = A¯ yn · y¯n − −−−−→ 0 = A¯ y · y¯. |yn |2 2 |yn | n→∞ 2 This implies that A¯ y = 0, with y¯ of norm 1 and orthogonal to Span(e) = Ker(A), which is a contradiction.   Let xn n∈N be a minimizing sequence: xn ∈ C and limn→∞ J (xn ) = inf y∈C J (y). Since J is coercive, this sequence is bounded and we can assume that it has a limit x. By continuity, we obtain x ∈ C and J (x) = inf y∈C J (y).  We now apply the constrained optimization theorem, which ensures the existence of λ ∈ R, the Lagrange associated with the constraint, such that ∇J (x) + λe = 0; in other words, the multiplier pair (x, λ) satisfies the system 

A e

  e x b = . 0 λ 0

[2.35]

We are faced with a linear system of size (N + 1) × (N + 1), which remains symmetric11. T HEOREM 2.15.– The problem [2.35] has a unique solution. P ROOF.– It suffices to show that the associated matrix is injective, that is to say, the solution of [2.35] with b = 0 is x = 0, λ = 0. Indeed, if (x, λ) satisfies this equation, then Ax · x + λe · x = Ax · x = 0. This implies that x = αe, for a certain α ∈ R. However, x · e = 0 becomes αe · e = 0, which implies α = 0 and therefore x = 0, and then Ax + λe = λe = 0 gives λ = 0. We note that in [2.35] no compatibility conditions are required on the right hand side b. In fact, we find it in the multiplier λ: if (x, λ) is a solution of [2.35], then we have Ax · e + λe · e = b · e. However, since A is symmetric, Ax · e = x · Ae = 0, 11 The matrix of the system [2.35] is not however positive definite. Indeed, for a symmetric, positive definite M , we have in particular the relation M ei · ei = Mii > 0 for the canonical basis vectors e1 , ...eN +1 , which property is not satisfied by the matrix from system [2.35].

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

which implies that λ = orthogonal to e.

b·e e·e

and Ax = b −

b·e e: e·e

227

the right hand side of that equation is 

Figure 2.30 shows the results obtained using this approach for the problem d2 − 2 u = ex − dx

ˆ

1

d d u(0) = u(1) = 0. dx dx

ey dy, 0

The solution is uexact (x) = (e − 1)

5 x2 1 + x − ex + (e − 1) − . 2 6 2

We find only order 1 accuracy. To reach order 2, it is necessary to modify the algorithm to some extent. On the one hand, we modify the treatment of boundary terms, see note 2.3. On the other hand, this discussion relies on an approximation of the integral on [0, 1] through the rectangle method, so it is only of order 1. It is necessary to replace the multiplication times e by an operation corresponding to the trapezoid method approximation (see Appendix 2). These manipulations allow us to make the approximation of order 2 (see Figure 2.31).

4

0.06

2

0.04

0 log(max|u−uex|/max|uex|)

0.08

u(x)

0.02

0

−0.02

−2

−4

−6

−0.04

−8

−0.06

−10

−0.08

0

0.2

0.4

0.6 x

0.8

1

−12

2

4

6 −log(dx)

8

Figure 2.30. Simulation of Neumann problem: order 1 method

10

228

Mathematics for Modeling and Scientific Computing

2.9. Charge distribution on a cord We consider a tight rope, attached at the ends. At rest, the cord is in a horizontal position, but it deforms if a charge is delivered to it. This charge may cause it to rupture at the fixture points if the tension on it becomes too large. We therefore inquire how to best distribute the charge to make the effect of tension minimal. The model that will be discussed can be seen as a simplified version of several engineering and material resistance problems.

−2

0.06

−4

0.04

−6 log(max|u−uex|/max|uex|)

0.08

u(x)

0.02

0

−0.02

−8

−10

−12

−0.04

−14

−0.06

−16

−0.08

0

0.2

0.4

0.6 x

0.8

1

−18

2

4

6 −log(dx)

8

10

Figure 2.31. Simulation of Neumann problem: order 2 method. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

The cord is described as a one-dimensional medium with each abscissa x ∈ [0, 1] we associate the displacement u(x) ∈ R with respect to the equilibrium position. Since the cord is attached at the ends, we have u(0) = 0 = u(1). The charge is represented by a density per unit length x → f (x). The mechanical properties of the cord are described with a rigidity coefficient x → k(x) > 0 that can be variable (and we will see the role that these variations can play in the optimal charge distribution). We consider an element of the cord between the points x and x + dx. The length of this element at rest is therefore simply dx. Once it has been charged, using the Pythagorean theorem (see Figure 2.32), this element has a length close to

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

229

,

2 u(x + dx) − u(x) + (x + dx − x)2 . The energy corresponding to the stretching of the cord is proportional to the increase in length: E = k(x)

,

u(x + dx) − u(x)

2

 + dx2 − dx .

We assume that the function x → u(x) is regular enough to use the expansion ∂ u(x + dx) − u(x) = ∂x u(x) dx + dx( dx), with limh→0 (h) = 0. Then, we have   ∂ 2  E = k(x) u(x) + ( dx) dx2 + dx2 − dx , ∂x and since we are considering infinitesimal elements 0 < dx 1, we can replace E with the approximate expression   ∂ 2  E# = k(x) u(x) + 1 − 1 dx. ∂x Moreover, the potential energy is the opposite of the work from the external force on that displacement, which is given by −f (x)u(x) dx. Note that, since f is a force density by length unit, this quantity has the 2 homogeneity of a “force×length” product, that is to say, mass × length , which is, time2 indeed, the dimension of an energy. The total energy is obtained by adding all the infinitesimal elements, and we thus write ˆ

1

ETot (u) = 0

)   ∂ 2  * k(x) u(x) + 1 − 1 − f (x)u(x) dx. ∂x

Next, we simplify the expression further by assuming that the displacement u and √ its derivative are small. We resort to the expansion 1 + y = 1 + 12 y + yη(y), where limy→0 η(y) = 0. Finally, we focus in this context on the functional 1 E#Tot (u) = 2

ˆ

1 0

ˆ 1  ∂ 2 k(x) u(x) dx − f (x)u(x) dx. ∂x 0

The cord’s position at equilibrium under charge f is thus defined as the function x → u(x), which satisfies the boundary conditions u(0) = 0 and u(1) = 0 that minimize E#Tot .

230

Mathematics for Modeling and Scientific Computing

x

x + dx



u(x)



u(x + dx)

Figure 2.32. Cord elongation

We assume that the data f is a continuous function, or an element of L2 (]0, 1[). We note that E#Tot (u) is well defined for all u ∈ H01 (]0, 1[), a functional space that takes into account the Dirichlet conditions. More precisely, we can rewrite 1 E#Tot (u) = a(u, u) − (u) 2 where a is the symmetric bilinear form defined on H01 (]0, 1[) × H01 ((]0, 1[) by ˆ

1

a(u, v) =

k(x) 0

∂ ∂ u(x) v(x) dx ∂x ∂x

and is the linear form defined on H01 ((]0, 1[) by ˆ

1

(u) =

f (x)u(x) dx. 0

P ROPOSITION 2.8.– The function u minimizes E#Tot on H01 ((]0, 1[) if and only if for all v ∈ H01 (]0, 1[), we have a(u, v) = (v).

[2.36]

P ROOF.– First suppose that u minimizes E#Tot . Then, for all h ∈ H01 (]0, 1[) and all t > 0, we have E#Tot (u + th) ≥ E#Tot (u).

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

231

Now, by developing a(u+th) and using the fact that a(u, h) = a(h, u), we observe that   t2 E#Tot (u + th) = E#Tot (u) + t a(u, h) − (h) + a(h, h). 2 Since E#Tot (u + th) ≥ E#Tot (u), it follows that   t2 t a(u, h) − (h) + a(h, h) ≥ 0. 2 We divide by t > 0 and make t tend to 0 to obtain a(u, h) − (h) ≥ 0. Since this relation is satisfied for all h ∈ H01 ((]0, 1[), we can apply it with h = v and h = −v for all elements v of H01 (]0, 1[): we conclude that u satisfies [2.36]. Reciprocally, if u satisfies [2.36], then we obtain 1 E#Tot (u + v) = E#Tot (u) + a(u, v) − (v) + a(v, v) ≥ E#Tot (u).    2   =0



≥0

The bilinear form a is continuous on H01 (]0, 1[) × H01 (]0, 1[) because the Cauchy–Schwarz inequality makes it possible to dominate |a(u, v)| ≤ k ∞

∂x u L2 ∂x v L2 . Moreover, by writing κ = minx∈[0,1] k(x) > 0, we obtain a(u, u) ≥ κ ∂x u 2L2 . By the Poincaré inequality, u → ∂x u 2L2 defines a norm on H01 (]0, 1[), which is equivalent to the usual norm on H 1 ; the form a is therefore coercive. The Lax–Milgram theorem ensures the existence and uniqueness of a solution u ∈ H01 (]0, 1[) of [2.36], which is therefore the solution to the minimization problem. We conclude that [2.36] is just the variational formulation of the boundary problem −

 ∂  ∂ k(x) u(x) = f (x) ∂x ∂x

for 0 < x < 1,

[2.37]

u(0) = u(1) = 0. Therefore, to every data f , there corresponds a minimizer u of E#Tot , the solution of [2.37]. With this solution, we associate a rupture energy defined from the normal derivatives at the ends x = 0 and x = 1. More precisely, we write R=

2 1  2 1 ∂ ∂ k(0) u(0) + k(1) u(1) . 2 ∂x 2 ∂x

This quantity depends on the applied force f . We will focus on forces that have a specific form characterized by an application point, and then we will look for the optimal position that minimizes rupture energy.

232

Mathematics for Modeling and Scientific Computing

We define the applied force as follows: given a function p : R → [0, ∞[ with compact support in [− , + ], where 0 < < 1/2, we fix a ∈] , 1 − [ and write f (x) = −p(x − a) ≤ 0. The sign means that the force is oriented “downwards”, and we therefore expect to find a negative solution u. The point a represents the “center” of the charge. Therefore, for each value of a, we can associate it with a corresponding solution of [2.37], denoted as x → u(a) (x), as well as the rupture energy R(a) =

2 1  2 1 ∂ ∂ k(0) u(a) (0) + k(1) u(a) (1) . 2 ∂x 2 ∂x

The question thus lies in finding the value of a that minimizes a → R(a). Let us begin by studying the simple case of a homogeneous cord, for which the rigidity coefficient k(x) = k¯ > 0 is constant. In this particular case, we can explicitly calculate R as a function of a. Indeed, multiplying [2.37] by x and integrating, we obtain −k¯

∂ (a) u (1) = − ∂x =− =−

ˆ

ˆ

1

p(x − a)x dx = − ˆ

0

+∞

−∞

p(x − a)x dx

+∞

−∞ ˆ +∞

ˆ

p(x − a)(x − a) dx + a ˆ

−∞

p(x − a) dx

+∞

p(y)y dy + a −∞

+∞

p(y) dy −∞

where we have used the fact that p has compact support to extend the integration ∂ (a) domain. We denote this relation as k¯ ∂x u (1) = yp−ap, which therefore reveals ∂ (a) k¯ ∂x u (1) as an affine function of a. Likewise, by multiplying [2.37] (1 − x) by, we obtain the expression ∂ k¯ u(a) (0) = yp + (1 − a)p. ∂x Therefore, R(a) =

 1 |yp − ap|2 + |yp + (1 − a)p|2 2

is a quadratic function of a, for which it is easy to determine the minimum. In the general case, we no longer have a simple formula of this kind. We do not have an explicit formula either, and it is not even clear that classic optimization results can be applied.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

233

We will nevertheless implement extrema determination algorithms, focusing on the discrete version of the problem. We introduce the numerical unknown U (a) = (a) (a) (u1 , ..., uN ) ∈ RN , a vector whose coordinates are designed to be approximations of the evaluation of u(a) at the points ih, i ∈ {1, ..., N } of a uniform grid with step h: x0 = 0 < x1 = h < ... < xi = ih < ... < xN = N h < xN+1 = (N + 1)h = 1 (note the relation between N and h = 1/(N + 1)). This quantity is defined as a solution to the linear system u − ui 1 − ki+1/2 i+1 h h (a)

(a)

− ki−1/2

(a)

− ui−1  = −p(ih − a) h

(a)

= 0 = uN +1 and ki+1/2 = k(h(i +

ui

for i ∈ {1, ..., N }, using the convention u0

(a)

(a)

(a)

1/2)). In the matrix form, we have M U (a) = F (a) , with Fi is the symmetric matrix whose coefficients are given by Mii =

ki+1/2 + ki−1/2 , h2

Mii+1 = −

= −p(ih − a), and M

ki+1/2 , h2

and Mij = 0 when |i − j| > 1. Note that MU · U =

N  ki+1/2 i=1

h2

(Ui+1 − Ui )2 ,

which proves that M is indeed invertible. We associate the discrete unknown U (a) with the rupture energy 1 u − u0 k1/2 1 2 h (a) 2  u 1 = k1/2 1 + 2 h (a)

Rh (a) =

(a) 2

+

(a) (a) u − uN  2 1 kN+1/2 N+1 2 h (a) 2 u

1 kN+1/2 N 2 h

taking into account the boundary conditions. We will seek to numerically determine the optimal position using the gradient algorithm (even though, in this case too, it is far from clear that the conditions that guarantee convergence are satisfied). We therefore construct the sequence an+1 = an − ρRh (an )

234

Mathematics for Modeling and Scientific Computing

where ρ > 0 is a parameter to be determined. In order to express the derivative of the d rupture energy Rh , we introduce the vector V (a) = da U (a) , and the chain rule yields Rh (a) = k1/2 u1 v1 + kN +1/2 uN vN . (a) (a)

(a) (a)

[2.38]

In order to put the algorithm in place, it is necessary to calculate the vector V (a) . To this end, we take the derivative of the relation M U (a) = F (a) to obtain V (a) as a solution of the linear system M V (a) = G(a) , where G(a) has coordinates p (x − a) (now assuming that p is C 1 ). We perform simulations in the case where the rigidity coefficient is given by k(x) = 1 + 0.8 sin(10x) cos(5x), for the graph of this function, see Figure 2.33. The applied force is obtained with the function   p(x) = 20 exp − (20x)2 . coef k 1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 2.33. Graph of the rigidity coefficient k as a function of position

It does not have compact support, but has significantly positive values only on a restricted domain (see the figures). We calculate the solution u(a) for several values

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

235

of the parameter a, chosen between 0.1 and 0.9 using a uniform grid of 2N points. Figure 2.34 shows the evolution of the rupture energy Rh (a) as a function of the d charge position a. We also show the derivative da Rh (a), which is calculated with Rh (ak+1 )−Rh (ak ) either the formula [2.38] or the discrete derivation for comparison. ak+1 −ak We observe that this derivative is zero for a that is approximately 0.66. Figure 2.36 shows the optimal position determined by the gradient algorithm, with ρ = 0.04. |E(ak+1 )−E(ak )| We stop the algorithm when the relative error is less than 10−5 . We E(ak ) therefore find, in nine iterations starting from a0 = 1/2, the optimal position aopt = d 0.66668, with Rh (aopt ) = 1.5708 and da Rh (aopt ) = −3.5772 × 10−5 . Figure 2.35 shows the configurations obtained for the consecutive iterations. Energy versus of application Energie vspoint pt d application 3 2.5 2 1.5 0.1

0.3

0.2

0.4

0.5

0.6

0.7

0.8

0.9

0.8

0.9

0.8

0.9

Derivee Energie vs point pt d application Derived energy versus of application 10 5 0 −5 0.1

0.2

0.3

0.4

0.5

0.6

0.7

Derivee Energie (def differente) vs point pt d application Derived energy (def different) versus of application 10 5 0 −5 0.1

0.2

0.3

0.4

0.5

0.6

0.7

Figure 2.34. Evolution of the rupture the energy and its derivative as a function of the point of the application force

2.10. Stokes problem The Stokes equations describe the flow of an incompressible viscous fluid. We are going to study a stationary environment and first consider a two-dimensional context, where quantities depend only on the variables x and y. We let ν > 0 denote the viscosity of the fluid. External forces are described by a vector-valued function (x, y) → (F (x, y), G(x, y)) ∈ R2 . The unknowns represent the velocity and pressure fields, (x, y) → (U (x, y), V (x, y)) ∈ R2 and (x, y) → P (x, y) ∈ R, respectively.

236

Mathematics for Modeling and Scientific Computing

The conservation of momentum is reflected in the system of partial differential equations ∂x P − ν(∂x2 + ∂y2 )U = F, [2.39]

∂y P − ν(∂x2 + ∂y2 )V = G. u(x) vs x 0 −0.1 −0.2 −0.3 −0.4 −0.5 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.6

0.7

0.8

0.9

1

force 0

−5

−10

−15

−20 0

0.1

0.2

0.3

0.4

0.5

Figure 2.35. Successive configurations obtained with the gradient algorithm. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

These equations show the equilibrium between the external forces on the one hand, and between pressure and viscous friction on the other hand. The fluid’s incompressibility is reflected in the constraint ∂x U + ∂y V = 0.

[2.40]

This means that a fluid domain may deform along the course of movement, but it retains a constant volume (see section 4.3 in [GOU 11]). Indeed, consider a velocity field u : R × RN → RN , which may also be dependent on time. We associate it with characteristic curves defined by the system of differential equations d X(t, x) = u(t, X(t, x)), dt

X(0, x) = x.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

237

u(x) vs x 0 −0.1 −0.2 −0.3 −0.4 −0.5 0

0.1

0.2

0.3

0.5

0.4

0.6

0.7

0.8

0.9

1

0.6

0.7

0.8

0.9

1

force 0

−5

−10

−15

−20 0

0.1

0.2

0.3

0.4

0.5

Figure 2.36. Optimal configuration determined by the gradient algorithm

In other words, X(t, x) ∈ RN is the position in time t ≥ 0 of a particle responding to the velocity field u, which starts at an instant t = 0 at the position x ∈ RN . We fix a domain D0 and for t > 0 we let D(t) denote its image at instant t D(t) = {X(t, x0 ), x0 ∈ D0 }, (see Figure 2.37). We consider the volume ˆ V (t) = dx. D(t)

With the change of variable x = X(t, x0 ), we have ˆ ˆ d d d V (t) = |det(A(t, x0 ))|dx0 = |det(A(t, x0 ))|dx0 dt dt D0 D0 dt where A(t, x0 ) denotes the Jacobian matrix of that change of variables, that is, Aij (t, x0 ) = ∂x∂0 Xi (t, x0 ). For t = 0, we have A(0, x0 ) = I, so det(A(t, x0 )) j remains positive for all times. Now, we have det(A + H) = det(A(I + A−1 H)) = det(A)det(I + A−1 H) = det(A)(1 + Tr(A−1 H) + H (H)

238

Mathematics for Modeling and Scientific Computing

where limH→0 (H) = 0. We must therefore evaluate  d −1 d detA(t, x0 ) = det(A(t, x0 )) Tr A(t, x0 ) A(t, x0 ) . dt dt Returning to the differential system, we realize that  d ∂ ∂ Xi (t, x0 ) = (∂ ui )(t, X(t, x0 )) X (t, x0 ), dt ∂x0j ∂x0j N

=1

with

∂ ∂x0j

Xi (0, x0 ) = I. In the matrix form, this can be represented as

d A(t, x0 ) = ∇u(t, X(t, x0 ))A(t, x0 ), dt with A(0, x0 ) = I, where ∇u is the Jacobian matrix of u. It follows that   d detA(t, x0 ) = det(A(t, x0 )) Tr A(t, x0 )−1 ∇u(t, X(t, x0 ))A(t, x0 ) dt   = det(A(t, x0 )) Tr ∇u(t, X(t, x0 )) = det(A(t, x0 )) div(u)((t, X(t, x0 )).

u(t, X(t, x0 )) • X(t, x0 ) D(t)

D0

• x0

Figure 2.37. Evolution of the domain D0

We conclude that ˆ ˆ d V (t) = det(A(t, x0 )) div(u)(t, X(t, x0 )) dx0 = div(u)(t, x) dx dt D0 D(t)

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

239

using the change of variables x = X(t, x0 ), dx = det(A(t, x0 )) dx0 . Therefore, when the field u is such that div(u) = 0, the volume V (t) remains constant, although the shape of the domain may indeed change throughout time. We can determine explicitly some specific solutions for this problem. We assume that the fluid is subject to the effects of gravity, which act in the direction of x (which is the “vertical” direction): F (x, y) = −g, G(x, y) = 0, is constant. Moreover, the fluid flows between two plates with coordinates y = 0 and y = L. We seek a solution in the form of functions independent of the variable x. We thus obtain ⎧ 0 − νU  (y) = −g, ⎪ ⎪ ⎪ ⎨ P  (y) − νV  (y) = 0, ⎪ ⎪ ⎪ ⎩  V (y) = 0. These equations are completed by the boundary conditions U (0) = U (L) = 0 = V (0) = V (L), which express the fact that the fluid adheres to the walls of the domain in which it flows. It follows that V (y) = 0, P (y) = P¯ ∈ R is constant, and U is shaped like the parabola given by U (y) =

g y(y − L). 2ν

This situation is known as a Poiseuille flow. In practice, we can only seldom identify explicit situations of this kind and the use of numerical simulations is inevitable when trying to describe flows. We are therefore led to search for a relevant definition of approximated solutions. The analysis of problem [2.39]–[2.40] and its approximations calls for the use of fairly sophisticated functional tools. We will limit ourselves to studying a simplified version of the problem, inspired by [JOH 02], which deals with the most important challenges of the problem (the reader may refer to Part 1 of [TAR 82] for an introduction to the analysis methods for the Stokes and Navier–Stokes equations, or [BOY 06] for a more detailed and advanced treatment). Henceforth, we will assume that the flow occurs within the domain [0, L] × [0, 2π] and that boundary conditions are of the form U (0, y) = U (L, y) = 0, V (0, y) = V (L, y), U (x, 0) = U (x, 2π) = 0, ∂y V (x, 0) = ∂y V (x, 2π) = 0.

240

Mathematics for Modeling and Scientific Computing

x y=0

y

y=L

Figure 2.38. Velocity profile for a Poiseuille flow

We also assume that the force terms are of the form F (x, y) = f (x) sin(ky),

G(x, y) = g(x) cos(ky),

k ∈ N \ {0}.

We are looking for solutions that can be expressed in the same form U (x, y) = u(x) sin(ky),

V (x, y) = v(x) cos(ky),

P (x, y) = p(x) sin(ky).

Thus, equations [2.39]–[2.40] give ⎧  d d2  ⎪ ⎪ p + ν k2 − u = f, ⎪ ⎪ dx dx2 ⎪ ⎪ ⎪ ⎨  d2  2 kp + ν k − v = g, ⎪ dx2 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ d u − kv = 0, dx

[2.41]

with the boundary conditions u(0) = u(L) = 0 = v(0) = v(L). Let φ, ψ be regular functions, for example Cc∞ (]0, 1[). Using integration by parts, we obtain, with (u, v, p) being a solution of [2.41]: ˆ

L

ν 0



d d d d  u φ+ v ψ dx dx dx dx dx ˆ L ˆ L  d + pφ + kpψ dx = (f φ + gψ) dx dx 0 0

k 2 (uφ + vψ) +

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

Note that if x → (φ(x), ψ(x)) satisfies the incompressibility constraint kψ = 0 and equals zero at x = 0 and x = L, then we obtain ˆ

L 0

241

d φ dx



ˆ L   d   d pφ + kpψ dx = p − φ + kψ dx = 0. dx dx 0

These remarks motivate us to introduce the following functional framework. We write ) d d 1 H0,k = x ∈ [0, L] → (u(x), v(x)), u, v ∈ L2 (]0, L[), u, v ∈ L2 (]0, L[), dx dx * d u(0) = u(L) = 0 = v(0) = v(L), u − kv = 0 . dx H01

We equip this space with the usual norm H 1 . It is indeed a closed subspace of × H01 (]0, L[). On this space, we define ˆ

L

a(u, v; φ, ψ) = ν 0

 d d d d  k 2 (uφ + vψ) + u φ+ v ψ dx dx dx dx dx

and ˆ

L

(φ, ψ) =

(f φ + gψ) dx. 0

1 It is clear that a is a continuous and coercive bilinear form on H0,k and is a 1 2 continuous linear form on H0,k since f, g ∈ L (]0, L[). From the Lax–Milgram theorem, we can infer that:

T HEOREM 2.16.– for all f, g ∈ L2 (]0, L[), there exists a unique element (u, v) ∈ 1 1 H0,k , such that a(u, v; φ, ψ) = (φ, ψ) for all (φ, ψ) ∈ H0,k . In this formulation, the pressure p disappears! In fact, it is constraint in the defining the functional framework. Allowing us to realize another integration by parts (which d2 d2 2 is, in fact, not correct because we cannot be sure that dx 2 u and dx2 v are in L ), the relation a(u, v; φ, ψ) = (φ, ψ) means that the vector 

W

 =

Z

 2 νk u − ν

d2 dx2 u

−f

νk 2 v − ν

d2 dx2 v

−g

 [2.42]

satisfies ˆ

L 0



W Z

 φ · dx = 0 ψ

[2.43]

242

Mathematics for Modeling and Scientific Computing

d for every vector-valued test function (φ, ψ) that satisfies the constraint dx φ−kψ = 0. The challenge is to identify the set of vectors (W, Z) that satisfy this orthogonality d property. Therefore, if we have W (x) = dx p(x) and Z(x) = kp(x) for a certain 1 scalar function x → p(x), then, as seen above, [2.43] is satisfied by all (φ, ψ) ∈ H0,k . We will provide arguments to support the claim that it is the only possible solution, although the functional framework required for a full proof is quite elaborate.

L EMMA 2.10 (Hodge decomposition).– Let U, V ∈ L2 (]0, L[). Then, there exist d functions p ∈ H 1 (]0, L[), R, S ∈ L2 (]0, L[), such that U = dx p + R, V = kp + S ´ L d 1 and for all φ ∈ H (]0, L[), we have 0 (R dx φ + Skφ) dx = 0. P ROOF.– If all the functions are regular and the proposed decomposition holds, then we obtain the following equation defining p: d d2 d d2 U + kV = − 2 p + k 2 p − R + kS = − 2 p + k 2 p. dx dx dx dx



This suggests a definition for p as a solution to the variational problem a ¯(p, φ) = ¯

(φ), with ˆ a ¯(p, φ) = 0

L

 d d  k 2 pφ + p φ dx, dx dx

ˆ ¯

(φ) =

L

 U

0

 d φ + kV φ dx, dx

for all φ ∈ H 1 (]0, L[). The Lax–Milgram theorem ensures the existence and uniqueness of a solution p ∈ H 1 (]0, L[) (if k = 0, uniqueness is guaranteed by ´L further assuming that 0 p dx = 0). Implicitly, we therefore impose the Neumann d condition for p. We then write R = U − dx p and S = V − kp, which are both 2 elements of L (]0, L[). By definition, we have ˆ

L

 R

0

 d φ + kSφ dx = dx

ˆ

 d  U φ + kV φ dx dx 0 ˆ L  d d − p φ + k 2 pφ dx = 0. dx dx 0 L

This decomposition is, in fact, unique. Indeed, if there are two such d d decompositions U = dx p + R = dx q + R and V = kp + S = kq + S  , then d  π = p − q, ρ = R − R and σ = S − S  satisfy dx π + ρ = 0 = kπ + σ. Moreover, ´L d the relation 0 (ρ dx φ + σkφ) dx = 0 is satisfied for all φ ∈ H 1 (]0, L[). However, this can be rewritten as a ¯(π, φ) = 0. It follows that π = 0 and then ρ = σ = 0.  N OTE 2.9.– Throughout this discussion, the crucial point, which allows us to extend the argument to higher dimensions, is the duality between gradient and divergence

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

243

 operators: for u : Ω ⊂ RN → RN a vector-valued function, such that u∂Ω = 0, and p : Ω → R a scalar function, we have (with the appropriate regularity assumptions) ˆ

ˆ ∇p · u dX = −

pdiv(u) dX.

Ω

Ω

Here, in dimension N = 2 (X = (x, y)), the gradient operator  ∇ : p →

∂x p ∂y p

is replaced by the operator D : p →

 ∂x p kp

and the duality relation becomes ˆ

ˆ

L

Dp · u dx = − 0

L

pD u dx, 0

with  D : u =

u1 u2

= ∂x u1 − ku2 ,

which plays the role of the operator div(u) = ∂x u1 + ∂y u2 . It is necessary to understand lemma 2.10 as a decomposition u = Dp + Φ, with D Φ = 0, which is of the form Ran(D) + Ker(D ). The multidimensional analogue is obtained by writing u = ∇p + Φ, with Φ a vector function satisfying div(Φ) = 0. It is important to keep these remarks in mind to properly understand the arguments presented here. This statement is important and has several applications in multidimensional versions. Here, it is not entirely sufficient for identifying the vector (W, Z) in [2.42]. Indeed, if we assume that this vector is an element of L2 (]0, L[) (which is not the d case in general), we can decompose (W, Z) = ( dx p + R, kp + S) and obtain ´L ´L 1 (W, Z) · (φ, ψ) dx = (R, S) · (φ, ψ) dx = 0 for all (φ, ψ) ∈ H0,k , in addition 0 0 ´L d 1 to 0 (R dx φ + Skφ) dx = 0 for all φ ∈ H (]0, L[). However, this does not allow us to conclude that R = S = 0. We thus require more sophisticated functional tools. Step 1. We begin by introducing H −1 (]0, L[), the dual space of H01 (]0, L[). It is the set of continuous linear forms on H01 (]0, L[): a linear form on H01 (]0, L[) is an

244

Mathematics for Modeling and Scientific Computing

element of H −1 (]0, L[) when there exists a C > 0, such that for all u ∈ H01 (]0, L[), we have ˆ | (u)| ≤ C u H 1 ,

where

u 2H 1

= 0

L

 d 2     |u(x)|2 +  u(x) dx. dx

We can then write

H −1 =

| (u)| . u∈H01 \{0} u H 1 sup

A key example is given in the following lemma, whose proof is a direct consequence of the Cauchy–Schwarz inequality. L EMMA 2.11.– Let p ∈ L2 (]0, 1[). We associate it with the linear form denoted as and defined by for any ϕ ∈ Then,

dp dx

H01 (]0, L[),

dp (ϕ) = − dx

ˆ

L

p(x) 0

dp dx

d ϕ(x) dx. dx

dp ∈ H −1 (]0, L[) with dx

H −1 ≤ p L2 .

Step 2. Using the same reasoning, we can define the derivative

of H −1 (]0, L[) by writing for any ϕ ∈ Cc∞ (]0, L[),

d dx

of a linear form

 dϕ  d

(ϕ) = −

. dx dx

We thus consider the following subspace: ) * d

X = ∈ H −1 (]0, L[), ∈ H −1 (]0, L[) . dx If we let ∈ X , it therefore implies that there exists a C > 0, such that for all ϕ ∈ Cc∞ (]0, L[), we have  d

   (ϕ) ≤ C ϕ H 1 .  dx

[2.44]

(note that for the elements of H −1 , the estimate would additionally involve the norm of the second derivative of ϕ in L2 ).

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

245

L EMMA 2.12.– We have X = L2 (]0, L[). Moreover, there exists two constants c, c > 0, such that the equivalence of norms dp c p L2 ≤ p X := p H −1 + ≤ c p L2 dx H −1 is satisfied for all p ∈ L2 (]0, L[). P ROOF.– Lemma 2.11 ensures the inclusion L2 (]0, L[) ⊂ X since it is clear that L2 (]0, L[) ⊂ H −1 (]0, L[). In particular, note that p X ≤ 2 p L2 . The difficulty is in proving the inverse inclusion. Let ψ ∈ Cc∞ (]0, L[). We fix a C ∞ function θ on [0, L], such that 0 ≤ θ(x) ≤ 1, for x ∈ [0, L], θ(x) = 1, for x ∈ [3L/4, L], θ(x) = 0, for x ∈ [0, L/4]. θ(x) 1•

• L/4

• 3L/4

Figure 2.39. Function x → θ(x)

Finally, we set ˆ

ˆ

x

L

ψ(y) dy − θ(x)

ϕ(x) = 0

ψ(y) dy. 0

• L

x

246

Mathematics for Modeling and Scientific Computing

This function is still C ∞ and its support is included in the open set ]0, L[. Then, for ∈ X , we have   ˆ L d d

(ψ) =

ϕ+ ψ(y) dy × θ dx dx 0  ˆ L  d d =

ϕ + ψ(y) dy ×

θ . dx dx 0 Since ∈ X , we have  d

   (ϕ) ≤

X ϕ H 1 .  dx Furthermore, the Cauchy–Schwarz inequality makes it possible to dominate ´L  √  ψ(y) dy  ≤ L ψ L2 . It follows that 0

ϕ H 1 ≤ ψ L2



2(1 + L2 + L θ H 1 ).

We let m1 denote the constant, which is dependent on L θ, involved in this  dθ d estimate. Finally, dx ∈ Cc∞ (]0, L[), so that dx θ is well estimate and dominated dθ dθ by

H −1 dx

H 1 ≤

X dx

H 1 , we denote the right hand side as

X m2 . We thus obtain   d  ˆ L  d  √   | (ψ)| = 

ϕ + ψ(y) dy ×

θ  ≤ ψ L2

X (m1 + Lm2 ). dx dx 0 Then, is extended as a continuous linear form on L2 (]0, L[), and by the Riesz theorem (theorem 4.37 in [GOU 11]), we can identify it to a function p ∈ L2 (]0, L[) : ˆ

(ψ) =

L

p(x)ψ(x) dx.



0

N OTE 2.10.– Lemma 2.12 can be adapted to higher dimensions where it is known as “Necas’s lemma” [NEC 96]. In the whole space (respectively in a periodic domain), it can be proved directly by using the Fourier transform (respectively Fourier series).  and therefore Recall that ∇φ(ξ) = iξ φ(ξ)

φ 2H 1

1 = (2π)N

ˆ

 2 dξ. (1 + ξ 2 )|φ(ξ)|

This allows us to define the dual norm

H −1 =

1 (2π)N

ˆ

 2 | (ξ)| dξ. 1 + ξ2

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

247

We thus write | p(ξ)|2 =

2 | p(ξ)|2 ξ 2 | p(ξ)|2 | p(ξ)|2 |∇p(ξ)| + = + . 1 + ξ2 1 + ξ2 1 + ξ2 1 + ξ2

By integrating and using the Plancherel (theorem 6.11 in [GOU 11]), we find ˆ

ˆ ˆ | p(ξ)|2 |∇p(ξ)|2 | p(ξ)|2 dξ = dξ + dξ . 2 1+ξ 1 + ξ2          (2π)N p2 2

(2π)N p2

(2π)N ∇p2

H −1

L

H −1

The proof of this statement on a domain Ω is somewhat technical (see, for example, section 1, Chapter IV in [BOY 06]). Step 3. We now move on to vector-valued objects. We seek to identify Y =

)  dp dx

kp

* , p ∈ L2 (]0, L[) ⊂ H −1 (]0, L[),

with 

⊥ 1 H0,k

) U = ∈ H −1 (]0, L[), U (φ) + V (ψ) = 0, V * dφ φ for any ∈ H01 (]0, L[) such that − kψ = 0 . ψ dx

We know that  1 ⊥ Y ⊂ H0,k , 1 since for all p ∈ L2 (]0, L[) and all (φ, ψ) ∈ H0,k , we have

dp (φ) + kp(ψ) = − dx

ˆ

L

p(x)

 dφ

0

dx

 − kψ dx = 0.

Suppose it is proved that Y is a closed subspace of H −1 .

[2.45]

248

Mathematics for Modeling and Scientific Computing

Then, we can conclude from the inclusion12 1 Y ⊥ ⊂ H0,k .

[2.46]

 1 ⊥ Indeed, it implies13 H0,k ⊂ (Y ⊥ )⊥ = Y , with Y = Y , by [2.45]. Now, if 2 1 for all p ∈ L (]0.L[), (u, v) ∈ H0 satisfies ˆ

L 0

  du p(x) − (x) + kv(x) dx = 0 dx

(x) + kv(x) = 0 a. e. (which is equivalent to (u, v) ∈ Y ⊥ ), then we have − du dx 1 x ∈]0, L[ so (u, v) ∈ H0,k . To finish the proof, it remains to show [2.45], which is a product of the following statement. L EMMA 2.13.– For k = 0, the (linear and continuous) operator A : p ∈ L (]0, L[) −→ 2

 dp dx

kp

∈ H −1 (]0, L[)

has a closed range.     P ROOF.– Consider a sequence pn n∈N , such that Apn n∈N converges in H −1 (]0, L[): there exists p, q ∈ H −1 (]0, L[), such that limn→∞ pn = p and −1 n limn→∞ dp (]0, L[). This implies that for all ϕ ∈ Cc∞ (]0, L[), we have dx = q in H ˆ

L

pn ϕ dx −−−−→ p(ϕ) 0

n→∞

as well as dpn (ϕ) = − dx

ˆ

L

pn (x) 0

 dϕ  dϕ (x) dx −−−−→ q(ϕ) = −p n→∞ dx dx

dp d which indeed means that q = dx . In fact, we can show that the operator dx , which 2 −1 is linear and continuous for L (]0, L[) in H (]0, L[), has a closed range and extend this property to all dimensions. This property is a consequence of a general statement, which is detailed in Appendix 3. 

12 Here, for a subspace Z ⊂ H −1 × H −1 , we denote as Z ⊥ the set of (u, v) ∈ H01 , such that for all (λ, μ) ∈ Z, we have λ(u) + μ(v) = 0; for more information on these orthogonality notions, the reader may refer to [BRÉ 05, section II.5]. 13 See [BRÉ 05, proposition II.12]

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

249

N OTE 2.11 (Periodic framework).– We can advance the same discussion by assuming that the functions are (L = 1)−periodic. The result is thus obtained in a more direct fashion, by arguing through Fourier series. We introduce the vector 1 e(j) = (2iπj, −k). The constraint defining H0,k is thus represented in terms of Fourier coefficients    φ(j)   2iπj φ(j) − k ψ(j) = 0 = e(j) ·  , ψ(j)  ψ)  in the form for all j ∈ Z. We decompose any vector (φ,        e(j) ⊗ e(j)  φ e(j) ⊗ e(j) φ φ = I− + , |e(j)|2 |e(j)|2 ψ ψ ψ where    e(j) ⊗ e(j)  φ e(j) · I − = 0. |e(j)|2 ψ For any vector-valued function x → (φ, ψ)(x), by [2.42]–[2.43], (W, Z) satisfies       W  - (j) e(j) ⊗ e(j)  φ(j) · I− =0   |e(j)|2 ψ(j) Z(j) j∈Z

and therefore ˆ

1 0



W Z

      W  - (j) φ φ(j) · dx = ·   ψ ψ(j) Z(j) j∈Z      W  - (j) e(j) ⊗ e(j) φ(j) = ·   |e(j)|2 Z(j) ψ(j) j∈Z      e(j) ⊗ e(j) W  - (j) φ(j) = ·  .  |e(j)|2 ψ(j) Z(j) j∈Z

Since this equality holds for all (φ, ψ), we conclude that 

- (j) W  Z(j)



e(j) ⊗ e(j) = |e(j)|2



 - (j) W = e(j) p(j)  Z(j)

250

Mathematics for Modeling and Scientific Computing

where p is the scalar function whose Fourier coefficients are given by   - (j) e(j) W p(j) = ·  . |e(j)|2 Z(j) In other words, W =

d p dx

and Z = kp.

N OTE 2.12.– Stokes problem [2.39]–[2.40], presented within a domain Ω, must be completed by boundary conditions. They affect the velocity U and do not involve the pressure. A simple model consists of assuming  that the fluid adheres to the walls, which allows us to impose the constraint U ∂Ω = 0. Note that the pressure P in [2.39]–[2.40] is defined only up to a constant: if x → (U (x), P (x)) is a solution, then for all C ∈ R, x → (U (x), P (x) + C) is also a solution.´ For [2.41], we have chosen that constant: we assume the solution P (x, y) satisfies P dy dx = 0. From a mathematical point of view, the pressure is intimately associated with the incompressibility constraint, and we will see further below how to interpret it in terms of Lagrange multipliers. For [2.41], we can obtain an equation for p by combining the information as follows:  d  d d2  d2  p + νk 2 u − ν 2 u − k kp + νk 2 v − ν 2 v dx dx dx dx =

d f − kg dx

=

 d2  d  d2 d2 p − k2 p − ν − k2 u − kv = p − k 2 p. 2 2 dx dx dx dx2

[2.47]

Therefore, p satisfies an elliptic equation, whose right-hand side is determined by the data f and g. However, the difficulty is in the fact that we do not have boundary conditions for p. A “naive” discretization of the system [2.41] with finite differences leads to the following discrete equations. We fix the constant step Δx > 0 and use the following notation: x0 = 0 < x1 = Δx < ... < xj = jΔx < ... < xJ < L = (J + 1)Δx = xJ+1 . Naturally, the numerical unknown should be constituted by the vectors U (J) = (u1 , ..., uJ ), V (J) = (v1 , ..., vJ ) and P (J) = (p1 , ..., pJ ). The first order derivatives are approximated using the centered method and the diffusion term using the usual formula. We obtain, for j ∈ {1, ..., J}, u  1 j+1 − 2uj + uj−1 2 (pj+1 − pj−1 ) − ν − k u = fj , j 2Δx Δx2 v  j+1 − 2vj + vj−1 2 kpj − ν − k v = gj , j Δx2 1 (uj+1 − uj−1 ) − kvj = 0 2Δx

[2.48]

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

251

The boundary conditions become u0 = uJ+1 = 0 = v0 = vJ+1 . This system has 3 × J equations and four boundary conditions, but has 3 × J + 6 unknowns: the components of U (J) , V (J) , P (J) and the quantities u0 , v0 , uJ+1 , vJ+1 as well as p0 , pJ+1 , which are also involved in the formulas. The system is under-determined; there are more unknowns than equations. The difficulty arises because the discrete problem requires boundary conditions for the pressure, whereas the continuous problem makes no assumptions on that unknown. This difficulty reveals parasite modes, which are non-trivial solutions to the homogeneous discrete problem. Indeed, when f = g = 0, the solution of the continuous problem is u = v = 0, p = 0. Now, we can construct a non-zero solution of the discrete problem depending on the pressure boundary assumptions. To study this phenomenon, first note that, by substracting the evaluations of the first equation at j + 1 and j − 1, (pj+2 − pj ) − (pj − pj−2 ) 2Δx  (u  − 2u + u ) j+2 j+1 j − (uj − 2uj−1 + uj−2 ) 2 −ν − k (u − u ) j+1 j−1 Δx2

fj+1 − fj−1 =

pj+2 − 2pj + pj−2 2Δx  (u  j+2 − uj ) − 2(uj+1 − uj−1 ) + (uj − uj−2 ) −ν − k 2 (uj+1 − uj−1 ) 2 Δx v  pj+2 − 2pj + pj−2 j+1 − 2vj + vj−1 2 = − 2νkΔx − k v j 2Δx Δx2 pj+2 − 2pj + pj−2 = − 2k 2 Δxpj − 2kΔxgj . 2Δx =

In other words, k 2 pj −

pj+2 − 2pj + pj−2 fj+1 − fj−1 = −kgj − . 2 4Δx 2Δx

[2.49]

This equation is satisfied for j ∈ {2, ..., J − 1}. We recognize a discrete version of the elliptic equation [2.47]. However, this discretization can uncouple the grids formed by “even” points (j = 2n) and “odd” points (j = 2n + 1). We thus construct

252

Mathematics for Modeling and Scientific Computing

non-trivial solutions of the homogeneous problem, where f = g = 0. Suppose that J = 2N or 2N − 1. We define Q = (q1 , ..., qN−1 ) as a solution of k 2 qn −

1 (qn+1 − 2qn + qn−1 ) = 0, (2Δx)2

q0 = 1,

qN = 0.

In the matrix form, this becomes ⎛ ⎛ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ 2⎜ ⎜k ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎝

1

0

0

>> >> >> >> >> >> >> >> >>

0

0

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ 1 ⎜ = 2 (2Δx) ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

1 0

0

0

0 1



⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 1 ⎜ ⎟− ⎟ (2Δx)2 ⎜ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎠ ⎝

−2 1 0

1

0

@@ ?? @@ ?? ? @ ?? @@@ ??? ?? @@ ?? ?? @ ? ?? @@ ??? ?? @@ ?? @@ ?? @@

0

0

1

0

0 1 −2

⎞⎞ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ Q ⎟⎟ ⎟⎟ ⎟⎟ ⎟⎟ ⎠⎠

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

(we can likewise define a vector Q, with q0 = 0 and qN = 1). We thus set, for n ∈ {1, ..., N }, p2n = qn , u2n = 0, v2n+1 = 0, and we define u2n+1 , v2n as  kp2n + ν k 2 +

2  v2n = 0, Δx2

 p2n+2 − p2n 2  + ν k2 + u2n+1 = 0 2Δx Δx2

(for n ∈ {1, ...N } and n ∈ {0, ..., N − 1}, respectively). Finally, we write p2n+1 =

ν k (v2n+2 + v2n ) = − 2 2 (p2n+2 + p2n ). kΔx2 k Δx + 2

We confirm that, on the one hand, p2n+1 − p2n−1 1 =− 2 2 (p2n+2 − p2n−2 ) 2Δx (k Δx + 2)2Δx ν =0− (u2n+1 + u2n−1 ), Δx2

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

253

and, on the other hand,  1 2  (u2n+1 − u2n−1 ) − kv2n = ν k 2 + 2Δx Δx2   1 k 2 p2n − (p2n+2 − 2p2n + p2n−2 ) = 0. 2 (2Δx) We thus obtain a non-trivial solution of [2.48] when f = g = 0. uj • jΔx

vj+1/2 , pj+1/2 × (j + 1/2)Δx

uj+1 • (j + 1)Δx

Figure 2.40. Staggered grids: the horizontal velocity is evaluated at points jΔx, and the vertical velocity and pressure are evaluated at points (j + 12 )Δx

In order to overcome this difficulty, we use staggered grids, following a method introduced in [HAR 65]. The horizontal velocity is approximated at points xj = jΔx, and the vertical velocity and pressure are approximated at points xj+1/2 = (j + 1/2)Δx. The unknowns are therefore U (J) = (u1 , ..., uJ ) ∈ RJ , V (J) = (v1/2 , ..., vJ+1/2 ) ∈ RJ+1 and P (J) = (p1/2 , ..., pJ+1/2 ) ∈ RJ+1 . The discrete equations become u  1 j+1 − 2uj + uj−1 2 (pj+1/2 − pj−1/2 ) − ν − k u = fj j Δx Δx2 v  j+3/2 − 2vj+1/2 + vj−1/2 kpj+1/2 − ν − k 2 vj+1/2 = gj+1/2 2 Δx 1 (uj+1 − uj ) − kvj+1/2 = 0 for j ∈ {0, ..., J }. Δx

for j ∈ {1, ..., J }, for j ∈ {0, ..., J},

This system is composed of J + (J + 1) + (J + 1) = 3J + 2 equations. We complete them with the boundary conditions for the velocity u0 = uJ+1 = 0 = v−1/2 = vJ+3/2 .

254

Mathematics for Modeling and Scientific Computing

These equations involve the 3J + 2 components of vectors U (J) , V (J) , P (J) and the four terms imposed by the boundary conditions: there are as many equations as unknowns. In order to describe this system in the matrix form, we introduce the matrix with (J + 1) columns and J rows: ⎛

−1

1 0 0 DD EEE ⎜ ⎜ DD E ⎜ 0 DD EEE ⎜ DD E ⎜ DD EEE ⎜ D 1 ⎜ DD EEE 0 E DD B= ⎜ DD EEE Δx ⎜ ⎜ DD ⎜ DD ⎜ DD 1 ⎜ ⎝ 0 0 −1

0



⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ 0 ⎟ ⎟ ⎟ ⎠ 1

Its transpose is thus the matrix with J columns and (J + 1) rows: ⎞ 0 0 ⎟ ⎜ DD ⎟ ⎜ DD ⎟ ⎜ D DD ⎟ ⎜ 1 E DD ⎟ ⎜ EE DD ⎟ ⎜ EE DD ⎟ ⎜ 0 EE ⎟ ⎜ D E D EE 1 ⎜ ⎟ D  DD EE B = ⎟ ⎜ DD EE Δx ⎜ DD 0 ⎟ EE ⎟ ⎜ D EE ⎟ ⎜ ⎟ ⎜ ⎜ 0 0 1 −1 ⎟ ⎟ ⎜ ⎟ ⎜ ⎠ ⎝ 0 0 1 ⎛

−1

In particular, we have −for j ∈ {1, ..., J}, the jth component of BP (J) is pj+1/2 − pj−1/2 ; −for j ∈ {0, ..., J}, the (j + 1)th component of B  U (J) is uj − uj+1 .

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

255

(with the condition u0 = 0 = uJ+1 ). We will also use the matrix with J columns and J rows: ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ν ⎜ ⎜ A= ⎜ Δx2 ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

2 C −1 0 0 CC CC CC CC CC CC −1 C CC CC CC CCC CC C CC CCC CC 0 C 0 CC CC CC CC CC CC CC CC −1 CC CC CC C 0 0 −1 2

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

and we denote as A¯ the matrix with the same structure but with (J + 1) columns and (J + 1) rows. Similarly, we denote as I the identity matrix on RJ and as ¯I the identity matrix on RJ+1 . We therefore obtain the following block expression: ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

νk 2 I + A

0

B

⎞ 0

B

⎟ ⎛ (J) ⎞ ⎛ (J) ⎞ F ⎟ U ⎟⎜ ⎟ ⎟ ⎜ ⎟⎜ ⎜ ⎟ ⎟ ⎜ (J) ⎟ (J) ⎟ ⎟ ⎜ 2¯ ¯ ¯ = V G ⎟ νk I + A k I ⎟ ⎜ ⎟ ⎜ ⎟ ⎠ ⎝ ⎠ ⎟⎝ ⎟ 0 ⎠ P (J) k¯I 0

We can distinguish the velocity and pressure unknowns by writing ⎛

U (J)



⎛ ⎞ ⎜ νk 2 I + A ⎟ 0 ⎜ ⎟ ⎜B ⎟  (J) ⎜ ⎟ ⎜ ⎟ U 2J+1 ⎜ ⎟ ⎜ ⎟. , B = ∈ R , A = = k k (J) ⎜ ⎟ ⎜ ⎟ V ⎜ ⎟ ⎝k¯I⎠ ⎝ ⎠ 0 νk 2¯I + A¯

Similarly, we concatenate the data F (J) and G(J) in a single vector F (J) . The problem becomes ⎛ ⎜ Ak ⎜ ⎜ ⎜ ⎝ Bk



⎛ ⎞ ⎛ (J) ⎞ Bk ⎟ U (J) F ⎟ ⎟⎝ ⎠=⎝ ⎠. ⎟ ⎠ P (J) 0 0

[2.50]

256

Mathematics for Modeling and Scientific Computing

T HEOREM 2.17.– The linear system [2.50] has a unique solution. P ROOF.– Two ingredients are crucial in analyzing the system [2.50]. On the one hand, studying the elliptic model problem indicates that the matrix Ak is symmetric positive definite, and therefore invertible. On the other hand, note that the matrix Bk is injective. The equation provided by the uppermost blocks of [2.50] can be rewritten in the form U (J) = Ak−1 (F (J) − Bk P (J) ). Since Bk U (J) = 0, it follows that P (J) satisfies Bk Ak−1 Bk P (J) = Bk Ak−1 F (J) . Because Bk is injective, the matrix Bk Ak−1 Bk is symmetric positive definite, so it is invertible, since for all Q ∈ RJ+1 \ {0}, we have Bk Ak−1 Bk Q · Q = Ak−1 Bk Q · Bk Q ≥

1 |Bk Q|2 > 0 λM

where λM > 0 denotes the largest eigenvalue of Ak .



The pressure can be interpreted as the Lagrange multiplier associated with the incompressibility. Indeed, the constrained optimization theorem allows us to characterize the minimum of the functional J : V ∈ R2J+1 −→

1 Ak V · V − F (J) · V 2

on the domain   C = V ∈ R2J+1 , Bk V = 0 . Theorem 1.5 ensures that, with U (J) as the minimizer of J on C , there exists a vector (p1/2 , ..., pJ+1/2 ) = P (J) ∈ RJ+1 , such that ∇J (U (J) ) + Bk P (J) = 0. We thus find [2.50]. This observation can be reformulated in terms of a saddle point.

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

257

P ROPOSITION 2.9.– The pair (U , P ) is a solution of [2.50] if and only if (U , P ) is a saddle point of the functional L (V , q) = J (V ) + (q, Bk V ) which means that for all (V , q), we have L (U , q) ≤ L (U , P ) ≤ L (V , P ). P ROOF.– Let (U , P ) be a solution of [2.50]. Then, we have L (U , P ) = J (U ) because Bk U = 0. Let V = U + h, so that 1 L (V , P ) = J (U ) + (Ak U − F (J) ) · h + Ak h · h + P · Bk h 2 ≥ J (U ) + (Ak U − F (J) + Bk P ) · h = J (U ) = L (U , P ). Moreover, for all q ∈ RJ+1 , we have L (U , P ) − L (U , q) = (P − q, Bk U ) = 0. This proves that (U , P ) is a saddle point of L . Reciprocally, let (U , P ) be a saddle point of L . Then, for all V = U +th, t > 0, h ∈ R2J+1 , we obtain L (V , P ) − L (U , P ) = t(Ak U + Bk P − F (J) ) · h +

t2 Ak h · h ≥ 0. 2

We divide this relation by t and let t tend to 0 to obtain (Ak U + Bk P − F (J) ) · h ≥ 0. This relation is satisfied for all h = ±η ∈ R2J+1 , so we have (Ak U + Bk P − F (J) ) · η = 0 for all vectors η. This means that (Ak U + Bk P − F (J) ) = 0. Finally, we also have L (U , P ) − L (U , q) = (P − q, Bk U ) ≥ 0 for all vectors q = P ± ζ ∈ RJ+1 , which implies that Bk U = 0. We have thus shown that (U , P ) is a solution of [2.50].  In other words, we have L (U , p) = maxq minV L (V , q) = minV maxq L (V , q). The objective is thus to minimize L with respect to the state variable and maximize it with respect to the multiplier variable. One approach is to define approximations of (U , P ) by adapting the gradient method. We thus obtain

258

Mathematics for Modeling and Scientific Computing

the Arrow–Hurwicz algorithm: given U0 and p0 , we construct the sequences defined by the relations . / Un+1 = Un −  Ak Un − F (J) + Bk pn ,

pn+1 = pn + αBk Un+1 .

Given Un , pn , we find Un+1 by “descending” in the direction ∇U L (Un , pn ) with the step , and then we find pn+1 by “climbing up” following the direction ∇p L (Un+1 , pn ) α with the step. This algorithm depends on two parameters, α,  > 0, which must satisfy certain conditions to guarantee convergence. T HEOREM 2.18.– Let λm > 0 be the smallest eigenvalue of Ak . If 0 0. However, the method requires solving a linear system for each iteration. On the one hand, the size of these systems is somewhat reduced with respect to that of [2.50], and on the other hand, the structure of the matrix Ak can be more favorable than that of the complete system [2.50], enabling the use of more efficient resolution methods. These remarks motivate the introduction of the Uzawa iteration algorithm [2.53]. T HEOREM 2.19.– Let λm > 0 be the smallest eigenvalue of Ak . If 00 is bounded: Uε ≤ Fλm  . We can extract a convergent subsequence: lim →∞ Uε = U . Passing to the limit in [2.56], we obtain ε (Ak Uε − F (J) ) + Bk Bk Uε = 0 −−−→ Bk Bk U = 0 →∞

so U ∈ Ker(Bk Bk ). Now, Bk Bk U · U = Bk U · Bk U = Bk U 2 , so U ∈ Ker(Bk ): the limit satisfies the desired incompressibility constraint. Let V ∈ R2J+1 , such that Bk V = 0. We have 0 = (Ak Uε +

1 Bk Bk U ε − F (J) ) · V = 0 + (Ak Uε − F (J) ) · V ε

−−−→ (Ak U − F (J) ) · V = 0. →∞

The limit U is therefore such that (Ak U − F (J) ) ∈ Ker(Bk )⊥ = Ran(Bk ). Since we have shown that the system Ak U + Bk P = F (J) , Bk U = 0 has a unique solution, it follows that the entire sequence Uε ε>0 converges.  In practice, this kind of method, which can nevertheless be efficient, leads to two types of difficulties, both related to the fact that they depend on a parameter ε > 0. On the one hand, the role of the penalization parameter is not easy to analyze with respect to the discretization parameters. On the other hand, we can see that the matrix 1 Aε = Ak + Bk Bk ε of the underlying system is ill conditioned. Indeed, the matrix Aε is symmetric positive definite and its conditioning is given by the ratio of the extreme eigenvalues. The largest eigenvalue of Aε is given by Λε = max V =0

Bk Bk V · V Aε V · V 1 μ ≥ max = 2 2

V

ε V =0

V

ε

where μ is the largest eigenvalue of the symmetric matrix Bk Bk . However, we note also that for all V = 0, such that Bk V = 0, we have Aε V · V Ak V · V Ak V · V = ≤ max = σ. 

V 2

V 2 V =0, Bk V =0 V 2

262

Mathematics for Modeling and Scientific Computing

It follows that the smallest eigenvalue of Aε satisfies λε = min

V =0

Aε V · V Aε V · V ≤ min ≤ σ.

V 2 V =0, Bk V =0 V 2

It follows that cond(Aε ) =

Λε μ ≥ . λε εσ

In order to appropriately approximate the constraint, we must take small values of ε; however, the conditioning of the linear system that determines the approximate solution Uε is degraded. This difficulty is illustrated in Figure 2.48. We test these methods numerically, in the case where 0 < x < 2 and k = 3. The applied force is given by f (x) = 2 sin(x),

g(x) = cos(3x + .08).

Figure 2.41 is obtained by directly solving the linear system. Given the conditions of this simulation (200 discretization points), conditioning is of the order of 8.104 . Figures 2.42–2.44 show the velocity (U (x, y), V (x, y)) and pressure P (x, y) fields. Figure 2.45 shows the velocity field; we clearly see the formation of “vortices” that turn in opposite directions (for this figure, k = 3; the number of structures clearly depends on the parameter k). We obtain similar curves using the Arrow–Hurwicz algorithm, with, for example, the parameters  = 3 · 10−5 and α = 1.1/. Note that  must be chosen small enough to ensure convergence; if we seek to perform simulations with larger parameters, the calculation is unstable and can lead to outlying values. This phenomenon is explained by the fact that condition [2.51] becomes more constraining when the grid is finer, as is shown by the evolution of the norm of Ak in Figure 2.46. With these parameters, we obtain a solution such that the −F +Bk Pn  relative error Ak Un+1F and the norm of the discrete divergence   −5

Bk Un+1 are less than 10 , within approximatively 33000 iterations. With the same stopping criteria, the Uzawa algorithm provides the result in 18 iterations for  = 1.1. Therefore, even if it is necessary to solve linear systems at each iteration, this algorithm is faster. Figure 2.47 shows the velocity fields obtained using the penalization method with ε = (10−2 )k , and k varying from 1 to 6. Figure 2.48 shows the degradation of the conditioning when ε → 0, while the incompressibility condition is better satisfied. This discussion clearly shows that the choice of the numerical method and the evaluation of its practical efficiency rely on subtle arguments.

0.12

1.2

0.1

1

0.08

0.8

0.06

0.6 pression Pressure

Speed vitesse

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

0.04

0.02

0.4

0.2

0

0

−0.02

−0.2

−0.04

−0.4

−0.06 0

0.5

1 x

1.5

2

−0.6

0

0.5

1 x

1.5

2

Figure 2.41. Simulation of Stokes problem (200 discretization points). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

Figure 2.42. Simulation of Stokes problem (200 discretization points): horizontal component of the velocity field. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

263

Mathematics for Modeling and Scientific Computing

Figure 2.43. Simulation of Stokes problem (200 discretization points): vertical component of the velocity field. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

Pressure

264

Figure 2.44. Simulation of Stokes problem (200 discretization points): pressure field. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

Numerical Simulation of Stationary Partial Differential Equations: Elliptic Problems

7

6

5

4

3

2

1

0

−1

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Figure 2.45. Simulation of Stokes problem: velocity field (k = 1) 7

10

6

10

5

Norm(A)

10

4

10

3

10

2

10 1 10

2

3

10

10

4

10

J

Figure 2.46. Evolution of the norm of Ak as a function of the number of discretization points J . For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

265

Mathematics for Modeling and Scientific Computing

Penalization (u, v) Penalisation: u,v 0.12

0.1

0.08

0.06

vitesse

Speed

0.04

0.02

0

−0.02

−0.04

−0.06 0

0.2

0.4

0.6

0.8

1 x

1.2

1.4

1.6

1.8

2

Figure 2.47. Velocity fields using the penalization method for different values of ε. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

10

10

10

10

10

16

conditioning vs vs epsilon conditionnement epsilon

−2

14

10

−4

12

10

10

−6

10

−8

8

10

−10

6

10

−12

4

10 −15 10

divergence vs divergence vs epsilon epsilon

0

10

div(U) div(U)

10

Cond. Cond

266

−5

−10

10

10

epsilon

epsilon

0

10

10

−15

10

−10

10

−5

10 epsilon epsilon

0

10

Figure 2.48. Conditioning of the penalized linear system and divergence of the velocity field as a function of ε

3 Numerical Simulations of Partial Differential Equations: Time-dependent Problems There are no impenetrable citadels, only badly attacked fortresses. Pierre Choderlos de Laclos (Les liaisons dangereuses) 3.1. Diffusion equations We are interested in the numerical approximation of the heat equation in one spatial dimension, 2 ∂t u = κ∂xx u + f.

[3.1]

When the problem spans all space (x ∈ R), it can be simply resolved by the Fourier transform (see [GOU 11, section 6.4]); with uInit being the initial value for [3.1], we obtain ˆ

t

H(t − s)f (s, x) ds

u(t, x) = H(t)uInit (x) + 0

where t → H(t) represents the family of operators defined by ˆ H(t)v(x) = R



 |x − y|2  1 exp − v(y) dy. 4κt 4πκt

In particular, we easily establish:

Mathematics for Modeling and Scientific Computing, First Edition. Thierry Goudon. © ISTE Ltd 2016. Published by ISTE Ltd and John Wiley & Sons, Inc.

268

Mathematics for Modeling and Scientific Computing

P ROPOSITION 3.1.– If uInit ≥ 0 and f ≥ 0 then u ≥ 0. If f = 0, t → u(t, ·) is continuous over [0, ∞) with values within L2 (R) and we even obtain u ∈ C ∞ ([, T ] × R) for all 0 <  < T < ∞. N OTE 3.1.– The analysis of diffusion–reaction problems, of the type 2 ∂t u = ∂xx u + F (u)

can prove to be surprising. An iterative scheme can show, with the help of Banach’s theorem, the existence and uniqueness of a solution defined over a sufficiently short interval of time. However, we could convince ourselves that in such problems, there is 2 a competition between the regularizing effect of the heat equation ∂t u = ∂xx u and the  “explosive” effect of the ODE u (t) = F (u(t)). More information on this fascinating topic can be found in [GIG 85, WEI 85] and, more recently, [MER 98]. The numerical analysis of this equation combines both the difficulties of the stationary problems (the choice of infinite-dimensional norms) and the stability issues related to temporal evolution. The simplest scheme that we can imagine depends upon an explicit Euler-type discretization for the time derivative and a finite difference discretization, with a uniform mesh of step size Δx > 0, for the diffusion 2 operator ∂xx . We thus define a sequence unj by u0j = uInit (jΔx), un+1 = unj + κ j

Δt n (u − 2unj + unj−1 ) + fjn+1 , Δx2 j+1

[3.2]

with fjn+1 = f ((n + 1)Δt, jΔx). L EMMA 3.1.– We define the consistency error of scheme [3.2] as εn+1 = j

u((n + 1)Δt, jΔx) − u(nΔt, jΔx) Δt −κ

u(nΔt, (j + 1)Δx) − 2u(nΔt, jΔx) + u(nΔt, (j − 1)Δx) Δx2

−f ((n + 1)Δt, jΔx) (t, x) → u(t, x) being the solution of [3.1]. Hence, there exists a constant C > 0, such that |εnj | ≤ C(Δt + Δx2 ).

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

269

The constant C appearing in this statement depends upon the final time 0 < T < ∞ and the infinite norms of the derivatives of u with respect to x and t, to a relatively high order. We shall see that the stability analysis depends critically upon the norm used. The following example shows that, without sufficient precaution, nonsensical numerical approximations can be obtained. We take the initial value u0j = 0 if j ∈ / {j0 , ..., j0 + k0 } and u0j = 1 if j ∈ {j0 , ..., j0 + k0 }, and additionally f = 0. The numerical conditions are such that κΔt = Δx2 . In other words, we have un+1 = unj+1 + unj−1 − unj . j The numerical solution is described by the following table: n=0 :

0 0

0

0

1

1 1 1 1 ... 1 1 1 1

1

0

0

0 0

n=1 :

0 0

0

1

0

1 1 1 1 ... 1 1 1 1

0

1

0

0 0

n=2 :

0 0

1 −1 2

0 1 1 1 ... 1 1 1 0

2 −1 1

0 0

n=3 :

0 1 −2 4 −3 3 0 1 1 . . . 1 1 0 3 −3 4 −2 1 0

n=4 :

1 −3 7 −9 10 −6 4 0 1 . . . 1 0 4 −6 10 −9 7 −3 1

This result is obviously absurd, with the analysis having shown us that the solution u(t, x) takes values within [0, 1]. We shall see that the numerical parameters Δt and Δx must satisfy certain conditions for scheme [3.2] to give a meaningful result. 3.1.1. L2 stability convergence

(von

Neumann

analysis)

and

L∞

stability:

In order to study von Neumann stability, we must limit the problem to one with periodic condition: we consider equation [3.1] applied in R with the condition u(t, x) = u(t, x + 2π). Evidently, f is also subject to this periodic constraint. The segment [0, 2π] is discretized by uniform subdivision 0 < Δx < ... < (J + 1)Δx = 2π. The matrix of finite differences corresponding to the operator form ⎞ ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ 1 ⎜ ⎜ A# = ⎜ Δx2 ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

−2 1 0

0 1

1

0

0

?? ?? ?? ?? ? ? ?? ??? ??? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ? ?? ?? ?? ?? ? ? 0

0

1

1 0

0 1 −2

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

d2 dx2

is written in the

[3.3]

270

Mathematics for Modeling and Scientific Computing

Scheme [3.2] takes the matrix form 1 (U n+1 − U n ) = κA# U n + F n+1 , Δt where F n+1 is the vector with coordinates f1n+1 , ..., fJn+1 . We associate the sequence unj defined by [3.2] with the sequence of functions defined over [0, 2π[ by x → un (x) =

J

unj 1jΔx≤x ⎜ ?? ⎜ >> ⎜ 1 ??? 0 ?? >>> ⎜ >> ⎜ ? > ⎜ >> ??? >> >> ⎜ 0 >> ? >> ??? >> 1 ⎜ ⎜ >> >> ? A= ⎜ >> ??? >> 0 Δx2 ⎜ >> >> ? ⎜ ⎜ >> ??? > ⎜ ? >> ⎜ 0 >> ??? 1 ⎜ ? > ⎜ ⎝ 0 0 0 1 −2 1

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

[3.5]

One method of defining the stability is to require that (I + κΔtA)k remains uniformly bounded for k ∈ N. We have seen that the eigenvalues of (I + κΔtA) (with Δx = 1/(J + 1)) are 1 − 4κΔt(J + 1)2 sin2



kπ  2(J + 1)

for k ∈ {1, ..., J}. The spectrum of (I + ΔtA) lies within ] − 1, +1[ when κΔt × (J + 1)2
0 depends upon u C 4 ([0,T ]×[0,2π]) ). Let us start by studying the von Neumann stability. P ROPOSITION 3.4.– Scheme [3.6] is unconditionally L2 −stable. P ROOF.– This time we have n+1 (k)  n (k) + ΔtF u

=

J

j=0

e−i(j+1/2)kΔx

 2 sin(kΔx/2)  n+1 Δt n+1 uj − κ (uj+1 − 2un+1 + un+1 j j−1 ) 2 k Δx

278

Mathematics for Modeling and Scientific Computing

=

J

e−i(j+1/2)kΔx

 2 sin(kΔx/2) n+1  Δt −ikΔx ikΔx uj 1−κ (e − 2 + e ) k Δx2

e−i(j+1/2)kΔx

 2 sin(kΔx/2) n+1  Δt 2 uj 1 + 4κ sin (kΔx/2) . k Δx2

j=0

=

J

j=0

The amplification factor is now expressed as 0 ≤ M (k) =

1 ≤1 Δt 2 1 + 4κ sin (kΔx/2) Δx2

and we have n+1 (k) = M (k)(u n (k) + Δt u F n+1 (k)).

We immediately deduce from this the stability estimate.



As for the explicit scheme, we can combine the stability and consistency properties and use them to find the convergence of the scheme. C OROLLARY 3.1.– Scheme [3.6] is convergent in the sense that, with fixed 0 < T = N Δt < ∞, we have ⎧ ⎨ lim

 sup

Δt→0, Δx→0 ⎩n∈{0,...,N }

⎫ J

 n  2 ⎬ uj − u(nΔt, jΔx) Δx = 0. ⎭ j=0

In order to establish the same result for the L∞ norm, we will exploit the fact that I − κΔtA is an M −matrix (see definition 2.3 and theorem 2.7). We will use this to deduce that scheme [3.6] preserves the maximum principle, is L∞ −stable and, finally, is convergent. P ROPOSITION 3.5.– For all Δt, Δx > 0, scheme [3.6] exhibits the following properties: i) if uInit ≥ 0 and f ≥ 0 then unj ≥ 0 for all n, j; ii) if f = 0 and 0 ≤ uInit ≤ M , then 0 ≤ unj ≤ M for all n, j; iii) the scheme is unconditionally L∞ −stable : supj |unj | ≤ supj |u0j | + nΔt f ∞ .

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

279

P ROOF.– Point i) is a direct consequence of the fact that I − κΔtA is an M −matrix. We then write the scheme in the form  Δt  n+1 Δt n+1 Δt n+1 1 + 2κ u −κ u −κ u = unj + Δtf ((n + 1)Δt, jΔx) Δx2 j Δx2 j+1 Δx2 j−1 which leads to  Δt  Δt 1 + 2κ sup |un+1 | ≤ sup |unk | + Δt f ∞ + 2κ sup |un+1 |, j k 2 Δx2 Δx j k k 

from which we deduce iii) and hence ii).

C OROLLARY 3.2.– Scheme [3.6] is convergent in the sense that, with fixed 0 < T = N Δt < ∞, we have  lim

Δt→, Δx→0

sup n∈{0,...,N }, j

|unj − u(nΔt, jΔx)|

= 0.

We can try to improve the order of consistency in time, to obtain a scheme of order 2 in both variables. One idea could be to use a centered-time scheme, approximating un+1 −un−1

∂t u by j 2Δt j , but, besides the issue of defining the first iteration, the scheme thus obtained is unstable (see Figures 3.5). Another approach could be to reproduce the method adopted for ODEs and which leads to the Crank–Nicolson scheme. More generally, we construct, for 0 ≤ θ ≤ 1, the following scheme u0j = uInit (jΔx), Δt  n+1 θ(uj+1 − 2un+1 + un+1 j j−1 ) Δx2  + (1 − θ)(unj+1 − 2unj + unj−1 ) .

un+1 = unj + κ j

[3.7]

The numerical solution is updated by solving the linear system   (I − θκΔtA)U n+1 = I + (1 − θ)κΔtA U n . L EMMA 3.3.– The consistency error of scheme [3.7] is of order 1 in time and of order 2 in space for 0 ≤ θ ≤ 1 and θ = 1/2, and is of order 2 in both variables for θ = 1/2: u((n + 1)Δt, jΔx) − u(nΔt, jΔx) Δt u((n + 1)Δt, (j + 1)Δx) − 2u((n + 1)Δt, jΔx) + u((n + 1)Δt, (j − 1)Δx) −κθ Δx2 u(nΔt, (j + 1)Δx) − 2u(nΔt, jΔx) + u(nΔt, (j − 1)Δx) −κ(1 − θ) Δx2 εnj =

280

Mathematics for Modeling and Scientific Computing

satisfying |εnj | ≤ C(Δt + Δx2 )

if 0 ≤ θ ≤ 1 and θ = 1/2,

|εnj | ≤ C(Δt2 + Δx2 )

if θ = 1/2.

k In this statement, the constant C depends upon the L∞ norms of ∂t,x u up until a sufficiently large index k ≥ 4. The interest in this family of schemes, in particular for the case θ = 1/2, is bolstered by the following stability analysis.

P ROPOSITION 3.6.– Scheme [3.7] is unconditionally L2 −stable for all 1/2 ≤ θ ≤ 1; Δt 1 for 0 ≤ θ < 1/2, it is stable under the condition κ Δx 2 ≤ 2(1−2θ) . P ROOF.– For this scheme, we obtain the relation J

e−i(j+1/2)kΔx

j=0

=

J

 2 sin(kΔx/2) n+1  Δt 2 uj 1 + 4θκ sin (kΔx/2) k Δx2

e−i(j+1/2)kΔx

j=0

 2 sin(kΔx/2) n  Δt 2 uj 1 − 4(1 − θ)κ sin (kΔx/2) k Δx2

n+1 (k).  +ΔtF

The amplification factor then becomes Δt sin2 (kΔx/2) Δx2 Δt 1 + 4θκ sin2 (kΔx/2) Δx2

1 − 4(1 − θ)κ M (k) =

and we have n+1 (k)| ≤ |M (k)u n (k)| + Δt| |u F n+1 (k)|, 2 Δt since 1 + 4θκ Δx 2 sin (kΔx/2) ≥ 1. We always have M (k) ≤ 1, but the estimate M (k) ≥ −1 sets



Δt sin2 (kΔx/2) (1 − 2θ) ≤ 2. Δx2

This relationship is always satisfied when θ ≥ 1/2; otherwise, we need to set Δt 2 4κ Δx  2 ≤ 1−2θ . By repeating the arguments discussed above, we can deduce the convergence of the scheme in terms of the L2 norm.

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

281

Δt C OROLLARY 3.3.– Let 0 ≤ θ ≤ 1; if 0 ≤ θ < 1/2, we also assume that κ Δx 2 ≤ 1 . Then, scheme [3.7] is convergent in the sense that, with fixed 0 < T = 2(1−2θ) N Δt < ∞, we have ⎧ ⎫ J ⎨ 

 n  2 ⎬ uj − u(nΔt, jΔx) lim sup Δx = 0. Δt→0, Δx→0 ⎩n∈{0,...,N } ⎭ j=0

In the case where θ = 1/2, the error is of order 2: it is controlled by Δt2 + Δx2 . However, this result applies for the L2 norm only: the scheme works with respect to the L∞ norm at the price of new constraints, again requiring a choice of time step of order Δx2 , as demonstrated by the following stability analysis. Δt P ROPOSITION 3.7.– Suppose that κ Δx 2 ≤ following properties:

1 2(1−θ) .

Then, scheme [3.7] exhibits the

i) if uInit ≥ 0 and f ≥ 0 then unj ≥ 0 for all n, j; ii) if f = 0 and 0 ≤ uInit ≤ M , then 0 ≤ unj ≤ M for all n, j; iii) the scheme is L∞ −stable : supj |unj | ≤ supj |u0j | + nΔt f ∞ . In this case, it is convergent in the sense that, with fixed 0 < T = N Δt < ∞, we have  lim

Δt→, Δx→0

sup n∈{0,...,N }, j

|unj − u(nΔt, jΔx)|

= 0.

P ROOF.– In matrix form, the scheme is written as (I − θκΔtA)U n+1 = (I + (1 − θ)κΔtA)U n + ΔtF n+1/2 .

[3.8]

We have seen that (I − θκΔtA) is an M −matrix; hence, the components of U n+1 are positive, provided that the components of the right-hand side of [3.8] are also positive. Suppose f ≥ 0 and the components of U n are positive, this is satisfied when all components of (I + (1 − θ)κΔtA) are positive. Thus, this requires 1 − 2(1 − Δt θ)κ Δx  2 ≥ 0. 3.1.3. Finite element discretization We construct a numerical approximation uh (t, x) =

N

M

n=0 j=1

unj 1[nΔt,(n+1)Δt[ φj (x)

282

Mathematics for Modeling and Scientific Computing

where the functions φ1 , ..., φJ form a basis of the space of approximation Vh . (Here, the index h and the number of basis functions J are related and dependent upon both the refinement of the discretization and the degree of the polynomials used to define the approximation over each element.) We define a linear system satisfied by unj by making use of the variational structure, adapted to the time-dependent problem. For ϕ ∈ Cc∞ (]0, 1[), we have d dt

ˆ

ˆ

1

u(t, x)ϕ(x) dx + 0

1

a(x) 0

d d u(t, x) ϕ(x) dx = dx dx

ˆ

1

f (t, x)ϕ(x) dx. 0

The discrete problem is obtained by using the explicit Euler scheme to describe ´1 the evolution of t → 0 u(t, x)ϕj (x) dx. We obtain the following linear system 1 (M U n+1 − M U n ) = AU n Δt with ˆ Ai,j =

d d φi (x) φj (x) dx dx dx

and ˆ Mi,j =

φi (x) φj (x) dx,

which we call the stiffness matrix and the mass matrix, respectively. Thus, even with an explicit scheme, U n+1 is not directly determined and we must resolve a linear system to update the unknown values. For example, for the P1 elements on a uniform mesh with step size h > 0, we have if |i − j| > 1,

Mi,j = 0 and ˆ

(i + 1)h − x x − ih dx h h ih ˆ (i+1)h ˆ (i+1)h (x − ih)2 x − ih =− dx + dx 2 h h ih ih  1 1 h =h − + = , 3 2 6 ˆ (i+1)h  ˆ ih    2h  (i + 1)h − x 2  x − (i − 1)h 2 = .   dx +   dx = h h 3 ih (i−1)h

Mi,i+1 =

Mi,i

(i+1)h

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

283

In order to avoid the resolution of a linear system like this, we sometimes use an approximate expression of the matrix M which allows direct determination of U n+1 . The trapeze formula, which is explained further in Appendix 2, allows the ´1 J−1 approximation of g(y) dy by h j=0 (g(jh) + g((j+1)h)−g(jh) ) = 2 0 J−1 h j=0 (g(jh) + g((j + 1)h)). By using this approximation to evaluate the 2 coefficients of the mass matrix, we define J−1

  h !i,i = h M |φi (jh)|2 + |φi ((j + 1)h)|2 = 2|φi (ih)|2 = h, 2 j=0 2 J−1

  !i,j = h M φi (kh)φj (kh) + φi ((k + 1)h)φj ((k + 1h)) = 0 2

for i = j.

k=0

We therefore settle with setting 1 ! n+1 ! n h (M U − MU ) = (U n+1 − U n ) = AU n . Δt Δt Such techniques, consisting of simplifying the expression for the mass matrix, are referred to as mass condensation methods (mass-lumping). Now, resolution of the ! is a diagonal matrix (and even scalar in this example linear system is trivial since M where we actually retrieve the finite difference scheme). Beyond a simplification of the linear system determining U n+1 , another advantage of this method is that it restores the maximum principle (see [THO 06, theorem 15.5]). However, it can also induce degradation of the convergence order [CHA 12]. 3.1.4. Numerical illustrations We start by studying the case where the domain of calculation is the interval [0, 2π] with periodic conditions and the initial data is uInit (x) = 3 + 0.8 sin(3x) + 1.2 sin(5x).

[3.9]

In this case, Fourier analysis explicitly yields the solution u(t, x) = 3 + 0.8e−3

2

κt

2

sin(3x) + 1.2e−5

κt

sin(5x)

where κ > 0 is the diffusion coefficient (see Figure 3.1 for the graph of this solution at final time T = 0.1). Here, we have fixed κ = 3. First, we compare the explicit and implicit Euler methods and the Crank–Nicolson method, with a finite difference discretization over a uniform mesh of step size Δx > 0, under the condition Δt 2 ∞ 2κ Δx norm 2 < 1, which then ensures the stability of the L norm as well as the L

284

Mathematics for Modeling and Scientific Computing

Δt for these three finite difference schemes. Specifically, here, we set 2κ Δx 2 = 0.9. We also compare these numerical solutions with a discrete Fourier transform method ifft(exp(−κ 2 t)fft(uInit )), where spans a discrete set of frequencies (see Figure 3.2). The final time is T = 0.1 and we test uniform meshes with N = n × N0 points, n ∈ {1, ..., 7} and N0 = 24. We evaluate the evolution of the L2 and L∞ norms of the difference between the numerical solution and the exact solution as a function of the discretization (see Figure 3.3). We indeed observe convergence of order 2 as a function of the spatial discretization length.       3.08

3.06

3.04

3.02

3

2.98

2.96

2.94

0

1

2

3

4

5

6

7

Figure 3.1. Exact solution for the heat equation

If we free ourselves of the CFL condition, the explicit Euler scheme becomes unstable (another illustration is given later on). Hence, we only continue comparison of the implicit Euler and Crank–Nicolson methods, for the same meshes, but this time fixing the time interval Δt = 2 · 10−3 . This time we observe, as shown in Figure 3.4, that the error curves as a function of spatial step size for the implicit Euler method stop decreasing; it is the time error that starts to dominate. For the Crank–Nicolson method, over the meshes considered, we again observe a decrease in the error of order 2. Figures 3.5 show the results obtained with the scheme un+1 − un−1 unj+1 − 2unj + unj−1 j j =κ . 2Δt Δx2

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

  Euler

 Euler

3.1

3.1

3.05

3.05

3

3

2.95

2.95

0

2

4

6

8

0

2

4

Crank−Nicolson 3.1

3.05

3.05

3

3

2.95

2.95

2

6

8

6

8

Fourier

3.1

0

285

4

6

8

0

2

4

Figure 3.2. Numerical solutions (finite difference) for the heat equation under CFL condition. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip



L2 −1

−1

−2

−2

−3

−3

−4

−4

−5

−5

−6

−6

−7

3

3.5

4

4.5

2

(a) L Norm

5

5.5

−7

3

3.5

4

(b) L

4.5



5

5.5

Norm

Figure 3.3. Evolution of the error as a function of mesh under CFL: the explicit Euler (×), implicit Euler (◦), and Crank–Nicolson (blue) finite difference schemes and straight line of slope 2 (green), log–log scale. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

286

Mathematics for Modeling and Scientific Computing



∞

−1

−2

−2

−3

−3 −4

−4 −5

−5 −6

−6

−7

−7

−8

−8

5.5

5

4.5

4

3.5

3

3

3.5

2

(a) L Norm

4

(b) L

4.5



5

5.5

Norm

Figure 3.4. Evolution of the error as a function of mesh without obeying the CFL: the implicit Euler (◦) and Crank–Nicolson (×) finite difference schemes and straight line of slope 2, log–log scale. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip    

   

10

5.5 5

8 4.5 4

6

3.5

4

3 2.5

2 2 1.5

0 1 0.5

0

1

2

3

4

(a) t = 0.075

5

6

7

−2 0

1

2

3

4

5

6

(b) t = 0.078

Figure 3.5. Simulation with an unstable scheme

The exact solution is shown in Figure 3.6. This scheme is reliably of order 2 in time and space, but it is unstable, as the figures show, even though the time interval Δt has been chosen such that 2κ Δx 2 < 1.

7

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

287

      4.5

4

3.5

3

2.5

2

1.5

0

1

2

3

4

5

6

7

Figure 3.6. Exact solution at time t = 0.0078

Finally, we test the behavior of these finite difference schemes for the initial value uInit (x) = 10 × (1[−1/4,−1/8] (x) + (1[1/8,1/4] (x)) +

1 × 1[−1/8,1/8] (x) 2

[3.10]

over the interval [−1, +1]. We are still working with periodic boundaries and κ = 3. The spatial discretization is set by 256 equally separated points (i.e. Δx = 0.0078). The time step is 5 · 10−4 and we stop the simulation at time T = 0.01 (20 time iterations). The result is presented in Figure 3.7. We clearly observe the instability of the explicit Euler scheme under these conditions. We also note parasitic oscillations for the Crank–Nicolson scheme, revealing the lack of stability in the L∞ norm for the numerical conditions adopted. However, these oscillations fade as time progresses. We meet the same problem with a finite element discretization strategy. Figures 3.8 and 3.9 show the solutions obtained at the final time T = 0.01 for the initial value [3.10] with the explicit, implicit and Crank–Nicolson P1 methods or the explicit, explicit with mass condensation, implicit and Crank–Nicolson P2 methods. The boundary conditions are the homogeneous Dirichlet conditions and we have 2 taken for this simulation a time step Δt = 4 Δx 2κ , which violates the stability condition obtained by finite difference methods of analysis (we recall that finite difference and finite element schemes in one dimension over a uniform mesh are equivalent). The explicit schemes are unstable, even with mass condensation; the Crank–Nicolson scheme produces oscillations (which are even more sensitive for shorter simulation times). It is interesting to test shorter time intervals: with 2 Δt =  Δx with 0 <  < 1 in these conditions the explicit P1 scheme is stable (it is 2k in fact the finite difference scheme), but the explicit P2 schemes can continue to be unstable.

288

Mathematics for Modeling and Scientific Computing

  Euler

   

39

4

x 10

4

2

3

0

2

−2

1

−4 −1

−0.5

0

0.5

1

0 −1

−0.5

Crank−Nicolson

0

0.5

1

0.5

1

Fourier 4

5 4

3

3 2 2 1

1 0 −1

−0.5

0

0.5

1

0 −1

−0.5

0

Figure 3.7. Numerical solutions (finite difference) for the heat equation for discontinuous data, not obeying the stability condition. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

Finally, we proceed to an analysis of convergence by numerical experimentation by returning to the periodic framework with the initial condition [3.9] for which we explicitly know the solution. Evidently with P1 methods, we retrieve the results of Figures 3.3 and 3.4. The error curves for the implicit and Crank–Nicolson P2 schemes are reproduced in Figure 3.10. We clearly observe an improvement of one order of convergence with the Crank–Nicolson scheme. We impose here a time step 2 Δt = 4 Δx . With the implicit scheme, the error behavior is dominated by the 2k consistency error in time and is therefore proportional to Δx2 , while with the Crank–Nicolson scheme it remains dominated by the spatial error which behaves as Δx3 . However, for short times, the Crank–Nicolson scheme can produce oscillations and losses of positivity of the solution, a fault which can become intolerable with regard to the physical interpretation of the unknown quantity and its eventual use in a coupled problem.

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

   

13

2

289

x 10

1 0 −1 −2 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

0.2

0.4

0.6

0.8

1

   

4 3 2 1 0 −1

−0.8

−0.6

−0.4

−0.2

0 Crank−Nicolson

4 3 2 1 0 −1

−0.2

−0.4

−0.6

−0.8

0

Figure 3.8. Numerical solutions (P1 finite element) for the heat equation with discontinuous data, not obeying the DF stability condition. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

Let us finish with a few practical remarks about the implementation of these algorithms. To construct the stiffness and mass matrices, we follow the procedure explained in section 2.4.2, bearing in mind that the elemental mass matrix whose coefficients are defined by the integrals ˆ

ˆ

1

ψ0 (x)ψ0 (x) dx, ˆ

0

0

ˆ

0

ψ1 (x)ψ1/2 (x) dx,

Nloc

2 16 2

⎞ −1 2 ⎠. 4

1

ψ1 (x)ψ1 (x) dx, 0

is expressed as 4 1 ⎝ 2 = 30 −1

ψ1/2 (x)ψ1 (x) dx, ˆ

0



1

0

1

ψ1 (x)ψ0 (x) dx, 0

ˆ

1

ψ1/2 (x)ψ1/2 (x) dx, ˆ

1

ψ0 (x)ψ1 (x) dx, 0

ψ1/2 (x)ψ0 (x) dx, 0

1

ψ0 (x)ψ1/2 (x) dx, ˆ

1

ˆ

1

290

Mathematics for Modeling and Scientific Computing

   

33

1

x 10

1 x 10

27

Explicit  +Cond. Mass

0.5

0

0

−0.5

−1 −1

−0.5

0

0.5

−1 −1

1

−0.5

   

0

0.5

1

0.5

1

Crank−Nicolson

4

5 4

3

3 2 2 1

1

0 −1

−0.5

0

0.5

0 −1

1

−0.5

0

Figure 3.9. Numerical solutions (P2 finite element) for the heat equation with discontinuous data, not obeying the DF stability condition. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

The scaling to obtain the mass matrix associated with the mesh of uniform step size Δx is achieved by multiplying by Δx (while the stiffness matrix is divided by Δx). Although satisfying the Dirichlet conditions does not pose any difficulties, the construction of the matrices for the periodic problem is a bit more delicate. The (symmetric) structure of the P2 matrices reproduces these periodic conditions, and we obtain matrices of the form ⎞ ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝







0

0







0











0

0

0

0







0

0

0

0

0



 0

0









0

0

0







0



0

0











0

0







⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

291

(There are either 3 or 5 non-zero coefficients per row and per column; for the stiffness matrix, the sum of the coefficients for each row is zero. In this expression, the first coordinate corresponds to a node at abscissa 0, and the last coordinate to a node at abscissa 2π − Δx/2.) LInfini−P2

L2−P2

0

0

−1

−1

−2 −2

−3 −3

−4 −4

−5 −5

−6 −6

−7 −7

−8

−8 3

3.5

4.5

4

2

(a) L Norm

5

5.5

−9 3

3.5

4

4.5



(b) L

5

5.5

Norm

Figure 3.10. Error curves for the simulation of the heat equation by P2 finite element schemes (implicit ◦, Crank–Nicolson ×, straight line of gradient 2)

3.2. From transport equations towards conservation laws 3.2.1. Introduction Finite volume methods are especially suited for dealing with conservation laws because these equations correspond exactly to balances (of mass, energy, etc.), where the gains and losses in a domain arise from exchanges at the interfaces of the domain. Let us consider the example of passive particles immersed in a fluid: the particles (for example, a pollutant dispersed in a river) are ´ described by their mass density u(t, x). Thus, for all domain Ω ⊂ RN , the integral Ω u(t, x) dx represents the mass contained at time t in the domain Ω. The fluid is described by the velocity field (t, x) → a(t, x) ∈ RN , which we here take to be known. Any variations in mass are due to gains and losses across the boundary ∂Ω and we obtain the following balance: ˆ ˆ d u(t, x) dx = − u(t, x)a(t, x) · ν(x) dσ(x) dt Ω ∂Ω where ν(x) represents the unit vector pointing outward Ω at a point x ∈ ∂Ω (such that a(t, x) · ν(x) > 0 corresponds to a flux flowing from the inside to the outside of Ω,

292

Mathematics for Modeling and Scientific Computing

hence a mass loss, which explains the negative sign in front of the right-hand side of the balance, see also note 3.3). We have used dσ to designate the Lebesgue measure over ∂Ω. By integrating by parts, we deduce that ˆ



 ∂t u + divx (au) (t, x) dx = 0,

Ω

using divx to represent the differential operator defined over vector-valued functions U : x ∈ RN −→ U (x) = (U1 (x), ..., UN (xN )) ∈ RN by divx (U )(x) =

N

∂xj Uj (x).

j=1

This relation being satisfied for all domains Ω, we thus obtain ∂t u + divx (au) = 0. Provided that the velocity a is a “reasonably regular” given function, the analysis of this equation poses little difficulty. As we shall see later, it can be solved by the method of characteristics which shows that there exists a unique regular solution (let us say of class C 1 , for C 1 initial data). Considerable difficulties arise when the velocity depends upon the unknown u itself. We will study a variety of numerical approximation strategies for this equation. In particular, finite volume approaches correspond very closely to the principles that led to the equation, as a consequence of the balance of exchanges across the interfaces. N OTE 3.3.– In dimension one, the domain Ω is a segment [α, β], and the balance relation simply takes the form d dt

ˆ

β

  u(t, x) dx = − a(t, β)u(t, β) − a(t, α)u(t, α) .

α

A velocity which is positive at β or negative at α leads to mass loss. Generally, we are from now on interested in a scalar quantity u : [0, T ] × R → R, which satisfies the equation ∂t u + ∂x f (u) = 0.

[3.11]

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

293

We will discuss later the properties of solutions following the nature of the flux u → f (u), the linear case corresponding to f (u) = a × u, for a given function a : [0, T ] × R → R. Two classic nonlinear examples are: – Burgers’ equation where f (u) = u2 . This relates to a very simplified model of the equations of gas dynamics (but which preserve certain fundamental difficulties) introduced by [BAT 15, BUR 40]. – The Lighthill–Whitham–Richards equation for road traffic [LIG 55, RIC 56], where f (u) = u × V , with V = VM (1 − u/uM ). Here, u(t, x) represents a density of vehicles at time t and position x on a highway: ´b the integral a u(t, x) dx gives the number of vehicles present at time t over the section [a, b]. The function V describes the velocity of these vehicles: if the density u is low, the vehicles drive at a velocity close to the maximum speed VM > 0, if the density u approaches the maximum density uM > 0, the velocity tends to become zero1. In these examples, we must not confuse V (u), such as f (u) = u × V (u), and the derivative of the flux f  (u). The formalism can also be adapted to apply to systems where u becomes an unknown vector quantity u : [0, T ] × R → RD . We will focus in particular on Euler equations for gas dynamics: – Isentropic Euler equations u = (ρ, J ) ∈ R2 , with f (u) =

J J 2 /ρ + κργ

where κ > 0 and γ > 1 (or isothermal Euler equations if γ = 1). – Compressible Euler equations u = (ρ, J, E ) ∈ R3 , with ⎛

J



⎜ 2 ⎟ ⎟ f (u) = ⎜ ⎝ J /ρ + p(ρ, e) ⎠ (E + p(ρ, e))J/ρ where p is a function with positive values, satisfying certain properties imposed by the J2 physics of the situation, and e = Eρ − 2ρ 2 (the internal energy). 1 This is a very simple model; a richer description in the form of a system of conservation laws has recently been proposed [AW 00] and has since become a very active area of research.

294

Mathematics for Modeling and Scientific Computing

As mentioned above, the analysis of such nonlinear systems is extremely delicate and is still a very active area of research. We shall present slightly later some of the technical difficulties these problems present, and more on the subject can be found in [BEN 07, DAF 10, GOD 91, SER 96a, SER 96b]. In order to treat such equations numerically, it is important to properly understand the principles of discretization over linear models for which we can make use of explicit solutions that allow meaningful comparisons. Finite volume discretization of these equations is based upon the following principles. We will use the following notations (refer back to Figure 2.6): – xj represents a mesh node, – Cj = [xj−1/2 , xj+1/2 ] represents the control volume centered upon point xj , – hj = xj+1/2 − xj−1/2 is the size of the control volume Cj , – h = max{hj , j ∈ Z}, h = min{hj j ∈ Z}, – Δt signifies the time step, which we will assume to be constant and denote tn = nΔt. We will focus on families of such meshes, parameterized by the step size h. In order to study the convergence properties of the approximations, we will assume that the elements of these families are such that h = kh, for a given 0 < k ≤ 1, independent of h (such a mesh family is said to be quasi–uniform). The unknown value Ujn is seen as an approximation of the average over the cell Cj that is ´ 1 n n n+1 ] × Cj , hj Cj u(t , y) dy. By integrating equation [3.11] over the domain [t , t we obtain ˆ ˆ 1 1 u(tn+1 , y) dy − u(tn , y) dy hj Cj hj Cj ˆ tn+1   1 + f (u(s, xj+1/2 )) − f (u(s, xj−1/2 )) ds = 0. hj tn We draw inspiration from this formula to update the unknown values 1 1 n n (U n+1 − Ujn ) + (Fj+1/2 − Fj−1/2 )=0 Δt j hj

[3.12]

n where the numerical flux Fj+1/2 is thus interpreted as an approximation of the physical flux between times tn and tn+1 over the interface xj+1/2 , namely

1 Δt

ˆ

tn+1

tn

f (u(s, xj+1/2 )) ds.

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

295

The design of the approximation schemes for [3.11] thus depends upon an n adequate definition of the numerical fluxes Fj+1/2 . To determine the flux at xj+1/2 , the most simple constructions involve only the adjacent cells, centered at xj and xj+1 ; we thus obtain a three-point scheme to find the numerical flux in the form, n n Fj+1/2 = F(Uj+1 , Ujn ).

For example, the centered flux n Fj+1/2 =

 1 n f (Uj+1 ) + f (Ujn ) 2

at first seems to be applicable, since, over a uniform mesh of step size h, it leads to an approximation of the spatial derivative by n n f (Uj+1 ) − f (Uj−1 ) 1 n n )= (Fj+1/2 − Fj−1/2 , h 2h

with a consistency error of order 2. However, we shall see that this scheme has a bad behavior and cannot be used. 3.2.2. Transport equation: method of characteristics We start by addressing the linear case f (u) = a × u, i.e.: ∂t u + ∂x (au) = 0,

u(t = 0, x) = uInit (x).

[3.13]

with a given function a : R+ × R → R. We can explicitly solve this partial derivative equation by using arguments from the theory of ordinary differential equations. We introduce characteristic curves X(s; t, x), solutions of the differential equation d X(s; t, x) = a(s, X(s; t, x)), ds

X(t; t, x) = x.

[3.14]

We interpret X(s; t, x) as the position occupied at time s by a particle having left position x at time t. It is important to bear this interpretation in mind to build an intuition on the behavior of the solutions. In particular, this makes clear that X(t; s, X(s; r, x)) = X(t; r, x).

296

Mathematics for Modeling and Scientific Computing

If u is a solution of [3.13], by applying the chain rule, we obtain # d" d u(s, X(s; t, x)) = (∂t u)(s, X(s; t, x)) + X(s; t, x) × (∂x u)(s, X(s; t, x)) ds ds   = (∂t + a∂x )u (s, X(s; t, x)) = −(u × ∂x a)(s, X(s; t, x)). It then remains to integrate this simple linear equation, satisfied by s → u(s, X(s; t, x)), to obtain u(t, x) = uInit (X(0; t, x)) J(0; t, x),

[3.15]

with ˆ

s

J(s; t, x) = exp

(∂x a)(σ, X(σ; t, x)) dσ .

[3.16]

t

In fact, J(s; t, x) is nothing other than the Jacobian of the change in variable y = X(s; t, x) (or equivalently x = X(t; s, y)). By differentiating [3.14], we obtain that Y (s; t, x) = ∂x X(s; t, x) satisfies d Y (s; t, x) = (∂x a)(s, X(s; t, x)) × Y (s; t, x), ds

Y (t; t, x) = 1.

Again, we have a simple linear differential equation whose solution is ˆ Y (s; t, x) = exp

s

(∂x a)(σ, X(σ; t, x)) dσ .

t

We thus have J(s; t, x) = ∂x X(s; t, x) and dy = J(s; t, x) dx. In particular, we deduce from this that [3.13] is indeed a conservation equation since the change in variable y = X(0; t, x), dy = J(0; t, x) dx gives ˆ

ˆ u(t, x) dx = R

R

ˆ uInit (X(0; t, x)) J(0; t, x) dx =

R

uInit (y) dy.

This treatment assumes that the characteristic curves X(s; t, x) are well-defined, i.e. that we can apply the Cauchy–Lipschitz theorem to solve [3.14]. For example, we assume that h1) a is of class C 1 , h2) there exists C > 0, such that for all t, x ∈ R, we have |a(t, x)| ≤ C(1 + |x|).

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

297

In the particular case where a(t, x) = c ∈ R is constant, we simply have X(s; t, x) = x + c(s − t) and J(s; t, x) = 1. T HEOREM 3.3.– Under the hypotheses h1)-h2), for any initial data uInit ∈ C 1 , problem [3.13] has a unique solution u of class C 1 defined by [3.15]–[3.16]. If uInit ∈ L1 (R), then u is integrable and has the same integral as uInit . If ∂x a ∈ L∞ , and 0 ≤ uInit (x) ≤ M for all x ∈ R then, for all 0 ≤ t ≤ T < ∞, x ∈ R, we have 0 ≤ u(t, x) ≤ M eT ∂x a∞ . It is interesting to note that formula [3.15]–[3.16] retains its meaning when uInit is a function of Lp (R), not necessarily continuous. Hence, we can naturally extend the notion of solutions to this context. T HEOREM 3.4.– Let uInit ∈ L1 ∩ L∞ (R). Then, (t, x) → u(t, x) defined by [3.15]– [3.16] is the unique function satisfying ˆ lim

t→t0

R

|u(t, x) − u(t0 , x)| dx = 0

[3.17]

and u is a weak solution of [3.13] in the sense that for all functions φ ∈ Cc1 ([0, ∞[×R), we have ˆ



− 0

ˆ R

  u(t, x) ∂t φ(t, x) + a(t, x)∂x φ(t, x) dx dt −

ˆ R

ρInit (x)φ(0, x) dx = 0.

P ROOF.– We start by establishing a continuity property of [3.13] with respect to the initial data. By noting ui (t, x) = uiInit (X(0; t, x)) J(0; t, x) for i ∈ {1, 2}, we have ˆ R

ˆ |u1 (t, x) − u2 (t, x)| dx =

R

|u1Init (X(0; t, x)) − u2Init (X(0; t, x))| J(0; t, x) dx

ˆ =

R

|u1Init (y) − u2Init (y)| dy.

As a consequence, it will be sufficient to prove [3.17] for continuous data with compact support, uInit , to then extend it to integrable functions by a density argument (see [GOU 11, theorem 4.13]). The demonstration then makes use of the continuity properties of the characteristic curves. The general theory of differential equations ensures that the mapping (t, s, x) → X(t; s, x) is of class C 1 . We also note that when the variables t, s, x describe a compact set, the trajectories X(t; s, x) remain in a compact set of R.

298

Mathematics for Modeling and Scientific Computing

The continuity in time for values in L1 is demonstrated by a change in variables and by invoking the Lebesgue theorem. Indeed, we write the following inequalities: ˆ R

|u(t, x) − u(t0 , x)| dx ˆ = |uInit (X(0; t, x))J(0; t, x) − uInit (X(0; t0 , x))J(0; t0 , x)| dx R

ˆ ≤

R

|uInit (X(0; t, x)) − uInit (X(0; t0 , x))|J(0; t, x) dx

ˆ

+ ˆ ≤

R

[3.18]

|uInit (y) − uInit (X(0; t0 , X(t; 0, y)))| dy

ˆ

+

 J(0; t, x)    |uInit (X(0; t0 , x))|  − 1 J(0; t0 , x) dx J(0; t0 , x) R

 J(0; t, X(t ; 0, y))    0 |uInit (y)|  − 1 dy. J(0; t , X(t ; 0, y)) 0 0 R

We thus recall that, for fixed t0 and y, on the one hand, we have lim X(0; t0 , X(t; 0, y)) = X(0; t0 , X(t0 ; 0, y)) = X(0; 0, y) = y

t→t0

and, on the other hand, we can establish that ˆ lim J(0; t, X(t0 ; 0, y)) = lim exp

t→t0

t→t0

ˆ

0

(∂x a)(σ, X(σ; t, X(t0 ; 0, y))) dσ t

0

= exp

(∂x a)(σ, X(σ; t0 , X(t0 ; 0, y))) dσ t0

= J(0; t0 , X(t0 ; 0, y)) ˆ

0

= exp

(∂x a)(σ, X(σ; 0, y)) dσ t0

=

1 . J(t0 ; 0, y)

Thus, the integrands in [3.18] do indeed tend towards 0 as t → t0 . Furthermore, while t0 ∈ R is fixed, when t varies within, for example, [t0 − 1, t0 + 1] and y within a compact K of R (as it happens, the support of uInit ), the quantity J(0;t,X(t0 ;0,y)) J(0;t0 ,X(t0 ;0,y)) = J(0; t, X(t0 ; 0, y))J(t0 ; 0, y) is bounded by a constant (which depends upon the compact K). Finally, {y ∈ R, X(0; t0 , X(t; 0, y)) ∈ K, t0 − 1 ≤ t ≤ t0 + 1} = {y ∈ R, X(t; 0, y) ∈ X(t0 ; 0, K), t0 − 1 ≤ t ≤ t0 + 1} = {X(0; t, X(t0 , 0, K)), t0 − 1 ≤ t ≤ t0 + 1} is again a compact set of R. These remarks ensure the domination required to be able to conclude with the Lebesgue

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

299

theorem and we deduce from this that [3.17] is satisfied for continuous uInit with compact support, and hence finally for uInit ∈ L1 (R). We then show that u is a weak solution by calculating ˆ



ˆ R

0

  u(t, x) ∂t φ(t, x) + a(t, x)∂x φ(t, x) dx dt

ˆ



ˆ



ˆ

= ˆ

0

ˆ

0

R

= = ˆ

R

= R

R

  uInit (X(0; t, x)) ∂t φ(t, x) + a(t, x)∂x φ(t, x) J(0; t, x) dx dt   uInit (y) ∂t φ(t, X(t; 0, y)) + a(t, X(t; 0, y))∂x φ(t, X(t; 0, y)) dy dt ˆ

uInit (y) 0



# d" φ(t, X(t; 0, y)) dt dy dt

uInit (y)φ(0, y) dy.

To demonstrate the uniqueness, we must prove that if u satisfies the weak formulation with uInit = 0 then u is null almost everywhere. This conclusion is based upon a duality argument, sometimes referred to as the Hölmgren method. Let ψ ∈ Cc∞ ((0, +∞) × R), such that ψ(t, ·) = 0 for t ≥ T . We solve ∂t φ + a∂x φ = ψ with the final data φt=T = 0. More precisely, by integrating along the characteristics, we obtain ˆ

t

ψ(σ, X(σ; t, x)) dσ ∈ Cc1 ([0, +∞) × R).

φ(t, x) = T

We can then use this test function in the weak form, which leads to ˆ



ˆ u(t, x)ψ(t, x) dx dt = 0.

0

R

With this having been satisfied for all ψ ∈ Cc∞ ((0, +∞) × R), we conclude from this that u(t, x) = 0 almost everywhere.  3.2.3. Upwinding principles: upwind scheme As we had mentioned previously, the construction of a numerical scheme for [3.13] seeks to “copy” the mass balance over the control volumes. The difficulty therefore lies in defining an appropriate approximation of the flux over the interface xj+1/2 . The upwind flux will look for the information where it comes from, with regard to the

300

Mathematics for Modeling and Scientific Computing

equation: from the left if a(t, x) ≥ 0, and from the right if a(t, x) ≤ 0. The scheme is thus adapted to the direction of information propagation by setting

n Fj+1/2

=

⎧ n n ⎨ a(t , xj+1/2 ) Uj+1

if a(tn , xj+1/2 ) ≤ 0,

⎩ a(tn , x

if a(tn , xj+1/2 ) ≥ 0,

n j+1/2 ) Uj

[3.19]

L EMMA 3.4.– The upwind scheme [3.19] preserves the positivity under the CFL condition a ∞ Δt ≤ 1/2: if u0j ≥ 0 for all j, then unj ≥ 0 for all n, j. h P ROOF.– By linearity, it suffices to prove that un+1 remains positive when j unj−1 , unj , unj+1 are positive. We set anj+1/2 = a(tn , xj+1/2 ). We recall the notation $ % $ % a + |a| a − |a| a + = max(a, 0) = , a − = min(a, 0) = , 2 2 $ % $ % $ % $ % such that a = a + + a − and |a| = a + − a − . The scheme is expressed as  % $ % $ % $ % Δt $ n aj+1/2 + unj + anj+1/2 − unj+1 − anj−1/2 + unj−1 − anj−1/2 − unj hj  % %  Δt $ n Δt $ n = unj 1 − aj+1/2 + + aj−1/2 − hj hj

un+1 = unj − j



% % hj $ n Δt $ n aj+1/2 − unj+1 + aj−1/2 + unj−1 . hj hj

The terms involving unj−1 and unj+1 contribute positively. Moreover, we have $ % $ % 0 ≤ anj+1/2 + − anj−1/2 − ≤ 2 a ∞ . Thus, the CFL constraint ensures that 

1−

% %  Δt $ n Δt $ n aj+1/2 + + aj−1/2 − ≥ 0, hj hj

which thus implies that un+1 ≥ 0. j



N OTE 3.4.– The proof shows that, in the specific case where the velocity a has constant sign, the upwind scheme is L∞ −stable under CFL a ∞ Δt h ≤ 1. In this 0 n case, we show in fact that if m ≤ uj ≤ M for all j, then m ≤ uj ≤ M for all n, j, as for the continuous problem. When the sign of a is variable, the condition that guarantees the preservation of positivity is slightly more restrictive.

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

3.2.4. Linear transport at constant speed: schemes

301

analysis of FD and FV

In this section, we describe the particular case where the velocity a(t, x) = c ∈ R is constant. The solution of [3.13] is thus simply u(t, x) = uInit (x − ct). Of course, with the solution being explicitly known, the numerical simulation of this problem is of little interest in itself. However, the detailed analysis of this simple case and the direct comparison between numerical solutions and the exact solution uncovers certain guiding principles which will govern the design of numerical methods for more complex problems. From a finite difference perspective, the unknown value u ¯nj is interpreted as an approximation of u(nΔt, xj ). It seems quite natural, to start with, to use an Euler scheme with centered discretization for the spatial variable: u ¯n+1 −u ¯nj u ¯nj+1 − u ¯nj−1 j = −c . Δt xj+1 − xj−1

[3.20]

In this context, the FD-upwind scheme takes the form ⎧ u ¯nj − u ¯nj−1 ⎪ ⎪ ⎪ −c n+1 n ⎨ hj−1/2 u ¯j − u ¯j = n ¯j+1 − u ¯nj ⎪ Δt ⎪ −c u ⎪ ⎩ hj+1/2

if c > 0, [3.21] if c < 0.

We will note the difference with the FV-upwind scheme, which uses the same formulae but where hj = xj+1/2 − xj−1/2 replaces hj+1/2 = xj+1 − xj (see [3.12]). The two schemes match, however, in the case of a uniform mesh of constant mesh size h. The first element of analysis in these finite difference schemes lies in the concept of consistency, which generalizes the definition introduced in lemma 3.1. D EFINITION 3.3.– We say that a scheme un+1 = Φ(Δt, unj−k , ..., unj+k , hj−k+1/2 , ..., hj+k+1/2 ) j is consistent (in the sense of finite differences) with an equation (E), if, given u : (t, x) → u(t, x) a regular solution of (E), then the sequence defined by nj =

u(tn+1 , xj ) − Φ(Δt, u(tn , xj−k ), ..., u(tn , xj+k ), hj−k+1/2 , ..., hj+k+1/2 ) Δt

302

Mathematics for Modeling and Scientific Computing

satisfies sup sup |nj | = 0.

lim

Δt,h→0 0≤tn ≤T

j

N OTE 3.5.– We define the of consistency in the Lp norm in the same manner concept n n p by replacing sup |j | by j hj |j | . j

In what follows, we should be aware that the consistency analysis to be carried out makes use of the regularity of the exact solution u of the problem (which we do not always know). The centered scheme thus seems attractive, since, over a uniform mesh, it is consistent of order 2 in space and of order 1 in time. Indeed, with hj+1/2 = xj+1 − xj , we obtain  n  u(tn+1 , xj ) − u(tn , xj ) 1 +c u(t , xj+1 ) − u(tn , xj−1 ) Δt xj+1 − xj−1  1 Δt2  2 = ∂t u(tn , xj )Δt + ∂tt u(tn , xj ) Δt 2  1 +c ∂x u(tn , xj )(hj+1/2 + hj−1/2 ) xj+1 − xj−1 [3.22] h2j+1/2 − h2j−1/2 h3j+1/2 + h3j−1/2  3 n 2 n +∂xx u(t , xj ) + ∂xxx u(t , xj ) + Rjn 2 6

nj =

=

Δt 2 ∂tt u(tn , xj ) 2

+

3 +c∂xxx u(tn , xj )

2 c∂xx u(tn , xj )

h3j+1/2

+

h2j+1/2 − h2j−1/2 2(hj+1/2 + hj−1/2 )

h3j−1/2

6(hj+1/2 + hj−1/2 )

+ Rjn

where the final term can be estimated by making use of the regularity of u: for example, we have 4 3 |Rnj | ≤ c∂xxxx u ∞ h3 + ∂ttt u ∞ Δt2 .

When the mesh is uniform h2j+1/2 − h2j−1/2 = 0 and, in this case, we deduce from this that for the centered scheme [3.20] sup |nj | ≤ C(h2 + Δt). j

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

303

We conduct a similar analysis for the upwind scheme, for example, for c > 0  n  u(tn+1 , xj ) − u(tn , xj ) 1 +c u(t , xj ) − u(tn , xj−1 ) Δt xj − xj−1  1 Δt2  2 = ∂t u(tn , xj )Δt + ∂tt u(tn , xj ) Δt 2 [3.23] h2j−1/2  1  n 2 n n +c ∂x u(t , xj )hj−1/2 + ∂xx u(t , xj ) + Rj hj−1/2 2

nj =

2 = ∂tt u(tn , xj )

hj−1/2 Δt 2 + c∂xx u(tn , xj ) + Rjn . 2 2

We deduce from this the following estimate of consistency of order 1: for the upwind scheme [3.21] sup |nj | ≤ C(h + Δt). j

In these estimates, the constant C depends upon the norm of u in C k for a suitable k. However, the centered scheme is hindered by a stability fault, which is immediately noticeable in the simulations. T HEOREM 3.5.– The centered scheme [3.20] is neither L∞ −stable, nor L2 −stable. P ROOF.– We only consider the situation where the mesh size is constant hj = h and we study the case of periodic boundary conditions. We use the same reasoning as for the heat equation to define the L2 stability: the index j describes {0, ..., J} and the scheme is completed by the periodicity relations. The sequence unj defined by the scheme is then associated with the sequence of functions defined over [0, 2π[ by x → u (x) = n

J

unj 1jh≤x 1. Figure 3.12 illustrates how this stability fault materializes.  T HEOREM 3.6.– Over a uniform mesh of step size h > 0, the upwind scheme [3.21] is L∞ −stable and L2 −stable under the condition |c|Δt/h ≤ 1. P ROOF.– The amplification factor of scheme [3.21] is expressed as Δt ikh/2 e sin(kh/2) h Δt Δt = 1 − 2c sin2 (kh/2) + 2ic cos(kh/2) sin(kh/2), h h Δt for c < 0, M (k) = 1 − 2i|c| e−ikh/2 sin(kh/2) h Δt Δt = 1 − 2|c| sin2 (kh/2) − 2i|c| cos(kh/2) sin(kh/2), h h for c > 0, M (k) = 1 + 2ic

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

305

It follows that |M (k)|2 = 1 + 4c2

Δt2 sin4 (kh/2) h2

Δt Δt2 sin2 (kh/2) + 4c2 2 sin2 (kh/2)(1 − sin2 (kh/2)) h h  Δt  = 1 + 4|c|Δth sin2 (kh/2) |c| −1 . h −4|c|

The CFL condition thus ensures |M (k)| ≤ 1. ¯n+1 is written, when The L∞ stability follows on from the fact that the iterated u j the CFL condition is satisfied, as a convex combination of u ¯nj , u ¯nj−1 , u ¯nj+1 . (If the ∞ meshes are quasi-uniform but not uniform, the condition for L -stability becomes |c|Δt  kh ≤ 1.) As a consequence of these properties, we can establish the convergence of the upwind scheme for linear transport at a constant speed. 2 T HEOREM 3.7.– Let uInit ∈ C# ([0, 2π]). Let u be the associated solution of [3.13], with a constant velocity a(t, x) = c. Finally, let 0 < T < ∞. Then, the sequence u ¯nj defined by [3.21], for a family of quasi-uniform meshes such that the condition |c|Δt kh

≤ 1 holds, satisfies max sup |¯ unj − u(tn , xj )| ≤ CT h

0≤tn ≤T

j

where C depends on u C 2 . P ROOF.– We set enj = u ¯nj −u(tn , xj ). We only treat the case c > 0, the demonstration being identical for negative velocities. We deduce from the recurrence relation 1 n+1 c (ej − enj ) + (en − enj−1 ) = −n+1 j Δt hj−1/2 j that  cΔt  cΔt n |en+1 | ≤ |enj | 1 − + |e | + Δt|n+1 | ≤ sup |enk | + CΔt h. j j hj−1/2 hj−1/2 j−1 k To obtain this inequality, we have used both the CFL condition and the consistency estimate. An argument by recurrence thus leads to the stated assertion. 

306

Mathematics for Modeling and Scientific Computing

The consistency analysis [3.22] of the centered scheme reveals the origin of the instability and offers ideas for how to remedy it. Indeed, we can rewrite [3.22] in the following manner:  u(tn+1 , xj ) − u(tn , xj ) 1  n −c u(t , xj + h) − u(tn , xj − h) Δt 2h   Δt 2 ˜ jn (Δt, h) [3.24] = ∂t u + c∂x u + ∂ u (tn , xj ) + R 2 tt  Δt 2  n ˜ n (Δt, h) = ∂t u + c∂x u + c2 ∂xx u (t , xj ) + R j 2

nj =

2 2 since ∂t u = −c∂x u and thus ∂tt u = −c∂x ∂t u = c2 ∂xx u. In this relationship, the n 2 ˜ (Δt, h)| ≤ C(Δt + h2 ), where C depends upon remaining term is evaluated as |R j the norm of u in C 4 . This demonstrates that scheme [3.20] defines a consistent approximation of order 2 in time and space of the PDE

∂t v + c∂x v + c2

Δt 2 ∂ v = 0. 2 xx

This is an “anti-diffusion” equation which is inherently unstable2. For the upwind scheme, the same analysis leads to  Δt ch 2  n 2 ˜ nj (Δt, h) nj = ∂t u + c∂x u + ∂tt u − ∂xx u (t , xj ) + R 2 '()* 2 2 u =c2 ∂xx   c 2 ˜ n (Δt, h). = ∂t u + c∂x u + (cΔt − h)∂xx u (tn , xj ) + R j 2 Under the CFL condition, the coefficient δ = 2c (h − cΔt) is indeed positive and the scheme is consistent of order 2 in time and space with the diffusion equation 2 ∂t v + c∂x v − δ∂xx v = 0.

In this type of argument, we can look to introduce an artificial diffusion term (in this context, we are talking about “artificial viscosity”) so as to compensate the antidiffusive character of scheme [3.20]. We assume that the condition |c|Δt/h < 1 is achieved. We thus choose  > 0 depending on the discretization parameters, so that h2 h2 c2 Δt = (cΔt/h)2 2Δt ≤ 2Δt ≤  and we reach the modified equation 2  Δt  2 ∂t v˜ + c∂x v˜ −  − c2 ∂xx v˜ = 0. 2 2 We can convince ourselves of this by examining the Fourier transform of the solution that satisfies ∂t vˆ(t, ξ) = (icξ + kξ 2 )ˆ v (t, ξ), with k = c2 Δt . It follows that vˆ(t, ξ) = exp(t(icξ + 2 kξ 2 ))ˆ v (0, ξ). As k > 0, we deduce from this that ˆ v (t, ·)L2 = u(t, ·)L2 → ∞ as t → ∞, with an exponentially fast growth.

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

With  =

h2 2Δt ,

307

we obtain the following scheme

u ¯n+1 −u ¯nj c n 1 n j = − (¯ u −u ¯nj−1 ) + (¯ u − 2¯ unj + u ¯nj−1 ). Δt 2h j+1 2 j+1 This is the Lax–Friedrichs scheme. We can verify that the scheme is stable and consistent of order 1. In order to properly understand the influence of the “numerical diffusion” term, we can focus on the convection–diffusion equation at a constant velocity c > 0: 2 ∂t uδ + c∂x uδ = δ∂xx uδ ,

[3.25]

where 0 < δ 1. In fact, we note that v(t, x) = uδ (t/δ, x + ct/δ) simply satisfies 2 ∂t v = ∂xx v, an equation  for which we know the expression of the solution for fixed initial data. Thus, if uδ t=0 = uInit , we obtain ˆ uδ (t, x) =

 |x − ct − y|2  dy √ uInit (y) exp − . 4δt 4πδt R

Notably, even though the initial data are discontinuous, the solution uδ is C ∞ for all times t > 0. We recover the regularizing effect of the heat equation. Over numerical simulations, we shall indeed see the profile of the approximate solution regularizing for schemes that induce such a diffusion effect (see Figures 3.13 and 3.14). Nevertheless, we observe that ˆ R

|uδ (t, x) − uInit (x − ct)| dx ˆ ˆ   dy  2   e−|x−ct−y| /(4δt) uInit (y) − uInit (x − ct) √  dx 4πδt R R ˆ ˆ   uInit (y) − uInit (x − ct)e−|x−ct−y|2 /(4δt) √ dy dx ≤ 4πδt R R ˆ ˆ √   −Y 2 dY uInit (X − 4δtY ) − uInit (X)e √ dX ≤ π R R =

√ with the change in variables X = x − ct and Y = (x − ct − y)/ 4δt. The continuity of translations in L1 and the Lebesgue theorem allow us to conclude that, for fixed t when δ tends towards 0, uδ (t, ·) converges in L1 (R) towards u(t, ·), the solution for the transport equation at velocity c and for the initial data uInit .

308

Mathematics for Modeling and Scientific Computing

We can go slightly further by choosing  so as to cancel the diffusion term. As a consequence, we cancel the terms of order 1 in the consistency analysis. The resulting scheme is written as u ¯n+1 −u ¯nj c n Δt  c 2 n j = − (¯ uj+1 − u ¯nj−1 ) + (¯ uj+1 − 2¯ unj + u ¯nj−1 ). Δt 2h 2 h This is the Lax–Wendroff scheme [LAX 60]. Accordingly, it is consistent of order 2 in time and space. We compare these different schemes, given the following details: – the simulation is conducted over the interval [0, 2], with periodic conditions; – the velocity is constant, c = 1.37; – the time and space steps are constant, and related by the condition cΔt = 0.9h. We conduct a first simulation with the initial data (see Figure 3.11) uInit (x) = 10 exp(−100|x − 1|2 ).

[3.26]

We recall that the exact solution is known and is expressed as uInit (x − ct). 10 9 8 7 6 5 4 3 2 1 0

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Figure 3.11. Initial data [3.26]

Figure 3.12 shows the solution obtained with scheme [3.20]. The solution respects neither the positivity nor the upper bound of the exact solution, with

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

309

oscillations appearing and growing over time, which illustrates the numerical instability phenomenon. Figure 3.14 shows the solution obtained with different schemes in comparison with the exact solution (upwind, Lax–Friedrichs, Lax–Wendroff and the Després–Lagoutière schemes, the latter being a recent innovation described in [DES 01]). We observe that the maximum of the solution is poorly respected by the upwind and Lax–Friedrichs schemes, which also have the tendency of spreading the solution (the numerical solution takes significantly positive values over a larger interval than the exact solution). The solution of the Després–Lagoutière scheme presents a more irregular profile, with plateaus forming. The solution produced by Lax–Wendroff is better. These differences diminish as the discretization is refined. Figure 3.15 allows verification of the stated orders of convergence, confirming the visually observed behavior. Centered Scheme vs. Exact solution 80

60

10

40

8

u(t,x)

uexact(t,x)

20

0

6

4 −20 2 −40 0 −60

0

0.5

1 x

1.5

2

0

0.5

1 x

1.5

2

Figure 3.12. Simulation of the transport equation with the centered scheme (h = 1.34 10−2 : numerical solution for different times 0 ≤ t ≤ .5)

N OTE 3.6.– It is tempting, on the strength of experience gained with ODEs and diffusion problems, to look to relax the stability constraint |c| Δt < 1 and to construct h a method that works under less restrictive conditions in the form of an implicit scheme. Thus, in the case of constant velocity c > 0, we would write un+1 − unj un+1 − un+1 j j j−1 +c = 0. Δt h

Mathematics for Modeling and Scientific Computing UpWind Scheme vs. Exact solution

10

10

8

8

uexact(t,x)

u(t,x)

310

6

6

4

4

2

2

0

0 0

0.5

1 x

1.5

2

0

0.5

1 x

1.5

2

Figure 3.13. Simulation of the transport equation with the upwind scheme (h = 1.34 10−2 : numerical solution for different times 0 ≤ t ≤ .5). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip Exact

UpW 10 8 6 4 2

0.5

1

1.5

2

0

0

0.5

LxF

1

1.5

2

1.5

2

LxW 10 8 6 4 2

0.5

1

1.5

2

1.5

2

0

0

0.5

1

SLag

0.5

1

Figure 3.14. Simulation of the transport equation, numerical solutions at time T = .5 (h = 1.34 10−2 ): exact solution (Exact), upwind scheme (UpW), Lax–Friedrichs scheme (LxF), Lax–Wendroff scheme (LxW), Després–Lagoutière scheme (SLag)

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

311

2

1

0

−1



−2

−3

−4

−5

−6

−7 6.5

7

7.5

8

8.5 9 −log(Δ x*Δ t)

9.5

10

10.5

11

Figure 3.15. Simulation of the transport equation, evolution of the error in L2 and L∞ norms for upwind scheme (UpW, in blue), Lax–Friedrichs scheme (LxF, in red) and Lax–Wendroff scheme (LxW, in green). The L2 norms are represented by ×, and the L∞ norms are represented by ◦; for comparison with reference to straight lines of slopes 1 and 2, respectively. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

This scheme does indeed preserve the positivity of the solutions, as well as the L∞ bounds. Indeed, assuming that 0 ≤ unj ≤ M for all j and that un+1 = minj un+1 j0 j n+1 (respectively un+1 = max u ), then we have j j0 j



Δt Δt n+1 Δt Δt n+1 0 ≤ unj0 = un+1 1 + c − c u ≤ u 1 + c − c un+1 = un+1 j0 j0 j0 h h j0 −1 h h j0 (respectively M ≥ unj0 ≥ un+1 j0 ). This scheme has been analyzed under much more general conditions in [BOY 12], and the fact that it allows, a priori, for the use of larger time intervals is an advantage that makes it useful for certain applications. However, its use is often regarded with caution for at least two reasons: – Updating the unknown values is done by solving a linear system (I +

Δt C)U n+1 = U n h

(which is invertible since the matrix is strictly diagonally dominant). If the matrix (I+ Δt C) has a very sparse structure, its inverse (I+ Δt C)−1 is a full matrix. Thus, the h h n+1 approximation uj at the point on the grid xj involves the solution at the preceding

312

Mathematics for Modeling and Scientific Computing

discrete time over all points in the domain of calculation. This contradicts the principle of propagation at a finite speed of the continuous problem (the solution u(tn+1 , xj ) = u(tn , x − cΔt) depends only on u(tn , y) for the points y ∈ [x − cΔt, x]). More precisely, the matrix C is expressed as ⎛

1

0

0



AA ⎜ ⎟ A ⎜ ⎟ ⎜ −1 AAA ⎟ ⎜ ⎟ A @@ AA ⎜ ⎟ @@ AA ⎜ ⎟ A ⎜ 0 ⎟ @ A @@ C = c⎜ A ⎟. AA @@ ⎜ ⎟ AA @@ ⎜ ⎟ AA @@ ⎜ ⎟ 0 AA ⎜ ⎟ @@ A ⎜ ⎟ ⎝ ⎠ 0 0 −1 1 In fact, it is convenient to express the matrix (I + Δt C) of the linear system in the h form (1 + cΔt )(I − λN ), where N is the nilpotent matrix h ⎞

⎛ 0

⎜ ⎜ ⎜ 1 ⎜ >> ⎜ >> ⎜ ⎜ >> N =⎜ 0 >> ⎜ >> ⎜ >> ⎜ > ⎜ ⎝ 0 0 1

and λ =

cΔt h

1+ cΔt h

(I +

0

0

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

. The calculation of the inverse thus becomes very simple; we obtain

Δt C)−1 h

∞ J−1



1 1 k k λ N = λk N k cΔt cΔt 1 + h k=0 1 + h k=0 ⎞ ⎛ 0 1 0 AA ⎟ ⎜ AA ⎟ ⎜ AA λ ⎟ ⎜ BB AA ⎟ ⎜ BB AA ⎟ ⎜ 1 ⎟ ⎜ λ2 BB AA = ⎟. BB AA cΔt ⎜ C ⎟ CC A 1+ h ⎜ B A B ⎟ ⎜ CC BB AA ⎜ 0 C CC BBB AAA ⎟ ⎟ ⎜ ⎠ ⎝

=

λJ−1

λ2

λ

1

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

313

The coefficients of this matrix are indeed positive, which leads back to an argument that the maximum principle is satisfied. We remark that the evaluation of the solution at a point x0 depends on the values over all points situated to the left of x0 . – As noted in [BRU 97], explicit and implicit schemes result in an approximation which is consistent up to O(Δt2 + Δx2 ) of the equation 2 ∂t uδ + c∂x uδ = δ∂xx uδ

where δ is a function of the numerical parameters which depend on the scheme used δExp =

c (h − cΔt), 2

δImp =

c (h + cΔt). 2

10

10

8

8

uexact(t,x)

u(t,x)

Impl. UpWind Scheme vs. Exact solution

6

6

4

4

2

2

0

0 0

0.5

1 x

1.5

2

0

0.5

1 x

1.5

2

Figure 3.16. Simulation of the transport equation by the implicit upwind scheme and comparison with the exact solution at different times. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

In the explicit case, we recover the condition Δt ≤ hc , without which this convection–diffusion equation is unstable (examine the behavior of the Fourier transform), but by increasing Δt, we reduce the diffusion. In the implicit case, the diffusion term does not cancel and grows with Δt. The implicit scheme inherently introduces a stronger numerical diffusion which further distorts the result. Figure 3.16 compares the exact solution with the numerical solution given by the implicit upwind scheme: this result is to be compared with Figure 3.13, which shows the solution from the explicit upwind scheme, in the same numerical conditions (h = 0.0134, Δt = 0.0083). We clearly see that the implicit solution is subjected to stronger numerical diffusion. This analysis is completed by Figure 3.17, where we examine the influence of the time step on the performances of the explicit and implicit schemes, for

314

Mathematics for Modeling and Scientific Computing

a fixed mesh size. We recover the stated behavior: reducing the time step Δt increases the numerical diffusion for the explicit scheme and reduces it for the implicit scheme. 10 Exp Imp Exp Δ t/2 Impl Δ t/2 Exact

9 8 7

u(t,x)

6 5 4 3 2 1 0 0

0.2

0.4

0.6

0.8

1 x

1.2

1.4

1.6

1.8

2

Figure 3.17. Behavior of the explicit and implicit upwind schemes for different time steps. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

We have seen that we could attribute meaning to equation [3.13] when the initial data are discontinuous. It is interesting to study the behavior of numerical schemes in this situation, because, as we shall see later, nonlinear problems inevitably lead to the appearance of such discontinuities. The data are now (see Figure 3.18) uInit = 1.8≤x≤1.2 ,

[3.27]

for the simulations shown in Figure 3.19. The upwind and Lax–Friedrichs schemes exhibit smooth and softened solutions, with spread support and reduced extrema: the effect of the numerical diffusion is recognizable. The Lax–Wendroff scheme presents strong parasitic oscillations at the points of discontinuity. The Després–Lagoutière [DES 01] scheme performs spectacularly in this particular case. From the error curve in Figure 3.20, it is apparent that the orders of convergence are altered by the presence of discontinuities. We also note that theorem 3.7 specifically does not apply to this situation, since the solution is not even continuous! However, we can justify the estimate of the observed error, by introducing slightly more elaborate analysis tools. The “natural” space for

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

315

the study of such solutions is the BV set of bounded functions, whose derivative (in the sense of distributions) is a measure. Specifically, we say that u : I ⊂ R → R is an element of BV(I ) if u ∈ L∞ (I ) and there exists C > 0, such that for all functions ψ ∈ Cc1 (I ), we have ˆ    uψ  (x) dx ≤ C ψ L∞ .  I

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Figure 3.18. Initial condition [3.27]

We thus represent with u M 1 the infimum of such constants C and u BV = u L∞ + u M 1 . Theorem 3.7 extends to data of this type, but in the L1 norm and with a fractional order of convergence. T HEOREM 3.8.– Let uInit ∈ BV# ([0, 2π]). Let u be the associated solution of [3.13], with a constant velocity a(t, x) = c. Finally, let 0 < T < ∞. Then, the sequence u ¯nj defined by [3.21] with a uniform mesh size h and under the condition |c|Δt/h ≤ 1 satisfies max h n

0≤t ≤T



|¯ unj − u(tn , xj )| ≤ C uInit BV (h +



T h),

j

where C > 0 does not depend on h, nor T , nor the data uInit . P ROOF.– We introduce a sequence of mollifiers ζ (x) = 1 ζ(x/), where ζ ∈ Cc∞ (R) is supported in [−1, +1], positive and of integral equal to 1. We set uInit, = ζ  uInit (x).

Mathematics for Modeling and Scientific Computing

UpW

Exact 1

1

0.5

0.5

0

0

2

1.5

1

0.5

0

1

0.5

0

1.5

2

1.5

2

LxW

LxF 1

1.5

0.5

0.5

0

−0.5

1

0 0

1

0.5

1.5

2

1.5

2

1

0.5

0

SLag 1

0.5

0

0

1

0.5

Figure 3.19. Simulation of the transport equation, numerical solutions at time T = .5 (h = 1.34.10−2 ): exact solution (Exact), upwind scheme (UpW), Lax–Friedrichs scheme (LxF), Lax–Wendroff scheme (LxW) and Després–Lagoutière scheme (SLag)

0

−0.5

−1 

316

−1.5

−2

−2.5 6.5

7

7.5

8

8.5 9 −log(Δ x*Δ t)

9.5

10

10.5

11

Figure 3.20. Simulation of the transport equation: Order of convergence in the L2 norm with the upwind scheme and comparison with straight lines of slopes 1 and 1/2

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

317

have uInit, ∈ C 2 and we prove without difficulty that 2 uInit, BV ≤ uInit BV , while ∂xx uInit, L∞ = ∂x ζ  ∂x uInit L∞ ≤ 1 ∞ ∂ ζ u . Furthermore, we have x L Init BV

We

 ˆ 1  x − y    |uInit, (x) − uInit (x)| =  ζ uInit (y) − uInit (x) dy   ˆ ˆ 1 ≤ ζ(z)|∂x uInit (x + θz)| θz dθ dz 0

which leads to uInit, − uInit L1 ≤  zζ(z) L1 uInit BV . Next, we exploit the following property of the upwind scheme: if the sequence vjn satisfies vjn+1 = vjn − c

Δt n n ) + ηjn , (v − vj−1 h j

then, under the condition 0 < cΔt/h ≤ 1, h

j



Δt  n+1 Δt n+1 |vjn+1 | ≤ 1 − c h |vj | + c h |vj−1 | + h |ηjn | h h j j j ≤h



|vjn | + h



j

j

|ηjn | ≤ h



|vjn | + h

j

n

k=0

|ηjk |.

j

We represent with (t, x) → u(t, x) and (t, x) → u (t, x) the solutions of [3.13], with a(t, x) = c > 0 and for initial data u(0, x) = uInit (x), u (0, x) = ζ  uInit (x), respectively. Similarly, we use unj and un ,j to represent the numerical solutions of the upwind scheme with data u0j = uInit (xj ) and u0j, = ζ  uInit (xj ), respectively. The linearity of the equation and of the scheme allow us to write h



|unj − u(tn , xj )|

j

≤h ≤h

 j

≤ 2h

|u0j − u0 ,j | + h

j

j

|unj − un ,j | + |un ,j − u (tn , xj )| + |u (tn , xj ) − u(tn , xj )|

|un ,j − u (tn , xj )| + h

j

|uInit (xj ) − ζ  uInit (xj )| + h

j





|u (0, xj ) − u(0, xj )|

j

|un ,j − u (tn , xj )|.

318

Mathematics for Modeling and Scientific Computing

We set U (x) = uInit (x) − ζ  uInit (x). We can evaluate h



|U (xj )| ≤

ˆ

 hU (xj ) −

j



ˆ j



xj+1

xj

 U (xj ) − U (y)| dy +

 ˆ  U (y) dy  + j

ˆ

xj+1

|U (y)| dy

xj

|U (y)| dy

xj

ˆ j

j

xj+1

xj+1

ˆ

1

   U (xj + θ(y − xy ))(y − xj ) dθ dy

0

xj

+ zζ(z) L1 uInit BV 

ˆ 1 ˆ xj +θ(xj+1 −xj )  U  (z) dzθh dz dθ +  zζ(z) L1 uInit BV ≤ j

0

xj

≤ h U BV +  zζ(z) L1 uInit BV ≤ C(h + ) uInit BV . Furthermore, the convergence analysis conducted for the case of regular data ensures that h



|un ,j − u (tn , xj )| ≤ CT h ζ  uInit C 2 ≤ CT h

j

uInit BV . 

We thus obtain h

j

 Th |unj − u(tn , xj )| ≤ C uInit BV h +  + . 

By minimizing the right-hand term with respect to , we conclude from this that h



|unj − u(tn , xj )| ≤ C uInit BV (h +



T h).



j

Let us return to the case of a non-uniform mesh and a finite volume discretization, i.e. in the present case, assuming that the velocity a has positive values, un+1 = unj − j

Δt (aj+1/2 uj − aj−1/2 uj−1 ). hj

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

319

We remark that the upwind scheme is not consistent (in the FD sense). Indeed, by using the Taylor expansion of u, we obtain  u(tn+1 , xj ) − u(tn , xj ) 1 + a(xj+1/2 )u(tn , xj ) − a(xj−1/2 )u(tn , xj−1 ) Δt hj n+1 n u(t , xj ) − u(t , xj ) 1 = + a(xj+1/2 )u(tn , xj+1/2 ) Δt hj  a(xj+1/2 )  n  −a(xj−1/2 )u(tn , xj−1/2 ) + u(t , xj ) − u(tn , xj+1/2 ) hj  a(xj−1/2 )  n n − u(t , xj−1 ) − u(t , xj−1/2 ) hj   = ∂t u + ∂x (au) (tn , xj ) xj − xj+1/2 hj x j−1 − xj−1/2 −a(xj−1/2 )∂x u(tn , xj−1/2 ) + Rjn . hj +a(xj+1/2 )∂x u(tn , xj+1/2 )

The error term Rjn tends to 0 as Δt, h → 0. However, the leading term in the expansion of a(xj+1/2 )∂x u(tn , xj+1/2 )

xj − xj+1/2 xj−1 − xj−1/2 − a(xj−1/2 )∂x u(tn , xj−1/2 ) hj hj

is −a∂x u(tn , xj )

1  hj hj−1  1 hj−1  − = −a∂x u(tn , xj ) 1 − . hj 2 2 2 hj

For general meshes, this term does not cancel and does not tend to 0. Despite the non-consistency in the finite difference sense of the finite volume scheme over any grid, we shall nevertheless be able to establish the convergence of the FV scheme, in the L2 norm, by using a very different outlook (here for the case a(t, x) = c > 0). We introduce the operator Ah , which associates with u = (u1 , ..., uNh ) ∈ RNh the vector of RNh whose coordinates are given by [Ah u]j = −c

1 (uj − uj−1 ), hj

with the convention u0 = uNh (periodicity condition). For τ > 0, we have  τ  τ [(I + τ Ah )u]j = 1 − c uj + c uj−1 hj hj

320

Mathematics for Modeling and Scientific Computing

which appears as a convex combination of uj and uj−1 as soon as 0 < τ ≤ this case, we thus have

kh . c

In

|||I + τ Ah ||| ≤ 1. As a consequence, we also have etAh = e−τ eτ (I+t/τ Ah ) = e−τ



τn (I + t/τ Ah )n n! n=0

which satisfies, by taking τ ≥ tc/kh, |||e

tAh

||| ≤ e

−τ



τn = 1. n! n=0

We start by focusing on the semi-discrete problem by considering t → u +(t) ∈ RNh the solution of ˆ xj+1/2 d u +(t) = Ah u +(t), u +j (0) = uInit (y) dy. dt xj−1/2 We thus have u +(t) = etAh u +j (0) and we set u +h (t, x) =

Nh

u +j (t)1xj−1/2 ≤x 0 in the L∞ ((0, T ) × RN )) norm for Lipschitz data. The details of this analysis are found in [MER 07, MER 08], while a very original outlook, interpreting the numerical scheme in terms of Markov chains, is developed in [DEL 11]. Here, we have followed the arguments presented in [DES 04a, DES 04b]. 3.2.5. Two-dimensional simulations Once we have an effective scheme for dealing with transport in dimension one, it is relatively simple to obtain a multi-dimensional scheme by a directional decomposition approach. The idea is inspired by the solution of a differential system d X = (A + B)X dt by a method where we successively solve a system involving only matrix A followed by a system involving only matrix B. More specifically, knowing Xn , we start by solving over [nΔt, (n + 1)Δt] d Y = AY, dt

Y (nΔt) = Xn ,

then we solve, still over [nΔt, (n + 1)Δt] d Z = BZ, dt

Z(nΔt) = Y ((n + 1)Δt)

and we set Xn+1 = Z((n + 1)Δt). In other words, given that we know how to solve these two equations exactly, we have Y ((n + 1)Δt) = eAΔt Xn ,

Xn+1 = eBΔt eAΔt Xn .

We must compare this expression with the exact solution X((n + 1)Δt) = e(A+B)Δt X(nΔt).

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

327

By writing en = X(nΔt) − Xn , we thus obtain $ % en+1 = eBΔt eAΔt en + e(A+B)Δt − eBΔt eAΔt X(tn ). Now, we have e(A+B)Δt = I + (A + B)Δt + (A2 + AB + BA + B 2 )

Δt2 + H1 (Δt)Δt2 2

and eBΔt eAΔt = I + AΔt + BΔt + A2

Δt2 Δt2 + B2 + BAΔt2 + H2 (Δt)Δt2 2 2

where limΔt→0 Hj (Δt) = 0. It follows that e(A+B)Δt − eBΔt eAΔt = (AB − BA)

Δt2 + H(Δt)Δt2 2

with limΔt→0 H(Δt) = 0. We set 0 < T < ∞ and set Δt = T /N , for a non-null integer N . We can thus find a constant CT > 0, such that for all n ∈ {0, ..., N }, we have |en+1 | ≤ CT (|en | + Δt2 ). Then, by recurrence, we obtain (with e0 = X(0) − X0 = 0) |en | ≤ CT nΔt2 ≤ T CT Δt. The decomposition thus induces an error of order one in Δt. We can improve this decomposition error with the following strategy: solve over [nΔt, (n + 1/2)Δt]

d Y = AY, dt

Y (nΔt) = Xn ,

solve over [nΔt, (n + 1)Δt]

d Z = BZ, dt

Z(nΔt) = Y ((n + 1/2)Δt),

solve over [nΔt, (n + 1/2)Δt]

d ˜ Y = AY˜ , dt

Y˜ (nΔt) = Z((n + 1)Δt).

We thus set Xn+1 = Y˜ ((n + 1/2)Δt). In other words, we have Xn+1 = eAΔt/2 eBΔt eAΔt/2 Xn .

328

Mathematics for Modeling and Scientific Computing

This decomposition, called the Strang decomposition, induces an error of order two in Δt. We can derive inspiration from this reasoning to numerically solve the equation ∂t ρ + ∂x (uρ) + ∂y (vρ) = 0, over the domain Ω = [0, ] × [0, L], with given u, v : Ω → R, letting the operator ∂x (u•) play the role of A and the operator ∂y (v•) play that of B. Thus, we consider a mesh of Ω of step size Δx, Δy and, given ρni,j , an approximation of ρ(nΔt, iΔx, jΔy), we first define Δt (Fi+1/2,j − Fi−1/2,j ), Δx $ % $ % = u((i + 1/2)Δx, jΔy) + ρni,j + u((i + 1/2)Δx, jΔy) − ρni+1,j

ρi,j = ρni,j − Fi+1/2,j

with [z]+ = max(0, z) ≥ 0, [z]− = min(0, z) ≤ 0, then we set Δt (Gi,j+1/2 − Gi,j−1/2 ), Δy $ % $ % = v(iΔx, (j + 1/2)Δy) + ρi,j + v(iΔx, (j + 1/2)Δy) − ρi,j+1 .

 ρn+1 i,j = ρi,j −

Gi,j+1/2

We use this scheme to conduct the transport equation simulation with = 1 = L and the velocity field u(x, y) = − sin(πx) cos(πy),

v(x, y) = cos(πx) sin(πy).

This vector field is represented in Figure 3.21; we see that it acts to rotate the density ρ in the clockwise direction. The initial data   ρInit (x, y) = exp − 500((x − 0.25)2 + (y − 0.5)2 ) are represented in Figure 3.22. The evolution of the solution is presented in Figure 3.23 (Δx = 1/100, Δy = 1/120, CFL number = 0.4): we clearly see the rotation effect imposed by the velocity field. However, this result is not very satisfactory, since the numerical diffusion is significant: the support of the numerical solution broadens and it is clear from Figure 3.24 that, at the final time T = 1.5, the maximum of the solution is very distinctly underestimated. The final thing to be said about this approach is that it inherently imposes the use of Cartesian grids. However, in practice, for equations involving variable coefficients and complex geometries, it can be attractive to work with more general meshes. This is one of the reasons that motivates the development of finite volume methods.

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

329

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 3.21. 2D velocity field

1 0.8

ρ(0,x,y)

0.6 0.4 0.2 0 1 1

0.8 0.8

0.6 0.6

0.4

0.4 0.2

0.2 0

0

y

x

Figure 3.22. Initial 2D condition. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

3.2.6. The dynamics of prion proliferation We are interested here in the mathematical modeling of the development of spongiform encephalopathy, a disease whose human form is known by the name “Creutzfeldt–Jakob disease”, which is caused by the action of an infectious agent called a prion. The same arguments apply to descriptions of other pathologies, such as Alzheimer’s disease, for example. The prion consists of a modified form of the

330

Mathematics for Modeling and Scientific Computing

protein PrPC , designated PrPSc . Molecules of PrPC are present in the body and play a protective role in metabolism. The disease corresponds to the formation of aggregates of PrPC molecules, which are more stable than the free form and which resist treatment. The presence of aggregated PrPSc molecules promotes the formation of new aggregates; this results in a reduction of the number of free PrPC molecules and thus the disappearance of natural defenses which leads to irreversible neuron damage. The description of the mechanism of the disease is based on the following hypotheses. We adopt a “continuous” description, where the aggregated molecules are characterized by their size x ≥ 0, while the PrPC are seen as monomers of infinitely small size with respect to the aggregates. The PrPC and PrPSc molecules are subject to: – a natural metabolic degradation characterized by distinct rates for monomers and aggregates, γ and μ(x), respectively; – a polymerization mechanism, described by a rate τ (x), where PrPC monomers attach to PrPSc molecules to form a larger chain;

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

y

y

– a fragmentation mechanism where aggregates break up into smaller chains.

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0

0.1

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

0

1

0

0.1

0.2

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5 x

0.6

0.7

0.8

0.9

1

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0.4

(b) T = 0.5

y

y

(a) T = 0.25

0.3

0 0

0.1

0.2

0.3

0.4

0.5 x

0.6

(c) T = 0.75

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

(d) T = 1

Figure 3.23. Solution of the transport equation at different times. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

0.8

0.9

1

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

331

1 0.9

0.25

0.8 0.7 0.6

0.15 y

ρ(t,x,y)

0.2

0.5

0.1 0.4

0.05 0.3

0

0 0

0.2

0.2 0.2

0.4

0.4

0.1

0.6

0.6

0.8

0.8 1

1

y

0

x

0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

Figure 3.24. Solution of the transport equation at time T = 1.5. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

The dynamics of prion proliferation can thus be described as the evolution of the population, represented as V (t), of PrPC at time t ≥ 0, and the density u(t, x) of ´b polymers of size x ≥ 0 at time t ≥ 0. More precisely, a u(t, x) dx gives the number of polymers whose size lies within a and b at time t ≥ 0. The fragmentation phenomenon is subject to a threshold effect: there exists a critical size x0 ≥ 0 beneath which the aggregates are not stable and are immediately dissolved into the monomer form. The fragmentation is described by two parameters: a fragmentation rate β(x) and the probability density, written as x → κ(x, y), that a polymer of size y fragments into a chain of size x and a chain of size y − x. Evidently, this only has meaning for y ≥ x. The nucleus of fragmentation hence satisfies a natural symmetry property κ(y − x, y) = κ(x, y) and the normalization constraint , ˆ ∞ 0 if y ≤ x0 , κ(x, y) dx = 1 if y > x0 . 0 We thus arrive at the following system, which couples an ordinary differential equation with an equation of integro-differential nature, ⎧ ˆ ∞ d ⎪ ⎪ V (t) = λ − γV (t) − V (t) τ (x)u(t, x) dx ⎪ ⎪ ⎪ dt x0 ⎪ ⎪ ˆ x0  ˆ ∞  ⎪ ⎪ ⎪ ⎪ +2 x β(y)κ(x, y)u(t, y) dy dx, ⎨ 0 x0 [3.29]   ∂ ∂ ⎪ ⎪ ⎪ u(t, x) + V (t) τ (x)u(t, x) ⎪ ⎪ ∂t ∂x ⎪ ⎪ ˆ ∞ ⎪ ⎪ ⎪ ⎪ β(y)κ(x, y)u(t, y) dy, = −(μ(x) + β(x))u(t, x) + 2 ⎩ x

1

332

Mathematics for Modeling and Scientific Computing

for t ≥ 0, x ≥ x0 , where λ > 0 is a natural source of monomers. The factors of 2 take into account the symmetry of the fragmentations y → (y − x, x) and y → (x, y − x), which contribute to the formation of polymers of size x. The problem is completed by data of initial conditions V (0) = V0 , u(0, x) = u0 (x) and the boundary condition u(t, x0 ) = 0, which accounts for the fact that monomers of size less than or equal to x0 fragment instantaneously and are not observable. ´ y A crucial consequence of the hypotheses regarding κ is the equality xκ(x, y) dx = y/2. This results in the following calculation: 0 ˆ y ˆ y ˆ y xκ(x, y) dx = xκ(y − x, y) dx = (y − z)κ(z, y) dz 0

0

y = 2

ˆ 0

0 y

y κ(x, y) dx = . 2

This relation expresses a mass conservation property: the mass (proportional to y) of the initial fragment is contained in the fragmentation products of size x and y − x. We thus observe that ˆ ∞ ˆ ∞  d V (t) + xu(t, x) dx = λ − γV (t) − μ(x)xu(t, x) dx. dt x0 x0 This calculation is based on integration by parts and exploits the observation: ˆ

ˆ



2 x0

ˆ

ˆ



x x



ˆ

y

=2

ˆ

x0

β(y)κ(x, y)u(t, y) dy dx + 2 0



x

β(y)κ(x, y)u(t, y) dy x0

ˆ



ˆ

ˆ



x0

ˆ

=2

y

xκ(x, y) dx β(y)u(t, y) dy x0

ˆ



xκ(x, y) dx β(y)u(t, y) dy = 2 x0

0

x0

xκ(x, y) dx β(y)u(t, y) dy + 2 x0

dx

x0

0

y β(y)u(t, y) dy. 2

A simplified model First, we can assume that – the rates of degradation μ and polymerization τ do not depend on the size variable x; ¯ – the rate of fragmentation is proportional to the size of the aggregate: β(x) = βx ¯ for a positive constant β; – the probability of fragmentation is uniform over the length of the aggregate κ(x, y) = 0 if y ≤ x0 or y ≤ x, κ(x, y) = 1/y if y > x0 and y > x > 0.

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

333

With these definitions, when it is not null, the product β(y)κ(x, y) simply takes ¯ These hypotheses allow us to obtain a closed ODE system for V (t), the the value β. ´∞ ´∞ total number of polymers U (t) = x0 u(t, x) dx and P (t) = x0 x u(t, x) dx, which gives the number of monomers in the agglomerated form (or, up to a proportionality factor, the mass of polymers). In fact, we obtain ⎧ d ⎪ ¯ − (μ + 2βx ¯ 0 )U, ⎪ U = βP ⎪ ⎪ dt ⎪ ⎨ d ¯ 2 U, V = λ − γV − τ V U + βx 0 ⎪ dt ⎪ ⎪ ⎪ ⎪ ⎩ d P = τ V U − μP − βx ¯ 2 U. 0 dt

[3.30]

(We might note that this system has a structure very similar to the one studied in section 1.2.7.) The analysis starts with 3 3 T HEOREM 3.9.– For all initial values 4 (U0 , V0 , P0 ) ∈ X = (a, b, c) ∈ R , a ≥ 0, b ≥ 0,  c ≥ x0 a , problem [3.30] admits a unique solution t → U (t), V (t), P (t) ∈ X , defined over [0, +∞[. By setting ⎛¯ ⎞ ¯ 0 )a, βc − (μ + 2βx ⎜ ⎟ ¯ 2 ⎟ F : (a, b, c) ∈ X → ⎜ ⎝ λ − γb − τ ab + βx0 a, ⎠ ¯ 2a τ ab − μc − βx 0 we write the differential system in the (autonomous) form d Y = F (Y ). dt Since the function F is polynomial, for all initial data YInit ∈ X , the Cauchy–Lipschitz theorem ensures the existence and uniqueness of a solution t → Y (t) defined over an interval [0, T [ and such that Y (0) = YInit . We then observe that, if UInit = PInit = 0, then the corresponding solution is written as U (t) = 0, P (t) = 0 and V (t) = e−γt VInit +

λ (1 − e−γt ). γ

[3.31]

By uniqueness, and the system being autonomous, this expression determines all the solutions for which U and P simultaneously cancel with each other (in at least a time t0 ). We now consider the solution associated with initial values UInit > 0, VInit >

334

Mathematics for Modeling and Scientific Computing

0 and PInit > x0 UInit . By continuity, we can assume that the solution still satisfies d such inequalities over a given interval [0, t0 [. If V (t0 ) = 0, then dt V (t0 ) ≥ λ > 0 implies that t → V (t) is strictly increasing over a neighborhood [t0 − η, t0 [, which d ¯ (t0 ) > 0 (since leads to a contradiction. Similarly, if U (t0 ) = 0, then dt U (t0 ) = βP the initial datas and the uniqueness of the solution to the Cauchy problem exclude the possibility that U and P cancel at the same time), which again leads to a contradiction. d ¯ 0 )(P − x0 U ). Thus, if (P − Finally, we calculate dt (P − x0 U ) = τ V U − (μ + βx d x0 U )(t0 ) = 0, we obtain dt (P − x0 U )(t0 ) = τ V U (t0 ) > 0, and we arrive once more at a contradiction. We conclude from this that the solution satisfies U (t) > 0, V (t) > 0 and P (t) > x0 U (t) for all t ∈ [0, T [. We will exploit the positivity of the solutions to demonstrate that the solution is globally defined (T = +∞). Indeed, we ¯ μ, such that note that there exists C > 0, which depends on the coefficients β, d (U + V + P ) ≤ λ + C(U + V + P ). dt d (Note that taking the sum dt (U + V + P ) allows us to offset the quadratic terms τ U V of [3.30].) Grönwall’s lemma thus implies that (U + V + P )(t) ≤ eCt ((U + V + P )(0) + λt), an estimate which guarantees that the solutions, which we recall are positive, are defined for all times t ≥ 0.

We shall focus on stationary solutions of [3.30] and on their stability. We immediately note that (0, λ/γ, 0) is a stationary solution of [3.30]. This solution describes a healthy state because there are no agglomerated molecules. This state also corresponds to the asymptotic behavior of solution [3.31] as t → ∞. However, we note that 2 ¯ ¯ ¯ = βλτ − γ(x0 β + μ) , U ¯ μτ (2x0 β + μ)

(x0 β¯ + μ)2 V¯ = , ¯ βτ

¯ − γ(x0 β¯ + μ)2 βλτ P¯ = ¯ μ βτ 0 ¯ /γ > is also a stationary solution. This solution is physically admissible when βλτ ¯ x0 β + μ and thus corresponds to an infected state. From a biological point of view, the stability of these states is a critical question. It would be natural to study the linear stability of these solutions, but in fact we can establish a more precise result, [GRE 06, PRÜ 06]. 0 ¯ /γ < x0 β¯ + μ, then the healthy state (0, λ/γ, 0) is T HEOREM 3.10.– i) If βλτ globally stable and (U, V, P ) → (0, λ/γ, 0) as t → ∞. 0 ¯ /γ > x0 β¯ + μ, then the infected state (U ¯ , V¯ , P¯ ) is globally stable. ii) If βλτ In order to demonstrate i), the starting point is the calculation, for A > 0, d 1 dt 2



2

2 λ μ  λ V − + AP + A ¯ U = −γ V − − U ΠA (V ), γ γ β

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

335

where



2 λ ¯ 2 ) − A τ V − βx ¯ 2 − μ − 2μx0 V − (τ V − βx 0 0 γ β¯



2 λ μ λ¯ 2 ¯ 2 + τ + Aτ V + A βx ¯ 2+ = τ V 2 − βx + 2μx 0 + βx0 0 0 ¯ γ γ β

ΠA (V ) =

0 ¯ /γ < x0 β¯ + μ allows us to show is a polynomial of order 2 in V . The condition βλτ that A > 0 and π > 0, such that for all real z, ΠA (z) ≥ π > 0. Specifically, we note that lim|V |→∞ ΠA (V ) = +∞, thus we only need to find A, such that V → ΠA (V ) has no real roots. This leads to the requirement that the discriminant

2

λ μ2 λ¯ 2 2 2 ¯ ¯ d(A) = βx0 + τ + Aτ − 4τ A βx0 + ¯ + 2μx0 − 4τ βx γ γ 0 β



2 λ ¯ 2 μ2 λ¯ 2 ¯ 2+τλ = τ 2 A2 + 2τ τ − βx − 2 − 4μx A + βx − 4τ βx 0 0 0 ¯ γ γ γ 0 β is strictly negative for certain values of A. However, A → d(A) is again a polynomial of order 2 which satisfies limA→∞ d(A) = +∞ and is thus sufficient to ensure that this polynomial has real roots. We are therefore led to prove that τ

2

2 λ ¯ 2 μ2 λ¯ 2 ¯ 2+τλ − βx0 − 2 ¯ − 4μx0 − βx + 4τ βx 0 γ γ γ 0 β 2

2 2



μ μ λ ¯ 2 = 2 ¯ + 4μx0 − 2 2 ¯ + 4μx0 τ − βx 0 γ β β 2

2

μ μ λ ¯ 2 = 2 2 ¯ + 4μx0 + 2μx0 − τ + βx 0 γ β β¯ 2



2 μ ¯ λ > 0. = ¯ 2 ¯ + 4μx0 (x0 β¯ + μ)2 − βτ γ β β

Thus, by assuming A± =

1 τ2

0 ¯ /γ < x0 β¯ + μ, we must take A ∈]A− , A+ [, with βλτ

,

λ ¯ 2 μ2 τ −τ + βx + 2 + 4μx 0 0 γ β¯ 



5 2 μ2 λ ¯ ±τ ¯ 2 ¯ + 4μx0 (x0 β¯ + μ)2 − βτ . γ β β

336

Mathematics for Modeling and Scientific Computing

As

2 λ ¯ 2 μ2 −τ + βx + 2 + 4μx 0 0 γ β¯

2 2

2

2

λ ¯ 2 μ 2 λ ¯2 2 μ ¯ = −τ + βx + 2 + 4μx + −τ β + β x 2 + 4μx 0 0 0 0 γ γ β¯ β¯ β¯

2 2

2 λ ¯ 2 μ = −τ + βx0 + 2 ¯ + 4μx0 γ β

2

2

λ μ 2 2 2 μ 2 ¯ ¯ ¯ + ¯ −τ β + (βx0 + μ) 2 ¯ + 4μx0 − ¯ (μ + 2μβx0 ) 2 ¯ + 4μx0 γ β β β β

2

2

λ ¯ 2 2 λ μ ¯ 0 + μ)2 = −τ + βx + + ¯ −τ β¯ + (βx 2 ¯ + 4μx0 0 γ γ β β

2

λ μ 2 ¯ 0 + μ)2 ≥ ¯ −τ β¯ + (βx 2 ¯ + 4μx0 γ β β we observe that A− > 0, which allows us to find A > 0, such that V → ΠA (V ) is greater than a strictly positive constant π. Finally, for an appropriate choice of A > 0 following this, we thus obtain d 1 dt 2



2

2 λ μ  λ V − + AP + A ¯ U ≤ −γ V − − πU, γ γ β

By integrating and recalling that AP (t) ≥ 0, we arrive at

2

2 λ μ 1 λ μ V (t) − + A ¯ U (t) ≤ V (t) − + AP (t) + A ¯ U (t) γ 2 γ β β 5

2 ˆ t 6 1 λ 2 μ λ ≤ VInit − + APInit + A ¯ UInit − γ V (s) − + U (s) ds 2 γ γ β 0 1 2

where we have written γ = min(γ, π). Hence, we can apply the techniques of section 1.2.7 to conclude that limt→∞ V (t) = λ/γ, limt→∞ U (t) = 0, then limt→∞ P (t) = 0. We could also study the linear stability of equilibrium states. The proof of ii) is much more delicate (see [PRÜ 06]) but we can numerically check the validity of this statement. Distribution of sizes; stationary states The preceding statements only give information, in the infected case, about the ¯ and P¯ , but do not tell us anything with regard asymptotic macroscopic quantities U

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

337

to the size distribution of the equilibrium solutions. We notice that the stationary solutions (V∞ , u∞ ) of problem [3.29] satisfy ˆ ⎧  ⎪ ⎪ ⎨ (a) V∞ γ +





ˆ

τ u∞ (x) dx = λ + 2

 ⎪ ∂  ⎪ ⎩ (b) V∞ τ u∞ (x) = −(μ + β(x))u∞ (x) + 2 ∂x



x 0

x0

ˆ

x0

ˆx0∞

β(y)κ(x, y)u∞ (y) dy

dx, [3.32]

β(y)κ(x, y)u∞ (y) dy, x

with the boundary condition u∞ (x0 ) = 0. Figure 3.25 shows the solution in the case ¯ and x0 = 0. where μ and τ are constant, β(x) = βx We shall see later how to obtain this solution, but this profile does not correspond to recorded observations in which we see several maxima present in the size distribution of the aggregates (multi-modal profile). The idea is to look for such complex profiles by considering size-variable rates τ , as has been demonstrated in [CAL 09]. From now on, we adopt the following simplifying hypotheses: – μ is constant, ¯ – β(x) = βx, – κ(x, y) =

1 y

when y > x0 and 0 < x < y,

– x0 = 0, this final hypothesis signifying that the unstable aggregates are of negligible size (in the observations we do not see chains containing less than a dozen monomers), and we are interested in the effect of possible variations of τ with respect to the size variable. Under these hypotheses, system [3.32] becomes ˆ ⎧  ⎪ ⎨ (a) V∞ γ + 0



 τ u∞ (x) dx = λ,

  ⎪ ¯ ¯ ⎩ (b) V∞ ∂ τ u∞ (x) = −(μ + βx)u ∞ (x) + 2β ∂x

ˆ

[3.33]



u∞ (y) dy, x

with u∞ (0) = 0. We thus make the following crucial observations: – The solutions of [3.33b] are not unique: if, given V∞ , u∞ is a solution, then for all z ∈ R, zu∞ also satisfies [3.33b].

338

Mathematics for Modeling and Scientific Computing

– Equation [3.33b] is sufficient to determine the value of V∞ , knowing u∞ . Specifically, we multiply [3.33b] by x, then integrate, using integration by parts, ˆ

ˆ



2



x

u∞ (y) dy

0

ˆ



dx =

x2 u∞ (x) dx,

0

x

to finally obtain ˆ ∞ μ xu∞ (x) dx V∞ = ˆ 0∞ . τ u∞ (x) dx

[3.34]

0

(Note that this value is invariant under exchange of u∞ for zu∞ .) The quantity determined by [3.34] does not depend on λ, γ; but, knowing a solution u∞ of [3.33b], we choose the solution z∞ u∞ of [3.33b] by imposing z∞ = ˆ

λ/V∞ − γ



.

τ u∞ (x) dx 0

´∞ Thus, if u∞ is a solution of [3.33b] normalized by 0 u∞ dx = 1, then z∞ is nothing other than the total number of polymers in the infected state. Such a solution does not exist for all coefficient values. Notably, we must impose the condition V∞ < λ/γ for a meaningful equilibrium to exist. In order to return to a more familiar framework, we differentiate [3.33b] and we obtain V∞

  ∂2  ∂  ¯ ¯ τ u∞ (x) + (μ + βx)u ∞ (x) + 2βu∞ (x) = 0, 2 ∂x ∂x

[3.35]

with the boundary conditions u∞ (0) = 0,

∂ V∞ (τ u∞ )(0) = 2β¯ ∂x

ˆ



u∞ (y) dy.

[3.36]

0

In the case that τ is constant, we can find an explicit formula for the solution u∞ , a solution of 2 ¯ x u∞ (x) + 3βu ¯ ∞ = 0. V∞ τ ∂xx u∞ (x) + (μ + βx)∂

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

339

We introduce the change of variable  X=x

β¯ μ +0 −1 ¯ V∞ τ V∞ βτ

that is to say  x=X

 τ V∞ μ − ¯+ β¯ β

τ V∞ . β¯

We thus set U (X) = u∞ (x), which leads to ∂x u∞ (x) = U satisfies



β¯ ∂ U τ V∞ X

(X). Thus,

2 ∂XX U + (X + 1)∂X U + 3U = 0.

We introduce the function Φ(X) = (X + X 2 /2) 2 and, since ∂XX Φ = 1, we rewrite the equation in the form

  U  −∂X (∂X U + ∂X ΦU ) = −∂X e−Φ ∂X −Φ e 2 2 = −(∂XX U + ∂X Φ∂X U + ∂XX ΦU ) = 2U . where U appears with the eigenvalue 2, of the Fokker–Planck  as an  eigenfunction,  · 2 operator −∂X e−Φ ∂X e−Φ . We prove, since ∂XX Φ − (∂X Φ)2 = −2Φ, that U = −Φ Φe is a solution of this equation with U (0) = 0,

∂X U (0) = 1.

In this particular case, we thus have ⎛  u∞ (x) = U ⎝x

⎞ β¯ μ +0 − 1⎠ . ¯ V∞ τ V∞ βτ

We obtain V∞ by exploiting [3.34]. In fact, by directly integrating [3.32b] and taking account of the boundary conditions, we arrive at ˆ μ 0



ˆ u∞ dx = β¯



xu∞ dx. 0

340

Mathematics for Modeling and Scientific Computing

Returning to [3.34], we deduce from this that μ2 V∞ = ¯ . βτ We have thus completely determined the stationary solution, which is obtained as a dilatation of the function U = Φe−Φ . In the case where τ varies as a function of the size x, we will numerically evaluate the solutions of [3.32] and demonstrate the formation of bimodal profiles. The numerical scheme works in the following manner. As we observed, there is not uniqueness in the solutions of [3.33b], or of [3.35]–[3.36]; we shall thus parameterize the solutions by their integral. We first seek an approximation of the normalized ´∞ solution, i.e. satisfying 0 u∞ (x) dx = 1. In particular, the differential in x = 0 in ´ [3.36] is precisely determined by this constraint which prescribes u∞ dx. Then, the idea is to reinterpret x as a variable “of time” and to see the equation, with given V ,   ∂2  ∂  ¯ ¯ τ w(x) + (μ + βx)w(x) + 2βw(x) = 0, ∂x2 ∂x ∂ ¯ u∞ (0) = 0, V (τ w)(0) = 2β, ∂x

V

[3.37]

as a Cauchy problem of order 2. Finally, to obtain V , the numerical scheme corresponds to iterations of the mapping ˆ



μxw dx Ψ:V →  Ψ(V ) = ˆ0

,



τ w dx 0

which acts on formula [3.34], where w is the solution of the stationary problem [3.37] with coefficient V . The algorithm used corresponds to a discrete version of this process. Let h be a fixed discretization step size. For fixed V n > 0, we define (win+1 )i∈N as a solution of  1 n+1 n+1 τi+1 wi+1 − 2τi win+1 + τi−1 wi−1 h2 ¯  1 1  ¯ i+1 )w n+1 − (μ + βx ¯ i−1 )wn+1 + 2 β wn+1 = 0 + (μ + βx i+1 i−1 n 2h V Vn i for i ∈ N with w0n+1 = 0,

τ1 w1n+1 − τ0 w0n+1 2β¯ = n. h V

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

341

This discrete relation is consistent at order 2 with equation [3.37]. In practice, we must stop the calculation for a certain maximum value of the index i. As the quantity we are looking for is integrable, we expect (although is it evidently not true for all integrable functions) that it tends towards 0 at infinity and we will work with i ∈ {0, 1, ..., Imax }, Imax chosen to be “sufficiently large” and we numerically demonstrate that the last iterations do indeed take small values. Thus, having determined wn+1 , we set hμ V n+1 = h



i

xi win+1

τi win+1

,

i

which is a discrete version of [3.34] (in fact, we should rather use the trapeze formula to approximate the integrals with second order accuracy). Even if we do not have proof of convergence, we hope that the scheme converges to the fixed point of Ψ, which is indeed the solution we are looking for, and this algorithm is designed to give in several iterations an approximation of the normalized stationary solution (V∞ , u∞ ). In order to validate the algorithm, we test it for finding the exact solution in the simple case where x0 = 0, β¯ = 1, μ = 1, and τ = 1. We start from V 0 = 1.8 and we stop the algorithm when the relative error between two iterations is less than a fixed threshold n+1 n | (here |V |V n−V < 10−6 ). The maximum size is Xmax = 10, and we discretize | [0, 10] with 5,000 points. The approximated solution is obtained in nine iterations. Figure 3.25 compares this numerical solution with the exact solution: the curves are indistinguishable. Figure 3.26 shows the different profiles of the evolution of V with successive iterations. The algorithm does indeed find the theoretical value of V∞ . 0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

−0.1

0

1

2

3

4

5

6

7

8

9

Figure 3.25. Stationary solution for constant τ

10

342

Mathematics for Modeling and Scientific Computing

1.15

0.8

0.7

0.6

1.1

0.5

0.4

0.3

1.05

0.2

0.1

0

−0.1

0

1

2

4

3

7

6

5

9

8

1 0

10

3

2

1

4

7

6

5

Figure 3.26. Stationary solution for constant τ : profiles for w (on the left) and the evolution of V n (on the right) as a function of the number of iterations

8

n

Figure 3.27 displays the solution obtained with the function  (x − m)2  τ (x) = τ0 + A exp − . σ2 The values are as follows: β¯ = 0.03, μ = 0.05, τ0 = 0.1, A = 100, m = 2 and σ = 1/2. The algorithm stops after 27 iterations for a relative error on the iterations of V beneath the threshold 10−4 (see Figure 3.28 for the evolution of V and its profiles). We find V∞ = 0.0527. The profile, which we can demonstrate is normalized, does indeed display two “peaks”, more in agreement with the experimental observations. 1.4

1.2

1

0.8

0.6

0.4

0.2

0

−0.2

0

1

2

3

4

5

6

7

8

9

Figure 3.27. Stationary solution for variable τ

10

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

1.8

343

0.8

1.6

0.7

1.4 0.6 1.2 0.5

1 0.8

0.4

0.6

0.3

0.4 0.2 0.2 0.1

0 −0.2

0

1

2

3

4

5

6

7

8

9

0 0

10

5

10

15

20

25

30

Figure 3.28. Stationary solution for variable τ : profiles for wn (on the left) and the evolution of V n (on the right) as a function of the number of iterations

The question of stability of the stationary solutions, in healthy or infected states, is delicate; it is an open problem which is the subject of active research [CAL 09]. We can nevertheless numerically test the evolution of solutions of the simplified system ⎧ ⎨ ⎩

ˆ ∞ ˆ ∞ d ¯ 2 V (t) = λ − γV (t) − V (t) τ (x)u(t, x) dx + βx u(t, y) dy, 0 dt x0 x0 ˆ ∞  ∂ ∂  u(t, x) + V (t) τ (x)u(t, x) = −(μ(x) + β(x))u(t, x) + 2β¯ u(t, y) dy. ∂t ∂x x

[3.38]

The rate of growth being positive, the upwinding principles lead to the following discretized version



V n+1 − V n ¯ 2h = λ − γV n − V n h τi Uin + βx Uin , 0 Δt i≥i0

i≥i0

n

τi Uin − τi−1 Ui−1 Uin+1 − Uin ¯ +Vn = −(μ + βxi )Uin + 2βh Ujn . Δt Δx j≥i

We carry out simulations with the initial conditions Ui1 =

x2i , 2(1 + x4i )

and setting U1n = 0. The coefficients are β¯ = 0.03, μ = 0.02, λ = 0.06 and γ = 1. We test the case where τ is variable, with τ0 = 0.2 and A = 10, m = 2 and σ = 1/2. We work over an interval of maximum size Xmax = 10, until final time Tmax = 2. 1 The step size in space has a value of Δx = 200 and the step size in time is calculated with a CFL number of 0.8. Figure 3.29 shows the profiles of the size distributions at times T = 30, 60, 90, 120 and 150, as well as a comparison with the stationary profile of the same mass (here around 2.24). We find V∞ = 0.0268 both from the

344

Mathematics for Modeling and Scientific Computing

algorithm that looks for a stationary solution, and as the value of V at final time for the evolution problem. We obtain the same value by evaluating [3.34] at final time. Figure 3.29 shows the comparative evolutions of V (t) and of ˆ μ W (t) = ˆ



xu(t, x) dx 0 ∞

τ u(t, x) dx 0

which is the quantity given in [3.34]. These two functions do indeed converge towards the same value, which coincides with V∞ obtained by looking for stationary solutions. We note that the profile takes more time to develop than V and W need to approach their asymptotic value. 3.5

3.5

3 3 2.5 2.5 2 2 1.5 1.5 1 1

0.5

0.5

0

0 1

2

3

4

6

5

8

7

9

−0.5 0

10

1

3

2

5

4

6

7

8

9

Figure 3.29. Evolution of the size distribution profile as a function of time for variable τ and comparison with the stationary state of the same mass (solution at time T = 150). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip 0.4

0.35

V(t) W(t)

0.3

0.25 V(t), W(t)

0

0.2

0.15

0.1

0.05

0

0

1

2

3

4

5

6

7

t

Figure 3.30. Evolution of V (t) (solid line) and W (t) (dashes). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

10

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

345

3.3. Wave equation In this section, we are interested in solving the wave equation 2 2 (∂tt − c2 ∂xx )u = 0.

We will see later how this equation can be interpreted as a linear system of conservation laws. To start, we recall the well-posed nature of this equation. T HEOREM 3.11.– Let c > 0, g ∈ C 1 (R) and h ∈ C 0 (R). The Cauchy problem 2 2 (∂tt − c2 ∂xx )u = 0,

u(0, x) = g(x),

∂t u(0, x) = h(x),

[3.39]

has a unique solution u ∈ C 1 (R × R), determined by D’Alembert’s formula u(t, x) =

1 1 g(x − ct) + g(x + ct) + 2 c

ˆ

x+ct

 h(y) dy .

[3.40]

x−ct

P ROOF.– With g of class C 1 and continuous h, the function u given by [3.40] is of class C 1 over R × R and its derivatives satisfy  1  c  g (x + ct) − g  (x − ct) + h(x + ct) + h(x − ct) , 2 2    1 1 ∂x u(t, x) = g  (x + ct) + g  (x − ct) + h(x + ct) − h(x − ct) . 2 2c ∂t u(t, x) =

In particular, we have u(0, x) = g(x) and ∂t u(0, x) = h(x). It thus follows that (∂t + c∂x )u(t, x) = cg  (x + ct) + h(x + ct), (∂t − c∂x )u(t, x) = −cg  (x − ct) + h(x − ct). If g and h are more regular, for example, C 2 and C 1 , respectively, we can differentiate once more and demonstrate that (∂t − c∂x )(∂t + c∂x )u(t, x) = 0 = 2 2 (∂t + c∂x )(∂t − c∂x )u(t, x) = (∂tt − c2 ∂xx )u(t, x). In order to work in the C 1 framework, we must interpret the derivatives in a weak sense. Let ϕ ∈ Cc∞ (R × R). We calculate 7 2 8 7 8 2 (∂tt − c2 ∂xx )u, ϕ = − (∂t + c∂x )u, (∂t − c∂x )ϕ ¨ $  % =− cg (x + ct) + h(x + ct) (∂t − c∂x )ϕ(t, x) dx dt. R×R

346

Mathematics for Modeling and Scientific Computing

We use the change of variables X = x + ct, T = x − ct whose Jacobian is ⎛ ⎝

⎛ 1 ⎠=⎝ ∂t T 1

∂x X

∂t X

∂x T





c

⎠ −c

such that dX dT = 2c dx dt. We also introduce ψ(T, X) = ϕ(t, x), which satisfies ∂t ϕ(t, x) = ∂t X∂X ψ(T, X) + ∂t T ∂T ψ(T, X) = c∂X ψ(T, X) − c∂T ψ(T, X), ∂x ϕ(t, x) = ∂x X∂X ψ(T, X) + ∂x T ∂T ψ(T, x) = ∂X ψ(T, X) + ∂T ψ(T, X), and (∂t − c∂x )ϕ(t, x) = −2c∂T ψ(T, X). We thus arrive at 7 2 8 2 (∂tt − c2 ∂xx )u, ϕ = 4c2

¨

$  % cg (X) + h(X) ∂T ψ(T, X) dX dT R×R ˆ ˆ  $  % = 4c2 cg (X) + h(X) ∂T ψ(T, X) dT dX = 0, R

R

using the fact that ψ also has compact support. We have shown that u is a solution of [3.39]. Finally, we remark that if supp(g), supp(h) ⊂ [−R, R], then supp(u(t, ·)) ⊂ [−R − c|t|, R + c|t|].

[3.41]

This fundamental property indicates that the wave equation propagates information at a finite speed (like the transport equation and unlike the heat equation). Now assume that we have two solutions of [3.39], in the weak sense studied previously. Then, their difference v(t, x) ∈ C 1 (R × R) satisfies [3.39] with g = 0 = h. We shall show that v is identically zero. Let us have fixed T ∈ R and ζ ∈ Cc∞ (R). We set ψ(t, x) =

1 2c

ˆ

x+c(t−T )

ζ(y) dy. x−c(t−T )

2 2 In other words, ψ ∈ C ∞ (R × R) is a solution of (∂tt − c2 ∂xx )ψ = 0, ψ(T, x) = 0 and ∂t ψ(T, x) = ζ(x). We set

ˆ

ˆ

I(t) = R

∂t v(t, x)ψ(t, x) dx −

R

v(t, x)∂t ψ(t, x) dx.

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

347

This quantity is well defined since, by assuming supp(ζ) ⊂ [−R, +R], for all fixed t, x → ψ(t, x) is supported in [−R ´ − c|t − T |, R + c|t − T |]. We even have I ∈ C 0 (R) with I(0) = 0 and I(T ) = − v(T, x)ζ(x) dx. We shall finish by showing that I is identically null. Let θ ∈ Cc∞ (R). By using integrations by parts, we evaluate ˆ

¨



I(t)θ (t) dt = R

R×R

¨ R×R

¨

R×R

R×R

R×R

¨ R×R

∂t v(t, x)∂t ψ(t, x)θ(t) dx dt

  v(t, x)∂t ∂t ψ(t, x)θ(t) dx dt

¨

  ∂t v(t, x)∂t ψ(t, x)θ(t) dx dt +

= R×R

R

R×R

2 v(t, x)∂tt ψ(t, x)θ(t) dx dt

¨

  ∂t v(t, x)∂t ψ(t, x)θ(t) dx dt +

= ¨

v(t, x)∂t ψ(t, x)θ (t) dx dt

v(t, x)∂t ψ(t, x)θ (t) dx dt

¨ ¨

R×R

  ∂t v(t, x)∂t ψ(t, x)θ(t) dx dt +

= −

¨

R×R

v(t, x)∂t ψ(t, x)θ (t) dx dt

¨ ¨

∂t v(t, x)ψ(t, x)θ (t) dx dt −

  ∂t v(t, x)∂t ψ(t, x)θ(t) dx dt −

= −

¨



R

2 v(t, x)c2 ∂xx ψ(t, x)θ(t) dx dt



= R×R

 (∂t − c∂x )v(t, x)(∂t + c∂x ) ψ(t, x)θ(t) dx dt = 0.

We can apply this reasoning again to ˆ t θ(t) =

ˆ

T

h(τ ) − κ(τ )

0

 h(s) ds dτ

0

where h and κ are continuous functions, supported in [0, T ], additionally with ´T κ(s) ds = 1. In particular, θ is C 1 over [0, T ] (without loss of generality, we 0 assume here that T > 0) and satisfies θ(0) = 0 as well as θ(T ) = 0 (thanks to the normalization hypothesis over κ), which justifies the integrations by parts carried out ´T above. As θ (t) = h(t) − κ(t) 0 h(s) ds, we thus obtain ˆ



ˆ

T

I(t)θ (t) dt = R



ˆ

I(t)θ (t) dt = 0 = 0

0

T

ˆ  I(t) −

T

 I(s)κ(s) ds h(t) dt.

0

With this being satisfied for all test functions h, we deduce that I(t) is constant over [0, T ] with I(0) = 0, thus I(t) = 0 over [0, T ], and finally v(t, x) = 0. 

348

Mathematics for Modeling and Scientific Computing

The property of finite speed of propagation, see [3.41], also plays a fundamental role in the stability analysis of numerical schemes for the wave equation. Formula [3.40] shows that the evaluation of the solution u at time t and position x depends only on the values taken by the data in the interval, called the light cone, C (t, x) = [x−c|t|, x+c|t|]. A natural finite difference scheme for the wave equation is expressed as   1  n+1 c2  n uj − 2unj + un−1 = uj+1 − 2unj + unj−1 . j 2 2 Δt Δx u0 −u−1

[3.42]

To initiate the scheme, we use u0j = g(jΔx) and j Δt j = h(jΔx). Thus, the approximated solution at time (n + 1)Δt and position jΔx depends on the approximated solution at times nΔt and (n − 1)Δt and neighboring positions (j ± 1)Δx and jΔx (see Figure 3.31). By recurrence, we thus establish that unj depends only on the initial state u0k and u−1 k for indices k, such that |j − k| ≤ n. In other words, unj depends only on the values taken by data in the interval Cjn = [(j − n)Δx, (j + n)Δx] (see Figure 3.32). Thus, to obtain results consistent with the continuous problem, we must impose C (nΔt, jΔx) ⊂ Cjn : if not, we could affect the exact solution at (nΔt, jΔx) with a modification of the initial data at the boundaries of C (nΔt, jΔx) without the approximated solution being changed. This constraint leads to the condition cΔt ≤ Δx. Geometrically, this means that the wave starting from jΔx does not reach, in a time Δt, the boundaries of the discretization interval [(j − 1)Δz, (j + 1)Δx] (see Figure 3.33). •

(n + 1)Δt

nΔt









(n − 1)Δt (j − 1)Δx

jΔx

(j + 1)Δx

Figure 3.31. Determination of the numerical solution at (n + 1)Δt and jΔx

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

nΔt (n − 1)Δt (n − 2)Δt (n − 3)Δt

◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦

(j − 3)Δx...

• ◦ ◦ ◦ ◦ jΔx

349

◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦

...(j + 3)Δx

Figure 3.32. Light cone of the numerical solution

Figure 3.33. CFL condition for the wave equation: the “numerical speed” Δx must be greater than the speed of propagation c, and the Δt numerical cone contains the physical cone

These concepts, demonstrated by R. Courant, K. Friedrichs and H. Lewy in 1928 [COU 28], are illustrated in Figures 3.34–3.39, where we present the numerical results obtained with and without the condition cΔt ≤ Δx being satisfied. For these numerical tests, c = 3 and the initial data are g(x) =



10 exp(−10|x − 5|2 ), h(x) = 0.

We carry out the simulation over the domain [0, 10], supplementing the wave equation with homogeneous Dirichlet conditions u(t, 0) = u(t, 10) = 0. The final time here is T = 0.7. In particular, this time is short enough that the core initial information, concentrated at the position x = 5, interacts only negligibly with the boundary and the solution obtained corresponds to the solution of the problem posed over the whole of R. When the stability condition is not satisfied, we see parasitic oscillations appear, which grow with time and rapidly render the numerical solution incoherent.

350

Mathematics for Modeling and Scientific Computing

Figure 3.34. Simulation of the wave equation under the stability constraint (cΔt/Δx = .9). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

Figure 3.35. Simulation of the wave equation under the stability constraint (cΔt/Δx = .9). Comparison with the exact solution at final time T = .7. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

Figure 3.36. Simulation of the wave equation under the stability constraint (cΔt/Δx = .9). Representation of the theoretical light cone (solid line) and the numerical light cone (crosses). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

Figure 3.37. Simulation of the wave equation beyond the stability constraint (cΔt/Δx = 1.1). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

351

352

Mathematics for Modeling and Scientific Computing

Figure 3.38. Simulation of the wave equation beyond the stability constraint (cΔt/Δx = 1.1). Comparison with the exact solution at final time T = .7. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

Figure 3.39. Simulation of the wave equation beyond the stability constraint (cΔt/Δx = 1.1). Representation of the theoretical light cone (solid line) and the numerical light cone (crosses). For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

353

Another approach to the wave equation consists of writing it in the form of a system ∂t u + c∂x v = 0,

∂x v + c∂x u = 0.

Indeed, by taking the derivative in time of the first equation and the derivative in 2 2 space of the second equation, we recover that u is a solution of (∂tt − c2 ∂xx )u = 0. We set U (t, x) = (u(t, x), v(t, x)) and the wave equation thus appears as a linear system of conservation laws ∂t U + W ∂x U = 0,

W =

0 c

c . 0

We note that the Fourier transform (in space) of U (t, ·) satisfies  (t, ξ) + A(ξ)U  (t, ξ) = 0, ∂t U

A(ξ) =

0 icξ

icξ . 0

The eigenvalues of A(ξ) are ±icξ and the eigenspaces are spanned by (1, ±1). As P −1 A(ξ)P =

icξ 0

0 , −icξ

P =

1 1

1 , −1

 (t, ξ), which satisfies the diagonal system we introduce V (t, ξ) = P U ∂t V (t, ξ) =

icξ 0

0 V (t, ξ). −icξ

We represent the components of the inverse Fourier transform V (t, ξ) with w+ (t, x) and w− (t, x); they are thus solutions of the transport equations with velocities ±c: ∂t w± ± c∂x w± = 0 and we recover U (t, x) by applying P −1 to the vector (w+ (t, x), w− (t, x)). We thus retrieve u = (w+ + w− )/2 and v = (w+ + w− )/2 as a combination of waves propagating at speeds ±c. In an equivalent manner, we relate the equation for (w+ , w− ) to that satisfied by U = (u, v) by diagonalizing the matrix W .

354

Mathematics for Modeling and Scientific Computing

This approach motivates the design of a scheme based on the upwind discretization of the transport equations satisfied by w± , at positive/negative speeds. Specifically, we set cΔt n n (w − w+,j−1 ), Δx +,j cΔt n n + (w − w−,j ). Δx −,j+1

n+1 n w+,j = w+,j − n+1 n w−,j = w−,j

Thus, the corresponding scheme for u, v takes the following form n+1 n+1 un+1 = (w+,j + w−,j )/2 j

= unj −

cΔt n cΔt n n (vj+1 − vj−1 )+ (u − 2unj + unj−1 ) 2Δx 2Δx j+1

n+1 n+1 vjn+1 = (w+,j − w−,j )/2

= vjn −

cΔt n cΔt n n (uj+1 − unj−1 ) + (v − 2vjn + vj−1 ). 2Δx 2Δx j+1

We thus recover a scheme of the Lax–Friedrichs type, with a centered finite difference to approximate the spatial derivative ∂x and the addition of a numerical diffusion of order Δx. A comparison with the finite difference scheme is given in Figure 3.40: exact and FD solutions are practically indistinguishable, but the Lax–Friedrichs scheme displays a stronger numerical diffusion (underestimation of the extrema and spreading of the solution). 3.4. Nonlinear problems: conservation laws 3.4.1. Scalar conservation laws Evidently, the numerical simulation of the linear transport equation with constant coefficients is of limited interest in itself, since we know the exact solution! The motivation for the discussion put forward in section 3.2.4 is, precisely because we have a simple explicit expression of the solution, on the one hand, to build up an intuition for constructing reliable methods, and, on the other hand, to lay down a theoretical basis and put in place tools for analysis of the numerical schemes. We can then extend this outlook for nonlinear situations, which are more relevant. We start by focusing on the case where the unknown u is a function with scalar values. 3.4.1.1. Elements of analysis The remarkable fact arising from the analysis of solutions of conservation laws is the potential for singularities to appear in finite time: even if the initial data are infinitely smooth, the solution can become discontinuous in finite time. This

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

355

phenomenon is the exact opposite of the situation demonstrated for the diffusion equations. Indeed, we saw for the heat equation that, although the initial data can be discontinuous, the solution is instantaneously C ∞ . For the linear transport equation, the regularity is conserved: initial C k data produce a solution of class C k . For scalar conservation laws, in general, C ∞ data produce a discontinuous solution! Evidently, this phenomenon makes it difficult to give meaning to the equation when the solution is not even continuous: we must work within a framework of weak solutions. It is also important to highlight another fundamental difference with the heat equation. The heat equation propagates information at infinite speed. Indeed, the convolution relation shows that at all points x and at each point in time, the solution depends on the whole of the initial data, independently of its support. For the conservation laws, as for the transport equation and the wave equation, the information is propagated at a finite speed and the solution at a position x and a time t depends only on the values of the initial data within a limited domain.

Figure 3.40. Simulation of the wave equation: comparison of the Lax–Friedrichs-type scheme and the finite difference scheme. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

We can try to reproduce the argument held for the linear transport equation by defining the solution of ∂t ρ + ∂x f (ρ) = 0

[3.43]

356

Mathematics for Modeling and Scientific Computing

by integration along the characteristic curves. These are defined by the differential equation d X(s; t, x) = f  (ρ(s, X(s; t, x))), ds

X(t; t, x) = x.

[3.44]

Thus, X(s; t, x) is the position at time s of a particle occupying the position x at time t and subject to the velocity field (t, x) → f  (ρ(t, x)). The chain rule again leads to

# d" d s ρ(s, X(s; t, x)) = ∇t,x ρ(s, X(s; t, x)) · ds dt X(s; t, x) = (∂t ρ)(s, X(s; t, x)) + ∂s X(s; t, x)(∂x ρ)(s, X(s; t, x)) = (∂t ρ + f  (ρ)∂x ρ)(s, X(s; t, x)) = 0. We thus arrive at ρ(t, x) = ρInit (X(0; t, x)),

[3.45]

where ρInit is the initial data, for t = 0, of problem [3.43]. However, in contrast to the linear situation, expression [3.45] is now of implicit nature, since the characteristics themselves depend on the solution ρ via formula [3.44]. We can nevertheless exploit the fact that the solution is conserved along the characteristic curves ρ(t, X(t; 0, x)) = ρInit (x) to rewrite [3.44] in the form d X(t; 0, x) = f  (ρ(t, X(t; 0, x))) = f  (ρInit (x)), dt

X(0; 0, x) = x.

It follows that X(t; 0, x) = x + tf  (ρInit (x)). For fixed t > 0, we use φt to signify the function x → φt (x) = x + tf  (ρInit (x)). To express ρ(t, x) as a function of the initial data ρInit , we must determine the inverse of the function φt , which gives X(0; t, x), the position at time 0 of a particle leaving x at time t. We note that φt (x) = 1 + tf  (ρInit (x))ρInit (x) does not necessarily have a constant sign. Thus, φt has an inverse defined over R when f is convex and ρInit is monotonically increasing (since, in this case, we have φt (x) > 0, so x → φt (x) is strictly monotonically increasing), but as a generality there can exist times t > 0, for which φt is not invertible. Formulae [3.44] and [3.45] then no longer have meaning.

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

357

Let us now consider another approach that reveals this phenomenon of loss of regularity. We set v(t, x) = ∂x ρ(t, x). By differentiating [3.43], we obtain ∂t v + f  (ρ)∂x v = −f  (ρ)v 2 . We restrict ourselves to the case of Burgers’ equation, where f (ρ) = ρ2 /2. Thus, f (ρ) = 1 and, along the characteristics, v satisfies the Ricatti equation: V (t) = v(t, X(t; 0, x)) satisfying 

d V = −V 2 . dt The solutions of this equation are of the form V (t) = we have  v(t, x) = t +

1 t+1/V (0) .

In other words,

−1 1 . ∂x ρInit (X(0; t, x))

[3.46]

4.5

4

3.5

3

2.5

2

1.5

1

0.5

0

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Figure 3.41. Loss of regularity of solutions of Burgers’ equation. For the color version of this figure, see www.iste.co.uk/goudon/mathmodel.zip

However, the solutions of the Ricatti equation are not globally defined (the function x2 not being uniformly Lipschitzian): if V (0) < 0, then there exists 1 0 < T < ∞, such that V (t) = t+1/V (0) → ∞ as t → T . Thus, when ρinit is decreasing, formula [3.46] does not have any meaning for all times. Figure 3.41 illustrates this phenomenon. It shows the solution of Burgers’ equation at a positive time, starting from regular, but not monotonic, data (specifically with a Gaussian

358

Mathematics for Modeling and Scientific Computing

profile). We note that the solution exhibits a discontinuity at position x  1.43. In particular, we note that the solution ρ(t, x) remains bounded; by contrast, its derivative in space ∂x ρ(t, x) tends to become infinite at that position and time. We thus understand the fault of the analysis by characteristics: with the term on the right-hand side of [3.44] not being Lipschitzian, the hypotheses of the Cauchy–Lipschitz theorem fail and the characteristic curves cannot be defined for all (t, x). Even if we consider regular initial data, it is thus absolutely unrealistic to expect the solution of [3.43] to remain regular for all times. We must therefore work in a framework of weak solutions. D EFINITION 3.4.– We say that a function ρ ∈ L1loc ([0, ∞] × R) is a weak solution of [3.43] associated with the initial data ρInit if f (ρ) ∈ L1loc ([0, ∞] × R) and, for all test functions ϕ ∈ Cc1 ([0, ∞[×R), we have ˆ





ˆ



R

0

 ρ∂t φ + f (ρ)∂x φ (t, x) dx dt −

ˆ R

ρInit (x)φ(0, x) dx = 0. [3.47]

We note that the integrals that appear in [3.47] themselves have meaning even though ρ is potentially discontinuous. This formula comes from multiplying [3.43] by ϕ and proceeding by integration by parts. This concept thus allows us to consider discontinuous solutions of [3.43]. Actually, we can characterize the discontinuities of the solutions of [3.43]. Indeed, let us assume that (t, x) → ρ(t, x) is a weak solution of [3.43], of class C 1 on both sides of a curve of discontinuity Γ = {(x, t) ∈ R × [0, ∞[, x = s(t)}. The function t → s(t) characterizes the propagation of the singularity. We write Ω− = {(x, t) ∈ R × [0, ∞[, x < s(t)} and Ω+ = {(x, t) ∈ R × [0, ∞[, x > s(t)}. We represent with ν ± = (νx± , νt± ) the vector normal to Γ, pointing outward of Ω± . More specifically, we have ν



1 =0 1 + |s(t)| ˙ 2



1 , −s(t) ˙

ν + = −ν − .

Let ϕ be a test function whose support is strictly enclosed within ]0, ∞[×R. As ρ satisfies [3.47], we have ¨ ¨ ¨ 0=− (ρ∂t φ + f (ρ)∂x φ) dx dt = − ... dx dt − ... dx dt ¨



= Ω−

ˆ −

Γ−

 ∂t ρ + ∂x f (ρ) φ dx dt +

(ρνt− + f (ρ)νx− )φ dγ −

Ω−

¨

ˆ



Ω+

Γ+

Ω+

 ∂t ρ + ∂x f (ρ) φ dx dt

(ρνt+ + f (ρ)νx+ )φ dγ.

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

359

We have used Γ± here to represent the boundaries of Ω± . As ρ is a regular solution of [3.43] in the domains Ω± , we have ¨

 Ω±

 ∂t ρ + ∂x f (ρ) φ dx dt = 0.

Taking account of the fact that Γ± = Γ with ν − = −ν + , we arrive at ¨ 0=− (ρ∂t φ + f (ρ)∂x φ) dx dt ˆ $ + % = (ρ − ρ− )νt− + (f (ρ+ ) − f (ρ− ))νx− φ dγ Γ

where ρ± signifies the limit of ρ over Γ coming from Ω± . We deduce from this that the following relation – called the Rankine–Hugoniot relation – between jumps in ρ and in f (ρ): (ρ+ − ρ− )s˙ − (f (ρ+ ) − f (ρ− )) = [[ρ]]s˙ − [[f (ρ)]] = 0. t

Ω

Γ = {(s(τ ), τ ), τ > 0}



{x + s(t)(τ ˙ − t), τ ∈ R}

• ν−

Ω+ x

Figure 3.42. Curve of discontinuity and Rankine–Hugoniot relations

The identification of this phenomenon of loss of regularity dates back to the middle of the 19th century when Challis demonstrated the paradox that the solutions

360

Mathematics for Modeling and Scientific Computing

of equations of the dynamics of isentropic gases (which we will meet later) can take different values at one point at certain times [CHA 48]. Stokes then established the jump relations [STO 48], before Riemann [RIE 58/59], Rankine [RAN 70] and Hugoniot [HUG 87, HUG 89] pushed the analysis further. However, considering discontinuous solutions brings new difficulties: we realize that weak solutions of such nonlinear problems are not unique. For example, for Burgers’ equation f (ρ) = ρ2 /2, with identically null initial data ρInit (x) = 0, we can prove that ρ1 (t, x) = 0 and ρ2 (t, x) = 1−t/2≤x max(u− , u+ )) f (u− ) − f (k) − f (u+ ) + f (k) = f (u− ) − f (u+ −) ≤ s(k ˙ − u+ − k + u− ) = s(u ˙ − − u+ ). – with k = θu− + (1 − θ)u+ , 0 ≤ θ ≤ 1, we arrive at  s˙ |u+ − θu− − (1 − θ)u+ | − |u− − θu− − (1 − θ)u+ | = −(1 − 2θ)s|u ˙ + − u− |   ≥ f (u+ ) − f (k) sgn(u+ − θu− − (1 − θ)u+ )   − f (u− ) − f (k) sgn(u− − θu− − (1 − θ)u+ )   ≥ sgn(u+ − u− ) f (u+ ) − 2f (k) + f (u− ) . By using the expression s˙ =

f (u+ )−f (u−) , u+ −u−

we can write

  s|u ˙ + − u− | = sgn(u+ − u− ) f (u+ ) − f (u− ) . We are thus led to   sgn(u+ − u− ) θf (u− ) + (1 − θ)f (u+ ) − f (θu− + (1 − θ)u+ ) ≤ 0 or again, for 0 < θ < 1, sgn(u+ − u− )

f (u+ + θ(u− − u+ )) − f (u+ ) + f (u+ ) − f (u− ) ≥ 0. θ

By taking the limit θ → 0, we arrive at ⎛



⎜  f (u+ ) − f (u− ) ⎟ ⎟ ≥ 0. (u+ − u− )sgn(u+ − u− ) × ⎜ −f (u+ ) + ⎝ ⎠ ' () * u+ − u− ' () * =|u+ −u− | =s˙

We obtain the other of Lax’s inequality criteria by taking the limit θ → 1.



It is also important to understand, at least intuitively, the motivation behind the entropic criterion. The idea, notably developed in [LAD 57, LAX 54, LAX 57,

364

Mathematics for Modeling and Scientific Computing

LAX 60, OLE 57], consists of seeing problem [3.43] as the limit when  → 0 of the equations 2 ∂t ρ + ∂x f (ρ ) = ∂xx ρ .

[3.51]

2 The introduction of the “viscous” term ∂xx ρ can be motivated either with analogy to the derivation, at least formally, of the Euler equations starting from the Navier–Stokes equations when the viscosity becomes small (this is the concept which motivates the name “approximation by vanishing viscosity”), or from considerations of the numerical order: as we saw in the linear case, the “good” numerical schemes inherently introduce such a diffusion term, with a parameter  dependent on the discretization parameters (see [3.25]). The approach for the analysis of [3.43] can thus be outlined as follows:

– Solve the parabolic nonlinear problem [3.51] with  > 0. The equation remains nonlinear, we expect that the diffusion term will allow us to work with more regular solutions (but and, thus, if we think in terms of sequences of solutions, to work with better compactness properties), which will lead to the establishment of a fixed-point strategy4. – Establish uniform estimates with respect to the parameter  > 0. – Deduce from these the compactness properties. – Take the limit  → 0. At this stage, the difficulty lies in dealing with the nonlinear term f (ρ ), which necessitates strong convergence properties of ρ . – Establish the entropic criterion and the uniqueness of entropic solutions. These different points are based on extremely subtle analysis techniques which will not be explained in detail here; for more information, see [GOD 91, SER 96a]. However, we can explain from this point of view the origin of the entropic criterion. Indeed, the solutions ρ of [3.51] being regular, we have 2 ∂t η(ρ ) + ∂x q(ρ ) = η  (ρ )∂xx ρ

  = ∂x η  (ρ )∂x ρ − η  (ρ )|∂x ρ |2 .

4 In certain cases, we can in fact explicitly solve the nonlinear problem. Thus, for the viscous 2 Burgers’ equation ∂t ρ + ∂x (ρ2 /2) = ∂xx ρ , we introduce the Hopf–Cole transformation : ´x 1  (t,x) v (t, x) = exp(− 2 −∞ ρ (t, y) dy), so that ρ (t, x) = −2 ∂xvv(t,x) = −2∂x ln(v (t, x)), 2 see [HOP 50]. We demonstrate that v (t, x) satisfies the heat equation ∂t v = ∂xx v ; we thus have an explicit formula giving ρ (t, x) as a function of ρInit .

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

365

By integrating, we deduce from this d dt

ˆ

ˆ η(ρ ) dx + 

η  (ρ )|∂x ρ |2 dx = 0.

With η(ρ) = ρ2 /2, we deduce from this that, if ρinit ∈ L2 (R), ρ is bounded in L∞ (0, T ; L2 (R)). and √

∂x ρ is bounded in L2 ([0, T ] × R).

$ %2 More generally, with η(ρ) = ρp /p or η(ρ) = ρ − ρInit ∞ − , we show that ρ is bounded in L∞ (0, T ; Lp (R)) if ρinit ∈ Lp (R) for all 1 ≤ p ≤ ∞ (notably, |ρ (t, x)| ≤ ρInit ∞ ). Thus, we write ∂t ρ + ∂x f (ρ ) =



∂x

√

 ∂x ρ −−−→ 0

→0

and the only difficulty lies in taking the limit in f (ρ ). Finally, if η is convex, we have ∂t η(ρ ) + ∂x q(ρ ) =

√ '

∂x

√

 η  (ρ )∂x ρ −η  (ρ )|∂x ρ |2 , () * () *' ≤0 −−−→0 →0

a relation which motivates the entropic criterion: the term −η  (ρ )|∂x ρ |2 is bounded, a priori its limit as  → 0 is not null, but has a sign. To resume, we must keep in mind: – The solutions of [3.43] generally become discontinuous in finite time, even if the initial data are regular. – This phenomenon requires us to work in a framework of weak solutions. – The weak solutions are not unique; the uniqueness is re-established by also imposing an entropic criterion. In terms of numerical simulations, this discussion leads to the following difficulties: – The schemes must be capable of reproducing these phenomena concerning the appearance of singularities, – and of following their propagation as time develops.

366

Mathematics for Modeling and Scientific Computing

– The solutions of [3.43] satisfy a priori natural ´conditions that the ´ numerical solutions must conserve (preservation of the integral ρ(t, x) dx = ρInit (x) dx, L∞ estimates, etc.). – The scheme must find the entropic solution and thus the numerical solutions must satisfy a discrete version of the entropic criterion. Following all this, Figure 3.44 shows the numerical solutions obtained with four different schemes for the same Burgers’ equation and starting with the same initial state. These solutions display discontinuities. However, the scheme labeled LxW exhibits strong oscillations and the solution produced by the UpW scheme does not satisfy Lax’s criterion, in contrast to the LxF and EO schemes, which find an admissible solution. UpW

LxF

1

1

0.5

0.5

0

0

−0.5

−0.5

−1 0

0.5

1

1.5

2

−1

0

0.5

1

1

0.5

0

0

−1

−0.5

0.5

1

1.5

2

1.5

2

EO

LxW 2

−2 0

1

1.5

2

−1

0

0.5

1

Figure 3.44. Simulations of Burgers’ equation with four different numerical schemes

3.4.1.2. Lax–Friedrichs and Rusjanov schemes for scalar conservation laws The idea here is to draw inspiration from the study of the linear case to decenter the flux in an adequate manner. To this end, we set c = sup|u|≤M |f  (u)| and we rewrite the equation in the form  1   1  ∂t u + ∂x f (u) = ∂t u + ∂x f (u) − cu + ∂x f (u) + cu = 0. 2 2

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

367

We note that for f (u) − cu, the “velocity” f  (u) − c is, by construction, negative, while for f (u) + cu, the “velocity” f  (u) + c is, by construction, positive. This yields the upwinding strategy: to the right for the first term, and to the left for the second term. Thus, the numerical flux is defined by F(uj+1 , uj ) =

 1  1 f (uj+1 ) − cuj+1 + f (uj ) + cuj 2 2

[3.52]

which thus leads to the scheme  Δt  F(unj+1 , unj , ) − F(unj , unj−1 ) hj [3.53]  Δt  Δt n = unj − f (uj+1 ) − f (uj−1 ) + c (uj+1 − 2unj + unj−1 ). 2hj 2hj

un+1 = unj − j

This scheme can be interpreted as a consistent approximation in the finite difference sense, of order 1 in time, 2 in space, of the equation ∂t u + ∂x f (u) =

ch 2 ∂ u. 2 xx

However, as we discussed in detail in the previous section for the linear transport equation, the concept of consistency in the finite difference sense is not really appropriate for the analysis of these schemes. It suits to introduce a new definition of consistency, adapted to the finite volume formalism. D EFINITION 3.6.– A FV scheme [3.12] with a numerical flux Fj+1/2 = F(uj+1 , uj ) is said to be consistent in the FV sense with the equation ∂t u+∂x f (u) if the numerical flux function F is such that F(u, u) = f (u). Scheme [3.52]–[3.53] is evidently consistent in the sense of definition 3.6. Furthermore, we can identify a condition over the numerical parameters, which guarantees the preservation of natural estimates on the solution of the problem. P ROPOSITION 3.8.– We assume that f is a convex function and that the numerical parameters are such that sup |f  (u)|

|u|≤M

Δt ≤ 1. kh

Thus, scheme [3.52]–[3.53] is L∞ -stable: if |u0j | ≤ M for all j, then |unj | ≤ M for all n, j.

368

Mathematics for Modeling and Scientific Computing

P ROOF.– As f is convex, for all x, y, we have f (y) ≥ f (x) + f  (x)(y − x). We deduce from this that f (unj−1 ) ≥ f (unj+1 ) + f  (unj+1 )(unj−1 − unj+1 ) and f (unj+1 ) ≥ f (unj−1 ) + f  (unj−1 )(unj+1 − unj−1 ). It follows that  Δt  f (unj+1 ) − f (unj−1 ) Δt c Δt n (uj+1 + unj−1 ) + unj 1 − c − 2 hj hj 2 hj   c Δt n Δt Δt  n ≤ (u + unj−1 ) + unj 1 − c − f (uj−1 )(unj+1 − unj−1 ) 2 hj j+1 hj 2hj  Δt  n Δt Δt ≤ 1−c u + (c − f  (unj−1 )) unj+1 + (c + f  (unj−1 )) unj−1 hj j 2hj ' 2h () * () * j ' ≥0 ≥0   Δt Δt ≤ 1−c + (c − f  (unj−1 ) + c + f  (unj−1 ) sup |unk | ≤ M. hj 2hj k ' () *

un+1 = j

=M

Likewise, we obtain  Δt  n Δt Δt un+1 ≥ 1 − c u + (c − f  (unj+1 )) unj+1 + (c + f  (unj+1 )) unj−1 j hj j 2hj ' 2hj ' () * () * ≥0 ≥0   Δt Δt ≥ 1−c + (c − f  (unj−1 ) + c + f  (unj−1 ) inf |unk | ≥ m. k hj 2hj ' () * =m

The argument also demonstrates that the CFL constraint is still satisfied: f is convex, thus f  is monotonically increasing and we always have |f  (unj )| Δt hj ≤ 1.  More generally, it is interesting to introduce the following concept. D EFINITION 3.7.– We say that the numerical flux Fj+1/2 = F(uj+1 , uj ) for [3.12] is monotone if the function (u, v) → F(u, v) satisfies ∂u F(u, v) ≤ 0,

∂v F(u, v) ≥ 0.

Numerical Simulations of Partial Differential Equations: Time-dependent Problems

369

This definition gives a simple, practical criterion to satisfy the maximum principle, that is to say the preservation of L∞ bounds by the numerical solution. The idea is then to identify a convex combination by writing  Δt  F(unj+1 , unj ) − F(unj , unj−1 ) hj  Δt   Δt  = unj − F(unj+1 , unj ) − F(unj , unj ) − F(unj , unj ) − F(unj , unj−1 ) hj hj

un+1 = unj − j

= unj + λnj (unj+1 − unj ) − μnj (unj − unj−1 ), where we have set ˆ 1 Δt F(unj+1 , unj ) − F(unj , unj ) =− =− ∂1 F(unj + θ(unj+1 − unj ), unj ) dθ, hj unj+1 − unj 0 ˆ 1 Δt F(unj , unj ) − F(unj , unj−1 ) μnj = = ∂2 F(unj , unj−1 + θ(unj − unj−1 )) dθ. hj unj − unj−1 0 λnj

We thus obtain un+1 = (1 − λnj − μnj )unj + λnj unj+1 + μnj unj−1 . j Assume that the coefficients λnj and μnj are positive. Let us assume then that unj are in [−M, M ] and that the flux function F is Λ-Lipschitzian over [−M, M ] × [−M, M ]. n+1 Δt n Thus, we observe that λnj ≤ Δt remains in the hj Λ, μj ≤ hj Λ. It follows that uj interval [−M, M ] under the condition 2 Δt Λ ≤ 1. hj

L EMMA 3.6.– We assume that the flux F is monotone. We also assume that F is locally Lipschitzian. If we have |unj | ≤ M for all j and the time step satisfies 2 Δt Λ ≤ 1, hj where Λ represents the Lipschitz constant of F over [−M, M ] × [−M, M ], then for all j, we again have |un+1 | ≤ M. j T HEOREM 3.13 (Lax–Wendroff theorem).– We assume that i) The scheme is consistent (in the FV sense), ii) The scheme is L∞ -stable so that for all n, j, |unj | ≤ M . We set uΔ (t, x) =



unj 1nΔt≤t