Introduction to Scientific Computing and Data Analysis [2 ed.] 3031224299, 9783031224294

This textbook provides an introduction to numerical computing and its applications in science and engineering. The topic

573 146 8MB

English Pages 562 [563] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Introduction to Scientific Computing and Data Analysis [2 ed.]
 3031224299, 9783031224294

Table of contents :
Preface
Preface to Second Edition
Contents
1 Introduction to Scientific Computing
1.1 Unexpected Results
1.2 Floating-Point Number System
1.2.1 Normal Floats
1.2.2 Rounding
1.2.3 Non-Normal Floats
1.2.4 Significance
1.2.5 Flops
1.2.6 Functions
1.3 Arbitrary-Precision Arithmetic
1.4 Explaining, and Possibly Fixing, the Unexpected Results
1.5 Error and Accuracy
1.5.1 Over-computing?
1.6 Multicore Computing
2 Solving A Nonlinear Equation
2.1 The Problem to Solve
2.2 Bisection Method
2.2.1 Convergence Theory
2.3 Newton's Method
2.3.1 Picking x0
2.3.2 Order of Convergence
2.3.3 Failure
2.3.4 Examples
2.3.5 Convergence Theorem
2.4 Secant Method
2.4.1 Convergence Theorem
2.5 Other Ideas
2.5.1 Is Newton's Method Really Newton's Method?
3 Matrix Equations
3.1 An Example
3.2 Finding L and U
3.2.1 What Matrices Have a LU Factorization?
3.3 Implementing a LU Solver
3.3.1 Pivoting Strategies
3.3.2 LU and Gaussian Elimination
3.4 LU Method: Summary
3.4.1 Flops and Multicore Processors
3.5 Vector and Matrix Norms
3.5.1 Matrix Norm
3.6 Error and Residual
3.6.1 Correct Significant Digits
3.6.2 The Condition Number
3.6.3 A Heuristic for the Error
3.7 Positive Definite Matrices
3.7.1 Cholesky Factorization
3.8 Tri-Diagonal Matrices
3.9 Sparse Matrices
3.10 Nonlinear Systems
3.11 Some Additional Ideas
3.11.1 Yogi Berra and Perturbation Theory
3.11.2 Fixing an Ill-Conditioned Matrix
3.11.3 Insightful Observations about the Condition Number
3.11.4 Faster than LU?
3.11.5 Historical Comparisons
4 Eigenvalue Problems
4.1 Review of Eigenvalue Problems
4.2 Power Method
4.2.1 General Formulation
4.3 Extensions of the Power Method
4.3.1 Inverse Iteration
4.3.2 Rayleigh Quotient Iteration
4.4 Calculating Multiple Eigenvalues
4.4.1 Orthogonal Iteration
4.4.2 Regular and Modified Gram-Schmidt
4.4.3 QR Factorization
4.4.4 The QR Method
4.4.5 QR Method versus Orthogonal Iteration
4.4.6 Are the Computed Values Correct?
4.5 Applications
4.5.1 Natural Frequencies
4.5.2 Graphs and Adjacency Matrices
4.5.3 Markov Chain
4.6 Singular Value Decomposition
4.6.1 Derivation of the SVD
4.6.2 Interpretations of the SVD
4.6.3 Summary of the SVD
4.6.4 Consequences of a SVD
4.6.5 Computing a SVD
4.6.6 Low-Rank Approximations
4.6.7 Application: Image Compression
5 Interpolation
5.1 Information from Data
5.2 Global Polynomial Interpolation
5.2.1 Direct Approach
5.2.2 Lagrange Approach
5.2.3 Runge's Function
5.3 Piecewise Linear Interpolation
5.4 Piecewise Cubic Interpolation
5.4.1 Cubic B-splines
5.5 Function Approximation
5.5.1 Global Polynomial Interpolation
5.5.2 Piecewise Linear Interpolation
5.5.3 Cubic Splines
5.5.4 Chebyshev Interpolation
5.5.5 Chebyshev versus Cubic Splines
5.5.6 Other Ideas
5.6 Questions and Additional Comments
6 Numerical Integration
6.1 The Definition from Calculus
6.1.1 Midpoint Rule
6.2 Methods Based on Polynomial Interpolation
6.2.1 Trapezoidal Rule
6.2.2 Simpson's Rule
6.2.3 Cubic Splines
6.2.4 Other Interpolation Ideas
6.3 Romberg Integration
6.3.1 Computing Using Romberg
6.4 Methods for Discrete Data
6.5 Methods Based on Precision
6.5.1 1-Point Gaussian Rule
6.5.2 2-Point Gaussian Rule
6.5.3 m-Point Gaussian Rule
6.5.4 Error
6.5.5 Parting Comments
6.6 Adaptive Quadrature
6.7 Other Ideas
6.8 Epilogue
7 Initial Value Problems
7.1 Solution of an IVP
7.2 Numerical Differentiation
7.2.1 Higher Derivatives
7.2.2 Interpolation
7.3 IVP Methods Using Numerical Differentiation
7.3.1 The Five Steps
7.3.2 Additional Difference Methods
7.3.3 Error and Convergence
7.3.4 Computing the Solution
7.4 IVP Methods Using Numerical Integration
7.5 Runge-Kutta Methods
7.5.1 RK2
7.5.2 RK4
7.5.3 Stability
7.5.4 RK-n
7.6 Solving Systems of IVPs
7.6.1 Examples
7.6.2 Simple Approach
7.6.3 Component Approach and Symplectic Methods
7.7 Some Additional Questions and Ideas
7.7.1 RK4 Order Conditions
8 Optimization: Regression
8.1 Model Function
8.2 Error Function
8.2.1 Vertical Distance
8.3 Linear Least Squares
8.3.1 Two Parameters
8.3.2 General Case
8.3.3 QR Approach
8.3.4 Moore-Penrose Pseudo-Inverse
8.3.5 Overdetermined System
8.3.6 Regularization
8.3.7 Over-fitting the Data
8.4 QR Factorization
8.4.1 Finding Q and R
8.4.2 Accuracy
8.4.3 Parting Comments
8.5 Other Error Functions
8.6 Nonlinear Regression
8.7 Fitting IVPs to Data
8.7.1 Logistic Equation
8.7.2 FitzHugh-Nagumo Equations
8.7.3 Parting Comments
9 Optimization: Descent Methods
9.1 Introduction
9.2 Descent Methods
9.2.1 Descent Directions
9.3 Solving Linear Systems
9.3.1 Basic Descent Algorithm for Ax = b
9.3.2 Method of Steepest Descents for Ax = b
9.3.3 Conjugate Gradient Method for Ax = b
9.3.4 Large Sparse Matrices
9.3.5 Rate of Convergence
9.3.6 Preconditioning
9.3.7 Parting Comments
9.4 General Nonlinear Problem
9.4.1 Descent Direction
9.4.2 Line Search Problem
9.4.3 Examples
9.5 Levenberg-Marquardt Method
9.5.1 Parting Comments
9.6 Minimization Without Differentiation
9.7 Variational Problems
9.7.1 Example: Minimum Potential Energy
9.7.2 Example: Brachistochrone Problem
9.7.3 Parting Comments
10 Data Analysis
10.1 Introduction
10.2 Principal Component Analysis
10.2.1 Example: Word Length
10.2.2 Derivation of Principal Component Approximation
10.2.3 Summary of the PCA
10.2.4 Scaling Factors
10.2.5 Geometry and Data Approximation
10.2.6 Application: Crime Data
10.2.7 Practical Issues
10.2.8 Parting Comments
10.3 Independent Component Analysis
10.3.1 Derivation of the Method
10.3.2 Reduced Problem
10.3.3 Contrast Function
10.3.4 Summary of ICA
10.3.5 Application: Image Steganography
10.4 Data Based Modeling
10.4.1 Dynamic Modes
10.4.2 Example: COVID-19
10.4.3 Propagation Paths
10.4.4 Parting Comments
Appendix A Taylor's Theorem
A.1 Useful Examples for x Near Zero
A.2 Order Symbol and Truncation Error
Appendix B Vector and Matrix Summary
Appendix References
Index

Citation preview

13

Mark H. Holmes

Introduction to Scientific Computing and Data Analysis Second Edition

Editorial Board T. J.Barth M.Griebel D.E.Keyes R.M.Nieminen D.Roose T.Schlick

Texts in Computational Science and Engineering Volume 13

Series Editors Timothy J. Barth, NASA Ames Research Center, National Aeronautics Space Division, Moffett Field, CA, USA Michael Griebel, Institut für Numerische Simulation, Universität Bonn, Bonn, Germany David E. Keyes, New York, NY, USA Risto M. Nieminen, School of Science & Technology, Aalto University, Aalto, Finland Dirk Roose, Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium Tamar Schlick, Department of Chemistry, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA

This series contains graduate and undergraduate textbooks on topics described by the term “computational science and engineering”. This includes theoretical aspects of scientific computing such as mathematical modeling, optimization methods, discretization techniques, multiscale approaches, fast solution algorithms, parallelization, and visualization methods as well as the application of these approaches throughout the disciplines of biology, chemistry, physics, engineering, earth sciences, and economics.

Mark H. Holmes

Introduction to Scientific Computing and Data Analysis Second Edition

Mark H. Holmes Department of Mathematical Sciences Rensselaer Polytechnic Institute Troy, NY, USA

ISSN 1611-0994 ISSN 2197-179X (electronic) Texts in Computational Science and Engineering ISBN 978-3-031-22429-4 ISBN 978-3-031-22430-0 (eBook) https://doi.org/10.1007/978-3-031-22430-0 Mathematics Subject Classification: 65-XX, 65Dxx, 65Fxx, 65Hxx, 65Lxx, 68T09 1st edition: © Springer International Publishing Switzerland 2016 2nd edition: © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The objective of this text is easy to state, and it is to investigate ways to use a computer to solve various mathematical problems. One of the challenges for those learning this material is that it involves a nonlinear combination of mathematical analysis and nitty-gritty computer programming. Texts vary considerably in how they balance these two aspects of the subject. You can see this in the brief history of the subject given in Figure 1 (which is an example of what is called an ngram plot). According to this plot, the earlier books concentrated more on the analysis (theory). In the early 1970s this changed and there was more of an emphasis on methods (which generally means much less theory), and these continue to dominant the area today. However, the 1980s saw the advent of scientific computing books, which combine theory and programming, and you can see a subsequent decline in the other two types of books when this occurred. This text falls within this latter group. There are two important threads running through the text. One concerns understanding the mathematical problem that is being solved. As an example, when using Newton’s method to solve f (x) = 0, the usual statement is that it will work if

Percentage

3

2

Numerical Methods Scientific Computing Numerical Analysis

1

0 1950

1960

1970

1980

1990

2000

2010

Year

Figure 1 Historical record according to Google. The values are the number of instances that the expression appeared in a published book in the respective year, expressed as a percentage for that year, times 105 [Michel et al., 2011].

v

vi

Preface

you guess a starting value close to the solution. It is important to know how to determine good starting points, and, perhaps even more importantly, knowing whether the problem being solved even has a solution. Consequently, when deriving Newton’s method, and others like it, an effort is made to explain how to fairly easily answer these questions. The second theme is the importance in scientific computing of having a solid grasp of the theory underlying the methods being used. A computer has the unfortunate ability to produce answers even if the methods used to find the solution are completely wrong. Consequently, it is essential to have an understanding of how the method works, and how the error in the computation depends on the method being used. Needless to say, is also important to be able to code these methods, and in the process be able to adapt them to the particular problem being solved. There is considerable room for interpretation on what this means. To explain, in terms of computing languages, the current favorites are MATLAB and Python. Using the commands they provide, a text such as this one becomes more of a user’s manual, reducing the entire book down to a few commands. For example, with MATLAB, this book (as well as most others in this area) can be replaced with the following commands: Chapter 1: eps Chapter 2: fzero(@f,x0) Chapter 3: A\b Chapter 4: eig(A) Chapter 5: polyfit(x,y,n) Chapter 6: integral(@f,a,b) Chapter 7: ode45 (@f,tspan,y0) Chapter 8: fminsearch(@fun,x0) Chapter 9: svd(A) Certainly this statement qualifies as hyperbole, and as an example, Chapters 4 and 5 should probably have two commands listed. The other extreme is to write all of the methods from scratch, something that was expected of students in the early days of computing. In the end, the level of coding depends on what the learning outcomes are for the course, and the background and computing prerequisites required for the course. Many of the topics included are typical of what are found in an upper division scientific computing course. There are also notable additions. This includes material related to data analysis, as well as variational methods and derivative-free minimization methods. Moreover, there are differences related to emphasis. An example here concerns the preeminent role matrix factorizations play in numerical linear algebra, and this is made evident in the development of the material. The coverage of any particular topic is not exhaustive, but intended to introduce the basic ideas. For this reason, numerous references are provided for those who might be interested in further study, and many of these are from the current research literature. To quantify this statement, a code was written that reads the tex.bbl file containing the references for this text, and then uses MATLAB to plot the number as a function of the year published. The result is Figure 2, and it shows that approximately

Preface

vii

Number

15

10 Median = 2006 Mean = 2002 5

0 1950

1960

1970

1980

1990

2000

2010

2020

Year

Figure 2 Number of references in this book, after 1950, as a function of the year they were published.

half of the references were published in the last ten years. By the way, in terms of data generation and plotting, Figure 1 was produced by writing a code which reads the html source code for the ngram webpage and then uses MATLAB to produce the plot. The MATLAB codes used to produce almost every figure, and table with numerical output, in this text are available from the author’s website as well as from SpringerLink. In other words, the MATLAB codes for all of the methods considered, and the examples used, are available. These can be used as a learning tool. This also goes to the importance in computational-based research, and education, of providing open source to guarantee the correctness and reproducibility of the work. Some interesting comments on this can be found in Morin et al. [2012] and Peng [2011]. The prerequisites depend on which chapters are covered, but the typical two-year lower division mathematics program (consisting of calculus, matrix algebra, and differential equations) should be sufficient for the entire text. However, one topic plays an oversized role in this subject, and this is Taylor’s theorem. This also tends to be the topic that students had the most trouble within calculus. For this reason, an appendix is included that reviews some of the more pertinent aspects of Taylor’s theorem. It should also be pointed out that there are numerous theorems in the text, as well as an outline of the proof for many of them. These should be read with care because they contain information that is useful when testing the code that implements the respective method (i.e., they provide one of the essential ways we will have to make sure the computed results are actually correct). I would like to thank the reviewers of an early draft of the book, who made several very constructive suggestions to improve the text. Also, as usual, I would like to thank those who developed and have maintained TeXShop, a free and very good TeX previewer. Troy, NY, USA January 2016

Mark H. Holmes

Preface to Second Edition

One of the principal reasons for this edition is to include material necessary for an upper division course in computational linear algebra. So, least squares is covered more extensively, along with new material covering Householder QR, sparse matrix methods, preconditioning, and Markov chains. The computational implications of the Gershgorin, Perron-Frobenius, and Eckart-Young theorems have also been expanded. It is assumed in the development that a previous course in linear algebra has been taken, but as a refresher some of the more pertinent facts from such a course are given in an appendix. Other significant changes include a reorganization and expansion of the exercises. Also, the material on cubic splines, particularly related to data analysis, has been expanded (e.g., Section 6.4), and the presentation of Gaussian quadrature has been modified. There is also a new section providing an introduction to data-based modeling and dynamic modes. As with the first edition, the material is developed and presented, independent of the language used for computing. So, there are no computer codes in the text. Rather, the procedures are written in a generic format, such as Newton’s method on page 41. There are, however, a few exercises where example MATLAB commands are given indicating how the problem can be done (e.g., the commands on page 198 for inputting, and displaying, a grayscale image). All of the codes used in the computational examples in the text are available from the author’s GitHub repository (https://github.com/HolmesRPI/IntroSciComp2nd). Other files for the book, such as an errata page and the answers for selected exercises, are also there. Troy, NY, USA August 2022

Mark H. Holmes

ix

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

Preface to Second Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

1

Introduction to Scientific Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Unexpected Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Floating-Point Number System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Normal Floats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Non-Normal Floats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.5 Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.6 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Arbitrary-Precision Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Explaining, and Possibly Fixing, the Unexpected Results . . . . . . . 1.5 Error and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Over-computing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Multicore Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 5 5 8 9 10 10 11 12 12 18 20 21

2

Solving A Nonlinear Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 The Problem to Solve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Convergence Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Picking x0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Order of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31 32 33 36 39 42 43 45 46 48 50 53

xi

xii

Contents

2.5 3

4

Other Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Is Newton’s Method Really Newton’s Method? . . . . . . .

54 54

Matrix Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Finding L and U . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 What Matrices Have a LU Factorization? . . . . . . . . . . . . . 3.3 Implementing a LU Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Pivoting Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 LU and Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . 3.4 LU Method: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Flops and Multicore Processors . . . . . . . . . . . . . . . . . . . . . 3.5 Vector and Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Matrix Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Error and Residual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Correct Significant Digits . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 The Condition Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 A Heuristic for the Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Tri-Diagonal Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Sparse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11 Some Additional Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.1 Yogi Berra and Perturbation Theory . . . . . . . . . . . . . . . . . 3.11.2 Fixing an Ill-Conditioned Matrix . . . . . . . . . . . . . . . . . . . . 3.11.3 Insightful Observations about the Condition Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.4 Faster than LU? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.5 Historical Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73 73 76 77 78 80 80 82 87 88 90 93 94 94 99 100 103 105 107 108 111 111 112

Eigenvalue Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Review of Eigenvalue Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Extensions of the Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Inverse Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Rayleigh Quotient Iteration . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Calculating Multiple Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Orthogonal Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Regular and Modified Gram-Schmidt . . . . . . . . . . . . . . . . 4.4.3 QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 The QR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5 QR Method versus Orthogonal Iteration . . . . . . . . . . . . . . 4.4.6 Are the Computed Values Correct? . . . . . . . . . . . . . . . . . .

129 129 134 138 141 143 145 147 148 151 154 155 157 158

112 113 114

Contents

4.5

xiii

Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Natural Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Graphs and Adjacency Matrices . . . . . . . . . . . . . . . . . . . . 4.5.3 Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Derivation of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Interpretations of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Summary of the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.4 Consequences of a SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.5 Computing a SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.6 Low-Rank Approximations . . . . . . . . . . . . . . . . . . . . . . . . 4.6.7 Application: Image Compression . . . . . . . . . . . . . . . . . . . .

161 161 163 165 169 170 173 175 177 179 179 180

5

Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Information from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Global Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Direct Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Lagrange Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Runge’s Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Piecewise Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Piecewise Cubic Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Cubic B-splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Global Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . 5.5.2 Piecewise Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . 5.5.3 Cubic Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.4 Chebyshev Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.5 Chebyshev versus Cubic Splines . . . . . . . . . . . . . . . . . . . . 5.5.6 Other Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Questions and Additional Comments . . . . . . . . . . . . . . . . . . . . . . . .

205 205 207 207 209 212 213 217 220 226 226 228 230 232 237 239 239

6

Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 The Definition from Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Midpoint Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Methods Based on Polynomial Interpolation . . . . . . . . . . . . . . . . . 6.2.1 Trapezoidal Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Simpson’s Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Cubic Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Other Interpolation Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Romberg Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Computing Using Romberg . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Methods for Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Methods Based on Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 1-Point Gaussian Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 2-Point Gaussian Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

255 256 258 261 261 265 268 271 272 273 274 277 278 279

4.6

xiv

Contents

6.5.3 m-Point Gaussian Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.4 Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.5 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adaptive Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Other Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

281 281 282 283 287 287

7

Initial Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Solution of an IVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Higher Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 IVP Methods Using Numerical Differentiation . . . . . . . . . . . . . . . 7.3.1 The Five Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Additional Difference Methods . . . . . . . . . . . . . . . . . . . . . 7.3.3 Error and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Computing the Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 IVP Methods Using Numerical Integration . . . . . . . . . . . . . . . . . . . 7.5 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 RK2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 RK4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.4 RK-n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Solving Systems of IVPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 Simple Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.3 Component Approach and Symplectic Methods . . . . . . . 7.7 Some Additional Questions and Ideas . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 RK4 Order Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

303 303 305 309 310 311 311 318 321 322 324 327 327 330 333 333 335 335 336 339 342 345

8

Optimization: Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Model Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Error Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Vertical Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Two Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 QR Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Moore-Penrose Pseudo-Inverse . . . . . . . . . . . . . . . . . . . . . 8.3.5 Overdetermined System . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.7 Over-fitting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Finding Q and R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

359 360 361 361 362 363 365 368 370 371 373 374 376 377 383 384

6.6 6.7 6.8

Contents

8.5 8.6 8.7

xv

Other Error Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fitting IVPs to Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.1 Logistic Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.2 FitzHugh-Nagumo Equations . . . . . . . . . . . . . . . . . . . . . . . 8.7.3 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

384 387 391 391 394 395

Optimization: Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Descent Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Solving Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Basic Descent Algorithm for Ax = b . . . . . . . . . . . . . . . . 9.3.2 Method of Steepest Descents for Ax = b . . . . . . . . . . . . . 9.3.3 Conjugate Gradient Method for Ax = b . . . . . . . . . . . . . . 9.3.4 Large Sparse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.5 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.6 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.7 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 General Nonlinear Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Descent Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Line Search Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Levenberg-Marquardt Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Minimization Without Differentiation . . . . . . . . . . . . . . . . . . . . . . . 9.7 Variational Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.1 Example: Minimum Potential Energy . . . . . . . . . . . . . . . . 9.7.2 Example: Brachistochrone Problem . . . . . . . . . . . . . . . . . 9.7.3 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

411 411 413 414 415 416 418 420 425 428 430 435 436 436 438 441 443 449 449 453 453 455 458

10 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Example: Word Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Derivation of Principal Component Approximation . . . . 10.2.3 Summary of the PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.4 Scaling Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.5 Geometry and Data Approximation . . . . . . . . . . . . . . . . . . 10.2.6 Application: Crime Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.7 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.8 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Derivation of the Method . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Reduced Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.3 Contrast Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

475 475 475 475 480 484 487 488 489 493 495 495 497 499 500

9

xvi

Contents

10.3.4 Summary of ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.5 Application: Image Steganography . . . . . . . . . . . . . . . . . . 10.4 Data Based Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Dynamic Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.2 Example: COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.3 Propagation Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.4 Parting Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

505 506 508 511 515 516 518

A Taylor’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 A.1 Useful Examples for x Near Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 A.2 Order Symbol and Truncation Error . . . . . . . . . . . . . . . . . . . . . . . . . 532 B Vector and Matrix Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549

Chapter 1

Introduction to Scientific Computing

This chapter provides a brief introduction to the floating-point number system used in most scientific and engineering applications. A few examples are given in the next section illustrating some of the challenges using finite precision arithmetic, but it is worth quoting Donald Knuth to get things started. If you are unfamiliar with him, he was instrumental in the development of the analysis of algorithms, and is the creator of TeX. Anyway, here are the relevant quotes [Knuth, 1997]: “We don’t know how much of the computer’s answers to believe. Novice computer users solve this problem by implicitly trusting in the computer as an infallible authority; they tend to believe that all digits of a printed answer are significant. Disillusioned computer users have just the opposite approach; they are constantly afraid that their answers are almost meaningless”. “every well-rounded programmer ought to have a knowledge of what goes on during the elementary steps of floating-point arithmetic. This subject is not at all as trivial as most people think, and it involves a surprising amount of interesting information”. One of the objectives in what follows is to help you from becoming disillusioned by identifying where problems can occur, and to also provide an appreciation for the difficulty of floating-point computation.

1.1 Unexpected Results What follows are examples where the computed results are not what is expected. The reason for the problem is the same for each example. Namely, the finite precision arithmetic used by the computer generates errors that are significant enough that they affect the final result. The calculations to follow use double-precision arithmetic, and this is explained in Section 1.2. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0 1

1

2

1 Introduction to Scientific Computing

Example 1.1 Consider adding a series of numbers from largest to smallest S(n) = 1 +

1 1 1 + ··· + + , 2 n−1 n

(1.1)

and the same series added from smallest to largest s(n) =

1 1 1 + + · · · + + 1. n n−1 2

(1.2)

If one calculates s(n) and S(n), and then calculates the difference S(n) − s(n), the values given in Table 1.1 are obtained. It is evident that for larger values of n, the two sums differ. In other words, using a computer, addition does not necessarily satisfy one of the fundamental properties of arithmetic, which is that addition is associative. So, given that both s(n) and S(n) are likely incorrect, a question relevant to this text is, is it possible to determine which one is closer to the exact result?  Example 1.2 Consider the function

y = (x − 1)8 .

(1.3)

Expanding this you get the polynomial y = x8 − 8x7 + 28x6 − 56x5 + 70x4 − 56x3 + 28x2 − 8x + 1.

(1.4)

The expressions in (1.4) and (1.3) are equal and, given a value of x, either should be able to be used to evaluate the function. However, when evaluating them with a computer they do not necessarily produce the same values and this is shown in Figure 1.1. In the upper graph, where 0.9 ≤ x ≤ 1.1, they do appear to agree. However, looking at the plots over the somewhat smaller

n

S(n) − s(n)

10

0

100

−8.88e−16

1,000

2.66e−15

10,000

−3.73e−14

100,000

−7.28e−14

1,000,000

−7.83e−13

Table 1.1 Difference in partial sums for the harmonic series considered in Example 1.1. Note that 8.9e−16 = 8.9 × 10−16 .

1.1

Unexpected Results

10

10

3

-9

y-axis

Using (1.3) Using (1.4) 5

0 0.9

y-axis

4

0.95

1

1.05

1.1

0.99

1

1.01

1.02

10-14

2 0 -2 0.98

x-axis Figure 1.1 Plots of (1.4) and (1.3). Upper graph: the interval is 0.9 ≤ x ≤ 1.1, and the two functions are so close that the curves are indistinguishable. Lower graph: The interval is 0.98 ≤ x ≤ 1.02, and now they are not so close.

interval 0.98 ≤ x ≤ 1.02, which is given in the lower graph, the two expressions definitely do not agree. The situation is even worse than the fact that the graphs differ. First, according to (1.3), y is never negative but according to the computer (1.4) violates this condition. Second, according to (1.3), y is symmetric about x = 1 but the computer claims (1.4) is not.  Example 1.3 As a third example, consider the function √ 16 + k − 4 . y= k

(1.5)

This is plotted in Figure 1.2. According to l’Hospital’s rule lim y =

k→0

1 . 8

The computer agrees with this result for k down to about 10−12 but for smaller values of k there is a problem. First, the function starts to oscillate and then, after k drops a little below 10−14 , the computer decides that y = 0. It is also worth pointing out that ratios as in (1.5) arise in formulas for the numerical evaluation of derivatives, and this is the subject of Section 7.2. In particular, (1.5) comes from an approximation used to evaluate f  (0), where

4

1 Introduction to Scientific Computing

y-axis

3/16 1/8 1/16 0 10-18

10-14

10-10

10-6

10-2

k-axis Figure 1.2

Plot of (1.5).

√ f (x) = 16 + x. Finally, it needs to be mentioned that the curve shown in the figure contains a plotting error, and this is corrected in Figure 1.8.  Example 1.4 The final example concerns evaluating trigonometric functions, specifically √ sin(x). As you should recall, if n is an integer, then sin( π4 + 2πn) = 12 2. Using double precision one finds that if n = 252 + 2 then sin( π4 + 2πn) ≈ −1, while if n = 252 + 3, then sin( π4 + 2πn) ≈ 0. Clearly, neither is close to the correct value. To investigate this further, the relative error    sin( π + 2πn) − 1 √2    4 2 √ (1.6)   1   2 2 is plotted in Figure 1.3 as a function of n. It shows that overall the error increases with n, eventually getting close to one. It also means, for example, that if you want to evaluate sin(x) correct to, say, 12 digits then you will need to require |x| ≤ 15,000. 

Relative Error

10

0

10

-4

10

-8

10

-12

10-16 0 10

10

5

10

10

n-axis Figure 1.3

The relative error (1.6) when computing y = sin( π4 + 2πn).

10

15

1.2

Floating-Point Number System

5

1.2 Floating-Point Number System The problems illustrated in the above examples are minor compared to the difficulties that arose in the early days of computing. It was not unusual to get irreproducible results, in the sense that two different computers would calculate different answers to the same formula. To help prevent this, a set of standards was established that computer manufactures were expected to comply with. The one of interest here concerns the floating-point system, and it is based on the IEEE-754 standard set in 1985. It consists of normal floats (described below), along with zero, ± Inf, and NaN.

1.2.1 Normal Floats Normal (or normalized) floating-point numbers are real numbers that the computer has the exact value for. The form they are written in is determined by the binary nature of computer systems. Specifically, they have the form xf = (±) m × 2E ,

(1.7)

where

b1 b2 bN −1 + 2 + · · · + N −1 . (1.8) 2 2 2 In this representation m, E, and the bi ’s have the following properties: m=1+

• m: This is the mantissa. The bi ’s are either zero or one, and for this reason 1 ≤ m < 2. In fact, the largest value of the mantissa is m = 2 − ε, where ε = 1/2N −1 (see Exercise 1.32). The number ε is known as machine epsilon and it plays a critical role in determining the spacing of floating-point numbers. • E: This is the exponent and it is an integer that satisfies Em ≤ E ≤ EM . For double precision, −1022 ≤ E ≤ 1023. In general, according to the IEEE requirements, Em = −EM + 1 and EM = 2M −1 − 1, where M is a positive integer. As defined, a floating-point system requires specification of the two integers N and M , and some of the standard choices are listed in Table 1.2. The one of particular importance for scientific computing is double precision, for which N = 53 and M = 11. Other formats are used, and they are usually connected to specific applications. An example is bfloat16, which was developed by the Google Brain project for their machine learning algorithms. These numbers have the same storage requirements as half-precision floating-point numbers but have N = M = 8. This means the exponents for bfloat16 cover the same range as single-precision floating-point numbers but with fewer mantissa values.

6

1 Introduction to Scientific Computing

Precision

N

M

Em

EM

xm

xM

ε

Decimal Digits

Half

11

5

−14

15

6 × 10−5

7 × 104

10−3

3

Single

24

8

−126

127

10−38

3 × 1038

10−7

7

Double 53

11

−1022

1023

Quad

2 × 10−308 2 × 10308 2 × 10−16

113 15 −16382 16383 3 × 10−4932

104932

10−34

16 34

Table 1.2 Values for various binary normalized floating-point systems specified by IEEE754 and its extensions. The values for the smallest positive xm , largest positive xM , and machine epsilon ε are given to one significant digit. Similarly, the number of decimal digits is approximate and is determined from the expression N log10 2.

Example 1.5 1. Is x = 3 a floating-point number? Answer: Yes, because 3 = 2 + 1 = (1 + 12 )× 2. So, E = 1, b1 = 1, and the other bi ’s are zero. 2. Is x = −10 a floating-point number? Answer: Yes, because −10 = −8 − 2 = −(1 + b2 = 1, and the other bi ’s are zero.

1 22 )×

23 . In this case, E = 3,

3. Examples of numbers that are not floating-point numbers are π. 

1 10 ,



2, and

Example: Floats in 1 ≤ x < 2 In this case, E = 0 and (1.7) takes the form xf = 1 +

b2 b1 bN −1 + 2 + · · · + N −1 . 2 2 2

(1.9)

The endpoint x = 1 is obtained when all of the bi ’s are zero. The next largest float occurs when bN −1 = 1 and all of the other bi ’s are zero. In other words, the next largest float is xf = 1 + ε, where ε = 1/2N −1 . As stated earlier, the number ε is known as machine epsilon. In the case of double precision, ε = 2−52 ≈ 2.2 × 10−16 .

(1.10)

1.2

Floating-Point Number System

7

This number should be memorized as it will appear frequently throughout this textbook. The next largest float occurs when bN −2 = 1 and all of the other bi ’s are zero, which gives us xf = 1 + 1/2N −2 = 1 + 2ε. This pattern continues, and one ends up concluding that the floats in 1 ≤ x < 2 are equally spaced, a distance ε apart. In fact, it is possible to show that the largest value of (1.9) is a distance ε from 2 (see Exercise 1.32(d)). Therefore, the floating-point numbers in 1 ≤ x ≤ 2 are equally spaced a distance ε apart.  Example: Floats in 2n ≤ x ≤ 2n+1 The floats in the interval 2n ≤ x < 2n+1 have the form   b2 b1 bN −1 + 2 + · · · + N −1 × 2n . xf = 1 + 2 2 2 So, looking at (1.9) one concludes that the floats in this interval are those in 1 ≤ x < 2, but just multiplied by 2n . This means the floats in the closed interval 2n ≤ x ≤ 2n+1 are evenly spaced but now the spacing is 2n ε, as illustrated in Figure 1.4.  A conclusion coming from Figure 1.4 is that for large values of n the distance between the floats can be huge. For example, take n = 100. For double precision, between 2100 ≈ 1030 and 2101 ≈ 2 × 1030 they are a distance of ε × 2100 ≈ 2.8 × 1014 apart. The fact that they are so far apart can cause problems, and a particular example of this is considered in Section 1.2.6. Integers The smallest positive integer that is not included in the floating-point system is 2N + 1. In other words, all nonzero integers satisfying |x| ≤ 2N are normalized floating-point numbers. An important question is whether the computer recognizes them as integers. This is needed in programming because integers are used as counters in a for loop as well as the indices of a vector or matrix.

Figure 1.4 The floating-point numbers in the interval 2n ≤ x ≤ 2n+1 are equally spaced, being a distance 2n ε apart. So, they are a distance ε/2 apart for 2−1 ≤ x ≤ 1, ε apart for 1 ≤ x ≤ 2, and 2ε apart for 2 ≤ x ≤ 22 .

8

1 Introduction to Scientific Computing

Figure 1.5 The floating-point numbers just to the left and right of x = 1. The red dashed lines are located halfway between the floats, and any real number between them is rounded to the floating-point number in that subinterval.

Most computer systems have a way to treat integers as integers, where addition and subtraction are done exactly as long as the integers are not too big. For example, for languages such as C and FORTRAN you use a type declaration at the beginning of the program to identify a variable as an integer. Languages such as MATLAB, Python, and Julia use dynamic typing, which means they determine the variable type on the fly. This is more convenient, but it does add to the computing time. One last comment concerns the biggest and smallest positive normal floats. In particular, for double precision, xM = (2 − ε) × 21023 ≈ 2 × 10308 is the largest positive normal float, and xm = 2−1022 ≈ 2 × 10−308 is the smallest positive normal float.

1.2.2 Rounding Assuming x is a real number satisfying xm ≤ |x| ≤ xM then in the computer this is rounded to a normal float xf , and the relative error satisfies |x − xf | ε ≤ . |x| 2 To do this it uses a “round-to-nearest” rule, which means xf is the closest float to x. To illustrate, the floats just to the left of x = 1 are a distance 12 ε apart, and those just to the right are a distance ε apart. This is shown in Figure 1.5. So, any number in Region II, which corresponds to 1 − 14 ε < x < 1 + 12 ε, is rounded to xf = 1. Similarly, any number in Region III, which corresponds to 1 + 12 ε < x < 1 + 32 ε, is rounded to xf = 1 + ε. In the case of a tie, a “roundto-even” rule is used, where the nearest float with an even least significant digit is used. Example 1.6   Evaluate 1 + 14 ε − 1 using floating-point arithmetic.   Answer: Since 1 < 1 + 14 ε < 1 + 12 ε, then 1 + 14 ε is rounded to 1 (see Figure 1.5). So, the answer is zero. 

1.2

Floating-Point Number System

9

1.2.3 Non-Normal Floats The floating-point number system is required to provide a floating-point number to any algebraic expression of the form xf ± yf , xf ∗ yf , or xf /yf . So, in addition to the normal floats, the following are also included. Zero It is not possible to represent zero using (1.7), and so it must be included as a special case. Correspondingly, there is an interval −x0 < x < x0 where any number in this interval is rounded to xf = 0. If subnormals are not used (these are explained below), then x0 = xm /2, and if they are used, then x0 = xtiny /2. Inf and NaN Positive numbers larger than xM are either rounded to xM , if close enough, or assigned the value Inf. The latter is a situation known as positive overflow. A similar situation occurs for very negative numbers, something called negative overflow and it produces the value −Inf. For those situations when the calculated value is ill-defined, such as 0/0, the floating-point system assigns it a value of NaN (Not a Number). Because ±Inf and NaN are floating-point numbers you can use them in mathematical expressions. For example, the computer will be able to evaluate 4 ∗ Inf, sin(Inf), eInf , Inf ∗ NaN, etc. You can probably guess what they evaluate to, but if not then try them using MATLAB, Julia, or Python. Subnormals These are a relatively small set of numbers from xtiny = 2Em −N +1 up to xm that are included to help alleviate the problem of rounding a nonzero number to zero (producing what is called gradual underflow). For example, using double precision, xtiny = 2−1074 ≈ 5 × 10−324 . Because they are often implemented in software, computations with subnormals can take much longer than with normal floats and this causes problems for timing in multicore processors. There is also a significant loss of precision with subnormals. For these, and other, reasons it is not uncommon in scientific computing to use a flush-to-zero option to prevent the system from using subnormals.

10

1 Introduction to Scientific Computing

1.2.4 Significance Working with numbers that have a fixed number of digits has the potential to cause complications that you should be aware of. For example, even though floating-point addition and multiplication are commutative, they are not necessarily associative or distributive. Example 1.1 is an illustration of the non-associativity of addition. Another complication involves significance. To perform addition, a computer must adjust the exponents before adding. For example, suppose the base is fixed to have 4 digits. So, letting → denote the exponent adjusting step, 1.234 × 104 + 1.234 × 102 → 1.234 × 104 + 0.012 × 104 = 1.246 × 104 . This also means that 1.234 × 104 + 1.234 × 10−6 → 1.234 × 104 + 0.000 × 104 = 1.234 × 104 . In this example, 1.234 × 10−6 is insignificant relative to 1.234 × 104 and contributes nothing to the sum. This is why, using double precision, a computer will claim that (1020 + 1) − 1020 = 0. For subtraction the potential complication is the loss of significance, which can arise when the two numbers are very close. For example, assuming that the base is fixed to have 4 digits, then (1.2344 − 1.2342)/0.002 → (1.234 − 1.234)/0.002 = 0, whereas the exact value is 0.1. Awareness of the complications mentioned above is important, but it is outside the purview of this text to explore them in depth. For those interested in more detail related to floating-point arithmetic, they should consult Goldberg [1991], Overton [2001], or Muller et al. [2010].

1.2.5 Flops All numerical algorithms are judged by their accuracy and how long it takes to compute the answer. As one estimate of the time, an old favorite is to determine the flop count, where flop is an acronym for floating-point operation. To use this, it is necessary to have an appreciation of how long various operations take. In principle these are easy to determine. As an example, to determine the computing time for an addition one just writes a code where this is done N times, where N is a large integer, and then divides the total computing time by N . The outcomes of such tests are shown in Table 1.3, where the times are scaled by how long it takes to do an addition. Note that the actual times here are very short, with an addition taking approximately 2 nsec for MATLAB, FORTRAN, and C/C++, and 10 nsec using Python.

1.2

Floating-Point Number System

11

Operation

MATLAB

Python

FORTRAN

C/C++

Addition or Subtraction

1

1

1

1

Multiplication

1

1

1

1

Division √ x

2

1

2

2

6

3

3

12

sin x

7

3

5

8

ln x

14

8

7

15

ex

12

3

8

8

35

5

3

24

xn ,

for n = 5, 10, or 20

Table 1.3 Approximate relative computing times for various floating-point operations in MATLAB (R2022a), Python (v3.11), FORTRAN (gfortran v12.2), and C/C++ (clang v14.0). Note that each column is normalized by the time it takes that language to do an addition.

Because of this, even though x = 3/2 might take twice as long to compute than x = 0.5 ∗ 3, it’s really not necessary to worry about this (at least in the problems considered in this text). One of the principal reasons why Python and MATLAB differ in execution times from FORTRAN, even though the same processor is used, involves branching. They, like Julia and R, use dynamic typing. This means they check on the type of numbers being used and then branch to the appropriate algorithm. This way they are able to carry out integer arithmetic exactly. In comparison, FORTRAN, like C and C++, is statically typed. This means the typing is done at the start of the program, and so there is no slow down when the commands are issued. This is one of the reasons most large-scale scientific computing codes are written in a statically typed language.

1.2.6 Functions Any computer system designed for scientific computing has routines to evaluate√well-known or often used functions. This includes elementary functions like x; transcendental functions like sin(x), ex , and ln(x); and special functions like erf(x) and Jν (x). To discuss how these fit into a floating-point system, these will be written in the generic form of y = f (x). The ideal goal is that, letting yf denote the computed value and assuming that xm ≤ |y| ≤ xM , |y − yf | ε ≤ . |y| 2 Whether or not this happens depends on the function and the value of x.

12

1 Introduction to Scientific Computing

Oscillatory functions are some of the more difficult to evaluate accurately, and an illustration of this is given in Figure 1.3. What this shows is that if you want an error commensurate with the accuracy associated with double precision you should probably restrict the domain to |x| < 104 . In this case the relative error is no more than about 10−12 . However, if you are willing to have the value correct to no more than 4 significant digits then the domain expands to about |x| < 1012 . One of the reasons for the difficulty evaluating this function is that the distance between the floating-point numbers increases with x and eventually gets much larger than the period 2π. When this occurs the computed value is meaningless. Fortunately, the situation for monotonic √ function, such as exp(x), x, and ln(x), is much better and we can expect the usable domains to be much larger. Those you might want to investigate some of the challenges related to accurate function evaluation should consult Muller [2016].

1.3 Arbitrary-Precision Arithmetic Some applications, such as cryptography, require exact manipulation of extremely large integers. Because of their length, these integers are not representable using double, or even quadruple, precision. This has given rise to the idea of arbitrary-precision arithmetic, where the limitation is determined by the available memory for the computer. The price paid for this is that the computations are slower, with the computing time increasing fairly quickly as the size of the integers is increased. As an example of the type of problem arbitrary-precision arithmetic is used for, there is the Great Internet Mersenne Prime Search (GIMPS). A Mersenne prime has the form 2n − 1, and considerable computing resources have been invested into finding them. The largest one currently known, which took 39 days to compute, has n = 82,589,933, which results in a prime number with 24,862,048 digits [GIMPS, 2022]. Just printing this number, with 3,400 digits per page, would take more than twelve times the pages in this text. There are multiple computational challenges finding large prime numbers. Those interested in the computational, and theoretical, underpinnings of computing primes should consult Crandall and Pomerance [2010].

1.4 Explaining, and Possibly Fixing, the Unexpected Results The problem identified in Example 1.4 was discussed in Section 1.2.6. What follows is a discussion related to the other examples that were presented in Section 1.1.

1.4

Explaining, and Possibly Fixing, the Unexpected Results

13

Example 1.1 (cont’d) The differences in the two sums are not unexpected when using doubleprecision arithmetic. Also, the order of the error is consistent with the accuracy obtained for double precision. The question was asked about which sum might produce the more accurate result. One can argue that it is better to add from small to big. The reason being that if one starts with the larger terms, and the sum gets big enough, then the smaller terms are less able to have an effect on the answer. To check on this, it is necessary to know the exact value, or at least have an accurate approximate value. This can be found using what is known as the Euler-Maclaurin formula, from which one can show that for larger values of n, n  1 1 1 1 1 = ln(n) + γ + − + − + ··· , 2 4 k 2n 12n 120n 252n6

k=1

where γ = 0.5772 · · · is Euler’s constant. To investigate the accuracy of the two sums, the errors are shown in Figure 1.6. For smaller values of n the errors are fairly small, and there is no obvious preference between s and S. For larger values of n, however, there is a preference as s(n) generally serves as a more accurate approximation than S(n). It is also seen that there is also a slow overall increase in the error for both, but this is not unusual when such a large number of floating-point calculations are involved (see Exercise 1.25). Given the importance of summation in computing, it should not be surprising that numerous schemes have been devised to produce an accurate sum. A particularly interesting example is something called compensated summation. n Given the problem of calculating i=1 xi , where xi > 0, the compensated summation procedure is as follows:

10-11

Error

10

-12

S s

10-13 10-14 10

-15

104

105

106

107

108

Number of Terms Figure 1.6 and (1.2).

The error in computing the partial sum of the harmonic series using (1.1)

14

1 Introduction to Scientific Computing

let: loop:

sum = 0 and err = 0 for i = 1, 2, 3, · · · , n z = xi + err q = sum sum = q + z err = z − (sum − q) end

(1.11)

The error in computing s(n) when using this procedure versus just adding the terms recursively is given in Table 1.4. The improvement in the accuracy is dramatic. Not only does it gives a more accurate value, but the error does not increase with n unlike what happens when adding the terms recursively. Compensated summation is based on estimating the error in a floatingpoint addition, and then compensating for this in the calculation. The sequence of steps involved illustrating how the method works is given in Table 1.5. To explain, suppose that the mantissa used by the computer has four digits and base 10 arithmetic is used. To add a = 0.1234 × 103 and b = 0.1234 × 101 it is first necessary to express them using the same exponent, and write b = 0.001234 × 103 (this is the alignment step in Table 1.5). Given the four digit mantissa, b gets replaced with b = 0.0012 × 103 . Using the notation in Table 1.5, the digits b2 = 34 are lost in this shuffle. With this, sf = 0.1246 × 103 and sf − a = 0.12 × 101 . Accordingly, (sf − a) − b = −0.0034 × 101 , which means we have recovered the piece missing from the sum. In Table 1.5, err = b − (sf − a), and this is added back in during the next iteration step. There are variations on this procedure, and also limitations on its usefulness. Those interested in reading more about this should consult Demmel and Hida [2004] or Blanchard et al. [2020]. 

n

E(n) − c(n)

E(n) − s(n)

104

0

4 × 10−15

105

0

2 × 10−14

106

2 × 10−15

5 × 10−14

107

4 × 10−15

10−13

108

4 × 10−15

5 × 10−13

109

0

2 × 10−12

Table 1.4 Comparison between compensated summation, as given in (1.11), and regular summation. Note E(n) is the exact result, c(n) is the value using compensated summation, and s(n) is given in (1.2).

1.4

Explaining, and Possibly Fixing, the Unexpected Results Operation

floating-point Result

15

Comments mantissas for a and b are aligned for the addition

sf = (a + b)f

due to the fixed number of digits, b2 is lost

sf − a

a is removed from the sum

(sf − a) − b

In removing b, the part that remains is −b2

Table 1.5 Steps explaining how the error in floating-point addition is estimated for compensated summation. Adapted from [Higham, 2002].

Example 1.2 (cont’d) The first thing to notice is that the values of the function in the lower plot in Figure 1.2 are close to machine epsilon. The expanded version of the polynomial is required to take values near x = 1 and combine them to produce a value close to zero. The errors seen here are consistent with arithmetic using double precision, and the fact that the values are sometimes negative also is not surprising. It is natural to ask, given the expanded version of the polynomial (1.4), whether it is possible to find an algorithm for it that is not so sensitive to round off error. There are procedures for the efficient evaluation of a polynomial, and two examples are Horner’s method and Estrin’s method. To explain how these work, Horner’s method is based on the following observations: a2 x2 + a1 x + a0 = a0 + (a1 + a2 x)x, a3 x3 + a2 x2 + a1 x + a0 = a0 + (a1 + (a2 + a3 x)x)x, a4 x4 + a3 x3 + a2 x2 + a1 x + a0 = a0 + (a1 + (a2 + (a3 + a4 x)x)x)x. Higher order polynomials can be factored in a similar manner, and the resulting algorithm for evaluating the nth degree polynomial p(x) = a0 + a1 x + · · · + an xn is let: loop:

p = an for i = 1, 2, 3, · · · , n p = an−i + p ∗ x end

(1.12)

16

1 Introduction to Scientific Computing

This procedure is said to be optimal because it uses the minimum number of flops to compute pn (x). In particular, it requires 2n adds and multiplies, while the direct method requires about 3n. Because of the reduced computing cost, Horner’s method is often used in library programs for evaluating polynomials. For example, the library routines that are used by some computers to evaluate tan(x) and a tan(x) involve polynomials of degree 15 and 22 [Harrison et al., 1999], and having this done as quickly as possible is an important consideration. However, because additions and multiplications take only about 10−9 sec, the speedup using Horner is not noticeable unless you are evaluating the polynomial at a huge number of points. The advantage using Horner is that it tends to be less sensitive to round-off error. To illustrate, using Horner to evaluate the polynomial in (1.4), the curve shown in Figure 1.7 is obtained. It clearly suffers the same oscillatory behavior the direct method has, which is shown in Figure 1.1. However, the amplitude of the oscillations is about half of what is obtained using the direct method.  Example 1.3 (cont’d) The function considered was

√ y=

16 + k − 4 , k

(1.13)

and this is replotted in Figure 1.8. The plot differs from the one in Figure 1.8 in two respects. One is that the k interval is smaller to help magnify the curve in the region where it drops to zero. The second difference concerns the dashed lines. The plotting command used to make Figure 1.8 connected the given function values as if y is continuous (which it is). However, the floating-point evaluation of y is discontinuous, and the dashed lines indicate where those discontinuous points are located. Why the jumps occur can be explained using Figure 1.5. The floating-point value jumps as you move from

y-axis

4

10-14 Using Horner Using (1.3)

2 0 -2 0.98

0.99

1

1.01

x-axis Figure 1.7

Plot of (1.4) when evaluated using Horner’s method, and using (1.3).

1.02

1.4

Explaining, and Possibly Fixing, the Unexpected Results

Figure 1.8

17

Plot of (1.13).

region III to region II, and jumps again when you move into region I. Normally these jumps are imperceptible in a typical plot, but in the region near where the drop to zero occurs they are being magnified to the extent that they are clearly evident in the plot. It is fairly easy to avoid the indeterminate form at k = 0 by rewriting the function. For example, rationalizing the numerator you get that √ √ √ 16 + k − 4 16 + k − 4 16 + k + 4 √ = k k 16 + k + 4 1 =√ . 16 + k + 4 With this, letting k → 0 is no longer a problem. A follow-up question is, how does the computer come up with the value y√= 0? The short answer is that for k close to zero, the computer rounds 16 + k to 4 and this means that the numerator is rounded to zero. As seen in Figure 1.8, the k values being considered here are far from the smallest positive normalized float xm , so the denominator is not also rounded to zero. Consequently, the computer ends up with y = 0. To provide more detail about why, and where, y = 0, there are two factors contributing to this. One is that 16 + k gets rounded to 16, and the other is √ to 4. To investigate where these occur, note that that 16 + k gets rounded √ for small values of k, 16 + k is just a bit larger than 4. The floating-point numbers the computer has to work with in this region are√4, 4(1 + ε), 4(1 + 2ε), · · · . Given the rounding rule, the computer will claim 16 + k = 4 once √ k is small enough that 16 + k < 4(1 + ε/2). Squaring this and dropping the very small ε2 term you get that k < 16ε. In a similar manner you find that 16 + k gets rounded to 16 for k < 8ε. Based on this, the conclusion is that if k < 16ε, then the computer will claim that y = 0. It should be mentioned that if you account for both rounding procedures together that you find the drop to zero occurs at about k ≈ 24ε. 

18

1 Introduction to Scientific Computing

1.5 Error and Accuracy One of the most important words used in this text is error (and it is used a lot). There are different types of error that we will often make use of. For example, if xc is a computed value and x is the exact value, then 1. |x − xc | is the error. 2.

|x − xc | is the relative error (assuming x = 0). |x|

Both are used, but the relative error has the advantage that it is based on a normalized value. To explain, consider the requirement that the computed solution should satisfy |x − xc | < 0.1 versus satisfying |x − xc |/|x| < 0.1. If x = 10−8 then using the error you would accept xc = 10−2 , even though this is a factor of 106 larger than the solution. With the relative error, you would not accept xc = 10−2 but would accept any value that satisfies 0.9 < |xc /x| < 1.1. The other useful aspect of the relative error is that it has a connection to the number of correct significant digits, and this is explained later. The error and the relative error require you to know the exact solution. For this reason, they will play an important role in the derivation and testing of the numerical methods, but will have little, if any, role in the algorithm that is finally produced. One of the problems of not knowing the error is that it can be difficult to know when to stop a computation. As an example, consider the problem of calculating the value of ∞  7 . s=8− 8n n=1 Given that this involves the geometric series, you can show that s = 7. The value of the series can be computed by first rewriting it as recurrence relation as follows: s0 = 8 and sk = sk−1 −

7 , for k = 1, 2, 3, 4, · · · . 8k

(1.14)

The computed values for sk are given in Table 1.6. It is possible to introduce a measure for the improvement in the value of sk seen in this table by using one of the following: 1. |sk − sk−1 | is the iterative error, 2.

|sk − sk−1 | is the relative iterative error (assuming sk = 0). |sk |

The values for these quantities are given in Table 1.6. As with the relative error, the preference in this text will be to use the relative iterative error whenever possible given that it is a normalized value.

1.5

Error and Accuracy k

sk

1

7.125000000000000

2

19 |sk − sk−1 |

|sk − sk−1 |/|sk |

|s − sk |/|s|

7.015625000000000

1.1e−01

1.6e−02

2.2e−03

3

7.001953125000000

1.4e−02

2.0e−03

2.8e−04

4

7.000244140625000

1.7e−03

2.4e−04

3.5e−05

5

7.000030517578125

2.1e−04

3.1e−05

4.4e−06

6

7.000003814697266

2.7e−05

3.8e−06

5.4e−07

Table 1.6 Values of sk , which is given in (1.14), as they approach the exact value of s = 7. Also given are the iterative error, the relative iterative error, and the relative error.

Many of the methods in this text involve determining a quantity s by computing a sequence s1 , s2 , s3 , · · · where sk → s. As in the above example, the question is when to stop computing and accept the current value of sk as the answer. Typically, an error tolerance tol is specified at the beginning of the code and the code stops when |(sk − sk−1 )/sk | < tol. For example, if tol = 10−4 , then in Table 1.6 the code would stop once s5 is computed. As with many things in computing there is a certain level of uncertainty in this result. For example, there is always the possibility that the sequence converges so slowly that sk and sk−1 are almost identical yet sk is nowhere near s. Or, when summing a series, there is the possibility that sk is so large that the remaining terms to be added, no matter how many there might be, are insignificant (an example of this is given in Exercise 1.27). Such complications are the basis for the first quote on page 1. Correct Significant Digits You would think that an elemental requirement in computing is that the computed solution xc has a certain number of correct significant digits. So, for example, if x = 100 and you wanted a value with at least 3 correct significant digits, you would accept xc = 100.2 and xc = 99.98, but not accept xc = 100.8. As will be seen below, this idea is more complicated than you might expect. We begin with the operation of rounding. Suppose x is rounded to p digits to obtain x ¯. As shown in Exercise 1.30, x − x ¯   (1.15)  < 5 × 10−p .  x So, the exponent in the relative error is determined by the number of digits used in the rounding. This raises the question whether an inequality as in

20

1 Introduction to Scientific Computing

(1.15) can be used to determine the number of correct significant digits in the computed solution xc . For example, if |(x − xc )/x| < 5 × 10−p then is xc correct to p significant digits? This means that if you round x and xc to p digits and obtain x ¯ and x ¯c , respectively, then x ¯c = x ¯. Based simply on the sk values in Table 1.6, the answer is no, not necessarily. Except for the last entry, there are p − 1 correct digits. It turns out that the connections between rounding and correct significant digits can be surprising. For example, suppose that x = 0.9949 and xc = 0.9951. So, xc is correct to one digit, correct to 3 digits, but it is not correct to 2 digits or to 4 digits. In fact, the difficulty of determining correctly rounded numbers underlies what is known as “The Table-Maker’s Dilemma”. The connection between the number of correct significant digits and the relative error is too useful to be deterred by the unpleasant fact that the exact connection is a bit convoluted. So, in this text, the following simple heuristic is used: Expected Significant Digits. If x − x   c  < 10−p ,  x

(1.16)

then xc is expected to be correct to at least p significant digits. The upper bound 10−p is used in (1.16) rather than 5 × 10−p because it provides a more accurate value for p. As a case in point, it provides a better approximation for the correct significant digits in Table 1.6 as well as in Example 2.4. Also, significant digits refer to the maximum number of correct significant digits. So, for example, if x = 0.9949 and xc = 0.9951, then p = 3.

1.5.1 Over-computing? In using a numerical method, the question comes up as to how accurately to compute the answer. For example, numerical methods are used to solve problems in mechanics, and one often compares the computed values with data obtained experimentally. This begs the question, if the data is correct to only two or three digits, is it really necessary to obtain a numerical solution that is correct to 15 or 16 significant digits (the limit for double precision)? It is true that in many situations you do not need the accuracy provided using double precision, and that is why lower precision types are available (see Table 1.2). The latter are typically used when speed is of particular importance, such as in graphics cards. The issue that arises in scientific computing is the prevalence of ill-conditioned problems. These are problems that tend to magnify small errors. For example, in Chapter 3 it will be seen that when

Exercises

21

solving the matrix equation Ax = b it is easily possible that 15 or 16 digits are needed just to guarantee that the computed solution is correct to one or two digits. On the other hand, some methods that will be considered actually try to take advantage of not over-computing the solution. Several examples of this will arise in Chapter 9 when computing minimizers of nonlinear functions.

1.6 Multicore Computing Scientific computing is a mathematical subject that is dependent on the current state of the hardware being used. As a case in point, most computing systems now have multicore processors. Having multiple cores can enable a problem to be split into smaller problems 1000 that are then done by the cores in parallel. For example, to compute i=1 xi , if you have 10 cores you could have each core add up 100 of the xi ’s. In theory this could reduce your computing time by a factor of 10. The tricky part is writing the software that implements this idea efficiently. Significant work has been invested in writing computer software that makes use of multiple cores. A particularly important example is BLAS (basic linear algebra subprograms). These are low-level routines for using multiple cores to evaluate αAx + βy and αAB + βC. These expressions are the building blocks for many of the algorithms used in scientific computing. As an example, suppose you need to compute Ax1 , Ax2 , · · · , Axm . This can be done sequentially or you can form the matrix X = [x1 x2 · · · xm ] and then compute AX. In MATLAB, since it uses BLAS, the matrix version is about a factor of 6 times faster than computing them sequentially (on a 10 core system). Most programmers are not able to write the low-level routines needed to take advantage of multiple cores. However, MATLAB as well as programs such as Python and Julia have numerous built-in commands that do this automatically. As an example of a consequence of this, very few now write code to solve the matrix equation Ax = b. Instead, if you are using MATLAB, you simply use the command x=A \ b. On a 10 core machine, MATLAB obtains the solution approximately 5 times faster than when using just one core.

Exercises For an exercise requiring you to compute the solution, you should use doubleprecision arithmetic. 1.1. The following are true or false. If true, provide an explanation why it is true, and if it is false provide an example demonstrating this along with an

22

1 Introduction to Scientific Computing

explanation why it shows the statement is incorrect. In this problem x is a real number and xf is its floating-point approximation. (a) If x = 0, then xf = 0. (b) If x > y, then xf > yf . (c) If xf is a floating-point number with 1 ≤ xf ≤ 2, then there is an integer k so that xf = 1 + kε. (d) If xf is a floating-point number, then 1/xf is a floating-point number. (e) If x is a solution of ax = b, then xf is also a solution. (f) Approximately 25% of the positive floating-point numbers are in the interval 0 < x < 1. Section 1.2 1.2. Show that the following are floating-point numbers: (a) 15, (b) −6, (c) 3/2, (d) −5/4. 1.3. By hand, evaluate the given formula using double-precision arithmetic. Note that the terms must be evaluated in the order indicated by the parenthesis. Make sure to explain your steps. 

  1 + 14 ε 1 + 13 ε ,

(a) (1 − 13 ε) − 1,

(d)

(b) (1 − 2−55 ) − 1,

(e) (1 − ε/8)/(1 + ε/3),

(c) (1 + 2−54 )4 ,

(f) (1 + (2−53 + 2−54 ))4 − 1.

1.4. Let x = 2m + 2n and y = 2m . Taking m = 20, use the computer to find the largest integer value of n where you get x − y = 0. Do the same for m = 0 and m = −20. Using the ratio |(x − y)/x|, explain why you get these particular values of n. 1.5. Find nonzero numbers for x and y so the computed value of x/y is the stated result. Also, provide a short explanation why your example does this. (a) Inf, (b) NaN,

(c) 0, (d) 1 (with x = y).

1.6. Compute the following, and provide a plausible explanation for the answer. (a) 10 · NaN and 0 · NaN, (d) Inf − Inf, (b) NaN/NaN, (e) 0 · Inf, (c) −6 · Inf, (f) 0Inf and (Inf)0 ,

(g) 1Inf , (h) e−Inf and eInf , (i) sin(Inf).

Exercises

23

1.7. Letting A=

a −1 0

3

,

x=

1



−1

,

and

y=



a 0

,

compute the following, and provide a plausible explanation for the answer. (a) x + y when a = Inf, (b) Ax when a = NaN,

(c) A2 when a = Inf, (d) Ay when a = Inf.

1.8. Let xf and yf be adjacent floating-point numbers. You can assume they are positive and normal floats. (a) What is the minimum possible distance between xf and yf ? (b) What is the maximum possible distance between xf and yf ? (c) How many double-precision numbers lie between two consecutive singleprecision numbers? You can assume the single-precision numbers are positive. 1.9. In double precision, (i) what is the distance from 32 to the next largest floating-point number and (ii) what is the distance from 32 to the next smallest floating-point number? 1.10. (a) In double precision, explain why the floating-point numbers in the interval 252 ≤ x ≤ 253 are exactly the integers in this interval. (b) In double precision, explain why the only floating-point numbers satisfying 253 ≤ xf ≤ xM are integers. 1.11. For a computer, the epoch is the time and date from which it measures system time. Typically for UNIX systems, the epoch is midnight on January 1, 1970. Assume in this problem that system time is measured in microseconds. So, for example, if the system time is 86400000000 then it is midnight on January 2, 1971 since there are 86400000000 microseconds in a day. Assuming double precision is used, and every year contains 365 days, on what date will a computer using UNIX be unable to accurately determine the time of day? Comment: The epoch and system time are used by UNIX programmers to have “time t parties”. The more notable being the one held on 2/13/2009. 1.12. (a) Find the largest open interval about x = 16 so all real numbers from the interval are rounded to xf = 16. That is, find the smallest value of L and largest value of R with L < 16 < R so any number from the interval (L, R) is rounded to the floating-point number xf = 16. Assume double precision is used. (b) Show that x = 50 is a floating-point number, and then redo part (a) for x = 50.

24

1 Introduction to Scientific Computing

1.13. Let rand be a computer command that returns a uniformly distributed random number from the interval (0, 1). It does this by picking a random number from a prescribed collection S of floating-point numbers satisfying 0 < xf < 1. The requirements are that the numbers in S are equally spaced, and the number selected is as likely to be in the interval (0, 1/2) as in the interval (1/2, 1). (a) Suppose S consists of all the floating-point numbers that satisfy 0 < xf < 1. Why would this not satisfy the stated requirements? (b) What is the largest set of floating-point numbers that can be used for S? If xf is one of the floats in S, what is the probability that rand picks it? (c) An often used method is to prescribe a positive integer p, then take S = {1/2p , 2/2p , 3/2p , · · · , (2p − 1)/2p }. Typically p is chosen so that 2p − 1 is a Mersenne prime (e.g., p = 13, p = 29, or p = 53). Also, if p ≤ 53 then all the numbers in S are floating-point numbers. What is S when p = 1, p = 2, p = 3, and p = 53? (d) Explain why the set S in part (c) satisfies the stated requirements if p ≤ 53. What is the probability that rand picks any particular number from S? Comment: Generating uniformly distributed random numbers is a critical component of any computer system, and how to do this is continually being improved [Goualard, 2020; Jacak et al., 2021]. 1.14. Using compound interest, if an amount a is invested at an annual interest rate r, and compounded n times per year, then the amount A at the end of one year is r n . A=a 1+ n It’s not hard to show that the larger the value of n the larger the value of A. Assume that a = 100 and the interest rate is 1% (so, r = 0.01). Also assume there are 365 days in a year. Compute A for the following cases: (a) (b) (c) (d) (e) (f)

Compounding every hour (so, n = 365 ∗ 24). Compounding every second. Compounding every millisecond. Compounding every nanosecond. Compounding every picosecond. You should find that the values computed in (d) and (e) are incorrect. The question is why, that is, what causes the floating-point calculation to produce an incorrect value? Based on this, given a value of r (with 0 < r < 1), at what value of n would you expect an incorrect result as in (d) and (e) to be computed?

1.15. Consider the ratio R=

n(n − 2)(n − 4) · · · 2 , (n − 1)(n − 3)(n − 5) · · · 1

Exercises

25

where n is even. It is known that if n = 100 then R ≈ 12.5645 and if n = 400 then R ≈ 25.0820. (a) The algorithm below will, in theory, compute R. Try them and show that they work if n = 100 but not if n = 400 (for the latter, the first line must be changed). Explain why this happens. n = 100, T = 1, B = 1 for i = 2, 4, 6, · · · , n T =T ∗i end for i = 3, 5, 7, · · · , n − 1 B =B∗i end R = T /B (b) How can R be rewritten so it is possible to compute its value when n = 400? Prove it works by computing the result. Also, compute R for n = 4,000,000. 1.16. Compute the following. Your answer must contain at least 12 significant digits. If you must modify the sum(s) in any way to obtain the answer, explain what you did and why. (a) (b) (c)

1000  k=0 1000 

1000 k k=0 e , (d) 1000 n n=0 ne

cosh(k) , 1 + sinh(k)

(e)

k=0 1000  k=0

(g)

ek , 1 + ek

1000  k=1

3 + ek −

1000  n=0

√ 1 + en ,

1000 

ln(ek + 1),

k=0

1000

 ek , (f) ln k=0

     1 1 10 10 k sin π(k + ) − sin π(k − ) . k k

1.17. Using the quadratic formula, compute the positive solution of the given equation. If you run into a complication in computing the solution, explain how you resolved the problem. (a) x2 + 108 x − 14 = 0, (b) 10−16 x2 + x − 14 = 0.

1.18. Let z = (x2 + y 2 )/2. Assume that x and y are positive. (a) Show that min{x, y} ≤ z ≤ max{x, y}. (b) Compute z when x = 10200 , y = 10210 . Your value must satisfy the inequalities in part (a). If you run into a complication in computing the value, explain how you resolved the problem. (c) Redo part (b) using x = 10−200 , y = 10−210 .

26

1 Introduction to Scientific Computing

(d) Write down an algorithm that can be used to accurately evaluate z for any positive normalized floating-point numbers for x and y. 1.19. In some circumstances, the order of multiplication makes a difference. As an illustration, given n-vectors x, y, and z then (xyT )z = x(yT z). (a) Prove this for n = 3. (b) For large n, which is computed faster: (xyT )z or x(yT z)? 1.20. Homer Simpson, in the 1998 episode “The Wizard of Evergreen Terrace”, claimed he had a counterexample to Fermat’s Last Theorem, and it was that 398712 + 436512 = 447212 . This exercise considers whether it is possible to prove numerically that Homer is correct. Note that another (false) counterexample appeared in the 1995 episode “Treehouse of Horror VI”. (a) Calculate 398712 + 436512 − 447212 . If Homer is right, what should the answer be? 1/12 − 4472. If Homer is right, what should (b) Calculate 398712 + 436512 the answer be?  (c) Calculate 398712 + 436512 /447212 . If Homer is right, what should the answer be? 1/12 12  (d) Calculate 398712 + 436512 − 447212 . If Homer is right, what should the answer be? (e) One argument that Homer could make is that (c) is the correct result and (a) and (b) can be ignored because if they are correct then you should not get a discrepancy between (a) and (d). Explain why double-precision arithmetic cannot be used to prove whether Homer is right or wrong. Note: Homer’s blackboard containing the stated formula, along with a few other gems, can be found in Singh [2013]. It also explains why Homer appears to have an interest in mathematics and physics. 1.21. This problem considers the consequences of rounding using double precision. Assume the “round-to-nearest” rule is used, and if there is a tie then the smaller value is picked (this rule for ties is used to make the problem easier). (a) For what real numbers x will the computer claim the inequalities 1 < x < 2 hold? (b) For what real numbers x will the computer claim x = 4? (c) Suppose it is stated that there is a floating-point number xf that is the ¯f exact solution of x2 − 2 = 0. Why is this not√possible? Also, suppose x ¯f are the floats to the left and right of 2, respectively. What does and x ¯f − x ¯f equal? x

Exercises

27

0.8

y-axis

0.6 0.4 0.2 0 -3

-2

-1

0

x-axis Figure 1.9

1

2

3 10

-7

A plot of f (x) for Exercise 1.23.

Section 1.4 1.22. A way to sum a series of positive numbers is to use a sort-then-add procedure. You do this as follows: (1) sort the n terms in the series from smallest to largest and then add the two smallest entries (which then replaces those two entries), (2) sort the resulting n − 1 terms by size and add the two smallest entries (which then replaces those two entries), etc. (a) Work out the steps in the procedure to sum 6 + 4 + 5 + 2 + 1. (b) In Table 1.4, E(105 ) = 12.0901461298634. Use the sort-then-add procedure to compute s(105 ), and compare the resulting error with what is given in Table 1.4. Also, how does the computing time needed for this procedure compare to just computing s(105 ) directly? √ 1.23. The graph of the function f (x) = ( 1 + x2 − 1)/x2 is shown in Figure 1.9 where the values of f (x) were computed using double precision. (a) Using l’Hospital’s rule, determine limx→0 f (x). (b) Rewrite f (x) in such a way that the function is defined at x = 0. Evaluate this function at x = 0. (c) Why does the computer state that f (x) = 0 for small values of x? Also, the graph shows that as x decreases to zero the function drops to zero. Determine where this occurs (approximately). 1.24. The graph of the function f (x) = (ex − 1)/x is shown in Figure 1.10. In this problem, the quadratic Taylor polynomial approximation ex ≈ 1 + x + x2 /2, for |x| 1, is useful. (a) Using l’Hospital’s rule, determine limx→0 f (x). (b) Redo part (a) but use the quadratic Taylor polynomial approximation. (c) Suppose that x is positive. Why does the computer state that f (x) = 0 for small values of x? Approximately, where does the drop to zero occur? (d) Redo part (c) when x is negative.

28

1 Introduction to Scientific Computing

1.25. This problem considers the following algorithm: √ x = 2, s = 0 for i = 1, 2, 3, · · · , N s=s+x end y = |x − s/N | It is assumed N is a prescribed positive integer. (a) What is the exact value for y? (b) If N = 103 , the computed value is y ≈ ×10−14 , if N = 106 , one gets that y ≈ 10−11 , and if N = 109 , one gets that y ≈ 10−8 . Why is the error getting worse as N increases? Is there any correlation between the value of N and the value of y? (c) Use compensated summation to compute this result and compare the values with those given in part (b). 1.26. The polynomial pn (x) = a0 + a1 x + · · · + an xn can be separated into the sum of two polynomials, one which contains even powers of x and the other involving odd powers. This problem explores the computational benefits of this. To make things simple, you can assume n is even, so n = 2m, where m is a positive integer. (a) Setting z = x2 , find f (z) and g(z) so that pn (x) = f (z) + xg(z). (b) What is the minimum flop count to compute the expression in part (a)? Also, explain why it is about halfway between the flop count for the direct method and the count using Horner’s method. (c) Evaluate (1.4) using the formula in part (a), and then plot the values for 0.98 ≤ x ≤ 1.02 (use 1000 points in this interval). In comparison to the plot obtained using the direct method, does the reduced flop count reduce the error in the calculation? Section 1.5

y-axis

2 1.5 1 0.5 0 -3

-2

-1

0

x-axis Figure 1.10

A plot of f (x) for Exercise 1.24.

1

2 10-15

Exercises

29

1.27. This problem concerns computing sum is π 2 /6.

∞

k=1

1/k 2 . The exact value of the

(a) Suppose that the sum is computed using the algorithm given below. Explain how the stopping condition works. s = 1, ss = 0, k = 1 while s = ss k =k+1 ss = s s = s + 1/k 2 end (b) Compute the sum using the procedure in part (a) and report its value to 12 digits. At what value of k does the algorithm stop? How many digits of the computed sum are correct? (c) Assuming the number of correct digits in your answer is part (b) is m, compute the sum so the value is correct to, at least, m + 2 digits. Make sure to explain how you do this. Also, your procedure cannot use the known value of the sum. 1.28. The computed values of a sequence are s1 = 10.3, s2 = 10.03, s3 = 10.003, s4 = 10.0003, · · · . (a) What is the apparent value of limk→∞ sk ? (b) If the stopping condition is that the iterative error satisfies |sk − sk−1 | < tol, where tol = 10−8 , at what value of k does the computation stop? When it stops, how many correct significant digits does sk contain? (c) If the stopping condition is that the relative iterative error satisfies |sk − sk−1 |/|sk | < tol, where tol = 10−8 , at what value of k does the computation stop? When it stops, how many correct significant digits does sk contain? 1.29. The computed values of a sequence are s1 = 0.0027, s2 = 0.00207, s3 = 0.002007, s4 = 0.0020007, · · · . (a) What is the apparent value of limk→∞ sk ? (b) If the stopping condition is that the iterative error satisfies |sk − sk−1 | < tol, where tol = 10−8 , at what value of k does the computation stop? When it stops, how many correct significant digits does sk contain? (c) If the stopping condition is that the relative iterative error satisfies |sk − sk−1 |/|sk | < tol, where tol = 10−8 , at what value of k does the computation stop? When it stops, how many correct significant digits does sk contain? 1.30. A positive number can be written as x = d × 10n , where 1 ≤ d < 10. Assume that this is written in decimal form as d = d1 .d2 d3 · · · .

30

1 Introduction to Scientific Computing

(a) Suppose x is rounded to p digits to obtain x ¯. Letting x ¯ = d¯ × 10n , explain why (ignoring ties)  if dp+1 < 5, d1 .d2 d3 · · · dp d¯ = −p+1 if dp+1 > 5. d1 .d2 d3 · · · dp + 10 (b) Show that

  x − x ¯  −p   x  < 5 × 10 .

Additional Questions 1.31. This problem considers ways to compute xn , where n is a positive integer. (a) Compare the total number of flops between computing xn = x ∗ x ∗ · · · ∗ x, and computing  y ∗ y ∗ ··· ∗ y if n is even n x = x ∗ y ∗ y ∗ · · · ∗ y if n is odd, where y = x2 . As examples of the last formula, x6 = y ∗ y ∗ y, while x5 = x ∗ y ∗ y. (b) Suppose n = 28. Explain the connection between the floating-point representation 28 = (1 + 1/2 + 1/22 ) × 24 and the factorization    2 2 2 2  2 2 2  2 2 28 x ∗ x ∗ x . x = What is the minimum number of flops required to determine x28 using this formula? Note that this procedure is a version of the square-andmultiply algorithm. (c) Suppose n = 100, so its floating-point representation is (1 + 12 + 214 ) × 26 . Explain how to use the idea in part (b) to calculate x100 . How does the flop count compare with the two methods in part (a)? (d) Another approach, assuming x is positive, is to write xn = en ln x . Based on the values in Table 1.3, what is the approximate flop time for this? How does it compare with the flop times found in parts (b) and (c)? 1.32. This problem considers finding the largest value of the mantissa. (a) What values of the bj ’s in (1.8) produce the largest value of m? n+1 1 = 1 + x + x2 + · · · + xn + x1−x . (b) Assuming x = 1, show that 1−x (c) Use the result from part (b) to show that the value of m from part (a) can be written as m = 2 − ε. From this show that the largest value of the mantissa is m = 1 + Kε, where K = 2N −1 − 1. (d) Use part (c) to explain why the float just to the left of x = 2 is 2 − ε. Also, explain why the float just to the right of x = 2 is 2(1 + ε).

Chapter 2

Solving A Nonlinear Equation

In this chapter, one of the more common mathematical problems is studied, which is to find the solution, or solutions, of an equation of the form f (x) = 0. To illustrate the situation, we begin with a few examples. Example: Bungee Jumping Apparently, for bungee jumpers, it is more entertaining if the cord is long enough that the jumper comes very close to hitting the ground. So, an important question is, given the person’s weight, how long will the cord stretch? According to the Mooney-Rivlin model for rubber, the weight W of the person and the length  the cord stretches are related through the equation:  β  2 1  x − = W, α+ x x where x = /0 , 0 is the original length of the cord, and α and β are positive constants. This equation must be solved for x to determine the stretched length of the bungee cord.  Example: Atmospheric Warming In determining how quickly the atmosphere will warm up, or cool down, it is necessary to know how fast airborne particles will drop to the surface of the Earth. Based on experimental data, one of the simpler formulas used to determine the terminal velocity v of a falling particle involves solving the equation:   α √ + 1 = 1, v2 v + β+ v where α and β are positive constants.



© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0 2

31

32

2 Solving A Nonlinear Equation

Example: Locating Solutions To solve f (x) = 0, it helps if you have an idea where the solutions are located. This is typically done by either plotting f (x) using a computer, or else by sketching the function by hand. As an illustration of the latter, suppose the equation to solve is x3 − 12x + 5 = 0. In this case, f (x) = x3 − 12x + 5. To determine the local maximum and minimum points note that f  (x) = 3x2 − 12. So, setting f  (x) = 0 gives x = ±2. Since f  (x) > 0 for −∞ < x < −2, f (x) is monotonically increasing in this interval from f (−∞) = −∞ up to f (−2) > 0. Similarly, it is decreasing for −2 < x < 2, from f (−2) down to f (2) < 0, and then monotonically increasing for 2 < x < ∞ from f (−2) > 0 up to f (∞) = ∞. This information is enough to produce the curve shown in Figure 2.1. As for the placement of the tick marks on the x-axis, it is easy to check that f (−4) < 0 while f (−3) > 0, which means that f (x) crosses the x-axis somewhere in the interval −4 < x < −3. A similar check is made for the other two crossings. From this, you have a rough estimate of where the solutions are located from the sketch. Another way to locate the solutions is to rewrite f (x) = 0 as F (x) = G(x), then sketch F (x) and G(x) on the same axes. The desired solutions are where the two functions intersect. For example, x3 − 12x + 5 = 0 can be rewritten as x3 = 12x − 5, and it is fairly easy to sketch F (x) = x3 and G(x) = 12x − 5. Another example of this approach is given in Example 2.4. 

2.1 The Problem to Solve In this chapter, we will describe methods for finding a solution x ¯ of the equation f (x) = 0, where f (x) is continuous. Two of the methods considered, Newton and secant, will include additional assumptions about f (x). It should be pointed out that when computing a solution of f (x) = 0, the floating-point number system will consider any x that satisfies |f (x)| < x0 /2 to be a solution, where x0 is the smallest positive float. For example, e−7x = 0 has no solution, but using double precision, you get that e−7x = 0 for all

Figure 2.1

Sketch of f (x) = x3 − 12x + 1.

2.2

Bisection Method

33

x ≥ 107. An easy way to help avoid accepting incorrect answers such as this is to use sketching, or plotting, as discussed in the previous example.

2.2 Bisection Method The idea is to construct a sequence of points x0 , x1 , x2 , · · · in such a way that limk→∞ xk = x ¯, where x ¯ is a solution of f (x) = 0. How this is going to be done is illustrated in Figure 2.2. An explanation of the steps used in the construction of the sequence is as follows: Step 0 : To get things started it is necessary to find a0 and b0 so that f (a0 )f (b0 ) < 0. The two points are labeled so that a0 < b0 . With this, you then take x0 to be the midpoint of the interval. So, x0 = (b0 + a0 )/2. Comments: Finding a0 and b0 typically involves guessing and using properties of f (x) if possible. The requirement that f (a0 )f (b0 ) < 0 means that there is at least one solution of f (x) = 0 in the interval a0 < x < b0 . This is guaranteed because of the intermediate value theorem. There might be more than one solution in the interval, which is fine, and an example of this situation is considered below. Step 1 : If f (x0 ) = 0, then determine which half of the interval contains a solution. So, if f (a0 )f (x0 ) < 0, then the new trapping interval is a1 = a0 and b1 = x0 . Otherwise, take a1 = x0 and b1 = b0 . With this, take x1 = (b1 + a1 )/2. Step 2 : If f (x1 ) = 0, then you just redo the subdivision procedure used in step 1. Based on the situation shown in Figure 2.2, a2 = x1 , b2 = b1 , and x2 = (b2 + a2 )/2.

Figure 2.2 f (x) = 0.

Schematic of the steps used by the bisection method to locate a solution of

34

2 Solving A Nonlinear Equation

The next steps simply repeat this procedure. So, at Step k, the current interval is split in two and you keep the half that you know contains a solution. With this, you take xk to be the midpoint of whatever half you kept. Example 2.1 The function used in this example is shown in Figure 2.3. 1. Carry out Steps 1 and 2 of the bisection method when a0 = −1.5, b0 = 2.5: Answer: Step 0: x0 = (b0 + a0 )/2 = 1/2. Step 1: Since f (x0 )f (b0 ) < 0, then a1 = x0 = 1/2, and b1 = b0 = 2.5. With this, x1 = (b1 + a1 )/2 = 1.5. Step 2: Since f (1.5) < 0, then f (x1 )f (b1 ) < 0. So, a2 = x1 = 1.5, and b2 = b1 = 2.5. With this, x2 = (b2 + a2 )/2 = 2. 2. What solution does the bisection method determine if a0 = −1.5 and b0 = 2.5? Answer: Since x0 = (b0 + a0 )/2 = 0.5 and f (0.5) < 0, Step 1 will produce an interval only containing the solution near x = 2. So, this is the solution the bisection method will end up finding.  The resulting bisection algorithm is given in Table 2.1. It differs slightly from the individual steps described earlier in that the points are not indexed. Instead, the values of a, b, and x are simply updated with the newest values as the procedure proceeds. The stopping condition involves the relative iterative error |(xk − xk−1 )/xk |. This is easily changed to the iterative error |xk − xk−1 | if a complication arises due to a division by zero. 4

f(x)

2 0 -2 -1.5

-1

-0.5

0

0.5

x-axis Figure 2.3

Plot of function for Example 2.1.

1

1.5

2

2.5

2.2

Bisection Method

35

pick: a < b with f (a)f (b) < 0, and tol > 0 let:

x = (a + b)/2, and err = 10 ∗ tol

loop: while err > tol if f (a)f (x) < 0, then b = x else if f (b)f (x) < 0, then a = x else stop end X=x x = (a + b)/2 err = |x − X|/|x| end

Table 2.1 Algorithm for solving f (x) = 0 using the bisection method. The procedure stops when f (xk ) = 0 or |xk − xk−1 |/|xk | < tol.

Example 2.1 (cont’d) The function shown in Figure 2.3 is f (x) = (x − 0.1)(x − 2.1)(x + 1.1). Taking a0 = −1.5 and b0 = 2.5, and tol = 10−6 , the bisection algorithm applied to this function produces the values given in Table 2.2. As expected, it finds the solution near x = 2. What is of interest is how the error changes as the method finds the solution. To make this easier, the values of the error

k

xk

|xk − x ¯|

|xk − xk−1 |/|xk |

1

1.500000000000000

6.0e−01

6.7e−01

2

2.000000000000000

1.0e−01

2.5e−01

3

2.250000000000000

1.5e−01

1.1e−01

.. .

.. .

.. .

.. .

16

2.100006103515625

6.1e−06

1.5e−05

17

2.099990844726562

9.2e−06

7.3e−06

18

2.099998474121094

1.5e−06

3.6e−06

19

2.100002288818359

2.3e−06

1.8e−06

20

2.100000381469727

3.8e−07

9.1e−07

Table 2.2 Solving the equation for Example 2.1 using the bisection method. Note that 6.7e−01 = 6.7 × 10−1 .

36

2 Solving A Nonlinear Equation

Error

10

10

0

Error Relative Iterative Error

-3

10-6 0

2

4

6

8

10

12

14

16

18

20

Iteration Step Figure 2.4

The error and the relative iterative error for Example 2.1.

ek = |xk − x ¯| and the relative iterative error |(xk − xk−1 )/xk | are plotted in Figure 2.4. What is seen in Table 2.2 is that the relative iterative error is cut in half with each step once the method gets started. This results in the straight line shown in Figure 2.4. The error ek shows a similar overall decrease, although it does not follow a straight line in doing so. 

2.2.1 Convergence Theory The bisection method has the rare property that you can determine how much computing is necessary to calculate the solution before doing the calculation. To explain, as stated earlier, we know that the error satisfies |¯ x − xk | ≤

1 (bk − ak ). 2

(2.1)

Since the length of the interval is cut in have at each step, we have that b1 − a1 = 12 (b0 − a0 ), b2 − a2 = 12 (b1 − a1 ) = (b0 − a0 )/22 , b3 − a3 = 12 (b2 − a2 ) = (b0 − a0 )/23 , etc. So, in general bk − ak =

1 (b0 − a0 ), 2k

(2.2)

and this means, from (2.1), that |¯ x − xk | ≤ (b0 − a0 )/2k+1 . A second useful observation is that xk−1 is the midpoint of the interval [ak−1 , bk−1 ], and xk is the midpoint of half of this interval. In other words, |xk − xk−1 | = 14 (bk−1 − ak−1 ). So, from (2.2), |xk − xk−1 | = (b0 − a0 )/2k+1 . From the above discussion, we have the following result. Theorem 2.1 If f ∈ C[a0 , b0 ], with f (a0 )f (b0 ) < 0, then the midpoints x0 , x1 , x2 , · · · computed using the bisection method converge to a solution x ¯ of f (x) = 0. Moreover, the error satisfies

2.2

Bisection Method

37

|xk − x ¯| ≤

1 2k+1

(b0 − a0 ),

(2.3)

and the iterative error is |xk − xk−1 | =

1 (b0 − a0 ). 2k+1

(2.4)

Note that f ∈ C[a0 , b0 ] is mathematical shorthand for the statement that f (x) is a continuous function for a0 ≤ x ≤ b0 (see Appendix A). Example 2.2 1. If a0 = −1 and b0 = 3, then how many steps are required to guarantee that the error is no more than 10−8 ? Answer: From (2.3), the error is guaranteed if 2−(k+1) (b0 − a0 ) ≤ 10−8 . Since b0 − a0 = 4, then we want, 2−k+1 ≤ 10−8 . Consequently, the error is achieved if k ≥ 1 + 8 ln(10)/ ln(2) ≈ 26.6. 2. If a0 = −2 and b0 = 0, what is the maximum number of steps the bisection method requires if the stopping condition is |xk − xk−1 | ≤ 10−6 ? Answer: From (2.4), and the given values, the procedure stops when 2−k ≤ 10−6 . In other words, k ≥ 6 ln(10)/ ln(2) ≈ 19.9. Therefore, it will stop when k = 20. 3. For the function in Figure 2.3, if a0 = −1.5, b0 = 2.5, and tol = 10−6 , then approximately how many steps will the algorithm in Table 2.1 require to solve f (x) = 0? Answer: The stopping condition is |xk − xk−1 |/|xk | ≤ tol. Once xk gets x|. So, using close to the solution x ¯, then |xk − xk−1 |/|xk | ≈ |xk − xk−1 |/|¯ x| ≤ tol. From Figure 2.3, x ¯ ≈ 2. Since (2.4), we want 2−(k+1) (b0 − a0 )/|¯ b0 − a0 = 4, then the stopping condition is, approximately, 2−k ≤ 10−6 . This happens when k ≈ 20, which agrees with the computed values given in Table 2.2.  Order of Convergence At Step k, the error is ek ≡ |xk − x ¯|. As seen in Figure 2.4, even though on a step-by-step basis the decrease is uneven, overall, it is following a line that is expected of a method in which the error is cut in half at each step. Mathematically, this can be written as ek ≈ 12 ek−1 . This reduction in the error determines the order of convergence, and because this is key to what follows it requires a more formal introduction.

38

2 Solving A Nonlinear Equation

Definition of the Order of Convergence. Suppose that limk→∞ xk = x ¯. ¯, if Once xk gets close to x (2.5) ek ≈ Cerk−1 , where C is a positive constant, then the order of convergence is r. It is possible to prove that if the order of convergence is r, and x ¯ = 0, then the relative error |xk − x ¯|/|¯ x| and the relative iterative error |xk − xk−1 |/|xk | also satisfy (2.5), although typically with a different value for C. In this chapter, we consider the bisection method (where r = 1 so it is first order), Newton’s method (where r = 2 so it is second order), and the √ secant method (where r = (1 + 5)/2 ≈ 1.6). As demonstrated in the next example, the larger the value of r the faster the method converges, and the improvements are dramatic. Example 2.3 Suppose (2.5) holds, with C = 1/2. If e1 = 10−1 , then what is the approximate value of the error e4 for the three methods mentioned above? Answer: 1. Bisection: Since r = 1, then e2 ≈ e1 /2, e3 ≈ e2 /2 ≈ e1 /4, and so e4 ≈ e3 /2 ≈ e1 /8 ≈ 10−2 . 2. Newton: Since r = 2, then e2 ≈ e21 /2 = 10−2 /2, e3 ≈ e22 /2 ≈ 10−4 /8, and so e4 ≈ e23 /2 ≈ 10−8 /128 ≈ 10−10 . √ 3. Secant: Since r = (1 + 5)/2, then e4 ≈ 10−6 .  In the above example, for the bisection method to get the same error as Newton’s method, requires k = 31. Said another way, with Newton e5 ≈ 10−21 , while with bisection you only have e5 ≈ 10−3 . The obvious question at this point is why to use the bisection method. The answer is that it is easier to guarantee that the bisection method will work than guaranteeing that Newtons method will work. This will be explained later once Newton’s method is derived. Example 2.4 Find all solutions of the equation e−x = 2 + x. The values should be computed so they are correct to 6 significant digits.

2.3

Newton’s Method

Figure 2.5

39

Sketch of the two functions appearing in Example 2.4.

Answer: First, the functions need to be sketched to determine how many solutions there are and their approximate locations. This is done in Figure 2.4, and it apparent that there is only one solution and it is in the interval −2 < x < 0. Letting f (x) = e−x − (2 + x), then f (−2)f (0) < 0, so we can take a0 = −2 and b0 = 0. Also, tol = 10−6 . The computed values are given in Table 2.3. The fact that bisection is a first-order method is evident in the error values, as they decrease by about a factor of 2 at each step. 

2.3 Newton’s Method We will now consider what is known as Newton’s method. It, like the bisection method, is an iterative procedure that constructs a sequence of points x0 , x1 , ¯, where x ¯ is a solution of f (x) = 0. x2 , · · · in such a way that limk→∞ xk = x

Table 2.3

Solving

k

xk

|xk − xk−1 |/|xk |

1

−0.500000000000000

1.0e+00

2

−0.250000000000000

1.0e+00

3

−0.375000000000000

3.3e−01

. . .

. . .

. . .

18

−0.442852020263672

8.6e−06

19

−0.442853927612305

4.3e−06

20

−0.442854881286621

2.2e−06

21

−0.442854404449463

1.1e−06

22

−0.442854166030884

5.4e−07

e−x

= 2 + x using the bisection method.

40

2 Solving A Nonlinear Equation

Figure 2.6 First step using Newton’s method. Given x0 , the tangent line (the red line) to the curve is used to determine x1 .

It is also sometimes called the tangent line method, and the reason for this is given in the next paragraph. The geometric explanation for how Newton’s method works is shown in Figures 2.6 and 2.7. To start, one picks x0 and then uses the tangent line to the curve at x0 to locate x1 (Figure 2.6). You then use the tangent line at x1 to locate x2 (Figure 2.7). The other terms x3 , x4 , · · · are determined in a similar manner. A more formal derivation of Newton’s method involves using Taylor’s theorem. To explain, suppose that xk−1 is known and we want to determine xk . Taylor’s theorem is used to obtain a linear approximation of f (x), for x near xk−1 . This is given in Appendix A, in equation (A.6), and, from this, we have that (2.6) f (x) ≈ f (xk−1 ) + f  (xk−1 )(x − xk−1 ). We now replace the equation f (x) = 0 with the equation f (xk−1 ) + f  (xk−1 )(x − xk−1 ) = 0. Solving this, we get xk = xk−1 −

f (xk−1 ) , for k = 1, 2, 3, · · · . f  (xk−1 )

(2.7)

Figure 2.7 Second step using Newton’s method. The tangent line (the red line) to the curve at (x1 , f (x1 )) is used to determine x2 .

2.3

Newton’s Method

41

pick: x, and tol > 0 let:

err = 3 ∗ tol

loop: while err > tol z = f (x)/f  (x) x=x−z err = |z/x| end

Table 2.4 First version of algorithm for solving f (x) = 0 using Newton’s method. The procedure stops when |xk − xk−1 |/|xk | < tol.

This formula is Newton’s method, and to use it to solve f (x) = 0 requires a start value x0 . An algorithm for Newton’s method is given in Table 2.4. It differs slightly from the procedure in (2.7) in that the values of x are not indexed. Instead, the value of x is overwritten as the procedure proceeds. Note that the stopping condition is based on the relative iterative error |(xk − xk−1 )/xk |. If a divide by zero occurs, aside from using a different x0 it is also worth checking whether x = 0 is a solution. Finally, this version of the algorithm does not account for the various ways Newton’s method can fail, and how this can be done will be considered later (see Table 2.7). Example 2.5 For the equation 2x3 − 3x2 + 3 = 0, carry out Steps 1 and 2 of Newton’s method when x0 = 3/2. Answer: Since f (x) = 2x3 − 3x2 + 3 then f  (x) = 6x2 − 6x. So, 1. Step 1: x1 = x0 −

3 2 5 f (x0 ) = − = . f  (x0 ) 2 3 6

2. Step 2: x2 = x1 −

5 112 299 f (x1 ) = + = .  f (x1 ) 6 45 90

The other steps are carried using a computer, and the values are given in Table 2.5. A particularly important observation to make here is that once Newton gets a sense of where the solution is, so k ≥ 7, then the exponent in the error term doubles at each step. Correspondingly, the number of correct digits is effectively doubled at each step! This is due to the fact that Newton is (usually) a secondorder method, and this is discussed in more detail later. 

42

2 Solving A Nonlinear Equation

Table 2.5

k

xk

|xk − xk−1 |/|xk |

|¯ x − xk |/|¯ x|

1

0.833333333333333

8.0e−01

2.0e+00

2

3.322222222222223

7.5e−01

5.1e+00

3

2.388442483865826

3.9e−01

4.0e+00

4

1.728225936956258

3.8e−01

3.1e+00

5

1.150397916227964

5.0e−01

2.4e+00

6

−0.848111933239367

2.4e+00

5.2e−02

7

−0.807921856614232

5.0e−02

1.8e−03

8

−0.806445887479259

1.8e−03

2.4e−06

9

−0.806443932362200

2.4e−06

4.3e−12

10

−0.806443932358772

4.3e−12

2.2e−16

Solving 2x3 − 3x2 + 3 = 0 using Newton’s method with x0 = 3/2.

2.3.1 Picking x0

f(x)

As will be explained in Section 2.3.5, to guarantee that Newton’s method finds x) = 0, and the starting a solution x = x ¯ of f (x) = 0, it is required that f  (¯ ¯. This raises the question on how to determine point x0 should be close to x x0 , other than to simply pick some random number. The usual way this is done is to plot f (x) and then visually locate the solution and pick a x0 close by. To illustrate, the function from Example 2.5 is plotted in Figure 2.8. In that example, x0 = 3/2, but in looking at Figure 2.8, it is evident x0 = −1 would be a better choice. Or, given that the root is about halfway between x = −1.5 and x = −1, then you might take x = −1.25. As a general rule for picking x0 , it should be located so there are no local max or min points in the function between x0 and the solution. This does not mean Newton will fail if you don’t do this, it just reduces the possibility that

0

-1.5

-1

-0.5

0

0.5

1

x-axis Figure 2.8

The function y = 2x3 − 3x2 + 3 considered in Example 2.5.

1.5

2

2.3

Newton’s Method

43

it will fail. As a case in point, the value of x0 used in Example 2.5 violates this rule. As you can see in Table 2.8, Newton has some difficulty getting past the local max and min points. For example, x2 is farther from the solution than x1 . But, once it does get past them, starting with x6 , the method converges very quickly. In lieu of plotting, one can just sketch the function, or functions, involved. To illustrate, the functions involved in Example 2.4 are sketched in Figure 2.5. It is apparent that the solution is in the interval −2 < x < 0, and so one might take x0 = −1. This can be improved a bit by noticing the y-intercept for e−x is at y = 1. If you draw a horizontal line with y = 1, it intersects y = 2 + x at x = −1. So, the solution must be in the interval −1 < x < 0, and so one might take x0 = −1/2. Finally, it is not unusual to use the bisection method to determine x0 . It is fairly easy, by picking random numbers if necessary, to find two points so that f (a0 )f (b0 ) < 0. You then run the bisection method for a few steps to narrow down your choice for x0 . With this, you then switch to Newton’s method, and compute the solution. Example 2.6 When using Newton’s method it can be unpredictable what solution it will find. This is illustrated in Figure 2.9. In the lower graph, the function is plotted, and it shows that there are four solutions of f (x) = 0. The upper graph gives the value of the root calculated using Newton’s method as a function of the starting location. For example, taking a starting value of x = 5, the method converges to the root between 5 and 6. In fact, there is a wide range of starting values near x = 5 for which Newton’s method produces the same result. The same is true for starting values near the other roots. Where the method appears to produce more unpredictable results is when the staring points are near the local max and min points of the function. As an example, for a starting points near x = 1.5 it is possible to have Newton’s method converge to the root near x = −2 or the one near x = 5. 

2.3.2 Order of Convergence As pointed out earlier, the error in Table 2.5 is squared in each step once xk gets close to the solution. According to (2.5), this means that r = 2, so Newton’s method is second order (at least for this example). The following is the second example considered earlier for the bisection method.

44

2 Solving A Nonlinear Equation

Solution

6 4 2 0 -2

f(x)

-3

-2

-1

0

1

2

3

4

5

6

-2

-1

0

1

2

3

4

5

6

0

-3

x-axis

Figure 2.9 Lower: Plot of f (x). Upper: Solution of f (x) = 0 obtained using Newton’s method as a function of the starting value.

Example 2.7 Find the solution of the equation e−x = 2 + x. The value should be computed so it is correct to 10 significant digits. Answer: As explained in Example 2.4, the solution is in the interval −2 < x < 0. Taking f (x) = e−x − (2 + x), then (2.7) becomes xk = xk−1 +

e−xk−1 − (2 + xk−1 ) . e−xk−1 + 1

Because Newton is going to be compared to bisection, we will take x0 = b0 = 0. Also, given the 10 digits requirement, let tol = 10−10 . The computed values, along with the relative iterative error, are given in Table 2.6. Compared to the results for the bisection method given in Table 2.3, the values for Newton are impressive. It achieves the requested 10 correct digits at k = 5, and at that step, it produces an error that is approaching the limit of double precision. It does appear that the error squares at each step, which should happen for a second-order method. However, unlike the situation for the bisection method, being able to see an actual pattern in the computed values is difficult because the method converges so quickly.  To restate the observation made in the above example, Newton’s method is second order. As defined in (2.5), this means that it is typical that once xk gets close to x ¯, then ek ≈ Ce2k−1 . A similar conclusion holds for the relative

2.3

Newton’s Method

Table 2.6

45

k

xk

|xk − xk−1 |/|xk |

1

−0.500000000000000

1.0e+00

2

−0.443851671995364

1.3e−01

3

−0.442854703829747

2.3e−03

4

−0.442854401002417

6.8e−07

5

−0.442854401002389

6.3e−14

Solving e−x = 2 + x using Newton’s method.

error and the relative iterative error. The consequence of this is that the number of correct digits can double at each step once Newton gets close to the solution. The exact number depends on the value of C and x ¯. The proof that Newton is second order, and the conditions required for this to hold, are given in Section 2.3.5.

2.3.3 Failure It is important to be aware that Newton’s method might not work. For example, it is evident in (2.7) that if we ever get f  (xk−1 ) = 0 then the method fails. There are other potential problems, and one is illustrated in the next example. Example 2.8 Suppose we use Newton’s method to solve x/(1 + x2 ) = 0. The graph of the function is shown in Figure 2.10, as well as the tangent line when you take x0 = 2. In this case, the next point x1 is farther away from the solution. In fact, if you keep using Newton’s method you would find that xk → ∞. Also, note that Newton’s method can be used to solve this equation, it is just that you need to pick x0 near the solution. For example, if x0 = 1/2, then the method will work just fine.  Other examples of how, or when, Newton’s method fails are given in Exercises 2.26 and 2.27. The algorithm for Newton’s method should be revised so the calculation is stopped if the runaway situation shown in Figure 2.10 occurs. One way to do this is to put a bound on the value of xk . For example, one picks a relatively large value M and if it ever happens that |xk | > M , then it is assumed that runaway is occurring and the calculation is stopped. Another possibility is to

46

2 Solving A Nonlinear Equation

y-axis

0.5

0

-0.5 -1

0

1

2

3

4

5

6

7

8

x-axis Figure 2.10 First step using Newton’s method. The solid curve is y = x/(1 + x2 ), and the dashed line is the tangent line used by Newton’s method when x0 = 2.

simply limit the number of iteration steps to a number K. Note that when Newton’s method does work it converges very quickly, so K does not need to be very large (e.g., K = 20). A revised algorithm for Newton’s method incorporating these changes is given in Table 2.7, which includes a step that exits the program and reports a failure.

2.3.4 Examples The examples below illustrate how Newton’s method can be used to derive general formulas to solve particular problems.

pick: x, M > 0, K > 0, and tol > 0 let:

err = 3 ∗ tol k=0

loop: while err > tol z = f (x)/f  (x) x=x−z k =k+1 err = |z/x| if |x| > M or k > K, then error exit end

Table 2.7 Revised algorithm for solving f (x) = 0 using Newton’s method. The procedure stops when |xk+1 − xk |/|xk | < tol, or when the method appears to fail.

2.3

Newton’s Method

47

Example: Division Newton’s method can be used to derive algorithms for calculating mathematical expressions using more elementary operations. As an example, Newton’s method can be used to perform division using only subtraction and multiplication. To explain, given α > 0, suppose we want to compute 1/α. In other words, we want to find the value of x which satisfies x = 1/α (without actually doing division). One might try rewriting the equation as αx − 1 = 0, or as α2 x2 − 1 = 0, but neither works. For example, using f (x) = α2 x2 − 1, then (2.7) becomes 1 1 xk = xk−1 − 2 , 2 2α xk−1 and this requires a division. A choice that does work is to rewrite the equation as α = 1/x, so f (x) = α − 1/x. In this case, (2.7) becomes xk = xk−1 (2 − αxk−1 ), which consists of two multiplications and one subtraction (per iteration). This procedure has been used by more than one computer system to√perform floating-point division. It is also possible to use Newton to compute α using addition, subtraction, multiplication, and division (Exercise 2.21), and to do something similar for α1/3 (Exercise 2.22). 

Example: Implicit Function The mathematical problem to be considered is to determine the value of a function y(x) from an equation of the form F (x, y) = 0. Examples of this are y 3 + x = ln y, and (2.8) y + x + 2 = esin(x)−y . It is relatively easy to use Newton’s method to solve F (x, y) = 0 if it is understood that x is fixed, or given, and we are solving the equation for y. The resulting iteration scheme is: after picking y0 , then yk = yk−1 −

F (x, yk−1 ) , Fy (x, yk−1 )

(2.9)

where Fy is the partial derivative of F with respect to y. As an example, for (2.8), we get that yk = yk−1 −

yk−1 + x + 2 − esin(x)−yk−1 . 1 + esin(x)−yk−1

(2.10)

You can pick y0 using any of the ideas discussed in Section 2.3.1. However, something new occurs when dealing with an implicit function. Namely, it is

48

2 Solving A Nonlinear Equation

y-axis

0

-1

-2

-3 0

1

2

3

4

5

6

7

8

x-axis Figure 2.11

The plot of the solution of (2.8) obtained using Newton’s method.

typical that you want to determine y(x) for a ≤ x ≤ b. This is done by using grid points on the x-axis given as x1 < x2 < x3 < · · · < xn , where x1 = a and xn = b. The objective is to then use Newton to evaluate y(xj ) at each point xj . Now, if you have just computed y(xj−1 ), then a good candidate for y0 at xj is just y0 = y(xj−1 ). This means you only need to come up with a good guess for y0 when x = a. This was done to construct the plot in Figure 2.11. In this case, (2.8) was solved for 0 ≤ x ≤ 8 using 20 equally spaced points in the interval, and taking y0 = 0 at x = 0. As you can see, using y0 = y(xj−1 ) when computing y(xj ) is a good approximation for the value of y(xj ). 

2.3.5 Convergence Theorem To guarantee that Newton’s method works you need to pick a starting point x) = 0. It is possible near the solution and you also need to require that f  (¯ to state this more formally, and this is done next. Theorem 2.2 Assume f ∈ C 2 (a, b), with a < x ¯ < b and f  (x) = 0 for a < ¯, Newton’s method, as given in x < b. In this case, for x0 chosen close to x x) = 0, then for xk close to x ¯ (2.7), will converge to x ¯. Moreover, if f  (¯

and where

|xk − x ¯| ≈ C|xk−1 − x ¯|2 ,

(2.11)

|xk − xk−1 | ≈ C|xk−1 − xk−2 |2 ,

(2.12)

    f (¯ x)  . C =   2f (¯ x) 

In other words, the method is second order.

(2.13)

2.3

Newton’s Method

49

Outline of Proof: The requirement that x0 is close to x ¯, and Taylor’s theorem, are the keys to proving this. Although the discussion to follow contains many of the steps of the proof, the objective is to explain how (2.11) is obtained. ¯, then from (2.7) we have that ek+1 = ek − f (xk )/f  (xk ). Setting ek = xk − x ¯ + ek , then, using Taylor’s theorem, we have that Since xk = x 1 x + ek ) = f (¯ x) + ek f  (¯ x) + e2k f  (¯ x) + · · · , f (xk ) = f (¯ 2 and

x + ek ) = f  (¯ x) + ek f  (¯ x) + · · · . f  (xk ) = f  (¯

In these expansions, we are assuming that ek is small, and this is based ¯. Since f (¯ x) = 0, and setting on the stated hypothesis that x0 is close to x x)/(2f  (¯ x)), we have that C = f  (¯ x) + 12 e2k f  (¯ x) + · · · ek f  (¯ f (xk ) =    f (xk ) f (¯ x) + ek f (¯ x) + · · · 1 + Cek + · · · . = ek 1 + 2ek C + · · ·

(2.14)

Note that, by assumption, C = 0 (what happens if it is zero is explained below). Recalling that if y is close to zero, then (see Section A.1) 1 = 1 − y + y2 − y3 + · · · . 1+y This applies to the denominator in (2.14) with y = ek z + · · · , and so f (xk ) = ek (1 + Cek + · · · )(1 − 2Cek + · · · ) f  (xk ) ≈ ek − e2k C.

(2.15)

With this, ek+1 = ek − f (xk )/f  (xk ) can be rewritten as ek+1 ≈ Ce2k . The formula for the iterative error can be derived in a similar manner.  x) = 0. In the case of when To obtain (2.11) and (2.12), it was assumed f  (¯ x) = 0, Newton’s method will converge faster than second order. How this f (¯ happens in explained in Exercise 2.28(a). Another special case arises when x) = 0, which means that x ¯ is a multiple root. This happens, for examf  (¯ ple, when solving x2 = 0 or sin4 (x) = 0. As explained in Exercise 2.28(b), in such cases Newton reverts to being a first-order method. It is not typical x) = 0, or that f  (¯ x) = 0, and for most problems, the convergence is that f  (¯ second order. 

50

2 Solving A Nonlinear Equation

2.4 Secant Method One of the complications with using Newton’s method is that it requires computing the first derivative. The secant method avoids this. The penalty for avoiding using f  (x) is that the order of convergence is reduced, although not by much. The idea comes directly from calculus, where a secant line is used to introduce the idea of a tangent line to a curve. Namely, given x0 , one picks a nearby point x1 and then draws the line passing through (x0 , f (x0 )) and (x1 , f (x1 )). This is illustrated in Figure 2.12. The formula for the secant line in this case is (2.16) y = f (x0 ) + m0 (x − x0 ), where the slope is m0 =

f (x1 ) − f (x0 ) . x1 − x0

(2.17)

Using a similar approach as in Newton’s method, we now replace f (x) = 0 with f (x0 ) + m0 (x − x0 ) = 0, and solve to find x2 = x0 − or equivalently x2 = x0 −

f (x0 ) , m0

f (x0 )(x1 − x0 ) . f (x1 ) − f (x0 )

It is now possible to construct a new secant line using x1 and x2 , and then determine the next approximation x3 . It is found that x3 = x1 −

f (x1 )(x2 − x1 ) . f (x2 ) − f (x1 )

Generalizing this formula, we have that

Figure 2.12 First step using the secant method. Given x0 and x1 , the secant line (the red line) is used to determine x2 .

2.4

Secant Method

51

xk = xk−2 −

f (xk−2 )(xk−1 − xk−2 ) . f (xk−1 ) − f (xk−2 )

It is not hard to show that this can be rewritten as xk = xk−1 −

f (xk−1 )(xk−1 − xk−2 ) , for k = 2, 3, 4, · · · . f (xk−1 ) − f (xk−2 )

(2.18)

This formula is known as the secant method. To use it, it is necessary to provide two starting values, x0 and x1 . An algorithm for the secant method is given in Table 2.8. It differs slightly from (2.18) in that the values of x are not indexed. Also, as with Newton’s method, the constants M and K are used to stop the procedure if the method appears to be failing. Example 2.9 Find the solution of the equation e−x = 2 + x. The value should be computed so it is correct to 10 significant digits. Answer: Taking f (x) = e−x − (2 + x), then (2.18) becomes xk = xk−1 −

(e−xk−1 − 2 + xk−1 )(xk−1 − xk−2 ) . e−xk−1 − e−xk−2 − xk−1 + xk−2

pick: x, X, M > 0, K > 0, and tol > 0 let:

err = 3 ∗ tol k=0

loop: while err > tol



z = f (x)(x − X)/ f (x) − f (X)



X=x x=x−z k =k+1 err = |z/x| if |x| > M or k > K, then error exit end

Table 2.8 Algorithm for solving f (x) = 0 using the secant method. The procedure stops when |xk+1 − xk |/|xk | < tol, or when the method appears to fail.

52

2 Solving A Nonlinear Equation

Table 2.9

k

xk

2

−0.238405844044235

1.0e+00

4.6e−01

3

−0.469644870324350

4.9e−01

6.0e−02

4

−0.441196816846734

6.4e−02

3.7e−03

5

−0.442840870720375

3.7e−03

3.1e−05

6

−0.442854407830573

3.1e−05

1.5e−08

7

−0.442854401002360

1.5e−08

6.4e−14

8

−0.442854401002389

6.4e−14

0.0e+00

Solving

e−x

|xk − xk−1 |/|xk |

|¯ x − xk |/|¯ x|

= 2 + x using the secant method.

As explained in Example 2.4, the solution is in the interval −2 < x < 0. So, for starting points, let x0 = −2 and x1 = 0. Setting tol = 10−10 , the computed values, along with the relative iterative error and relative error, are given in Table 2.9. Not surprisingly, when comparing this with Table 2.3, the secant method is seen to be much faster than bisection. In comparison to Newton’s method (see Table 2.6), it is slower but not significantly slower. The order of convergence can be determined by looking at the powers in the error. They form a Fibonacci sequence (e.g., 1 + 2 = 3, 2 + 3 = 5, 3 + 5 = 8, etc.). Kepler showed back in 1611 that the ratio √ of successive terms of a Fibonacci sequence approach the golden ratio (1 + 5)/2 ≈ 1.618. This is, consequently, the order of convergence of the secant method. The formal proof that it is the golden ratio is a bit more involved and it is outlined in Exercise 2.30. A comparison of the bisection, Newton’s, and the secant methods for this example is shown in Figure 2.13 using the relative iterative error. Newton clearly converges faster than the other two, although the secant method is not far behind Newton as compared to the bisection method. 

Error

100 10

-4

10

-8

10

-12

10

-16

Bisect Secant Newton 1

2

3

4

5

6

7

8

9

10

11

12

k Figure 2.13 The relative iterative error for Example 2.1 when using the bisection, Newton’s, and the secant methods.

2.4

Secant Method

53

The observation made in the above example about the error and a Fibonacci sequence comes from the fact that once xk gets close to the solution, the error for the secant method satisfies ek ≈ Cek−1 ek−2 (see Exercise 2.30). A consequence of this is that the number of correct significant digits at step k is the sum of the correct digits at the two previous steps. In comparison, for Newton, the number at step k can be twice the number at the previous step. As is always the case when discussing the number of correct digits, the exact number depends on the value of C and x ¯. The question arises about recommendations for picking x0 and x1 . The secant method was introduced by noting that, if these points are close together, then the secant line serves as an approximation of the tangent. However, having x0 and x1 close together is not required for the method to work. As with Newton, it is usually best to avoid having a local max or min point between the solution and either x0 or x1 . It can also help if the solution is between x0 and x1 , although this is not required.

2.4.1 Convergence Theorem Similar to what occurs with Newton’s method, to guarantee that the secant method works you need to pick starting points near the solution. The other assumptions are the same as for Newton. Theorem 2.3 Assume f ∈ C 2 (a, b), with a < x ¯ < b and f  (x) = 0 for a < ¯, the secant method, as x < b. In this case, for x0 and x1 chosen close to x x) = 0, then for xk close given in (2.18), will converge to x ¯. Moreover, if f  (¯ to x ¯ ¯| ≈ D|xk − x ¯|r , (2.19) |xk+1 − x and

|xk − xk−1 | ≈ D|xk−1 − xk−2 |r , √ where r = (1 + 5)/2 and r−1    f (¯ x)  . D =   2f (¯ x) 

(2.20)

(2.21)

An outline of the proof, which explains where this particular value of r comes from, is provided in Exercise 2.30. Because the rate of convergence is better than linear, which corresponds to r = 1, but not as fast as quadratic, which corresponds to r = 2, the secant method is said to be superlinear.

54

2 Solving A Nonlinear Equation

2.5 Other Ideas The problem of solving f (x) = 0 is so old, and so simple to state, that many methods have been derived for solving it. For those curious about other possibilities, Traub [1982] discusses over 40 different methods. What is somewhat surprising is that research papers are still being published in this area, and those interested in this should consult Wilkins and Gu [2013] or Cordero et al. [2015]. One method of recent vintage that is particularly interesting uses what is called Chebyshev interpolation. The idea is to replace f (x) with an interpolation polynomial (see Section 5.5.4), and then find its roots by solving an eigenvalue problem. In contrast to the methods considered in this chapter, the Chebyshev interpolation approach has the capability of simultaneously determining all of the roots in a given interval. More about this can be found in Boyd [2014]. Those who write general-purpose codes tend to use a hybrid approach, which means they use a combination of methods. As an example, one could start with the bisection method to get close to the root, and then switch to the secant or Newton’s method to speed things up. A well-known variation of this is Brent’s method, which is what MATLAB uses for the fzero command. There is also the question of how do you solve problems with more than one equation. The easiest to extend is Newton’s method, which uses the multi-dimensional version of Taylor’s theorem, and this will be considered in Section 3.10. The bisection method is more difficult to extend because the idea of a trapping interval does not work in multi-dimensions. There are multi-dimensional analogues of the secant method, and one example is the Nelder-Mead algorithm described in Section 9.6.

2.5.1 Is Newton’s Method Really Newton’s Method? Given that the quintessential numerical procedure is Newton’s method, it is worth knowing if the name is appropriate. To set the stage, in the late 1600s, one of the central research problems concerned computing the roots of polynomials. An example used by Newton is the equation x3 − 2x − 5 = 0. Knowing the solution is near x = 2, the approach was to write x = 2 + z, and then consider (2 + z)3 − 2(2 + z) − 5 = 0. Assuming z is small, so the z 3 and z 2 terms can be dropped, the conclusion is z ≈ 1/10. How to improve this approximation was the problem considered at the time. Newton’s idea, which was published in 1685, was to write z = 0.1 + u, substitute this into the equation for z, and then repeat the earlier argument to find u. Another proposal, made by Joseph Raphson in about 1690, was to write x = 2.1 + u and then substitute this into the original equation for x [Raphson, 1690]. Mathematically, these methods are equivalent, and one gets the same value for u. However, because using the original equation is easier, it allowed Raphson

Exercises

55

to make a critical observation, which is that his method for solving x3 − 2x − 5 = 0 can be written as x=y−

y 3 − 2y − 5 , 3y 2 − 2

where y is the previously calculated value for x. This is exactly what Newton’s method, as given in (2.7), states should be done, where x0 = 2. However, the calculus had not yet been developed, and as far as Raphson was concerned this was simply an algebraic approximation that worked. For this reason, he never extended the method to non-polynomial equations, even after the calculus was known [Kollerstrom, 1992; Ypma, 1995]. This has caused some historians to look elsewhere for the person responsible for (2.7). One candidate is Thomas Simpson, who published a book entitled “Essays on Several Curious and Useful Subjects in Speculative and Mix’d Mathematicks, Illustrated by a Variety of Examples” in which (2.7) does indeed appear. The case for Simpson is made very strongly by Kollerstrom [1992]. However, Simpson’s book was published in 1740, which means it took over 40 years to make this connection. This is a long time, particularly for someone of Newton’s capability. One argument made in support of Newton is that he used (2.7) to solve Kepler’s equation x − e sin x = M before 1713 (when it appeared in the second edition of the Principia). Also, there are one or more letters, which were written in 1692, in which he discusses derivatives of equations [Wallis, 1699]. However, the Kepler solution can be obtained without using calculus [Ypma, 1995], and the letters are lost [Kollerstrom, 1992]. In the end, the real reason it is referred to as Newton’s method is attributed to Fourier [1831], who simply said it was Newton’s method (even if it might not be exclusively Newton’s method).

Exercises Sections 2.2–2.4 2.1. The problem is to find a solution of f (x) = 0, where f (x) is shown in Figure 2.14. (a) Which solution will the bisection converge to if a0 = −3, b0 = 0? What if a0 = −2, b0 = 3? ¯| ≤ 10−4 . Using (b) Suppose the objective is to have the error satisfy |xk − x the bisection method, what value of k will guarantee this holds if a0 = −3, b0 = 0? (c) If a0 = 1, b0 = 2, and tol = 10−8 , then approximately how many steps will the algorithm in Table 2.1 require to solve f (x) = 0? (d) Suppose Newton’s method is used with x0 = 2. Approximately, where is x1 in Figure 2.14?

2 Solving A Nonlinear Equation

f(x)

56

0

-3

-2

-1

0

1

2

3

x-axis Figure 2.14

Plot of function for Exercise 2.1.

(e) Redo part (d) but use x0 = −2. (f) Suppose the secant method is used with x0 = −2 and x1 = 1. Approximately, where is x2 in Figure 2.14? Approximately, where is x3 ? (g) Redo part (f) but interchange the labels, so x0 = 1 and x1 = −2. (h) Suppose the secant method is used with x0 = 3. Where should x1 be, approximately, so that x2 is the solution just to the left of x = 0? 2.2. Suppose that f (x) = 0 is rewritten as F (x) = G(x), where F (x) and G(x) are plotted in Figure 2.15 (this is similar to the situation in Figure 2.5).

y-axis

(a) How many solutions of f (x) = 0 are there, and what are their approximate locations in Figure 2.15? (b) Explain why the bisection method will not work if a0 = −3, b0 = 0. (c) What solution will the bisection converge to if a0 = −3, b0 = 1? What if a0 = −2, b0 = 3? (d) If a0 = 1, b0 = 3, and tol = 10−8 , then approximately how many steps will the algorithm in Table 2.1 require to solve f (x) = 0? (e) Suppose the secant method is used with x0 = 0 and x1 = 2. Approximately, where is x2 in Figure 2.15?

0

-3

-2

-1

0

x-axis Figure 2.15

Plot of function for Exercises 2.2.

1

2

3

Exercises

57

(f) Rewrite x ln x − 1 = 0 as F (x) = G(x), and then sketch these functions as in Figure 2.15. Use this to show there is one solution x ¯ and determine a and b so that a < x ¯ < b. (g) Redo part (f) for the equation sin(x) + 2 − x = 0. 2.3. (a) For given values of tol, a0 , and b0 , will the bisection method use fewer steps to solve x − 2 = 0 than for solving ex−2 − 1 = 0? (b) If a0 = 0 and b0 = 3, then what is the maximum number of steps the bisection method will require to solve x − 1 = 0 and have at least six correct digits? (c) Answer part (b) for the equation (x − 1)3 = 0. 2.4. A criticism of the bisection method is that it does not use the values of f (ak ) and f (bk ), only their signs. A way to use these values is to find the equation of the line connecting (ak , f (ak )) and (bk , f (bk )) and then take xk to be the point where the line intersects the x-axis. Other than this, the algorithm described in Section 2.2 is unchanged. This is known as the method of false position. (a) Show that xk =

ak f (bk ) − bk f (ak ) . f (bk ) − f (ak )

(b) To solve x2 − 4 = 0, take a0 = 0 and b0 = 3. Find x0 and x1 using the method of false position, and also find them using the bisection method. Which method produces the more accurate values? Which one requires more flops? (c) To solve e10x − 1 = 0, take a0 = −3 and b0 = 1. Compute x0 and x1 using the method of false position, and also find them using the bisection method. Given that the solution is x ¯ = 0, which method produces the more accurate values? Comment: False position can be modified to overcome the difficulty in part (c) (yielding, for example, the Illinois algorithm), but the simplicity and dependability of bisection is hard to beat. 2.5. The function f (x) is given along with the starting point x0 for Newton’s method. You are to determine the tangent line approximation used by Newton’s method, and the resulting value for x1 . With this, sketch f (x) and the tangent line on the same axes. (a) (b) (c) (d)

f (x) = x(x − 3) with x0 = 1. f (x) = x(x − 3) with x0 = 2. f (x) = x(x − 1)(x − 3) with x0 = 2. f (x) = x(x − 1)(x − 3) with x0 = 1/2.

58

2 Solving A Nonlinear Equation k

Method 1

Method 2

Method 3

Method 4

1

1.10000000000000

1.02000000000000

1.05000000000000

1.03162277660168

2

1.01000000000000

1.00400000000000

1.02500000000000

1.00562341325190

3

1.00010000000000

1.00080000000000

1.01250000000000

1.00042169650343

4

1.00000001000000

1.00016000000000

1.00625000000000

1.00000865964323

5

1.00000000000000

1.00003200000000

1.00312500000000

1.00000002548297

6

1.00000000000000

1.00000640000000

1.00156250000000

1.00000000000407

7

1.00000000000000

1.00000128000000

1.00078125000000

1.00000000000000

8

1.00000000000000

1.00000025600000

1.00039062500000

1.00000000000000

Table 2.10

Values for Exercise 2.7.

2.6. (a) Explain why, for a first-order method, in (2.5) it must be that C < 1. (b) Derive an interval subdivision method, similar to bisection, that is a first-order method with C = 1/3. (c) To solve x2 − 2 = 0, the algorithm in Table 2.1 works if tol = 10−8 but can fail if tol = 10−20 . Why? 2.7. Four different methods were used to solve f (x) = 0, and the computed values for x1 , x2 , x3 , · · · are shown in Table 2.10. 1. One of them is the bisection method. Which of the four is most likely the bisection method, and why? 2. One of them is Newton’s method. Which of the four is most likely Newton’s method, and why? 3. Suppose someone claims they computed the solution using the secant method, and they obtained the results given for Method 3. Why are they are most likely mistaken (i.e., they are wrong)? 2.8. True or False: If Newton’s method converges into a solution x ¯ for a ¯ for any starting point particular choice of x0 , then it will converge to x between x ¯ and x0 . 2.9. In Table 2.5, once xk gets close to the solution, the value of the relative iterative error lags a step behind the relative error. Why does this happen? Why does it happen in Table 2.9? 2.10. (a) Suppose to solve f (x) = 0 one finds Newton’s method takes 20 iterations and the secant method takes 30. When is it possible that the secant method takes less computing time? Make sure to explain your answer. (b) Which of the curves in Figure 2.16 corresponds to Newton’s method and which one to the bisection method? Make sure to justify your answers.

Exercises

59

Table 2.11

A

x2 − 1 = x5

B

sin x = 2x − π

C

x3 + x = 1/(1 + x2 )

D

e−x = x3

E

ln x = 2 − x2

F

cos x = 2 ln x

Equations used in Exercise 2.11.

2.11. Each of the equations in Table 2.11 has one solution. Select an equation, and then do the following:

Error

(a) Sketch the functions to determine the approximate location of the solution. (b) For the bisection method, provide an initial interval that can be used to find the solution, and provide an explanation of why it works. With this calculate, by hand, x0 and x1 . (c) What is Newton’s iteration formula (2.7) for this equation? Also, provide a starting point x0 for the solution, providing an explanation of why it is a good choice. With this calculate, by hand, x1 . (d) What is the secant iteration formula (2.18) for this equation? Also, provide starting points x0 and x1 for the solution, providing an explanation of why they are a good choice. With this calculate, by hand, x2 . (e) Compute the solution of the equation. Your answer should be correct to at least six significant digits. Make sure to state which numerical method was used, why you made this choice, and what error condition you used to stop the calculation.

10

-1

10

-4

10

-7

10

-10

10

-13

10

-16

1

2

3

4

5

6

Iteration Step Figure 2.16

Graph for Exercise 2.10(b).

7

8

9

10

60

2 Solving A Nonlinear Equation

2.12. Find all solutions of cos x = 2 sin x. Make sure to state what method you used, the stopping condition, and how you selected your start point(s). 2.13. A problem arising in studying the vibrations of an elastic string is to find the positive solutions of tan x = 2x. (a) Sketch the two functions for x ≥ 0 and explain why there are an infinite number of solutions. (b) Plot f (x) = tan x − 2x, for 0 ≤ x ≤ 25 π. Using the plot, approximately, what is the solution in this interval? (c) Compute the solution located in part (b). Your answer should be correct to at least six significant digits. Make sure to state which numerical method was used, why you made this choice, how you selected your start point(s), and what error condition you used to stop the calculation. (d) To compute the next largest solution, explain how to use the solution from part (c) to pick the starting point(s). Applications 2.14. This problem concerns the configuration shown in Figure 2.17. There are four straight sides, of fixed length, that are free to rotate at the vertices. The bottom side, of length a, does not move. What is of interest is how the angle ϕ changes as θ changes. This is a situation that arises in kinematics, and it has been found that the two angles are related through the equation A cos θ − B cos ϕ + C = cos(θ − ϕ),

(2.22)

where A = a/b, B = a/d, and C = (a2 + b2 − c2 + d2 )/(2bd). In textbooks on the kinematics of machines, this is known as the Freudenstein equation. (a) If θ = 0 then a triangle is produced. In this case, using the law of cosines, find ϕ and show that this is the same result√obtained from (2.22). In the rest of the problem let a = 3/2, b = 3, c = 1, and d = 1/2. (b) Taking θ = 0, plot the left- and right-hand sides of (2.22) for 0 ≤ ϕ ≤ 2π and show that there are two solutions. Explain geometrically why there are two, and identify which one corresponds to the configuration shown in Figure 2.17.

Figure 2.17

Figure for Exercise 2.14.

Exercises

Figure 2.18

61

Figure for Exercise 2.15.

(c) Assuming Newton’s method is used to find ϕ, what is (2.7) when applied to (2.22)? Use this to calculate ϕ for θ = π/6 and θ = π/3. Your values should be correct to six significant digits. (d) As θ increases from θ = 0 to θ = 2π, the vertex connecting side b and c traces out a portion of a circle. Explain why, and find the maximum and minimum values of ϕ that determine this circular arc. 2.15. The crossing ladders problem is the following: Two ladders of length a and b are leaning across an alleyway between two buildings as shown in Figure 2.18. If they cross at a height h, what is the width w of the alleyway? Just so it’s clear, a, b, and h are known, and the objective is to determine w. 1 + B1 . Also, using the (a) Using similar triangles show that h1 = A 2 2 2 Pythagorean theorem, show that B = b − a + A2 . Combining these two formulas, show that

h2 A2 = (A − h)2 (b2 − a2 + A2 ). This equation is going to be solved to determine A. (b) By sketching the functions in the equation from part (a) shows that there are two positive solutions for A (assuming that a < b). Note that you might find it easier to first rewrite the equation before doing the sketch. Which solution is the one for the crossing ladders problem? (c) Newton’s method is going to be used to find A. What does (2.7) reduce to in this case? Based on part (b), what would be a good choice for the starting point in this case? Make sure to explain why. (d) Taking a = 20, b = 30, and h = 8, use Newton’s method to compute A and from this determine w. Your answer should be correct to at least eight significant digits. 2.16. This problem considers finding α so the line y = αx is tangent to the curve y = cos(2πx). You can assume that α > 0, and the tangency points occur for x > 0.

62

2 Solving A Nonlinear Equation

(a) By sketching y = αx and y = cos(2πx), explain why there are an infinite number of solutions for α. Also, use this sketch to determine the approximate locations where tangency occurs. (b) Calculate the largest value of α. Make sure to explain what mathematical problem you solve to find α, as well as which numerical method you used, why you made this choice, and what error condition was used to stop the calculation. 2.17. From the Michaelis-Menten model for an enzyme-catalyzed reaction, the amount of the substrate S present at time t satisfies the differential equation vS dS =− , dt K +S where v and K are positive constants. Solving this, one obtains K ln(S/S0 ) + S = S0 − vt ,

(2.23)

where S0 is the (nonzero) amount at the beginning. To find S, for any given value of t, it is necessary to solve this nonlinear equation, and how this might be done is considered in this exercise. How to find the numerical solution of the differential equation is considered in Exercise 7.29. (a) As a function of S, sketch the two functions K ln(S/S0 ) and S0 − vt − S. Do this for t = 0 and for t > 0. Use this to explain why there is only one solution of (2.23). (b) Use part (a) to explain why the solution satisfies 0 < S ≤ S0 . (c) Suppose (2.23) is to be solved using Newton’s method. What does (2.7) reduce to in this case? Based on parts (a) and (b), what would be a good choice for the starting point in this case? Make sure to explain why. (d) Based on parts (a) and (b), what would be a good choice for a starting interval when using the bisection method to solve (2.23)? Make sure to explain why. (e) Suppose (2.23) is to be solved using the secant method. What does (2.18) reduce to in this case? Based on parts (a) and (b), what would be a good choice for the two starting points in this case? Make sure to explain why. (f) It is found that for sucrose, v = 0.76 mM/min and K = 16.7 mM [Johnson and Goody, 2011]. Also, assume that S0 = 100 mM. Use one of the above methods from (c) to (e) to find the value of S at t = 1 min, at t = 10 min, and at t = 100 min. Your answers should be correct to at least four significant digits. Make sure to state which method was used, why you made this choice, and what error condition you used to stop the calculation. (g) Using the parameter values given in part (f), and one of the numerical methods from (c)-(e), plot S(t) for 0 ≤ t ≤ 1000.

Exercises

63

(h) Explain how it is possible to produce the plot in part (g), from (2.23), without having to use a numerical solver to find S. Useful facts here are S  (t) < 0, S(0) = S0 , and S(∞) = 0. 2.18. According to the Colebrook equation, the friction factor f for turbulent flow in a pipe is found by solving   β 1 √ = −2 log10 α + √ , f f where α √and β are constants that satisfy 0 < α < 1 and 0 < β < 1. By setting x = 1/ f , this equation can be rewritten as x = −2 log10 (α + βx) ,

(2.24)

where 0 < x < ∞. (a) Sketch the two functions in (2.24) for 0 < x < ∞. Use this to explain why there is only one solution, and that it is in the interval 0 < x < 2 log10 (1/α). In the rest of the problem assume that α = 10−2 and β = 10−4 , which are typical values for these constants. (b) What does (2.7) reduce to for (2.24)? Based on part (a), what would be a good choice for x0 ? Make sure to explain why. (c) Based on part (a), what would be a good choice for a0 and b0 when using the bisection method to solve (2.24)? Make sure to explain why. (d) Based on part (a), what would be a good choice for x0 and x1 when using the secant method to solve (2.24)? Make sure to explain why. (e) Use one of the methods from (b) to (d) to solve (2.24), and from this determine the value of f . Make sure to state which method was used, why you made this choice, and what error condition you used to stop the calculation. 2.19. The terminal velocity v of a sphere falling through the air satisfies v2 =

2mg , ρAcD

(2.25)

where A = πd2 /4. In this expression, ρ is the density of air, m and d are the mass and diameter of the sphere, and g is the gravitational acceleration constant. The term cD is known as the drag coefficient, and from experimental data the following formula has been proposed [White, 2005] cD =

2 6 24 √ + , + R 1+ R 5

(2.26)

where R = ρvd/μ is known as the Reynolds number and μ is the (dynamic) viscosity of air. In this problem, (2.25) is rewritten in terms of R, and it is

64

2 Solving A Nonlinear Equation

Figure 2.19 A baseball is going to get dropped. Whether you should try to catch it is considered in Exercise 2.19.

then solved for R. After this, the value of v is computed. Also, note that the velocity is positive. (a) Show that it is possible to rewrite (2.25) as (R)2 cD = α, where α does not depend on v or R. After substituting (2.26) into this equation, the problem is rewritten so R is the variable to solve for. Write down the resulting equation. Assuming the value of R has been computed, explain how the velocity is calculated. (b) Write the equation in part (a) as cD R = α/R, where cD is given in (2.26). Sketch the left- and right-hand sides of this equation as a function of R. This shows that there is one solution and it is in the interval 0 < R < α/24. (c) Assuming Newton’s method is used to find R, what is (2.7) when applied to the equation in part (a)? Based on your results from part (b), what would be a good choice for a starting point? Make sure to explain why.

Exercises

65

(d) Write down the iteration formula if the secant method is used to find R. Based on your results from part (b), what would be a good choice for the two starting points? Make sure to explain why. (e) For air, μ = 1.8 × 10−5 , ρ = 1.2, g = 9.8, and for a baseball, d = 0.075 and m = 0.14 (using kg, m, s units). What is the terminal velocity of a baseball (assuming it is a perfect sphere)? Make sure to state which method was used, why you made this choice, and what error condition you used to stop the calculation. Also, how does this velocity compare to what is considered the velocity of a typical fastball in professional baseball? (f) Is it possible to make a baseball out of something so its terminal velocity is approximately the speed of sound? 2.20. The ideal gas law states that pv = nRT , where p is the pressure, v is the volume, n is the amount of the substance (in moles), R is the gas constant, and T is the temperature (in Kelvin). An improved version of this is the van der Waals equation of state, and it is given as   n2 a (2.27) p + 2 (v − nb) = nRT, v where a and b are positive constants. Also, p and v are positive. (a) Explain why there is one solution for v. Note, one way to do this is to rewrite(2.27) as p + n2 a/v 2 = nRT /(v − nb), and then sketch the leftand right-hand sides as functions of v. (b) Using the sketch from part (a), explain why the solution satisfies nb < v < nb + nRT /p. (c) Assuming Newton’s method is used to find v, what is (2.7) when applied to (2.27)? Based on parts (a) and (b), what would be a good choice for a starting point? Make sure to explain why. (d) Write down the iteration formula if the secant method is used to find v. Based on parts (a) and (b), what would be a good choice for the two starting points? Make sure to explain why. (e) Assume that p = 1 atm, n = 1 mol, and recall that R = 0.08205746 L atm/(K mol). Also, for oxygen, which is the gas considered here, a = 1.382 and b = 0.0319. Note that the values for R, a, and b are the exact values given in the 2021 CRC Handbook of Chemistry and Physics. Using either (c) or (d), determine v at room temperature (you can assume this is 25◦ C). In your write-up, state why you picked the solver you used, and give your reason(s) for what value you selected for the error tolerance used to stop the solver. Also, explain why it isn’t necessary to run the solver to the point that the error in the solution is on the order of machine ε. (f) Using the values given in part (e), plot v as a function of T , for 0◦ C ≤ T ≤ 50◦ C. In your write-up explain how you used your solver to do this.

66

2 Solving A Nonlinear Equation

Newton’s Method: Mathematical Applications 2.21. This problem examines how to use an iterative method to calculate the square √ root of a positive number α. In other words, the algorithm calculates α. The procedure can only contain the four elementary arithmetic operations (addition, subtraction, multiplication, and division). Also, you do not need to actually calculate anything, you just need to describe how to do this with the respective method. (a) How can this be done using the bisection method? (b) How can this be done using Newton’s method? 2.22. Show how to use Newton’s method to evaluate 21/3 . Your procedure, or formula, should only contain additions, subtractions, multiplications, and/or divisions. 2.23. This exercise concerns an implicitly defined function y(x), defined through an equation of the form F (x, y) = 0. For the equations in Table 2.12, select one and then do the following: (a) Sketch the functions as a function of y, and use this to show that for each x, there is one solution y. (b) Use the sketch in part (a) to find a bounded interval for y that contains the solution (note the interval will possibly depend on x). (c) What is Newton’s iteration formula (2.7) for this equation? Also, provide a starting point y0 for the solution, providing an explanation of why it is a good choice. (d) What is the secant iteration formula (2.18) for this equation? Also, provide starting points y0 and y1 for the solution, providing an explanation of why they are a good choice. (e) Compute y(2). Your answer should be correct to at least four significant digits. Make sure to state which numerical method was used, why you made this choice, and what error condition you used to stop the calculation. (f) Plot y(x) for 2 ≤ x ≤ 10. You should also provide an explanation of the algorithm you used to do this. 2.24. This exercise explores how to use Newton’s method to evaluate an inverse function. To explain, given y = g(x), then the inverse function satisfies x = g −1 (y). The problem is, given y, what is the value of x? (a) Assuming y is given, and setting f (x) = y − g(x), show that Newton’s method (2.7) gives xk+1 = xk +

y − g(xk ) , for i = 0, 1, 2, 3, · · · g  (xk )

Exercises

Table 2.12

67

A

y 3 = x − y for x > 0

B

1/(x2 + y 2 ) = y for x > 0

C

ln(x + y) = −y 3 for x > 1

D

y 3 e2y = x for x > 0

E

x(1 − y 2 ) =

√ y for x > 0

Equations used in Exercise 2.24.

(b) Use the result from part (a) to evaluate e2 and e−3 using the ln(x) function. (c) Use the result from part (a) to evaluate arccos(1/2) and arccos(1/3). (d) The error function is defined as  x 2 2 e−s ds. erf(x) = √ π 0 The inverse error function is denoted as erf−1 (x). Use the result from part (a) to evaluate erf−1 (1/2) and erf−1 (1/3). If you are using MATLAB, you can use the erf command to evaluate the error function. (e) The complete elliptic integral of the first kind is defined as  1 1  ds. K(x) = 2 (1 − s )(1 − xs2 ) 0 The inverse function is denoted as K−1 (x). Use the result from part (a) to evaluate K−1 (2) and K−1 (4). It is useful to know that K  (x) =

E(x) − (1 − x)K(x) , 2x(1 − x)

where E(x) is the complete elliptic integral of the second kind. If you are using MATLAB, you can use the ellipke command to evaluate K and E. Newton’s Method: Extensions and Failures 2.25. In the derivation of Newton’s method, to determine the formula for xk , the function f (x) is approximated using a first-order Taylor approximation centered at xk−1 . This problem investigates what happens when you use a second-order Taylor approximation.

68

2 Solving A Nonlinear Equation

(a) Approximating f (x) using a second-order Taylor approximation centered at xk−1 , then f (x) = 0 is replaced with what equation to solve? (b) Solve the equation in part√(a) for xk , and show that the solution has the form xk = xk−1 − A(1 ± 1 − B). (c) Given that xk is close to xk−1 , what choice should be made for the ± in part (b)? (d) The equation in part (a) contains a term of the form (xk − xk−1 )2 . Explain why this can be approximated with −(xk − xk−1 )f (xk−1 )/ f  (xk−1 ). If this is done, what is the resulting formula for xk ? Note that the formula you are deriving is known as Halley’s method, and it is an example of a third-order method. 2.26. This problem demonstrates a way Newton can fail (it thinks there is a solution even though no solution exists). √ (a) Sketch f (x) = 3 cos(x) + 32 + π 3 for 0 ≤ x ≤ 2π. How many solutions of f (x) = 0 are there in this interval? (b) Using Newton’s method, take x0 = 2π/3 and calculate x1 (by hand). After this, calculate x2 (by hand). (c) Sketch the tangent lines for x0 and x1 . Based on this, determine what the other xk values will be. (d) Explain why Newton’s method will not converge for this problem. 2.27. This exercise considers an example where Newton will fail. The function is shown in Figure 2.20, and it is given as f (x) =

1 − e−10x . 1 + e−10x

(a) Show that f  (x) > 0 for all x. (b) Describe what happens if one uses Newton’s method with x0 = 1. Also, explain why essentially the same thing happens if you use x0 = −1. (c) Experiment with Newton’s method, and find the largest (positive) value of x0 that will result in Newton’s method converging to the correct solution. You only need to find x0 to two significant digits. Also, give the corresponding value of x1 . Note that keeping track of what happens to x1 will be helpful for part (d). (d) Explain why the value of x0 you found in part (c) can be found by finding the positive solution of 2xf  (x) = f (x). What exactly is the relationship between x0 and x1 that gives rise to this equation? 2.28. This problem explores when Newton’s method converges either faster or slower than stated in Theorem 2.2. (a) Suppose f  (¯ x) = 0 but f  (¯ x) = 0. Show that (2.15) is replaced with f (xk ) 1 = ek − e3k w + · · · , f  (xk ) 3

Exercises

69

where w = f  (¯ x)/f  (¯ x). Explain why, in this case, Newton’s method is third order. (b) Suppose the assumption that f  (x) = 0 for a ≤ x ≤ b is replaced with f  (x) = 0 for a ≤ x ≤ b except f  (¯ x) = 0. Show that (2.15) is replaced with f (xk ) 1 = ek + · · · . f  (xk ) 2 Explain why, in this case, Newton’s method is first order.

1

y-axis

0.5 0 -0.5 -1 -2

-1.5

-1

-0.5

0

0.5

1

1.5

2

x-axis Figure 2.20

Plot of function for Exercise 2.27.

Additional Questions 2.29. The exercise examines various choices that can be made for xk when using the bisection method. So, in this exercise, the bisection method subdivides the interval at each step and keeps the half that is guaranteed to contain a solution. The question is what point xk should you pick from this half to be the approximation for the solution x ¯? ¯|. Sketch the error as (a) Whatever choice is made for xk , the error is |xk − x ¯ ≤ bk . Explain why the minimum error occurs a function of x ¯, for ak ≤ x when bk − xk = xk − ak , and from this conclude that xk = (bk + ak )/2. In other words, one should select the midpoint of the subinterval. ¯)/¯ x|. This requires a (b) Suppose one instead uses the relative error |(xk − x nonzero solution, so it is assumed here that 0 < ak < bk . By sketching ¯ ≤ bk , explain why the minimum error occurs the relative error for ak ≤ x when xk = 2ak bk /(bk + ak ). (c) Does it make much difference which choice is made for xk from the interval ak ≤ x ≤ bk ? In answering this, assume that the stopping condition for the loop in Table 2.1 is the same irrespective of the choice for xk . 2.30. This problem provides an outline of the proof of Theorem 2.3, using the ideas developed for the outline of the proof for Theorem 2.2.

70

2 Solving A Nonlinear Equation

(a) Setting ek = xk − x ¯, show that ek = ek−1 − z, where z = f (xk−1 )(xk−1 − xk−2 )/(f (xk−1 ) − f (xk−2 )). ¯ + ek , show that z = ek−1 (1 + Cek−1 + C2 e2k−1 + · · · )/ (b) Writing xk = x (1 + C(ek−1 + ek−2 ) + · · · ), where C = f  (¯ x)/(2f  (¯ x)), and C2 = f  (¯ x)/  x)). Assume that C = 0. (6f (¯ (c) Using the results from (a) and (b), show that ek = Cek−1 ek−2

1 + 2C2 (ek−1 + ek−2 ) + · · · . 1 + C(ek−1 + ek−2 ) + · · ·

From this, derive the approximation ek ≈ Cek−1 ek−2 . (d) To solve |ek | = |Cek−1 ek−2 |, let Ek = ln(|ek |). Show that the equation becomes Ek = Z + Ek−1 + Ek−2 , where Z = ln(|C|). Also, show that k k + Br− , solves this equation, where r± are the two soluEk = −Z + Ar+ 2 tions of r − r − 1 = 0, and A and B are arbitrary constants. (e) Using the solution from part (d), explain why, as k increases, |ek | ≈ k ). (1/|C|) exp(Ar+ (f) Find the values of s and D so that |ek |/|ek−1 |s → D, and from this derive the result in (2.21). 2.31. This problem explores how to use the floating-point representation to obtain a good starting point for Newton’s method. To illustrate this idea, the equation to solve is x2 − a = 0, where a > 0, which is used to evaluate √ x = a. (a) Show that the formula for Newton’s method is   1 a xk+1 = . xk + 2 xk (b) The floating-point approximation of a has the form af = m × 2E , where m=1+

b2 b1 bN −1 + 2 + · · · + N −1 . 2 2 2

Also, recall that for 0 ≤ y < 1,  1 1 1 1 + y = 1 + y − y2 + y3 + · · · . 2 8 16 Assuming E is even, and b1 or b2 are nonzero, use the above information to show that   1 1 √ af ≈ 1 + b1 + b2 × 2E/2 . 4 8 In what follows it is assumed that this formula is used to determine x0 . Note that this expression only involves additions and multiplications (and not a square root).

Exercises

71

(c) What does the approximation in part (b) reduce √ to for a = 28 = (1 + 1/2 + 1/22 ) × 24 ? How close does x0 come to 28? (d) Modify the derivation in part (b)

to find a starting value in the case of when a = 50 = 1 + 1/2 + 1/24 × 25 . Make sure to compare x0 with the √ known. exact value. Also, you can assume the value of 2 is √ (e) Write a program that uses this idea to calculate a, for any given a > 1. The stopping condition for Newton’s method |xk+1 − √ √ √ should be √ xk |/|xk+1 | ≤ 10−8 . With this, compute 1.5, 33, 1001, and 0.1.

Chapter 3

Matrix Equations

This chapter concentrates on solving the matrix equation Ax = b, and the chapter to follow investigates various ways to compute the eigenvalues of a matrix A. Together, they are central components of what is called numerical linear algebra. What will be evident from reading this material is the prominent role matrix factorizations play in the subject. To explain what this involves, given a matrix A, one factors it as A = BC or A = BCD. The factors, B, C, and D, are matrices with nice, easy to compute with, properties. The time consuming computational step is finding the factorization. There are many useful factorizations, and a listing of some considered in this text can be found in the index. To help bring out the importance of matrix factorization, the first three sections of this chapter are written in reverse order, at least compared to most numerical textbooks. What is done here is to start with the final result, which is the description of the numerical method, then afterwards explain where or how one would come up with such an idea. There are several reasons for reversing the order, one being that the derivation of the method (Section 3.3.2) is useful mostly for showing where the idea comes from. Once you realize how the method works you also realize there is a more direct method to get the result (Section 3.2.1). Finally, there is the constructive component, where you actually solve problems with the method, and this is where we start.

3.1 An Example Consider the following system of equations −2x + 2y = 4 8x − 5y = −13. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0 3

73

74

3 Matrix Equations

This can be written in matrix form as      −2 2 x 4 = , 8 −5 y −13 or equivalently as Ax = b, where     x −2 2 , , x= A= y 8 −5

 4 . −13

 and

b=

The LU Factorization The goal of this chapter is to examine how matrix equations like the one above can be solved using a computer. The centerpiece of this is something called the LU factorization method. As the name implies, this method works by factoring the matrix A as A = LU,

(3.1)

where L is a lower triangular matrix and U is an upper triangular matrix. It is fairly easy to find L and U, and how this is done is explained in the next section. First, however, it is worth demonstrating how this helps solving the above matrix equation. For this example, a factorization of the matrix is      −2 2 1 0 −2 2 = . (3.2) 8 −5 −4 1 0 3 Consequently,  L=

1

0

−4

1



 ,

and U =

−2

2

0

3

 .

(3.3)

These are not the only L and U factors you can find for A, and this will discussed later.

Solving using LU Solving Ax = b comes by noticing that the equation can now be written as L(Ux) = b.

3.1

An Example

75

Looking at this for a moment you see that if we set y = Ux, then the equation can be solved by: First: finding the vector y that is the solution of Ly = b, Second: finding the vector x that is the solution of Ux = y. The two equations listed above are much easier to solve than the original. To demonstrate, for our example: I) Solving Ly = b: The equation is      1 0 u 4 = , −4 1 v −13 where y = (u, v)T . In component form, u=4 −4u + v = −13. The lower triangular form of this matrix means we simply start at the top and solve for the various components of y (what is sometimes called forward substitution). In particular, u = 4 and v = 4u − 13 = 3. II) Solving Ux = y: The equation to solve is      4 x −2 2 , = 3 y 0 3 or, in component form −2x + 2y = 4 3y = 3. The upper triangular form of this matrix means we simply start at the bottom and solve for the various components of x (what is sometimes called back substitution), and one finds that y = 1 and x = y − 2 = −1. It is hard to overstate the importance of the LU method in numerical computing. At the same time, for those who are seeing it for the first time, it looks strange. Also, those who take linear algebra are taught to solve matrix equations using Gaussian elimination, and the LU method looks to be something completely different. It is shown in Section 3.3.2 how they are connected, but you can use the LU method without knowing anything about Gaussian elimination. A comment is in order about the transpose notation used earlier. In stating that y = (u, v)T it is meant that

76

3 Matrix Equations

y=

  u v

.

It is perhaps more consistent to write y = (u v)T , but the comma is used to help clarify the separation between the components of the vector.

3.2 Finding L and U Once you realize that a LU factorization might be possible, finding it involves a fairly straightforward calculation. Basically, you simply assume the matrix can be factored and then attempt to find L and U. To illustrate, using the earlier example, we will assume that it is possible to write A = LU, which means we assume that      u11 u12 −2 2 11 0 . = 21 22 0 u22 8 −5 Multiplying this out we have     11 u12 −2 2 11 u11 = , 21 u11 21 u12 + 22 u22 8 −5

(3.4)

or in component form 11 u11 = −2 21 u11 = 8

11 u12 = 2 21 u12 + 22 u22 = −5.

(3.5)

The first thing to notice is that we have 4 equations and 6 unknowns, and so the LU factorization is not unique. You should pick two values and it does not make much difference what choice is made, other than being nonzero. Two possibilities, that are used frequently enough that they are given names, are the following: • Doolittle factorization: choose 11 = 22 = 1 • Crout factorization: choose u11 = u22 = 1. Using a Doolittle factorization, then (3.5) takes the form u11 = −2 21 u11 = 8

u12 = 2 21 u12 + u22 = −5.

Consequently, u11 = −2, u12 = 2, 21 = 4, and u22 = 3. The resulting matrices are given in (3.3).

3.2

Finding L and U

77

A few questions arise from the above calculation. First, since the LU factorization is not unique you might wonder if the solution is unique. In particular, you might expect that if you use a different factorization that you will get a different solution. The answer is no, the non-uniqueness of the factorization does not mean the solution is non-unique. As a simple test, you might try a Crout factorization in the above example to demonstrate that you still get the same solution. A second question is, can you factor any square matrix, and the answer to this is given next.

3.2.1 What Matrices Have a LU Factorization? To answer this we will try to factor a general 2 × 2 matrix using a Doolittle factorization. So, we consider the equation      u11 u12 a b 1 0 , = 0 u22 c d 21 1 which in component form is u11 = a 21 u11 = c

u12 = b 21 u12 + u22 = d.

So, u11 = a, u12 = b, and, if a = 0, then 21 = c/a and u22 = d − bc/a. If a = 0 then the above equations can still be solved if c = 0. In this case, any value for 21 works. For example, taking 21 = 0, then u12 = b and u22 = d. The remaining case to consider is a = 0 and c = 0. To explain what to do in this case, the equation Ax = b is      b1 x 0 b = . (3.6) y c d b2 In component form this is by = b1 cx + dy = b2 . These can be rewritten as cx + dy = b2 by = b1 . Interchanging rows in this way is known as pivoting. The resulting matrix equation is

78

3 Matrix Equations



c d 0 b

  x y

=

  b2 b1

.

(3.7)

Because c is nonzero, using the result in the previous paragraph, we know that the above coefficient matrix has a LU factorization. We have shown that any linear system involving two equations can be rearranged so the coefficient matrix has a LU factorization. This result is true for all square matrices and this is stated in the next result. Theorem 3.1 Every n × n matrix can be rearranged using pivoting so it has a LU factorization. The proof of this theorem can be found in Bj¨ ourck [2015].

3.3 Implementing a LU Solver The method used to find a LU factorization for a general square matrix is the same as for the 2 × 2 case. Namely, you just assume it’s possible and then calculate what’s necessary. In the case of a Doolittle factorization, this means we assume that ⎞ ⎛ ⎞ ⎛ ⎞⎛ 1 0 ··· 0 u11 u12 · · · u1n a11 a12 · · · a1n ⎜ a21 a22 · · · a2n ⎟ ⎜ 21 1 · · · 0⎟⎜ 0 u22 · · · u2n ⎟ ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎜ .. .. .. ⎟ = ⎜ .. .. .. ⎟ . .. ⎟⎜ .. .. ⎝ ⎝ ⎠ ⎝ . ⎠ . ··· . . . ··· . ⎠ . . ··· . an1 an2 · · · ann

n1 n2 · · · 1

0

0

· · · unn

It is necessary to multiply the two matrices on the right and then equate the result with the matrix on the left. As an example, equating the first column on the left with the one on the right, we have that ⎛ ⎞ ⎛ ⎞ a11 u11 ⎜ a21 ⎟ ⎜ 21 u11 ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ .. ⎟ = ⎜ .. ⎟ . ⎝ . ⎠ ⎝ . ⎠ an1

n1 u11

From this it follows that u11 = a11 , 21 = a21 /a11 , · · · , n1 = an1 /a11 . This requires that a11 = 0. If it is zero, and at least one of the other ai1 ’s is nonzero, then we would first need to use pivoting to get a nonzero entry in the (1, 1)position. There is also the possibility that all the entries in the first column are zero, a situation equivalent in the 2 × 2 case of when a = c = 0, and it can be handled in a similar manner. However, this situation will not arise if the matrix is invertible.

3.3

Implementing a LU Solver

79

for j = 1, 2, · · · , n l(j, j) = 1 for i = 1, 2, · · · , j u(i, j) = a(i, j) for k = 1, 2, · · · , i − 1 u(i, j) = u(i, j) − l(i, k)u(k, j) end end for i = j + 1, j + 2, · · · , n l(i, j) = a(i, j) for k = 1, 2, · · · , j − 1 l(i, j) = l(i, j) − l(i, k)u(k, j) end l(i, j) = l(i, j)/u(j, j) end end

Table 3.1 Algorithm for finding a Doolittle factorization of A, assuming pivoting is not needed. It is understood that any loop with a larger starting value than ending value is skipped.

Once this step is complete, you then equate the second column, then the third, etc. The resulting algorithm for finding the nonzero entries in L and U, when no pivoting is needed, is given in Table 3.1. Whenever pivoting is used it is necessary to keep track of which rows are interchanged. In linear algebra textbooks this is often done using a permutation matrix P. Before starting the factorization, P = I. Whenever a row interchange is made, that same interchange is made on P. Once the factorization is complete, you then replace Ly = b with Ly = Pb. This is, however, not recommended when computing a LU factorization because of the additional storage, and computational effort, that it requires. Instead, before starting the factorization, you let p = (1, 2, 3, · · · , n)T . During the factorization, if the ith and jth rows are interchanged, then the ith and jth entries in p are also interchanged. At the end, you replace b with b∗ where b∗i = bpi . For example, suppose for a 3 × 3 matrix that the first and third rows are interchanged, then the second and third rows are interchanged. In this case you start with p = (1, 2, 3)T , which changes to p = (3, 2, 1)T , which then changes to p = (3, 1, 2)T . So, b = (b1 , b2 , b3 )T ends up changing to b∗ = (b3 , b1 , b2 )T .

80

3 Matrix Equations

As a final comment, for those considering writing their own LU solver, it is recommended that you first read Section 1.6.

3.3.1 Pivoting Strategies In the above derivation, to deal with the case of when a11 = 0, one looks for a row with a nonzero entry in the first column and then interchanges that row with the first row. It is also possible that this must be done when determining the other columns in the factorization. There are numerous variations on how to pivot, some that are worth considering and some which are a waste of time. As an example of the former, one could have a situation where a11 = 0 but a11 is so close to zero that dividing by it can cause numerical difficulties. In this case, even though a11 is nonzero, pivoting is necessary. Another variation concerns which row to pivot with. As described above, one looks for the first row with a nonzero entry and then performs the interchange. Instead, one looks at all of the rows in the first column and picks the one with the largest entry, in absolute value. This is known as partial pivoting. It is also possible to pivot using the columns of the matrix, although this requires reordering the entries in the solution vector (see Exercise 3.9). For those interested in exploring the various pivoting strategies that have been proposed, Golub and Van Loan [2013] should be consulted.

3.3.2 LU and Gaussian Elimination One of the conventional methods for solving a linear system is Gaussian elimination. To do this one forms the augmented matrix and then row reduces to find the answer. The general form of this procedure, for Ax = b, is





A|b → U|y → I|z , where U is an upper triangular matrix and I is the identity matrix. From this, the solution is x = z. As an example, consider solving ⎛ ⎞⎛ ⎞ ⎛ ⎞ 1 1 1 x 1 ⎝2 4 1 ⎠⎝y ⎠ = ⎝0⎠ . (3.8) 5 −1 −1 z 2 The steps in the reduction are given in Table 3.2. The row operation used in each step is given in the second column. For example, −2R1 + R2 → R2 means the first row is multiplied by −2, added to the second row, and the

3.3

Implementing a LU Solver Row Operation

Augmented Form

 ⎤ 1 1 1  1 ⎣2 4 1  0 ⎦ 5 −1 −1  2  ⎡ 1 1 1  2 −1  → ⎣0 5 −1 −1   ⎡ 1 1 1  2 −1  → ⎣0 0 −6 −6   ⎡ 1 1 1  → ⎣ 0 2 −1  0 0 −9 

81 Matrix Form



Table 3.2

Ax = b



1 −2 ⎦ 2



1 −2 ⎦ −3



1 −2 ⎦ −9

−2R1 + R2 → R2 E1 Ax = E1 b −5R1 + R3 → R3 E2 E1 Ax = E2 E1 b 3R2 + R3 → R3 E3 E2 E1 Ax = E3 E2 E1 b

Summary of the steps when using an augmented matrix to solve (3.8).

result put into the second row. The third column expresses the result in matrix form, using elementary matrices. For example, ⎞ ⎛ 1 0 0 E1 = ⎝−2 1 0 ⎠ , 0 0 1 and it is easy to verify that ⎞ ⎛ 1 1 1 2 −1 ⎠ . E1 A = ⎝0 5 −1 −1 The other two elementary ⎛ 1 E2 = ⎝ 0 −5

matrices are ⎞ 0 0 1 0⎠ and 0 1

⎛ ⎞ 1 0 0 E3 = ⎝0 1 0 ⎠ . 0 3 1

The conclusion we make from this calculation is that E3 E2 E1 A = U, where U is the upper triangular matrix given in the augmented matrix in the last step in Table 3.2. From this we conclude that −1 −1 A = E−1 1 E2 E3 U,

in other words, A = LU, where

82

3 Matrix Equations −1 −1 L = E−1 1 E2 E3 .

(3.9)

To complete the derivation we need some useful results from matrix algebra: • If E is invertible and lower triangular then E−1 is lower triangular. • If L1 and L2 are lower triangular then L1 L2 is lower triangular. Therefore, the matrix L in (3.9) is lower triangular. The above discussion involved a particular 3 × 3 matrix but the conclusion is the same for the general n × n case. Namely, by keeping track of the steps involved with Gaussian elimination one effectively constructs a LU factorization of the original matrix, and uses this to solve the equation. This also provides the motivation for looking for the LU factorization in the first place. However, once you know that you can factor the matrix in this way, the augmented matrix approach is not needed to find the factorization.

3.4 LU Method: Summary In what follows it is assumed that A is a n × n non-singular matrix, and x and b are n-vectors. Also, U is an upper triangular matrix and L is a lower triangular matrix (and both are n × n). To solve Ax = b using the LU method, the following steps are taken: 1. Calculate the factorization: A = LU 2. Solve for y: Ly = b 3. Solve for x: Ux = y It is sometimes necessary to interchange rows to find the factorization (a process known as pivoting). Also, the factorization is not unique. This gives rise to certain choices for the diagonals in the factorization, and the two most commonly used are: • Doolittle factorization: choose 11 = 22 = · · · = nn = 1. In this case, L is said to be a unit lower triangular matrix. • Crout factorization: choose u11 = u22 = · · · = unn = 1. In this case, U is said to be a unit upper triangular matrix. When solving an equation, only one of these is used when finding a factorization (i.e., you should not use both at the same time). Example 3.1 Solve

3.4

LU Method: Summary

83

x+y+z =2 2x + 4y + z = 7 5x − y − z = −8 using a LU factorization. Answer: We will use a Doolittle ⎛ ⎞ ⎛ 1 1 1 1 0 ⎜ ⎟ ⎜ 2 4 1  = ⎝ ⎠ ⎝ 21 1 5 −1 −1 31 32 ⎛ u11 ⎜ = ⎝ 21 u11

factorization, and so ⎞ ⎞⎛ u11 u12 u13 0 ⎟ ⎟⎜ 0 ⎠⎝ 0 u22 u23 ⎠ 0 0 u33 1 u12 21 u12 + u22

u13 21 u13 + u23

⎞ ⎟ ⎠.

31 u11 31 u12 + 32 u22 31 u13 + 32 u23 + u33 From the first row we get that u11 = 1, u12 = 1, and u13 = 1. From the second row, 21 = 2/u11 = 2, u22 = 4 − 21 u12 = 2, and u23 = 1 − 21 u13 = −1. From the last row, 31 = 5, 32 = −3, and u33 = −9. The next step is to solve Ly = b, which is ⎛ ⎞⎛ ⎞ ⎛ ⎞ 1 0 0 y1 2 ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎝ 2 1 0 ⎠⎝y2 ⎠ = ⎝ 7 ⎠ . 5 −3 1

y3

−8

The solution is y1 = 2, y2 = 3, and y3 = −9. Lastly, we solve Ux = y, which is ⎛ ⎞⎛ ⎞ ⎛ ⎞ 1 1 1 x 2 ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎝ 0 2 −1 ⎠⎝y ⎠ = ⎝ 3 ⎠ . 0 0 −9 z −9 From this it follows that the solution of the problem is z = 1, y = 2, and x = −1.  Example 3.2 Use the LU method to solve ⎛ 2 1 ⎜1 2 ⎜ ⎜ ⎝0 1 0 0

0 1 2 1

⎞⎛ ⎞ ⎛ ⎞ x1 0 0 ⎟ ⎜ ⎜ ⎟ 0 ⎟⎜ x2 ⎟ ⎜ 0 ⎟ ⎟ ⎟⎜ ⎟ = ⎜ ⎟ . 1 ⎠⎝ x3 ⎠ ⎝ 0 ⎠ 5 2 x4

(3.10)

Answer: The matrix is tri-diagonal, which means the only nonzero entries are on the diagonal and on the super- and sub-diagonals. It is a type of matrix

84

3 Matrix Equations

that is very common in applications. Normally, finding the LU factorization of a 4 × 4 matrix by hand is a bit tedious, but for a tridiagonal matrix it isn’t so bad. This is because L and U are also tri-diagonal. In particular, using a Doolittle factorization, we assume that ⎞ ⎛ ⎞ ⎞⎛ ⎛ 1 0 0 0 u11 u12 0 2 1 0 0 0 ⎟ ⎜ ⎜1 2 1 0⎟ ⎜ 0 0⎟ ⎟ ⎜ 21 1 ⎟⎜ 0 u22 u23 0 ⎟ ⎜ ⎟=⎜ ⎟ ⎟⎜ ⎜ ⎝ 0 1 2 1 ⎠ ⎝ 0 32 1 0 ⎠⎝ 0 0 u33 u34 ⎠ 0 0 1 2 0 0 43 1 0 0 0 u44 ⎛ ⎞ u11 u12 0 0 ⎜ u ⎟ u23 0 ⎜ 21 11 21 u12 + u22 ⎟ =⎜ ⎟. ⎝ 0 ⎠ 32 u22 32 u23 + u33 u34 0 0 43 u33 43 u34 + u44 Starting with the first row, we conclude that u11 = 2 and u12 = 1. Dropping to the second row one finds that 11 = 1/u11 = 1/2. Finishing the second row and then continuing one finds that ⎛ ⎞ ⎛ ⎞ 2 1 0 0 u11 u12 0 0 ⎜ ⎜ 0 u ⎟ 3 0 ⎟ 22 u23 ⎜ ⎟ ⎜0 2 1 0 ⎟ U=⎜ ⎟=⎜ ⎟, 4 ⎝ 0 0 u33 u34 ⎠ ⎝ 0 0 3 1 ⎠ 0 0 0 u44 0 0 0 54 and



1 0 0 ⎜ 0 ⎜ 21 1 L=⎜ ⎝ 0 32 1 0 0 43 The next step is to solve Ly = b, ⎛ 1 0 0 ⎜1 1 0 ⎜2 ⎜ ⎝ 0 23 1 0 0 34 One finds that y1 = y2 = y3 which is ⎛ 2 1 ⎜0 3 ⎜ 2 ⎜ ⎝0 0 0 0

⎞ ⎛ ⎞ 1 0 0 0 0 ⎜1 ⎟ 0⎟ ⎟ ⎜ 2 1 0 0⎟ ⎟=⎜ ⎟. 2 0⎠ ⎝ 0 3 1 0⎠ 1 0 0 34 1

which is ⎞⎛ ⎞ ⎛ ⎞ y1 0 0 ⎜ ⎜ ⎟ ⎟ 0 ⎟⎜y2 ⎟ ⎜ 0 ⎟ ⎟ ⎟⎜ ⎟ = ⎜ ⎟ . 0 ⎠⎝y3 ⎠ ⎝ 0 ⎠ y4 5 1

= 0 and y4 = 5. It remains to solve Ux = y, ⎞⎛ ⎞ ⎛ ⎞ x1 0 0 0 ⎜x2 ⎟ ⎜ 0 ⎟ 1 0⎟ ⎟⎜ ⎟ ⎜ ⎟ ⎟⎜ ⎟ = ⎜ ⎟ . 4 ⎠⎝x3 ⎠ ⎝ 0 ⎠ 1 3 x4 5 0 54

The solution in this case is x4 = 4, x3 = −3, x2 = 2, and x1 = −1.



3.4

LU Method: Summary

85

Example 3.3 We will now consider the n × n version of the above example, and the equation is ⎛ ⎞ 2 1 ⎜1 2 1 0 ⎟ ⎜ ⎟ ⎜ 1 2 1 ⎟ ⎜ ⎟ (3.11) ⎜ ⎟x = b, . . . .. .. .. ⎜ ⎟ ⎜ ⎟ ⎝ 0 1⎠ 1 2

where the vector b is selected so that x = (1, 1, · · · , 1)T is the exact solution. The matrix equation is going to be solved using the LU method, specifically a version that is specialized to tri-diagonal matrices as described in Section 3.8. The question considered here is how accurately we are able to compute the solution using double precision arithmetic. In other words, we are interested in how the exact solution x compares to the computed solution xc . To determine this we will compute the largest number, in absolute value, in x − xc . This number will be denoted as ||x − xc ||∞ . The results of the calculation are given in Table 3.3. For the smaller values of n the solution is as accurate as can be expected using double precision. The error does increase for the larger values of n, but the answer is still reasonably accurate when n = 160,000. The reason for the increase in the error is explained in Section 3.6.  In looking at Table 3.3, you might wonder just how long it takes a computer to solve a 160,000 × 160,000 matrix equation. It is about 10 msec but the method uses the tri-diagonal version of the LU method (see Section 3.8). So, only the nonzero entries are stored and used. If the full matrix is used then the n = 160,000 case can not be run on the author’s computer because

n

||x − xc ||∞

4

6.66e−16

8

1.33e−15

12

1.78e−15

16

1.33e−15

160

6.26e−14

1600

1.35e−12

16000

4.97e−11

160000

3.01e−09

Table 3.3 Difference between the exact and computed solution in Example 3.3. Note that 6.66e−16 = 6.66 × 10−16 .

86

3 Matrix Equations

of the storage requirement. Storing a m × n matrix, using double precision, requires approximately 8mn × 10−9 GB of RAM. So, a 160,000 × 160,000 matrix requires about 205 GB. If all three matrices (A, L, and U) are stored, and you want to save room for other things (like the operating system), then n = 50,000 is about as large as you can go if you have 64 GB of RAM. If you have 4 GB then the upper limit is in the neighborhood of n = 12,000. Example 3.4 The equation to be solved is ⎛ ⎞ 1 1 1 ... 1 ⎜ 1 1/2 (1/2)2 . . . (1/2)n−1 ⎟ ⎜ ⎟ ⎜ 1 1/3 (1/3)2 . . . (1/3)n−1 ⎟ ⎜ ⎟x = b. ⎜ .. ⎟ .. .. .. ⎝. ⎠ . . . 2 n−1 1 1/n (1/n) . . . (1/n)

(3.12)

The vector b is selected so that x = (1, 1, · · · , 1)T is the exact solution. This matrix is known as a Vandermonde matrix, and we will see similar matrices in Chapters 5 and 8. As in Example 3.3, to compare the exact solution x to the computed solution xc we will determine the largest number, in absolute value, in the vector x − xc . Also, as before, this number will be denoted as ||x − xc ||∞ . The results of the calculation are given in Table 3.4. The results are dramatically different than what were obtained in Example 3.3. Namely, although the accuracy is reasonable when n = 4, when n = 16 it is horrible and it is not even defined when n = 160.  What the last two examples show is that for some matrices the LU method works just great, but for others it can fail. The reason is that solving matrix equations can be very sensitive to round-off error, whether one uses LU or some other method. To understand when, and why, this happens requires

Table 3.4

n

||x − xc ||∞

4

4.44e−16

8

3.75e−09

12

8.79e−04

16

15.2

160

NaN

Difference between the exact and computed solution in Example 3.4.

3.4

LU Method: Summary

87

analyzing the error and how it depends on the properties of the matrix. This is done in Section 3.6.

3.4.1 Flops and Multicore Processors The number of flops (i.e., the number of additions, subtractions, multiplications and divisions) needed to solve the equation is: 1. finding L and U: 16 n(n − 1)(4n + 1) = 23 n3 − 12 n2 + O(n) 2. solving Ly = b and Ux = y: 2n2 + O(n) Therefore, for large matrices, a solution obtained using the LU method takes on the order of 23 n3 flops. In comparison, solving the problem by first finding A−1 and then calculating A−1 b takes on the order of 2n3 flops. The 23 n3 flop count means that if the size of the matrix is doubled then the flops increase by a factor of about 8. To investigate this, the average computing times the LU command takes using a randomly generated n × n matrix are shown in Figure 3.1. The calculations were done first using 10 cores of the processor, and then again using only one core. In both cases the algorithm does better than predicted using the flop count. For example, when going from n = 4000 to n = 8000, the computing time with a single core increases by “only” a factor of 5, and for 10 cores by a factor of 6. On the other hand, using 10 cores does not improve the speed by a factor of 10 over using a single core. It is closer to a factor of 5 for larger values of n. It is also important to appreciate the computing times required here. For a 1,000 × 1,000 matrix, the factorization using 10 cores is computed in about 7 msec. In other words, it is so fast that the computing time is imperceptible. Even for a 16,000 × 16,000 matrix it only takes about 7 seconds. These very fast computing times are due to the implementation of the LU factorization algorithm that take advantage of the hardware available, including memory

Time (sec)

103 10 10

1

-1

10-3 10-5 102

10 Cores 1 Core

103

104

105

n-axis Figure 3.1 Computing time, in seconds, to find the LU factorization of an n × n matrix using a 10 core, and with a single core, processor.

88

3 Matrix Equations

hierarchy and multiple cores. This also means taking maximum advantage of the BLAS routines mentioned in Section 1.6. Although it is straightforward to write your own LU solver, it is not so simple to write a code that uses the BLAS routines or is optimized for a multicore processor. So, for the computational problems in this text, built-in commands will mostly be used. In MATLAB this means simply using the command x=A\b or its variations (e.g., lu and chol). This will be true for most of the matrix factorization methods considered in the text. There are exceptions to this statement, and one of significance is the Thomas algorithm for solving tridiagonal matrix problems described in Section 3.8.

3.5 Vector and Matrix Norms The scalar-valued function ||x||∞ introduced in Examples 3.3 and 3.4 is an example of a vector norm. A norm is a generalization of length, and as such it is required to have the basic properties of length. To illustrate, if x has “length” 3, then 5x should have “length” 15 (i.e., if ||x|| = 3 then ||5x|| = 15). The other two requirements are that it satisfies the triangle inequality, and the only vector with “length” zero is the zero vector. These are formalized in the following definition. Definition of a Vector Norm. Given x ∈ Rn , its norm is designated as ||x||. To qualify to be a norm, it must have the following three properties: 1. ||αx|| = |α| · ||x|| 2. ||x + y|| ≤ ||x|| + ||y|| 3. If ||x|| = 0, then x = 0 These are required to hold for all x, y ∈ Rn , and all numbers (scalars) α. An example is the norm used in calculus, which is defined as ||x||2 ≡ x21 + x22 + · · · + x2n .

(3.13)

This is known as the Euclidean norm, or the 2-norm, or the 2 -norm. Another norm we will often use is ||x||∞ ≡ max{ |x1 |, |x2 |, · · · , |xn | } ,

(3.14)

which is known as the ∞-norm, or the ∞ -norm. A third norm that is used in scientific computing is ||x||1 ≡ |x1 | + |x2 | + · · · + |xn | ,

(3.15)

3.5

Vector and Matrix Norms

89

and this is known as the 1-norm, or the 1 -norm. Note that norms are dimensionally consistent with the vectors considered. For example, if x is a position vector, so the entries have the dimension of length, then the norm of x has the dimension of length. Example 3.5 1. If x = (2 , −1)T , then ||x||2 = and ||x||1 = |2| + | − 1| = 3.

√ √ 22 + 12 = 5, ||x||∞ = max{2, 1} = 2,

2. If x ∈ Rn with x = (1 , 1 , 1 , · · · , 1)T then ||x||2 = ||x||1 = n.

√ n, ||x||∞ = 1, and

3. To show that ||x||∞ is a vector norm, note that ||αx||∞ = max{ |αx1 |, |αx2 |, · · · , |αxn | } = max{ |α| |x1 |, |α| |x2 |, · · · , |α| |xn | } = |α| max{ |x1 |, |x2 |, · · · , |xn | } = |α| ||x||∞ . Also, let j be the value where ||x + y||∞ = |xj + yj |. Because |xj + yj | ≤ |xj | + |yj | ≤ ||x||∞ + ||y||∞ , it follows that ||x + y||∞ ≤ ||x||∞ + ||y||∞ . Finally, it is clear that if ||x||∞ = 0 then x = 0. 

If you are wondering how the norms compare, the answer can be inferred from the previous examples. Another way, for x ∈ R2 , can be obtained from Figure 3.2 using the usual relationship between

the lengths of the sides of a right triangle. Namely, max{ |x1 |, |x2 | } ≤ x21 + x22 ≤ |x1 | + |x2 |. In other words, ||x||∞ ≤ ||x||2 ≤ ||x||1 . As demonstrated in the above example, for large vectors (big n), the three vector norms can produce significantly different values. A natural question to ask is, why consider different vector norms? One reason is that the norm should be a reflection of what you are trying to

Figure 3.2

Components of the three vector norms in R2 .

90

3 Matrix Equations

accomplish. For example, many numerical methods produce a sequence of approximations x1 , x2 , x3 , · · · that converge to the desired solution. The usual way it is decided when to stop computing is through a requirement of the form ||xk+1 − xk || ≤ tol, where tol is a given error tolerance. One should pick a vector norm that best achieves this goal. For example, using the ∞norm one is requiring that every element of xk+1 − xk is smaller than tol (in absolute value). This is a reasonable requirement, and does not take long to compute. The 2-norm also plays an important role, particularly in the least squares material in Chapters 8 and 10. It is also important when considering methods involving orthonormal vectors because of its connection with the √ dot product through the formula ||x||2 = x • x.

3.5.1 Matrix Norm The requirements for a matrix norm are the same as for a vector norm. So, for example, it must satisfy the triangle inequality ||A + B|| ≤ ||A|| + ||B||. In this chapter, because we are concentrating of solving Ax = b, what matrix norm to use will be determined by the vector norm that has been selected. This is because a vector norm can be used to define a norm of a matrix. This is done by comparing the size of Ax to the size of x. The definition is ||A|| ≡ max x=0

||Ax|| . ||x||

(3.16)

Because matrices have the property that A(αx) = αAx, and norms have the property that ||αx|| = |α| · ||x||, this definition can be rewritten as ||A|| = max ||Ax|| . ||x||=1

(3.17)

The fact that these satisfy the required conditions to be a norm is considered in Exercise 3.22. Using the above formulas, the vector norm ||x||∞ gives rise to the matrix norm ||A||∞ , ||x||2 gives rise to the matrix norm ||A||2 , and ||x||1 gives rise to the matrix norm ||A||1 . There is a geometric interpretation of ||A||2 in the case of when the matrix is invertible that is worth knowing about. To explain, if x = (x1 , x2 )T , then the equation ||x||2 = 1 corresponds to the unit circle x21 + x22 = 1. This is shown on the left in Figure 3.3. If you evaluate y = Ax for x’s on this circle you get an ellipse as shown on the right in Figure 3.3. According to (3.17), ||A||2 equals the largest distance between the origin and the points on this ellipse. In other words, ||A||2 = a, where a is the semi-major axis of the ellipse. Moreover, if b is the semi-minor axis of the ellipse, then ||A−1 ||2 = 1/b. There

Vector and Matrix Norms

91

2

2

1

1

y2 -axis

x2 -axis

3.5

0

0

-1

-1

-2

-2 -2

-1

0

1

2

-2

-1

x1 -axis

0

1

2

y1 -axis

Figure 3.3 Using the formula y = Ax, the invertible 2 × 2 matrix A maps the circle x21 + x22 = 1, on the left, onto an ellipse centered at the origin, on the right. With this, ||A||2 = a and ||A−1 ||2 = 1/b.

are similar interpretations for ||A||∞ and ||A||1 , and these are considered in Exercise 3.17. The definition in (3.16), or the version given in (3.17), are useful for the more theoretical aspects of the subject, but they are not particularly useful for calculating a matrix norm. Fortunately, it is possible to derive easier to use formulas for ||A||∞ and ||A||1 . In particular, one can show that the ∞-norm of a matrix reduces to ⎧ ⎫ n n n ⎨ ⎬   |a1j |, |a2j |, · · · , |anj | ||A||∞ = max ⎩ ⎭ j=1

= max

1≤i≤n

n 

j=1

j=1

|aij | .

(3.18)

j=1

In other words, the ∞-norm is determined by the largest row sum (of absolute values). In contrast, the 1-norm of a matrix reduces to  n  n n    ||A||1 = max |ai1 |, |ai2 |, · · · , |ain | i=1

= max

1≤j≤n

n 

i=1

|aij | ,

i=1

(3.19)

i=1

which means that the 1-norm is determined by the largest column sum (of absolute values). One way to remember these two formulas is that ∞ is horizontal (rows), while 1 is vertical (columns).

92

3 Matrix Equations

Unfortunately, there is no tidy little formula for ||A||2 that is easy to calculate for large matrices. It is possible to connect ||A||2 with what are called the singular values of the matrix, which are related to the major and minor axis of an ellipse shown in Figure 3.3, and the relevant formula is given in Section 4.6.4. Example 3.6 ⎛

⎞ 1 2 −3 6⎠, A = ⎝ 4 −5 −7 8 9

If

then ||A||∞ = max{1 + 2 + 3 , 4 + 5 + 6 , 7 + 8 + 9} = max{6, 15, 24} = 24, and ||A||1 = max{1 + 4 + 7 , 2 + 5 + 8 , 3 + 6 + 9} = max{12, 15, 18} = 18.  A matrix norm that is derived from a vector norm is called an induced matrix norm, or subordinate norm or a natural norm. An example of a noninduced matrix norm is ||A||S ≡ i,j |aij |. Although this is not obtained using a vector norm it satisfies, as it must, the three requirements for a norm (see Exercise 3.22). A non-induced norm that often arises in data analysis is the Frobenius norm, and this is introduced in Chapter 10. There are several important properties of matrix norms, and the ones we need are listed below. Some others can be found in Exercise 3.22. Basic Properties of an Induced Matrix Norm: For the matrix norm given in (3.16), and in (3.17), the following hold: 1. If I is the identity matrix, then ||I|| = 1. 2. ||αA|| = |α| · ||A||, for any number α. 3. ||Ax|| ≤ ||A|| · ||x|| 4. ||AB|| ≤ ||A|| · ||B|| Proof: The first follows directly from (3.16), and the second is a consequence of ||αx|| = |α| ||x||. The third holds if x = 0, and when x = 0, the inequality follows from (3.16). As for the fourth, given that (AB)x = A(Bx), then using Property 3 (twice), ||(AB)x|| ≤ ||A|| · ||Bx|| ≤ ||A|| · ||B|| · ||x||. Assuming x = 0, and rewriting the last inequality as ||(AB)x||/||x|| ≤ ||A|| · ||B||, then the result follows from (3.16). 

3.6

Error and Residual

93

Non-induced matrix norms do not necessarily satisfy the above properties. As an example, ||I||S = n, so it does not satisfy the first property listed.

3.6 Error and Residual Letting x be the exact solution and xc the computed solution: • The error vector e is defined as e ≡ x − xc

(3.20)

• The residual vector r is defined as r ≡ b − Axc

(3.21)

Although having zero error is desired, the reality is that when using floating point numbers the best we can expect are relative errors on the order of machine ε. Another complication is that for most problems we don’t know x and are therefore not able to calculate e. The residual, however is something we can calculate. Everything that follows is based on the goal of using the residual to determine, or estimate, the error in the calculated solution. The first step is to realize that these two vectors are related through the formula

or equivalently

r = Ae,

(3.22)

e = A−1 r.

(3.23)

We would like to be able to state that if r is small then so is e. However, as the above formula shows, it might happen that the multiplication by A−1 takes a small r and produces a large e. How to relate these two vectors is given next. Theorem 3.2 Assuming A is non-singular and b is nonzero, then ||r|| ||e|| ≤ κ(A) , ||x|| ||b||

(3.24)

where κ(A) is the condition number of A and is defined as κ(A) ≡ ||A|| · ||A−1 || .

(3.25)

Proof: First note that since b is nonzero then x is nonzero. The conclusion is a consequence of two inequalities: (i) Since Ax = b then ||b|| = ||Ax|| ≤ ||A|| · ||x||. From this we have that ||A|| 1 ≤ . ||x|| ||b||

94

3 Matrix Equations

(ii) Since e = A−1 r, then ||e|| ≤ ||A−1 || · ||r||. The stated inequality follows by simply combining the conclusions from (i) and (ii).  The above result holds for any norm. But, it is understood that the same norm is used in all expressions. In words, (3.24) states that the relative error ||e||/||x|| is never more than the relative residual ||r||/||b|| multiplied by the condition number of the matrix. As it turns out, it is not unusual that the relative error comes close to equaling the maximum value allowed in (3.24). This means that ||r|| ||e|| ≈ κ(A) . ||x|| ||b|| In such cases, even if the relative residual is small, a very large value of κ(A) means that the relative error is large. This is the motivation for referring to matrices with very large condition numbers as being ill-conditioned. One last comment to make about (3.24) is that it holds no matter how you have determined xc . You could have used the LU method, used the conjugate gradient method (described in Chapter 9), or you could have simply guessed the solution.

3.6.1 Correct Significant Digits The connections between the expected number of correct significant digits and the relative error were discussed in Section 1.5. When vectors are involved, the situation is more complicated. To illustrate, suppose the exact value is x = (100, 1, 0)T and the computed value is xc = (100.1, 1.1, 0.1)T . Using the ∞-norm, the relative error is ||x − xc ||∞ = 10−3 . ||x||∞ While it is true that the first entry in xc is correct to three digits, the second entry is correct to only one digit, and the concept of significant digits is not even applicable to the third entry. So, when dealing with a nonzero vector x, the connection of the relative error with the number of correct significant digits is guaranteed to only apply to the largest entry, in absolute value, of x. This is also assuming that the ∞-norm is used.

3.6.2 The Condition Number As illustrated in the above theorem, the condition number plays a central role in determining how accurately the matrix equation can be solved. We

3.6

Error and Residual

95

are using κ(A) to designate this number but another common notation is cond(A). Also, if a particular norm is being used, this is indicated using a subscript. So κ∞ (A) indicates that the ∞-norm is being used. Any formula without a subscript means it applies irrespective of the norm. Example 3.7 

If A=

a b c d

then A−1 =

1 ad − bc



 ,

d −b −c a

 .

(3.26)

This assumes, of course, that ad − bc = 0. With this ||A||∞ = max{ |a| + |b|, |c| + |d| }, and ||A−1 ||∞ =

(3.27)

1 max{ |d| + |b|, |c| + |a| }. |ad − bc|

So, for example, if

 A=

1 −2 −3 4

(3.28)

 ,

then ||A||∞ = 7 and ||A−1 ||∞ = 3. Consequently, κ∞ (A) = 21.



Example 3.8 The equation below, on the left, is to be solved. However, the person solving the problem thinks it can be approximated with the simpler equation on the right, the solution of which is xc = (2, 0)T . This example explores how well xc approximates the solution of the original problem.           15 14 x 30.03 15 14 x 30 = = 14 13.04 y 28 14 13 y 28 Question 1: Is the error in the solution about the same as the error when approximating the equation? Answer: No. The solution is x = (1.022, 1.05)T , so the error is e = x − xc = (−0.978, 1.05)T . These values are significantly larger than the 0.03 and the 0.04 terms dropped when approximating the equation.

96

3 Matrix Equations

Question 2: What is the relative residual and the relative error? Answer: The A and b are determined by the original equation. Since ||b||∞ ≈ 30 and ||r||∞ = ||b − Axc ||∞ = 0.03, then ||r||∞ ≈ 10−3 . ||b||∞ So, the relative residual is reflective of the error made in approximating the original problem. As for the relative error, one finds that ||e||∞ = 1. ||x||∞ Therefore, the relative error is significantly larger than the relative residual. Question 3: Determine κ∞ (A). Does its value help explain why the relative error is so much larger than the relative residual? Answer: Since ||A||∞ = 29 and, using (3.27), ||A−1 ||∞ = 145/2, then κ∞ (A) ≈ 2100. As mentioned in the comments following Theorem 3.2, it is not unusual that the relative error comes close to the upper bound given in (3.24). This happens in this example. Namely, even though the relative residual is rather small, it is multiplied by the condition number. So, although the relative error does not equal the theoretical maximum value of κ∞ ||r||∞ /||b||∞ ≈ 2.1, it comes close.  The above example illustrates a complication that we will often run into. Namely, even though you have a reasonably accurate approximation of the original problem, such as what you get using double precision, the problem has a sensitivity to such approximations and ends up magnifying the error. The condition number is what we will use to measure this sensitivity and the above example helps explain why. The following example is another illustration of this. Example 3.9 A way to visualize the importance of the condition number is to consider the formula for the solution, which is x = A−1 b. Given that x = 0 if b = 0, then it should be that if b is close to zero then x should be as well. To investigate this, what is shown in Figure 3.4, in red, is the circle b21 + b22 = r2 . For points on this circle, the corresponding curve determined using the formula x = A−1 b is an ellipse. Taking

Error and Residual

97

2r

2r

r

r

x2 -axis

b2 -axis

3.6

0

0

-r

-r

-2r

-2r -2r

-r

0

r

d=2 =9

2r

-2r

-r

0

b1 -axis

x2 -axis

3r

0

-3r

d = 1.5 = 12 -3r

0

3r

r

2r

x1 -axis

15r

30r

0

0

-15r

-30r

d = 1.1 = 44

-15r

0

x1 -axis

15r

x1 -axis

-30r

d = 1.05 = 82 0

30r

x1 -axis

Figure 3.4 The circle b21 + b22 = r 2 is shown on the upper left plot (in red), and the resulting ellipse that comes from the formula x = A−1 b is shown in the other four plots for different values of d.

 A=

1 −1 −1 d

 ,

(3.29)

the ellipse obtained for different values of d is shown in Figure 3.4. In the case of when d = 2, so κ∞ (A) = 9, it is seen that the points on the ellipse are about as close, if not closer, to the origin than the points on the circle. The situation changes as the condition number increases. When this happens the ellipse becomes more elongated, resulting in points on the ellipse being increasing farther from the origin. For very large condition numbers, such as κ∞ (A) ≈ 108 , a plot of the corresponding ellipse would appear simply as a straight line. The significance of this is that when solving Ax = b, a small change in b, such as might occur when replacing it with its floating-point approximation, will result in the solution moving to some other location in the

98

3 Matrix Equations

elongated ellipse. The greater the elongation, the more likely it will be that the solution is far from what it would be for the original b. In other words, the difference between the computed and exact solutions likely increases as the condition number increases.  Example 3.10 Suppose A is a diagonal 3 × 3 matrix, ⎛ d1 ⎜ A=⎝ 0 0

which means it can be written as ⎞ 0 0 ⎟ d2 0 ⎠ . (3.30) 0 d3

In this case, ||A||∞ = max{ |d1 |, |d2 |, |d3 | }. Also, assuming the diagonals are nonzero, ⎞ ⎛ 0 0 1/d1 ⎟ ⎜ 1/d2 0 ⎠. A−1 = ⎝ 0 0 0 1/d3 From this it follows that ||A−1 ||∞ = max{ |1/d1 |, |1/d2 |, |1/d3 | } = Therefore, κ∞ (A) =

1 . min{ |d1 |, |d2 |, |d3 | }

max{ |d1 |, |d2 |, |d3 | } . min{ |d1 |, |d2 |, |d3 | }

This shows that for this matrix the condition number is not affected so much by how large or small the di ’s are but rather how different they are. For example, if d1 = d2 = d3 = 10−10 or if d1 = d2 = d3 = 1010 then κ∞ (A) = 1 and we have a well-conditioned matrix. However, if d1 = d2 = 10−10 and d3 = 1010 then κ∞ (A) = 1020 and we have an ill-conditioned matrix.  The condition number has several useful properties, and some are listed below. Basic Properties of the Condition Number: Assuming that A and B are n × n and invertible, and an induced matrix norm is used, then the following hold. 1. κ(I) = 1, where I is the identity matrix 2. κ(A) ≥ 1 3. For any nonzero number α, κ(αA) = κ(A)

3.6

Error and Residual

99

4. κ(A) = κ(A−1 ) 5. κ(BA) ≤ κ(B)κ(A) Proof: Note that Property 1 holds because I−1 = I and ||I|| = 1. As for Property 2, since AA−1 = I, it follows that that ||I|| ≤ ||A|| · ||A−1 || and this means that ||A|| · ||A−1 || ≥ 1. Property 3 holds because ||αA|| = |α| · ||A|| and ||(αA)−1 || = |α|−1 · ||A−1 ||, and Property 4 is an immediate consequence of the definition of the condition number. Finally, for Property 5, given that ||BA|| · ||(BA)−1 || = ||BA|| · ||A−1 B−1 ||, it follows that ||BA|| · ||(BA)−1 || ≤ ||B|| · ||A|| · ||A−1 || · ||B−1 ||. 

3.6.3 A Heuristic for the Error A rule of thumb has been developed that relates the condition number to the accuracy of the computed solution. It begins with the assumption that ||x − xc || ≈ εκ(A). ||x||

(3.31)

This comes from a worst-case analysis applied to (3.24). Namely, even though you might solve the equation to the accuracy allowed using float-point arithmetic, so the relative residual is on the order of machine ε, the relative error in the solution is as bad as permitted in (3.24). The second assumption is that if ε ≈ 10−p and κ∞ (A) ≈ 10q then xc probably has entries that are correct to no more than about p − q digits. In particular, when using double precision, this difference is 16 − q. Note that when this difference is negative then the heuristic, and probably the computed solution, have no meaning. Example 3.11 The table from Example 3.3 in Section 3.4, which was computed for a tridiagonal matrix, is repeated in Table 3.5 (left side) but two columns are added, one giving the condition number and the other giving the value of εκ∞ (A). Note the ∞-norm is used here, in which case ||x||∞ = 1. Similarly, the table from Example 3.4, which was computed for the Vandermonde matrix, is repeated in Table 3.5 (right side). These results show that the heuristic is a bit pessimistic in the sense that the actual error is better than what is predicted by the heuristic. It is also clear that the Vandermonde matrix is ill-conditioned except for very small values of n. The tri-diagonal matrix, in contrast, is reasonably well-conditioned for all values of n 

100

3 Matrix Equations n

||x − xc ||∞

κ∞ (A)

εκ∞ (A)

||x − xc ||∞

κ∞ (A)

εκ∞ (A)

4

6.7e−16

1.2e+01

2.7e−15

4.4e−16

1.5e+03

3.4e−13

8

1.3e−15

4.0e+01

8.9e−15

3.8e−09

4.5e+08

1.0e−07

12

1.8e−15

8.4e+01

1.9e−14

8.8e−04

1.1e+15

2.4e−01

16

1.3e−15

1.4e+02

3.2e−14

1.5e+01

1.2e+18

2.6e+02

160

6.3e−14

1.3e+04

2.9e−12

NaN

Inf

Inf

1600

1.4e−12

1.3e+06

2.8e−10

16000

5.0e−11

1.3e+08

2.8e−08

Table 3.5 Error when computing the solution for Example 3.3, on left, and for Example 3.4, on right, from Section 3.4. Because ||x||∞ = 1, according to the heuristic, εκ∞ (A) is an estimate of the error.

3.7 Positive Definite Matrices One of the more common numerical problems that arises in continuum mechanics or electrodynamics involves solving equations with matrices such as ⎞ ⎛ 4 −1 0 −1 0 0 ⎛ ⎞ ⎜ −1 4 −1 0 −1 0⎟ 2 −1 0 0 ⎟ ⎜ ⎜ −1 ⎜ ⎟ 0 −1 4 −1 0 −1 ⎟ 2 −1 0⎟ ⎟. ⎜ ⎜ or ⎜ −1 ⎝ 0 −1 0 −1 4 −1 0⎟ 2 −1 ⎠ ⎟ ⎜ ⎝ 0 −1 0 −1 4 −1 ⎠ 0 0 −1 2 0 0 −1 0 −1 4 These matrices have several properties that have a significant impact on the numerical methods that can be used. The two that are of interest here are that they are symmetric and positive definite. For those who are unfamiliar with the latter property, its definition is given next. Definition of a Positive Definite Matrix. If A is a symmetric matrix, then A is positive definite if, and only if, either of the following hold: 1. xT Ax > 0, ∀ x = 0, or 2. A has only positive eigenvalues. To put this definition on solid ground, it is necessary to prove that any symmetric matrix that satisfies the first condition also satisfies the second condition (and visa-versa). This is easy to do, and to illustrate, given an eigenvalue λ, and a corresponding eigenvector x, then x is nonzero and Ax = λx. With this, xT Ax = λx • x. Since x • x > 0, and if it is true that xT Ax > 0, it then

3.7

Positive Definite Matrices

101

follows that λ > 0. The proof of the other direction requires a result from linear algebra about the eigenvalues for a symmetric matrix (see Theorem 4.1) and is left as an exercise. The fact is that the above definition is not particularly easy to use even though the idea being defined is very important. It turns out that there are some easy to use tests for determining whether or not a matrix is positive definite. We begin with the negative results, namely ways to determine if a matrix does not have this property. Theorem 3.3 Assume A is a symmetric matrix. 1. A is not positive definite if any diagonal entry is negative or zero. 2. A is not positive definite if the largest number in A, in absolute value, appears off the diagonal. 3. A is not positive definite if det(A) ≤ 0. Proof : In regard to the first statement, suppose aii ≤ 0. Taking x = ei , where ei is the ith coordinate vector, then xT Ax = aii . The latter number is not positive and so A is not positive definite. The second statement is proved in a similar manner, but using x = ei − ej and x = ei + ej . The third statement follows from the result from linear algebra which states that if λ1 , λ2 , · · · , λn are the eigenvalues of A then det(A) = λ1 λ2 · · · λn . Since the eigenvalues of a symmetric positive definite matrix are positive, then the product of the eigenvalues must be positive.  The first two conditions are easy to use even on very large matrices, while the usefulness of the third condition is limited to smaller or very simple matrices. Example 3.12 Use Theorem 3.3, when possible, to determine if the symmetric matrix is positive definite.   4 1 1. A = 1 −4 Answer: It is not positive definite because it has a negative diagonal entry. Also, det(A) < 0.  2. A1 =

1 4 4 2



 and

A2 =

1 4 4 4



Answer: Neither is positive definite because the largest number appears off the diagonal. It should be pointed out that even though a 4 does appear on the diagonal in A2 , the theorem states that it can not appear anywhere else if the matrix is positive definite.

102

3 Matrix Equations



⎞ 4 3 3 ⎜ ⎟ 3. A3 = ⎝ 3 1 3 ⎠ 3 3 1



⎞ 1 −2 −5 ⎜ ⎟ 1 5⎠ and A4 = ⎝ −2 −5 5 8

Answer: The theorem does not provide any insight because these matrices only have positive numbers on their diagonals, the largest number only appears on the diagonal, and det(A3 ) = 4 and det(A4 ) = 26. However, neither of them is This is because the eigenvalues for √ positive definite. √ A3 are −2, 4 − 3 2, and 4 + 3 2, while the eigenvalues for A4 are −1, −2, and 13 (i.e., each has at least one negative eigenvalue).  There is a simple test to prove a matrix is positive definite and it involves the size of the numbers on the diagonal compared to the other numbers in their respective rows. The property needed is defined next. Definition of a Strictly Diagonally Dominant Matrix. A matrix A is strictly diagonally dominant if, and only if, for every i, |aii | >

n 

|aij | .

(3.32)

j=1 j=i

It is diagonally dominant if the above condition holds with ≥ instead of >. A positive definite matrix does not have to be strictly diagonally dominant, or even just diagonally dominant (see Exercise 3.37). However, as illustrated by the next result, it can be used with other conditions to prove positive definiteness. Theorem 3.4 A symmetric matrix A is positive definite if the diagonals are all positive and if either one of the following hold: a) A is strictly diagonally dominant, or b) A is diagonally dominant, with strict inequality (3.32) holding for at least one row, and there are no zero entries in the super-diagonal. For the record, the diagonal entries are the aii ’s while the super-diagonal entries are the ai,i+1 ’s (i.e., the entries just to the right of the diagonals). The above theorem is a direct consequence of the Gershgorin Theorem given in Section 4.4.6. It is also worth pointing out that the condition “there are no zero entries in the super-diagonal” can be replaced with the more general requirement that “the matrix is irreducible.” The property of irreducibility is not needed in this chapter but, if you are interested, the definition is given in Section 4.5.3.1.

3.7

Positive Definite Matrices

103

Example 3.13 ⎛

⎞ 2 −1 0 ⎜ ⎟ 3 −1 ⎠ 1. A = ⎝ −1 0 −1 2 This is a symmetric matrix with positive diagonal  entries. To check on = 2 while diagonal dominance: for row 1, a 11 1j | = 1; for row 2, j=1 |a   a22 = 3 while j=2 |a2j | = 2; for row 3, a33 = 2 while j=3 |a3j | = 1. So, the matrix is strictly diagonally dominant and from the above theorem we conclude that it is positive definite. ⎛

⎞ 2 −1 0 ⎜ ⎟ 2 −1 ⎠ 2. A = ⎝ −1 0 −1 2 This is a symmetric matrix with positive diagonal entries. Because of the second row the matrix is not strictly diagonally dominant, but it is diagonally dominant with strict inequality holding for at least one row (row 1 and row 3). Since the super-diagonals are nonzero it follows from the above theorem that the matrix is positive definite.  As a final comment, a question that often comes up is where the idea of being positive definite comes from and why is it considered important. Many of the matrices that arise in applications come from approximations of differential equations. This includes Maxwell’s equations in electrodynamics, or Navier-Stokes equations in fluid dynamics, or the heat equation in thermodynamics. The steady state version of these equations often have a property known as ellipticity, and this helps guarantee that the solution is unique or the potential energy in the system behaves in a physically realistic manner. Saying something has ellipticity is a fancy way of saying it is positive definite. The matrices that come from these applications are simply inheriting the property from the original problem. What we are doing here is deriving numerical methods that take advantage of this property.

3.7.1 Cholesky Factorization Suppose A is symmetric and A = LU. Since AT = A and (LU)T = UT LT , then A = UT LT where UT is lower triangular and LT is upper triangular. Because of this, you might think that it is possible to take L = UT or U = LT for a symmetric matrix. However, if you want to avoid using complex numbers in the factorization, you need to make the additional assumption that A is

104

3 Matrix Equations

also positive definite. The explanation for this, and how complex numbers can be used in the factorization, can be found in Exercises 3.54(f) and 3.55. The above discussion provides the foundation for the following definition. Definition of a Cholesky Factorization. Given a symmetric and positive definite matrix A, its Cholesky factorization is A = UT U,

(3.33)

where U is an upper triangular matrix with positive diagonal entries. The proof that every symmetric positive definite matrix has a unique Cholesky factorization is outlined in Exercise 3.54 in the case of when n = 2. Also, although this might be obvious, it is important enough to point out that U is nonsingular. Example 3.14 Use the Cholesky factorization to solve      4 1 x −2 = . 1 4 y 7 Answer: Note that, from Theorem 3.4(a), the matrix is positive definite. So, the first step is to find the factorization. This is done by assuming that      4 1 u11 0 u11 u12 = 1 4 u12 u22 0 u22   2 u11 u12 u11 = . u12 u11 u212 + u222 Consequently, we need u211 = 4, and so u11 = 2. The negative root is not considered because a Cholesky factorization requires the√ diagonals to be positive. With this, one then finds that u12 = 12 , and u22 = 12 15. The next step is to solve UT y = b, which is      2 0 u −2 √ = . 1 1 7 v 2 2 15 √ The solution is u = −1 and v = 15. Lastly, we solve Ux = y, which is      1 −1 2 x 2 √ = √ . y 15 0 12 15 From this it follows that the solution is y = 2 and x = −1.



3.8

Tri-Diagonal Matrices

105

Because Cholesky avoids having to calculate L, the flop count is about half of the LU count. In other words, when using a Cholesky factorization the flop count is approximately 13 n3 . From numerical experiments, using MATLAB, it is found that the computing time for a Cholesky factorization is about a factor of 3 faster than for LU (on a 10 core processor). The experiments in this case used both n = 1,000 and n = 10,000. In terms of computing time, a Cholesky factorization takes about 1.6 msec when n = 1,000 and about 0.6 s when n = 10,000. In addition to a reduced flop count, a symmetric and positive definite matrix is always nonsingular (however, make sure to look at Exercise 3.39). Moreover, the factorization can be carried out without having to use pivoting. These are some of the nice properties referred to earlier. What is not avoided however is the possibility that the matrix is ill-conditioned. The diagonal matrix used in Example 3.10 is positive definite as long as the diagonals are positive. As demonstrated in that example, the matrix can be either ill-conditioned or well-conditioned depending on the relative values of the diagonals. It is interesting how Theorem 3.4 is used by MATLAB. When given the command A\b, MATLAB makes a series of tests to decide how to solve the problem. If it finds that the matrix is symmetric, contains only real numbers, and has positive diagonals it will attempt a Cholesky factorization. It also knows that it is very possible that the matrix is not positive definite so there are built in contingencies for what to do if the Cholesky factorization fails. What MATLAB is doing is not unreasonable because symmetric and positive definite matrices are so common in applications that having positive diagonals increases the likelihood that the matrix is positive definite to the point that the Cholesky factorization is worth trying.

3.8 Tri-Diagonal Matrices Another type of matrix that often occurs in applications are those that are tri-diagonal. A standard LU factorization can be used on a tri-diagonal matrix and the procedure can be greatly simplified if one uses that fact that there are so many zeros in the matrix. To explain, a tri-diagonal matrix has the form: ⎛ ⎞ a1 c1 ⎜ b2 a2 c2 ⎟ 0 ⎜ ⎟ ⎜ ⎟ b a c 3 3 3 ⎜ ⎟ (3.34) A=⎜ ⎟. . . . .. .. .. ⎜ ⎟ ⎜ ⎟ ⎝ 0 cn−1⎠ bn an

106

3 Matrix Equations

As illustrated in Section 3.4, the factorization of such a matrix involves tridiagonal matrices. In particular, the Crout factorization has the form ⎛ ⎞ ⎛ ⎞⎛ ⎞ 11 a1 c1 1 u12 ⎜ b2 a2 c2 ⎟ ⎜ 21 22 ⎟⎜ ⎟ 1 u23 ⎜ ⎟ ⎜ ⎟⎜ ⎟ = ⎜ ⎜ ⎟ ⎟ ⎜ ⎟. b a c   1 u 3 3 3 32 33 34 ⎝ ⎠ ⎝ ⎠⎝ ⎠ .. .. .. .. .. .. .. . . . . . . . Keeping track of what is, or is not, zero, the entire LU method reduces to the algorithm given in Table 3.6. This is known as the Thomas algorithm. In comparison to a full LU, it requires only 8n − 7 flops! Moreover, it is only necessary to store the tri-diagonal portion of the matrix and so the entire method requires storage that amounts to approximately six n-vectors. Note that tri-diagonal matrices can suffer the same problems more general matrices have. In particular, they can be singular and they can be illconditioned. This is evident in the algorithm given in Table 3.6 with the variable w. If w is zero, then the algorithm will fail. This will happen if the matrix is singular or if the LU factorization requires pivoting (the algorithm in Table 3.6 assume pivoting is not required). In certain cases it is possible to determine very easily when w = 0, and this is given next. Theorem 3.5 The tri-diagonal matrix A in (3.34) is invertible, and the algorithm in Table 3.6 can be used to solve Ax = f , if either one of the following holds: a) A is strictly diagonally dominant, or b) A is diagonally dominant, ci = 0 ∀i, and |bn | < |an |.

set: loops:

f1 w for i = 2, 3, . . . , n ci−1 vi = w w = a i − bi v i fi − bi xi−1 xi = w end w = a1 , x1 =

for j = n − 1, n − 2, . . . , 1 xj = xj − vj+1 xj+1 end

Table 3.6 in (3.34).

Thomas algorithm for solving Ax = f when A is the tri-diagonal matrix given

3.9

Sparse Matrices

107

Outline of Proof: The only operation of concern in the algorithm is the division by w, and so the majority of the proof consists of showing this can not be zero. In the case of when (b) holds, note that when i = 2, |v2 | = |c1 /a1 | ≤ 1, where the inequality holds because the matrix is diagonally dominant. With this |w| = |a2 − b2 v2 | ≥ |a2 | − |b2 v2 | ≥ |a2 | − |b2 | > 0, where the last inequality holds because the matrix is diagonally dominant and c2 = 0. Continuing this argument, using induction, it is not hard to show that |vi | ≤ 1 and |w| ≥ |ai | − |bi | > 0. The last inequality holds, except for the last row of the matrix, because we are assuming |ai | ≥ |bi | + |ci | and ci = 0. The fact that it holds for i = n is because we have explicitly assumed that |bn | < |an |. Showing that w is nonzero when A is strictly diagonally dominant follows a similar induction proof. What remains is to prove that the vector computed by the algorithm is the solution of the equation, and this is left as an exercise.  Example 3.15 1. The matrix



⎞ 2 −1 0 2 −1 ⎠ A = ⎝1 0 1 2

is diagonally dominant, c1 = c2 = −1, and |b3 | < |a3 |. Therefore, according to the above theorem, it is invertible and the algorithm in Table 3.6 can be used with it. 2. Let



⎞ ⎛ ⎞ ⎛ ⎞ 2 −1 0 2 −2 0 2 −2 0 1 −1 ⎠ , A2 = ⎝ 1 2 0 ⎠ , A3 = ⎝ 1 1 −1 ⎠ . A1 = ⎝ 1 0 1 2 0 1 2 0 2 2

These do not satisfy the conditions of the above theorem because: A1 is not diagonally dominant (because of the second row), while A2 and A3 are diagonally dominant but A2 violates the ci = 0 ∀i condition, and A3 violates |b3 | < |a3 |. 

3.9 Sparse Matrices Another type of matrix that often arises is one containing mostly zeros. These are said to be sparse. Although there is not a precise definition of what it means to be sparse, as a rule of thumb, a large n × n matrix is sparse if there are on the order of n nonzero entries. As an example, a large tri-diagonal matrix is sparse, and the reason is that there are no more than 3n nonzero

108

3 Matrix Equations

entries in such a matrix. It is worth pointing out that matrices with few zero entries are said to be dense, which is another way of saying that the matrix is not sparse. The question arises when solving a problem with a sparse matrix if it is possible to avoid having to store all those zero entries, and if it is possible to avoid calculations with them. It is fairly easy to do this, although it depends on the method being used to solve the problem, and this is explored in Section 9.3.4.

3.10 Nonlinear Systems We now consider the problem of how to solve a nonlinear system of equations numerically. An example of this type of problem is to find the value(s) of x and y that satisfy x2 + 4y 2 = 1, 2

(3.35)

2

4x + y = 1.

(3.36)

Each of these equations defines an ellipse, and they are plotted in Figure 3.5. It is evident there are four solutions. We will use Newton’s method to calculate these solutions, and it will make it easier if we first write the problem in the more general form of solving f (x, y) = 0,

(3.37)

g(x, y) = 0.

(3.38)

1

y-axis

0.5 0 -0.5 -1 -1

Figure 3.5

-0.5

0 x-axis

0.5

1

The ellipse (3.35), the red curve, and the ellipse (3.36), the blue curve.

3.10

Nonlinear Systems

109

As usual with Newton, it’s assumed that an initial guess (x0 , y0 ) of the solution is provided. We then approximate the above functions for (x, y) near (x0 , y0 ) using what is called the linear approximation in calculus, or Taylor’s theorem in advanced calculus. Whatever you call it, the approximations are f (x, y) ≈ f (x0 , y0 ) + (x − x0 )fx (x0 , y0 ) + (y − y0 )fy (x0 , y0 ), g(x, y) ≈ g(x0 , y0 ) + (x − x0 )gx (x0 , y0 ) + (y − y0 )gy (x0 , y0 ). As was done for the one variable case, we now replace (3.37) and (3.38) with f (x0 , y0 ) + (x − x0 )fx (x0 , y0 ) + (y − y0 )fy (x0 , y0 ) = 0, g(x0 , y0 ) + (x − x0 )gx (x0 , y0 ) + (y − y0 )gy (x0 , y0 ) = 0. Rearranging the terms in these equations, we obtain xfx (x0 , y0 ) + yfy (x0 , y0 ) = −f (x0 , y0 ) + x0 fx (x0 , y0 ) + y0 fy (x0 , y0 ), xgx (x0 , y0 ) + ygy (x0 , y0 ) = −g(x0 , y0 ) + x0 gx (x0 , y0 ) + y0 gy (x0 , y0 ). These can be written in matrix form as J0 x1 = −f0 + J0 x0 , or equivalently as where

x1 = x0 − J−1 0 f0 ,

(3.39)



 fx (x0 , y0 ) fy (x0 , y0 ) J0 = , gx (x0 , y0 ) gy (x0 , y0 )       x0 x1 f (x0 , y0 ) , x1 = , f0 = x0 = . g(x0 , y0 ) y0 y1

(3.40)

The formula in (3.39) is the multi-variable version of (2.7). The matrix in (3.40) is the Jacobian for the problem, evaluated at x0 . Continuing this procedure, we obtain the general formula for Newton’s method, which is that xk+1 = xk − J−1 k fk , for k = 0, 1, 2, 3, · · · where

(3.41)



 fx (xk , yk ) fy (xk , yk ) Jk = , gx (xk , yk ) gy (xk , yk )     xk f (xk , yk ) , fk = xk = . g(xk , yk ) yk

(3.42) (3.43)

110

3 Matrix Equations

The formula in (3.41) is the multi-variable version of (2.7). Also, the earlier requirement that f  (x) is nonzero now becomes the condition that the Jacobian J is nonsingular. For numerical reasons it is better to rewrite (3.41) to reduce the flop count. First note that it can be written as Jk xk+1 = Jk xk − fk , and this can be written as Jk (xk+1 − xk ) = −fk . Therefore, (3.41) can be broken into two steps, first one solves (3.44) J k z = − fk , and then xk+1 = xk + z .

(3.45) J−1 k

With this we have avoided having to determine and then calculating J−1 k fk . As a final comment, the above expressions were derived for the twovariable problem in (3.37) and (3.38). However, they apply to the more general problem of n nonlinear equations in n unknowns. In this case, J is the n × n Jacobian matrix, while xk and z are n-vectors. Example 3.16 For the equations in (3.35) and (3.36), f (x, y) = x2 + 4y 2 − 1 and g(x, y) = 4x2 + y 2 − 1. The Jacobian is therefore   2x 8y J= . 8x 2y Setting z = (u, v)T , then (3.44) becomes    2   u xk + 4yk2 − 1 2xk 8yk =− . v 4x2k + yk2 − 1 8xk 2yk

(3.46)

||xk − xk−1 ||∞ /||xk ||∞

k

xk

yk

0

1.0000000000

1.0000000000

1

0.6000000000

0.6000000000

6.7e−01

2

0.4666666667

0.4666666667

2.9e−01

3

0.4476190476

0.4476190476

4.3e−02

4

0.4472137791

0.4472137791

9.1e−04

5

0.4472135955

0.4472135955

4.1e−07

6

0.4472135955

0.4472135955

8.4e−14

Table 3.7 Solving (3.35) and (3.36) using Newton’s method as given in (3.46) and (3.47). Also given is the relative iterative error.

3.11

Some Additional Ideas

After solving this, using (3.45),     xk+1 xk + u = . yk + v yk+1

111

(3.47)

The values obtained using this procedure are shown in Table 3.7. As expected when using Newton’s method, the iterative error shows that the method is second-order. This is because once the method starts to get close to the solution, the iterative error at step k is approximately the square of the error at step k − 1.  Newton’s method, as given in (3.44) and (3.45), is relatively easy to derive. We also saw that when it works, the error shows the second-order convergence that is the hallmark of the method. The fact is, however, that solving a nonlinear system of equations using Newton’s method can be challenging. This is due, in part, to the fact that it is guaranteed to work only if the starting point x0 is close to the solution. However, it is possible to pair it with other methods that are not so sensitive to the location of the starting point, and this is discussed in Section 9.5.

3.11 Some Additional Ideas 3.11.1 Yogi Berra and Perturbation Theory Yogi Berra, an insightful baseball personality, once said “In theory there is no difference between theory and practice. In practice there is.” This has particular applicability to computing. As an example, it was stated earlier that pivoting is not needed when finding a LU factorization of a symmetric positive definite matrix. Yet, it is possible to find a symmetric positive definite matrix that, when attempting to compute its factorization, the procedure fails (without pivoting). The reason is that it can happen that the floating point approximation of a matrix does not have the same properties as the original. As a case in point, it is possible to find a symmetric positive definite matrix whose floating point approximation, although symmetric, is not positive definite. Situations similar to this are considered in Exercise 3.39. This helps explain the interest in what is called matrix perturbation theory. The idea here is that the original matrix A and its floating point approximation Af are related through an equation of the form Af = A + P. The entries of P are proportional to machine ε, and are generally, although not always, much smaller than the entries in A. In the vernacular of the subject, P is called a perturbation matrix. The question is, if A has a certain property, under what conditions, if any, will Af have that same property? As an

112

3 Matrix Equations

example, you can prove that if A is invertible, then Af is invertible if the perturbation matrix is small enough that 1 ||P|| < . ||A|| κ(A) In other words, an invertible matrix will remain invertible if perturbed by a small enough amount that it satisfies the above inequality. The limitation of this statement is that, when computing, you do not know if the precision you are using is enough to guarantee that such an inequality is satisfied. Nevertheless, matrix perturbation theory plays a prominent role in the analysis of matrix algorithms, and more can be learned about this in Golub and Van Loan [2013] and Higham [2002].

3.11.2 Fixing an Ill-Conditioned Matrix Ill-conditioned matrices are made easier to use the same way cooks deal with really tough meat, you hit it with a tenderizing hammer. The hammer in this case is a matrix called a preconditioner. Letting this be P, then multiplying the equation by P gives PAx = Pb. The goal is to pick P so PA is not ill-conditioned. Ideally, if P = A−1 , then you end up with just x = Pb. So, the question is, can you come up with an easy to compute approximation of A−1 ? How this can be done, and some of the complications involved, are explored in Section 9.3.6.

3.11.3 Insightful Observations about the Condition Number As stated earlier, the condition number plays a central role in determining how accurately a matrix equation can be solved. The heuristic described in Section 3.6.3 helps to quantify its role in affecting the accuracy. However, there are other, more qualitative, ways to interpret its role. This means they are not particularly useful for evaluating it but they do provide insight into the impact of the condition number on the accuracy. 1. It is sometimes said that the condition number is a measure of the distortion associated with the matrix. This comment comes from the formula κ(A) =

max||x||=1 ||Ax|| . min||x||=1 ||Ax||

3.11

Some Additional Ideas

113

This expression appeared when connecting the 2-norm of a matrix with the semi-major and -minor axis of the ellipse in Figure 3.3 (similar connections for the 1-norm and the ∞-norm are made in Exercise 3.17). What is seen in Figure 3.4 is that if the ellipse is close to a circle then the max and min values of ||Ax|| are not too far apart and the condition number is not very big. However, as the condition number increases the greater the distortion, or elongation, of the ellipse. 2. Another often made comment is that the condition number is a measure of how close the matrix is to being singular (non-invertible). For example, MATLAB will issue the response “Warning: Matrix is close to singular or badly scaled” in alarming orange text when given a matrix with a large condition number. The reason for this is the formula   ||A − B|| 1 = min : B is singular . κ(A) ||A|| To decipher this, recall that the distance between two vectors can be written as ||x − y||. In the same way the distance between two matrices can be written as ||A − B||. Also recall how one calculates the distance between a point and an object such as a sphere. Namely, if x is the point then the distance to the object is the smallest distance between x and the points making up the object. In other words, if S is the object then the distance between x and S is min{||x − y|| : y ∈ S}. Based on this observation, the right hand side of the above formula is the normalized distance between A and the set of singular matrices. What the formula shows is that the larger the condition number the closer the matrix is to the set of singular matrices.

3.11.4 Faster than LU? As explained in Section 3.4, using LU takes approximately 23 n3 flops. This is better than calculating A−1 b, which takes approximately 2n3 flops. The important observation for this discussion is that both methods are O(n3 ). This raises the question as to whether there are sub-cubic methods, in other words, can you find a method that requires O(nω ) flops, where ω < 3? Considerable research has been invested in this question, and the usual approach to answering it is to change the question. It can be proved that if you can multiply two n × n matrices using O(nω ) flops than you can solve Ax = b using O(nω ) flops [Bunch and Hopcroft, 1974]. The usual method for multiplying two matrices requires n3 multiplications and n3 − n2 additions, in other words, O(n3 ) flops. The cubic barrier for matrix multiplication was first broken by Strassen, who was able to produce an algorithm that uses O(nω ) flops, where ω = log2 7 ≈ 2.8 [Strassen, 1969]. Others have worked on

114

3 Matrix Equations

improving this, and the current best result is ω ≈ 2.37 [Gall and Urrutia, 2018, Alman and Williams, 2021]. You might be wondering if anyone actually uses these methods to solve matrix equations. The answer is that these are mostly theoretical results, and the methods are rarely used in practice. The exception is Strassen’s method, although it is not straightforward to implement [Bailey et al., 1991, Huss-Lederman et al., 1996]. Those who might be interested in this topic should consult Higham [2002].

3.11.5 Historical Comparisons Although this discussion is going to be similar to when parents tell their children how hard it was “back in the day,” it is worth knowing what early researchers in the area said about solving matrix problems. The particular individual is William Kahan, who was awarded the A.M. Turning Prize for his contributions in floating-point computations and his dedication to “making the world safe for numerical computations.” Anyway, in Kahan [1966], when commenting about the difficulty of solving Ax = b he wrote that “On our computer (an IBM 7094-II at the University of Toronto) the solution of 100 linear equations in 100 unknowns can be calculated in about 7 seconds”. This is where you are suppose to point out that we are now able to do this in less than 10−4 sec, which is no surprise. His next comment is more interesting, and he states that “This calculation costs about a dollar.” Computers were treated like taxi-cabs, but instead of charging by the mile (or kilometer) they charged by the second. Fortunately, the introduction of UNIX ended this particular practice at most universities. He goes on to say, “to solve 10000 linear equations would take more than two months”. Again, this is where it is necessary to comment that on current machines this takes about 1 sec. One might think the reason is the improvement in the speed of the processors, which is partly true. The more significant reason is memory. They were unable to store everything in active memory (RAM) and were forced into using what Kahan calls “bulk storage units, like magnetic tapes or disks”. This eventually became known as the “swap to disk” problem, and as you would expect, the time required to transfer data back and forth to tape means you will be measuring computing time not with a stop-watch but with a calendar. Also, when a code can take months (or even days) a factor of two is actually significant. This forced them into using every possible trick available, like writing 0.5 ∗ x instead of x/2. The fact is that many of the concerns Kahan talks about are still with us, only the scale has changed. To explain, a significant effort is currently underway to build exascale computers. The reason for doing this is very similar to the one given by Kahan, which is the need to be able to solve physically realistic problems. In 1966 this meant planning on building a computer that can handle systems with 104 unknowns, and now the number is 108 or more

Exercises

115

depending on the density. It is also still true that data transfer is one of the main bottle necks. A new wrinkle with exascale is finding the electrical power to run one, and also the need to deal with the heat generated while it is running. Some of the mathematical and engineering complications involved with this project are discussed in Mann [2020] and Dongarra et al. [2020].

Exercises 3.1. The following are true or false. If true, provide an explanation why it is true, and if it is false provide an example demonstrating this along with an explanation why it shows the statement is incorrect. (a) If A is the zero matrix and A = LU, then either L or U is the zero matrix. (b) If A = LU and L = UT , then A must be symmetric. (c) If A is invertible and A = LU, then a LU factorization of the inverse matrix is A−1 = U−1 L−1 . (d) An ill-conditioned 2 × 2 matrix can always be made well-conditioned using pivoting. (e) For x ∈ R3 , the only vector that satisfies ||x||1 = ||x||2 = ||x||∞ is x = 0. (f) Because ||x||∞ ≤ ||x||1 then it must be that ||A||∞ ≤ ||A||1 . (g) If A is symmetric then ||A||∞ = ||A||1 . (h) If A is strictly diagonally dominant, then AT is strictly diagonally dominant. (i) If a nonzero vector x can be found so Ax = 0, where A is symmetric, then A is not positive definite. (j) If A is positive definite, and symmetric, then A only has positive entries. (k) Assuming A is 2 × 2, if A is symmetric and positive definite, then A−1 is symmetric and positive definite. (l) If A is symmetric and positive definite, then A + I is symmetric and positive definite. Sections 3.1-3.4 3.2. Using a Doolittle factorization, solve the following equations. (a)

x+y =1

(b)

(d)

x − y + 2z = 2 2x − y + z = 3 x − 3z = 1

x − 2y = 0

(c)

−x + 4y = 1

x + 4y = 7 (e)

2x + z = 4 −4x + y = −5

4x + y + z = 5

− 2x + y = 3 4x − 6y = −14

(f )

−x−y =2 x + y − z = −3

4x − y + z = −2

116

3.3. Find ⎛ 1 (a) ⎝ −1 1

3 Matrix Equations

the Doolittle factorization of the following matrices: ⎞ ⎛ ⎞ ⎛ ⎞ 1 1 −2 0 1 0 0 1 0 −2 ⎠ 1⎠ (b) ⎝ −4 3 (c) ⎝ 0 −2 7 ⎠ 1 3 2 6 −4 0 −6 2

3.4. Find the 3 × 3 matrices that have a LU factorization so that L = U. What if A is n × n? 3.5. Find a LU factorization of xyT , where x, y ∈ R3 , that is defined even if x and y are zero. What is the factorization when x, y ∈ Rn ? 3.6. Assuming pivoting is not necessary, then a symmetric matrix A can be factored as A = LDLT , where L is an unit lower triangular matrix and D is a diagonal matrix. (a) Find the LDLT factorization of the matrix   −4 1 A= . 1 1 (b) In Section 3.1, it was shown how a LU factorization results in solving two matrix equations (for y and x). Explain how a LDLT results in solving three matrix equations. Use this to solve −4x + y = 2 x + y = 1. (c) Explain why the flop count for solving Ax = b, where A is a symmetric n × n matrix, using a LDLT factorization is approximately half of the flop count when using a LU factorization. 3.7. This problem considers the question, why not use UL instead of LU? You can assume that A is a 2 × 2 matrix. (a) For the matrix in Exercise 3.2(a), find the factorization A = UL, where L is a unit lower triangular matrix and U is upper triangular. (b) Assuming A = UL, describe the resulting algorithm for solving Ax = b. (c) Is there any connection between the Doolittle factorization of AT and the one you found in part (a)? (d) Aside from an alphabetic advantage, is there any reason to prefer LU over UL? 3.8. In this problem you are to pick one of the following equations: ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ 1 1 1 2 −2 1 1 −1 3⎠x = ⎝ 3⎠ (i) ⎝ 1 1 (ii) ⎝ 2 −1 −2 ⎠ x = ⎝ −1 ⎠ −1 0 −2 −5 −4 4 3 −2

Exercises

117

(a) Show that a Doolittle factorization fails on the matrix. Explain what pivoting is necessary so it doesn’t fail, and find the resulting Doolittle factorization. Also, what is the resulting permutation vector p? (b) As explained in Section 3.3, when using pivoting the problem reduces to solving Ly = b∗ and then solving Ux = y. Use this to solve the equation. 3.9. Instead of pivoting using row interchanges it is possible to use column interchanges. So, for example, in (3.6) the two columns of the matrix are interchanged, and (x, y)T is changed to (y, x)T . Use column pivoting to find the Doolittle factorization of the matrices in Exercise 3.8. Section 3.5 3.10. When solving problems that arise in applications the entries of the vectors almost always have physical dimensions. For example, in mechanics it is not unusual to have a vector such as x = (p, v1 , v2 , v3 )T where p is the pressure and the vj ’s are velocities. Explain why ||x||2 is undefined. What about ||x||1 ? 3.11. You are to find the stated quantity using (i) the 2-norm, (ii) the ∞norm, and (iii) the 1-norm. (a) What is the length of x = (1, −3, 5)T ? (b) Find the distance between the two points P = (1, −1, 3) and Q = (−2, 0, 1). (c) What is the length of x = (1, 0)T ? If the vector is rotated√counterclock√ wise by 45o about the origin, you get the vector y = ( 2/2, 2/2)T . What is the length of y? (d) Over the course of a day, does the length of a hand of a clock change? 3.12. Taking x = (2, 1)T and y = (−1, 1)T , let f (α) = ||x − αy||. (a) Evaluate f (3) for: (i) the ∞-norm, (ii) the 2-norm, and (iii) the 1-norm. (b) What is the smallest value of f (α) when the 2-norm is used? (c) What is the smallest value of f (α) when the ∞-norm is used? 3.13. Find ||A||1 and ||A||∞ for the respective problem in Exercise 3.2. 3.14. Find ||A||1 and ||A||∞ for the respective problem in Exercise 3.8. 3.15. In this problem x ∈ R2 , and let x = (x, y)T . When using double precision, what x’s will satisfy; (a) ||x||∞ = 0, (b) ||x||∞ = 1, (c) ||x||∞ ≤ 10−12 ? In answering these questions, when considering rounding you can ignore ties. 3.16. In this problem x ∈ R2 , and let x = (x, y)T . (a) The equation ||x||1 = 1 determines a diamond. Sketch it. (b) The equation ||x||∞ = 1 determines a square. Sketch it.

118

3 Matrix Equations

3.17. This problem concerns how the plot on the right in Figure 3.3 changes when a different vector norm is used. How the plot on the left in Figure 3.3 changes is considered in Exercise 3.16. In this problem,   4 2 A= . 1 3 A useful fact is that if the x’s lie on a straight line, then the Ax’s will lie on a straight line. (a) For the x’s on the curve ||x||∞ = 1, the values of Ax lie on a parallelogram. Sketch it. (b) On the parallelogram in part (a), what is the largest value of ||Ax||∞ , and what is the smallest value of ||Ax||∞ ? How are these values connected with the vertices of the parallelogram? How are these values connected with the value of κ∞ (A)? (c) For the x’s on the curve ||x||1 = 1, the values of Ax lie on a parallelogram. Sketch it. (d) On the parallelogram in part (c), what is the largest value of ||Ax||1 , and what is the smallest value of ||Ax||1 ? How are these values connected with the vertices of the parallelogram? How are these values connected with the value of κ1 (A)? 3.18. Using the three defining properties of a vector norm: (a) Show that if x = 0, then ||x|| = 0. (b) Show that for any x, ||x|| ≥ 0. (c) Show that ||x − y|| ≤ ||x|| + ||y||. (d) Show that ||x||1 is a vector norm. 3.19. In this problem,

 A=

1 −2 −2 3

 .

(a) Find a vector x so that ||Ax||∞ = ||A||∞ . ||A||1 . (b) Find a vector x so that ||Ax||1 = √ (c) One can show that ||A||2 = 9 + 4 5. Find a vector x so that ||Ax||2 = ||A||2 . 3.20. Let A = uvT , where u, v ∈ Rn . (a) Using the law of cosines, show that ||A||2 = ||u||2 · ||v||2 . (b) Show that ||A||∞ ≤ ||u||∞ · ||v||1 . 3.21. Show that the following do not satisfy at least one of the three conditions required of a norm. Even though they might violate more than one of the conditions it is only necessary to identify one of them. (a) For x = (x, y)T , ||x|| = |x − y|. (b) For x = (x, y)T , ||x|| = |xy|. (c) ||A|| = | det(A)|. (d) ||A|| = |tr(A)|.

Exercises

119

3.22. In this problem you are to show that the matrix norm defined in (3.16), or (3.17), satisfies the three conditions required of a norm. In particular, show that for any n × n matrices A and B, and scalars α, that: (a) ||αA|| = |α| · ||A||. (b) ||A + B|| ≤ ||A|| + ||B||. (c) If ||A|| = 0 then A is the zero matrix. Section 3.6 3.23. Find κ∞ (A) and κ1 (A) for the respective problem in Exercise 3.2. 3.24. Consider the matrix



⎞ 1 −1 −1 0⎠ . A = ⎝ 1 −4 0 1 0

It is useful, but not essential, to know that this is a unimodular matrix. (a) Find A−1 . (b) Find κ∞ (A). 3.25. Consider the matrix



⎞ 1 1 0 A = ⎝0 1 1⎠ . 1 0 1

It is useful, but not essential, to know that this is a normal matrix. (a) Find A−1 . (b) Find κ∞ (A). 3.26. In this problem Q is an orthogonal matrix. Some useful facts about orthogonal matrices are given in Appendix B. (a) Show that κ2 (Q) = 1. (b) Show that κ1 (Q) = ||Q||1 ||Q||∞ . Also, find a similar formula for κ∞ (Q). (c) Find κ1 , κ2 , and κ∞ for the following matrices: ⎛ ⎞ √ ⎞ ⎛ 1 −1 1 1 2 1 0 2 3√ 3 ⎟ √ 1⎜ 1 ⎜1 −1 −1 −1⎟ (b) (a) ⎝ 21 √2 − 23 ⎠ 6 √2 ⎝ 1 −1 1⎠ 2 1 − 12 2 16 2 − 23 1 1 1 −1 3.27. The equation on the left is approximated with the simpler equation on the right, the solution of which is xc . Answer the three questions given in Example 3.8.           15 14 x 14.99 15 14 x 15 (a) = = 16 14.96 y 16 16 15 y 16

120

3 Matrix Equations

 (b)

14 13 13 12.05

    x 14.01 = y 13



14 13 13 12

    x 14 = y 13

3.28. In this problem α is a small positive number. Sketch the two lines in the x, y-plane, and describe how they change, including the point of intersection, as α approaches zero. Also, find κ∞ (A) and describe how it changes as α approaches zero. (a)

x − y = −1 −x + (1 + α)y = 1

(b)

2x + 4y = 1 (1 − α)x + 2y = −1

3.29. This problem concerns the equation Ax = b, where   −1 1 A= 0 α and α > 0. (a) For what values of α is this matrix ill-conditioned? Make sure to identify what norm you are using. (b) Suppose the residual r is small (but nonzero). For what values of α, if any, will the error e = x − xc be large? Note x is the exact solution and xc is the computed solution. (c) Suppose the error e is small (but nonzero). For what values of α, if any, will the residual be large? 3.30. This exercise generalizes Example 3.10. Assume that A is given in (3.30) with the di ’s nonzero. (a) Show that ||A||1 = max{ |d1 |, |d2 |, |d3 | }. What is κ1 (A)? (b) Show that ||A||2 = max{ |d1 |, |d2 |, |d3 | }. What is κ2 (A)? 3.31. Assuming the conditions stated in Theorem 3.2 hold, show that 1 ||r|| ||e|| ≤ . κ(A) ||b|| ||x|| 3.32. This exercise considers the following three versions of the same problem: ax + by = b1 cx + dy = b2 by + ax = b1 cx + dy = b2 ax + by = b1 dy + cx = b2 Here a, b, c, d, as well as b1 and b2 , are assumed known. Note that each version differs only in the order the equations are written down. A property of this problem is said to be fragile if it holds for one of the versions but not for all of them. (a) Is symmetry of the coefficient matrix a fragile property? (b) Is uniqueness of the solution a fragile property?

Exercises

121

(c) Is invertibility of the coefficient matrix a fragile property? (d) Is the ∞-norm of the coefficient matrix a fragile property? Does this mean that being ill-conditioned is, or is not, a fragile property? Section 3.7 3.33. Verify that A is symmetric and positive definite, and then solve the equations using a Cholesky factorization. (b) 3x − 2y = 8 (c) 3x + y + z = 2 (a) 2x + y = 1 −2x + 5y = 2 x + 4y − z = −3 x + 4y = −3 x − y + 4z = 2 3.34. For each of the following symmetric matrices, explain why it is positive definite and then find its Cholesky factorization. ⎛ ⎞ ⎛ ⎞   1 1 0 4 −2 2 1 1 2 −1 ⎠ (a) (b) ⎝ 1 (c) ⎝ −2 10 −4 ⎠ 1 4 0 −1 5 2 −4 18 3.35. The 4 × 4 matrix in (3.10) is symmetric. Verify that it is positive definite, and then find its Cholesky factorization. 3.36. A matrix often used to test the effectiveness of algorithms used to calculate eigenvalues is the Rosser matrix, which is given as ⎞ ⎛ 611 196 −192 407 −8 −52 −49 29 ⎜ 196 899 113 −192 −71 −43 −8 −44 ⎟ ⎟ ⎜ ⎜−192 113 899 196 61 49 8 52 ⎟ ⎟ ⎜ ⎜ 407 −192 196 611 8 44 59 −23 ⎟ ⎟. R=⎜ ⎜ −8 −71 61 8 411 −599 208 208 ⎟ ⎟ ⎜ ⎜ −52 −43 49 44 −599 411 208 208 ⎟ ⎟ ⎜ ⎝ −49 −8 8 59 208 208 99 −911 ⎠ 29 −44 52 −23 208 208 −911 99 This matrix is symmetric. Is it positive definite? 3.37. Let A = I + σvvT , where I is the n × n identity matrix and v is an n-vector. (a) Show that A is symmetric. Also, using the definition, show that it is positive definite if σ ≥ 0. (b) Taking n = 2, and assuming that σ ≥ 0, find its Cholesky factorization. (c) Taking n = 4, σ = 1, and v = (1, 1, · · · , 1)T , show that A is not diagonally dominant. What about n = 3 and n = 5? √ 3.38. Assuming A is symmetric and positive definite, let ||x||A ≡ xT Ax. This is often called the energy norm as it is associated with the potential energy for a discrete particle system.

122

3 Matrix Equations

(a) Given the Cholesky factorization, show that ||x||A = ||Ux||2 . (b) Show that ||x||A is a vector norm. 3.39. In this exercise, A is a 2 × 2 matrix, and Af is its floating point approximation (using double precision). (a) Give an example of an invertible matrix A where Af is the zero matrix (and, therefore, not invertible). Your example should have κ∞ (A) = 1. (b) Give an example of a symmetric and positive definite matrix A where Af is symmetric but not positive definite. (c) Give an example of a strictly diagonally dominant matrix A where Af is not strictly diagonally dominant. (d) Which of these properties for A are guaranteed to be true for Af : (i) symmetric, (ii) upper or lower triangular, (iii) invertible, and (iv) wellconditioned? Section 3.8 3.40. Using the estimate given in Section 3.4, for a computer with 4 GB of RAM, approximately how large a matrix can be solved using the Thomas algorithm? What if there is 64 GB? 3.41. This problem considers the derivation of the Thomas algorithm when n = 3. As given in Table 3.6, the vector x is overwritten as it is computed. So, the method solves Lx = f then it solves Ux = x. This works because U is upper triangular, and it reduces the storage requirement needed for the method. (a) Write out the Crout factorization for A when n = 3, and then write out the solution of Lx = f as well as the solution of Ux = x. (b) In the first loop in Table 3.6, for a given i, what term in the LU factorization does w equal? What term does vi refer to? (c) Explain why the first condition in Theorem 3.5 guaranties that there is not a divide-by-zero in the algorithm. 3.42. This problem considers the following n × n tri-diagonal matrix: ⎛ ⎞ a c ⎜b a c 0 ⎟ ⎜ ⎟ ⎜ ⎟ b a c ⎜ ⎟ A=⎜ ⎟. . . . .. .. .. ⎜ ⎟ ⎜ ⎟ ⎝ 0 b a c⎠ b

a

It is possible to show that the eigenvalues of this matrix are   √ iπ λi = a + 2 bc cos , for i = 1, 2, . . . , n. n+1 Also, assume that n ≥ 3.

Exercises

n

123 ||x − xM ||∞ ||x||∞

||x − xI ||∞ ||x||∞

||rM ||∞ ||b||∞

κ∞ (A)

εκ∞ (A)

n1 n2 n3 n4 Table 3.8 Table for Exercises 3.43-3.47. Note that xM is the solution using the LU method, xI is the solution using the matrix inverse, and rM = b − AxM .

(a) What is ||A||∞ , and what is ||A||1 ? (b) In the case that A is symmetric (so, c = b), what inequality must be satisfied to guarantee that A is strictly diagonal dominant? (c) Assuming the matrix is symmetric and a is positive with 2|b| ≤ a, explain why A is positive definite. Computational Questions for Sections 3.1 - 3.8 In problems 3.43-3.47 you are to solve Ax = b using the LU method as well as using the formula x = A−1 b. The exact solution should be x = (1, 1, · · · , 1)T , and so you should take b = Ax. You are to fill out Table 3.8 and then answer the following questions. The entries in the table only need to include two significant digits. (a) Do you see any substantial differences between the two solution methods when they are compared using the relative error? (b) Does a small residual indicate an accurate solution? Your answer should include a comment on the value of the condition number. What about any dependence on the size n of the matrix? (c) How well does the last column predict the relative error? If you are using MATLAB, the LU solution xM can be calculated using the command xM=A\b;. The inverse matrix solution xI can be calculated using the command: xI=inv(A)*b;. Note that the backslash command (n) is found in the help pages under “mldivide”. 3.43. Take A to be a n × n magic matrix (using MATLAB, A=magic(n)). Also, n1 = 3, n2 = 6, n3 = 9, and n4 = 12.

124

3 Matrix Equations

3.44. Letting P be the n × n Pascal matrix (using MATLAB, P=pascal(n)), let A = P except that an1 = 0. Also, n1 = 4, n2 = 8, n3 = 12, and n4 = 16. 3.45. Letting H be the n × n Hilbert matrix (using MATLAB, H=hilb(n)), let A = H except that an1 = 0. Also, n1 = 4, n2 = 8, n3 = 12, and n4 = 16. 3.46. Take A to be the n × n zero matrix except that a13 = 1, a31 = −1, and the diagonal entries are aii = i4 , for i = 1, 2, · · · , n. Also, n1 = 5, n2 = 10, n3 = 100, and n4 = 200. 3.47. Take A to be the n × n zero matrix except that aii = i and ai1 = i5 , for i = 1, 2, · · · , n, and a1j = 1, for j = 1, 2, · · · , n. Also, n1 = 10, n2 = 100, n3 = 1000, and n4 = 2000. 3.48. You are to solve Ax = b, where A is the n × n matrix containing all 1’s, except for the diagonals which are aii = n, and b = (1, 1, · · · , 1)T . Solve the equation for n = 8, n = 80, n = 800, and n = 8,000 by first using a LU factorization, then using a Cholesky factorization. For each n, you should report the value of ||e||∞ and the computing time required to find the factorization and then solve the equation. Also, comment on the relative accuracy of the two methods as well as any differences in the computing times. Finally, explain why A is positive definite irrespective of the value of n. If you are using MATLAB, the two solutions can be obtained using the commands [L,U]=lu(A); y=L\b; x=U\y;, and U=chol(A); x=U\(U’\b);. 3.49. You are to solve Ax = f , where A is the n × n tridiagonal matrix (3.34), where ai = 5, bi = 4 and ci = −1, for i = 1, 2, · · · , n. The exact solution should be x = (1, 1, · · · , 1)T , and so you should take f = Ax. Solve the equation for n = 1,000, n = 10,000, n = 100,000, and n = 1,000,000. Your answer should include the resulting values for ||x − xc ||∞ /||x||∞ and ||r||∞ /||f ||∞ , where xc is the computed solution. In addition, state what method you used to solve the equation and a reason why the method is expected to work. Finally, comment on how sensitive the accuracy of the computed solution is on n. Section 3.10 3.50. Consider the following nonlinear system of equations: x2 + y 2 = 4, y = x3 . (a) Sketch the two curves and explain where (approximately) the solutions are located. (b) What is J and f, as given in (3.44), for this problem?

Exercises

125

(c) What would be good starting values for the solutions you identified in part (a)? Make sure to provide an explanation for your choices. (d) Use Newton’s method to find the solution for x > 0. The values for x and y must be correct to 6 significant digits. 3.51. Consider the following nonlinear system of equations: 4x2 + y 2 = 16, yex = 2. (a) Sketch the two curves and explain where (approximately) the solutions are located. (b) What is J and f, as given in (3.44), for this problem? (c) What would be good starting values for the solutions you identified in part (a)? Make sure to provide an explanation for your choices. (d) Use Newton’s method to find the solution for x > 0. The values for x and y must be correct to 6 significant digits. 3.52. The function F (x, y) = (x2 − 2x + 5y 2 ) exp(−x2 − y 2 ) is shown in Figure 3.6. The objective of this exercise is to compute its largest value. (a) What two equations must be solved to locate where the maximum occurs? (b) What is the Jacobian for the system in part (a)? (c) From Figure 3.6, what would be a good starting point for Newton’s method? (d) Using your starting point, use Newton’s method to solve the system in part (a). The values for x and y must be correct to 6 significant digits. From this, determine the largest value of F (x, y).

Figure 3.6

Function used in Exercise 3.52.

126

3 Matrix Equations

Additional Questions 3.53. In coding theory it is necessary to be able to determine in how many locations, or positions, two character strings differ, and in optimization it is not unusual to want a solution with the fewest components. This is the motivation for letting ||x||0 denote the number of nonzero entries in x. This is not a norm (it satisfies conditions 2 and 3 but not condition 1 of the definition), but it is often referred to as the 0 -norm, or the Hamming norm. To avoid this unfortunate terminology some refer to is as the cardinality function and use the notation card(x). (a) If x = (1, −12, 3, −4)T and y = (0, −12, 0, −4)T , what are ||x||0 , ||y||0 , and ||x − y||0 ? (b) Give an example where condition 1 for a vector norm is violated. However, show that it preserves order in the sense that if |α| ≤ |β| then ||αx||0 ≤ ||βx||0 . n (c) Show that ||x||0 = i=1 |xi |0 , where 00 ≡ 0. (d) For x ∈ R2 , what x’s satisfy ||x||0 = 1? 3.54. This exercise looks at the some of the theorems about symmetric and positive definite matrices in the case of when   a b A= . b c As will be established in part (a), this matrix is positive definite if, and only if, a > 0 and ac − b2 > 0. This result is then used to prove the various theorems we had about positive definite matrices. (a) Show that the eigenvalues of the matrix are

1

λ± = a + c ± (a + c)2 − 4(ac − b2 ) . 2

(b) (c) (d) (e) (f)

Explain why the eigenvalues are positive if, and only if, a + c > 0 and ac − b2 > 0. Also explain why these two conditions can be replaced with the requirements that a > 0 and ac − b2 > 0. Use the result from part (a) to prove that A is not positive definite if any diagonal entry is negative or zero. Use the result from part (a) to prove that if A is positive definite, then the largest number in A, in absolute value, can only appear on the diagonal. Use the result from part (a) to prove that if A is positive definite, then det(A) > 0. Use the result from part (a) to prove that A is positive definite if the diagonals are all positive and it is strictly diagonal dominant.. What are the three equations for the uij ’s that come from the Cholesky factorization? Explain why the assumption that A is positive definite is exactly what is needed to guarantee that you can find a real-valued solution to these equations.

Exercises

127

3.55. This problem considers whether a Cholesky type factorization can be used on a matrix which is not positive definite. The assumption is that given an invertible symmetric matrix A, then A = CT C where C is upper triangular with possibly complex-valued entries. In this problem this will be referred to as a generalized Cholesky factorization. What will be shown is that a generalized Cholesky factorization is possible as long as the leading principal minors of A are nonzero. Note that a method that avoids the use of complex-valued factors is considered in Exercise 3.6. (a) The following matrix is invertible and symmetric but not positive definite. Find a matrix C satisfying the stated assumption.   −4 1 A= 1 1 (b) Using the factorization found in part (a), solve −4x + y = 2 x + y = 1. Do you obtain the same answer you would get if you did not use the factorization? (c) What conditions must be imposed on the entries of the following symmetric matrix so it is invertible and has a generalized Cholesky factorization.   a b A= . b c (d) Answer the question posed in part (c) for a symmetric 3 × 3 matrix.

Chapter 4

Eigenvalue Problems

One of the central problems considered in this chapter is: given an n × n matrix A, find the number(s) λ and nonzero vectors x that satisfy Ax = λx.

(4.1)

This is an eigenvalue problem, where λ is an eigenvalue and x is an eigenvector. There are a couple of observations worth making about this problem. First, x = 0 is always a solution of (4.1), and so what is of interest are the nonzero solutions. Second, if x is a solution then αx, for any number α, is also a solution.

4.1 Review of Eigenvalue Problems In linear algebra the procedure used to solve the eigenvalue problem consists of two steps: (1) Find the λ’s by solving det(A − λI) = 0,

(4.2)

where I is the identity matrix. This is known as the characteristic equation, and the left-hand side of this equation is an nth degree polynomial in λ and is called the characteristic polynomial. (2) For each eigenvalue λ, find its eigenvectors x by solving (A − λI)x = 0.

(4.3)

Note that Step 2 provides a way to determine an eigenvector once the eigenvalue is known. For some of the numerical methods used to solve (4.1), the eigenvectors are determined first. It is possible to determine the associated © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0 4

129

130

4 Eigenvalue Problems

eigenvalue by multiplying both sides of (4.1) by an eigenvector x. This yields the formula x • Ax , (4.4) λ= x•x which is known as Rayleigh’s quotient. Also note that the “ • ” appearing in this expression designates the dot product. As a reminder, if x = (x1 , x2 , · · · , xn )T and y = (y1 , y2 , · · · , yn )T then the dot product is defined as x • y ≡ x1 y1 + x2 y2 + · · · + xn yn . This can also be written as x • y = xT y or as x • y = yT x. Example 4.1 a) Consider the eigenvalue problem      2 1 x x =λ . 1 2 y y

(4.5)

The characteristic equation (4.2) is λ2 − 4λ + 3 = 0, and so the eigenvalues are λ1 = 3 and λ2 = 1. For λ1 , (4.3) takes the form      −1 1 x 0 = . 1 −1 y 0 From this it follows that y = x. Consequently, the eigenvectors associated with this eigenvalue have the form x = (x, x)T = xx1 , where   1 . x1 = 1 For λ2 , one finds that the eigenvectors have the form x = xx2 , where   1 .  x2 = −1 b) Consider the eigenvalue problem      3 0 x x =λ . 0 3 y y

(4.6)

The characteristic equation is (λ − 3)2 = 0, and so the only eigenvalue is λ1 = 3. In this case, (4.3) takes the form      0 0 x 0 = . 0 0 y 0

4.1

Review of Eigenvalue Problems

131

All values of x and y satisfy this equation. Consequently, the eigenvectors are x = (x, y)T = xx1 + yx2 , where     1 0 , and x2 = .  x1 = 0 1 c) Consider the eigenvalue problem      1 2 x x = λ . y − 12 1 y The characteristic equation is λ2 − 2λ + 2 = 0, and so the eigenvalues are λ = 1 + i and λ = 1 − i.  d) The eigenvalue problem 

2 1 0 2

    x x =λ y y

(4.7)

has one eigenvalue λ = 2, with eigenvector x = αx1 , where x1 = (1, 0)T . Unlike the previous examples, there isn’t a second linearly independent eigenvector. An n × n matrix that has fewer than n independent eigenvectors is said to be defective. So, the matrix of this example is defective, while the matrices for the three previous examples are not defective.  An important observation is that the first two matrices in the above examples are symmetric. They illustrate a result from linear algebra, which is stated next. Theorem 4.1 If A is a symmetric n × n matrix, then the following hold: (1) Its eigenvalues are real numbers. (2) If xi is an eigenvector with eigenvalue λi , xj is an eigenvector with eigenvalue λj , and λi = λj , then xi • xj = 0. (3) It is possible to find a set of orthonormal basis vectors u1 , u2 , · · · , un , where each ui is an eigenvector for A. In the last statement, for the vectors to be orthonormal it is required that ui • uj = 0 if i = j and ui • ui = 1. Each eigenvalue has an algebraic multiplicity and a geometric multiplicity. The algebraic multiplicity is the exponent in the characteristic equation. For example, suppose that the characteristic polynomial of a 6 × 6 matrix can be factored as (λ − 1)2 (λ + 1)3 (λ − 4). In this case, the distinct eigenvalues are λ = 1, −1, 4, with respective algebraic multiplicities of 2, 3, 1. The geometric multiplicity equals the number of linearly independent eigenvectors

132

4 Eigenvalue Problems

for the eigenvalue. So, in Example 4.1(a) both eigenvalues have a geometric multiplicity of 1, while the eigenvalue in Example 4.1(b) has a geometric multiplicity of 2. It is possible to show that the geometric multiplicity is never more than the algebraic multiplicity. Moreover, if a matrix has an eigenvalue in which the algebraic and geometric multiplicities are not equal, then the matrix is defective.

Example: Chain of Oscillators Several applications involving eigenvalues and related ideas are considered in this chapter, including Markov chains (Section 4.5.3), networks (Section 4.5.2), and image compression (Section 4.6.7). The particular example described here involves a chain of oscillators as illustrated in Figure 4.1. What is shown in this figure are masses that are connected by springs, and each mass is also attached to its own spring (the other end of these springs is assumed to be held fixed). Letting yi (t) be the position of the ith mass, relative to its equilibrium location, then the resulting equations of motion are (4.8) myi + syi = sc (yi+1 − 2yi + yi−1 ), for i = 1, 2, 3, · · · , n where y0 = yn+1 = 0. Although configuring masses and springs in this way looks like some complicated child’s toy, this arises in numerous applications and has various names. A recent example is its use in the study of a chain of atoms that are subject to nearest neighbor interactions [Iooss and James, 2005]. However, the problem goes back centuries, to at least Bernoulli [1728], and has been used in a wide spectrum of applications, including the interactions of stars [Voglis, 2003] and the modeling of swarming motion [Erdmann et al., 2005]. It is also the basis of what is known as the Fermi-Pasta-Ulam (FPU) chain, which is used to study solitary waves [Ford, 1992; Berman and Izrailev, 2005]. The question considered here is, what are the time periodic solutions of this chain, what are called the normal modes √of the system. This is answered by assuming yi (t) = xi exp(Iωt), where I = −1. Substituting this into (4.8), and writing the problem in matrix form, we obtain the eigenvalue equation Ax = λx, where

Figure 4.1

Chain of masses and springs.

4.1

Review of Eigenvalue Problems

133

⎞ a −1 ⎟ ⎜ −1 a −1 0 ⎟ ⎜ ⎟ ⎜ −1 a −1 ⎟ ⎜ A=⎜ ⎟, .. .. .. ⎟ ⎜ . . . ⎟ ⎜ ⎝ 0 −1 ⎠ −1 a ⎛

(4.9)

a = 2 + s/sc and λ = mω 2 /sc . Using the formula from Exercise 3.42, the eigenvalues of this matrix are   iπ λi = a + 2 cos , for i = 1, 2, . . . , n. (4.10) n+1 The values obtained from this formula in the case of when s = 1, sc = 1/10, and n = 30 are shown in Figure 4.2. In the applications mentioned earlier, n is usually quite large, and as an example n = 3000 in James et al. [2013]. The eigenvalues in this case follow the curve seen in Figure 4.2, but they are much closer together. As will be explained later, this has a detrimental affect on how fast the eigenvalue solvers considered in this chapter are able to compute the eigenvalues. 

Computing Eigenvalues and Eigenvectors The methods developed in this chapter to compute eigenvalues will not use (4.2). The reason is that the characteristic equation involves a determinant, and these require on the order of O(n3 ) flops to compute. This is significant because the roots of (4.2) can be very sensitive to round-off error, and an illustration of this is Figure 1.1. Even if the determinant is avoided, an eigenvalue can be very sensitive to round-off errors. In particular, a small error in evaluating the matrix can lead to a disproportionality large change in the eigenvalue. An example of this is

5

Eigenvalue

4 3 2 1 0 0

5

10

15

20

25

i-axis

Figure 4.2

Eigenvalues of an oscillator chain in the case of when n = 30.

30

134

4 Eigenvalue Problems

given in Exercise 4.60. Fortunately, this does not happen with symmetric matrices or, for that matter, normal matrices. It also does not happen for singular values, which are investigated in Section 4.6.

4.2 Power Method We are going to derive a procedure for calculating an eigenvector for what is known as the dominant eigenvalue of a matrix. The key observation is that if you multiply Ax = λx by A, you get A2 x = λAx = λ2 x. Multiplying by A again gives A3 x = λ3 x, and in general Ak x = λk x. How fast λk grows or decays depends on |λ|. The power method takes advantage of this and finds the one with the largest magnitude. As an example, consider the eigenvalue problem in (4.2). The matrix is   2 1 A= , (4.11) 1 2 with eigenvalues λ1 = 3 and λ2 = 1. Because |λ1 | > |λ2 |, the method being derived will find an eigenvector for λ1 . For the power method you guess a nonzero starting vector y0 and then determine Ay0 , A2 y0 , A3 y0 , · · · . However, before doing so recall that we found two linearly independent eigenvectors x1 and x2 for this matrix. They can be used as a basis, which means that we can write y0 = α1 x1 + α2 x2 . In this example, we will assume that α1 = α2 = 1, so y 0 = x1 + x 2 .

(4.12)

With this, y1 = Ay0 = Ax1 + Ax2 = λ 1 x1 + λ 2 x2 , and y2 = Ay1 = λ21 x1 + λ22 x2 . At the kth step, yk = Ayk−1 = λk1 x1 + λk2 x2

= λk1 x1 + ω k x2 ,

(4.13)

where ω = λ2 /λ1 = 1/3. After calculating yk , we can use the Rayleigh quotient (4.4) to obtain an approximation vk of the corresponding eigenvalue. The result is

4.2

Power Method

135

Pick: random y, and tol > 0 Let:

z = Ay v = (y • z)/y • y err = 10 ∗ tol

Loop: while err > tol y = z/||z||2 z = Ay V =v v =y•z err = |v − V |/|v| end

Table 4.1 Power method for calculating the dominant eigenvalue of A. After exiting the loop, v is the computed value for the eigenvalue and y = z/||z||2 is the computed eigenvector.

yk • Ayk yk • yk (x1 + ω k x2 ) • (λ1 x1 + ω k λ2 x2 ) = (x1 + ω k x2 ) • (x1 + ω k x2 ) 1 + ω 2k+1 = λ1 . 1 + ω 2k

vk =

(4.14)

Because ω 2k → 0 as k → ∞, it follows that vk → λ1 as k → ∞. There is a potential numerical difficulty with the formula for yk in (4.13) because the term λk1 becomes very large as k increases. This can be avoided by scaling the vectors. Specifically, at the kth step, one first computes zk = √ Ayk−1 , and then lets yk = zk /||zk ||2 , where ||zk ||2 = zk • zk . What this k does is effectively removes the λ1 coefficient in (4.13). The resulting algorithm is given in Table 4.1. Note that the values for v, y, and z are not indexed, but are overwritten as the procedure proceeds. Example 4.2 Suppose y0 = (2, 0)T , and A is given in (4.11). Applying the power method, as given in Table 4.1,      2 1 2 4 = , z0 = Ay0 = 1 2 0 2 and

136

4 Eigenvalue Problems

v0 =

y0 • z 0 = 2. y0 • y0

Continuing,   z0 1 2 y1 = √ =√ , z0 • z0 5 1   1 5 z1 = Ay1 = √ , 5 4 14 = 2.8. v 1 = y1 • z 1 = 5 The next steps are calculated using MATLAB, with the results given in Table 4.2. It is evident that vk is approaching the dominant eigenvalue λ1 = 3. In = z/||z|| this calculation, tol = 10−8 , and the resulting eigenvector is y √ √ 2 T= T . The exact eigenvector is y = (1/ 2, 1/ 2) , (0.707118756, 0.70709480) √ where 1/ 2 = 0.707106781 . . .. So, the eigenvector has not been computed to the same accuracy as the eigenvalue.  As you might have noticed, a random y0 was not used in the above example, although this is part of the stated algorithm in Table 4.1. In working out by-hand examples in this chapter, or their MATLAB extensions as in Table 4.2, the starting values are selected to simplify the calculations.    vk − λ1    v − λ1 

  vk − vk−1  v −v

k

vk

0

2.000000000000

1

2.800000000000

2.00e−01

2

2.975609756098

1.22e−01

2.20e−01

3

2.997260273973

1.12e−01

1.23e−01

4

2.999695214874

1.11e−01

1.12e−01

5

2.999966130398

1.11e−01

1.11e−01

6

2.999996236654

1.11e−01

1.11e−01

7

2.999999581850

1.11e−01

1.11e−01

8

2.999999953539

1.11e−01

1.11e−01

9

2.999999994838

1.11e−01

1.11e−01

k−1

k−1

k−2

   

Table 4.2 The value for the dominant eigenvalue computed using the power method applied to (4.11). The exact value is λ1 = 3. Also, the error ratios given in the last two columns both approach |λ2 /λ1 |2 = 1/9 ≈ 0.1111.

4.2

Power Method

137

Rate of Convergence The formula in (4.14) has some useful information related to how fast the method converges. To explain, we have that vk − λ 1 = λ 1

(ω − 1)ω 2k . 1 + ω 2k

This means that as k increases, vk − λ1 ≈ λ1 (ω − 1)ω 2k . Since this also means that vk−1 − λ1 ≈ λ1 (ω − 1)ω 2(k−1) , it then follows that λ 2 vk − λ 1 2 ≈ . vk−1 − λ1 λ1

(4.15)

This is a particularly useful result because it states that once vk starts to get close to λ1 , the error |vk − λ1 | is approximately a factor of |λ2 /λ1 |2 smaller than the previous error |vk−1 − λ1 |. Moreover, using the same approximation used to derive (4.15), it is possible to show that λ 2 2 vk − vk−1 ≈ (vk−1 − vk−2 ). λ1

(4.16)

So, the iterative error |vk − vk−1 | is approximately a factor of |λ2 /λ1 |2 smaller than the previous iterative error |vk−1 − vk−2 |. In the above example, the fact that the error (4.15), and the iterative error (4.16), are reduced by approximately a factor of |λ2 /λ1 |2 = 1/9 ≈ 0.1111 with each step is evident in Table 4.2. A conclusion coming from (4.15) and (4.16) is that if |λ1 | and |λ2 | are well separated, so |λ2 /λ1 |  1, then the method converges fairly quickly. However, if |λ1 | and |λ2 | are close together then the convergence is relatively slow. This observation will apply to most of the methods considered in this chapter. A couple of comments should be made before concluding. One is that the rate of convergence for the eigenvector is slower than for the eigenvalue. In (4.13), the reduction of the x2 contribution decreases by a factor of |λ2 /λ1 | with each iteration step. This is the reason for the observed inaccuracy of the eigenvector in Example 4.2. Second, the error reduction factor |λ2 /λ1 |2 for the eigenvalue is what is obtained for a symmetric matrix. If the matrix is not symmetric then the reduction factor is, typically, just |λ2 /λ1 |. The reason for the difference is that in deriving (4.14), the fact that x1 • x2 = 0 was used. For a non-symmetric matrix the orthogonality of the eigenvectors is not guaranteed.

138

4 Eigenvalue Problems

4.2.1 General Formulation The task now is to itemize the conditions needed so the power method will work on a n × n matrix. It is assumed that the matrix is not defective. So, it has independent eigenvectors x1 , x2 , · · · , xn with respective eigenvalues λ1 , λ2 , · · · , λn . It is assumed throughout this chapter that the labeling is such that eigenvalues satisfy |λ1 | ≥ |λ2 | ≥ · · · ≥ |λn |. Note also that the eigenvalues appear in this list according to their geometric multiplicity. For example, if λ = 2 has a geometric multiplicity of 4 then it appears 4 times in the list. Now, in (4.14) and (4.15), we used the inequality |λ2 /λ1 | < 1 to be able to conclude that the method converges. This means that λ = λ1 is the dominant eigenvalue, and this is defined next. Definition of Dominant Eigenvalue. λ = λ1 is the dominant eigenvalue of A if given any other eigenvalue λi , with λi = λ1 , then |λ1 | > |λi |. Note that because absolute values are used here, the dominant eigenvalue is not necessarily the largest eigenvalue. Example 4.3 Assume that A is a 4 × 4 matrix, and that it has eigenvalues: a) λ1 = 3, λ2 = 3, λ3 = −2, λ4 = 1. In this case λ = 3 is the dominant eigenvalue. b) λ1 = 3, λ2 = −3, λ3 = −2, λ4 = 1. In this case λ = 3 is not the dominant eigenvalue since λ1 = λ2 but |λ1 | = |λ2 |. c) λ1 = 10, λ2 = 2 + i, λ3 = 2 − i,√λ4 = 1. In this case λ = λ1 is the dominant eigenvalue since |2 ± i| = 5 < 10. d) λ1 = 3, λ2 = 3, λ3 = 3, λ4 = 3. In this case λ = 3 is the dominant eigenvalue.  What happened in Example 4.3(b) needs special mention. Namely, it is not possible for a nonzero eigenvalue λ to be a dominant eigenvalue if −λ is also an eigenvalue. This situation will cause most of the eigenvalue solvers considered in this chapter to fail. However, it is easy to fix the problem using shifting, and this is explained later. The second thing we needed for the power method to work was that y0 in (4.12) included a contribution from x1 . In other words, the initial guess y0 must include a part of an eigenvector for the dominate eigenvalue. The theorem establishing the convergence of the power method, as given in Table 4.1, is next.

4.2

Power Method

139

Theorem 4.2 Assume that A is not defective. If A has a dominant eigenvalue λ = λ1 , then the following almost always holds: the power method, as outlined in Table 4.1, converges to λ1 . Moreover, once vk gets close to λ1 , |vk − λ1 | ≈ R|vk−1 − λ1 |,

(4.17)

|vk − vk−1 | ≈ R|vk−1 − vk−2 |.

(4.18)

and The error reduction factor R satisfies 0 ≤ R ≤ |λj /λ1 |, where λj is the next largest eigenvalue in absolute value (or, magnitude). If A is symmetric, then 2 λj R = . λ1

(4.19)

Using the definition for the rate of convergence given in Section 2.2.1, because of (4.17), the power method is a first-order method. It is necessary to explain what “almost always” means in the above theorem. First, it is possible to pick the initial guess for y so the method does not converge to the dominant eigenvalue. This happens if the guess does not include some part of an eigenvector for the dominant eigenvalue. Second, the formula (4.19) does not hold if the initial guess does not contain a contribution from an eigenvector for λj . However, by picking the initial y randomly, as stated in Table 4.1, the likelihood of having either of these situations occur is so low that it is statistically zero. For this reason, these possibilities are ignored in what follows. Example 4.4 Assume that A is a 4 × 4 matrix, and that it has eigenvalues: a) λ1 = 3, λ2 = 3, λ3 = −2, λ4 = 1. There is a dominant eigenvalue, so the power method will work. As for the error reduction factor, since λj = λ3 = −2, then R ≤ 2/3. In the case of when the matrix is symmetric, R = 4/9. b) λ1 = 10, λ2 = 2 + i, λ3 = 2 − i, λ4 = 1. There is a dominant eigenvalue, so the power method will work. Since λj = λ2 = 2 + i, then R ≤ |(2 + √ i)/10| = 5/10. c) λ1 = 3, λ2 = 3, λ3 = 3, λ4 = 3. There is a dominant eigenvalue, so the power method will work. Moreover, R = 0. This is because λ = 3 is an eigenvalue with a geometric multiplicity of 4. Since the matrix is 4 × 4, then any nonzero vector from R4 is an eigenvector. In other words, any nonzero guess is an eigenvector, so the power method converges in one step. 

140

4 Eigenvalue Problems

Example 4.5 Suppose A is the symmetric n × n matrix ⎞ ⎛ a 1 ⎟ ⎜ a ⎟ ⎜ ⎟ ⎜ .. A=⎜ ⎟. . ⎟ ⎜ ⎝ a ⎠ 1 a

(4.20)

Specifically, if aij denotes the entries in A, then aij = 0 except that aii = a and an1 = a1n = 1. For n ≥ 3, A has three distinct eigenvalues: a + 1, a, and a − 1 (see Exercise 4.4). Also, λ = a has a geometric multiplicity of n − 2. Taking a = 3, then according to our labeling assumption, λ1 = 4, λ2 = 3, λ3 = 3, · · · , λn−1 = 3, and λn = 2. Applying the power method to this matrix, with n = 200, produces the results given in Table 4.3. For this matrix, the dominant eigenvalue is λ1 = 4. The convergence is not as fast as in Table 4.2, and the reason is that λ2 /λ1 = 3/4. According to (4.19), once the method starts to get close to the exact solution, the error should be reduced by a factor of R = (3/4)2 ≈ 0.56 at each step. This reduction holds for the error ratio

k

vk

   vk − λ1    v − λ1  k−1

  vk − vk−1  v −v k−1

k−2

0

3.000300728189

1

3.000633149675

1.00e+00

2

3.001168862204

9.99e−01

1.61e+00

3

3.002095543148

9.99e−01

1.73e+00

4

3.003727977465

9.98e−01

1.76e+00

. . .

. . .

. . .

. . .

22

3.991584072355

5.66e−01

5.73e−01

23

3.995248545996

5.65e−01

5.68e−01

. ..

. ..

. ..

. ..

34

3.999991483933

5.63e−01

5.63e−01

35

3.999995209694

5.63e−01

5.63e−01

   

Table 4.3 Dominant eigenvalue λ1 = 4 of (4.20) as computed using the power method. Also the values of the ratios in the last two columns approach the theoretical value for the error reduction factor for a symmetric matrix, which is R = |λ2 /λ1 |2 ≈ 0.56.

4.3

Extensions of the Power Method

141

coming from (4.17), given in the third column, and for the iterative error ratio coming from (4.18), given in the fourth column.  A few comments are in order before concluding this section. First, because eigenvectors are not unique, you can run the algorithm twice and get two different answers. For example, one time you get y and the next time you might get −y. Or, if the geometric multiplicity of the eigenvalue is greater than one, you can get a completely different eigenvector. This latter observation can be used to advantage as illustrated in Exercise 4.13. The second comment concerns the assumption that A is not defective. This is needed because of the way the theorem is proved. This does not mean that the method will necessarily fail if applied to a defective matrix. However, as illustrated in Exercise 4.60, defective matrices are well-named. Finally, you might be wondering about the formula for the error reduction factor if the matrix is not symmetric. The exact formula depends on the orthogonality of the eigenvectors. For example, if the non-symmetric matrix has an eigenvector for λj that is not orthogonal to the eigenvectors for the dominant eigenvalue then R = |λj /λ1 |. Illustrations of what values you can get for R can be found Exercise 4.17.

4.3 Extensions of the Power Method It is relatively easy to modify the power method to find some of the other eigenvalues and eigenvectors of a matrix. This requires knowing the right formulas from linear algebra, and two of particular usefulness are the following. Theorem 4.3 Suppose the eigenvalues of an n × n matrix A are λ1 , λ2 , · · · , λn . (1) Setting B = A + sI, where s is a constant, then the eigenvalues of B are λ1 + s, λ2 + s, · · · , λn + s. Also, if xi is an eigenvector for A that corresponds to λi , then xi is an eigenvector for B that corresponds to λi + s. (2) If A is invertible, then the eigenvalues of A−1 are 1/λ1 , 1/λ2 , · · · , 1/λn . Also, if xi is an eigenvector for A that corresponds to λi , then xi is an eigenvector for A−1 that corresponds to 1/λi . How to use these with the power method is illustrated in the next example. Example 4.6 Suppose A is a symmetric matrix with eigenvalues −6, −2, 1/4, 1, 3. These are located in Figure 4.3.

142

4 Eigenvalue Problems

(1) B = A + 3I: The respective locations of the shifted eigenvalues are shown in Figure 4.3. So, if the power method is applied to B it will compute the eigenvalue λ = 6 and eigenvector y. In this case, the eigenvalue for A is λ = λ − 3 = 3, with eigenvector y. Since A is symmetric then so is B. Consequently, the error reduction factor in this case is R = (4/6)2 . (2) A−1 : The respective locations of the eigenvalues are shown in Figure 4.3. So, if the power method is applied to A−1 it will compute the eigenvalue λ = 4 and eigenvector y. In this case, the eigenvalue for A is λ = 1/λ = 1/4, with eigenvector y. Since A is symmetric then so is A−1 . Consequently, the error reduction factor in this case is R = (1/4)2 .  In the above example, shifting allowed us to determine a non-dominant eigenvalue for A. It can also be used to prevent ± pairs for eigenvalues. For example, a 2 × 2 matrix A with eigenvalues ±2 does not have a dominant eigenvalue. However, B = A + 3I and B = A − 3I do. Shifting can also be used to change the error reduction factor, for better or worse. To illustrate, a symmetric 2 × 2 matrix A with eigenvalues 1, 4 has the error reduction factor R = 1/16 ≈ 0.06. In contrast B = A + 3I has the error reduction factor R = (4/7)2 ≈ 0.3, which is worse, while B = A − 12 I has the error reduction factor R = (1/7)2 ≈ 0.02, which is better. As demonstrated in the above example, applying the power method to A−1 ends up determining the eigenvalue for A that is closest to zero. This requires, of course, that A−1 has a dominant eigenvalue. When using the power method in Table 4.1 with A−1 , the command z = Ay is replaced with z = A−1 y. So, z can be found by computing A−1 before the loop starts. On the multi-core systems currently in use, it typically takes less than 20 milliseconds to compute the inverse of a 1000 × 1000 matrix. If you prefer not to wait this long, you can reduce the time in (about) half using the LU method. This requires writing z = A−1 y as Az = y, and then using the LU method to solve this equation. This requires a small, but straightforward, rewrite of the algorithm given in Table 4.1.

A -6

-4

-2

0

2

4

6

-6

-4

-2

0

2

4

6

-6

-4

-2

0

2

4

6

B A-1

Eigenvalue Figure 4.3 ple 4.6.

Locations of the five eigenvalues for A, B = A + 3I, and A−1 used in Exam-

4.3

Extensions of the Power Method

143

4.3.1 Inverse Iteration The method that is going to be described next is based on the following question: if we know that λi ≈ μ, can we use this to compute an accurate value for λi , along with an associated eigenvector? In answering this, it is assumed that λi is the eigenvalue for A that is closest to μ. This means that the eigenvalue of B = A − μI that is closest to zero is λi − μ. Applying the power method to B−1 yields λ, along with an associated eigenvector y. Since λ = 1/(λi − μ), it follows that λi = μ + 1/λ, with associated eigenvector y. This procedure is known as the inverse iteration method. To determine the error reduction factor, recall that λ = λi yields the smallest value of |λ − μ|. Assume that λ = λj yields the second smallest value. If the matrix is symmetric then, according to Theorem 4.2, R = |(λi − μ)/(λj − μ)|2 . Consequently, if the approximation is good enough that |λi − μ|  |λj − μ|, then the method will converge very quickly. One of the more useful applications of inverse iteration is to determine eigenvectors. As will be seen later, there are efficient methods for finding eigenvalues that are not so efficient for finding eigenvectors. With good approximations for the eigenvalues, inverse iteration provides a very good method for finding the eigenvectors. Example 4.7 The eigenvalues of



⎞ 4 1 −2 A=⎝ 1 3 1 ⎠ −2 1 4

(4.21)

are λ1 = 6, λ2 = 4, and λ3 = 1. Inverse iteration will be used to compute λ2 by taking μ = 3. The shifted matrix is then ⎛ ⎞ 1 1 −2 B = A − 3I = ⎝ 1 0 1 ⎠ . −2 1 1 Taking y0 = (0, 2, 0)T , and solving Bz0 = y0 , one finds that ⎛ ⎞ 1 z0 = ⎝1 ⎠ . 1 With this v0 = (y0 • z0 )/(y0 • y0 ) = 1/2, and this leads to the approximation λ1 ≈ μ + 1/v0 = 5. To improve on this, we calculate

144

4 Eigenvalue Problems

⎛ ⎞ 1 z1 1 ⎝ ⎠ y1 = =√ 1 ||z1 ||2 3 1 and, from solving Bz1 = y1 ,

⎛ ⎞ √ 1 3⎝ ⎠ 3 . z1 = 6 1

With this v1 = y1 • z1 = 5/6, and this leads to the approximation λ1 ≈ μ + 1/v1 = 21/5 = 4.2. The remaining approximations are calculated using MATLAB, with tol = 10−10 . The results are given in Table 4.4. The last column gives the computed value for Rk =

|vk − vk−1 | , |vk−1 − vk−2 |

(4.22)

which is the approximation for the error reduction factor. Since the eigenvalues for B are 3, 1, and −2, then the error reduction factor for inverse iteration is R = (1/2)2 = 0.25. As seen in Table 4.4, Rk → R. 

|vk − vk−1 |/|vk |

k

λ2

Rk

0

5.000000000000

1

4.200000000000

4.00e−01

2

4.047619047619

1.27e−01

3.64e−01

3

4.011764705882

3.42e−02

2.79e−01

4

4.002932551320

8.73e−03

2.57e−01

5

4.000732600733

2.19e−03

2.52e−01

5.49e−04

2.50e−01

6

4.000183116645

. . .

. . .

16

4.000000000175

5.24e−10

2.50e−01

17

4.000000000044

1.31e−10

2.50e−01

18

4.000000000011

3.27e−11

2.50e−01

. . .

. . .

Table 4.4 Result of using inverse iteration to calculate the eigenvalue λ2 = 4 of (4.21), along with the value of Rk given in (4.22).

4.3

Extensions of the Power Method

145

Example 4.8 The distinct eigenvalues of the n × n matrix given in (4.20), when a = 3, are 2, 3, and 4. We will consider two approximations for the largest eigenvalue, μ = 4.8 and μ = 4.4. Taking n = 200, the resulting values computed using the inverse iteration method are given in Table 4.5. Comparing these to the values computed using the power method, which are given in Table 4.3, it is evident that shifting does indeed reduce the number of iteration steps. It is also evident that the better the shift the faster the method works.  A potential complication with inverse iteration arises when μ is so close to λi that B = A − μI is singular to machine precision. One possible fix is to just use a slightly different μ. Another approach involves using restarts, and this is explained in Peters and Wilkinson [1979].

4.3.2 Rayleigh Quotient Iteration A second reason inverse iteration is important is because of a method derived from it. This modification involves improving the shift as the iteration proceeds. To explain, let y0 be the starting vector. Using the Rayleigh’s quotient (4.4), an approximate eigenvalue is

k

λ1 with μ = 4.8

|vk − λ1 |

λ1 with μ = 4.4

|vk − λ1 |

0

3.011084359147

0.00e+00

3.025334642298

0.00e+00

1

3.053890689453

9.46e−01

3.248666409890

7.51e−01

2

3.223877488525

7.76e−01

3.802384154198

1.98e−01

3

3.593555475912

4.06e−01

3.980294828895

1.97e−02

4

3.880854761463

1.19e−01

3.998361856085

1.64e−03

5

3.973977120660

2.60e−02

3.999866074899

1.34e−04

6

3.994750049149

5.25e−03

3.999989066061

1.09e−05

7

3.998958585552

1.04e−03

3.999999107426

8.93e−07

8

3.999794116469

2.06e−04

3.999999927137

7.29e−08

9

3.999959324930

4.07e−05

3.999999994052

5.95e−09

10

3.999991965156

8.03e−06

3.999999999514

4.86e−10

Table 4.5 Dominant eigenvalue λ1 = 4 of (4.20) as computed using the inverse iteration method using two different shifts.

146

4 Eigenvalue Problems

Pick: random z, and tol > 0 Let:

y = z/||z||2 v = y • Ay err = 10 ∗ tol

Loop: while err > tol B = A − vI Bz = y y = z/||z||2 V =v v =y•z err = |v − V |/|v| end

Table 4.6 Rayleigh quotient iteration for calculating an eigenvalue of A. Note that Bz = y means that the equation is solved for z.

μ0 =

y0 • Ay0 . y0 • y0

This provides the first shift, and we take B0 = A − μ0 I, and from this one finds z0 by solving B0 z0 = y0 . As usual, this is normalized to produce y1 = z0 /||z0 ||2 . This enables us to find a somewhat better approximation for the eigenvalue, which is μ1 = y1 • Ay1 . This brings us to the better shifted matrix B1 = A − μ1 I. This process of using inverse iteration, but improving the approximation for the shift, is known as Rayleigh quotient iteration, and it is summarized in Table 4.6. The usual criticism of the Rayleigh quotient iteration is that the B matrix keeps changing. So, at each iteration step you must find a inverse matrix, or else find a LU factorization. Although this is potentially a drawback, as will be demonstrated in the example below, Rayleigh converges so quickly that this is often not a particular issue. Example 4.9 The eigenvalues of the n × n matrix given in (4.20) are a, a − 1, and a + 1. Taking a = 3, and n = 200, the resulting values computed using the Rayleigh quotient iteration are shown in Table 4.7. The contrast between this and what is obtained using the power method on the same matrix, which is given in Table 4.3, is striking. It is also much faster than the inverse iteration results given in Table 4.5. It is possible to prove that the order of convergence is three (see Section 2.2.1). This means that if the error at step k is 10−m , then

4.4

Calculating Multiple Eigenvalues

Table 4.7 a = 3.

147 |vk − λ|

k

vk

0

0.050740429652

1

3.000515277178

5.15e−04

2

3.000000000139

1.39e−10

3

3.000000000000

2.66e−15

4

3.000000000000

1.78e−15

Eigenvalue computed using Rayleigh quotient iteration for matrix (4.20), with

at the next step you should expect the error will be about 10−3m . This is evident in Table 4.5, where the number of zeros in the solution effectively triples at each step until you get to the resolution of double precision.  The proof that the Rayleigh quotient iteration is third order requires the assumption that the matrix is symmetric or normal [Parlett, 1998]. There has been an effort to extend the method to non-normal matrices [Parlett, 1974], but it is known that the method can fail if the matrix is not symmetric [Batterson and Smillie, 1989]. Also, nothing has been said about which eigenvalue the Rayleigh quotient iteration converges to. In the above example, it converged to the middle eigenvalue, versus the dominant one or the one closest to zero. The one that is determined depends on the eigenvalue representation in the starting vector, and which eigenvalue the approximate shifts are closest to as the iteration proceeds. Those interested in pursuing this question should consult Beattie and Fox [1989] and Pantazis and Szyld [1995].

4.4 Calculating Multiple Eigenvalues Having derived methods for finding particular eigenvalues, we now consider how to calculate several of them at the same time. In this discussion it is assumed A is an invertible symmetric n × n matrix, so Theorem 4.1 applies. It is also assumed that if λ is an eigenvalue, then −λ is not an eigenvalue. Suppose the power method has been used to calculate x1 , which is an eigenvector for the dominant eigenvalue. The question arises if it is possible to calculate x2 in a similar way. It is, and to explain how, consider the example used to introduce the power method in Section 4.2. We begin, as before, and guess a nonzero starting vector y0 . Also, as in (4.12), it is possible to write y0 = α1 x1 + α2 x2 , where, because the eigenvectors are orthogonal,

148

4 Eigenvalue Problems

α1 =

x1 • y0 x1 • x1

and

α2 =

x2 • y0 . x2 • x2

If we use the power method without modification, then we will end up finding x1 . The modification you can make to prevent this is to remove the x1 contribution to y0 , and use the revised guess w0 = y0 − α1 x1 . Because of the inaccuracies of floating-point arithmetic, to prevent x1 from sneaking back into the solution, a similar subtraction is made in each step of the power method.

4.4.1 Orthogonal Iteration In principal, the method described above to find λ2 can be generalized to compute all of the eigenvalues, one at a time. However, it is possible to modify it slightly and calculate all, or just a few, of the eigenvalues simultaneously. The resulting method, which is called orthogonal iteration, is described below. To use our earlier example, which is given in (4.11), to find λ1 and λ2 we need two start-off guesses: y0 and w0 . We do not yet know the eigenvector x1 , so we are not able to remove its contribution from w0 . What we do know is that x1 and x2 are orthogonal, and we will require this for our start-off guesses. This means we need to derive two orthogonal vectors from y0 and w0 . This can be done using the Gram-Schmidt procedure. This is described in more detail later, but for our example it gives us two orthogonal vectors e1 and e2 , given as: e1 = y0 , e2 = w0 −

w0 • e1 e1 . e1 • e1

(4.23)

In the power method we normalized the iteration vectors, and we will do the same here. In other words, we do the following: q1 = e1 /||e1 ||2 , q2 = e2 /||e2 ||2 . Now, individually, the power method would calculate y1 = Aq1 and w1 = Aq1 . It is much from a computational point of view, to first form the

better, matrix Q0 = q1 q2 . This is just the matrix with column vectors q1 and q2 . Then, the multiplication by A can be written as B1 = AQ0 ,



(4.24)

where B1 = y1 w1 is the matrix with column vectors y1 and w1 . Now Gram-Schmidt must be applied to y1 and w1 to find Q1 . Once this is done,

4.4

Calculating Multiple Eigenvalues

149

Pick: random n × m B0 Q0 = GS(B0 ) Loop: for k = 1, 2, 3, · · · Bk = AQk−1 Qk = GS(Bk ) end

Table 4.8 Orthogonal iteration for finding m independent eigenvectors of an invertible symmetric matrix A. Note Qk = GS(Bk ) means that Gram-Schmidt is applied to the columns of Bk to produce Qk .

then B2 = AQ1 . The other steps in the iteration are computed in a similar manner, and the procedure we have derived is known as orthogonal iteration. The algorithm is outlined in Table 4.8. As the method proceeds, the columns of Qk serve as approximations for linearly independent eigenvectors for A, and the corresponding eigenvalues can be determined using Rayleigh’s quotient (4.4). Note that given the i and i + 1 column vectors of Qk , the associated eigenvalues satisfy |λi+1 | ≤ |λi |. In the examples to follow the iteration is stopped once the relative iterative error for each of the m eigenvalues being computed satisfies a given error tolerance. It’s important to point out that it’s not required that B0 have n column vectors. It can have m columns, where 1 ≤ m ≤ n. Therefore, orthogonal iteration provides a way to compute some, or all, of the eigenvectors of a symmetric matrix. Example 4.10 To use orthogonal iteration to calculate the eigenvalues of   2 1 A= , 1 2 take as a starting matrix B0 =

  −1 2 . 3 −1

To apply Gram-Schmidt to the columns of this matrix, we use (4.23) and set         1 −1 1 3 −1 2 and e2 = + = . e1 = 3 −1 3 2 2 1

150

4 Eigenvalue Problems



Normalizing these vectors we obtain the orthogonal matrix Q0 = q1 q2 , where     1 1 −1 3 and q2 = √ . q1 = √ 3 10 10 1 The corresponding approximations for the eigenvalues are λ1 ≈ q1 • Aq1 =

7 = 1.4, 5

and

13 = 2.6 . 5 The next step is to compute B1 = AQ0 , and then repeat what was done above. These steps are calculated using MATLAB, with the results given in Table 4.9. As expected, once the answer gets close to the solution, the error for λ1 drops by a factor of about |λ2 /λ1 |2 = 1/9 with each step.  λ2 ≈ q2 • Aq2 =

Example 4.11 The distinct eigenvalues of the n × n matrix in (4.20) are a + 1, a, and a − 1 (for n ≥ 3). Specifically, λ1 = a + 1, λ2 = λ3 = · · · = λn−1 = a, and λn+1 = a − 1. In this example, B0 is taken to be n × n. So, taking a = 3, then the first column of Qk converges to an eigenvector for λ1 = 4, the next n − 2 columns converge to orthonormal eigenvectors for the eigenvalue λ = 3, and the last column converges to an eigenvector for λn = 2. The resulting values for the eigenvalues computed using the Rayleigh quotient, in the case of when n = 10, are given in Table 4.10. 

k

λ1

λ2

0

1.40000000

2.60000000

1

2.38461538

1.61538462

2

2.90588235

1.09411765

3

2.98908595

1.01091405

4

2.99878142

1.00121858

5

2.99986453

1.00013547

6

2.99998495

1.00001505

7

2.99999833

1.00000167

8

2.99999981

1.00000019

Table 4.9 Calculation of the eigenvalues λ1 = 3 and λ2 = 1 in Example 4.10 using orthogonal iteration.

4.4

Calculating Multiple Eigenvalues

151

k

λ1

λ2

λ7

λ10

1

3.31164154

2.90695156

2.81400409

3.39057203

2

3.44905315

2.96204417

2.88453173

3.19895248

3

3.59266711

2.98801812

2.94561856

2.97968970

4

3.72148534

2.99848788

2.97968077

2.72727677

5 .. .

3.82167988 .. .

3.00171537 .. .

2.99433240 .. .

2.46954966 .. .

21

3.99997821

3.00000092

3.00000149

2.00000168

22

3.99998774

3.00000052

3.00000084

2.00000075

23

3.99999311

3.00000029

3.00000048

2.00000033

24

3.99999612

3.00000016

3.00000027

2.00000015

Table 4.10 Calculation of the eigenvalues in Example 4.11, using orthogonal iteration. The exact values are λ1 = 4, λ2 = λ7 = 3, and λ10 = 2.

For orthogonal iteration to work, it is required that the eigenvalues being computed do not contain a ± pair. If it is suspected that this happens, you can simply use shifting to obtain a matrix that doesn’t have this problem, and then shift back afterwards. Also, as with the power method, using a random B0 means that the method almost always converges. This is because this ensures that the columns of B0 are (almost always) independent, and that all the eigenvectors for A contribute to each column of B0 .

4.4.2 Regular and Modified Gram-Schmidt Orthogonal iteration, as given in Table 4.8, uses the Gram-Schmidt process. To write down this procedure, let c1 , c2 , · · · , cm be the column vectors from B, and assume that they are independent. The first step is to construct m orthogonal vectors from the ci ’s, and this is done as follows: e1 = c1 c2 • e1 e1 e1 • e1 c3 • e1 c3 • e2 e3 = c3 − e1 − e2 e1 • e1 e2 • e2 .. . e2 = c2 −

em = cm −

m−1

i=1

cm • ei ei . ei • ei

(4.25)

152

4 Eigenvalue Problems

The vectors are now normalized by setting qi = ei /||e i ||2 , and from this we have that GP (B) = Q, where Q = q1 q2 · · · qm . The formulas in (4.25) are referred to as regular Gram-Schmidt, and they are what is usually given in a linear algebra course. In terms of computing, however, the procedure can be sensitive to round-off error for larger values of n. To explain, due to round-off, e2 will not be exactly perpendicular to e1 . Since e2 appears in all of the subsequent formulas, this error affects the orthogonality for the remaining ej ’s. The same can be said for the consequences of round-off error made with every ej . An alternative is to use what is known as modified Gram-Schmidt, which reduces, but does not eliminate, the sensitivity. To explain how this is done, regular Gram-Schmidt calculates ej using the following steps (for j > 1): Set ej = cj For k = 1, 2, · · · , j − 1 r = cj • ek /ek • ek ej = ej − rek End For the modified Gram-Schmidt method, one instead does the following: Set ej = cj For k = 1, 2, · · · , j − 1 r = ej • ek /ek • ek ej = ej − rek End Note that when using exact arithmetic, these two procedures produce the same result. However, in the regular version, orthogonality is obtained by removing the parts of cj coming from the earlier computed ek ’s. In the modified version, ej is made so it is orthogonal to the earlier computed ek ’s. Example 4.12 It is of interest to see just what sort of improvement is obtained when using modified Gram-Schmidt, and some comparisons are given in Table 4.11. To explain how the error is computed, given an n × n matrix B, Gram-Schmidt is applied to its columns to compute Q. If exact arithmetic is used then Q is an orthogonal matrix (these are discussed in more depth in the next section). This means that it must be that QT Q = I. So, as a check on the accuracy one can check on how close the value of ||QT Q − I||2 gets to zero. This measure of the error is what is used in Table 4.11. Two matrices were tried, one obtained using random numbers while the second was the Vandermonde

4.4

Calculating Multiple Eigenvalues

153

matrix. The conclusion drawn from this is that the modified Gram-Schmidt is more accurate than the regular version. Also, on the badly conditioned Vandermonde matrix, both fail but the modified Gram-Schmidt procedure is able to obtain a reasonably accurate result for somewhat larger values of the condition number.  The loss of orthogonality can cause orthogonal iteration to fail or, even worse, to work but produce very inaccurate eigenvalues. Consequently, the regular version of Gram-Schmidt should be avoided. Shifting can also improve the accuracy of Gram-Schmidt, either version, which can be used to check on the accuracy of the unshifted values. There has been research into how to improve on the modified Gram-Schmidt method, and an example of what has been done can be found in Giraud et al. [2005].

n

RGS Error

MGS Error

κ∞ (B)

2

4.31e−16

4.31e−16

3.23e+00

4

1.51e−14

6.78e−15

2.05e+02

8

3.52e−15

9.26e−16

1.32e+02

32

1.92e−13

1.36e−14

9.98e+02

128

3.66e−12

1.17e−13

2.29e+04

512

4.71e−11

2.30e−13

6.94e+04

1024

5.05e−11

5.24e−13

1.75e+05

2

6.57e−16

6.57e−16

8.00e+00

4

2.92e−13

2.94e−14

1.55e+03

8

8.31e−03

2.48e−09

4.52e+08

12

3.98e+00

2.44e−03

1.06e+15

16

7.26e+00

1.00e+00

1.19e+18

Random

Vander

Table 4.11 The accuracy of Q = GS(B) for the regular Gram-Schmidt (RGS) and modified Gram-Schmidt (MGS) procedures on a randomly generated n × n matrix as well as on the n × n Vandermonde matrix. The error is determined by the value of ||QT Q − I||2 .

154

4 Eigenvalue Problems

4.4.3 QR Factorization It is worth reconsidering the relationship between B and Q in orthogonal iteration (see Table 4.8). It’s assumed in this discussion that B is an n × n matrix with independent column vectors. Although it was not stated earlier, it is possible to write B = QR, (4.26) where R is an upper triangular matrix. As an example, suppose   1 2 B= . 2 1 Applying Gram-Schmidt to the columns of this matrix, one finds that   1 1 2 . Q= √ 5 2 −1 To find R, we solve (4.26) to obtain R = Q−1 B   1 5 4 =√ . 5 0 3

(4.27)

The verification that R is, in general, upper triangular is easy to show using the triangular form of the equations involved with Gram-Schmidt, as given in (4.25). Nothing was said about R earlier because it was not needed in the derivation of the orthogonal iteration method. However, the formula in (4.26) is significant, and it is an example of what is called a QR factorization. It is reminiscent of the LU factorization used to solve a matrix equation. However, the matrix Q is not necessarily lower triangular, and instead it has the property of being an orthogonal matrix. To make this clear, we have the following definition. Definition of an Orthogonal Matrix. An n × n matrix Q is orthogonal if, and only if, Q−1 = QT . A summary of the more important properties of an orthogonal matrix are listed in Appendix B (page 535). The one property of immediate use is the fact that a square matrix is orthogonal matrix if, and only if, its columns are orthonormal vectors. Since this is what the Gran-Schmidt procedure produces, then by construction Q is orthogonal. So, once Q is determined, then R is computed using the formula R = QT B.

4.4

Calculating Multiple Eigenvalues

155

n

LU (sec)

QR

SVD

200

0.0003

2.0

15.1

400

0.0011

2.0

18.0

600

0.0021

2.1

18.2

800

0.0037

2.3

20.5

1000

0.0061

2.3

20.4

2000

0.0280

3.2

35.2

4000

0.1585

3.3

35.6

Table 4.12 Computing time for a LU, QR, and SVD factorization of a random n × n matrix. The times for QR and SVD are in terms of multiples of the time the LU takes for that value of n.

This follows, because if B = QR, then R = Q−1 B = QT B. It is possible to find those who recommend using a QR factorization to solve Ax = b rather than using a LU factorization. There are reasons both for and against this idea, and these are explored in Exercise 8.28. One of the reasons used to argue against doing this is that QR requires approximately four times the flops that LU takes. To investigate this, the computing times are given in Table 4.12 using MATLAB’s commands for these factorizations. Included in this comparison is another factorization known as the SVD, which is derived in Section 4.6. The fact that QR is computed faster than the expected factor of four is due to the implementation taking advantage of the hardware available, including memory hierarchy and multiple cores. This also means taking maximum advantage of the BLAS routines mentioned in Section 1.6. The result is that if you are factoring a 1000 × 1000 matrix, you must wait an additional 8 msec to obtain a QR compared to the time needed to obtain a LU. Finally, there are more computationally efficient methods for determining Q than Gram-Schmidt, and this is discussed in more detail in Section 8.4.

4.4.4 The QR Method In the case of when it is necessary to compute all of the eigenvalues of a symmetric matrix, the following procedure can be used:

156

4 Eigenvalue Problems

Set C0 = A For k = 0, 1, 2, · · · Qk Rk = Ck Ck+1 = Rk Qk

% f ind Qk and Rk % calculate Ck+1

(4.28)

End This is a strange looking procedure in that it states after factoring Ck , you then multiply the factors in reverse order to calculate Ck+1 . If it converges, the Ck ’s approach a diagonal matrix, and the diagonals are the eigenvalues of A. The eigenvalues will appear on the diagonal according to how many independent eigenvectors they have (i.e., their geometric multiplicity). To guarantee convergence, it is required that for any nonzero eigenvalue λ, that −λ is not also an eigenvalue. This procedure is called the QR method. A proof of convergence, including the extension of the method to non-symmetric matrices, can be found in Golub and Van Loan [2013]. However, some insight into how the method works is given in Exercise 4.63. Example 4.13 Consider the eigenvalue problem in (4.5), where   2 1 A= . 1 2 So, C0 = A and applying Gram-Schmidt, as given in (4.25), we get         4 2 3 −1 2 1 and e2 = − e1 = = . 1 2 2 5 1 5 Normalizing these vectors we obtain 1 Q0 = √ 5



2 −1 1 2

From this we have that R0 = and

QT0 C0

1 =√ 5

1 C1 = R0 Q0 = 5





 .

5 4 0 3

14 3 3 6

 ,

 .

The remaining Ck matrices are calculated using MATLAB, with the result that:

4.4

Calculating Multiple Eigenvalues



2.9756 0.21951 0.21951 1.0244   2.9997 2e–02 C4 = 2e–02 1.0003   3.0000 3e–03 C6 = 3e–03 1.0000   3.0000 3e–04 C8 = 3e–04 1.0000

157





C2 =

C3 =  C5 =  C7 =  C9 =

2.9973 7e–02 7e–02 1.0027 3.0000 8e–03 8e–03 1.0000 3.0000 9e–04 9e–04 1.0000 3.0000 1e–04 1e–04 1.0000

   

It is seen that the off-diagonal entries are converging to zero, while the diagonal entries are approaching the eigenvalues λ1 = 3 and λ2 = 1. A more expansive list of the computed values for the two eigenvalues is given in Table 4.13. As with orthogonal iteration, once the answer gets close to the solution, the error in λ1 = 3 drops by a factor of about |λ2 /λ1 |2 = 1/9 with each iteration step. 

4.4.5 QR Method versus Orthogonal Iteration It is worth comparing the QR and orthogonal iteration methods. Both require finding Q, and the QR method also requires finding R. As is evident in comparing Tables 4.10 and 4.9, they have a similar rate of converge. There are,

k

λ1

λ2

1

2.8

1.2

2

2.97560975609756

1.02439024390244

3

2.9972602739726

1.0027397260274

4

2.99969521487351

1.00030478512649

5

2.99996613039797

1.00003386960203

6

2.99999623665423

1.00000376334577

7

2.99999958184977

1.00000041815023

8

2.99999995353885

1.00000004646114

9

2.99999999483765

1.00000000516235

Table 4.13 Calculation of the eigenvalues of (4.5) using the QR method. The exact values are, λ1 = 3 and λ2 = 1.

158

4 Eigenvalue Problems

however, some significant differences. First, for the QR method the Ck ’s converge to a matrix containing the eigenvalues, while the Bk ’s for orthogonal iteration contain the eigenvectors. For QR, once the eigenvalues are computed then the associated eigenvectors can be computed easily using inverse iteration (Section 4.3.1). For orthogonal iteration, once the eigenvectors are known, Rayleigh’s quotient (4.4) can be used to compute the eigenvalues. Other differences include the observation that orthogonal iteration can be used to compute a few or all of the eigenvalues, while the QR method computes all of them. Also, the QR method is required to start with C0 = A (or a variation on this expression), while orthogonal iteration can be started with any matrix with m independent columns, where 1 ≤ m ≤ n. One property that the QR method and orthogonal iteration have in common is that they are computationally intensive. In both algorithms, to calculate all of the eigenvalues of the matrix, each step requires O(n3 ) flops, and the improvement in the eigenvalue approximations at each step depends on ratios of the form |λi /λj |2 , where |λi | < |λj |. Consequently, if two of the eigenvalues are close together, it can take a large number of iterations to compute them accurately. This has given rise to the development of more efficient implementations for the QR method than the simple factor and remultiply version given earlier. The one most often used in practice uses what is called the implicitly shifted QR method, which is also known as a bulgechasing method. A nice explanation of this can be found in Watkins [2008]. Another method that is used for larger symmetric matrices, and potentially faster than QR, involves a divide and conquer procedure. Implementing this idea efficiently is challenging and more can be learned about this in Demmel [1997] and Nakatsukasa and Higham [2013]. As a final comment, the QR method has been stated to be one of the “10 algorithms with the greatest influence on the development and practice of science and engineering in the 20th century” [Dongarra and Sullivan, 2000]. What is interesting is that John Francis, who has been co-credited for deriving and then naming it the QR method, was completely unaware, and amazed, of how significant his work had become. He missed this because shortly after deriving the method he went to work designing and building industrial computer systems. As he recalled, “there had been no reaction, none whatsoever” when his papers appeared [Golub and Uhlig, 2009].

4.4.6 Are the Computed Values Correct? For many real-word applications it is often not known whether the matrix has the required properties for the eigenvalue solver to work, and in such cases one usually simply tries it and sees what happens. What is considered here are simple tests that can be used to check on the correctness, or accuracy, of the computed values.

4.4

Calculating Multiple Eigenvalues

159

The first test will involve the trace of the matrix, which is defined next. Definition of the Trace. If A is an n × n matrix with entries aij , then the trace tr(A) is defined as n

aii . tr(A) ≡ i=1

In other words, the trace is the sum of the diagonal entries of the matrix. The second piece of information we need connects the trace with the eigenvalues of the matrix. Theorem 4.4 If A is an n × n matrix, with eigenvalues λ1 , λ2 , · · · , λn (listed according to their algebraic multiplicities), then tr(Ak ) =

n

λki ,

(4.29)

i=1

for any positive integer k. To demonstrate how (4.29) is used, we consider a few examples using k = 1. Example 4.14 a) The eigenvalues of

 A=

2 1 1 2

 ,

are λ1 = 3 and λ2 = 1. Clearly, tr(A) = λ1 + λ2 . b) The matrix

 A=

3 0 0 3



has the one eigenvalue λ1 = 3, but it has two linearly independent eigenvectors. Because of this, it has an algebraic multiplicity of two, and so the sum to be considered is λ1 + λ1 = 6, which clearly equals tr(A). c) The (non-symmetric) matrix  A=

1 2 − 21 1



has eigenvalues λ1 = 1 + i and λ2 = 1 − i. It is easy to verify that tr(A) = λ1 + λ2 . 

160

4 Eigenvalue Problems

The above theorem is useful when computing all of the eigenvalues of a matrix. Once these methods are finished, then the values can be checked by comparing their sum with the value for the trace of the matrix (which is very easy to compute). In Section 4.5.2, this test will be used to help determine which of two answers is correct.

Gershgorin Intervals Another useful method uses what are called Gershgorin circles. This result is going to be stated for symmetric matrices, so it is more appropriate to refer to them as Gershgorin intervals. The reason is that symmetric matrices only have real-valued eigenvalues. To state the result, we introduce the row sums for a matrix A: ri =

n

|aij |,

(4.30)

j=1 j=i

and the associated Gershgorin intervals Ii = [aii − ri , aii + ri ]. So, Ii is the closed interval, centered at the diagonal value of the ith row, with radius equal to the sum of the remaining row entries (in absolute value). Gershgorin Theorem. Assuming A is a symmetric n × n matrix, then: (1) Every eigenvalue of A is in at least one of its Gershgorin intervals. (2) Suppose the intervals can be separated into two disjoint groups, one containing p intervals and the other containing n − p intervals. The union of the group with p intervals will then contain p eigenvalues, and the union of the second group will contain n − p eigenvalues. The eigenvalues in this case are counted according to their algebraic multiplicity. (3) If an eigenvalue is an endpoint of one of the intervals, and if A is irreducible, then the eigenvalue is an endpoint of all the intervals. The definition of irreducible is given in Section 4.5.3.1. Statement (3), which is known as Taussky’s theorem, is included so all of the Gershgorin interval results are in one place. The proof of the above theorem can be found in S¨ uli and Mayers [2003] and Varga [2004]. One particularly useful consequence of Statement (2) in the above theorem is that if there is an interval Ii that is disjoint from the other n − 1 intervals, then it must contain exactly one eigenvalue. Moreover, it has a geometric multiplicity of one.

4.5

Applications

161

Example 4.15 a) The Gershgorin intervals for the following ⎛ ⎞ 2 1 0 A = ⎝1 7 2⎠ ⇒ 0 2 12

matrix are I1 = [1, 3] I2 = [4, 10] I3 = [10, 14]

Since these intervals do not overlap there must be an eigenvalue in each one. Moreover, since the intervals are positive we can conclude that the matrix is positive definite. b) The Gershgorin intervals for the following matrix are ⎛ ⎞ −2 1 −2 I1 = [−5, 1] A = ⎝ 1 7 2 ⎠ ⇒ I2 = [4, 10] I3 = [8, 16] −2 2 12 Since I1 is disjoint from the other two intervals, it contains an eigenvalue, while the other two eigenvalues are in I2 ∪ I3 . Also, λ = 2 is not an eigenvalue since it is not in any of the intervals. 

4.5 Applications Examples are presented below that illustrate how eigenvalues problems arise in applications. They are also used to extend the methods we have developed as well as introduce some new topics in numerical linear algebra.

4.5.1 Natural Frequencies Mechanical systems such as an elastic string, or the chain of masses and springs shown in Figure 4.1, possess natural frequencies, and they play an important role in determining the motion of the system. Mathematically, natural frequencies are the eigenvalues for the system. As an example, the natural frequencies for a guitar string consist of the fundamental along with multiples of the fundamental (what are called the higher harmonics). Because most of the energy is often contained in the lower frequencies, it is typical in applications to be more interested in not finding all the natural frequencies (i.e., eigenvalues) but only the lower ones. So, the first task is to describe the method that will be used to do this.

162

4 Eigenvalue Problems

Pick: random n × m B Q = GS(B) Loop: for k = 1, 2, 3, · · · AB = Q

% f ind B

Q = GS(B) end

Table 4.14 Inverse orthogonal iteration used to calculate the m eigenvectors of A which have eigenvalues closest to zero. Note Q = GS(B) means that Gram-Schmidt is applied to the columns of B to find Q.

Suppose we want to find a few of the eigenvalues that are closest to zero. This can be done using orthogonal iteration with A−1 . The algorithm, which comes from using A−1 in the procedure in Table 4.8, is given in Table 4.14. To explain how it works, if you are interested in finding the first m eigenvalues, you take B to be an n × m matrix. As seen in Table 4.14, after computing Q, you then must solve AB = Q. This can be done a couple of ways. If you know A−1 , then B = A−1 Q. Or, if you have a LU factorization A = LU, then Y = L−1 Q, and B = U−1 Y. The stopping condition used in the example below is that the relative iterative error for each of the computed eigenvalues is less than a given tolerance tol.

Natural Frequencies for a Chain of Oscillators For the chain of oscillators shown in Figure 4.1, the natural frequencies are determined from the eigenvalue problem where A is given in (4.9). We are

i

λi

Relative Iterative Error

Relative Error

n

1.000245565612

3.14e−09

1.28e−06

n−1

1.000978310051

3.94e−09

1.22e−06

n−2

1.002227417808

9.99e−08

2.91e−05

n−3

1.003890019139

4.13e−08

1.73e−05

n−4

1.006089940281

6.48e−08

1.41e−05

Table 4.15 Five smallest eigenvalues of a chain of oscillators computed using inverse orthogonal iteration. The relative iterative error is |vk − vk−1 |/|vk |, while the relative error is |vk − λi |/λi where λi is given in (4.10).

4.5

Applications

Figure 4.4

163

Examples of two graphs with five vertices.

going to use inverse orthogonal iteration to find the five lowest eigenvalues of A. It is assumed that n = 200, and that sc = s. Also, the stopping condition for the procedure is that for each eigenvalue |vk − vk−1 |/|vk | < tol, where tol = 10−7 . The results are given in Table 4.15. At first glance there is nothing remarkable about these values. However, the method requires k = 1, 546 to reach the stopping condition. This rather large number is needed because the eigenvalues are relatively close together, which causes our iterative solvers to converge slowly. This is also the reason that the relative error is not as small as the relative iterative error. The convergence is so slow that the improvement in the iterative error hardly changes, and the algorithm is fooled into thinking it has found the exact value. For larger chains the situation gets worse. For example, when n = 2,000, the procedure takes 1,869 steps and ends up with relative errors on the order of 10−4 . To obtain a more accurate result you can either decrease tol or else use shifting.

4.5.2 Graphs and Adjacency Matrices A graph, or network, consists of vertices and lines (or edges) connecting them. Two examples are shown in Figure 4.4. The lines in this case indicate the connections between the respective vertices. For example, the vertices could identify computers and the lines identifying how they are connected. Other possibilities are that the vertices are cities and the lines are airline routes (see Exercise 4.39), or that the vertices are atoms and the lines indicate their bonds (see Exercise 4.40). Eigenvalues provide useful information about the properties of the network, and this is done using what is called an adjacency matrix. This matrix for the graph on the left in Figure 4.4 is ⎛ ⎞ 0 1 0 0 0 ⎜1 0 0 0 0⎟ ⎜ ⎟ ⎟ (4.31) A=⎜ ⎜0 0 0 1 1⎟. ⎝0 0 1 0 1⎠ 0 0 1 1 0

164

4 Eigenvalue Problems

To explain how this is determined, since there are n vertices, then the adjacency matrix is n × n. The entries of A are zero except when vertex i is connected to vertex j, in which case aij = aji = 1. For example, for the graph on the left in Figure 4.4, the third and fifth vertices are connected, so a35 = a53 = 1. Also, since the first and fifth vertices are not connected, then a15 = a51 = 0. We are considering what is known as an undirected graph which means that if aij = 1 then aji = 1. In other words, the adjacency matrix is symmetric. The eigenvalues of the adjacency matrix play an important role in determining the properties of a graph. As an example, suppose the eigenvalues for the adjacency matrix are λ1 , λ2 , · · · , λn . If Pm is the number of paths through the graph that take m steps and end n up back at the vertex they started at, it is possible to show that Pm = i=1 λm i . For the graph on the left in Figure 4.4, the paths consisting of two steps are 1 → 2 → 1, 2 → 1 → 2, 5 → 4 → 5, etc. Using the terminology of graph theory, these are called closed paths of length two. There are a total of eight closed paths of length two, so P2 = λ21 + · · · + λ25 = 8. Example 4.16 Assume there are n vertices, with vertex 1 connected to vertex 2. For the others, vertex i is connected to vertex i + 1, except that vertex n is connected to vertex 3. Such a network for n = 5 is shown on the left in Figure 4.4, and the resulting adjacency matrix is given in (4.31). To test our eigenvalue solvers we will take n = 10. The eigenvalues computed using orthogonal iteration, as outlined in Table 4.8, are given by the vector λL : ⎞ ⎞ ⎛ ⎛ −1.9996 −2 ⎜ −1.2435 ⎟ ⎜ −1.4142 ⎟ ⎟ ⎟ ⎜ ⎜ ⎜ −0.97893 ⎟ ⎜ −1.4142 ⎟ ⎟ ⎟ ⎜ ⎜ ⎟ ⎜ −0.46691 ⎟ ⎜ −1 ⎟ ⎟ ⎜ ⎜ ⎜−1.2326e–33⎟ ⎜−4.4409e–16⎟ ⎟, ⎟ ⎜ λ (4.32) λL = ⎜ = R ⎜ 9.8608e–33 ⎟ ⎜ 4.4409e–16 ⎟ . ⎟ ⎟ ⎜ ⎜ ⎟ ⎜ 0.46691 ⎟ ⎜ 1 ⎟ ⎟ ⎜ ⎜ ⎜ 0.9872 ⎟ ⎜ 1.4142 ⎟ ⎟ ⎟ ⎜ ⎜ ⎝ 1.2352 ⎠ ⎝ 1.4142 ⎠ 1.9996 2 The stopping condition here is that the iterative error for each eigenvalue is less than 10−12 . Although all seems fine, there is a potentially serious problem with this result. The first and last eigenvalues appear to be negatives of each other, and this is the one situation that orthogonal iteration likely fails. It is possible to check on whether the computation is correct by using shifting. To check, orthogonal iteration is going to be applied to B = A − 3I. If the λL values are correct then the eigenvalues of B should not have ± pairs. Doing

4.5

Applications

165

this, and then shifting back, one gets the values given by λR in (4.32). This shows that applying orthogonal iteration directly to A is even worse than we originally thought because this solution shows that every nonzero eigenvalue appears as a ± pair. For those that are not convinced that λR is correct, other shifts can be tried. However, there is another  test that can be used. According to the n Theorem 4.4, and (4.31), tr(A) = i=1 λi = 0. For both λ vectors in (4.32), n −16 λ ≈ 4 × 10 . Although this is not zero, it differs from the exact result i i=1 by an amount withinthe resolution of double precision. n However, taking n k = 2, then λL gives i=1 λ2i = 13.44, while λR gives i=1 λ2i = 18. Since A contains only integer values, it must be that tr(A2 ) =

n

λ2i

i=1

is an integer. Consequently, λR is the accepted answer.



4.5.3 Markov Chain Suppose a car rental company has 3 locations A, B, and C. Over a week’s time the number of cars at each location changes as indicated in Table 4.16. This information is expressed using the weighted directed graph in Figure 4.5. This is done as it makes it easier to visualize the movement through the network (i.e., between the rental locations). To clear up some of the terminology being used here and in the previous application, a graph, as illustrated in Figure 4.4, involves bidirectional connections. An example is cities (the vertices) connected by roads (edges) where traffic can go in either direction. A directed graph involves vertices with edges that have a direction associated with them. With or without the numbers, Figure 4.5 is a directed

A

B

C

1 stay at A 5

1 move to A 2

1 move to A 3

1 move to B 5

1 stay at B 2

1 move to B 3

3 move to C 5 Table 4.16

1 stay at C 3

Weekly movement of cars between three rental company locations.

166

Figure 4.5

4 Eigenvalue Problems

Directed graph version of the values given in Table 4.16.

graph. When the numbers are included, you have what is called a weighted directed graph. If y0 = (a, b, c)T is the vector for the number of cars at each location at the start, then in the next week the distribution of cars is given as y1 = Ay0 , where ⎛ ⎞ 1/5 1/2 1/3 A = ⎝ 1/5 1/2 1/3 ⎠ . (4.33) 3/5 0 1/3 In a similar manner, after k weeks, the distribution vector is yk = Ayk−1 , or equivalently, yk = Ak y0 . The question considered here is: what happens to the distribution of cars as k → ∞? This problem is an example of a Markov chain. These are sequences y1 , y2 , y3 , · · · in which yk+1 only depends on yk . In our case the dependence is linear in the sense that yk+1 = Ayk . Also, note that we are going to have fractional cars in this discussion. The values will be rounded to integers only when explicitly evaluating yk . To determine what happens when k → ∞ we need to know an important fact about A, which is that λ = 1 is the dominant eigenvalue and it has an algebraic multiplicity of one. Why this is true is explained at the end of this example. But, in brief, A has non-negative entries and each column adds to one. Because of this it is said to be a probability matrix. It also is irreducible (this is explained below) with at least one positive diagonal entry. Together these guarantee that λ = 1 has the stated properties. Now, back to the question. As was done in deriving the power method, assume that A has eigenvectors x1 , x2 , and x3 , with respective eigenvalues λ1 , λ2 , and λ3 . In this case, we can write y0 = α1 x1 + α2 x2 + α3 x3 . This means that yk = Ak y0 can be written as yk = α1 λk1 x1 + α2 λk2 x2 + α3 λk3 x3 . Letting λ1 be the dominant eigenvalue then, since λ1 = 1, yk = α1 x1 + α2 λk2 x2 + α3 λk3 x3 , where |λ2 | < 1 and |λ3 | < 1. It follows that lim yk = α1 x1 .

k→∞

4.5

Applications

167

So, to answer the question we must determine α1 and x1 . The vector x1 is easily determined by applying the power method to A. As for α1 , the total number of cars N never changes. This means that if y0 = (a, b, c)T , then N = a + b + c = ||y0 ||1 . In fact, for every k, it must be that ||yk ||1 = N . From this it follows that ||α1 x1 ||1 = N . In other words, α1 = ||y0 ||1 /||x1 ||1 . Therefore, the limiting distribution of cars is given as lim yk =

k→∞

||y0 ||1 x1 . ||x1 ||1

(4.34)

This is known as the steady-state or stationary distribution. Example 4.17 Suppose there are 1000 cars and at the start they are all at location A. So, y0 = (1000, 0, 0)T . The dominant eigenvector x1 is computed using the power method as given in Table 4.1, with tol = 10−8 . From (4.34), and rounding the values to the closest integer, it is found that the number of cars at location A, B, and C approaches 345, 345, and 310, respectively. A question the car rental company might ask is, how many weeks does it take to reach this stationary distribution? To answer this, let d = (345, 345, 310)T and let dk be the vector obtained when you round the entries in yk to the closest integer. You then use the procedure in Table 4.1 but stop the iteration when dk = d. It is found that it takes 4 weeks.  It is interesting to note that the model used here is similar to what is used in Google’s PageRank algorithm for prioritizing the results of a search. In this case, A, B, C are web-pages and the arrows in Figure 4.5 indicate how the pages are linked. For example, C has a link to B, but not visa-versa. Google uses a “random surfer” model where the surfer can follow one of the links on the page, or jump randomly to some other page in the web. By computing x1 they determine where the largest number of surfers end up, and this is used in ranking the web-pages. Using our car rental example, A and B would be listed first in the search results, and C would be last. It should be pointed out that PageRank is only one, of many, factors Google uses to rank web-pages. In fact, a whole industry related to search engine optimization (SEO) has arisen to try to inverse engineer how the ranking is determined and use this to improve the ranking of individual pages.

4.5.3.1

Probability Matrices and their Dominant Eigenvalue

Because A has only non-negative entries and each column adds to one, it is said to be a probability matrix. It is also called a stochastic matrix, or a

168

4 Eigenvalue Problems

transition probability matrix. There are probability matrices that are based on row sums, and the results discussed below apply to them as well. To guarantee that the power method can be used, we need some information about the eigenvalues of a probability matrix. We begin with the easier results. Theorem 4.5 If A is a probability matrix, then λ = 1 is an eigenvalue, and all of its eigenvalues satisfy |λ| ≤ 1. The proof comes from noting that x = (1, 1, · · · , 1)T is an eigenvector, with eigenvalue λ = 1, for AT . Since A and AT have the same eigenvalues, then λ = 1 is an eigenvalue for A. Also, if Ax = λx, then ||λx|| = ||Ax|| ≤ ||A|| · ||x||. Using the 1-norm, and since ||A||1 = 1 and ||λx|| = |λ| · ||x||, it follows that |λ| ≤ 1. Now comes the harder question. Because A is not symmetric, it can have complex-valued eigenvalues. What we want to know is, are there any eigenvalues besides λ = 1 that have |λ| = 1? To answer this, it is going to be assumed that the matrix is irreducible. This means that if you follow the arrows in the directed graph for the matrix that you can get to any vertex from any other vertex. In graph theory, such a graph is said to be strongly connected. As seen in Figure 4.5, this holds for our matrix. However, note that if the A to C arrow is removed then it would be reducible since you cannot get to C from A or B. If you just remove the C to A arrow it remains irreducible. The next result, which is part of the Perron-Frobenius theorem, answers the question about λ = 1 [Berman and Plemmons, 1994]. Perron-Frobenius Theorem. If A is a probability matrix that is irreducible, and if it has at least one positive diagonal entry, then λ = 1 is the dominant eigenvalue. Moreover, it has an algebraic multiplicity of one, and it has an eigenvector whose components are all positive. The eigenvector mentioned in this theorem is x1 in the car rental company example. Also, note that because λ = 1 has algebraic multiplicity one, then its geometric multiplicity is one. To determine if a matrix is irreducible, irrespective of whether it is a probability matrix, requires constructing its directed graph. How this is done is illustrated in the following example. Example 4.18 To determine if



⎞ 0 1 2 A = ⎝ 0 0 3⎠ −1 4 7

(4.35)

4.6

Singular Value Decomposition

169

Figure 4.6 Directed graph for (4.35)

is irreducible, you first determine its directed graph. Since n = 3, then there are three vertices. From the first row, since a12 = 0, then in the directed graph there is a path from the first to second vertex. Similarly, since a13 = 0, there is a path from the first to third vertex. Continuing, you obtain the directed graph in Figure 4.6. It is evident that it is possible to transverse the graph and go between any two vertices, so the matrix is irreducible. Note that if, say, the path from the second to third vertex is removed it is reducible. 

4.6 Singular Value Decomposition One of the more significant results in linear algebra is the spectral decomposition theorem. For those who might not remember this, it is given next. Theorem 4.6 If A is a symmetric matrix, then it is possible to factor A as A = QDQT ,

(4.36)

where D is a diagonal matrix and Q is an orthogonal matrix. Orthogonal matrices were introduced in Section 4.4.3, and a summary of their basic properties is given in Appendix B. To explain where the matrices in this theorem come from, let u1 , u2 , · · · , un be orthonormal eigenvectors for A, with corresponding eigenvalues λ1 , λ2 , · · · , λn (see Theorem 4.1). In this case, D is the diagonal matrix ⎛ ⎞ λ1 ⎜ λ2 ⎟ ⎜ ⎟ D=⎜ ⎟, . .. ⎝ ⎠ λn and the jth column of Q is uj . Diagrammatically, (4.36) can be written as ⎛ ⎞⎛ ⎞⎛ ⎞ λ1 ← u1 → ⎜↑ ↑ ⎟ ↑ ⎟⎜ λ2 ⎟⎜ ← u2 → ⎟ ⎜ ⎟⎜ ⎟ ⎟⎜ u u · · · u A=⎜ ⎜ ⎟⎜ ⎟. .. 1 2 n . ⎜ ⎟⎝ . ⎠⎝ ⎠ . . ⎝↓ ↓ ↓⎠ λn ← un →

170

4 Eigenvalue Problems

Example 4.19 √ For the matrix A in (4.5) we√found that λ1 = 3, with u1 = (1, 1)T / 2, and λ2 = 1, with u2 = (1, −1)T / 2. Accordingly, the factorization in the above theorem is √  √     √  √ 2 1 1/√2 1/√2 1/√2 3 0 1/√2 = .  1 2 0 1 1/ 2 −1/ 2 1/ 2 −1/ 2 Given that it is possible to carry out such a factorization, the next question is, why do this? One answer is that this shows that it is possible to change the basis so the matrix is diagonal, and diagonal matrices are very easy to work with. For example, it is much easier to solve Dx = b than it is to solve Ax = b, where A is a full n × n matrix. The limitation is that the above theorem requires the matrix to be symmetric. What we are going to consider is how it might be possible to derive a similar result for nonsymmetric matrices, and even for matrices that are not n × n. It needs to be pointed out that the resulting factorization will have some distinct differences from (4.36), even in the case of when the matrix is symmetric. What these are will be explained once the factorization has been derived.

4.6.1 Derivation of the SVD Given an n × m matrix A, with m ≤ n, a singular value decomposition, or factorization, has the form (4.37) A = UΣVT , where U is an n × n orthogonal matrix, V is an m × m orthogonal matrix, and Σ is an n × m matrix given as   S Σ= , (4.38) O where S is a diagonal m × m matrix of the form ⎞ ⎛ σ1 ⎟ ⎜ σ2 ⎟ ⎜ S=⎜ ⎟, . .. ⎠ ⎝ σm

(4.39)

and O is a (n − m) × m matrix containing only zeros. The σi ’s are called singular values and they satisfy σ1 ≥ σ2 ≥ · · · ≥ σm ≥ 0.

4.6

Singular Value Decomposition

171

The assumption that m ≤ n is made because this applies to the data matrices considered in this textbook, and it makes writing out the formula for Σ a bit simpler. The steps outlined below can be easily extended to a matrix with n < m, or you can use the fact that a SVD for A can be obtained by taking the transpose of a SVD for AT . The key step for finding a factorization is to consider AT A and AAT . These are called Gram matrices, and some of their properties needed in the derivation of the SVD are given in the next theorem. Theorem 4.7 Assuming A is an n × m matrix, then: (1) AAT and AAT are symmetric with non-negative eigenvalues. (2) If λ is an eigenvalue of AAT , with eigenvector u, then either λ is an eigenvalue for AT A with eigenvector AT u, or else λ = 0 and AT u = 0. It should be pointed out that it is possible to prove that a factorization is possible in a short paragraph using the orthogonal subspace viewpoint discussed later in this section. The approach used here is more constructive, albeit longer, with the objective of being able to show how the three matrices in the factorization can be computed. Finding V and S: Since AT A is a m × m symmetric matrix then, from (4.36), AT A = QL DL QTL ,

(4.40)

where DL is a diagonal matrix formed from the eigenvalues for AT A, and QL is an orthogonal matrix formed using the orthonormal eigenvectors of AT A. On the other hand, if the factorization (4.37) is possible then AT A = (UΣVT )T UΣVT = (VΣT UT )UΣVT = VS2 VT , where S2 = ΣT Σ is the m × m diagonal matrix given as ⎛ 2 ⎞ σ1 ⎜ ⎟ σ22 ⎜ ⎟ S2 = ⎜ ⎟. . .. ⎝ ⎠ 2 σm

(4.41)

(4.42)

In comparing (4.40) and (4.41) we come to the conclusion that for the factorization in (4.37) we can take V = QL . It also means that S2 = DL . Therefore, the columns of V are orthonormal eigenvectors for AT A, and the correspond√ ing singular values are σi = λi , where λi is an eigenvalue for AT A.

172

4 Eigenvalue Problems

Finding U and Σ: In a similar vein, given that AAT is a n × n symmetric matrix, then (4.43) AAT = QR DR QTR , where DR is a diagonal matrix formed from the eigenvalues for AAT , and QR is an orthogonal matrix formed using the orthonormal eigenvectors of AAT . According to (4.37), AAT = U(ΣΣT )UT , and for this to agree with (4.43) we take U = QR . Now, from (4.38), ΣΣT is the n × n diagonal matrix ⎛ 2 ⎞ σ1 ⎜ .. ⎟ ⎜ ⎟ . ⎜ ⎟ 2 ⎜ ⎟ σm ⎟. (4.44) ΣΣT = ⎜ ⎜ ⎟ 0 ⎜ ⎟ ⎜ ⎟ .. ⎝ . ⎠ 0 From (4.43), the diagonals of this matrix are the eigenvalues of AAT . However, we have already determined the σi ’s so there is a possible inconsistency in this result. Fortunately, as stated in Theorem 4.7, AAT and AT A share nonzero eigenvalues, and they have the same geometric multiplicities. Moreover, the only other eigenvalue of AAT is λ = 0, and it has a geometric multiplicity equal to the number of zeros appearing along the diagonal in (4.44). Assembly: The singular values are labeled so that σ1 ≥ σ2 ≥ · · · ≥ σm . Also, if uj and vj are the jth columns of U and V, respectively, for j = 1, 2, · · · , m then from (4.37) it is required that Avj = σj uj . This equation is useful because it can be used to determine uj when σj = 0. If σj = 0, or for j = m + 1, m + 2, · · · , n then Theorem 4.7 can be used to determine uj ’s. Specifically, they are orthonormal solutions of AT u = 0. Although the factorization resembles the one in (4.36), and it was certainly used in the above derivation, there are differences. The most obvious one is that the factorization in (4.46) works on matrices that are not square. However, consider the case of when A is symmetric (and therefore square). The diagonals of D are the eigenvalues of A, and they can be positive, negative, or zero. The singular values listed in Σ, on the other hand, are non-negative. One might guess that the singular values in this case are just the absolute values of the nonzero eigenvalues of A, and this is correct. However, this conclusion is limited to symmetric matrices, and as will be shown in Example 4.21, it does not need to hold for square but non-symmetric matrices.

4.6

Singular Value Decomposition

173

4.6.2 Interpretations of the SVD The above derivation contains the formulas that can be used to compute an SVD. It is informative to also consider the geometric underpinnings of the factorization as well. Principal Axes Viewpoint The x’s that satisfy ||x||2 = 1 form the unit sphere in Rm . If m = n, and A is invertible, then the corresponding values for y = Ax form an ellipsoid in Rn . This is illustrated in Figure 4.7 for the matrix   2 2 A= . (4.45) 3 −1 What the SVD does is pick the orthogonal vectors u1 and u2 to align with the principal axes of the ellipsoid (or, for this example, the ellipse). Specifically, u1 points along the major axis and u2 points along the minor axis. This determines v1 and v2 as these are the vectors that map to u1 and u2 , respectively. More precisely, because v1 and v2 are unit vectors, the requirement is that Av1 points in the same direction as u1 , and similarly for v2 and u2 . The value of σ1 , and σ2 , is the respective length of the semi-major and semi-minor axes of the ellipse. If this reminds you of Figure 3.3, it should, and this connection will be revisited with properties (C4) and (C5) in Section 4.6.4. This viewpoint brings out the importance of the 2-norm in the determination of the SVD. Namely, that the SVD is determined by what A does to the unit sphere ||x||2 = 1. This raises the question as to whether it is possible to have a version of the SVD using one of the other norms. To explore

v2

1 v

1

u1

u2

Figure 4.7 right.

2

For y = Ax, the unit circle ||x||2 = 1 on the left maps to the ellipse on the

174

4 Eigenvalue Problems

this idea, one approach is to simply redraw Figure 4.7 for the new norm. For example, using the 1-norm, ||x||1 = 1 is a diamond shape curve and A turns it into a parallelogram. It is possible to align u1 and u2 with the long and short axes of the parallelogram. This works but what is lost using the 1-norm is the guaranteed orthogonality of u1 and u2 . This means that U and V are not necessarily orthogonal matrices. Consequently, instead of computing VT it is now necessary to compute V−1 . For large matrices the resulting increase in the computing time is significant, which makes using the 1-norm less attractive.

Orthogonal Subspace Viewpoint Given that A is n × m, then A : Rm → Rn . In contrast, since AT is m × n, then AT : Rn → Rm . What the SVD does is find an orthonormal basis for Rm , and one for Rn , that are determined by the range (R) and null space (N ) of A and AT . In what follows it is assumed that dim R(A) = r (i.e., the range has dimension r). In this case, the null space N (A) has dim N (A) = m − r. We begin with some facts from linear algebra. In what follows, ⊕ is the direct sum. (1) Rm = R(AT ) ⊕ N (A), where R(AT ) ⊥ N (A). Explanation: In words, this says that R(AT ) and N (A) are orthogonal subspaces, and given v ∈ Rm then there are vectors x1 ∈ R(AT ) and x2 ∈ N (A) so that v = x1 + x2 . Verifying that this is possible starts with showing that the dimensions add up correctly. This happens because the number of independent columns in a matrix equals the number of independent rows. So, dim(R(AT )) = r. The next step is to show that R(AT ) and N (A) are orthogonal. This means that if x1 ∈ R(AT ) and x2 ∈ N (A), then x1 • x2 = 0. Verifying this is left as an exercise. (2) Rn = R(A) ⊕ N (AT ), where R(A) ⊥ N (AT ) Explanation: This is Fact (1) but written for AT . A conclusion coming from Fact 1 is that you can find an orthonormal basis v1 , v2 , · · · , vm for Rm , where v1 , v2 , · · · , vr is a basis for R(AT ), and vr+1 , vr+2 , · · · , vm is a basis for N (A). It is understood here that if r = m, then N (A) contains only the zero vector, so there is no basis for this subspace in this case. In a similar manner, from Fact 2 you can find an orthonormal basis u1 , u2 , · · · , un for Rn , where u1 , u2 , · · · , ur is a basis for R(A), while ur+1 , ur+2 , · · · , un is a basis for N (AT ). Now, what the SVD does is provide a way to construct orthonormal basis vectors v1 , v2 , · · · , vr for R(AT ) in such a way that Av1 , Av2 , · · · , Avr is

4.6

Singular Value Decomposition

175

an orthogonal basis for R(A). Turning this into an orthonormal basis results in the formula uj = Avj /σj , where σj = ||Avj ||2 is called a singular value for A. The rest of the basis for Rn comes from N (AT ). In other words, they are orthonormal solutions of AT u = 0.

4.6.3 Summary of the SVD Assuming A is a nonzero n × m matrix, with m ≤ n, then a singular value decomposition (SVD) of A is any factorization that has the form A = UΣVT ,

(4.46)

where U is an n × n orthogonal matrix, V is an m × m orthogonal matrix, and Σ is an n × m matrix that has the form ⎛ ⎞ σ1 ⎜ σ2 ⎟ ⎜ ⎟ ⎜ ⎟ . .. ⎜ ⎟ ⎜ ⎟ ⎜ σm ⎟ (4.47) Σ=⎜ ⎟. ⎜0 ⎟ · · · 0 ⎜ ⎟ ⎜ . .. ⎟ ⎝ .. . ⎠ 0

···

0

The σj ’s are the singular values for A, and they satisfy σ1 ≥ σ2 ≥ · · · ≥ σm ≥ 0. The matrices in the factorization can be determined as follows:

a) V = v1 v2 · · · vm and Σ: The vj ’s are orthonormal eigenvectors of  AT A. If λj is the corresponding eigenvalue for vj , then σj = λj . Moreover, the labeling is such that σ1 ≥ σ2 ≥ · · · ≥ σm ≥ 0.

b) U = u1 u2 · · · un :

Assuming that σr > 0 and σr+1 = 0, then

uj =

1 Avj , for j = 1, 2, · · · , r. σj

(4.48)

For j = r + 1, r + 2, · · · , n, the uj ’s are orthonormal solutions of AT uj = 0.

(4.49)

Note that if r = m, then all of the uj ’s are determined using (4.48). The geometric multiplicity of a singular value σj equals the geometric multiplicity of the associated eigenvalue λj for AT A. Correspondingly, the geomet-

176

4 Eigenvalue Problems

ric multiplicity of σj determines how many times it appears on the diagonal in (4.47). There are other ways to find an SVD. One of note is that if you have determined an SVD of A, you can use it to determine an SVD of αA, AT , or A−1 (assuming A is invertible). This is illustrated in Example 4.22. It should also be evident that an SVD is not unique. For example, the uj ’s satisfying (4.49) are not unique and the vj ’s are also not unique. However, the singular values are unique, and this means that all SVD’s of a matrix have the same Σ (see Exercise 4.54). Example 4.20 Suppose



⎞ 2 −1 1⎠ . A=⎝ 1 −1 2

Since A is 3 × 2, then there are two singular values σ1 and σ2 , V is 2 × 2, and U is 3 × 3. Finding σ1 , σ2 , and V: The eigenvalues of AT A =



6 −3 −3 6



are√λ1 = √ 9 and λ2 = 3, with√corresponding orthonormal eigenvectors v1 = √ (− 2/2, 2/2)T and v2 = ( 2/2, 2/2)T . Consequently,  1√ 1√  − 2 √2 2 √2 V= . 1 1 2 2 2 2 The associated singular values are σ1 =

√ √ √ λ1 = 3, and σ2 = λ2 = 3.

Finding U : The columns of U associated with the two singular values can be determined using (4.48). So, u1 = Av1 /σ1 and u2 = Av2 /σ2 . The third Since the column u3 must satisfy (4.49) and have unit length. √ √ general √ solution of AT u = 0 is u = a(1, −1, 1)T , we take u3 = ( 3/3, − 3/3, 3/3)T . Consequently, √ ⎞ ⎛ 1√ 1√ 1 − 2 2 6 √6 3 √3 1 1 ⎠ U=⎝ √0 31 √6 − 31 √3 . 1 2 2 6 6 3 3

4.6

Singular Value Decomposition

The resulting SVD:

177



⎞ 3 √0 A = U⎝0 3⎠VT . 0 0



Example 4.21 Suppose

 A=

2 −1 1 −2

 .

Using the same steps as in the last example, one finds that   3 0 A=U VT , 0 1 where U=

√   1√ 1 2 √2 2 √2 1 1 2 2 −2 2

 and V =



1 2 √2 − 12 2

(4.50)

√ 

1 2 √2 1 2 2

.

The singular values for A in this√case are σ1 = 3√and σ2 = 1. In comparison, the eigenvalues of A are λ1 = 3 and λ2 = − 3. Note that the singular values of A are not the absolute values of its eigenvalues.  Example 4.22 Given a SVD for A, find a SVD for A−1 . The matrix A in this case is from the last example. Answer: First note that if A = UΣVT , then A−1 = (VT )−1 Σ−1 U−1 . Since (VT )−1 = V, and U−1 = UT , then from (4.50) we have that A

−1

 =



1 2 √2 − 12 2

√ 

1 2 √2 1 2 2

√ T   1√ 1 1/3 0 2 √2 2 √2 . 1 1 0 1 2 2 −2 2

To obtain the SVD, the diagonals in the diagonal matrix must be interchanged (so, σ1 ≥ σ2 ). This is easy to do because of (4.48). Making the needed column interchanges gives us the SVD A−1 =

√   1√   1√ 1 1 0 2 √2 2 √2 2 √2 1 1 1 0 1/3 − 2 − 2 2 2 2 2

√ T

1 2 √2 1 2 2

. 

4.6.4 Consequences of a SVD In what follows, A is an n × m matrix with m ≤ n. Also, rank(A) denotes the rank of the matrix A.

178

4 Eigenvalue Problems

C1)

rank(A) = r if, and only if, A has exactly r nonzero singular values.

Reason: Because U and V are orthogonal, A and Σ have the same rank. The result follows because the rank of a diagonal matrix equals the number of nonzero diagonals. C2) If A is symmetric, then its singular values are the absolute values of its eigenvalues. Reason: This result is a direct consequence of the fact that for a symmetric matrix, AAT = AT A = A2 . C3)

If rank(A) = r, then A has the outer product expansion A=

r

σj uj vjT ,

(4.51)

j=1

where





⎜ ↑ ↑ ⎜ v u v uj vjT = ⎜ ⎜ j1 j j2 uj ⎝ ↓ ↓

···

↑ ⎟ ⎟ vjn uj ⎟ ⎟. ↓ ⎠

Reason: This follows by simply multiplying out the SVD. Note that each term uj vjT is a n × m matrix that has rank 1. Also, it follows from this result that r

Ax = σj x • vj )uj . (4.52) j=1

C4)

||A||2 = σ1

Reason: Note that even though A is not necessarily square, its norm is still defined using (3.17), with the understanding that Ax is an n-vector and x is an m-vector. The proof of the formula is outlined in Exercise 4.53. C5)

If A is invertible, then κ2 (A) =

σ1 . σm

Reason: This follows from the fact that the singular values for A−1 are 1/σm , 1/σm−1 , · · · , 1/σ1 , and (C4) above. C6)

If A is invertible, then the solution of Ax = b is m

1

b • uj )vj . x= σ j=1 j

4.6

Singular Value Decomposition

179

Reason: Since A−1 = VΣ−1 UT , then A

−1

m

1 = vj uTj . σ j=1 j

The result follows since x = A−1 b.

4.6.5 Computing a SVD It is more than apparent how much work is needed to find the SVD for a matrix. Using a direct approach, it is necessary to calculate AT A, then find its eigenvalues and eigenvectors. This is certainly doable using one or more of the methods described earlier in this chapter. However, more efficient procedures have been developed, and the standard approach is to use what is known as the Golub-Reinsch algorithm, which involves transforming the problem to make it more amenable to a computational solution [Cline and Dhillon, 2007]. Depending on the type of matrix involved, MATLAB typically uses this when computing an SVD using the command svd(A). The number of flops for a square matrix for the Golub-Reinsch algorithm is, according to Golub and Van Loan [2013], approximately 21n3 . Said another way, the flops for the SVD are about a factor of 31 more than for the LU. In looking at the times in Table 4.12, it appears the algorithm used by MATLAB is a bit better than this, at least for the smaller matrices. Even though it does take significantly longer than the LU calculation, the actual computing time is not intolerable. For example, a SVD factorization for a 4000 × 4000 matrix takes less than 6 seconds (see Table 8.9). It is also apparent that the time increases quickly as the size of the matrix increases, and this has generated interest in finding efficient algorithms for computing an approximate SVD for very large matrices. More information about this can be found in Halko et al. [2011], Dongarra et al. [2018], and Heavner et al. [2020].

4.6.6 Low-Rank Approximations The SVD provides a way to obtain useful approximations of a matrix. One application of this is image compression, and how this is done is explained later. The approximation comes from the outer product expansion (4.51) A=

r

σj uj vjT .

i=1

Each term σj uj vjT in this sum is a n × m matrix that has rank 1.

(4.53)

180

4 Eigenvalue Problems

The approximations come from the question of whether it is necessary to include all r terms in (4.53). For large matrices, it is not uncommon that the first few singular values are much larger than those that come later. In such cases, the question arises if it is possible to use an approximation A ≈ Ak , where k < r and k

σj uj vjT . (4.54) Ak = j=1

Note that rank(Ak ) = k. The accuracy of the approximation A ≈ Ak is determined by the next largest singular value, and this is explained in the next result. Eckart-Young Theorem. If rank(A) = r, then ||A − Ak ||2 = σk+1 ,

(4.55)

where Ak is given in (4.54), and 1 ≤ k < r. This is a consequence of (C3) and (C4). You can also show that Ak is the most accurate rank k approximation of A possible in the sense that given any other n × m matrix B with rank k, then ||A − Ak ||2 ≤ ||A − B||2 . However, Ak is not necessarily the only rank k approximation that yields the minimum error of σk+1 .

4.6.7 Application: Image Compression One application for the SVD is image compression. To explain, a 1528 × 1225 grayscale image is shown in Figure 4.8(Top). The image file consists of a 1528 × 1225 array of integers. For an 8-bit grayscale image the integers have a value from 0 to 255. This number represents the intensity of gray for that pixel, with 0 corresponding to black and 255 corresponding to white. The matrix A is this matrix of integers. The idea underlying image compression using the SVD is to consider the approximation A ≈ Ak . The question is, just how small can we take k and still get an acceptable reproduction of the original image. To investigate this, the resulting images when using A10 , A25 , A50 , and A100 are shown in Figure 4.8. The image shows significant improvement going from k = 10 to k = 25, and modest improvement going from k = 25 to k = 50. The changes going to k = 100 are less obvious. The observations about how well the Ak ’s reproduce the original image can be explained using the Eckart-Young theorem. From (4.55), the relative error in using Ak to approximate a nonzero matrix A is

4.6

Singular Value Decomposition

181

k = 10

Rel Error = 0.068

k = 25

k = 50

Rel Error = 0.014

k = 100

Rel Error = 0.028

Rel Error = 0.007

Figure 4.8 Original grayscale image (top) and the resulting image when using the approximation (4.54): for k = 10, k = 25, k = 50, and k = 100.

182

4 Eigenvalue Problems 100

Error

10-1 10

-2

10

-3

10-4 10-5

0

200

400

600

800

1000

1200

1400

k-axis Figure 4.9 Relative error (4.56) when using Ak , given in (4.54), to approximate A for the pansy picture.

||A − Ak ||2 ||A||2 σk+1 = . σ1

E(k) ≡

(4.56)

The value of E(k) is given in Figure 4.8 for the respective value of k. A graph for E(k) is shown in Figure 4.9. This verifies our observation that the approximation improves quickly as k increases up to about 50, after which the improvement slows. For example, using the first 50 terms, the error is approximately 1.4 × 10−2 , and adding an additional 50 terms drops the error to just 7 × 10−3 . What benefits, in terms of storage, are achieved by using this type of compression? Well, the full image requires storing 1,871,800 integers, while the jth term in (4.53) requires storage of an n-vector and an m-vector, which in this case means 2,753 floating-point numbers. So, A100 requires the storage of 275,300 floating-point numbers. This is about a factor of 7 smaller than what is required for the full matrix (assuming the integers are being stored as floating-point numbers). This is certainly an improvement, but note that using the SVD in this way produces a “lossy” compression because information is lost in the approximation. To its credit, however, the SVD method has an adjustable parameter, k, that allows for various resolutions of the image. The fact that a digitized image is stored as a matrix makes it easy to modify the image. This also provides a way to visualize basic matrix operations, and a few examples are considered in Exercise 4.56.

Example: Color Images Low-rank approximations can be applied to color images in much the same way. For example, when using an RGB format for the 2360 × 2360 color image shown in Figure 4.10(top), there are three color planes: one for the red, one

4.6

Singular Value Decomposition

183

k = 10

Rel Error = 0.051

k = 25

k = 50

Rel Error = 0.013

k = 100

Rel Error = 0.027

Rel Error = 0.006

Figure 4.10 Original color image (top) and the resulting low-rank approximations: for k = 10, k = 25, k = 50, and k = 100.

184

4 Eigenvalue Problems

for the green, and one for the blue components of the image. Each plane is a 2360 × 2360 array of integers, each with a value from 0 to 255 (as before, the number indicating the intensity of the respective color). These arrays will be indicated using R, G, and B. A low-rank approximation can be applied to each array, or you can work with them as a group. The latter will be used in this example, So, we form a single 7080 × 2360 matrix by stacking the three component matrices as follows: ⎛ ⎞ R A = ⎝ G⎠ . (4.57) B After computing Ak , it is then necessary to extract Rk , Gk , and Bk to construct the RGB image file. The resulting images when using A10 , A25 , A50 , and A100 are shown in Figure 4.10. The relative error given in each case is determined using (4.56). The accuracy of the approximation is very similar to what was obtained for the grayscale image. A couple of comments are in order. First, it makes no difference how the color planes are placed in A. For example, you get exactly the same result if, instead of R/G/B, you use G/B/R. The reason for this is derived in Exercise 4.54. Second, rather than work with A, you can apply the low-rank approximation to R, G, and B individually. The advantage of using A is that the computation is faster, at least on a computer with 10 processors. The reason for this is explained in Section 1.6. 

Exercises 4.1 The following are either true or false. If true, provide an explanation why it is true, and if it is false provide a simple example demonstrating this along with an explanation why it shows the statement is incorrect. In this problem Af is the floating-point approximation of A. (a) If the eigenvalues of A are nonzero, then the eigenvalues of Af are nonzero. (b) If A has a dominant eigenvalue then Af has a dominant eigenvalue. (c) If A is symmetric then Af has real-valued eigenvalues. (d) If A is not defective then Af is not defective.

Exercises

185

Section 4.1 4.2. Find the eigenvalues and their associated eigenvectors matrices.     2 1 −2 −7 (a) (b) (c) 4 −1 1 6 ⎛ ⎞ ⎛ ⎞ 1 0 0 4 6 0 (e) ⎝ 6 25 0 ⎠ (f) (d) ⎝ 0 2 1 ⎠ 0 0 16 0 1 5

for the following  ⎛

3 −2 2 −1



⎞ 0 6 0 ⎝0 2 1⎠ 0 1 3

4.3. For a chain of three oscillators, if s/sc = 2, then what are the three eigenvalues for A? Also, what are the associated eigenvectors? 4.4. By writing out the equation Ax = λx in component form, show that the eigenvalues of (4.20) are a, a + 1 and a − 1. What are the corresponding eigenvectors? 4.5. Suppose that A = BT B, where B is a n × m matrix. (a) Show that A is symmetric. (b) Use the Rayleigh quotient to show that if λ is an eigenvalue of A, then λ ≥ 0. (c) Show that A + μI is positive definite if μ > 0. (d) Use the Rayleigh quotient to show that if λ = 0 is an eigenvalue of A, then there is a nonzero x so that Bx = 0. (e) Explain why A is positive definite if the columns of B are independent. 4.6. Assume that A is a n × n matrix, and that it is possible to find orthonormal vectors u1 , u2 , · · · , un , where each ui is an eigenvector for A. Prove that A is symmetric. Hint: A is diagonalizable. Section 4.2 4.7. The symmetric matrix  A=

2 2 2 −1



has independent eigenvectors x1 = (2, 1)T and x2 = (1, −2)T . (a) What are the corresponding eigenvalues? (b) Using the starting vector y0 = (1, 1)T , what eigenvalue will the power method converge to and what will be the resulting eigenvector? (c) Using the power method, by hand, determine v0 and v1 .

186

4 Eigenvalue Problems

(d) Compute the dominant eigenvalue, and its associated eigenvector, using the algorithm in Table 4.1, except use the starting vector given in part (b). Also, take tol = 10−6 . Your answer should include the value of k that the iteration stopped at, the resulting value of vk and the associated eigenvector yk+1 = zk /||zk ||2 . Finally, compare your computed answers with those for part (b). 4.8. Do Exercise 4.7, but use the symmetric matrix   −7 −6 A= , −6 2 which has independent eigenvectors x1 = (1, −2)T and x2 = (2, 1)T . 4.9. The symmetric matrix ⎛

⎞ 11 7 −4 4⎠ , A = ⎝ 7 11 −4 4 −10 has independent eigenvectors x1 = (1, 1, 0)T , x2 = (−2, 2, 1)T , and x3 = (1, −1, 4)T . (a) What are the corresponding eigenvalues? (b) Using the starting vector y0 = (1, 1, 1)T , what eigenvalue will the power method converge to and what will be the resulting eigenvector? (c) Using the power method, by hand, determine v0 and v1 . (d) Compute the dominant eigenvalue, and its associated eigenvector, using the algorithm in Table 4.1, except use the starting vector given in part (b). Also, take tol = 10−6 . Your answer should include the value of k that the iteration stopped at, the resulting value of vk and the associated eigenvector yk+1 = zk /||zk ||2 . Finally, compare your computed answers with those for part (b). 4.10. Do Exercise 4.9, but use the symmetric matrix ⎛ ⎞ 47 32 8 A = ⎝ 32 −1 −16 ⎠ , 8 −16 59 which has independent eigenvectors x1 = (2, 1, 0)T , x2 = (1, −2, 10)T , and x3 = (2, −4, −1)T . 4.11. How many iteration steps will the power method, as given in Table 4.1, require before it stops? (a) If A is the identity matrix.

Exercises

187

(b) If A is a 6 × 6 symmetric matrix with eigenvalues: λ1 = 3, λ2 = 3, λ3 = 3, λ4 = 3, λ5 = 0, and λ6 = 0. (c) If A is a 4 × 4 matrix with eigenvalues: λ1 = −2, λ2 = 0, λ3 = 0, and λ4 = 0. 4.12. Assume that A is a symmetric 6 × 6 matrix with eigenvalues: −4, 1, 4, 5, 8, 12. Suppose that when using the power method with A that the error at step k = 6 is 0.0009. What is (approximately) the error at step k = 7? 4.13. Suppose that the geometric multiplicity for the dominant eigenvalue is 2. How can the power method be used to find two linearly independent eigenvectors for the dominant eigenvalue? Make sure to explain why your method works. Also, the matrix is not necessarily symmetric. 4.14. Show that a symmetric and positive definite matrix always has a dominant eigenvalue. 4.15. Suppose that A is a symmetric n × n matrix with dominant eigenvalue λ1 , and corresponding eigenvector x1 . Also assume that ||x1 ||2 = 1, and each eigenvalue has a geometric multiplicity of one. (a) Let A1 = A − λ1 x1 xT1 . Show that A1 is a symmetric n × n matrix. Also, show that A1 has the same eigenvalues, and eigenvectors, as A but λ1 is replaced with the eigenvalue λ = 0. In other words, if λi = λ1 is an eigenvalue for A, with eigenvector xi , then A1 xi = λi xi , while A1 x1 = 0. (b) Suppose that A is a symmetric 6 × 6 matrix with eigenvalues −3, −1, 2, 4, 6, 8. Explain how to use the result from part (a), and the power method (6 times), to compute all of the eigenvalues for A. In doing this also state the order in which the eigenvalues are determined. 4.16. This problem illustrates that there is little, if any, connection between the power method converging and the condition of the matrix. Assume the start vector is chosen randomly. (a) Give an example of an ill-conditioned matrix for which the power method converges very quickly. (b) Give an example of a well-conditioned matrix for which the power method converges very slowly, if at all. 4.17. Let



a A = ⎝0 0

0 b 0

⎞ c−a c − b⎠, c

where a, b, and c are different numbers. (a) Find the eigenvalues and associated eigenvectors. (b) If a = 8, b = 2, c = 4, and y0 = (0, 0, 1)T , what eigenvalue will the power method find? Using an approach similar to the one used for the matrix given in (4.11), explain why the error reduction factor is R = 1/2.

188

4 Eigenvalue Problems

(c) If a = 8, b = 4, c = 2, and y0 = (0, 0, 1)T , what eigenvalue will the power method find? Using an approach similar to the one used for the matrix given in (4.11), explain why the error reduction factor is R = 1/4. Section 4.3 4.18. This problem involves the matrix A used in Example 4.6. Also, let B = A + sI. (a) For what value of s will B = A + sI not have a dominant eigenvalue? (b) For what values of s will −6 + s be the dominant eigenvalue of B? For these values of s, which s results in the smallest error reduction factor R? (c) For what values of s will 3 + s be the dominant eigenvalue of B? For these values of s, which s results in the smallest error reduction factor R? 4.19. Suppose that A is a symmetric 5 × 5 matrix with eigenvalues −2, −1, 0, 1, 2. (a) Explain why the power method will likely fail with this matrix. (b) Explain how shifting can be used so the power method can be used to calculate the λ = 2 eigenvalue of A. (c) The optimum shift is the one which produces the smallest error reduction factor R. In part (b), show that the optimum shift is s = 1/2. 4.20. In this problem assume the matrix is symmetric and positive definite. (a) The output for the power method is shown in Table 4.17. What is, approximately, the second largest eigenvalue λ2 for the matrix? (b) Explain how to use inverse iteration to compute an eigenvector for λ2 .

Table 4.17

|vk − vk−1 |/|vk |

k

vk

1

7.587539

2

7.970868

4.809e−02

3

7.993948

2.887e−03

4

7.998505

5.697e−04

5

7.999627

1.402e−04

6

7.999907

3.501e−05

7

7.999977

8.754e−06

8

7.999994

2.188e−06

9

7.999999

5.471e−07

10

8.000000

1.368e−07

Data for Exercise 4.20.

Exercises

189

4.21. Suppose A is a symmetric 10 × 10 matrix with 10 different eigenvalues (some positive and some negative). Explain how to use the power method (twice) to determine λa and λb so that λa and λb are eigenvalues for A and all other eigenvalues satisfy λa ≤ λ ≤ λb . You can assume that λa = −λb . 4.22. Suppose A1 and A2 are symmetric and positive definite 100 × 100 matrices with the following eigenvalues: A1 : A2 :

λ1 = 100, λ2 = 99, λ3 = 98, · · · , λ100 = 1 λ1 = 100, λ2 = 10, λ3 = 1, · · · , λ100 = 10−97

(a) Will the power method most likely converge faster for A1 or A2 ? (b) Given any symmetric positive definite n × n matrix A, with eigenvalues λ1 > λ2 > · · · > λn > 0, explain why, in theory, λn can be computed by applying the power method to B = A − λ1 I. This is assuming that λ1 is known. (c) To compute λ100 for A1 , one can use the method from part (b) or the inverse power method. Which will most likely converge faster? (d) Explain why the method in part (b) will likely fail when applied to A2 . 4.23. This problem concerns what is known as the Rosser matrix, which is given as ⎞ ⎛ 611 196 −192 407 −8 −52 −49 29 ⎜ 196 899 113 −192 −71 −43 −8 −44 ⎟ ⎟ ⎜ ⎜−192 113 899 196 61 49 8 52 ⎟ ⎟ ⎜ ⎜ 407 −192 196 611 8 44 59 −23 ⎟ ⎟. R=⎜ ⎜ −8 −71 61 8 411 −599 208 208 ⎟ ⎟ ⎜ ⎜ −52 −43 49 44 −599 411 208 208 ⎟ ⎟ ⎜ ⎝ −49 −8 8 59 208 208 99 −911 ⎠ 29 −44 52 −23 208 208 −911 99 √ √ The eigenvalues of R are: 0, 1000, 1020, ±10 10405, 510 ± 100 26 [Rosser et al., 1951]. Note that the eigenvalue λ = 1000 has two linearly independent eigenvectors. Also, R is symmetric but it’s not positive definite. (a) Show that four of the eigenvalues, in magnitude, are very close (although unequal). (b) Explain why the power method will likely fail on this matrix. (c) One can try shifting, and let B = A + μI. If μ = 1, what eigenvalue of B will the power method converge to, and what is the associated eigenvalue of A? (d) Using the shift from part (c), according to Theorem 4.2, by what factor is the error reduced by at each iteration step? How many iterations steps are needed to reduce the error by a factor of 10? (e) The power method is to be applied to the shifted matrix B = √ A + μI. What conditions need to be imposed on μ so it converges to 10 10405 + μ?

190

4 Eigenvalue Problems

4.24. This problem considers computing certain eigenvalues of an n × n symmetric matrix A. The matrix is nonsingular and has a dominant eigenvalue. In this problem, n = 200 and the entries of A are aij = 1 − |i − j|. (a) Use the power method, as given in Table 4.1, to determine the dominant eigenvalue. Also, take tol = 10−8 . Your answer should include the value of k that the iteration stopped at. (b) The goal is to compute two eigenvalues λL and λR of A so that all eigenvalues of A satisfy λL ≤ λ ≤ λR . Which one, λL or λR , was computed in part (a)? Use the power method, and shifting, to find the other one. Your answer should include the value of k that the iteration stopped at. √ 4.25. Do Exercise 4.24, but take n = 200 and aij = 2 − i + j. 4.26. Do Exercise 4.24, but take n = 200 and aij = (i + j)/(1 + i + j). 4.27. This problem concerns the n × n tridiagonal matrix shown below. It is symmetric and positive definite. Also, note that aii = i except that ann = πn. In (a)-(c) take n = 100. ⎞ ⎛ 1 1 ⎟ ⎜1 2 1 ⎟ ⎜ ⎟ ⎜ 1 3 1 ⎟ ⎜ A=⎜ ⎟. .. .. .. ⎟ ⎜ . . . ⎟ ⎜ ⎝ 1 n−1 1 ⎠ 1 πn (a) Use the power method to compute the largest eigenvalue. The value should be correct to six significant digits, and you should explain why it likely satisfies this condition. (b) Use the power method to compute the smallest eigenvalue. The value should be correct to six significant digits. Also, explain how you used the power method to answer this question. (c) Is λ = 100 an eigenvalue for A? You should use the power method to answer this question. Explain what you did and how (or why) it can be used to answer the question. Note that it is not possible to prove decisively whether λ = 100 is, or is not, an eigenvalue using the power method, but it can be used to provide a compelling answer to this question (which you must provide). (d) Compute the largest eigenvalue when n = 1,000,000. The value should be correct to seven significant digits. Make sure to explain what modifications you made to the code you used in part (a) to answer this question. Also, how many iteration steps does it take to determine the eigenvalue?

Exercises

191

Section 4.4 4.28. Consider the symmetric matrix   2 2 A= . 2 −1 (a) Using orthogonal iteration with B0 = A, what are the k = 0 approximations for the eigenvalues? (b) Find B1 and Q1 and the resulting approximations for the eigenvalues. (c) If B0 is a random 2 × 2 matrix, the Qk ’s will converge to one of four possible matrices. What are they? (d) Compute the eigenvalues of A using orthogonal iteration procedure given in Table 4.8. The relative iterative error for each eigenvalue should be less than 10−6 . Your answer should include the value of k that the iteration stopped at, along with the resulting Qk and the resulting eigenvalues. 4.29. Do Exercise 4.28, but use the symmetric matrix   −7 −6 A= . −6 2 4.30. Suppose that a symmetric 6 × 6 matrix A that has eigenvalues: −4, 1, 4, 5, 8, 12. If orthogonal iteration is used with an initial guess B0 , which is a random 6 × m matrix, what is the largest value of m for which the method will most likely work? 4.31. Find values of the constants so the matrix is orthogonal. ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ 1 a b c a −b c a 2 0 3b d ⎠ (b) ⎝ 0 2b d ⎠ (c) ⎝ a (a) ⎝ 0 √0 c ⎠ a −b e a −2b e b 12 3 0 4.32. Find ⎛ 1 0 (a) ⎝ 1 1 1 1 4.33. Let

a QR factorization for the given matrix. ⎞ ⎛ ⎞ 0 1 1 1 0⎠ (b) ⎝ 1 1 0 ⎠ 1 1 0 0 ⎛

⎞ 1 1 1 A = ⎝a 0 0⎠, 0 a 0



⎞ −1 1 1 1⎠ (c) ⎝ 1 −1 1 1 −1

192

4 Eigenvalue Problems

where a is a floating-point number that satisfies 10−100 ≤ a ≤ 10−8 . All calculations in this exercise are to be done by hand using floating-point arithmetic. This means that you should round each calculation as is done for double precision. (a) Apply regular Gram-Schmidt to the column vectors of A to produce the matrix Q. (b) Show that the column vectors of Q are not necessarily orthogonal. (c) Redo parts (a) and (b) but use modified Gram-Schmidt. (d) Calculate QT Q using your answers from (a) and (c). Which comes closest to being the identity matrix? 4.34. In this problem the QR method for computing eigenvalues is considered. Select one of the following matrices and then answer the questions that follow.       −1 1 2 2 2 −1 A1 = A2 = A3 = 1 2 2 2 −1 1 (a) (b) (c) (d)

Find C1 . Find C2 . What matrix does the Ck ’s converge to? Compute the eigenvalues of A using the QR method as given in (4.28). The relative iterative error for each eigenvalue should be less than 10−6 . Your answer should include the value of k that the iteration stopped at, along with the resulting value for Ck+1 .

4.35. The following questions are about the symmetric 6 × 6 matrix ⎞ ⎛ 8 −1 0 0 0 0 ⎜ −1 2 0 0 0 0⎟ ⎟ ⎜ ⎜ 0 0 4 1 0 0⎟ ⎟ ⎜ A=⎜ 0 1 10 1 1⎟ ⎟ ⎜ 0 ⎝ 0 0 0 1 12 −3 ⎠ 0 0 0 1 −3 20 (a) Show that A is positive definite. (b) Is λ = 6 an eigenvalue for A? (d) Suppose someone computes the eigenvalues and claims they are: 20, 15, 10, 8, 5, 2. Why are they wrong? (d) Does the dominant eigenvalue satisfy λ1 ≥ 26? 4.36. Suppose that A is a symmetric random n × n matrix, with entries between zero and one. Explain why any eigenvalue of A must satisfy −n + 1 ≤ λ ≤ n.

Exercises

193

Section 4.5 4.37. The equation for a string on an elastic foundation is 2 ∂2u 2∂ u = c − du , ∂t2 ∂x2

where c and d are positive constants. The problem for finding the natural frequencies of the string can be reduced to solving Ax = λx, where ⎞ ⎛ a −1 ⎟ ⎜ −1 a −1 0 ⎟ ⎜ ⎟ ⎜ −1 a −1 ⎟ ⎜ A=⎜ ⎟. . . . .. .. .. ⎟ ⎜ ⎟ ⎜ ⎝ 0 −1 ⎠ −1 a Note that A is an n × n symmetric tridiagonal matrix, and a=2+

c2 (n

d . + 1)2

(a) Explain why A is positive definite. (b) Use inverse orthogonal iteration as outlined in Table 4.14 to compute the five smallest eigenvalues of A. To do this take n = 250 and c = d = 1. The stopping condition should be that the relative iterative error for each eigenvalue is less than 10−6 . You should also report how many iteration steps were required. (c) The larger n, the more accurately the eigenvalues of A approximate the natural frequencies of the string. How large must n be so that the five smallest eigenvalues of A agree to at least four significant digits with the five lowest natural frequencies? Make sure to explain the numerical procedure you used to arrive at your conclusion.

Figure 4.11 Three oscillators that are coupled by springs, as an example of the problem considered in Exercise 4.38.

194

4 Eigenvalue Problems

4.38. This exercise involves finding the natural frequencies for the coupled oscillators in Figure 4.11. The equation of motion in this case is y + Ky = 0, where y = (y1 (t), y2 (t), y3 (t))T and ⎛ ⎞ 1 + k12 + k13 −k12 −k13 ⎠. −k12 1 + k12 + k23 −k23 K=⎝ −k13 −k23 1 + k13 + k23 In the above matrix, kij is the spring constant for the spring connecting the ith and jth mass, and it is positive (these are the three smaller springs √ shown in Figure 4.11). Assuming y = xeIωt , where I = −1, then the problem reduces to solving Kx = λx, where λ = ω 2 . (a) Show that K is positive definite. (b) Using orthogonal iteration, or the QR method, compute the eigenvalues of K in the case of when kij = 1/10. The stopping condition should be that the relative iterative error for each eigenvalue is less than 10−8 . You should also report how many iteration steps were required. (c) What is the eigenvalue for K when the oscillators are uncoupled? Note that this means that the kij ’s are zero. (d) Redo part (b) in the case of when kij = 1/4,000. (e) By hand, determine the eigenvalues of K (hint: λ = 1 is an eigenvalue). From this determine the number of correct digits for your answers in part (d). If this is not what is expected, based on the stopping condition, explain the reason(s) why (Figure 4.12). 4.39. The routes for a small airline are shown in Figure 4.13. In terms of a network, the five cities are the vertices, and the air routes are the connections, in a similar manner to those shown in Figure 4.4. (a) Number the cities from 1 to 5, and from this write down the corresponding adjacency matrix. What is the value of the trace of the matrix?

Figure 4.12 Six air routes, and five cities, used to construct the adjacency matrix in Exercise 4.39.

Exercises

195

Figure 4.13 Diagrammatic representations of benzene. The representation on the right is the one used to construct the adjacency matrix in Exercise 4.40.

(b) Using orthogonal iteration, or the QR method, compute the eigenvalues of the matrix you found in part (a). The stopping condition should be that the iterative eigenvalue is less than 10−8 . From the 5error for each  5 trace theorem, i=1 λi = 0 and i=1 λ2i is a positive integer. How close are your values to satisfying these two conditions? (c) Number the cities in a different order than you used in part (a) and then redo part (b). How does your answer differ from what you computed in part (b)? 4.40. Adjacency matrices arise in quantum chemistry in the form of what are called H¨ uckel Hamiltonian matrices. To illustrate, the atomic configuration for benzene is shown in Figure 4.13. The atoms are located at the vertices, and they are connected by nearest neighbor bonds (which form the edges of the polygon). The H¨ uckel Hamiltonian matrix for benzene is nothing more than the adjacency matrix for this graph. In this context, the eigenvalues are associated with the energy states of the system, and the eigenvectors provide information about the corresponding orbital structure for that particular energy. (a) Using the representation on the right in Figure 4.13, number the atoms (vertices) from 1 to 6 and write down the corresponding adjacency matrix for benzene. The double lines in this graph should be the treated the same as the single line connections. What is the value of the trace of the matrix? (b) Using orthogonal iteration, or the QR method, compute the eigenvalues of the matrix you found in part (a). The stopping condition should be that the relative iterative error for each eigenvalue is less than 10−6 . Using the computed values, and (4.29), compute the trace. If the result is different than your answer in part (a), explain why. (c) Answer the questions in parts (a) and (b) for naphthalene, which is shown in Figure 4.14. Note that there are 12 vertices in this case.

Figure 4.14 Diagrammatic representations of naphthalene. The representation on the right is the one used to construct the adjacency matrix in Exercise 4.40.

196

Figure 4.15

4 Eigenvalue Problems

House with 9 rooms interconnected by doorways used in Exercise 4.41.

4.41. This problem considers a house with 9 rooms and interconnecting doorways as shown in Figure 4.15. Every hour those in each room will either stay in the room or move into one of the adjacent rooms. Specifically, if the room has m individuals and n doorways, then each hour m/(n + 1) will stay, and m/(n + 1) will move through each doorway into the corresponding adjacent room. (a) Express this using a directed graph with 9 vertices. (b) Based on the given formula, a11 = 13 , a12 = 14 , and a25 = 15 . Determine the other entries, and then write down the matrix A. Explain why it is a probability matrix. (c) Explain why A is irreducible. (d) Suppose there are 1200 people in the house (it’s a big house), and at the start they are all in Room 2. Taking tol = 10−8 , use the power method to determine the eventual distribution of people in the 9 rooms. You should round the values to the closest integer. Also, your answer should include the value of k that the iteration stopped at. Is there any connection between your answer and the number of doorways to each room? (e) How many hours does it take to reach the eventual distribution? (f) Suppose three of the doorways for Room 5 are closed, and the one remaining open connects with Room 6. What is the resulting matrix A? Redo parts (d) and (e) and comment on the differences in your answers. 4.42. The directed graph for the weekly movement of cars between 6 rental locations is shown in Figure 4.16. At each location the exit arrows are equally likely. So, at location 1, half stay and half move to location 2, while for location 5 all move to location 1. (a) Write down the probability matrix A for this problem. (b) The matrix A is irreducible. What arrows, if any, can be removed and the matrix is still irreducible? Figure 4.16 Directed graph for the car rental company problem in Exercise 4.42.

Exercises

197

(c) Suppose there are 1000 cars, and at the start they are all at location 5. Taking tol = 10−8 , use the power method to determine the eventual distribution at the 6 locations. You should round the values to the closest integer. Also, your answer should include the value of k that the iteration stopped at. (d) How many weeks does it take to reach the eventual distribution? 4.43. An often used age-structured model of a population divides the number of females into age groups g1 , g2 , · · · , gn . Here g1 is the number in the youngest group, g2 is the number in the second youngest group, etc. It is assumed that, in any given year, si gi of those in gi survive from the previous year and move to age group gi+1 . Also, each year the females in gi produce bi gi female babies. Setting g = (g1 , g2 , · · · , gn )T , with gk being the value at time step tk and gk+1 being the value at time step tk+1 , then gk+1 = Agk , where ⎛ ⎞ b1 b 2 b 3 · · · bn ⎜ s1 0 0 ··· 0 ⎟ ⎜ ⎟ ⎜ 0 s2 0 ··· 0 ⎟ ⎜ ⎟ A=⎜ . .. ⎟ . .. ⎜ .. ⎟ . . ⎜ ⎟ ⎝ 0 ··· sn−2 0 0 ⎠ 0 ··· 0 sn−1 0 This is known as a Leslie matrix. It is assumed that the si ’s are positive, b1 and bn are non-negative, and the other bi ’s are positive. In this case, the dominant eigenvalue λ1 of A is positive. If λ1 > 1, then the population increases, and it decreases if λ1 < 1. Also, note that this matrix is not symmetric, but it has n linearly independent eigenvectors. (a) Deer can survive up to 20 years in the wild. Taking n = 20, and assuming their survivability decreases with age, let si = exp(−i/100). Also, assuming 3/4 of the females in each age group have one offspring each year, with equal probability of being male or female, then bi = (3/4)(1/2) = 3/8. The exception is the youngest group, and for this assume that b1 = 0. Using the power method, determine if the population increases or decreases. (b) If λ1 = 1, then the population approaches a constant value as time increases. For the deer in part (a), assuming b1 = 0 and b2 = b3 = · · · = bn = b, what does b have to be so this happens? Your value should be correct to six significant digits. Your answer must include a description of how you computed b, and your method must include using the power method.

198

4 Eigenvalue Problems

4.44. Select one of the following matrices and then answer the questions that follow. ⎛ ⎛ ⎞ ⎞ 2 −1 0 0 3 0 3 0 ⎜ −1 ⎜0 6 1 5⎟ 2 −1 0⎟ ⎟ ⎟ A1 = ⎜ A2 = ⎜ ⎝ 0 −1 ⎝3 1 5 0⎠ ⎠ 2 −1 0 0 −1 2 0 5 0 5 (a) Construct the directed graph for the matrix, and from this show that it is irreducible. (b) Use your result from part (a), and Gershgorin intervals, to show that the matrix is positive definite. 4.45. This problem considers some of the properties of irreducible matrices. The matrices are square but are not necessarily symmetric or a probability matrix. (a) Given the directed graph for A, what is the directed graph for AT . Use this to explain why AT is irreducible if, and only if, A is irreducible. (b) Show that a matrix that has no zero entries is always irreducible. (c) Explain why an upper triangular matrix is always reducible. (d) Explain why a matrix with a zero row, or a zero column, is reducible. Section 4.6 4.46. Select one of the following matrices, and then answer the questions below.       1 2 4 4 0 A2 = A3 = A1 = 1 2 4 −2 −2 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ 0 0 1 1 0 A4 = ⎝ 3 0 ⎠ A6 = ⎝ 1 1 ⎠ A6 = ⎝ −1 ⎠ 0 1 1 1 1 (a) (b) (c) (d) (e)

Find a SVD for A. Find a SVD for −2A. Find a SVD for AT . Find an outer product expansion for A. Find a rank 1 approximation for A. What is the relative error of this approximation? (f) Find ||A||2 .

Exercises

199

A) Table 4.18

2

1

0

1

B)

0

1

1

2

C)

1

0

1

2

Digital images for Exercise 4.52.

4.47. (a) Show that σ = 1 is the only singular value for an orthogonal matrix. Also, what is the geometric multiplicity of this singular value? (b) What is Σ in the SVD of an orthogonal matrix? 4.48. Assume that the columns of A are independent. (a) Explain how to use the power method to compute the first singular value of A. (b) Explain how to use orthogonal iteration to compute the singular values of A. Why is the assumption that A has independent columns useful here? 4.49. Show that the singular value for a vector x is ||x||2 . 4.50. The pixel values for three images are given in Table 4.18 (they each contain 4 pixels). Select one of them and then answer the questions that follow. (a) Find an SVD of the corresponding matrix. (b) Find the outer product expansion of the matrix. (c) Suppose the approximation A ≈ A1 is used. Find A1 , and determine the relative error using this matrix to approximate A. 4.51. Assume A is n × m, with m ≤ n and rank(A) = m. Also, let A+ ≡ (AT A)−1 AT . Write A+ in terms of the three matrices coming from the SVD of A. Use this to find an outer product expansion of A+ . The matrix A+ is known as the Moore-Penrose pseudo-inverse of A, and it is used in Section 8.3.4 to solve least squares problems. 4.52. Assume that A is n × m with m ≤ n. Show that for j = 1, 2, · · · , m: (a) Avj = σj uj (b) AT uj = σj vj 4.53. This problem derives the formula ||A||2 = σ1 .  m (a) Since the vj ’s form aj vj . Use this to show  2 a2 basis for R , then x = 2 σj aj . that ||Ax||2 =  2 aj = 1. Use this, and part (a) (b) In (3.16), ||x||2 = 1, which means that to show that ||Ax||22 ≤ σ12 . (c) What does ||Ax||22 equal when x = v1 ? It should be noted here that ||v1 ||2 = 1.

200

4 Eigenvalue Problems

(d) Explain why (b) and (c) mean that ||A||2 = σ1 . 4.54. In this problem Q1 and Q2 are orthogonal matrices. (a) By comparing the characteristic equations for BT B and AT A, show that B = Q1 AQ2 and A have the same singular values. Hint: the determinant of a product is the product of the determinants. (b) Explain why the result in part (a) means that Q1 A and AQ2 have the same singular values as A. (c) Use the result from part (a), or (b), to show that if you interchange two rows, or two columns, of a matrix that you do not change its singular values. (d) Suppose the matrix for a 3 × 2 grayscale image is ⎛ ⎞ a b A = ⎝c d⎠ . e f If the image is rotated clockwise by 90o , what is the resulting matrix for the image? Show that the resulting matrix has the same singular values as the matrix for the original image. 4.55. Suppose that A is 10 × 10, and it has two distinct singular values: σ = 8, which has a geometric multiplicity of 4, and σ = 4, which has a geometric multiplicity of 6. (a) What is Σ? (b) Plot E(k) for 1 ≤ k ≤ 9. (c) Explain why there is only one well-defined low-rank approximation for A. What is its rank, and what is the relative error of the approximation? Exercises 4.56-4.58 require a grayscale image. It should be clearly focused, have two or more distinct objects, and contain at least 106 pixels. Assuming the file is named pic.jpg, then to read it into MATLAB and obtain the n × m matrix A, you can use the commands: x=imread(’pic.jpg’); A = double(x);. If the image is in color, use the commands: x=imread(’pic.jpg’); x = rgb2gray(x); A = double(x);. Also, to display an image determined by a matrix B, you can use the commands: colormap(gray); imagesc(B); axis image; axis off. 4.56. In addition to displaying the original image, display the image associated with the matrix B. Explain how the new image is obtained using one or more reflections and/or rotations of the original. (a) B = AT (b) B = 2551 − A, where 1 is the matrix containing all ones. (c) B = CA where C is the n × n exchange matrix (it is also called the reverse identity matrix).

Exercises

201

(d) B = AC where C is the m × m exchange matrix. 4.57. Compute the SVD for the image matrix, and then do the following: (a) (b) (c) (d) (e)

Display the original image. For this image, what is the value of r in (4.53)? Plot E(k), as defined in (4.56), as a function of k. Display Ak where k is such that E(k) ≈ 15 . 1 1 , 20 , and Redo (d) when E(k) is approximately: 10

1 40 .

4.58. This problem explains how the SVD can be used for digital watermarking. You need a second grayscale image, and it should be a single object (e.g., a hammer, tennis racket, airplane, etc). It also must be the same size as your first image. (a) Display A, and compute its SVD A = UΣVT . (b) Letting B be the matrix for the second image, set C = Σ + aB, where a = 1/50. Compute the SVD C = U2 Σ2 V2T . (c) Letting A = UΣ2 VT , display A. This is the watermarked image. (d) Now suppose you think A is one of your watermarked images. You know U2 , Σ, V2 , and the value of a from the watermarking process. T To extract the watermark, first

compute the SVD A = U3 Σ3 V3 . Setting B = U2 Σ3 V2T , display a1 B − Σ . This should be the watermark, assuming, or course, that A is a watermarked image. (e) How well does this watermarking method work when a = 13 ? Additional Questions 4.59. This is a continuation of Exercise 3.32 and concerns what is, or is not, a fragile property of the problem. Said another way, it is asking if a particular property holds simply because of the way you wrote the problem down on a piece of paper. (a) Is the property that “the coefficient matrix has only real-valued eigenvalues” a fragile property? (b) Is the property that the coefficient matrix has a dominant eigenvalue a fragile property? (c) Are the singular values of the coefficient matrix a fragile property? Suggestion: Exercise 4.54 is useful here. (d) Is the property that the coefficient matrix is a probability matrix a fragile property?

202

4 Eigenvalue Problems

4.60. This problem explores what can happen when the power method is used on a defective matrix. When using floating-point arithmetic, instead of (4.7), the computer is working with a matrix more like   2+δ 1+δ . δ 2+δ The assumption is that δ is positive and very small. Also, it is assumed that the same δ is used in each entry to make the calculations a bit simpler. (a) What are the eigenvalues for the above matrix? Also explain why there are two independent eigenvectors (i.e., the matrix √ is not defective). (b) Given that δ is very small, show that R ≈ 1 − 23 δ. (c) Taking δ ≈ 10−16 suppose that when using the power method that the error when k = 10 is 10−1 . For what value of k, approximately, is the error 10−2 ? What conclusion can you draw from this about whether the power method works on this matrix? 4.61. An important factorization in mechanics is the polar decomposition. For an n × n matrix A with a positive determinant, the polar decomposition has the form A = QP, where Q is an orthogonal matrix and P is a symmetric positive definite matrix. (a) Assuming the SVD of A has been computed, explain how this can be used to compute Q and P. Make sure to explain why the formulas you derive for Q and P guarantee that they have their required properties. Hint: VT V = I. (b) For an orthogonal matrix, either det(Q) = 1 or det(Q) = −1. The latter corresponds to a reflection, and these are considered to be unphysical. For this reason, in mechanics one is interested in having det(Q) = 1, which corresponds to what is known as a proper orthogonal matrix and physically they correspond to rotations. Explain how to modify, if necessary, your algorithm or formulas in part (a) so that Q is a proper orthogonal matrix. 4.62. Suppose that A is a n × 3 matrix. Also, let xT1 , xT2 , · · · , xTn be the vectors forming the n rows of A. It is assumed that the xi ’s are all on a line through the origin. This means that there is a direction vector v, with ||v||2 = 1, and number ti so that xi = ti v. It is assumed that at least one of the ti ’s is nonzero. The ideas underlying this problem are developed in more detail when deriving the PCA in Chapter 10. (a) For



⎞ 1 −1 2 ⎜ −1 1 −2 ⎟ ⎟ A=⎜ ⎝ 2 −2 4 ⎠ , −3 3 −6

Exercises

(b) (c) (d) (e)

203

determine xi for i = 1, 2, 3, 4. Also, show that each can be written as xi = ti v where ||v||2 = 1. In the following questions, A is not necessarily the matrix in part (a). Explain why rank(A) = 1. Find a vector w so that A = wvT . With this, show that it is possible to write A = σuvT , where ||u||2 = 1 and σ > 0. What are the three singular values of A? What are σ1 and v1 ? Show that −σ1 ≤ ti ≤ σ1 , ∀i. Also, explain why it is possible to have a ti so that either σ1 = ti , or σ1 = −ti . This shows that σ1 determines the smallest interval in the v1 direction that contains the data points.

4.63. This exercise explores the connections between the QR method and the matrices in the factorization QDQT given in Theorem 4.6. It’s assumed that A is symmetric. (a) In the QR method, show that C1 = QT1 AQ1 , C2 = QT2 QT1 AQ1 Q2 , and in general, Ck = PTk APk , where Pk = Q1 Q2 · · · Qk . Note that Pk is an orthogonal matrix (you do not need to show this). (b) Assuming that Ck converges to a diagonal matrix, explain why the QR method is a procedure for computing the matrix D in Theorem 4.6. (c) Explain how two lines of code can be added to the QR algorithm given in Section 4.4.4 so that, when finished, you have a matrix containing the eigenvectors. 4.64. This problem considers the following n × n symmetric tridiagonal matrix ⎛ ⎞ 0 c1 ⎜ c1 0 c2 ⎟ ⎜ ⎟ ⎜ c2 0 c3 ⎟ ⎜ ⎟ A=⎜ ⎟, . . . .. .. .. ⎜ ⎟ ⎜ ⎟ ⎝ cn−2 0 cn−1 ⎠ cn−1 0  where ci = i/ (2i − 1)(2i + 1). The eigenvalues, and eigenvectors, of this matrix play a role in Gaussian quadrature (see Section 6.5.3). (a) Taking n = 2, use orthogonal iteration to compute the eigenvalues of A. The relative iteration error for the nonzero eigenvalues should be less than 10−12 . Also, calculate the eigenvalues by hand to make sure your computed answers are correct. Comment on any modifications of the orthogonal iteration procedure you make to compute the correct result. (b) Let q1 = (q11 , q12 )T and q2 = (q21 , q22 )T be the orthonormal eigenvectors calculated in part (a). As explained in Section 6.5.2, when n = 2 one 2 2 = 1 and 2q21 = 1. How close does your result come to should have 2q11 satisfying these conditions?

204

4 Eigenvalue Problems

(c) Taking n = 10, use orthogonal iteration to compute the eigenvalues of A. The relative iteration error for the nonzero eigenvalues should be less than 10−12 . Explain why you are confident that the values are correct. (d) Assume that the orthonormal eigenvectors are q1 , q2 , · · · , qn , where qj is the eigenvector for eigenvalue λj . If qj = (qj1 , qj2 , · · · , qjn )T , calculate 2 for each eigenvalue. Although this looks a little the quantity wj = 2qj1 odd, the wj ’s are used to determine the weights in the Gaussian quadrature formula. 4.65. This problem considers the following n × n tridiagonal matrix: ⎞ ⎛ ac ⎟ ⎜b a c 0 ⎟ ⎜ ⎟ ⎜ b a c ⎟ ⎜ A=⎜ ⎟. .. .. .. ⎟ ⎜ . . . ⎟ ⎜ ⎝ 0 b a c⎠ b a It is assumed that bc > 0 and n ≥ 3. (a) Assuming c = b, the eigenvalues of A are λi = a + 2|b| cos(iθ), for i = 1, 2, . . . , n, where θ = π/(n + 1). Show that xi = (sin(φ), sin(2φ), sin(3φ), · · · , sin(nφ))T is an eigenvector for λi , where φ = iθ. (b) The matrix in part (a) is symmetric. Explain why it is positive definite if a ≥ 2|b|. (c) For the eigenvectors in part (a), show that xi • xj = 0 if i = j, and xi • xi = (n + 1)/2. √ (d) The eigenvalues of A are λi = a + 2 bc cos(iθ), for i = 1, 2, . . . , n, where θ = π/(n + 1). Show that xi = (κ sin(φ), κ2 sin(2φ), κ3 sin(3φ), · · · , κn sin(nφ))T  is an eigenvector for λi , where φ = iθ and κ = b/c.

Chapter 5

Interpolation

5.1 Information from Data The topic of this chapter is interpolation, which involves passing a curve through a set of data points. To put this in context, extracting information from data is one of the central objectives in science and engineering and exactly what or how this is done depends on the particular setting. Two examples are shown in Figure 5.1. Figure 5.1(Left) contains data obtained from measurements of redshift type supernovae. As is often the case with computerized testing systems, there are many data points and there is some scatter in the values. Because of this one would not be interested in finding a function that passes through all of these points but, rather a function that behaves in a qualitatively similar manner as the data. In this case one uses a fitting method, like least squares, to make the connection more quantitative. Exactly how this might be done will be considered in Chapter 8. In comparison, the data in Figure 5.1(Right) have a well-defined shape and for this reason are more amenable to interpolation. This is also true for the

Figure 5.1 Left: data related to a supernovae redshift [Astier et al., 2006]. Right: data for nanopores in a supercapacitor [Kondrat et al., 2012].

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0 5

205

206

5 Interpolation

0.9

2

0.8 1.5 y-axis

y-axis

0.7 0.6 0.5

1 0.5

0.4 0.3

0

0.2 0.2

0.3

0.4 0.5 x-axis

0.6

0.7

-1

-0.5

0 x-axis

0.5

1

Figure 5.2 Left: geometric representation of a hand using interpolation [Burkardt, 2015]. Right: values of (5.1) at selected points in interval −1 ≤ x ≤ 1.

data shown in Figure 5.2. The hand data is typical of what arises in CAD applications, while the data on the right relates to a more mathematical application. For the latter, the data points are obtained by evaluating the function ∞ 1  4(−1)n f (x) = + cos(nπx) (5.1) 3 n=1 n4/3 π 2 at 10 points from the interval −1 ≤ x ≤ 1. What is seen is that even though the formula for f (x) is relatively complicated, the graph is rather simple. This raises the question if we might be able to replace the function with a much simpler expression that would serve as an accurate approximation of the original.

y-axis

0.4

1

0.2 0.5 0

-0.2

0 -1

-0.5

0

0.5

1

-1

-0.5

0

0.5

1

-1

-0.5

0 x-axis

0.5

1

1 y-axis

1 0

0.5

-1 0 -1

Figure 5.3

-0.5

0 x-axis

0.5

1

Data sets used in this chapter to test the various interpolation methods.

5.2

Global Polynomial Interpolation

207

One of the better ways to test how well an interpolation method works is to try it out on different data sets. In what follows we will use the four sets shown in Figure 5.3. Each consists of 12 data points. It is strongly recommend that you spend a moment or two and draw in what you think would be an acceptable interpolation function for each data set. This will help later when judging how well the standard interpolation methods work. This chapter starts out by developing the formulas for interpolating data points. This is basically the mathematical version of what you did in preschool (i.e., connect the dots). Of course, in scientific computing there are always complications related to evaluating the formulas, and this is included in the discussion. In the second part of the chapter the question of how to use these new found formulas to accurately approximate a function is considered. This material plays a central role in the next chapter when developing ways to numerically integrate a function.

5.2 Global Polynomial Interpolation If one wants to find a function to connect two data points, the easiest choice is to use a straight line, in other words a linear function. Similarly, given three data points one would likely use a quadratic. To generalize this idea, suppose there are n + 1 points and they are (x1 , y1 ), (x2 , y2 ), · · · , (xn+1 , yn+1 ), where x1 < x2 < · · · < xn+1 . We are going to find a single nth degree polynomial pn (x) that passes through each and every point. This is not particularly difficult and there are several ways to find the interpolation polynomial. However, as is often the case in numerical computing, some methods are much more sensitive to round-off error than other methods. In what follows the (xi , yi )’s are referred to as the interpolation points, or as the data points. The xi ’s are called the nodes. Occasionally it is assumed that the nodes are equally spaced, with spacing h. In this case, xi = x1 + (i − 1)h, for i = 1, 2, · · · , n + 1, where h = (xn+1 − x1 )/n. As an example, the 12 nodes in Figure 5.3 are equally spaced with x1 = 0, x12 = 1, and h = 1/11.

5.2.1 Direct Approach Taking the direct approach, the interpolation polynomial is written as pn (x) = a0 + a1 x + · · · + an xn . The interpolation requirements are:

(5.2)

208

5 Interpolation

a0 + a1 x1 + · · · + an xn1 = y1 ,

pn (x1 ) = y1 :

a0 + a1 x2 + · · · + an xn2 = y2 , .. .

pn (x2 ) = y2 : .. . pn (xn ) = yn : pn (xn+1 ) = yn+1 :

a0 + a1 xn + · · · + an xnn = yn , a0 + a1 xn+1 + · · · + an xnn+1 = yn+1 .

This can be rewritten in matrix form as Va = y, where a = (a0 , a1 , · · · , an )T , y = (y1 , y2 , · · · , yn+1 )T , and ⎛ ⎞ 1 x1 x21 . . . xn1 ⎜ 1 x2 x22 . . . xn2 ⎟ ⎜ ⎟ (5.3) V=⎜. .. .. .. ⎟ . ⎝ .. . . . ⎠ 1 xn+1 x2n+1 . . . xnn+1 This is a Vandermonde matrix. Just as when we tried to solve (3.12), for large values of n this matrix is likely ill-conditioned. Example 5.1 Each of the test data sets in Figure 5.3 contain 12 points. Fitting p11 (x) to each set produces the curves shown in Figure 5.4. The top two look to be reasonable fits to the data. This is not the case for the lower two plots because of the over and under-shoots. These are not due to V being ill-conditioned

y-axis

0.4

1

0.2 0.5 0

-0.2

0 -1

-0.5

0

0.5

1

3

-1

-0.5

0

0.5

1

-1

-0.5

0 x-axis

0.5

1

1

y-axis

2 0

1 0

-1

-1 -2

-2 -3

-1

-0.5

0 x-axis

0.5

1

-3

Figure 5.4 Using the interpolation polynomial p11 (x), as given in (5.2), on each of the data sets in Figure 5.3.

5.2

Global Polynomial Interpolation

209

because for all four data sets κ∞ (V) ≈ 105 . As we will see, they are common when you use higher degree polynomials for interpolation.  If you are interested in what happens when V is ill-conditioned, the green curves in Figure 8.5 are typical. Basically, the interpolation polynomial does not interpolate all of the data. Fortunately, it is possible to determine pn (x) without having to use V, and this is explained below. As shown in Section 1.4, there are more efficient ways to evaluate a polynomial than just evaluating it directly. An example is Horner’s method, which is given in (1.12). Another is the modified Lagrange formula (5.8), which is derived below.

5.2.2 Lagrange Approach The easiest way to explain how to avoid using the Vandermonde matrix is to examine what happens with linear and quadratic functions. After doing this, it is relatively easy to write down the general interpolation formula. Linear Interpolation: Suppose the data points are (x1 , y1 ) and (x2 , y2 ), where x1 = x2 . The polynomial in this case is linear and, using the point slope formula, it is y2 − y1 (x − x1 ) x2 − x1 x − x2 x − x1 = y1 + y2 x1 − x2 x2 − x1 = y1 1 (x) + y2 2 (x),

p1 (x) = y1 +

where x − x2 = linear function with 1 (x2 ) = 0, and 1 (x1 ) = 1, x1 − x2 x − x1 2 (x) = = linear function with 2 (x1 ) = 0, and 2 (x2 ) = 1. x2 − x1 1 (x) =

The functions 1 (x) and 2 (x) are examples of linear Lagrange interpolation functions. For each, i (x) is linear, equal to one when x = xi and equal to zero at the other node. Quadratic Interpolation: It is relatively easy to generalize this idea and write down the quadratic Lagrange interpolation functions. Namely, if the data points are (x1 , y1 ), (x2 , y2 ), and (x3 , y3 ) then

210

5 Interpolation

1 (x) = quadratic function with 1 (x2 ) = 0, 1 (x3 ) = 0, and 1 (x1 ) = 1 (x − x2 )(x − x3 ) , (x1 − x2 )(x1 − x3 ) 2 (x) = quadratic function with 2 (x1 ) = 0, 2 (x3 ) = 0, and 2 (x2 ) = 1 =

(x − x1 )(x − x3 ) , (x2 − x1 )(x2 − x3 ) (x − x1 )(x − x2 ) 3 (x) = . (x3 − x1 )(x3 − x2 ) =

By design, each i (x) is quadratic, it equals one when x = xi and it equals zero at the other nodes. Using these functions the corresponding quadratic interpolation polynomial is p2 (x) = y1 1 (x) + y2 2 (x) + y3 3 (x).

(5.4)

For this to be well-defined it is required that the xi ’s are distinct, which means that x1 = x2 , x1 = x3 , and x2 = x3 . Also, although it looks different, (5.4) produces the same function as given in (5.2) in the case of when n = 2. General Formula: Generalizing the above results we have that given data points (x1 , y1 ), (x2 , y2 ), · · · , (xn+1 , yn+1 ), with the xi ’s distinct, the Lagrange form of the interpolation polynomial is pn (x) =

n+1 

yi i (x),

(5.5)

i=1

where the Lagrange interpolation functions are defined as i (x) = =

(x − x1 )(x − x2 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn+1 ) (xi − x1 )(xi − x2 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn+1 ) n+1 j=1 j=i

x − xj . xi − xj

(5.6) (5.7)

So, each i (x) is an nth degree polynomial, it is equal to one when x = xi , and it equals zero when x = xj for j = i. Example 5.2 To find the global polynomial that interpolates the data in Table 5.1, first note that (x1 , y1 ) = (0, 1), (x2 , y2 ) = (1/2, −1), (x3 , y3 ) = (1, 2), and (x4 , y4 ) = (3, −5). Consequently, n = 3, and the interpolation polynomial is

5.2

Global Polynomial Interpolation

Table 5.1

211

x

0

1/2

1

3

y

1

−1

2

−5

Data for Example 5.2.

p3 (x) = y1 1 (x) + y2 2 (x) + y3 3 (x) + y4 4 (x) = 1 (x) − 2 (x) + 23 (x) − 54 (x), where 1 (x) =

1 (x − x2 )(x − x3 )(x − x4 ) = − (2x − 1)(x − 1)(x − 3) (x1 − x2 )(x1 − x3 )(x1 − x4 ) 3

2 (x) = 8x(x − 1)(x − 3)/5, 3 (x) = −x(2x − 1)(x − 3)/2, and 4 (x) = x(2x − 1)(x − 1)/30.  The data sets in Figure 5.3 are not considered here as Lagrange interpolation produces curves that are visually indistinguishable from those in Figure 5.4. As stated earlier, the unwanted over and under-shoots seen in the lower two plots of that figure are due to the interpolation function and not from some ill-effects related to V. Evaluating pn (x) There are two parts to interpolation: decide what formula you will use to determine pn (x), and then decide how you will evaluate that formula. The price paid for avoiding the Vandermonde matrix is the effort needed to evaluate the i ’s, and this is often stated to be a drawback of the method. However, it is possible to rewrite the formula to reduce the computational overhead. This comes by noting that the numerator in the i ’s involves almost the same computation. To take advantage of this, let (x) = (x − x1 )(x − x2 ) · · · (x − xn+1 ) =

n+1

(x − xj ).

j=1

With this, i (x) = wi (x)/(x − xi ). Setting wi = then

1 (xi − x1 )(xi − x2 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn+1 )

,

212

5 Interpolation n+1 

pn (x) = (x)

i=1

wi yi . x − xi

(5.8)

This is known as the modified Lagrange formula. There is a potential divide by zero using this formula, but this is easy to avoid by just not attempting to evaluate pn (x) at a node (the value of pn is already known at those points anyway). Example 5.2 (cont’d) For the data in Table 5.1, w1 = −2/3, w2 = 8/5, w3 = −1 and w4 = 1/15. So, the modified Lagrange formula (5.8) takes the form

2 8 2 1 + + + p3 (x) = −(x) 3x 5(x − 1/2) x − 1 5(x − 3) where (x) = x(x − 1/2)(x − 1)(x − 3).



To compare the modified Lagrange (5.8) with the Lagrange (5.5) interpolation formulas, if there are n + 1 data points and m evaluation points, then using (5.5) requires about 3mn2 flops, while (5.8) only requires about 6mn + 2n2 flops. To translate this into computing time, if you use 20 data points and 2,000 evaluation points, the computing time using (5.5) is about 4 msec, while using (5.8) it is about 0.1 msec. If you use 200 data points and 20,000 evaluation points, the computing times are about 3 sec and 10 msec, respectively. So, the modified Lagrange formula is a better choice. A possible complication with evaluating (x) or wi is underflow or overflow. This is similar to the situation considered in Exercise 1.15. A way to help avoid this is to scale the data, and this is explained in Exercise 5.38.

5.2.3 Runge’s Function It is time to explain one of the principal complications with using a global interpolation polynomial. For this we can use the top two plots in Figure 5.4. It is seen that with the 12 interpolation points we have obtained a fairly accurate approximation of the original functions. Always trying to improve things, one might think that by adding data points that the approximation will be even better. For many functions this does indeed happen but there are functions where it does not (and you would think it should). The example many use to demonstrate this is f (x) =

1 , 1 + 25x2

(5.9)

5.3

Piecewise Linear Interpolation

213

1 Runge n=4

-0.5 -1 2

0

1

Runge n = 16

-15 -1 2

0

1

Runge n = 64

-15 -1

0

1

x-axis Figure 5.5 Using the interpolating polynomial pn (x), with equally spaced nodes, for the Runge function (5.9).

and this is known as Runge’s function. This is plotted in Figure 5.5, along with p4 (x), p11 (x), and p64 (x). The interpolation polynomials are constructed using equally spaced nodes. It is seen that in the center of the interval the approximation improves but it gets worse towards the endpoints. Increasing the number of data points makes the situation worse in the sense that the magnitude of |pn (x)| near the endpoints increases. For example, when n = 40 the maximum of |pn (x)| is about 105 , while for n = 100 the maximum is about 1015 . Moreover, the x-intervals where the extreme over- and undershoots occur do not decrease with n. It should also be pointed out that this behavior is not limited to equally spaced nodes. If you take the nodes randomly from the interval, the maximum in |pn (x)| is often even larger. The conclusion from the above discussion is that using a global interpolation polynomial works well for small data sets but has potential complications as the number of data points increases. One solution for larger data sets is to break them into small groups, use interpolation on the subgroups, and then connect the information into a coherent whole. This is considered in the next section. If you are set on using a global polynomial then you need to consider where the xi ’s are placed in the interval, and this is considered in Section 5.5.4.

5.3 Piecewise Linear Interpolation We will consider using linear interpolation between adjacent data points. This is effectively what is done in a child’s connect the dots puzzle. An example

214

5 Interpolation

is shown in Figure 5.6 where the line between (x1 , y1 ) and (x16 , y16 ) is predrawn in the puzzle. To be able to write down the mathematical formula used for piecewise linear interpolation, assume that the data points are (x1 , y1 ), (x2 , y2 ), · · · , (xn+1 , yn+1 ), where x1 < x2 < · · · < xn+1 . The linear function connecting adjacent data points is given as gi (x) = yi +

yi+1 − yi (x − xi ), xi+1 − xi

for xi ≤ x ≤ xi+1 .

(5.10)

Assembling these into a complete description of the interpolation function we get ⎧ g1 (x) if x1 ≤ x ≤ x2 , ⎪ ⎪ ⎪ ⎨ g2 (x) if x2 ≤ x ≤ x3 , g(x) = (5.11) .. .. ⎪ . . ⎪ ⎪ ⎩ gn (x) if xn ≤ x ≤ xn+1 . An illustration of this is given in Figure 5.7. It is possible to write g(x) in a form that can be easier to use and looks a lot simpler than the expression in (5.11). To do this we introduce the piecewise linear function Gi (x), with Gi (xi ) = 1 and Gi (xj ) = 0 if j = i. The formula for this function is ⎧ ⎪ 0 if x ≤ xi−1 , ⎪ ⎪ ⎪ ⎪ ⎪ x − xi−1 ⎪ ⎪ if xi−1 ≤ x ≤ xi , ⎪ ⎨x − x i i−1 Gi (x) = (5.12) x − xi+1 ⎪ ⎪ ⎪ if xi ≤ x ≤ xi+1 , ⎪ ⎪ xi − xi+1 ⎪ ⎪ ⎪ ⎪ ⎩0 if x ≤ x, i+1

and a sketch of this is given in Figure 5.8. In the case of when the nodes are equally spaced, so xi+1 − xi = h, this can be written in a more compact form Figure 5.6 A typical child’s puzzle of connecting the dots.

5.3

Piecewise Linear Interpolation

Figure 5.7

215

Intervals and functions used for piecewise linear interpolation.

as

x − xi Gi (x) = G , h 

where G(x) =

1 − |x| 0

if |x| ≤ 1, if 1 ≤ |x|.

(5.13)

(5.14)

Note that the defining properties of Gi (x) are very similar to the properties that were used to define the Lagrange interpolation function i (x) in Section 5.2.2. Also, because of its shape, Gi (x) has a variety of names, and they include a hat function, a chapeau function, and a tent function. With this, the piecewise linear interpolation function (5.11) can be written as n+1  yi Gi (x). (5.15) g(x) = i=1

Just so it’s clear, this expression produces the same interpolation function as the expanded version given in (5.11) for x1 ≤ x ≤ xn+1 . However, unlike (5.11), (5.15) is defined for −∞ < x < ∞ with the result that g ∈ C(−∞, ∞). As a final comment, to define G1 (x) it is necessary to have a point x0 , with x0 < x1 , and for Gn+1 (x) we need a point xn+2 , with xn+1 < xn+2 . Exactly where these two nodes are located in not important because they have no effect on the interpolation function over the interval x1 ≤ x ≤ xn+1 . Example 5.3 Using a piecewise linear function to fit the data in Figure 5.9 produces the curves shown in Figure 5.9. These curves are not bad fits but they are also

Figure 5.8 The piecewise linear function Gi (x) defined in (5.12).

216

5 Interpolation

y-axis

0.4

1

0.2 0.5 0

-0.2

0 -1

-0.5

0

0.5

1

0

0.5

-1

0

y-axis

1

1

-1

Figure 5.9

-0.5

0 x-axis

0.5

1

-1

-0.5

0

0.5

1

-1

-0.5

0 x-axis

0.5

1

Using a piecewise linear function to fit the data in Figure 5.3.

not great. They are not bad because they do not contain the over and undershoots seen in Figure 5.9. However, they are not great because they are jagged. Note that in some applications, such as the one in Figure 5.6, jagged might be what is desired but in many applications this is something one wants to avoid.  Example 5.4 Find the piecewise linear function that interpolates the points: (0, 1), (1/2, −1), (1, 2). Note that in this case, x1 = 0, x2 = 1/2, and x3 = 1. Method 1: Using (5.11),  g(x) =

g1 (x) if 0 ≤ x ≤ 1/2, g2 (x) if 1/2 ≤ x ≤ 1,

where y2 − y1 (x − x1 ) x2 − x1 = 1 − 4x,

g1 (x) = y1 +

and g2 (x) = y2 +

y3 − y2 (x − x2 ) = −4 + 6x. x3 − x2

5.4

Piecewise Cubic Interpolation

217

Method 2: Using (5.15), g(x) = y1 G1 (x) + y2 G2 (x) + y3 G3 (x) = G1 (x) − G2 (x) + 2G3 (x), where

⎧ ⎪ ⎨1 + 2x if − 1/2 ≤ x ≤ 0, G1 (x) = 1 − 2x if 0 ≤ x ≤ 1/2, ⎪ ⎩ 0 otherwise, ⎧ ⎪ if 0 ≤ x ≤ 1/2, ⎨2x G2 (x) = 2 − 2x if 1/2 ≤ x ≤ 1, ⎪ ⎩ 0 otherwise,

and

⎧ ⎪ ⎨−1 + 2x G3 (x) = 3 − 2x ⎪ ⎩ 0

if 1/2 ≤ x ≤ 1, if 1 ≤ x ≤ 3/2, otherwise. 

5.4 Piecewise Cubic Interpolation The principal criticism of piecewise linear interpolation is that the approximation function has corners. One method that is often used to avoid this is to replace the linear functions with cubics. The consequence of this is that the interpolation function has the form (see Figure 5.10) ⎧ s1 (x) if x1 ≤ x ≤ x2 , ⎪ ⎪ ⎪ ⎨ s2 (x) if x2 ≤ x ≤ x3 , s(x) = (5.16) .. .. ⎪ . . ⎪ ⎪ ⎩ if xn ≤ x ≤ xn+1 , sn (x) where si (x) = ai + bi (x − xi ) + ci (x − xi )2 + di (x − xi )3 ,

Figure 5.10

for xi ≤ x ≤ xi+1 . (5.17)

Intervals and functions used for piecewise cubic interpolation.

218

5 Interpolation

Each si (x) has 4 coefficients, so s(x) has 4n coefficients. These are determined by requiring: • s(x) interpolates the data. This means that si (xi ) = yi , and si (xi+1 ) = yi+1 , for i = 1, 2, · · · , n.

(5.18)

• s (x) is continuous. This means that the slopes at the interior data points must match, and so si (xi+1 ) = si+1 (xi+1 ), for i = 1, 2, · · · , n − 1.

(5.19)

• s (x) is continuous. This means that the second derivatives also match at the interior data points, and so si (xi+1 ) = si+1 (xi+1 ), for i = 1, 2, · · · , n − 1.

(5.20)

Conditions (5.18)-(5.19) are the defining requirements for s(x) to be an interpolating cubic spline. However, these conditions produce 4n − 2 equations, while s(x) has 4n coefficients. In other words, we are short two conditions. What is usually done is to specify conditions at the left and right ends of the data interval. Some of the commonly made choices are as follows: • Natural Spline: s1 (x1 ) = 0 and sn (xn+1 ) = 0 This produces a spline with an interesting property related to curvature, and this will be explained later.  • Clamped Spline: s1 (x1 ) = y1 and sn (xn+1 ) = yn+1   This requires knowing y1 and yn+1 , something that is not always available. It is, generally, the most accurate cubic spline possible.    • Not-a-Knot Spline (for n > 2): s 1 (x2 ) = s2 (x2 ) and sn−1 (xn ) = sn (xn )  This means that s (x) is continuous at x2 and at xn−1 . This produces a spline that comes close to having the same accuracy as the clamped spline. It is the default choice in MATLAB.

There are also periodic end conditions, and these are explained in Exercise 5.25. Whichever choice is made, the resulting function s(x) provides a smooth interpolation of the given data points. Example 5.5 To find the natural cubic spline that interpolates the data in Table 5.2, we use (5.16) and write

5.4

Piecewise Cubic Interpolation

Table 5.2

219 x

0

1/2

1

y

1

−1

2

Data for Example 5.5.

 s(x) = where and

s1 (x) if 0 ≤ x ≤ 1/2 s2 (x) if 1/2 ≤ x ≤ 1,

s1 (x) = a1 + b1 x + c1 x2 + d1 x3 , s2 (x) = a2 + b2 (x − 1/2) + c2 (x − 1/2)2 + d2 (x − 1/2)3 .

From the interpolation conditions (5.18), s1 (0) = 1 :

a1 = 1,

s1 (1/2) = −1 :

a1 + 12 b1 + 14 c1 + 18 d1 = −1,

s2 (1/2) = −1 :

a2 = −1,

s2 (1) = 2 :

a2 + 12 b2 + 14 c2 + 18 d2 = 2.

From the continuity of s (x) and s (x) conditions (5.19) and (5.20), s1 (1/2) = s2 (1/2) :

b1 + c1 + 34 d1 = b2 ,

s1 (1/2) = s2 (1/2) :

2c1 + 3d1 = 2c2 .

Finally, to qualify to be a natural cubic spline it is required that s1 (0) = 0 : s2 (1)

=0:

c1 = 0, 2c2 + 3d2 = 0.

It is now a matter of solving the above equations, and after doing this one finds that 13 s1 (x) = 1 − x + 10x3 , 2 and s2 (x) = −1 + (x − 1/2) + 15(x − 1/2)2 − 10(x − 1/2)3 .



In general, finding s(x) requires solving 4n equations with 4n unknowns. It is possible to just solve the resulting matrix equation, but there are better ways to find the coefficients. One possibility is to reduce the equations down to a system with approximately n unknowns. Another approach is to use what

220

5 Interpolation

are called cubic B-splines. This also reduces the problem down to having to solve for approximately n unknowns. The advantage of B-splines is that they are easier to code. They are also very useful for least squares data fitting (see Exercise 8.38), as well when solving differential equations numerically [Holmes, 2007].

5.4.1 Cubic B-splines The idea is to write the interpolation function in the form  s(x) = ai Bi (x).

(5.21)

The functions Bi (x) are called cubic B-splines, and a sketch of a typical Bspline is given in Figure 5.11. The above expression has a passing similarity to the expressions used for Lagrange interpolation and piecewise linear interpolation. However, one important difference is that the coefficient ai in the above sum is not necessarily equal to the data value yi . The derivation of (5.21) consists of two steps. The first is the construction of the cubic B-spline Bi (x). This only has to be done once, which means that these functions do not need to be rederived for different data sets. The second step is to derive the equation that is needed to determine the ai ’s from the given data. A summary of the resulting formulas is given on page 224. In what follows it is assumed that the xi ’s are equally spaced over an interval a ≤ x ≤ b. So, the nodes for the interpolation points are xi = a + (i − 1)h, for i = 1, 2, · · · , n + 1, where h = (b − a)/n. However, because of the way the Bi (x)’s are constructed, it is convenient to introduce nodes that are located outside the interpolation interval. Specifically, let x0 = a − h, x−1 = a − 2h, xn+2 = b + h, and xn+3 = b + 2h. Something similar was done when defining Gi (x) for Lagrange interpolation. These new nodes are only

Figure 5.11

Sketch of the various components of the cubic B-spline Bi (x).

5.4

Piecewise Cubic Interpolation

221

used to make the formula for Bi (x) more universal, and make no contribution to the interpolation formula for a ≤ x ≤ b. Finding the Bi ’s Just like s(x), each Bi (x) is a piecewise cubic function that has a continuous second derivative. It is not required, or expected, to interpolate the data as in (5.18). The formula for Bi (x) can be written in expanded form as ⎧ 0 if x ≤ xi−2 , ⎪ ⎪ ⎪ ⎪ ⎪ qi−2 (x) if xi−2 ≤ x ≤ xi−1 , ⎪ ⎪ ⎪ ⎨q (x) if x i−1 i−1 ≤ x ≤ xi , Bi (x) = (5.22) ⎪ (x) if x q i+1 i ≤ x ≤ xi+1 , ⎪ ⎪ ⎪ ⎪ ⎪qi+2 (x) if xi+1 ≤ x ≤ xi+2 , ⎪ ⎪ ⎩ 0 if xi+2 ≤ x, where qj (x) = A¯j + B j (x − xj ) + C j (x − xj )2 + Dj (x − xj )3 . Because the xi ’s are equally spaced, the function Bi (x) is symmetric about x = xi . Note that one consequence of the symmetry is that Bi (xi ) = 0. The coefficients of the qj ’s in (5.22) are determined from the requirement that Bi (x), Bi (x), and Bi (x) are continuous at each node. This is accomplished by requiring the following: x = xi−2 :

  qi−2 (xi−2 ) = 0, qi−2 (xi−2 ) = 0, and qi−2 (xi−2 ) = 0

x = xi−1 :

  qi−2 (xi−1 ) = qi−1 (xi−1 ), qi−2 (xi−1 ) = qi−1 (xi−1 ), and   qi−2 (xi−1 ) = qi−1 (xi−1 )

x = xi :

  qi−1 (xi ) = qi+1 (xi ), qi−1 (xi ) = qi+1 (xi ) = 0, and   qi−1 (xi ) = qi+1 (xi )

with similar conditions at x = xi+1 and x = xi+2 . From the conditions at x = xi−2 one easily concludes that qi−2 (x) = Di−2 (x − xi−2 )3 . Likewise, qi+2 (x) = Di+2 (x − xi+2 )3 , where from the symmetry, Di+2 = −Di−2 . From the continuity requirements at xi−1 , and the requirement that B  (xi ) = 0, one finds that   qi−1 (x) = Di−2 h3 + 3h2 (x − xi−1 ) + 3h(x − xi−1 )2 − 3(x − xi−1 )3 . A similar equation can be derived for qi+1 (x). This leaves one undetermined constant and the convention is to take Bi (xi ) = 2/3, which means that Di−2 = 1/(6h3 ). With this, Bi (x) is completely defined and the functions in (5.22) are

222

5 Interpolation

1 (x − xi−2 )3 , 6h3 1 1 (x − xi−1 ) + qi−1 (x) = + 6 2h 1 1 (x − xi+1 ) + qi+1 (x) = − 6 2h 1 qi+2 (x) = − 3 (x − xi+2 )3 . 6h

qi−2 (x) =

1 (x − xi−1 )2 − 2h2 1 (x − xi+1 )2 + 2h2

1 (x − xi−1 )3 , 2h3 1 (x − xi+1 )3 , 2h3

By factoring the above polynomials it is possible to show that

x − xi Bi (x) = B , h ⎧   2 1 2 ⎪ ⎨ 3 − x 1 − 2 |x| B(x) = 16 (2 − |x|)3 ⎪ ⎩ 0

where

if |x| ≤ 1, if 1 ≤ |x| ≤ 2, if 2 ≤ |x|.

(5.23)

(5.24)

Plots of Bi−1 (x), Bi (x), and Bi+1 (x) are shown in Figure 5.12. For Bi (x), at xi−1 and xi+1 , the curve makes such a smooth transition at these points that you would not know that the cubics change. The same is true at xi−2 and xi+2 where the function makes a smooth transition to zero. Finding the ai ’s The cubic spline interpolation function can now be written as s(x) =

n+2 

ai Bi (x).

(5.25)

i=0

Just so it is clear, this function satisfies the smoothness conditions in (5.19) and (5.20), but does not yet satisfy the interpolation conditions in (5.18)

B

i-1

B-Spline

B

Figure 5.12

i

Bi+1

Plot of the cubic B-splines Bi−1 (x), Bi (x), and Bi+1 (x).

5.4

Piecewise Cubic Interpolation

223

xi−1

xi

xi+1

xj for j = i, i ± 1

2 3

1 6

0

Bi

1 6 1 2h

0

1 − 2h

0

Bi

1/h2

−2/h2

1/h2

0

Bi

Table 5.3 Values of the B-spline Bi (x), as defined in (5.22), at the nodes used in its construction.

or the specific end conditions (clamped, natural, etc). Also, the sum is over all possible Bi ’s that are nonzero on the interval x1 ≤ x ≤ xn+1 . This has required us to include the i = 0 and i = n + 2 terms even though there is no x−1 , x0 , xn+2 or xn+3 in the original data set. This is a minor complication, and it will be dealt with shortly. First note that at xi only Bi−1 , Bi , and Bi+1 are nonzero (see Figure 5.12). In particular, using the values given in Table 5.3, s(xi ) = ai−1 Bi−1 (xi ) + ai Bi (xi ) + ai+1 Bi+1 (xi ) 1 = (ai−1 + 4ai + ai+1 ). 6 Because of the interpolation requirement (5.18) we have that ai−1 + 4ai + ai+1 = 6yi , for i = 1, 2, · · · , n + 1.

(5.26)

To use this we need to know a0 and an+2 , and this is where the two additional conditions are used.   (xi ) + ai Bi (xi ) + ai+1 Bi+1 (xi ) = • Natural Spline: Since s (xi ) = ai−1 Bi−1 2  (ai−1 − 2ai + ai+1 )/h , then s (x1 ) = 0 gives a0 = 2a1 − a2 , and at the other end one finds that an+2 = 2an+1 − an . In (5.26), when i = 1 one finds that a1 = y1 and at the other end one gets that an+1 = yn+1 . The remaining ai ’s are found by solving (5.26).   (xi ) + ai Bi (xi ) + ai+1 Bi+1 (xi ) = • Clamped Spline: Since s (xi ) = ai−1 Bi−1    (−ai−1 + ai+1 )/(2h) then s (x1 ) = y1 gives a0 = a2 − 2hy1 , and at the  . other end one finds that an+2 = an + 2hyn+1

• Not-a-Knot Spline: This gives 2a1 − 3a2 + 2a3 = 0 and 2an−1 − 3an + 2an+1 = 0. In (5.26), when i = 2 one finds that a2 = 12y2 /11 and at the other end one gets that an = 12yn /11. The remaining ai ’s are found by solving (5.26).

224

5 Interpolation

Summary The cubic spline interpolation function is s(x) =

n+2 

ai Bi (x),

(5.27)

i=0

where Bi (x) is given in (5.23), and the nodes are xi = a + (i − 1)h where h = (b − a)/n. The sum includes every cubic B-spline that is nonzero for a ≤ x ≤ b. The result is that s(x) is a piecewise cubic function that interpolates the original data points. Moreover, by the way it is constructed, s ∈ C 2 (−∞, ∞). The ai ’s are determined by solving a matrix equation Aa = f , where A is tridiagonal. The resulting cubic spline algorithm is fairly simple. Given the data, it consists of using the Thomas algorithm to solve the matrix equation (see Section 3.8) for the ai ’s, then using (5.23) when evaluating the sum (5.27). To illustrate, for a natural spline, a1 = y1 , an+1 = yn+1 , and the matrix equation has ⎛ ⎛ ⎞ ⎞ ⎞ ⎛ a2 6y2 − y1 41 ⎜ a3 ⎟ ⎜ ⎟ ⎜1 4 1 0 ⎟ 6y3 ⎜ ⎜ ⎟ ⎟ ⎟ ⎜ ⎜ a4 ⎟ ⎜ ⎟ ⎟ ⎜ 1 4 1 6y 4 ⎜ ⎜ ⎟ ⎟ ⎟ ⎜ , a = ⎜ . ⎟ , and f = ⎜ A=⎜ ⎟ ⎟ . (5.28) . . . . .. .. .. ⎟ .. ⎜ .. ⎟ ⎜ ⎟ ⎜ ⎜ ⎜ ⎟ ⎟ ⎟ ⎜ ⎝ ⎝ ⎠ ⎠ ⎝ 0 1 an−1 6yn−1 ⎠ 1 4 an 6yn − yn+1 Once a is determined, then a0 = 2y1 − a2 and an+2 = 2yn+1 − an . The formulas for a clamped spline are derived in Exercise 5.41. Example 5.6 Using a natural cubic spline to fit the data in Figure 5.3 produces the solid (blue) curves shown in Figure 5.13. For comparison, the curves obtained using a not-a-knot spline are also shown. The interpolation of the top two data sets is as good as what was obtained using the global polynomial, and there are only minor differences between the two spline functions. Moreover, both splines give a better representation than a global polynomial for the lower two data sets. For the lower right data set some small over and undershoots are present in the spline functions, but they are not as pronounced as those in Figure 5.4. It is also evident that there are also differences between the two spline functions, although these occur primarily in the regions close to the endpoints. 

5.4

Piecewise Cubic Interpolation

225

y-axis

0.4

1

0.2 0.5 0

y-axis

-0.2

0 -1

-0.5

0

0.5

1

1

1

0

0.5

-1

0 -1

-0.5

0 x-axis

0.5

1

-1

-0.5

0

0.5

1

-1

-0.5

0 x-axis

0.5

1

Figure 5.13 Using a natural cubic spline function, the solid (blue) curves, and a not-aknot spline, the dashed (black) curves, to fit the data in Figure 5.3.

Example 5.7 To find the natural cubic spline that interpolates the data in Table 5.2, we use (5.27) and write s(x) = a0 B0 (x) + a1 B1 (x) + a2 B2 (x) + a3 B3 (x) + a4 B4 (x), where x0 = −1/2, x1 = 0, x2 = 1/2, x3 = 1, and x4 = 3/2. Since it’s a natural spline, a1 = y1 = 1, a3 = y3 = 2, and (5.28) reduces to a1 + 4a2 + a3 = −6 so 25 that a2 = − 94 . Also, a0 = 2y1 − a2 = 17 4 and a4 = 2y3 − a2 = 4 . Therefore, s(x) =

17 9 25 B0 (x) + B1 (x) − B2 (x) + 2B3 (x) + B4 (x). 4 4 4



The question arises as to why the natural cubic spline works as well as is does. There is a partial answer to this and it involves curvature. Recall that for a curve y = f (x), the curvature at a point is defined as κ=

|f  (x)| . [1 + (f  )2 ]3/2

Assuming the curve is not particularly steep, one can approximate the curvature as κ ≈ |f  (x)|. This is brought up because of the next result due to Holladay [1957].

226

5 Interpolation

Theorem 5.1 If q ∈ C 2 [a, b] interpolates the same data points that the natural cubic spline s(x) interpolates, then  a

b



2



[s (x)] dx ≤

b

[q  (x)]2 dx.

a

In these integrals, a = x1 and b = xn+1 . What this theorem states is that out of all smooth functions that interpolate the data, the natural cubic spline produces the interpolation function with the smallest total curvature (squared). So, q(x) could be the not-a-knot spline, the global interpolation polynomial pn (x), or even the original function f (x) used to produce the data. This helps explain why the spline interpolations in Figure 5.13 do not suffer the significant under- and over-shoots seen in Figure 5.4.

5.5 Function Approximation In using the data sets to test the various interpolation methods we have been using a qualitative, or visual, determination of how well they do. We are now going to make the test more quantitative and this will restrict the applications. In particular, it is assumed that the data come from the evaluation of a given function f (x) and we are going to investigate how well the interpolation function approximates f (x) over an interval a ≤ x ≤ b. In what follows the xi ’s are from the interval a ≤ x ≤ b and satisfy a ≤ x1 < x2 < · · · < xn+1 ≤ b. The corresponding interpolation points are (x1 , y1 ), (x2 , y2 ), · · · , (xn+1 , yn+1 ), where yi = f (xi ).

5.5.1 Global Polynomial Interpolation We begin with a global interpolation polynomial pn (x). As explained in Section 5.2.3, pn (x) can fail to provide a good approximation of a function if a large number of equally spaced nodes are used. However, it is effective for a small number of nodes, as long as they are not too far apart. In fact, such approximations are central to several of the methods considered later in the text. The critical result needed to determine the error when using the approximation f (x) ≈ pn (x) is contained in the following theorem. Theorem 5.2 If f ∈ C n+1 [a, b], then f (x) = pn (x) +

f (n+1) (η) qn+1 (x), (n + 1)!

for a ≤ x ≤ b,

(5.29)

5.5

Function Approximation

227

where qn+1 (x) = (x − x1 )(x − x2 ) · · · (x − xn+1 ),

(5.30)

and η is a point in (a, b). Outline of Proof: To explain how this is proved, we will consider the case of when n = 1. In this case, q2 (x) = (x − a)(x − b). The formula in (5.29) holds when x = a or x = b, so assume that a < x < b. The key step is a trick, which consists of introducing the function F (z) = f (z) − p1 (z) +

f (x) − p1 (x) q2 (z). q2 (x)

Given the way it is defined, F (a) = 0, F (x) = 0, and F (b) = 0. According to Rolle’s theorem, there must be a point z1 , where a < z1 < x and F  (z1 ) = 0, and there must be a point z2 , where x < z2 < b and F  (z2 ) = 0. Using Rolle’s theorem again, there must be a point η, where z1 < η < z2 and F  (η) = 0. From the above formula for F (z), one finds that F  (η) = 0 reduces to (5.29). The case of when n > 1 is similar, except one uses Rolle’s theorem n + 1 times (instead of twice).  It should be mentioned that η depends on x (i.e., a different x generally means a different η). Also, the exact location of η is not known other than it is located in the stated interval. If pn (x) is used to approximate f (x) then the resulting interpolation error is |f (x) − pn (x)|. The above theorem gives us the following bound on the size of this error. Theorem 5.3 If f ∈ C n+1 [a, b], then the interpolation polynomial pn (x) satisfies |f (x) − pn (x)| ≤

1 ||f (n+1) ||∞ ||qn+1 ||∞ , (n + 1)!

for a ≤ x ≤ b.

(5.31)

If the xi ’s are equally spaced, with x1 = a, xn+1 = b and xi+1 − xi = h, then |f (x) − pn (x)| ≤

1 ||f (n+1) ||∞ hn+1 , 4(n + 1)

for a ≤ x ≤ b.

(5.32)

In the above theorem ||f (n+1) ||∞ = maxa≤x≤b |f (n+1) (x)|, and ||qn+1 ||∞ = maxa≤x≤b |qn+1 (x)|. Note that the equally spaced result, given in (5.32), is obtained directly from (5.31) using the inequality derived in Exercise 5.42(a). Example 5.8 Suppose that pn (x), with equally spaced nodes, is used to approximate f (x) = e−2x for 0 ≤ x ≤ 3. Find a value of n so the approximation is cor-

228

5 Interpolation

rect to 4 significant digits at every point in the interval. Answer: We want

   f (x) − pn (x)    < 10−4 , for 0 ≤ x ≤ 3.   f (x)

To use (5.32), note f (n+1) (x) = (−2)n+1 e−2x , so ||f (n+1) ||∞ = 2n+1 . Also, on this interval |f (x)| ≥ e−6 . So, since h = 3/n, then from (5.32) we have that  

n+1  f (x) − pn (x)  6 1 ≤ 1  .  e−6 4(n + 1) n  f (x) To have e6 (6/n)n+1 /(4(n + 1)) < 10−4 requires that n ≥ 14 (this was determined by simply evaluating the left-hand side of the inequality for increasing values of n). So, n = 14 results in the required accuracy. 

5.5.2 Piecewise Linear Interpolation The next easiest method to analyze is piecewise linear interpolation. An example is shown in Figure 5.14, where f (x) = cos(2πx) is approximated using 6 points over the interval 0 ≤ x ≤ 1. It is seen that the accuracy of the linear approximation depends on the subinterval. To determine how well f (x) is approximated by gi (x), given in (5.10), over the subinterval xi ≤ x ≤ xi+1 we can use Theorem 5.3. In this case, n = 1 and q2 (x) = (x − xi )(x − xi+1 ). It is straightforward to show that over this subinterval the largest value of |q2 (x)| is h2 /4, where h = xi+1 − xi . Consequently, from (5.32), 1 max |f  (x)| max |q2 (x)| xi ≤x≤xi+1 2 xi ≤x≤xi+1 1 2 = h max |f  (x)| 8 xi ≤x≤xi+1 1 ≤ h2 ||f  ||∞ . 8

|f (x) − gi (x)| ≤

This applies to each subinterval, which gives the next result. Theorem 5.4 Suppose f ∈ C 2 [a, b] and the xi ’s are equally spaced, with xi+1 − xi = h. In this case the piecewise linear interpolation function (5.10) satisfies 1 |f (x) − g(x)| ≤ h2 ||f  ||∞ , for a ≤ x ≤ b, 8   where ||f ||∞ = maxa≤x≤b |f (x)|.

5.5

Function Approximation

229

1 Exact P-Linear

y-axis

0.5 0 -0.5 -1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x-axis Figure 5.14

Piecewise linear interpolation of f (x) = cos(2πx), for 0 ≤ x ≤ 1.

This means that the piecewise linear interpolation function converges to the original function and the error is second order (because of the h2 ). Consequently, if the number of interpolation points is doubled, the error in the approximation should decrease by about a factor of 4. Example 5.9 (1) Piecewise linear interpolation is used to approximate f (x) = cos(2πx) in Figure 5.14. According to Theorem 5.4, what is the error bound for this approximation? Answer: The interval in this case is 0 ≤ x ≤ 1. Since f (x) = cos(2πx), then f  (x) = −4π 2 cos(2πx) and from this it follows that ||f  ||∞ = 4π 2 . Given that h = 1/5, the error bound is |f (x) − g(x)| ≤ π 2 /50. (2) For f (x) = cos(2πx), where 0 ≤ x ≤ 1, how many interpolation points are needed to guarantee an error of 10−4 ? Answer: This is guaranteed if 18 h2 ||f  ||∞ ≤ 10−4 . Since ||f  ||∞ = 4π 2 , √ then we want π 2 h2 /2 ≤ 10−4 . This gives h ≤ 2 × 10−2 /π. Given that and xn+1 = 1, then h = 1/n. Combining our results, we want x1 = 0, √ n ≥ 50π 2 ≈ 222.1. Therefore, we need to take n ≥ 223. (3) According to Theorem 5.4, for what functions will there be zero error using piecewise linear interpolation, no matter what the interpolation interval? Answer: This requires ||f  ||∞ = 0, which means that f  (x) = 0 for a ≤ x ≤ b. Therefore, to have zero error it must be that f (x) = α + βx, i.e., it must be a linear function in this interval. 

230

5 Interpolation

5.5.3 Cubic Splines This brings us to the question of how well the cubic splines do when interpolating a function. The answer depends on what type of spline function is used (natural, clamped, or not-a-knot). To illustrate, in Figure 5.15 the function f (x) = cos(2πx) is approximated using both a natural and a clamped cubic spline. The clamped spline provides a somewhat better approximation but this is not surprising because it has the advantage of using the derivative information at the endpoints. It is possible to determine the error for the clamped spline, but because this requires some effort to derive only the results will be stated (see Hall and Meyer [1976] for the proof). Theorem 5.5 Suppose f ∈ C 4 [a, b] and the xi ’s are equally spaced, with xi+1 − xi = h. In this case the clamped cubic spline interpolation function (5.16) satisfies |f (x) − s(x)| ≤

5 4  h ||f ||∞ , 384

for a ≤ x ≤ b,

where ||f  ||∞ = maxa≤x≤b |f  (x)|. Moreover, for a ≤ x ≤ b, 1 3  h ||f ||∞ , 24 1 |f  (x) − s (x)| ≤ h2 ||f  ||∞ . 3 |f  (x) − s (x)| ≤

This is a remarkable result because it states that the clamped cubic spline can be used to approximate f (x) and its first two derivatives. Moreover,

1 Natural Exact

y-axis

0.5 0 -0.5 -1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.6

0.7

0.8

0.9

1

1 Clamped Exact

y-axis

0.5 0 -0.5 -1 0

0.1

0.2

0.3

0.4

0.5

x-axis

Figure 5.15

Natural and clamped cubic spline interpolation of f (x) = cos(2πx).

5.5

Function Approximation

231

the error in approximating f (x) is fourth-order (because of the h4 ). As an example, if the number of interpolation points is increased by a factor of 10, the error bound decreases by about a factor of 104 . In comparison, according to Theorem 5.4, the error for piecewise linear interpolation decreases by a factor of 102 . Example 5.10 (1) A clamped cubic spline is used to approximate f (x) = cos(2πx) in Figure 5.15. According to Theorem 5.5, what is the error bound for this approximation? Answer: The interval is 0 ≤ x ≤ 1. Since f (x) = cos(2πx), then f  (x) = (2π)4 cos(2πx) and this means that ||f  ||∞ = (2π)4 . Given that h = 1/5, then the error bound is |f (x) − g(x)| ≤ π 4 /3000, where π 4 /3000 ≈ 0.03. (2) For f (x) = cos(2πx), where 0 ≤ x ≤ 1, how many interpolation points are needed to guarantee an error of no more than 10−4 when using a clamped spline? 5 Answer: This is guaranteed if 384 h4 ||f  ||∞ ≤ 10−4 . Since ||f  ||∞ = 4 (2π) , then we want 5 4 h (2π)4 ≤ 10−4 . 384

This can be rewritten as h ≤ (384/5)1/4 /(20π). Given that x1 = 0, and xn+1 = 1, then h = 1/n. Combining our results, we want n ≥ 20 π(5/384)1/4 ≈ 21.2. Therefore, we need to take n ≥ 22. (3) According to Theorem 5.5, for what functions will there be zero error using a clamped spline, no matter what the interpolation interval? Answer: This requires ||f  ||∞ = 0, which means that f  (x) = 0 for a ≤ x ≤ b. Therefore, to have zero error it must be that f (x) = α0 + α1 x + α2 x2 + α3 x3 .  It is possible to prove a similar theorem for the not-a-knot spline although the values of the bounding constants are harder to determine. For the natural spline, it is possible to show that the same level of accuracy is obtained except near the endpoints. This is seen in Figure 5.15, where the natural spline does almost as well as the clamped spline in approximating the function except over the subintervals next to the boundary points. Moreover, the regions near the endpoints where the natural spline is not fourth-order shrinks to zero as n increases [Kershaw, 1971].

232

5 Interpolation

5.5.4 Chebyshev Interpolation We saw in Section 5.2.3 that pn (x) can fail to provide an accurate approximation of f (x) if the xi ’s are equally spaced. The question we examine now is, is it possible to pick the locations of the nodes so pn (x) is capable of producing an accurate interpolation function? The key in answering this is Theorem 5.2. What is of interest is the error term, which is f (n+1) (η) qn+1 (x), (n + 1)! where qn+1 (x) is given in (5.30). As seen in Figure 5.5, the error can be huge. We want to prevent this from happening. To do this note that the assumption that f ∈ C n+1 [a, b] means that there is a positive constant Mn+1 so that |f (n+1) (x)| ≤ Mn+1 . Consequently, the part of the error function that we need to concentrate on is qn+1 (x), and what we are specifically interested in is the value of (5.33) Q = max |qn+1 (x)|. a≤x≤b

This brings us to the following question: given n, how do we position the xi ’s in the interval a ≤ x ≤ b to minimize Q. The answer is given in the following theorem. Theorem 5.6 The xi ’s that produce the smallest value of Q are xi = where

1 [a + b + (b − a)zi ] , for i = 1, 2, · · · , n + 1, 2 2i − 1 π . zi = cos 2(n + 1)

(5.34)



(5.35)

The xi ’s in the above theorem are called the Chebyshev points because they correspond to the zeros of the (n + 1)-degree Chebyshev polynomial Tn+1 (x). The connections with the Chebyshev polynomials will be considered in more detail later. Also, note that in the above theorem n is a non-negative integer. Example 5.11 (1) Find the Chebyshev interpolation polynomial when n = 2. √ Answer: From √ (5.35), z1 = cos(π/6) = 3/2, z2 = cos(π/2) = 0, and z3 = √ cos(5π/6) =√− 3/2. This means that x1 = xm + 3/4, x2 = xm , and x3 = xm − 3/4, where xm = (a + b)/2 and  = b − a are the midpoint and width of the interval, respectively. The resulting interpolation polynomial is p2 (x) = f (x1 )1 (x) + f (x2 )2 (x) + f (x3 )3 (x), where

5.5

Function Approximation

233

Figure 5.16 The Chebyshev points (the red dots) are determined by equally spaced points on the circumscribed semi-circle. In this example n = 11.

1 = 8(x − x2 )(x − x3 )/(32 ), 2 = −16(x − x1 )(x − x3 )/(32 ), and 3 = 8(x − x1 )(x − x2 )/(32 ). (2) Find the quadratic Chebyshev interpolation polynomial for f (x) = cos(2πx), where 0 ≤ x ≤ 1. Answer: √ xm = 1/2 and  = 1, so x1 = √ Using the result from part (1), x = 1/2, and x = 1/2 − 3/4. This means that f (x1 ) = 1/2 + 3/4, 2 √ 3 √ f (x2 ) = −1, and f (x3 ) = cos((1 + √3/2)π) = − cos(√3π/2), cos((1 − 3/2)π) = − cos( 3π/2). Substituting these values into the formula for p2 (x) in part (1), you get p2 (x) =

  1 √  1 2 16  2 1 x −x+ − x− cos 3π .  3 16 2 2

There is a simple geometric interpretation for how the xi ’s are positioned in the interval that comes from (5.34). Placing a semi-circle over the interval, as in Figure 5.16, consider the n + 1 points on the semi-circle that are a constant angle π/n apart, with the first one (on the far right) at an angle π/(2n). Their x-coordinates are given in (5.34) and they are the corresponding Chebyshev points. A few observations can be made from this figure: • The endpoints of the interval are not nodes (i.e., x1 = a and xn+1 = b). • The Chebyshev points are symmetrically positioned with respect to the midpoint of the interval. • The Chebyshev points get closer together as you approach an endpoint. By using the Chebyshev points, the following interpolation error is obtained.

234

5 Interpolation

Theorem 5.7 With the xi ’s given in (5.34), then the interpolation polynomial pn (x) given in (5.8) satisfies  2 Rn+1 ||f (n+1) ||∞ , for a ≤ x ≤ b, (5.36) |f (x) − pn (x)| ≤ (n + 1)π where R=

(b − a)e . 4(n + 1)

(5.37)

The proof of this result is discussed in Exercise 5.42(b). According to the above theorem, if n is large enough that R < 1, then Rn+1 approaches zero exponentially fast as n → ∞. Whether this means that the error for Chebyshev interpolation approaches zero exponentially fast, however, depends on how the f (n+1) term depends on n. In contrast, when using piecewise linear or cubic splines the error approaches zero as a fixed power of h = (b − a)/n. Some comparisons of these interpolation methods are considered in what follows. Example 5.12 (1) Using the Chebyshev points with n = 16 to fit Runge’s function in (5.9), the interpolation function shown in Figure 5.17 is obtained. The improvement over using equally spaced points, which is shown in Figure 5.5, is dramatic. Also shown in Figure 5.17, by small horizontal bars, are the locations of the xi ’s. This shows the non-uniform spacing of the interpolation points, and they get closer together as you approach either of the endpoints of the interval. (2) For f (x) = cos(2πx), where 0 ≤ x ≤ 1, according to Theorem 5.7, how many interpolation points are needed to guarantee an error of 10−4 when 1

y-axis

Runge Cheb

0 -1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x-axis Figure 5.17 n = 16.

The Runge function (5.9) and the Chebyshev interpolation polynomial with

5.5

Function Approximation

235

using Chebyshev interpolation?  2/((n + 1)π) Rn+1 ||f (n+1) ||∞ ≤ 10−4 . Answer: This is guaranteed if  Since ||f (n+1) ||∞ = (2π)n+1 , then we want 2/((n + 1)π) (2πR)n+1 ≤ 10−4 . From this one finds that n ≥ 9. In comparison, earlier we found that piecewise linear requires n ≥ 223, while a clamped cubic spline requires n ≥ 22. It is also interesting to note that if the number of points is doubled to n = 18, that according to Theorem 5.7, the error bound is about 10−13 , and if doubled again to n = 36, the error bound is an astonishing 10−36 . In contrast, for a clamped cubic spline, doubling the number of points decreases the error bound by a factor of 2−4 ≈ 6 × 10−2 . This is a clear demonstration of the benefits of an exponentially converging approximation, but as we will see in Section 5.5.5, exponential convergence is limited to certain types of functions.  Chebyshev interpolation allows for the use of higher degree polynomials. It is recommended that the modified Lagrange interpolation formula (5.8), or the barycentric Lagrange formula given in Exercise 5.39, are used in such cases. The Vandermonde formula should be avoided because using Chebyshev interpolation points does not improve the condition number enough to make that method useable for larger values of n. The derivation of the Chebyshev interpolation nodes involved minimizing the largest value of |qn+1 (x)|. As seen in (5.29), the error in the approximation also includes a contribution from f n+1 (η). As a consequence of this, the Chebyshev interpolation nodes do not usually provide the “best” polynomial approximation of f (x). The definition of best here is the minimization of the largest value of |f (x) − pn (x)|. Computing the nodes of the “best” polynomial is more challenging, and often done using the Remez algorithm. One complication with these nodes is that, unlike the Chebyshev nodes, they depend on the particular function being interpolated (e.g., see Exercise 5.36).

Chebyshev Polynomials To explain how the xi ’s are determined, it is assumed that a = −1 and b = 1. We begin with a definition. Definition of a Chebyshev Polynomial. The Chebyshev polynomial Tn (x) is defined using the following recursion formula: Tn (x) = 2xTn−1 (x) − Tn−2 (x), for n = 2, 3, 4, · · · , where T0 (x) = 1 and T1 (x) = x. Using this definition, the first few Chebyshev polynomials are

(5.38)

236

5 Interpolation

T2 (x) = 2x2 − 1, T3 (x) = 4x3 − 3x, T4 (x) = 8x4 − 8x2 + 1, T5 (x) = 16x5 − 20x3 + 5x. T6 (x) = 32x6 − 48x4 + 18x2 − 1. As is seen in the above expressions, Tn (x) is an nth degree polynomial and the leading coefficient is 2n−1 . Two of the key results needed for finding the xi ’s are contained in the following result. Theorem 5.8 If Pn (x) = xn + bn−1 xn−1 + · · · + b0 , where n ≥ 1, then max−1≤x≤1 |Pn (x)| ≥ 21−n . If Pn (x) = 21−n Tn (x), then max−1≤x≤1 |Pn (x)| = 21−n . The usual proof of the first statement involves contradiction, and using the oscillatory properties of a polynomial. An illustration of how it is possible to prove it directly is given in Exercise 5.43. Note that the qn+1 (x) in (5.30) is an example of the function Pn+1 (x) appearing in the above theorem. The first result in the theorem states that no matter what we pick for the xi ’s, the Q in (5.33) satisfies Q ≥ 2−n . What the second result states is that if we pick qn+1 (x) = 2−n Tn+1 (x), then Q = 2−n , i.e., it achieves the stated minimum value. Therefore, we should pick the xi ’s so that (5.39) 2−n Tn+1 (x) = (x − x1 )(x − x2 ) · · · (x − xn+1 ). In other words, the xi ’s are the zeros of Tn+1 (x). We need an easy way to find the zeros of Tn+1 (x). This is done using the formula Tn (x) = cos(n cos−1 x), for n = 0, 1, 2, 3, · · · . This is a strange looking equation because the right-hand side does not look to be a polynomial. Nevertheless, the proof is rather simple and basically involves using trig identities to show that the right-hand side satisfies (5.38). With this it is easy to find the zeros of Tn+1 (x), and they are the values of x that satisfy (n + 1) cos−1 x =

π (2i − 1), for i = 1, 2, 3, · · · , n + 1. 2

The values given in (5.34), are the resulting positions when the above result is transformed from −1 ≤ x ≤ 1 to a ≤ x ≤ b.

5.5

Function Approximation

237

5.5.5 Chebyshev versus Cubic Splines We saw that Chebyshev interpolation has the potential to have an error that approaches zero exponentially fast as n increases. It also has the distinction that it produces the smallest Q, as explained in Theorem 5.6. The natural cubic splines, on the other hand, produce the smallest total curvature squared, as defined in Theorem 5.1. What this means is that we have two interpolation methods that can claim to be optimal. Given this, it is of interest to compare Chebyshev and cubic spline interpolation on some more challenging examples. Example 5.13 We begin with the Runge’s function in (5.9), and assume 13 interpolation points are used. The resulting Chebyshev interpolation function, and the natural cubic spline, are shown in Figure 5.18. In comparing the two figures, it is clear that the spline provides a better approximation. To have a more quantitative comparison, suppose g(x) is an interpolation function of a given function f (x). The error in using g(x) to approximate f (x) can be determined using the area between the two curves, which means  E= a

b

|f (x) − g(x)|dx.

(5.40)

The value of this integral is given in Figure 5.19, when g(x) is the natural cubic spline, and when it is the Chebyshev interpolation function. It is seen that the spline produces a more accurate approximation when using up to about 50 interpolation points. The reason the spline does better is that ||f (n+1) ||∞ grows rapidly with n, which effectively eliminates the exponential convergence for Chebyshev interpolation. For example, when n = 13,

1

0 -1

Spline Cheb Runge

-0.5

0

0.5

1

x-axis Figure 5.18 Runge’s function (5.9), the natural cubic spline, and the Chebyshev interpolation polynomial using 13 interpolation points. The Runge and spline curves are almost indistinguishable.

Error

238

5 Interpolation 10

-1

10

-6

10

-11

10

-16

Spline Cheb 10

1

10

2

10

3

n-axis Figure 5.19 The error, as determined by the area integral in (5.40), when approximating Runge’s function with a natural cubic spline, and with a Chebyshev interpolation function.

||f (n+1) ||∞ ≈ 5 × 1020 . It is not until n is rather large that the exponential convergence kicks in and Chebyshev begins to produce a better approximation than the spline. For the record, using a clamped cubic spline results in an error curve mostly indistinguishable with the natural cubic spline.  Example 5.14 Suppose the function is f (x) = tanh(100x − 30), and the interval is 0 ≤ x ≤ 1. Using 25 interpolation points, the resulting interpolation functions are shown in Figure 5.20. In this case, both have some difficulty with the rapid rise in the function. However, the under- and over-shoots in the spline function die out faster than those for the Chebyshev function. In this example it is necessary to take n > 600 before Chebyshev interpolation produces a better approximation than the spline.  It is evident from the examples that Chebyshev interpolation is capable of producing a more accurate approximation than cubic splines. However, this

y-axis

1

0 Spline Cheb Exact

-1 0

0.5

1

x-axis Figure 5.20 The function f (x) = tanh(100x − 30), along with Chebyshev and cubic spline interpolation functions using 25 interpolation points.

5.6

Questions and Additional Comments

239

is based on the stipulation that the contribution of the f (n+1) (η) term in the error is not too large, and it can be difficult to know if this holds. This limits its usefulness, but it does not dampen the enthusiasm that some have for the method. To get insight into why they think this way, Trefethen [2012] should be consulted.

5.5.6 Other Ideas There are a variety ways of modifying the interpolation procedure. For example, given a function f (x) one can construct a piecewise cubic that interpolates f (x1 ), f (x2 ), · · · , f (xn+1 ) as well as the derivative values f  (x1 ), f  (x2 ), · · · , f  (xn+1 ). This is known as Hermite interpolation and it produces an approximation with an error that is O(h4 ). This puts it in the same category as the clamped cubic spline discussed earlier. There is also the idea of being monotone. The objective here is that if the data appear to describe a monotonically increasing (or decreasing) function over an interval then the interpolation function should behave the same way over that interval. As seen in the lower right data set in Figures 5.4 and 5.13, the global and spline functions fail at this while the piecewise linear function in Figure 5.9 works. However, for the latter there is the usual problem with corners. So, the goal is to find a method that does well at preserving monotonicity and is also smooth. This comes under the more general heading of finding a smooth shape preserving interpolation function. A review of such methods, such as the Akima algorithm and the Fritsch-Butland procedure, can be found in Huynh [1993]. It is also worth noting that shape preserving interpolation is of particular interest in the mathematical finance community, and those interested in this application should consult Hagan and West [2006].

5.6 Questions and Additional Comments Below are some questions often asked about interpolation by those taking their first course in scientific computing. (1) To fix the corner problem that arises with piecewise linear interpolation, we used piecewise cubic interpolation. What’s wrong with piecewise quadratics? Answer: They have a couple of drawbacks. To explain, quadratics can interpolate the data and have a continuous first derivative (see Exercise 5.44). In comparison, cubic splines have continuous second derivatives, and so they are smoother. Another issue with quadratics is that they can produce what can best be described as bumpy curves, and this is

240 1

5 Interpolation Spline Cheb

0 0

0.5

1

x-axis Figure 5.21 Interpolation using Chebyshev and cubic spline interpolation functions with 15 data points, with all values zero except for the one near 0.8.

illustrated in Exercise 5.44(d). However, it is possible to adjust the interpolation procedure to improve the situation and those interested might want to consult Marsden [1974] or Cheng et al. [2001]. (2) The cubic B-spline Bi (x) is nonzero for xi−2 ≤ x ≤ xi+2 . Why not use the smaller interval xi−1 ≤ x ≤ xi+1 ? Answer: It simply won’t work (try it). (3) Can the interpolation methods be used to solve the puzzle in Figure 5.6? Answer: Yes, but if you want to draw something that looks like a flower then you will need to rewrite the problem because our methods are based on interpolating a function. The simplest solution is to use a parametric representation of the curve, and what’s involved is considered in Exercise 5.25. This idea can be generalized, which leads naturally to something called Bezier curves and surfaces, and geometric modeling. Those interested might consider looking at Salomon [2006] and Mortenson [1997]. (4) If just one of the yi ’s is changed, what happens to the interpolation function? Answer: It depends on what method you are using. For piecewise linear, the interpolation function is only affected over the interval xi−1 < x < xi+1 . If you are using cubic splines or Chebyshev, then the interpolation function over the entire interval a < x < b is affected. To get an idea of what happens, if all of the yi ’s are zero then the interpolation function is zero everywhere, irrespective of whether or not you are using a natural cubic spline or Chebyshev interpolation. Now, suppose one data point is changed and it is now nonzero. This situation is shown in Figure 5.21 using a natural cubic spline and Chebyshev interpolation, where the nonzero point is the one close to 0.8 (the exact point differs between the

Exercises

241

two methods due to how they position the points). In both cases, the changes in the interpolation function over the entire interval is less than the change at the given data point. In other words, if the data value is changed by a small amount, then the interpolation function over the interval is changed by no more than this value. However, it is also apparent that the changes in the Chebyshev function are more widespread than for the cubic spline. (5) Is it possible to find cubic spline versions of i (x) and Gi (x)? Answer: Yes. These were used in the early days of spline research, and they were usually designated as Li (x). They were defined as the cubic splines that satisfied Li (xi ) = 1 and Li (xj ) = 0 for j = i (see, e.g., de Boor and Schoenberg [1976]). These were called the fundamental functions, and one is shown in Figure 5.22. By doing this, then the spline interpolation function is simply s(x) =

n+1 

yi Li (x).

i=1

The drawback with using these is that they are not as computationally efficient as cubic B-splines.

Exercises Sections 5.2–5.4 5.1. Sketch the Lagrange interpolation polynomials 1 (x), · · · , n+1 (x) on the same set of axes. (a) When n = 1, with x1 = 1 and x2 = 2. (b) When n = 2, with x1 = 1, x2 = 2, and x3 = 3.

y-axis

1

0 -1

-0.5

0

0.5

x-axis Figure 5.22

A fundamental function for a clamped cubic spline. Shown is L7 (x).

1

242

5 Interpolation

(c) When n = 2, with x1 = −1, x2 = 0, and x3 = 4. (d) When n = 3, with x1 = 1, x2 = 2, x3 = 3, and x4 = 4. 5.2. Let p(x) be the fourth degree polynomial that’s zero when x = −1, 0, 2, 3, and p(4) = 1. What is p(1)? 5.3. Let p(x) be the third degree polynomial that equals one when x = −1, 1, 2, and p(3) = 0. What is p(0)? 5.4. In this problem the data are: (x1 , y1 ) = (0, 1), (x2 , y2 ) = (1, 2), and (x3 , y3 ) = (2, −1). (a) Find and sketch the Lagrange form of the interpolation polynomial. (b) Find and sketch the modified Lagrange form of the interpolation polynomial. (c) Find and sketch the piecewise linear interpolation function. (d) Following the approach used in Example 5.5, find and then sketch the natural cubic spline. 5.5. Do Exercise 5.4, but use the data: (x1 , y1 ) = (−1, −1), (x2 , y2 ) = (0, 4), and (x3 , y3 ) = (1, 1). 5.6. In this problem the data are: (x1 , y1 ) = (0, 0), (x2 , y2 ) = (1, 2), (x3 , y3 ) = (2, 2), and (x4 , y4 ) = (3, 6). (a) Find and sketch the Lagrange form of the interpolation polynomial. (b) Find and sketch the modified Lagrange form of the interpolation polynomial. (c) Find and sketch the piecewise linear interpolation function. (d) Following the approach used in Example 5.5, find and then sketch the not-a-knot spline. 5.7. Do Exercise 5.6, but use the data: (x1 , y1 ) = (−1, 6), (x2 , y2 ) = (0, 4), (x3 , y3 ) = (1, 0), and (x4 , y4 ) = (2, 0). 5.8. In this problem x1 = 0, x2 = 2, x3 = 4, x4 = 5, and p3 (x) = −21 (x) + 2 (x) − 23 (x) + 34 (x), for 0 ≤ x ≤ 5. (a) What data points (xi , yi ) were used to produce this interpolation polynomial? (b) What is 2 (x)? (c) Determine p3 (1). (d) Find the modified Lagrange form of the interpolation polynomial.

Exercises

243

5.9. Assume that the modified Lagrange interpolation formula is  p3 (x) = (x) −

 1 1 3 1 + + + , 24(x + 1) x − 1 3(x − 2) 8(x − 3)

where (x) = (x2 − 1)(x − 2)(x − 3). (a) What data points (xi , yi ) were used to produce this interpolation polynomial? (b) Determine p3 (0). (c) Find the Lagrange form of the interpolation polynomial. 5.10. In this problem x1 = −1, x2 = 1, x3 = 2, x4 = 5, and g(x) = G1 (x) + 4G2 (x) − G3 (x) + 7G4 (x), for −1 ≤ x ≤ 5. (a) What data points (xi , yi ) were used to produce this interpolation polynomial? (b) What is G2 (x)? (c) Determine g(0). (d) Suppose that the point (6, −2) is added to the data points in part (a). How does the formula for g(x) change? 5.11. It is possible to simplify an interpolation formula that uses basis functions such as i (x) and Gi (x). (a) Suppose there are four data points (x1 , y1 ), (x2 , y2 ), (x3 , y3 ), and (x4 , y4 ), so the i ’s are cubics. Explain why p3 (x) = A + B1 (x) + C2 (x) + D3 (x) is a cubic polynomial. Also, determine A, B, C, and D. (b) Generalize the formula in part (a) to the case of when there are n + 1 data points, explain why it is an nth degree polynomial, and then find the corresponding coefficients. (c) Show that there is a similar formula to the one obtained in part (b) for piecewise linear interpolation and the Gi (x)’s. 5.12. This problem considers the function  3 x −1 s(x) = −x3 + 6x2 − 6x + 1

if if

0 ≤ x ≤ 1, 1 ≤ x ≤ 2.

(a) Show that this is a cubic spline, and determine the data values used in its construction. (b) Is this a natural cubic spline? (c) Is this a not-a-knot spline? 5.13. Determine s1 (x), or s2 (x), so the following are natural cubic splines.

244

5 Interpolation

 (a) s(x) =  (b) s(x) =

1 + 2x − x3 s2 (x) s1 (x) −9x2 + x3

if if if if

0 ≤ x ≤ 1, 1≤x≤2 − 1 ≤ x ≤ 1, 1≤x≤3

5.14. In this problem x1 = −1, x2 = 1, x3 = 3, x4 = 5, and s(x) = B0 (x) − B1 (x) + B2 (x) − B3 (x) + B4 (x) − B5 (x), for −1 ≤ x ≤ 5. (a) (b) (c) (d)

What is x0 and what is x5 ? What data points (xi , yi ) were used to produce this cubic spline? Is this a natural cubic spline? Determine s (1) and s (1).

5.15. In this problem x1 = 0, x2 = 3, x3 = 6, x4 = 9, and s(x) = 2B0 (x) − B1 (x) + 2B3 (x) − B4 (x) + 2B5 (x), for 0 ≤ x ≤ 9. (a) (b) (c) (d)

What is x0 and what is x5 ? What data points (xi , yi ) were used to produce this cubic spline? Is this a natural cubic spline? Determine s (6) and s (6).

5.16. Assume that there are n + 1 data points, where n ≥ 3, and x is not a node. (a) How many flops does it take to evaluate (5.5)? (b) How many flops does it take to evaluate (5.8)? (c) How many flops does it take to evaluate (5.27) in the case of a natural spline? Assume the Thomas algorithm is used to solve Aa = f . 5.17. Suppose f (x) = sin πx is going to approximated using only data from the nodes: x1 = 0, x2 = 1, x3 = 2, and x4 = 3. Which method(s) covered in Sections 5.2-5.4, if any, will produce an interpolation function that is not identically zero? 5.18. The question considered is, if you use a cubic spline s(x) to interpolate f (x), and f (x) is a cubic polynomial, is it true that s(x) = f (x)? You can assume that the spline s(x) is unique (i.e., there is one, and only one, spline that satisfies the required conditions). (a) Explain why the answer is yes if you use a clamped spline (so yi = f (xi ), s (x1 ) = f  (x1 ), and s (xn+1 ) = f  (xn+1 )). (b) Explain why the answer is no, not necessarily, if you use a natural spline. What two conditions must f (x) satisfy so s(x) = f (x)?

Exercises

245

(c) Explain why the answer is yes if you use a not-a-knot spline. 5.19. A way to handle a large jump in a function, and avoid large over- and under-shoots, is to use linear interpolation in the interval containing the jump and higher degree polynomials for the other intervals. This exercise explores some ways this can be done. (a) Suppose the data points are: (0, 0), (1, 0), (2, 10). Also, assume that the interpolation function p(x) is  a1 + b1 x + c1 x2 + d1 x3 if 0 ≤ x ≤ 1 p(x) = if 1 ≤ x ≤ 2. a2 + b2 x The requirements are that p(x) interpolates the data, p (x) is continuous at x = 1, and p (0) = 0. Determine p(x) and then sketch the resulting interpolation function. (b) In part (a), instead of using a natural spline, the p (0) = 0 condition is replaced with the requirement that p (x) is continuous at x = 1. Determine p(x) and then sketch the resulting interpolation function. (c) Does part (a) or part (b) produce the smoother interpolation function? Which one results in smaller over- and under-shoots? 5.20. If f (x) is an even function you would expect that an interpolation polynomial for this function, over a symmetric interval − ≤ x ≤ , would only involve even powers of x. This exercise explores how Lagrange interpolation can be adapted in such cases. Assume that x2i = x2j if i = j. (a) Suppose the data points are (x1 , y1 ) and (x2 , y2 ), where − ≤ x1 < x2 ≤ . Assuming p2 (x) = a0 + a2 x2 , show that p2 (x) = y1 1 (x) + y2 2 (x), where 1 (x) =

x2 − x22 x21 − x22

and

2 (x) =

x2 − x21 . x22 − x21

(b) Suppose f (x) = cos(πx) and  = 1. If x1 = −1 and x2 = 1/3, find p2 (x), and then sketch it and f (x) on the same axes, for −1 ≤ x ≤ 1. (c) Is it possible to obtain the interpolation function in part (b) using regular Lagrange interpolation of f (x)? (d) Suppose the data points are (x1 , y1 ), (x2 , y2 ) and (x3 , y3 ), where − ≤ x1 < x2 < x3 ≤ . The interpolation polynomial can be written as p4 (x) = y1 1 (x) + y2 2 (x) + y3 3 (x). Based on the result from part (a), guess the form for the i ’s and then verify that p4 (x) satisfies the required interpolation conditions. 5.21. It is possible to interpolate both a function and its derivative at each node. This is called Hermite interpolation, and it is used in constructing B´ezier curves and in solving higher-order differential equations.

246

5 Interpolation

(a) Suppose there are two nodes, x1 and x2 . Also, at each node the value of the function yi and value of its derivative yi are given. Setting L1 (x) = (a1 + b1 x)(x − x2 )2 and L2 (x) = (a2 + b2 x)(x − x1 )2 , show that L1 (x2 ) = L1 (x2 ) = 0 and L2 (x1 ) = L2 (x1 ) = 0. With this, letting p(x) = (a1 + b1 x)L1 (x) + (a2 + b2 x)L2 (x), determine the 4 coefficients using the 4 interpolation requirements: p(xi ) = yi and p (xi ) = yi . (b) Suppose there are three nodes, x1 , x2 , and x3 . How does the formula for L1 (x) change? What about L2 (x)? There is now a L3 (x), what is the formula for it? Write down the formula for p(x) and then determine the 6 coefficients using the 6 interpolation requirements. (c) Suppose there are n + 1 nodes. What degree polynomial is needed for Hermite interpolation? Computational Questions for Sections 5.2–5.4 5.22. The data below describe a cross-section of an airfoil where the points (X, Yu ) define the upper curve and the points (X, Y ) describe the lower curve. X = [0, 0.005, 0.0075, 0.0125, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] Yu = [0, 0.0102, 0.0134, 0.017, 0.025, 0.0376, 0.0563, 0.0812, 0.0962, 0.1035, 0.1033, 0.095, 0.0802, 0.0597, 0.034, 0] Y = [0, −0.0052, −0.0064, −0.0063, −0.0064, −0.006, −0.0045, −0.0016, 0.001, 0.0036, 0.007, 0.0121, 0.017, 0.0199, 0.0178 , 0] (a) Draw the airfoil by fitting natural cubic splines separately to the upper and lower curves, and then plotting the results as a single figure. To make it look like an airfoil you will probably need to resize the plot window. (b) Redo (a) but use global polynomial interpolation instead of splines. 5.23. The data considered here are the temperatures over a 24-hour period, in two-hour increments. So, x1 = 0, x2 = 2, x3 = 4, x4 = 6, · · · , x13 = 24. The yi ’s are the corresponding temperatures. For example, the temperatures in Troy, NY on June 21 are given in Table 5.4 (in o F). (a) Fit this data with: i) a global polynomial using Lagrange interpolation, and ii) a natural cubic spline using cubic B-splines. Plot these two curves and the data on the same axes. Make sure to include a legend in your plot.

x

0

2

4

6

8

10

12

14

16

18

20

22

24

y

59

56

53

54

60

67

72

74

75

74

70

65

61

Table 5.4 Sample temperature data over a 24 hour period for Exercise 5.23. Note x = 0 and x = 24 correspond to midnight.

Exercises

247

x

0

1

2

3

4

5

6

7

8

9

10

y

106.0

123.2

132.2

151.3

179.3

203.2

226.5

248.7

281.4

308.7

331.4

Table 5.5 Sample population data for Exercises 5.24 and 5.38. Note x = 0 corresponds to 1920 and x = 10 corresponds to 2020.

(b) What do each of the two interpolation functions give as the temperature at 11AM? (c) What do the two interpolation functions predict the temperature will be at 1AM the next day? (d) What do the two interpolation functions predict the temperature will be at 9AM the next day? Explain why the spline predicts the value it does. 5.24. The data considered here are the population of a country, every 10 years. Let x1 = 0, x2 = 1, x3 = 2, · · · , x11 = 10 be the year measured in decades starting from 1920. So, x1 corresponds to 1920, x2 to 1930, x3 to 1940 etc. The yi ’s are the corresponding population values, and they are given per million. For example, the population of the United States is given in Table 5.5, and this is from the Wikipedia page Demographics of the United States. (a) Fit this data with: i) a global polynomial using Lagrange interpolation, and ii) a natural cubic spline using cubic B-splines. Plot these two curves and the data on the same axes. Make sure to include a legend in your plot. (b) What do each of the two interpolation functions give as the population in 2015? (c) What do each of the two interpolation functions predict the population will be in 2030? 5.25. A curve in the plane can be parameterized as x = f (t), y = g(t), where 0 ≤ t ≤ T . Data values for f and g are given in Table 5.6. In this exercise the data are interpolated and the resulting curve displayed. The original curve is an ellipse, and the question is how well the interpolated curve comes to looking like an ellipse.

t

1

2

3

4

5

6

7

8

f

1.687

−0.028

−1.721

−2.118

−0.921

0.970

2.131

1.687

g

3.469

1.797

−1.227

−3.328

−2.922

−0.316

2.528

3.469

Table 5.6

Curve data used in Exercise 5.25. Note that T = 8.

248

5 Interpolation

(a) Let fp (t) and gp (t) be the piecewise linear interpolation functions for f (t) and g(t), respectively. Plot (fp (t), gp (t)), for 1 ≤ t ≤ T , and the data points (fi , gi ), on the same axes. (b) Let f (t) and g (t) be the Lagrange interpolation functions for f (t) and g(t), respectively. Plot (f (t), g (t)), for 1 ≤ t ≤ T , and the data points (fi , gi ), on the same axes. (c) Let fs (t) and gs (t) be the natural cubic spline interpolation functions for f (t) and g(t), respectively. Plot (fs (t), gs (t)), for 1 ≤ t ≤ T , and the data points (fi , gi ), on the same axes. (d) Of those you plotted, which interpolation method provides the best looking ellipse? Is there a mathematical reason for this? Comment: Because the curve is closed, this is a situation where it would be better to use periodic end conditions on the spline. In this case you require continuity of s and s at the reattachment point. This means that s7 (8) = s1 (0) and s7 (8) = s1 (0). Section 5.5 5.26. A function f (x) is going to be approximated using an interpolation function for 0 ≤ x ≤ 3. The second, f  (x), and fourth, f  (x), derivatives of the function are plotted in Figure 5.23. (a) How many data points for piecewise linear interpolation are needed to guarantee the error is less than 10−8 ? (b) How many data points for a clamped cubic spline are needed to guarantee the error is less than 10−8 ? 5.27. The function y = 1 + cos(x) is going to be approximated using an interpolation function for −π/4 ≤ x ≤ π/4. For the given interpolation function, how many data points are needed to guarantee that the value of the interpolation function is correct to four significant digits at every point in the interval? 3 f f

2 1 0 -1 -2 0

0.5

1

1.5

x-axis Figure 5.23

Graph used in Exercise 5.26.

2

2.5

3

Exercises

(a) (b) (c) (d)

Using Using Using Using

249

piecewise linear interpolation at equally spaced nodes. a clamped cubic spline at equally spaced nodes. pn (x) at equally spaced nodes. Chebyshev interpolation.

5.28. Redo Exercise 5.27(a),(b), but use the following function and interval. √ (a) y = 1 + x with 0 ≤ x ≤ 1. (b) y = x1/3 with 8 ≤ x ≤ 16. (c) y = ln(x) with 2 ≤ x ≤ 4. √ 5.29. If f (θ) = sin θ, then f (0) = 0, f (π/4) = 2/2, and f (π/2) = 1. Use these three data points to answer the following questions. (a) Using piecewise linear interpolation, what is the estimated value of f (π/6)? What is the interpolation error in this estimate? (b) Using a global interpolation polynomial, what is the estimated value of f (π/6)? What is the interpolation error in this estimate? (c) Using natural cubic spline interpolation, what is the estimated value of f (π/6)? What is the interpolation error in this estimate? (d) Using the additional information that f  (0) = 1 and f  (π/2) = 0, use clamped cubic spline interpolation to find an estimated value of f (π/6). What is the interpolation error in this estimate? (e) Suppose Chebyshev interpolation is used. Determine the three Chebyshev points in the interval, and then use Chebyshev interpolation to find an approximate value for f (π/6). What is the interpolation error in this estimate? 5.30. The Bessel function of order zero can be written as  1 π cos(x sin s)ds. J0 (x) = π 0 (a) Show that |J0 (x)| ≤ 1, |J0 (x)| ≤ 1, |J0 (x)| ≤ 1, and in fact, for any positive integer k,   k   d   J (x)  dxk 0  ≤ 1.

(b) (c) (d) (e)

In what follows, determine how many interpolation points over the interval 0 ≤ x ≤ 10 are needed so the error is no more than 10−6 . Using piecewise linear interpolation. Using a clamped cubic spline. Using Chebyshev interpolation. The Bessel function of order m can be defined as  1 π cos(x sin s − ms)ds. Jm (x) = π 0 How do your answers in parts (b) - (d) change for this function?

250

5 Interpolation

5.31. List the Chebyshev points for: a) n = 4 and 0 ≤ x ≤ 1, b) n = 4 and −1 ≤ x ≤ 1, and c) n = 6 and 0 ≤ x ≤ 1. 5.32. Evaluate: a) Tn (0), b) Tn (0), c) Tn (1), d) Tn (−1), e) Tn (1/2), and f) Tn (−1/2). 5.33. This problem considers some of the difficulties interpolating the func√ tion f (x) = x. (a) If the interpolation interval is 1 ≤ x ≤ 10, how many data points are needed for piecewise linear interpolation to guarantee that the error is less than 10−6 ? (b) Explain why Theorem 5.4 is not so useful if the interval is 0 ≤ x ≤ 1. (c) One way to deal with the singularity at x = 0 is to break the interval into two segments, one is 0 ≤ x ≤ δ and the other is δ ≤ x ≤ 1, where δ is a small positive number. On the interval 0 ≤ x ≤ δ the function is going to be interpolated with a single line. What is the equation for this line, and how small does δ need to be to guarantee that the approximation error is 10−6 ? (d) Assuming that δ is known, how many data points over the interval δ ≤ x ≤ 1 are needed for piecewise linear interpolation to guarantee that the error is less than 10−6 ? 5.34. The function f (x) = 1/(1 + x2 ) is to be approximated using a piecewise linear function g(x) over the interval 0 ≤ x < ∞. The requirement is that |f (x) − g(x)| ≤ 10−4 for 0 ≤ x < ∞. Explain how to determine the spacing of the xi ’s used in the construction of g(x), and how you handle the approximation over the intervals xn ≤ x ≤ xn+1 and xn+1 ≤ x < ∞, where xn+1 is the largest node you use. 5.35. Suppose Chebyshev interpolation is applied to a function f (x) using eight interpolation points. Also, suppose that no matter what interpolation interval is used, the error for the Chebyshev interpolation is zero. What conclusion can you make about the original function f (x)? 5.36. In the case of when n = 0 there is one node x1 and p0 (x) is a constant. The resulting approximation is f (x) ≈ a0 where a0 = f (x1 ). (a) What is x1 if Chebyshev interpolation is used? (b) Suppose that f (x) = x(x − 2), a = 0, and b = 3. If x1 = 2, show that the smallest value of |f (x) − a0 | over the interval is 3, and it is 3/2 when x1 = 1/2. It’s recommended you sketch f (x) − a0 , for a ≤ x ≤ b, when answering this question. Also, in answering this question, explain how the max and min values of f (x) are involved in determining the answers. (c) Based on part (b), explain why the best is  constant approximation  obtained when taking x1 so that f (x1 ) = f (xm ) + f (xM ) /2, where xm and xM are points where the function achieves is minimum and maximum values over the interval, respectively.

Exercises

251

(d) Give an example when Chebyshev interpolation gives the best constant approximation. The worst possible error for a constant interpolation function is |f (xM ) − f (xm )|. Is it possible for Chebyshev interpolation to result in the worst possible approximation? (e) Suppose that elevators are located at x1 , x2 , and x3 . Where should you stand in order to minimize the expected distance you need to walk to catch the first elevator that arrives? Additional Questions 5.37. Given three nodes xi−1 , xi , and xi+1 , this problem considers the quadratic interpolation formula p2 (x) = yi−1 i−1 (x) + yi i (x) + yi+1 i+1 (x). It is assumed here that xi − xi−1 = h and xi+1 − xi = h. (a) Show that the above interpolation formula can be rewritten as p2 (x) = yi +

1 1 (yi+1 − yi−1 )(x − xi ) + 2 (yi+1 − 2yi + yi−1 )(x − xi )2 . 2h 2h

(b) Calculate p2 (x) and p2 (x). (c) Suppose p2 (x) interpolates f (x) at xi−1 , xi , and xi+1 , so yi−1 = f (xi−1 ), yi = f (xi ), and yi+1 = f (xi+1 ). Setting x = xi + αh, expand f (x) and p2 (x) about h = 0 and show that 1 1 f (x) = p2 (x) + z(z 2 − h2 )f  (xi ) + z 2 (z 2 − h2 )f  (xi ) + · · · , 6 24 where z = x − xi . (d) How does the result in part (c) compare to the result in Theorem 5.2 in the case of when a = xi − h, b = xi + h, and n = 2? 5.38. This problem explores how to scale the data to help improve the computability of the interpolation polynomial. Suppose that the original data are: (X1 , Y1 ), (X2 , Y2 ), · · · , (Xn+1 , Yn+1 ). The idea is to scale these values as follows: Xi − Xm Yi − Ym and yi = . xi = Xc Yc The interpolation polynomial pn (x) is then determined using the (xi , yi ) values. With this, given a value of X, one finds x = (X − Xm )/Xc , evaluates pn (x), and then Y = Ym + Yc pn (x). (a) Explain why in Exercise 5.24 that Xm = 1920 and Xc = 10. What are Ym and Yc ?

252

5 Interpolation

(b) It is not uncommon to let Xm be the mean of the Xi ’s, and Xc the maximum of |Xi − Xm |. Similarly for Ym and Yc . How does Table 5.5 change by making these choices? (c) Fit the population data using Lagrange interpolation for the following three cases: (i) unscaled data (so Xm = Ym = 0 and Xc = Yc = 1), (ii) the scaled data used in Exercise 5.24, and (iii) the data using the scaling in part (b). Plot the three curves, and the data, on the same axes in the (X, Y )-plane. Any choice better than the others? Any choice worse than the others? 5.39. This problem concerns an alternative form for Lagrange interpolation. (a) Suppose (5.8) is used to interpolate the constant function f (x) = 1. Use this to show that n+1  wi /(x − xi ) = 1. (x) i=1

(b) Use the result from part (a) to show that n+1

wi yi /(x − xi ) , pn (x) = i=1 n+1 i=1 wi /(x − xi ) This is called the barycentric Lagrange formula. Unlike the modified Lagrange formula, the barycentric version can be unstable unless you are careful where the nodes are placed [Higham, 2004]. The cause of the instability is the identity in part (a), which does not necessarily hold when using floating-point arithmetic. 5.40. This problem develops some of the basic properties of a cubic B-spline.   (a) Show that Bi (x) = h1 Q h1 (x − xi ) , where, setting s = sgn(x), ⎧ ⎪ ⎨0 Q(x) = − 2s (2 − |x|) ⎪   ⎩ −2x 1 − 12 |x| + sx2

(b) Show that   xi−1 Bi dx = xi−2

xi+2

xi+1

Bi dx =

h , and 24



if |x| ≥ 2, if 1 ≤ |x| ≤ 2, if 0 ≤ |x| ≤ 1. 

xi

xi−1

Bi dx =

xi+1

xi

Bi dx =

11h . 24

5.41. Show that for a clamped cubic spline that the matrix equation Aa = f that has to be solved has a = (a1 , a2 , · · · , an+1 )T , f = (6y1 + 2hy1 , 6y2 , · · · ,  )T , and A is a (n + 1) × (n + 1) tridiagonal matrix that 6yn , 6yn+1 − 2hyn+1 is the same form as for the natural spline except that a12 = 2 and an+1,n = 2. 5.42. This problem concerns some of the inequalities arising for function interpolation.

Exercises

253

(a) For qn+1 (x), given in (5.30), assume x is not one of the xi ’s. So, there is an xi so that xi < x < xi+1 . With this, it is possible to write qn+1 (x) = (x − xi )(x − xi+1 )

i−1

(x − xj )

j=1

n+1

(x − xj ).

j=i+2

Assuming the nodes are equally spaced, show that |(x − xi )(x − xi+1 )| ≤ i−1 n+1 1 2 i−1 and | j=i+2 (x − xj )| ≤ j=1 (x − xj )| ≤ i!h 4 h . Also show that | (n + 1 − i)!hn−i . From this, show that ||qn+1 ||∞ ≤

1 n! hn+1 . 4

Make sure to comment about the case of when x equals one of the xi ’s. (b) In most textbooks the error bound given for Chebyshev interpolation is

n+1 b−a 1 ||f (n+1) ||∞ , 2n (n + 1)! 2 √ Using the fact that 2πn(n/e)n ≤ n!, derive (5.36). |f (x) − pn (x)| ≤

for a ≤ x ≤ b.

5.43. This problem considers a direct proof of Theorem 5.8, at least for the case of when n = 1. This will help demonstrate how this result is independent of the coefficients of the polynomial. (a) If P1 (x) = x + b0 , explain why max |P1 (x)| = max{ |1 + b0 |, | − 1 + b0 | }.

−1≤x≤1

(b) On the same axes, sketch |1 + b0 | and | − 1 + b0 | as a function of b0 . Use this to explain why max−1≤x≤1 |P1 (x)| = 1 + |b0 |. From this derive the result stated in the theorem. 5.44. This problem derives the formulas for a piecewise quadratic interpolation function. This function is written as ⎧ w1 (x) if x1 ≤ x ≤ x2 ⎪ ⎪ ⎪ ⎨ w2 (x) if x2 ≤ x ≤ x3 w(x) = .. .. ⎪ . . ⎪ ⎪ ⎩ if xn ≤ x ≤ xn+1 , wn (x) where wi (x) = ai + bi (x − xi ) + ci (x − xi )2 ,

for xi ≤ x ≤ xi+1 .

254

5 Interpolation

The interpolation requirements are wi (xi ) = yi and wi (xi+1 ) = yi+1 . Also,  (xi+1 ) w (x) is required to be continuous, which means that wi (xi+1 ) = wi+1 for i = 1, 2, · · · , n − 1. (a) Explain why the stated requirements are not enough, and it is necessary to impose one additional condition. In this problem, it is assumed that b1 is specified. Explain why this is the same as specifying the value of the slope w (x1 ). (b) From the stated requirements, deduce that ai = yi for i = 1, 2, · · · , n + 1. Also, bi = −bi−1 + 2(yi − yi−1 )/h for i = 2, 3, · · · , n, and ci = (yi+1 − yi − hbi )/h2 for i = 1, 2, · · · , n. (c) Explain why the results from parts (a) and (b) mean that one can determine the coefficients for w1 , then determine the coefficients for w2 , then for w3 , etc. (d) Use w(x) to interpolate f (x) = cos(6πx), over the interval 0 ≤ x ≤ 1. Plot w(x) and f (x) for the following cases: i) n = 5, ii) n = 10, iii) n = 15, and iv) n = 30. Comment on how well the interpolation method works for this particular function.

Chapter 6

Numerical Integration

The objective of this chapter is to derive and then test methods that can be used to evaluate the definite integral  b f (x)dx. (6.1) a

The need for this is easy to explain given the complexity of integrals that often arise in applications. As an example, to find the deformation of an elastic body when compressed by a rigid punch it is necessary to evaluate [Gladwell, 1980]    2α sinh(x) − 2x − 1 sin(λx)dx. (6.2) 1 + α2 + x2 + 2α cosh(x) 0 Or, in the study of the emissions from a pulsar it is necessary to evaluate [Gwinn et al., 2012]      1 α3 + β3 x α1 + β1 x + γ1 x2 1−x exp K2 dx, 2 3 α2 + β2 x α4 + β4 x 0 α0 + β0 x + γ0 x + δ0 x where K2 is the modified Bessel function. There are other, equally important, situations that arise. One is when f (x) is only known from data. In this case you do not have a formula for f (x) as in the above integrals, and are stuck with whatever points are given in the data set. In such situations the interpolation functions developed in the previous chapter are very useful. A second important situation arises when solving a differential equation. In this case the task is to derive a procedure to solve the differential equation, and the methods developed in this chapter are key tools for doing this (as explained in Chapter 7). An illustration of this is given in Exercises 6.19 and 6.20. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0 6

255

256

6 Numerical Integration

6.1 The Definition from Calculus The definition of a definite integral introduced in calculus serves as the basis for many of the simpler numerical integration methods. To review this definition, one first subdivides the interval as shown in Figure 6.1. In this case, x1 < x2 < · · · < xn+1 , where a = x1 and b = xn+1 . The reason for this is the additive property of areas, which for an integral can be written as 



b

a

=



x2

f (x)dx =

x3

f (x)dx + x1 n  xi+1  i=1

 f (x)dx + · · · +

x2

xn+1

f (x)dx xn

f (x)dx.

(6.3)

xi

Out of each subinterval [xi , xi+1 ] one picks a point ci and then approximates the area with the quantity f (ci )(xi+1 − xi ). This is illustrated in Figure 6.2. Assuming that f ∈ C[a, b], then the value of the integral is approached as the number of subdivisions increases. In other words, 

b

f (x)dx ≡ lim

n→∞

a

n 

f (ci )(xi+1 − xi ).

i=1

This is useful because given any particular subdivision we obtain the approximation  b n  f (x)dx ≈ f (ci )(xi+1 − xi ). a

i=1

Moreover, we are free to pick ci from the subinterval [xi , xi+1 ] any way we want, so we are able to produce different numerical methods depending on how we make this choice. The examples usually considered in calculus involve picking one of the endpoints, namely either ci = xi or ci = xi+1 . Another is the midpoint ci = 1 2 (xi + xi+1 ). Whatever choice is made, the resulting integration rule over each subinterval is

Figure 6.1

Subdivision of the interval of integration in the case of when n = 4.

6.1

The Definition from Calculus

Figure 6.2

257

Rectangular region used as an approximation of the area over the subinterval.



xi+1

f (x)dx ≈ f (ci )(xi+1 − xi ).

(6.4)

xi

In what follows it is assumed that the xi ’s are equally spaced. Consequently, xi = a + (i − 1)h, for i = 1, 2, · · · , n + 1 where h = (b − a)/n. Derivation of the Error In calculus the question is whether the method converges as n → ∞. In computing, it is important to know just how fast it converges. This is the reason for asking: just how accurate is the approximation in (6.4)? Taylor’s theorem, as usual, can be used to answer this question. First, expanding about x = ci we get 1 f (x) = f (ci ) + (x − ci )f  (ci ) + (x − ci )2 f  (ci ) + · · · . 2 With this,  xi+1  f (x)dx = xi

(6.5)

xi+1

1 [f (ci ) + (x − ci )f  (ci ) + (x − ci )2 f  (ci ) + · · · ]dx 2 xi 1 = f (ci )h + hzi f  (ci ) + h(12zi2 + h2 )f  (ci ) + · · · , 24

where h = xi+1 − xi and zi = xi + h2 − ci . Therefore, when using the approximation in (6.4), the error in the approximation is E = hzi f  (ci ) +

1 h(12zi2 + h2 )f  (ci ) + · · · . 24

(6.6)

So, if you take ci = xi , then zi = h/2 and E = 12 h2 f  (xi ) + 16 h3 f  (xi ) + · · · . In other words, for small values of h, E = O(h2 ). Similarly, if ci = xi+1 , then zi = −h/2 and E = O(h2 ). To get a better error we need to pick ci so that zi = 0, and this means we pick ci = xi + h2 . The error in this case

258

6 Numerical Integration

1 3  is E = 24 h f (ci ) + · · · = O(h3 ). In fact, the midpoint is the only choice for ci that guarantees an error that is O(h3 ) independently of the particular function f (x).

6.1.1 Midpoint Rule It is worth examining the ci = xi + h2 choice in more detail. We have just shown that  xi+1 h

f (x)dx = f xi + h + O(h3 ). (6.7) 2 xi This gives rise to what is known as the midpoint rule. When this is put into the formula for the definite integral (6.3) we get 

b

a



h

h

f (x)dx = f x1 + h + O(h3 ) + f x2 + h + O(h3 2 2 h

+· · · + f xn + h + O(h3 ) 2 h

h

h

= h f x1 + + f x2 + + · · · + f xn + + n O(h3 ) 2 2 2 = IM + O(h2 ),

where IM

h

h

h

+ f x2 + + · · · + f xn + = h f x1 + . 2 2 2

(6.8)

(6.9)

The expression in (6.9) is known as the composite midpoint rule. Note that the error for the composite rule in (6.8) is a factor of h smaller than the integration rule (6.7). This happens because n = (b − a)/h. So, the term n O(h3 ) turns into O(h2 ). The error formula in (6.6) was derived using a Taylor series, which enabled us to conclude in (6.8) that the error EM of the composite midpoint rule is O(h2 ). If you instead use Taylor’s theorem with remainder, it is possible to 2  show that EM = b−a 24 h f (η), where η is a point somewhere in the interval a < x < b. With this, one can prove the following result. Theorem 6.1. If f ∈ C 2 [a, b] then the composite midpoint rule (6.9) satisfies



b−a

b

h2 ||f  ||∞ , f (x)dx − IM ≤ (6.10)

a 24 where ||f  ||∞ = maxa≤x≤b |f  (x)|.

6.1

The Definition from Calculus

259

Example 6.1 (1) Suppose the composite midpoint rule is used to approximate the value of  1 e3x dx. (6.11) 0

Using three subintervals, the approximation shown in Figure 6.3 is obtained. In this case, h = 1/3, and the midpoints are x1 + h2 = 1/6, x2 + h2 = 1/2, and x3 + h2 = 5/6. With this, (6.9) becomes IM =

 1  1/2 e + e3/2 + e5/2 ≈ 6.1043. 3

(2) According to Theorem 6.1, how many subintervals are necessary to guarantee an error of 10−8 if the composite midpoint rule is used to evaluate (6.11)? Answer: From (6.10), we want

b−a 2  24 h ||f ||∞

≤ 10−8 . To get this, note

||f  ||∞ = max 9e3x = 9e3 . 0≤x≤1

Also, h = (b − a)/n = 1/n. So, we want 9e3 /(24n2 ) ≤ 10−8 , which can be rewritten as  3 3 n≥ e × 104 ≈ 27,444.6. 8 Therefore, according to Theorem 6.1, if we use 27,445 subintervals then the error is no more than 10−8 . To check on this, the computed values are given in Table 6.1, along with the absolute value of the error  EM (n) =

b

f (x)dx − IM (n),

(6.12)

a

20

y-axis

15 10 5 0 0

1/6

1/3

1/2

2/3

5/6

1

x-axis Figure 6.3 Composite midpoint rule used with f (x) = e3x for 3 subintervals. The area of the dashed (red) boxes is used as the approximate area for the area under the curve.

260

6 Numerical Integration Relative Iterative Error

k

n

IM

|EM |

1

500

6.361836098304112

9.5e−06

2

1000

6.361843255371067

2.4e−06

1.1e−06

3

2000

6.361845044639554

6.0e−07

2.8e−07

4

4000

6.361845491956809

1.5e−07

7.0e−08

5

8000

6.361845603786101

3.7e−08

1.8e−08

6

16000

6.361845631743438

9.3e−09

4.4e−09

7

32000

6.361845638732778

2.3e−09

1.1e−09

8

64000

6.361845640480135

5.8e−10

2.7e−10

Table 6.1 Values of (6.11) when computed using the composite midpoint rule IM (n), along with the error |EM (n)| and the relative iterative error.



as well as the relative iterative error | IM (nk ) − IM (nk−1 )| /IM (nk )|. What is seen is that the desired error of 10−8 is obtained using a smaller number of subintervals than 27,445. This is not surprising, since the inequality in Theorem 6.1 is based on a worst-case assumption. Also, as is evident in Figure 6.4, the error and the relative iterative error both decrease as O(h2 ). For example, if you increase n by a factor of 10, then both errors decrease by a factor of 100.  There are two useful observations coming from the above example. First, the reason the midpoint rule works better than the other possible rectangular approximations is evident in Figure 6.3. Each box splits the curve in such a way that the (unsigned) area it misses on the right is about the same as the (unsigned) area it misses on the left. 10

-5

Error RIE

Error

10-6 10-7 10

-8

10-9 10

-10

103

104

105

n-axis Figure 6.4 The error |EM |, and the relative iterative error (RIE), for the composite midpoint rule from Table 6.1.

6.2

Methods Based on Polynomial Interpolation

261

The second observation is that numerical integration is an iterative procedure, similar to Newton’s method in Chapter 2 and the power method in Chapter 4. Using Table 6.1 as an example, you keep increasing n until the relative iterative error drops below some given error tolerance tol. If tol = 10−8 , then the computation would stop at n = 16000. Example 6.2 According to Theorem 6.1, what functions will the composite midpoint rule integrate exactly, no matter what the value of h? Answer: This requires ||f  ||∞ = 0, which means that f  (x) = 0 for a ≤ x ≤ b. Therefore, there will be zero error if f (x) = α + βx.  As a final comment, in looking at Table 6.1 it is evident that a large number of subintervals is required to obtain an accurate value of the integral using the composite midpoint rule. This is a concern because the function being integrated, shown in Figure 6.4, is very simple. For more interesting functions, such as the one in Figure 6.13, the required number of subintervals is enormous. The point of this observation is that we need better ways to compute an integral, and this is one of the objectives of this chapter.

6.2 Methods Based on Polynomial Interpolation With the composite midpoint rule we have a O(h2 ) method that requires n function evaluations. This begs the question of whether we can do better and find methods that are, say, O(h4 ) and still only require about n function evaluations. The answer is definitely yes, and it involves using a better approximation of f (x). For the composite midpoint rule we used a piecewise constant approximation (as illustrated in Figure 6.4). In what follows, we will simply use better piecewise polynomial approximations.

6.2.1 Trapezoidal Rule Instead of a constant we will use a linear approximation of f (x) over xi ≤ x ≤ xi+1 . In particular, we will use the linear function that interpolates the function at the endpoints (see Figure 6.5). From the point-slope formula, we have that 1 gi (x) = fi + (fi+1 − fi )(x − xi ), h

262

6 Numerical Integration

Figure 6.5 To derive the trapezoidal rule the function is approximated using linear interpolation.

where fi = f (xi ) and fi+1 = f (xi+1 ). So, using the approximation f (x) ≈ gi (x), we get that  xi+1  xi+1 f (x)dx ≈ gi (x)dx xi

xi

h = (fi+1 + fi ). 2

(6.13)

This is known as the trapezoidal rule for reasons that should be evident from Figure 6.5. As with the midpoint rule, we need to know how the error depends on h. This can be determined using Theorem 5.2, on page 226, or directly using Taylor’s theorem. The latter only requires what you learned in Calculus II, so it will be the approach considered here. Expanding about x = xi , we get that 1 1 f (x) = fi + (x − xi )fi + (x − xi )2 fi + (x − xi )3 fi + · · · , 2 6 where fi = f  (xi ), fi = f  (xi ), and fi = f  (xi ). From this we have that fi+1 = f (xi + h) = fi + hfi + 12 h2 fi + + 16 h3 fi + · · · . Substituting this into the formula for gi (x) and then rearranging things, yields f (x) = gi (x) + p(x), where p(x) =

1 1 (x − xi )(x − xi+1 )fi + (x − xi−1 )(x − xi )(x − xi+1 )fi + · · · . 2 6

With this,   xi+1 f (x)dx = xi

xi+1

[gi (x) + p(x)]dx

xi

=

h 1 1 (fi+1 + fi ) − h3 f  (xi ) − h4 f  (xi ) + · · · . 2 12 24

(6.14)

Therefore, for small values of h, the error for the trapezoidal rule is O(h3 ).

6.2

Methods Based on Polynomial Interpolation

263

For the definite integral (6.3), when adding the contributions of the subintervals together, you get that  b f (x)dx = IT + ET , a



where IT = h

1 1 f1 + f2 + f3 + · · · + fn + fn+1 2 2

 (6.15)

is the composite trapezoidal rule. The error ET is O(h2 ). This is a factor of h smaller than the error for the integration rule in (6.13) for the same reason as happened for the composite midpoint rule. By keeping track of the exact formula for ET , the following result is obtained. Theorem 6.2. If f ∈ C 2 [a, b] then the composite trapezoidal rule (6.15) satisfies



b−a

b

h2 ||f  ||∞ , f (x)dx − IT ≤ (6.16)

a 12 where ||f  ||∞ = maxa≤x≤b |f  (x)|. Example 6.3 (1) Suppose the composite trapezoidal approximate the value of  1 e3x dx.

rule

is

used

to (6.17)

0

Using three subintervals, the approximation shown in Figure 6.3 is obtained. In this case, h = 1/3, and (6.15) becomes   1 1 1 3 1 2 + e + e + e ≈ 6.8834. IT = 3 2 2 (2) According to Theorem 6.2, how many subintervals are necessary to guarantee an error of 10−8 if the composite trapezoidal rule is used to evaluate (6.17)? 2  −8 Answer: From (6.16), we want b−a . It was shown earlier 12 h ||f ||∞ ≤ 10 3 2 3  3 that ||f ||∞ = 9e . So, the requirement is 4 h e ≤ 10−8 . Since h = 1/n, √ then n ≥ 12 3e3 × 104 ≈ 38,812.6. Therefore, according to Theorem 6.2, we should use at least 38,813 subintervals so the error is no more than 10−8 . To check on this, the computed values are given in Table 6.2, along with the computed values of the error

264

6 Numerical Integration

20

y-axis

15 10 5 0 0

1/3

2/3

1

x-axis Figure 6.6 Composite trapezoidal rule used with f (x) = e3x for 3 subintervals. The area of the dashed (red) trapezoids is used as the approximate area for the area under the curve.

 ET (n) =

b

f (x)dx − IT (n).

(6.18)

a

What is seen is that the desired error of 10−8 is obtained using a somewhat smaller number of subintervals than the predicted value of 38,813. As pointed out for the midpoint rule, this is not surprising, since the inequality in Theorem 6.2 is based on a worst-case assumption. (3) Using the composite trapezoidal rule, find a value of (6.17) that is correct to approximately 8 significant digits.

Answer: We want | IT (nk ) − IT (nk−1 ) /IT (nk )| < 10−8 . Using the values in Table 6.2, this condition is satisfied when n = 16,000.  It is worth pointing out that the error formula is useful for checking that your code is working correctly. To explain, with the composite trapezoidal method the relative iterative error should decrease by about a factor of 4 Relative Iterative Error

k

n

IT

|ET |

1

1000

6.361850412446070

4.8e−06

2

2000

6.361846833908572

1.2e−06

5.6e−07

3

4000

6.361845939274068

3.0e−07

1.4e−07

4

8000

6.361845715615440

7.5e−08

3.5e−08

5

16000

6.361845659700774

1.9e−08

8.8e−09

6

32000

6.361845645722084

4.7e−09

2.2e−09

Table 6.2 Values of (6.17) when computed using the composite trapezoidal rule IT (n), as well as the error |ET (n)| and the relative iterative error.

6.2

Methods Based on Polynomial Interpolation

265

each time the number of subintervals is doubled. This is exactly what occurs in Table 6.2. If this didn’t happen it would be necessary to diagnose the situation to explain why it didn’t.

6.2.2 Simpson’s Rule The next step is to try a quadratic approximation for f (x), and for this we need three data points. One option is to use xi , xi+1 , and some point within the subinterval. Another option is to pair up the subintervals and use an approximation over x1 ≤ x ≤ x3 , another one over x3 ≤ x ≤ x5 , etc. We will use the latter option although this will require n to be even. From (5.4), the quadratic that interpolates f (x) over the interval xi−1 ≤ x ≤ xi+1 is p2 (x) = fi−1 i−1 (x) + fi i (x) + fi+1 i+1 (x).

(6.19)

An example of the resulting approximation is shown in Figure 6.7 in the case when n = 4. There are two quadratics, one used for 0 ≤ x ≤ 12 and another for 12 ≤ x ≤ 1. If you look closely, you will notice that over each subinterval the quadratic is above the function on one half, and below the function on the other half. This is also what happened for the midpoint rule, as seen in Figure 6.4, and it will result in the integration rule being more accurate than might be expected. In deriving the resulting integration rule we are also interested in the error. Using Taylor’s theorem (see Exercise 5.37(c)) one can show that 1 1 f (x) = p2 (x) + z(z 2 − h2 )f  (xi ) + z 2 (z 2 − h2 )f  (xi ) + · · · , 6 24 where z = x − xi . Integrating both sides of this expression, one finds that 1

y-axis

0.5 0 p 2 (x)

-0.5

f(x) -1 0

0.25

0.5

0.75

1

x-axis Figure 6.7 Example of the piecewise quadratic approximation p2 (x) used in the derivation of Simpson’s rule. One quadratic is used for 0 ≤ x ≤ 12 and another for 12 ≤ x ≤ 1.

266

6 Numerical Integration





xi+1

xi+1

f (x)dx = xi−1

xi−1 xi+1



z z2 [p2 (x) + (z 2 − h2 )fi + (z 2 − h2 )fi + · · · ]dx 6 24 p2 (x)dx

= xi−1



z 2 z2 (z − h2 )fi + (z 2 − h2 )fi + · · · ]dz 24 −h 6 h 1 = (fi−1 + 4fi + fi+1 ) − h5 fi + · · · . (6.20) 3 90 h

+

This gives rise to what is known as Simpson’s rule, which is that  xi+1 h f (x)dx ≈ (fi−1 + 4fi + fi+1 ). 3 xi−1

(6.21)

The error in this approximation is O(h5 ). It remains to recombine the subintervals, and the result is 

b

f (x)dx = IS + ES , a

where IS =

h (f1 + 4f2 + 2f3 + 4f4 + 2f5 + · · · + 4fn + fn+1 ) 3

(6.22)

is known as the composite Simpson’s rule. It is possible to show that the 1 (b − a)h4 f  (η), where η is a point somewhere in the error is ES = − 180 interval a ≤ x ≤ b. The denominator in this expression is 180, and not 90 as in (6.20), because the subintervals are occurring in pairs. So, when you add the contributions of the subintervals together, as done in (6.8), instead of having n contributions to the composite error, for the composite Simpson’s there are only n/2. From the above discussion, the following result is obtained. Theorem 6.3 If f ∈ C 4 [a, b] then the composite Simpson’s rule (6.22) satisfies



b

b−a

h4 ||f  ||∞ . f (x)dx − IS ≤ (6.23)

a

180 In the above expression, ||f  ||∞ = maxa≤x≤b |f  (x)|. This is an impressive result, and it is clearly better than what was achieved with the midpoint or trapezoidal rules. In fact, it is the “gold standard” in the sense that when new integration rules are derived they are almost invariably compared to Simpson’s rule.

6.2

Methods Based on Polynomial Interpolation

267

Example 6.4 According to Theorem 6.3, how many subintervals are necessary to guarantee an error of 10−8 if the composite Simpson’s rule is used to evaluate  1 e3x dx? (6.24) 0 81 4 3 h e ≤ 10−8 . Answer: Since ||f  ||∞ = 81e3 , then from (6.23) we want 180 81 3 1/4 2 Since h = (b − a)/n = 1/n, then we require n ≥ ( 180 e ) × 10 ≈ 173.4. So, according to Theorem 6.3, we should use at least 174 subintervals. To check on this, the computed values are given in Table 6.3, along with the value of the error  b

ES (n) =

f (x)dx − IS (n).

(6.25)

a

What is seen is that the desired error of 10−8 is obtained using a somewhat smaller number of subintervals than the predicted value of 174. It is also informative to plot the error ES as a function of the number of subintervals, and to compare this with the midpoint error EM , and the trapezoidal error ET . This is done in Figure 6.8. As expected, EM and ET are very similar, decreasing in the predicted O(h2 ) manner. It is also evident how much better Simpson’s rule does compared to the other two methods.  Example 6.5 Using the composite Simpson’s rule, find a value of (6.24) that is correct to about 8 significant digits. Answer: From the values in Table 6.3 the relative iterative error is less than 10−8 when n = 320. However, the value for n = 160 comes so close to satisRelative Iterative Error

k

n

IS

|ES |

1

10

6.362128885519898

2.8e−04

2

20

6.361863485939538

1.8e−05

4.2e−05

3

40

6.361846758607321

1.1e−06

2.6e−06

4

80

6.361845710944180

7.0e−08

1.6e−07

5

160

6.361845645430706

4.4e−09

1.0e−08

6

320

6.361845641335575

2.7e−10

6.4e−10

Table 6.3 Values of (6.24) when computed using the composite Simpson’s rule IS , as well as the error ES and the relative iterative error.

268

6 Numerical Integration

Error

100

10

10

-5

Trap Midpt Simp

-10

10

1

10

2

10

3

n-axis Figure 6.8 The errors |EM |, |ET | and |ES | for the composite midpoint, trapezoidal and Simpson’s rules, respectively, as a function of the number n of subintervals used to evaluate (6.24).

fying the condition that it’s not unreasonable to expect that it has 8 correct digits. Also note that the relative iterative error decreases by about a factor of 16 each time n is doubled, which is what is expected for composite Simpson since the error is O(h4 ).  Example 6.6 (1) According to Theorem 6.3, what functions will the composite Simpson’s rule integrate exactly, no matter what the value of h? Answer: This requires ||f  ||∞ = 0, which means that f  (x) = 0 for a ≤ x ≤ b. Therefore, the composite Simpson’s rule will integrate any polynomial of degree 3, or less, exactly. (2) Suppose that the error when using the composite midpoint rule is 10−4 if n = 100. What is the expected error if n = 200? Answer: Since the error is O(h4 ), then for n = 200 the error is expected  to be approximately ( 12 )4 × 10−4 ≈ 6 × 10−6 .

6.2.3 Cubic Splines One of the best interpolation methods considered in Chapter 5 used cubic splines, and this makes it an intriguing method for numerical integration. One difference with the other interpolation based methods we have considered, the cubic spline interpolation formula applies to the whole interval. A consequence of this is that we will obtain the composite rule directly, without first deriving the formula over a subinterval and then adding them together to produce a composite rule.

6.2

Methods Based on Polynomial Interpolation

269

As explained in Section 5.4.1, the cubic spline interpolation formula can be written as n+2  ai Bi (x), s(x) = i=0

where the Bi ’s are cubic B-splines and are defined in (5.23). Using the integration formulas given Exercise 5.40, one finds that 

b

s(x)dx = a

n+1 

n+2  i=0

 ai

b

Bi (x)dx a

1 1 1 1 =h fi + (a2 − a0 ) − f1 − (an+2 − an ) − fn+1 24 2 24 2 i=1

 . (6.26)

To go any further we need to specify which type of spline is going to be used, and we will use a clamped spline. This means that s(x) satisfies s (x1 ) = f  (a)

and

s (xn+1 ) = f  (b).

Given that   s (xi ) = ai−1 Bi−1 (xi ) + ai Bi (xi ) + ai+1 Bi+1 (xi ) 1 1 ai+1 , = − ai−1 + 2h 2h

then a0 = a2 − 2hf  (a) and an+2 = an + 2hf  (b). Substituting these into (6.26) yields    b n  1 h   (f1 + fn+1 ) + s(x)dx = h fi + (f1 − fn+1 ) , 2 12 a i=2  where f1 = f  (a) and fn+1 = f  (b). From Theorem 5.5 we know that f (x) = 4 s(x) + O(h ), and we therefore have that



b

f (x)dx = IH + O(h4 ),

(6.27)

a

where IH



1 1 = h f1 + f2 + f3 + · · · + fn + fn+1 2 2

 +

1 2   h f1 − fn+1 . 12

(6.28)

This is known as the composite Hermite rule or the corrected trapezoidal rule. It gets the latter name because IH = IT +

1 2   . h f1 − fn+1 12

(6.29)

270

6 Numerical Integration

This is interesting because it shows that the composite trapezoidal rule IT , given in (6.15), can be adjusted using values at the endpoints so it has the same order of error as Simpson’s rule. As for calling it the Hermite rule, it gets this name because you also obtain (6.28) by integrating something called the cubic Hermite interpolation function. The assumptions underlying Hermite interpolation are explained in Exercise 5.21. The following theorem states the error for this method. Theorem 6.4. If f ∈ C 4 [a, b] then the composite Hermite rule (6.28) satisfies



b−a

b

h4 ||f  ||∞ , f (x)dx − IH ≤ (6.30)

a 720 where ||f  ||∞ = maxa≤x≤b |f  (x)|. This result is not particularly surprising given the error using a clamped cubic spline, as given in Theorem 5.5. Example 6.7 According to Theorem 6.4, how many subintervals are necessary to guarantee an error of 10−8 if the composite Hermite rule is used to evaluate 

1

e3x dx?

(6.31)

0 9 4 3 h e ≤ 10−8 . Answer: Since ||f  ||∞ = 81e3 , then from (6.30) we want 80 −3 1/4 −2 This gives us h ≤ (80e /9) × 10 . Since h = (b − a)/n = 1/n, then we require n ≥ (9e3 /80)1/4 × 102 ≈ 122.6. So, according to Theorem 6.4, we should use at least 123 subintervals so the error is no more than 10−8 . To check on this, the computed values are given in Table 6.4, along with the absolute value of the error  b f (x)dx − IH (n). (6.32) EH (n) = a

What is seen is that the desired error of 10−8 is obtained using a somewhat smaller number of subintervals than the predicted value of 26 (Table 6.4).  It is interesting that Hermite has a slightly better error than Simpson, although it requires a bit more information about the function. Also, because of (6.29), it shows that the trapezoidal rule can be more accurate than Simpson for functions which satisfy f  (a) = f  (b). It needs to be stated however that most functions do not satisfy this condition, and for this reason Simpson is expected to produce a more accurate approximation that the trapezoidal

6.2

Methods Based on Polynomial Interpolation

271

n

IH

|EH |

10

6.361774223320726

7.1e−05

20

6.361841170284835

4.5e−06

40

6.361845361526698

2.8e−07

80

6.361845623589811

1.7e−08

160

6.361845639970481

1.1e−09

Table 6.4 Values of (6.31) when computed using the composite Hermite rule IH , as well as the error EH .

method. The most noteworthy exception to this arises with periodic functions, where having f  (a) = f  (b) is not uncommon.

6.2.4 Other Interpolation Ideas As pointed out earlier, (6.28) is known as the corrected trapezoidal rule. It turns out that there is also a corrected midpoint rule, and it states that  a

b

f (x)dx = IM −

1 2   + ECM . h f1 − fn+1 24

The error term in this case is ECM = (b − a)Kh4 ||f  ||∞ , where K = 7/5760. This makes it competitive with both Simpson’s rule as well as with the corrected trapezoidal rule. More information about this idea of correcting, or improving, an integration rule using information at the endpoints can be found in Fornberg [2021] and Talvila and Wiersma [2012]. You might think that the next step is to use fourth order, or higher, degree polynomial approximations and see what sort of integration rule results. Although there are people with a lot of time on their hands that do this sort of thing, it is not worth the effort. Instead, it is better to step back and look at the integration formulas we have derived. For example, it is informative to compare the formulas for the composite trapezoidal and Simpson’s rules. If you add up the fi ’s with the coefficients in (6.15) you get an approximation for the integral, but if you add up the same fi ’s using the coefficients is (6.22) you get a better approximation. This raises the question as to whether there are coefficients that can be used that will produce an even better approximation. This idea is pursued in Section 6.5. What is done first, however, is address a more practical concern, which is how to efficiently compute an integral.

272

6 Numerical Integration

6.3 Romberg Integration An interesting idea on how to improve the accuracy of numerical integration is based on an observation about the error. To illustrate, the error given in (6.14) has the form E = αh3 + βh4 + · · · . Stating that E = O(h3 ) means that the principal contribution to the error, for small values of h, involves a term of the form αh3 . However, it also contains terms involving h4 , h5 , etc., but they are much smaller than the principal term that is O(h3 ). A similar comment holds for all of the error terms we have derived. To explore the consequences of the above observation, it states in (6.8) that the error for the composite midpoint rule in O(h2 ). In reality, the equation in (6.8) has the form 

b

f (x)dx = IM (n) + αh2 + βh3 + γh4 + · · · .

(6.33)

a

The exact values of α, β, γ, · · · are not important other than to note that they do not depend on h or n. Also, IM (n) is being used here, rather than IM , to indicate that the formula is for the case of when n subintervals are being used. Now the key observation. If the number of subintervals is doubled, so n is changed to 2n, then the width of the subintervals changes from h to h/2. Consequently, using (6.33),  a

b

1 1 1 f (x)dx = IM (2n) + αh2 + βh3 + γh4 + · · · , 4 8 16

(6.34)

where h is the value of the width when using n subintervals. It is possible to combine (6.33) and (6.34) and eliminate the O(h2 ) term. Multiplying (6.34) by 4 and subtracting (6.33) we obtain 

b a

1 f (x)dx = 4IM (2n) − IM (n) − βh3 + · · · . 2



b

3 In other words,

f (x)dx = a

4 1 IM (2n) − IM (n) + O(h3 ). 3 3

(6.35)

So, we have combined two O(h2 ) approximations of the integral to produce a better O(h3 ) approximation. This idea of increasing the number of subintervals and then combining the two results to get a better approximation is known as Romberg integration. The formula in (6.35) is due entirely to the form of the error in (6.33), and it can be applied to any composite rule with this type of error. For example, the composite trapezoidal rule has this form, and so the Romberg formula

6.3

Romberg Integration

273

Pick: n and tol > 0 A(1) = IS (n) R(1) = A(1) Loop: for k = 2, 3, 4, · · · n = 2n A(k) = IS (n)





R(k) = 16A(k) − A(k − 1) /15 if |1 − R(k − 1)/R(k)| < tol, then stop end

Table 6.5 Algorithm for evaluating an integral using Romberg integration with the composite Simpson’s rule. After exiting the loop, R(k) is the computed value of the integral.

for it is



b

f (x)dx = a

4 1 IT (2n) − IT (n) + O(h3 ). 3 3

(6.36)

It can also be easily extended to composite rules with other forms for the error. An example is Simpson’s rule, and the Romberg procedure applied to it produces  b 16 1 IS (2n) − IS (n) + O(h6 ). f (x)dx = (6.37) 15 15 a The derivation of this result, as well as other Romberg formulas, can be found in Exercise 6.14.

6.3.1 Computing Using Romberg The question arises when computing an integral of how many subdivisions to use. For most of the methods considered in this chapter, theorems are given that provided bounds on the error, and these were used to predict how many subintervals are need to guarantee a certain error. In real world situations, these often are not useful because calculating ||f  ||∞ or ||f  ||∞ is either not possible or too difficult to be practical. For this reason, integration methods are used in an iterative manner. The iteration in this case is associated with using increasingly larger values of n. As a typical example, you keep doubling the number of subintervals, using one of the Romberg rules to compute the integral, and then stop when the relative iterative error is sufficiently small.

274

6 Numerical Integration

Such an algorithm using Simpson’s rule is given in Table 6.5. The formula for R(k) in this case coming from (6.37). Example 6.8 To demonstrate the usefulness of Romberg integration, consider evaluating the integral  1 e3x dx. (6.38) 0

Using the procedure given in Table 6.5, the relative error is plotted Figure 6.9. For comparison, the relative error for composite Simpson’s is also shown. The O(h6 ) error is evident in the slope of the Simpson-Romberg line. For example, going from 4 to 16 subintervals, the error drops by a factor of about 46 ≈ 4000.  Let’s take a moment and review the progress in our ongoing task of computing the value of (6.38). To achieve an error of 10−8 , for the composite trapezoidal rule you need about 21,850 subintervals while with SimpsonRomberg you need about 32. So, we have made significant progress is deriving a fairly efficient numerical integration procedure. However, as will be shown in Section 6.5, it is possible to derive a method that gets the number down to about 5 or 6.

6.4 Methods for Discrete Data It is often the case that the function to be integrated is only known at a fixed number of points, which is what typically occurs with experimentally determined data. If the nodes are equally spaced, then the methods in the

Error

10-2 10

-6

10

-10

10

-14

Simp Simp-Romb 100

101

102

n-axis Figure 6.9 The relative error when evaluating (6.38) using Simpson-Romberg integration as well as using the composite Simpson’s rule.

6.4

Methods for Discrete Data

275

previous sections can be adapted in a straightforward manner. The situation considered here concentrates on what can be done when they are not equally spaced. A guiding principal in this will be, any node within the interval of integration must be used in computing the integral. For example, suppose the nodes are x1 = 0, x2 = 3, and x3 = 4. The integral over the interval x1 ≤ x ≤ x3 can be computed using the trapezoidal rule with one trapezoid (using x1 , x3 ) or two trapezoids (using x1 , x2 and x2 , x3 ). According to the rule, the latter is the one to be used. In what follows the nodes are x1 , x2 , x3 · · · , xn+1 , with x1 = a, xn+1 = b, and the ith subinterval has length hi = xi+1 − xi . One of the easier methods to use with non-equally spaced nodes is the trapezoidal method. Using (6.13), then the generalization of the composite trapezoidal rule given in (6.15) is IT =

 1 h1 f1 + (h1 + h2 ) f2 + (h2 + h3 ) f3 + · · · + (hn−1 + hn ) fn + hn fn+1 . 2

(6.39) Given the simplicity, and adaptability, of this formula, it is typically the default method used for discrete data. However, as shown below, there is a better approximation to the integral. The formula in (6.39) comes from using piecewise linear interpolation on the data set, and then integrating the resulting interpolation function. Pursuing this idea a bit further, a global interpolation polynomial pn (x) is not recommended given its predilection for over- and under-shoots (see Figure 5.4). Using a cubic spline interpolation function, on the other hand, is a distinct possibility. This requires deriving the requisite formulas for irregularly spaced nodes. Assuming this has been done, then the task is to integrate the interpolation function s(x), as given in (5.16). Although most any method can be used, note that each si (x) is a cubic, and Simpson’s rule integrates cubics exactly. Therefore, 

b

s(x)dx = a

=

n   i=1 n  i=1

xn+1

si (x)dx

x1

 1  hi si (xi ) + 4si (xi + hi /2) + si (xi+1 ) . 6

Using the fact that si (xi ) = fi , si (xi+1 ) = fi+1 , and si (xi + hi /2) = s(xi + hi /2), then the resulting approximation of the integral is  a

b

  n 1 2 1 f (x)dx ≈ IT + hi s xi + hi , 3 3 i=1 2

(6.40)

276

6 Numerical Integration

where IT is given in (6.39). This formula holds irrespective of which end conditions have been used to determine the cubic spline interpolation function s(x). With experimentally obtained data, not only can the nodes be unequally spaced, but if the experiment is run again it is very possible that the nodes are different. The next example mimics this to investigate how much this might affect the accuracy of the integration rule. Example 6.9 Suppose f (x) = e3x is sampled at 6 points from the interval 0 ≤ x ≤ 1 in the following manner: x1 = 0, x6 = 1, and x2 , x3 , x4 and x5 are randomly selected from the interval 0 < x < 1 (and labeled so that xi < xi+1 ). With this, the integral  1

e3x dx

(6.41)

0

is computed using (6.39), and also using (6.40) with the not-a-knot cubic spline. This experiment was run 10 times, and the resulting relative errors using the two integration formulas are given in Figure 6.10. In each case the spline formula provides a better approximation that the composite trapezoidal method.  The natural cubic spline also does better than the composite trapezoidal approximation in the above example, but not as well as the not-a-knot spline. In fact, based on numerical experiments involving a variety of functions and integration intervals, more often than not, the not-a-knot spline provides a better approximation than a natural spline when using discrete data.

Error

10

0

Trap Spline

10-2

10-4

1

2

3

4

5

6

7

8

9

10

Experiment Figure 6.10 The relative error in using the composite trapezoidal (6.39) and the not-aknot cubic spline (6.40) approximations for (6.41).

6.5

Methods Based on Precision

277

6.5 Methods Based on Precision The next family of methods we are going to consider is based on two observations that come from the composite rules we derived using interpolation. The first is the observation that most of the composite integration rules have the form  b f (x)dx ≈ w1 f (z1 ) + w2 f (z2 ) + · · · + wm f (zm ). (6.42) a

The zi ’s are called the nodes, and the wi ’s are the weights. So, the composite midpoint rule in (6.9) has nodes zi = xi + h2 , with corresponding weights wi = h. Similarly, the composite trapezoidal rule given in (6.15) has nodes, zi = xi , with weights w1 = wn+1 = h/2 and all the others are wi = h. It should also be noted that (6.42) does not include derivative terms that arose with the Hermite rule, and how these can be incorporated into the procedure is explored in Exercise 6.30. The second observation concerns the error. To explain, in deriving the composite midpoint rule it was found that 

b

f (x)dx = IM + a

b − a 2  h f (η), 24

where η is located somewhere in the interval (a, b). Because of this the composite midpoint rule has zero error (i.e., it is exact) if f (x) = 1 or f (x) = x and the reason is that for both of these functions f  (x) = 0. However, it is not exact when f (x) = x2 and because of this the midpoint rule is said to have precision one. Similarly, the error for the composite Simpson’s rule is ES = −

b − a 4  h f (η). 180

Consequently, it has zero error if f (x) = 1, f (x) = x, f (x) = x2 , or f (x) = x3 , but it is not zero if f (x) = x4 . For this reason it has precision three. This brings us to the following definition. Definition of Precision. The precision of an integration rule is the largest value of m for which the rule is exact for the functions f (x) = xk , for k = 0, 1, 2, · · · , m. This definition is used to derive integration rules, and to do this, it is necessary to know the exact value of integrals of the form 

b

xk dx.

a

The values, up to k = 5, are given in Table 6.6.

(6.43)

278

6 Numerical Integration

We are now going to derive integration rules that have maximum precision, and these fall under the general category of Gaussian rules.

6.5.1 1-Point Gaussian Rule We start with a question: what integration rule gives the maximum precision using one point from the interval? The assumption is that the general form of the rule is  b

f (x)dx ≈ w1 f (z1 ),

a

where a ≤ z1 ≤ b. To find the value of w1 and z1 that maximize the precision, we want to find the largest value of k so that 

b

a

xk dx = w1 z1k ,

(6.44)

and this generates the following list.

b 1. k = 0: From (6.44) we require that a dx = w1 , and from this it follows that w1 = . b 2. k = 1: We require that a xdx = w1 z1 . Using Table 6.6, the condition is xm = w1 z1 . Given that w1 = , we obtain z1 = xm . 

Table 6.6

b

k

f (x)

0

1



1

x

xm

2

x2

 x2m +

3

x3

xm x2m +

4

x4

 x4m +

5

x5

xm x4m +

a

f (x)dx









1 2  12



1 2  4



1 2 2 1 4  xm +  2 80



5 2 2 1 4  xm +  6 16



Values of (6.43) for various values of k, where  = b − a and xm = (b + a)/2.

6.5

Methods Based on Precision

3. k = 2: If

b a

279

x2 dx = w1 z12 , then it must be that   1 2 2  xm +  = x2m . 12

(6.45)

It is not possible for this to hold for nonzero , and so this rule does not integrate x2 exactly. Therefore, the 1-point Gaussian quadrature rule is the mid-point rule. This is worth knowing but it is disappointing because the new idea of maximizing the precision has not produced a new method. As demonstrated next, this changes when we use more points.

6.5.2 2-Point Gaussian Rule The general form of the rule in this case is 

b

f (x)dx ≈ w1 f (z1 ) + w2 f (z2 ).

a

Our goal is to find the values of the wi ’s and zi ’s that maximize the precision. So, we want to find the largest value of k so that  a

b

xk dx = w1 z1k + w2 z2k .

As was done for the 1-point rule, we make a list using Table 6.6. (1) k = 0:  = w1 + w2 (2) k = 1: xm = w1 z1 + w2 z2 . (3) k = 2:

  1 2 2  xm +  = w1 z12 + w2 z22 12

(4) k = 3:

 xm

(5) k = 4:

x2m

1 + 2 4



= w1 z13 + w2 z23

  1 1  x4m + 2 x2m + 4 = w1 z14 + w2 z24 2 80

(6.46)

280

6 Numerical Integration

We have four unknowns so we will consider the first four equations from the above list. The fact that three of them are nonlinear makes the problem of solving them a bit challenging. One approach to help simplify things is to anticipate where the points z1 and z2 are located relative to each other in the interval. For the midpoint rule, z1 = xm ended up being symmetrically located in the interval (i.e., at the midpoint). Given that z1 and z2 appear in a symmetric manner in the integration rule, it is not unreasonable to expect them to also end up being placed symmetrically in the interval. In other words, they have the form z1 = xm − q and z2 = xm + q, where we need to find the value of q. Assuming this, then from the k = 0 and k = 1 conditions one finds that w1 = w2 = 12 . With this, and the k = 2 condition, one finds that q 2 = 1/12. The remaining tasks are to show that this solution satisfies the k = 3 condition but not the one for k = 4. Both of these are left as exercises. The resulting integration rule is 

b

f (x)dx ≈

a

where z1 =

1 b+a − √ , 2 2 3

 1   f (z1 ) + f (z2 ) , 2

and

z2 =

1 b+a + √ . 2 2 3

(6.47)

(6.48)

This is the 2-point Gaussian quadrature rule, and it has precision 3. The location of the two Gaussian nodes are shown in Figure 6.11. Example 6.10 

1

e3x dx.

(6.49)

0

(1) For the 1-point Gaussian rule, a = 0, b = 1, w1 = 1, and z1 = 1/2. The resulting approximation is  1 e3x dx ≈ e3/2 . 0

(2) For the 2-point Gaussian rule, a = 0, b = 1, w1 = w2 = 1/2, z1 = 12 (1 − √ √ 1/ 3), and z2 = 12 (1 + 1/ 3). The resulting approximation is

Figure 6.11

Location of the nodes z1 and z2 for 2-point Gaussian quadrature.

6.5

Methods Based on Precision

 0

1

281

e3x dx ≈

1 3z1 e + e3z2 .  2

6.5.3 m-Point Gaussian Rule It is possible to show that when using m points, as given in (6.42), the locations of the nodes correspond to the roots of the mth order Legendre polynomial. Assuming that m ≥ 2, and using the recursive properties of these polynomials, finding the nodes and weights can be reduced to an eigenvalue problem involving a m × m tridiagonal matrix. This problem is given in Exercise 4.64. Letting λj denote the jth eigenvalue, then in (6.42) the corresponding node is given as zj = 12 [b + a + (b − a)λj ], and the weight is wj = 12 (b − a)w j , where w j is defined in Exercise 4.64. This is the basis for what is known as the Golub-Welsch algorithm, and more about this can be found in Golub and Welsch [1969] and Davis and Rabinowitz [2007]. The values for the Gaussian quadrature nodes and weights are listed in most handbooks or compilations of mathematical formulas (e.g., the values for the 80-point rule are listed in Olver et al. [2010]). For those with a more extreme interest in this should also consult Love [1966], which considers up to 200 nodes, or Bogaert [2014], who discusses cases with millions of nodes. For the latter, the nodes near the endpoints of the interval are so close that they are not resolvable using double precision. The need of such a large number of nodes is, for the moment, mostly of theoretical interest.

6.5.4 Error We now turn to the error when using Gaussian quadrature. We begin with the theorem which provides the needed formula. Theorem 6.5 If f ∈ C 2m [a, b], then 

b

f (x)dx = w1 f (z1 ) + w2 f (z2 ) + · · · + wm f (zm ) + EG ,

a

where

α |EG | ≤ √ R2m ||f (2m) ||∞ , m R=

(b − a)e , 8m

√ ||f (2m) ||∞ = maxa≤x≤b |f (2m) (x)|, and α = (b − a) π/4. The proof of this theorem can be found in Isaacson and Keller [1994].

(6.50) (6.51)

282

6 Numerical Integration

There are a couple of immediate observations concerning the above theorem. First, although a relatively minor criticism, it requires f (x) to be much smoother than the other integration theorems considered so far. The second, and more significant, comment is that the error can approach zero exponentially. In particular, if m is large enough that R < 1, then R2m approaches zero exponentially as m increases. Whether this means that the error using Gaussian quadrature approaches zero exponentially depends on the contribution of the f (2m) term in the error formula. This is the same situation we encountered with Chebyshev interpolation is Section 5.5.4. Example 6.11 

1

e3x dx.

(6.52)

0

To demonstrate just how effective Gaussian quadrature can be, the relative error involved with computing (6.52) is given in Figure 6.12 as a function of the number of Gaussian points used. For comparison, the values obtained using the Simpson-Romberg procedure are given, using the corresponding number of interpolation points in the interval 0 ≤ x ≤ 1. The improvement of the Gaussian value as m increases is impressive, reaching the resolution of double precision using just 8 points. 

6.5.5 Parting Comments Looking at Figure 6.12, the question arises as why even consider other integration rules? To answer this, while there is no doubt that Gaussian quadrature is often used to evaluate definite integrals, it is not necessarily always the best choice. This is discussed in Section 6.8, but one example is when you are not 100

Error

10

Simp-Romb Gaussian

-4

10-8 10

-12

10-16

3

5

7

9

11

13

15

17

Number of Points Used Figure 6.12 Relative error in using Gaussian quadrature, versus using SimpsonRomberg, to evaluate (6.52).

6.6

Adaptive Quadrature

283

able to control the placement of the integration points, which is often the case when the data is obtained experimentally. A second situation involves solving differential equations. Integration rules are often used in deriving methods for solving differential equations, and both the placement of the integration points as well as the required smoothness needed for Gaussian quadrature complicate deriving efficient methods. You might have noticed that the nodes for the best interpolation polynomial (i.e., the Chebyshev points) are generally different that the nodes for the best quadrature rule (i.e, the Gaussian points). The reason is that the Chebyshev points minimize the value of Q given in (5.33), while the Gaussian points are used to integrate lower order polynomials exactly. Nevertheless, Chebyshev interpolation is used to derive integration rules. It turns out that a somewhat better idea is to use the extrema points of the Chebyshev polynomial from the interval a ≤ x ≤ b rather than the points where it is zero. The respective weights are determined, as usual, by maximizing the precision. This gives rise to what is known as Clenshaw-Curtis quadrature. Assuming there are m nodes, this method has precision m − 1, while the Gaussian rule has precision 2m − 1. Even so, there are situations where Clenshaw-Curtis might be preferable to Gaussian quadrature, and this is discussed in Trefethen [2008].

6.6 Adaptive Quadrature Consider the problem of integrating the function f (x) = cos(x3 )200 over the interval 0 ≤ x ≤ 3, which is shown in Figure 6.13. If Simpson’s rule is used, and an error of less than 10−6 is desired, then according to Theorem 6.3 it is necessary to use about 19,200 equally spaced subintervals. This large number is necessary to accurately calculate the area under the sharp peaks near x = 3. However, such small subintervals are not necessary to accurately compute the area over the interval 0 ≤ x ≤ 1. The way to deal with this is simple, namely you just use smaller subintervals in the regions with peaks, and larger subintervals in the flatter regions. What you do not want to do is to manually decide how to break up the interval, but instead design a procedure that progressively refines the subdivisions in the peak regions until the required accuracy is attained. This is the essence of what is called adaptive quadrature. To explain the basic idea underlying adaptive quadrature, consider evaluating the integral  4

f (x)dx, 0

where f (x) is the shifted Lorentzian function

(6.53)

284

6 Numerical Integration

y-axis

1

0.5

0 0

0.5

1

1.5

2

2.5

3

x-axis Figure 6.13 integration.

Example of a function that shows significant variation over the interval of

f (x) = 1 +

ω 1 , π (x − x0 )2 + ω 2

where x0 = 3 and ω = 1/(2π). The graph of f (x) is shown in Figure 6.14. The version of adaptive quadrature described here uses Simpson’s rule. In the derivation it is necessary to know how the error is affected as the number of subintervals increases. The needed formulas are derived in the next paragraph for the general case using Simpson’s rule, and after that the method will be applied to the above example. The method improves the accuracy, when necessary, by doubling the number of subintervals in a region. This requires a way to estimate how much this reduces the error. To do this, suppose xL ≤ x ≤ xR has two subintervals. Simpson’s rule gives  xR f (x)dx = IS (2) + E2 , (6.54) xL

where IS (2) = h3 (fL + 4fM + fR ), E2 is the error, and xM = xL + h is the midpoint of the interval. If you double the number of subintervals, each now having width h/2, then

y-axis

3

2

1

0 0

0.5

1

1.5

2

2.5

3

3.5

4

x-axis Figure 6.14 Function used in the integral (6.53), and the initial subdivisions used for Simpson’s rule.

6.6

Adaptive Quadrature

285



xR

f (x)dx = IS (4) + E4 .

(6.55)

xL

According to (6.20), if you cut h by a factor of 2 then the error should be reduced by about a factor of 16. So, E4 ≈ E2 /16, or, equivalently, E2 ≈ 16E4 . Since (6.54) and (6.55) are suppose to produce the same result, we equate the right-hand sides and conclude that E4 ≈

1 [IS (4) − IS (2)] . 15

(6.56)

With this we have a way to estimate the error after the subdivision. The remaining question is, how small do we want |E4 |? Assuming the requirement is that the error in the computed value for the entire integral (6.1) is no more that tol, then we will require |E4 | ≤

xR − xL tol . b−a

(6.57)

The coefficient of tol in the above expression is so the errors from the subintervals add up to tol. We return to the original problem of evaluating (6.53). The method will produce levels of subdivision, and it is worth labeling them as the refinement proceeds. It is also assumed that tol is a prescribed error tolerance. Level 1: Taking two subintervals, so h = (b − a)/2 = 2 (which are indicated using solid vertical blue lines in Figure 6.14), the first approximation is 2 IS (2) = [f (0) + 4f (2) + f (4)] ≈ 4.1684. 3 Doubling the number of subintervals, then h = 1 and the subintervals now include the dashed blue lines in Figure 6.14. In this case, IS (4) =

1 [f (0) + 4f (1) + 2f (2) + 4f (3) + f (4)] ≈ 6.7347. 3

Using (6.56), this means that the error in approximating the integral using IS (4) is E4 ≈ (IS (4) − IS (2)) /15 ≈ 0.1711. If this satisfies (6.57), which means that |E4 | ≤ tol, then you are finished and the computed value is IS (4). If not, then more subdivisions are needed and you proceed to Level 2. Level 2: If the error condition is not satisfied, then the method splits the interval in half and writes  2  4  4 f (x)dx = f (x)dx + f (x)dx, (6.58) 0

0

2

286

6 Numerical Integration

and then applies the approximations used in Level 1 to each integral. The 2 resulting subintervals are shown in Figure 6.15. For example, for 0 f (x)dx it calculates IS (2) = 13 [f (0) + 4f (1) + f (2)] ≈ 2.0351, and IS (4) ≈ 2.0336. The error in this case is E4 ≈ −10−4 . If this satisfies (6.57), which means that |E4 | ≤ tol/2, then computing the integral over the interval 0 ≤ x ≤ 2 is done. If not,  4 then this integral is passed to Level 3. A similar calculation is done for 2 f (x)dx. Level 3, 4, · · · : The method keeps subdividing the subintervals that are transferred from the previous level, until it finally obtains a value for each that satisfies the error condition. In the above description, the value of IS (4) is used as the computed value of the integral over the subinterval. Given the way the number of subintervals are doubled in this procedure, it is possible to compute the final value using Romberg integration. So, instead of IS (4), a more accurate value for the subinterval is obtained by using [16IS (4) − IS (2)]/15. Letting the procedure run, the final subdivision is shown in Figure 6.16. The error condition used to decide when to stop subdividing is that the error for the integral is no more than tol = 10−6 . This resulted in the need for 8 levels, and a total of 89 evaluations of the function f (x). In comparison, it takes approximately 960 equally spaced subdivisions to achieve the same error, and this is also the approximate number of function evaluations needed. As another example, for the function shown in Figure 6.14, 19,200 equally spaced subintervals are needed to achieve an error of 10−6 , and this is also the approximate number of function evaluations required. In contrast, using the adaptive Simpson method, this error is achieved using 849 function evaluations (and 14 levels). There are numerous variations on the idea of adaptive integration. For example, instead of using Simpson’s rule, the 2-point Gaussian rule is a possibility and it has an error comparable with Simpson. The drawback is that when subdividing the intervals, the nodes for the larger interval are not shared

y-axis

3

2

1

0 0

0.5

1

1.5

2

2.5

3

3.5

4

x-axis Figure 6.15 Function used in the integral (6.53), and the subdivisions used in Level 2 of the adaptive procedure.

6.8

Epilogue

287

y-axis

3

2

1

0 0

0.5

1

1.5

2

2.5

3

3.5

4

x-axis Figure 6.16 Function used in the integral (6.53), and the final subdivisions used for adaptive Simpson’s rule.

with the smaller subintervals. This means that a completely new set of function evaluations are required for each level, with a corresponding increase in computing time. There are some clever ways to avoid this difficulty, and one of the more well-known gives rise to what are called Gauss-Kronrod rules. Recent reviews, or discussions, of this can be found in Gander and Gautschi [2000] and Gonnet [2012].

6.7 Other Ideas In Section 1.4 the idea of compensated summation was introduced and its benefits illustrated in Table 1.4. Given the potential large number of additions that can arise with numerical integration, the question arises as to whether compensated summation  1 is worth using. To investigate this, we return to evaluating the integral 0 e3x dx. The error in computing this integral using the composite midpoint and Simpson’s rules, with and without compensated summation, is shown in Figure 6.17. What it shows is that for Simpson’s rule compensated summation does help once the error gets close to machine epsilon. In fact, the missing data values when using compensated summation, like the value when n = 104 , means the error is zero. In comparison, because the midpoint rule requires a significantly larger number of subintervals, compensated summation starts to make a noticeable improvement at a larger error level.

6.8 Epilogue This chapter is guilty of overkill in the sense that several methods were derived that basically do the same thing. This is done, partly, because there is no “best” method. It is certainly true that Gaussian quadrature has the

288

6 Numerical Integration

Error

distinct advantage of possibly having exponential convergence. The reason for wanting exponential convergence is evident in Figure 6.12. Even so, as explained in Section 6.3.1, integration is an iterative process. Unlike most of the other methods discussed in this chapter, the Gaussian rule can not make use, in an obvious way, of the function evaluations made in the earlier steps of the iteration. Then there is the actual computing time needed to evaluate the integral. Even for a function as complicated as the one in (6.2), evaluating the integral using Simpson-Romberg requires about 0.3 msec (taking tol = 10−10 ). For the same relative iterative error, Gaussian quadrature requires about 0.1 msec. As another comparison, integrating Runge’s function (5.9) over −1 ≤ x ≤ 1, takes about 0.01 msec with Simpson-Romberg but 1.2 msec with Gaussian quadrature. In comparison, MATLAB’s integrate command, which uses a vectorized version of a Gauss-Kronrod method, takes about 0.12 msec. The reason Gauss is not that much faster in the first example, and much slower in the second example, is that it requires solving an eigenvalue problem for each set of nodes, and this adds to the computing time. As explained in Johansson and Mezzarobba [2018] and Bogaert [2014], there are faster methods for computing the Gaussian nodes and weights. Even so, in the end, for the integrals considered here the computing times between Gaussian and Simpson-Romberg are effectively imperceptible. There are two special situations worth mentioning. The first is when you have a fixed data set, with possibly irregularly spaced nodes. In this situation, the Spline-Simpson rule (6.40) is the one to use. The second situation is when f (x) is periodic. In such cases the composite midpoint and trapezoidal methods can be very effective, giving rise to exponential convergence. An explanation of why, and what this is, can be found in Weideman [2002], Waldvogel [2011], and Trefethen and Weideman [2014]. A related, but more difficult situation, arises when f (x) is an oscillatory function and it oscillates

10

-4

10

-8

Original Comp Sum

10-12 10-16 100

102

104

106

108

n-axis Figure 6.17 Two upper right curves: Composite midpoint rule with and without compensated summation. Two lower left curves: Composite Simpson’s rule with and without compensated summation.

Exercises

289

rapidly. These can arise, for example, when you use a Fourier transform to solve a differential equation. Some of the ideas on how to integrate such functions can be found in Evans and Webster [1999], and Iserles et al. [2006].

Exercises Sections 6.1–6.2 6.1. This problem concerns calculating the integral  2 1 dx. I= 0 1+x (a) Using 4 subintervals, sketch the curve and the trapezoids that are used for the composite trapezoidal rule. Determine IT and calculate the relative error in this approximation. (b) Using 4 subintervals, determine IS and calculate the relative error in this approximation. (c) Using 4 subintervals, determine IH and calculate the relative error in this approximation. (d) Using the composite trapezoidal rule, how small does the step size h have to be to guarantee that the error is less than 10−6 ? (e) Using the composite Simpson’s rule, how small does the step size h have to be to guarantee that the error is less than 10−6 ? (f) Using the composite Hermite rule, how small does the step size h have to be to guarantee that the error is less than 10−6 ? 6.2. This problem concerns calculating the integral  1 I= e−2x dx. −1

(a) Using 4 subintervals, sketch the curve and the trapezoids that are used for the composite trapezoidal rule. Determine IT and calculate the relative error in this approximation. (b) Using 4 subintervals, determine IS and calculate the relative error in this approximation. (c) Using 4 subintervals, determine IH and calculate the relative error in this approximation. (d) Using the composite trapezoidal rule, how small does the step size h have to be to guarantee that the error is less than 10−6 ? (e) Using the composite Simpson’s rule, how small does the step size h have to be to guarantee that the error is less than 10−6 ?

290

6 Numerical Integration

(f) Using the composite Hermite rule, how small does the step size h have to be to guarantee that the error is less than 10−6 ? 6.3. The measured values of a force F (x) as a function of the displacement x are given in Table 6.7. The object of this exercise is to calculate the work  x W (x) = F (s)ds. 0

(a) To compute W (8) using the trapezoidal rule you can use one subinterval of length 8, two subintervals of length 4, or four subintervals of length 2. Which choice is expected to produce the most accurate value? Why? In the following, if there are different choices you can select for the subintervals, you should pick the one that is expected to produce the most accurate result. (b) Use the trapezoidal rule to find the value of W (x) at x = 2, 4, 6, 8. (c) What x’s can W (x) be evaluated at using the midpoint (or composite midpoint) rule? What is the value at each of these points? (d) What x’s can W (x) be evaluated at using Simpson’s (or composite Simpson’s) rule? What is the value at each of these points? (e) Which of the values for W (8) from (b)-(d) is likely the most accurate? Why? 6.4. For a linearly elastic material, the stress T (x) is given as T =E

du , dx

where u(x) is the displacement of the material and E is a positive constant known as the Young’s modulus. The question considered here is, how to determine u from measurements of T , as given in Table 6.8. (a) Show that 1 u(x) = u(0) + E



x

T (s)ds. 0

In the following you can assume that u(0) = 0 and E = 4. Also, if there are different choices you can select for the subintervals, you should pick the one that is expected to produce the most accurate result. (b) Use the trapezoidal rule to find the value of u(x) at x = 1/4, 1/2, 3/4, 1.

Table 6.7

x

0

2

4

6

8

F

0

1

5

17

37

Values for Exercise 6.3.

Exercises

Table 6.8

Table 6.9

291 x

0

1/4

1/2

3/4

1

T

1

−1

2

3

4

Values for Exercise 6.4. n

Method 1

Method 2

Method 3

5

1.561856000

1.001856000

1.068377000

10

1.315116500

1.000116500

1.017398078

20

1.166257289

1.000007289

1.004368622

40

1.085312956

1.000000456

1.001093351

80

1.043203153

1.000000028

1.000273413

160

1.021738283

1.000000002

1.000068358

Values for Exercise 6.6.

(c) Use the composite midpoint rule to calculate u(1). (d) Use the composite Simpson’s rule to calculate u(1). (e) Which of the values from (b)-(d) is likely the least accurate? Why? 6.5. The second, f  (x), and fourth, f  (x), derivatives of a function f (x) 3 are plotted in Figure 6.18. This question concerns evaluating 0 f (x)dx. (a) How large must n be to guarantee the error using the composite trapezoidal rule is less than 10−8 ? (b) How large must n be to guarantee the error using the composite Simpson’s rule is less than 10−8 ? 6.6. The values given in Table 6.9 are the result of using a composite integration rule. Which one is most likely the midpoint, trapezoidal, Simpson, and Hermite rules? Note some of the answers will repeat. 6.7. This problem considers how well the integration methods do when applied to a periodic function. The particular integral is 3 f f

2 1 0 -1 -2 0

0.5

1

1.5

x-axis Figure 6.18

Graph used in Exercise 6.5.

2

2.5

3

292

6 Numerical Integration



1

sin(2πx)dx. 0

(a) What is the exact value? (b) Determine (by hand) the value of IT when n = 3, 4, 5, 6. For each n, sketch sin(2πx), and the corresponding piecewise linear approximation used in the trapezoidal method, and use this to explain why the method works as well as it does. (c) Determine (by hand) the value of IM when n = 3, 4, 5, 6. For each n, sketch sin(2πx), and the corresponding piecewise constant approximation used in the midpoint method, and use this to explain why the method works as well as it does. 6.8. If you include the error term in Simpson’s rule, instead of (6.21) you have  xi+1 h f (x)dx = (fi−1 + 4fi + fi+1 ) + Kh5 f  (η), 3 xi−1 where η is a point somewhere in the interval xi−1 < x < xi+1 . By substituting f (x) = x4 into the above formula, find K. 6.9. It is sometimes possible to use the formula for the error to determine a useful lower bound on the number of subintervals required to achieve a certain error. (a) As stated, the error for the composite midpoint rule can be written as 2  EM = b−a 24 h f (η). In contrast to Example 6.1(2), use this to determine the minimum number of subintervals that need to be used. (b) Explain why the idea used in part (a) is not very informative if you integrate f (x) = x6 over the same interval. (c) Answer part (a) for the composite Simpson’s rule. 6.10. This exercise explores some connections between Simpson’s rule and some of the other methods that were derived. In this exercise, I(n) designates a composite rule that uses n subintervals. Assume here that n is even. (a) Show that IS (n) = 23 IT (n) + 13 IM (n/2). (b) Show that IS (n) = 43 IT (n) − 13 IT (n/2). 6.11. Suppose the nodes are not necessarily equally spaced. Specifically, xi+1 = xi + hi , for i = 1, 2, · · · , n, where x1 = a and xn+1 = b. Also, the hi ’s are positive, and let h∞ = max{h1 , h2 , · · · , hn }. n (a) Show that i=1 hi = b − a. (b) Show that the composite midpoint rule is IM = h1 f x1 +

1

1

1

h1 + h2 f x2 + h2 + · · · + hn f xn + hn . 2 2 2

Note that Theorem 6.1 still holds if you replace h with h∞ .

Exercises

293

(c) Show that the composite trapezoidal rule is IT =

 1 h1 f1 + (h1 + h2 ) f2 + (h2 + h3 ) f3 + · · · + (hn−1 + hn ) fn + hn fn+1 . 2

Note that Theorem 6.2 still holds if you replace h with h∞ . Section 6.3 6.12. This is a continuation of Exercise 6.3. (a) Use Romberg integration with the composite trapezoidal rule to calculate W (8). (b) Use Romberg integration with the composite midpoint rule to calculate W (8). (c) Use Romberg integration with the composite Simpson’s rule to calculate W (8). (d) Which of the values for W (8) from Exercise 6.3 and (a)-(c) is likely the most accurate? Why? 6.13. This is a continuation of Exercise 6.4. (a) Use Romberg integration with the composite trapezoidal rule to calculate u(1). (b) Use Romberg integration with the composite midpoint rule to calculate u(1). (c) Use Romberg integration with the composite Simpson’s rule to calculate u(1). (d) Which of the values for u(1) from Exercise 6.4 and (a)-(c) is likely the most accurate? Why? 6.14. In this exercise Romberg rules are derived. Assume that I(n) is an integration rule that uses n subintervals and h is the corresponding width of each subinterval. (a) It is known that the error term for the composite Simpson’s rule involves even powers of h. In particular, 

b

f (x)dx = IS (n) + αh4 + βh6 + γh8 + · · · .

a

Show that the integration rule IR = has an error that is O(h6 ).

1 [16IS (2n) − IS (n)] 15

294

6 Numerical Integration

(b) Suppose



b

f (x)dx = I(n) + αh2 + βh3 + γh4 + · · · .

a

Show that the integration rule IR =

1 [32I(4n) − 12I(2n) + I(n)] 21

has an error that is O(h4 ). (c) Suppose  b f (x)dx = I(n) + αh2 + βh3 + γh4 + · · · . a

Show that the integration rule IR =

1 [27I(3n) − 16I(2n) + I(n)] 12

has an error that is O(h4 ). Computational Questions for Sections 6.1–6.3 6.15. Parametric equations for an ellipse are: x = 4 cos(t), y = 3 sin(t), for 0 ≤ t ≤ 2π. This exercise concerns computing the perimeter (or, circumference), using the formula for arc-length. The value must be correct to about 8 digits. (a) Sketch the ellipse. (b) Compute the perimeter using the composite trapezoidal rule. How many subintervals were used? How did you guarantee that the value is correct to about 8 digits? (c) Do part (b) using the composite Simpson’s rule. (d) Do part (b) using Romberg integration with the composite Simpson’s rule. (e) Do part (b) using the Hermite rule. 6.16. The error function is defined as 2 erf(x) = √ π



x

2

e−s ds.

0

(a) Using the composite trapezoidal rule compute erf(2). The value should be correct to six significant digits. How many subintervals were used? How did you guarantee that the value is correct to six digits? (b) Do part (a) using the composite Simpson’s rule. (c) Do part (a) using Romberg integration with the composite Simpson’s rule.

Exercises

295

(d) Do part (a) using the Hermite rule. (e) Plot erf(x) for 0 ≤ x ≤ 10. Explain how you computed the function, and give the error tolerance used and the reason(s) why you used that value. 6.17. According to Planck’s law of blackbody radiation, the spectral energy density is 8πhc Ed (λ) = 5 α/λ , λ (e − 1) where λ is the wavelength and α = hc/(kB T ). Also, T is absolute temperature, kB is the Boltzmann constant, h is Planck’s constant, and c is the speed of light. The energy emitted in the wavelength band λ1 ≤ λ ≤ λ2 is 

λ2

E(λ1 , λ2 ) =

Ed (λ)dλ. λ1

In this problem take 8πhc = 5 × 10−18 J μm, and T = 7000 K, so that α = 2 μm. Note that in this case the wavelength λ is measured in μm. (a) Plot Ed (λ) for 0.01 ≤ λ ≤ 4. (b) Using the composite Simpson’s rule, calculate E(0.01, 0.4). Your answer should be correct to at least six significant digits. Make sure to state the number of subintervals used and how you determined that the error condition is satisfied. (c) Do part (b) using Romberg integration with the composite Simpson’s rule. (d) Do part (b) using the composite Hermite rule. 6.18. The Mooney-Rivlin law, often used for elastic polymers, states that the stress T (x) in a material is given as    1 β 2 λ − , T = α+ λ λ where λ, known as the stretch, is given as λ=1+

du . dx

The function u(x) is the displacement of the material, and α and β are positive constants. The objective of this exercise is, for a material which occupies the interval 0 ≤ x ≤ , to determine u from measured (known) values of T . This will be done by first finding λ from T , and then determining u from λ. Also, note that the stretch is positive. (a) Write down an algorithm, that uses either Newton’s method or the secant method, for finding the value of λ for a given value of T .

296

6 Numerical Integration

(b) Show that

 u(x) = u(0) − x +

x

λ(s)ds. 0

(c) Suppose the interval 0 ≤ x ≤  contains n equally spaced points xi , where x1 = 0 and xn = . Also, suppose that the value of λ is known at each of these points, and designate these values as λi , for i = 1, 2, · · · , n. Write down an algorithm, that uses the trapezoidal rule, which can be used to determine the ui values from the λi values. (d) Suppose that the values for the Ti ’s are given as Ti = x2i [α(1 + x2i ) + β](3 + 3x2i + x4i )/(1 + x2i )2 , for i = 1, 2, · · · , n. Using your algorithms from parts (a) and (c), compute the ui ’s in the case of when n = 20 and  = 1, and then plot the values (xi , ui ). In this calculation assume that u(0) = 0, and take α = 20 and β = 10, which are typical values for an elastomer. (e) Note that the exact solution at x = 1 is u = 1/3. For your algorithm in part (d), what value for n do you need to take so the error in your computed answer at x = 1 is no more than 10−6 ? 6.19. The position y(t), velocity v(t), and acceleration a(t) are related through the equations: a(t) = v  (t) and v(t) = y  (t). In this problem it is assumed that v(0) = 0 and y(0) = 0. In this case,  v(t) =



t

a(r)dr 0

and

y(t) =

t

v(r)dr. 0

It is also assumed that a(t) is known, and the objective of this exercise is to compute the velocity and position from this information. (a) Given a subinterval ti ≤ t ≤ ti+1 , then ai = a(ti ) and ai+1 = a(ti+1 ) are known. Assuming vi and yi have already been computed, use the trapezoidal rule to obtain the following expressions 1 vi+1 = vi + h(ai + ai+1 ), 2 and

1 yi+1 = yi + h(vi + vi+1 ). 2 (b) Suppose the interval 0 ≤ t ≤ 3 is subdivided into n equally spaced subintervals. So, ti = (i − 1)h, where i = 1, 2, 3, · · · , n + 1 and h = 3/n. Assuming that a(t) = sin(t4 ), plot y as a function of t, for n = 10, 20, 40. The three curves should be on the same axes. (c) An accurately computed value for the position at t = 3 is y(3) = 0.72732289075 . . .. What is the difference between this value and what you compute for y(3) at n = 10, 20, 40? How large does n need to be

Exercises

297

so that this value and what you compute for y(3) is less than 10−8 in absolute value? (d) Compute y(3) so the answer is correct to six significant digits. Do not use the value for y(3) given in part (c) to do this. Also, explain the method you used to answer this question. 6.20. This problem considers a way to compute velocity and position that differs from the one considered in Exercise 6.19. You will find that the value of n in part (c) is a factor of about 14 smaller than the corresponding value from Exercise 6.19(c). (a) Suppose the interval 0 ≤ t ≤ 3 is subdivided into n equally spaced subintervals. So, ti = (i − 1)h, where i = 1, 2, 3, · · · , n + 1 and h = 3/n. Explain how the Hermite rule can be used to obtain the following expressions 1 1 vi+1 = vi + h(ai + ai+1 ) + h2 (ai − ai+1 ) 2 12 and 1 1 yi+1 = yi + h(vi + vi+1 ) + h2 (ai − ai+1 ) 2 12 (b) Assuming that a(t) = sin(t4 ), plot, on the same axis, y as a function of t, for n = 10, 20, 40. (c) An accurately computed value for the position at t = 3 is y(3) = 0.72732289075 . . .. What is the difference between this value and what you compute for y(3) at n = 10, 20, 40? How large does n need to be so that this value and what you compute for y(3) is less than 10−8 in absolute value? 6.21. The objective of this exercise is to compute what appears to be the simple task of finding the area enclosed by a curve. Specifically, the area of the region enclosed by the curve x4 + 2y 4 = 1. (a) Sketch the curve. Explain how symmetry can be used to reduce the problem down to computing the area in the first quadrant. (b) One might think that the area can be computed by solving the equation for y, and then computing the integral over 0 ≤ x ≤ 1. Explain why this might not work based on the conditions required, say, in Theorems 6.2 and 6.3. Also, explain that this issue also arises if you instead solve for x and then compute the integral over 0 ≤ y ≤ 1/21/4 . (c) Ignoring the observations in part (b), use the composite Simpson’s rule to compute the integral of y = [(1 − x4 )/2]1/4 , for 0 ≤ x ≤ 1. Your answer should be correct to 8 digits. How many subintervals were used to obtain this value? (d) Divide the region in the first quadrant into two regions, one with 0 ≤ x ≤ 1/2 and the other with 1/2 ≤ x ≤ 1. Using the composite Simpson’s rule, find the area of the first region by integrating with respect to x and

298

6 Numerical Integration

find the area of the second region by integrating with respect to y. Your answers should be correct to 8 digits. How many subintervals were used for each integral? What is the total area enclosed by the curve? (e) Explain how the approach in part (d) avoids the issue considered in part (b). 1 6.22. In this exercise you are to evaluate 0 y(t)dt, where y(t) is determined by solving the equation y + t = ey − e−y . Note that y(0) = 0. (a) Evaluate the integral using the composite Simpson’s rule. Your answer should be correct to 8 digits. Also, what method did you use to determined y? (b) Making the substitution r = y(t), obtain an integral of a function f (r) over the interval 0 ≤ r ≤ y1 , where y1 is the solution of the equation when t = 1. Evaluate the integral using the composite Simpson’s rule. Your answer should be correct to 8 digits. 6.23. Evaluate the following integrals using Romberg integration. Your answer should be correct to 6 digits. You should state which composite rule you used, the stopping condition used, and the final number of subintervals needed to obtain the value. (a)



π/2

0

(b)



π/2

0

(c)



(cos2

π/2

cos 0

sin2 x dx sin x + cos x

−1

1 dx x + 3 sin2 x)3



cos x 1 + 2 cos x

 dx

Computational Questions for Section 6.4 Some of the following exercises require a cubic spline interpolation for irregularly spaced nodes. For MATLAB, the not-a-knot spline is available using the built-in command spline, and the natural spline is available using the splineA command (the latter is from MATLAB’s File Exchange site, and findable using a Google search on “splineA matlab”.). 6.24. Repeat Example 6.9, on page 276, using the integrals below.  1  1  1 √ 1 dx. sin(πx)dx, (b) 1 + 8xdx, (c) (a) 0 0 0 1+x

Exercises

299

6.25. Do Exercise 6.24, but use a natural cubic spline rather than a not-aknot. Comment on any differences there are with the corresponding results from 6.24. 6.26. An object is moving along the x-axis. Its velocity v(t) is measured at various times as it is moving and the resulting data are given in Table 6.10. Assume that its position is x(t) and x(0) = 0. (a) Using the composite trapezoidal formula (6.39) determine the position of the object at t = 6. (b) Using the spline-Simpson formula (6.40), using a not-a-knot spline, determine the position of the object at t = 6. (c) Using the spline-Simpson formula (6.40), using a natural spline, determine the position of the object at t = 6. (d) The actual position of the object is x(t) = e−t sin(2t) + t/(1 + t). Which method produced the most accurate result? Section 6.5 1 6.27. An integral 0 f (x)dx was computed and the relative iterative error is given in Table 6.11 as a function of the number of nodes n that were used. You are to determine which method corresponds to the: trapezoidal, Simpson, and Gaussian methods. Make sure to provide an explanation why. 6.28. True or false: The m-node Gaussian quadrature formula will integrate every polynomial with degree n, with n ≤ 2m − 1, exactly. 6.29. Suppose the integration rule has the form 

b

f (x)dx ≈ w1 f (a) + w2 f (z).

a

This is an example of what is called Radau quadrature, or Gauss-Radau quadrature. This means that one, and only one, of the nodes is an endpoint. (a) Find the values of w1 , w2 and z that maximize the precision. [Hint: let z = a + q and find q.]

Table 6.10

t

0

0.3

1

1.5

2

3.2

4.5

6

v

3

1.396

−0.391

−0.313

0.037

0.133

0.008

0.026

The velocity v measured at time t used in Exercise 6.26.

300

6 Numerical Integration n

Method 1

Method 2

Method 3

Method 4

6

5.84e−01

8.26e−02

2.32e−01

7.08e−01

12

2.72e−01

1.96e−01

1.05e−03

1.27e−01

24

8.43e−02

1.63e−01

4.55e−15

1.44e−02

48

2.23e−02

9.67e−02

8.88e−16

1.06e−03

96

5.67e−03

5.17e−02

0.00e+00

6.94e−05

3

Table 6.11

Values of the relative iterative error for Exercise 6.27.

(b) The error is known to have the form 

b

f (x)dx = w1 f (a) + w2 f (z) + K4 f  (η)

a

where, as usual, η is a point somewhere in the interval [a, b]. By substituting f (x) = x3 into the above formula, find K. 6.30. Suppose the integration rule has the form 

b

f (x)dx ≈ w1 f (z) + w2 f  (z)

a

(a) Find the values of w1 , w2 and z that maximize the precision. (b) The error is known to have the form 

b

f (x)dx = w1 f (z) + w2 f  (z) + K4 f  (η)

a

where, as usual, η is a point somewhere in the interval [a, b]. By substituting f (x) = x3 into the above formula, find K. 6.31. This problem considers what is known as Lobatto quadrature, or Gauss-Lobatto quadrature. It differs from Gaussian quadrature in that it assumes that the integration rule includes both endpoints, and possibly other points within the interval. (a) The assumed form using two points is 

b

f (x)dx ≈ w1 f (a) + w2 f (b).

a

Find the values of w1 and w2 that maximize the precision. (b) The assumed form using three points is

Exercises

301



b

f (x)dx ≈ w1 f (a) + w2 f (z) + w3 f (b).

a

Find the values of w1 , w2 , w3 , and z that maximize the precision. Hint: Write b = a +  and z = a + 12  + q (c) The assumed form using four points is 

b

f (x)dx ≈ w1 f (a) + w2 f (z1 ) + w3 f (z2 ) + w4 f (b).

a

Show that the values of w1 , w2 , w3 , w4 , z1 , and z2 that maximize √ the precision are: w1 = w4 = /12, w2 = w3 = 5/12, and q = 1/(2 5). In deriving this result you should assume that z1 = xm − q and z2 = xm + q. Also, similar to Exercises 6.29(b) and 6.30(b), the error term has the form K7 f (6) (η), where K = −1/1,512,000 (you do not need to derive this). 6.32. Evaluate the following integrals using Gaussian quadrature. Your answer should be correct to 8 significant digits. You should state the stopping condition used, and the final number of nodes needed to obtain the value. This requires knowing the formulas for the nodes and weights. Using MATLAB, given a, b, and m, the commands: q=0.5./sqrt(1-(2*(1:m-1)).^(-2)); A=diag(q,1)+diag(q,-1); [V,D]=eig(A); z=diag(D); z=a+0.5*(b-a)*(1+z); w=2*V(1,:).^2;

will produce m-vectors z and w that contain the respective nodes and weights. (a)



π/2

0

(b)



π/2

0

(c)



sin2 x dx sin x + cos x

π/2

(cos2 cos−1

0

1 dx x + 3 sin2 x)3



cos x 1 + 2 cos x

 dx

Additional Questions 6.33. Occasionally you will see someone try to adjust their data in an attempt to use Simpson’s rule. To explain, the goal is to evaluate the integral  xi+1 f (x)dx. xi−1

302

6 Numerical Integration

Assume that the value of f (x) is known at xi−1 and xi+1 but not at xi . The question is, can you use the data to find an approximation for f (xi ) that will enable you to use Simpson’s rule, and in the process get a better result than you would get using the trapezoidal rule? (a) What approximation of the integral is obtained using the trapezoidal rule? (b) One possibility for approximating f (xi ) is to use piecewise linear interpolation using the two data points (xi−1 , fi−1 ) and (xi+1 , fi+1 ). Doing this, and inserting the resulting approximation for fi into Simpson’s rule, what results? How does this differ from your answer in part (a)? (c) Suppose one just assumes that there are constants A and B so that fi = Afi−1 + Bfi+1 . With this, Simpson’s rule reduces to an integration rule of the form  xi+1 f (x)dx = w1 fi−1 + w2 fi+1 . xi−1

What do w1 and w2 have to be to maximize the precision? How does this differ from your answer in part (a)?

Chapter 7

Initial Value Problems

In this chapter, we derive numerical methods to solve the first-order differential equation: dy = f (t, y), for 0 < t, (7.1) dt where y(0) = α. (7.2) This is known as an initial value problem (IVP), and it consists of the differential equation (7.1) along with the initial condition in (7.2). Numerical methods for solving this problem are first derived for the case of when there is one differential equation. Afterwards, the methods are extended to problems involving multiple equations.

7.1 Solution of an IVP Example: Radioactive Decay According to the law of radioactive decay, the mass of a radioactive substance decays at a rate that is proportional to the amount present. To express this in mathematical terms, let y(t) designate the amount present at time t. In this case, the decay law can be expressed as dy = −ry, dt

for 0 < t.

(7.3)

If we start out with an amount α of the substance then the corresponding initial condition is y(0) = α. (7.4) © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0 7

303

304

7 Initial Value Problems

In the decay law (7.3), r is a proportionality constant and it is assumed to be positive. Because f (t, y) = −ry is a linear function of y the IVP is said to be linear. One consequence of this is that it is possible to find the solution using an integrating factor or using the method of separation of variables. What one finds is (7.5) y(t) = αe−rt . Consequently, the solution starts at α and decays exponentially to zero as time increases.  To put a slightly different spin on the solution in the above example, recall that y = Y is a steady-state solution of (7.1) if it is constant and satisfies f (t, Y ) = 0. Also, a steady-state Y is stable if any solution that starts near Y stays near it. If, in addition, initial conditions starting near Y actually result in the solution converging to Y as t → ∞, then y = Y is said to be asymptotically stable. With the solution in (7.5) we conclude that y = 0 is an asymptotically stable steady-state solution for (7.3). For those a bit rusty about solving differential equations, and determining the stability of a steady-state solution, you might want to review the material in Holmes [2020b]. Example: Logistic Equation In the study of populations limited by the supply of food, one obtains the logistic equation, which is dy = ry(1 − y), dt

for 0 < t,

(7.6)

where y(0) = α.

(7.7)

It is assumed that r and α are positive. For this problem, f (t, y) = ry − ry 2 is a nonlinear function of y and therefore the IVP is nonlinear. It is possible to find the solution using separation of variables, and the result is y(t) =

α . α + (1 − α)e−rt

(7.8)

The steady-state solutions for this equation are the constants that satisfy ry(1 − y) = 0, which means that y = 1 or y = 0. Because r > 0, the solution in (7.8) approaches y = 1 as t increases. Consequently, y = 1 is an asymptotically stable steady-state solution, whereas y = 0 is unstable.  A numerical method for solving a differential equation should usually have the same steady-state solutions and they should have the same stability prop-

7.2

Numerical Differentiation

305

erties as for the differential equation. There are some qualifications to this requirement and this will be discussed later. In this chapter it is assumed that the IVP is well-posed, which means that it has a unique solution. This is guaranteed, for example, if f (t, y) and fy (t, y) are continuous for 0 ≤ t < ∞ and −∞ < y < ∞. The formal statement of the requirements, and the conclusions, are contained in what is known as the Picard-Lindel¨ of theorem. This can be found in most upper division textbooks on differential equations.

7.2 Numerical Differentiation There are many ways to solve an IVP like the one in (7.1), (7.2) and we begin with the direct approach. This means we need to be able to compute the derivative. So, we start by deriving formulas for calculating derivatives. Just as with integration, the definition of a derivative from calculus is the starting point for our approximations. Although there are different ways to state the definition, the one used here is the following: y  (t) = lim

k→0

y(t + k) − y(t) . k

Consequently, for small values of k, we have the approximation y  (t) ≈

y(t + k) − y(t) . k

As will be explained in more detail later, we will compute the solution of the IVP at equally spaced time points t0 , t1 , t2 , . . . , tM , where tj = jk, for j = 0, 1, 2, . . . , M . With this, the above approximation can be written as y  (tj ) ≈

y(tj + k) − y(tj ) . k

To be useful as a computing tool, it is essential to know the error for this approximation. For this we rely, as usual, on Taylor’s theorem (see Appendix A). Given that 1 1 y(tj + k) = y(tj ) + ky  (tj ) + k 2 y  (tj ) + k 3 y  (tj ) + · · · , 2 6 then

  y(tj ) + ky  (tj ) + 12 k 2 y  (tj ) + 16 k 3 y  (tj ) + · · · − y(tj ) y(tj + k) − y(tj ) = k k 1  1 2   = y (tj ) + ky (tj ) + k y (tj ) + · · · . 2 6

306

7 Initial Value Problems

In other words, y  (tj ) =

y(tj+1 ) − y(tj ) + τj , k

(7.9)

where

1 1 (7.10) τj = − ky  (tj ) − k 2 y  (tj ) + · · · 2 6 is the truncation error or the approximation error. Because τj = O(k), the resulting approximation is said to be first order. The expression in (7.9) is listed in Table 7.1 as a forward difference formula. It is forward because it uses information at a future time, tj+1 , to construct the approximation. Also the formula for τj in the table looks different than the one given in (7.10). They are equivalent in the sense that the one in the table comes from the remainder term in Taylor’s theorem, while (7.10) is the series expansion for τj . More about the connections between the series and remainder versions of Taylor’s theorem can be found in Section A.2. However, in this chapter, except for Exercise 7.20, the only thing that is important is that τj = O(k). A criticism of (7.9) is that it is only first order. We are interested in having higher order approximations, which means τj = O(k 2 ) or possibly even τj = O(k 4 ). To do this, note that (7.9) uses tj and tj+1 to obtain the approximation. In what follows, we will investigate how it is possible to find approximations that use other time points. For example, we will look to see if it is possible to find an approximation that uses tj−1 and tj+1 . One outcome of this is that we will find higher order approximations for the derivative, as well as approximations for the other derivatives of y(t). Example 7.1: Using tj+2 , tj+1 , and tj The assumption is that we can find A, B, and C so that y  (tj ) ≈ Ay(tj+2 ) + By(tj+1 ) + Cy(tj ).

(7.11)

We will use Taylor’s theorem to relate the values at tj+2 and tj+1 to the value at tj . So, 4 y(tj+2 ) = y(tj + 2k) = y(tj ) + 2ky  (tj ) + 2k 2 y  (tj ) + k 3 y  (tj ) + · · · . 3 Also, we know that 1 1 y(tj+1 ) = y(tj + k) = y(tj ) + ky  (tj ) + k 2 y  (tj ) + k 3 y  (tj ) + · · · . 2 6 Combining these formulas,

7.2

Numerical Differentiation

307

Ay(tj+2 ) + By(tj+1 ) + Cy(tj )   4 3   2  = A y(tj ) + 2ky (tj ) + 2k y (tj ) + k y (tj ) + · · · 3   1 2  1 3   + B y(tj ) + ky (tj ) + k y (tj ) + k y (tj ) + · · · + Cy(tj ). 2 6 This means, we want y  (tj ) ≈ (A + B + C)y(tj ) + (2A + B)ky  (tj ) 1 1 + (4A + B)k 2 y  (tj ) + (8A + B)k 3 y  (tj ) + · · · . 2 6

(7.12)

This is supposed to hold for any (smooth) function y(t). One approach to finding A, B, and C is to substitute y = 1, then y = t, etc. into the above expression and determine the equations that the coefficients satisfy. An equivalent method is to equate the coefficients of the y, y  , y  , etc. terms on the two sides of the equation. Using the latter method, we have the following: y: 

y : y  :

A + B + C = 0, (2A + B)k = 1, 1 (4A + B)k 2 = 0. 2

Solving these equations one finds that A = −1/(2k), B = 2/k, and C = −3/(2k). With this, (7.12) takes the form Ay(tj+2 ) + By(tj+1 ) + Cy(tj ) = y  (tj ) − 13 k 2 y  (tj ) + · · · . Therefore, y  (tj ) =

−y(tj+2 ) + 4y(tj+1 ) − 3y(tj ) + τj , 2k

(7.13)

where τj = 13 k 2 y  (tj ) + · · · . This is listed in Table 7.1 as a one-sided difference formula, because of where the time points are located. It is also an example of a forward difference formula.  Example 7.2: Using tj+1 and tj−1 This means, we want y  (tj ) ≈ Ay(tj+1 ) + By(tj−1 ) (7.14)   1 1 = A y(tj ) + ky  (tj ) + k 2 y  (tj ) + k 3 y  (tj ) + · · · 2 6   1 2  1 3   + B y(tj ) − ky (tj ) + k y (tj ) − k y (tj ) + · · · . 2 6 Collecting like terms gives

308

7 Initial Value Problems

y  (tj ) ≈ (A + B)y(tj ) + (A − B)ky  (tj ) 1 1 + (A + B)k 2 y  (tj ) + (A − B)k 3 y  (tj ) + · · · . 2 6 Equating the left and right sides we conclude that y: y :

A + B = 0, (A − B)k = 1.

Solving these equations one finds that A = 1/(2k) and B = −2/k. With this, Ay(tj+1 ) + By(tj−1 ) = y  (tj ) + 16 k 2 y  (tj ) + · · · . Therefore, (7.14) can be written as y(tj+1 ) − y(tj−1 ) + τj , y  (tj ) = (7.15) 2k where 1 (7.16) τj = − k 2 y  (tj ) + · · · . 6 This is listed in Table 7.1 as a centered difference formula. It is centered because it uses time points that are symmetrically placed around tj . Also, it produces a second-order approximation because τj = O(k 2 ).  Example 7.3 √ Suppose y(t) = t, and we use the above formulas to calculate y  (1). Taking tj = 1, tj+1 = 1 + k, and tj−1 = 1 − k then the forward approximation coming from (7.9) is √ 1+k−1 y(1 + k) − y(1) = , (7.17) y  (1) ≈ k k and the centered approximation coming from (7.15) is √ √ 1+k− 1−k y(1 + k) − y(1 − k)  = . y (1) ≈ 2k 2k

(7.18)

The exact value is y  (1) = 12 . We are interested in the accuracy of each approximation. To examine this, the error for (7.17) is √      y (1) − 1 + k − 1  ,   k with a similar error formula for (7.18). The values of these errors are shown in Figure 7.1. According to (7.10), the error for (7.17) should decrease as O(k), while from (7.16) the error for (7.18) should decrease as O(k 2 ). Starting at k = 1, as k decreases they behave as expected. For example, the error in the forward approximation (7.17) decreases by a factor of 103 when k is reduced from k = 1 to k = 10−3 . For the same drop in k, the centered approximation

7.2

Numerical Differentiation

Error

10 10

0

-4

10-8 10

309

Forward Centered

-12

10

-18

10

-15

10

-12

10

-9

10

-6

10

-3

10

0

Stepsize (k) Figure 7.1 Error when using the forward approximation (7.17), and the centered approx√ imation (7.18), to compute the value of y  (1) when y(t) = t.

decreases by a factor of 106 . However, for both approximations, there is a value for k where the error starts to show an overall increase, albeit a noisy increase, for smaller values of k. This is due to round-off, and is very much the same problem considered in Example 1.3. There are ways to avoid this, and one possibility is to rewrite the formula. As an example, by rationalizing the numerator, one can rewrite (7.18) as y  (1) ≈ √

1 √ . 1+k+ 1−k

Unfortunately, it is not possible to do this for all functions. Another possibility is to use a complex-valued time step, something called a complex Taylor series expansion, and this will be explained in Section 7.7. However, in the end, with numerical differentiation, we are stuck with the situation shown in Figure 7.1. This gives rise to what is known as the optimal step size, which is the step size where the round-off error starts to become noticeable. It is possible, using the ideas discussed in Section 1.4, to derive formulas for the optimal step size, but this is not pursued here. 

7.2.1 Higher Derivatives It is relatively easy to use the procedure to derive approximations for higher derivatives. For example, if you intend on using tj+1 , tj , and tj−1 to obtain an approximation for y  (tj ), then y  (tj ) ≈ Ay(tj+1 ) + By(tj ) + Cy(tj−1 ) 1 = (A + B + C)y(tj ) + (A − C)ky  (tj ) + (A + C)k 2 y  (tj ) 2 1 1 3  4  (7.19) + (A − C)k y (tj ) + (A + C)k y (tj ) + · · · . 6 24

310

7 Initial Value Problems

Equating the left and right sides we conclude that y:

A + B + C = 0,



A − C = 0, 1 (A + C)k 2 = 1. 2

y : 

y :

Solving these, one finds that A = C = 1/k 2 and B = −2A = −2/k 2 . The result is that in (7.19), A − C = 0 while (A + C)k 4 = 2k 2 . The result is that y  (tj ) =

y(tj+1 ) − 2y(tj ) + y(tj−1 ) + τj , k2

(7.20)

where τj = −

1 2  k y (tj ) + · · · . 12

This is listed in Table 7.1 as a centered difference approximation.

7.2.2 Interpolation Interpolation methods were derived in Chapter 5 to approximate a function f (x). It is tempting to think that you can obtain an approximation for f  (x) by simply differentiating the approximation for f (x). In certain cases this is

Type

Difference Approximation

Truncation Term

Forward

y  (tj ) ≈

y(tj+1 )−y(tj ) k )−y(t y(t j j−1 ) y  (tj ) ≈ k y(tj+1 )−y(tj−1 ) y  (tj ) ≈ 2k )+4y(tj+1 )−3y(tj ) −y(t j+2 y  (tj ) ≈ 2k )−4y(t )+y(tj−2 ) 3y(t j j−1 y  (tj ) ≈ 2k y(tj+1 )−2y(tj )+y(tj−1 ) y  (tj ) ≈ k2

τj = − 12 ky  (ηj )

Backward Centered One-sided One-sided Centered

τj = 12 ky  (ηj ) τj = − 16 k 2 y  (ηj ) τj = 13 k 2 y  (ηj ) τj = 13 k 2 y  (ηj ) 1 2  τj = − 12 k y (ηj )

Table 7.1 Numerical differentiation formulas when using equally spaced points with k = tj+1 − tj . The point ηj is located between the left- and rightmost points used in the formula. Also, τj is written in remainder form, versus series form, and how these forms are connected is explained in Section A.2.

7.3

IVP Methods Using Numerical Differentiation

311

possible. For example, the centered difference approximation can be derived from the piecewise quadratic interpolation formula (see Exercise 5.37). Also, as stated in Theorem 5.5, you can do this with the clamped cubic spline. However, in general, the drawback with this approach is that the corresponding error formulas do not, necessarily, come from differentiating the interpolation error given in Theorem 5.2. Why this is the case, and how to determine the appropriate error formulas, is discussed in S¨ uli and Mayers [2003].

7.3 IVP Methods Using Numerical Differentiation The task we now undertake is to approximate the differential equation, and its accompanying initial condition, with a problem we can solve using a computer. To explain how this is done, consider the problem of solving dy = f (t, y), dt

for 0 < t,

(7.21)

where y(0) = α.

(7.22)

The function f (t, y) is assumed to be given. For example, with radioactive decay f (t, y) = −ry and for the logistic problem f (t, y) = ry(1 − y). The question is, can we accurately compute the solution directly from the problem without first finding an analytical solution? As it turns out, most realistic mathematical models of physical and biological systems cannot be solved by hand, so having the ability to find accurate numerical solutions directly from the original equations is an invaluable tool.

7.3.1 The Five Steps To explain how we will construct a numerical algorithm that can be used to solve (7.21) it should be noted that the variables in this problem, t and y, are continuous. Our objective is to replace these with discrete variables so that the resulting problem is algebraic and therefore solvable using standard numerical methods. Great care must be taken in making this replacement, because the computed solution must accurately approximate the solution of the original IVP. The approach we take proceeds in a sequence of five steps. Step 1: The Grid We first introduce the time points at which we will compute the solution. These points are labeled sequentially as t0 , t1 , t2 , . . . , tM and a schematic

312

7 Initial Value Problems

drawing indicating their location along the time axis is shown in Figure 7.2. We confine our attention to a uniform grid with step size k, so the formula for the time points is tj = jk, for j = 0, 1, 2, . . . , M.

(7.23)

Assuming that the time interval is 0 ≤ t ≤ T , then we require tM = T . Therefore, k and M are connected through the equation k=

T . M

(7.24)

Step 2: Evaluation Evaluate the differential equation at the time point t = tj to obtain y  (tj ) = f (tj , y(tj )).

(7.25)

Step 3: Finite Difference Formula Replace the derivative term in Step 2 with a finite difference formula using the values of y at one or more of the grid points in a neighborhood of tj . This is where things get a bit interesting, because numerous choices can be made, a few of which are listed in Table 7.1. Different choices result in different numerical procedures, and as it turns out, not all choices will work. To start, we take the first entry listed in the table, which means we use the following expression for the first derivative: y  (tj ) =

y(tj+1 ) − y(tj ) + τj , k

(7.26)

where

Figure 7.2 Grid system used to derive a finite difference approximation of the initial value problem. The points are equally spaced and tM = T .

7.3

IVP Methods Using Numerical Differentiation

313

k τj = − y  (ηj ) (7.27) 2 and ηj is a point between tj and tj+1 . Introducing this into (7.25) we obtain y(tj+1 ) − y(tj ) + τj = f (tj , y(tj )), k

(7.28)

y(tj+1 ) = y(tj ) + kf (tj , y(tj )) − kτj .

(7.29)

or equivalently, An important point to make here concerns the term τj . As it appears in (7.28), τj represents how well we have approximated the differential equation. For this reason it is the truncation error for the method, and from (7.27) it is seen that it is O(k). It is essential that whatever approximations we use, the truncation error goes to zero as k goes to zero. This means that, at least in theory, we can approximate the original problem as accurately as we wish by making the time step k small enough. It is said in this case that the approximation is consistent. Unfortunately, as will be explained later, consistency is not enough to guarantee an accurate numerical solution. Step 4: Finite Difference Approximation Drop the term containing the truncation error. This is the step where we go from an exact problem to one that is, hopefully, an accurate approximation of the original. After dropping the τj term in (7.29) the resulting equation is yj+1 = yj + kf (tj , yj ),

for j = 0, 1, 2, . . . , M − 1.

(7.30)

From the initial condition (7.22) we have that the starting value is y0 = α.

(7.31)

The finite difference equation (7.30) is known as Euler’s method for solving (7.21). It is a recursive algorithm in which one starts with j = 0 and then uses (7.30) to determine the solution at j = 1, then j = 2, then j = 3, etc. Because (7.30) gives the unknown yj+1 explicitly in terms of known quantities, it is an explicit method. Example 7.4 Let’s see how well Euler’s method does with the logistic equation (7.6). Specifically, suppose the IVP is dy = 10y(1 − y), dt

for 0 < t,

(7.32)

314

7 Initial Value Problems

where y(0) = 0.01.

(7.33)

We will use the Euler method to calculate the solution for 0 ≤ t ≤ 1. In this case, T = 1 and so, from (7.24), 1 k= . (7.34) M For this example, the finite difference equation in (7.30) takes the form yj+1 = yj + 10kyj (1 − yj ),

for j = 0, 1, 2, . . . , M − 1.

(7.35)

Taking M = 6, so k = 16 , the values obtained using the Euler method are given in Table 7.2. In this table, y j is not the floating-point value of yj . Rather, it is the floating-point value you get from recursively evaluating (7.35). So, for example, y 1 is the floating-point value computed using y 0 + 10ky 0 (1 − y 0 ), y 2 is the floating-point value computed using y 1 + 10ky 1 (1 − y 1 ), etc. For a more graphical picture of the situation, the exact solution, given in (7.8), and the computed solutions are shown in Figure 7.3 using successively smaller values of the time step k or, equivalently, larger values of M . It is seen that the numerical solution for M = 4 is not so good, but the situation improves considerably as more time points are used. It appears in Figure 7.3 that if we keep increasing the number of time points that the numerical solution converges to the exact solution. To check on this, let y = (y(0), y(t1 ), y(t2 ), · · · , y(tM ))T and let yc = (y 0 , y 1 , y 2 , · · · , y M )T be the vector of computed values. The resulting error ||y − yc ||∞ is given in Table 7.3 as a function of the number of points M . It is seen that if M is increased by a factor of 10, the error is reduced by the same factor. In other words, the error deceases as O(k). As explained below, it is not a coincidence that this is the order of the truncation error for Euler’s method. 

j

tj

y(tj )

yj

yj

y(tj ) − yj

yj − y j

0

0

1 100

1 100

0.01

0

0

1

1 6

53 2000

2.6500e−02

2.4e−02

−3.8e−18

2

1 3

55597 800000

6.9496e−02

1.5e−01

−1.4e−17

3

1 2

68073133591 384000000000

1.7727e−01

4.2e−01

−2.8e−17

. . .

. . .



1 + 99e−5/3



−1

1 + 99e−10/3



1 + 99e−5 . . .

−1

−1

. . .

. . .

. . .

. . .

Table 7.2 The first few time steps in solving the logistic equation (7.32) using the Euler method. Note that y(tj ) is the exact solution (7.8) of the logistic equation, yj is the exact value from (7.35), and y j is the computed value from (7.35).

7.3

IVP Methods Using Numerical Differentiation

Solution

1

Exact M=4 M = 16 M = 64

0.5

0

315

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

t-axis Figure 7.3 Solution of the logistic equation (7.32) using the Euler method (7.35) for three values of M . Also shown is the exact solution. The symbols are the computed values, and the dashed lines are drawn by the plotting program simply to connect the values.

One of the more significant criticisms of Euler’s method is evident in Table 7.3. Namely, to obtain a modest error of 4 × 10−5 requires using 100,000 points. Looking at the simple sigmoidal shape of the solution curve in Figure 7.3 you would think the number would be much smaller. For example, using piecewise linear interpolation this error is achieved using only about 180 points, and with a cubic spline you only need about 50 points. The reason Euler does so poorly is that it is a first-order method, which means the truncation error is O(k). We will remedy this situation later in the chapter when higher order methods are derived. More comment is needed about the observation made in the above example that the error reduction seen in Table 7.3 is determined by the truncation error. The role of the truncation error in determining the accuracy of the computed solution is evident in (7.29). This is because Euler’s method is obtained by dropping the O(kτj ) term in (7.29). So, the error in the approximation y1 ≈ y(t1 ) is O(kτ1 ). Similarly, at the next step, the error in the

M

Table 7.3

k

Error

10

1.0e−01

3.3e−01

100

1.0e−02

4.0e−02

1,000

1.0e−03

4.0e−03

10,000

1.0e−04

4.0e−04

100,000

1.0e−05

4.0e−05

1,000,000

1.0e−06

4.0e−06

10,000,000

1.0e−07

4.0e−07

The error ||y − yc ||∞ when using the Euler method to solve (7.32).

316

7 Initial Value Problems

approximation y2 ≈ y(t2 ) contains a contribution that is O(kτ2 ). However, because Euler’s formula states that y2 = y1 + kf (t1 , y1 ) then the O(kτ1 ) error for y1 also contributes to the error for y2 . This accumulation of error continues with each time step. So, at t = T the error contains a term of the form O(kτ1 ) + O(kτ2 ) + · · · + O(kτM ). Since τj = O(k), then the error is on the order of M O(k 2 ). Because M = 1/k, then the result is an error that is M O(k 2 ) = O(k). This is why the error reduction seen in Table 7.3 is determined by the truncation error. It needs to be pointed out that the formal proof of this statement is fairly involved, and includes a requirement that the method satisfies a stability condition. What it means to be stable is considered below, and its connection with what is called a convergent method is discussed in Section 7.3.3. Step 5: Stability To guarantee that a finite difference method works, two requirements must be satisfied. One is that the approximation is consistent. This was mentioned in Step 3 and will be discussed again later. The second requirement is that it is stable, which means, roughly, that the approximation produces a solution with properties similar to those of the exact solution. As an example, the simplest IVP is y  = 0, where y(0) = 1, and the exact solution is y = 1. The requirement for what is known as zero-stability is that the numerical method applied to y  = 0 produces a solution that is at least bounded. There is a stronger form of stability, which is more useful for many of the problems which arise in applications, and it is known as A-stability. This is determined by using the method to solve a test problem, which is dy = −ry, dt

(7.36)

y(0) = 1.

(7.37)

where −rt

The exact solution is y(t) = e , and, assuming that r > 0, this function approaches zero as t → ∞. It is required that the solution yj of the finite difference problem also approaches zero, and this is the basis for the following definition. Definition of A-Stable. If a numerical method, when applied to (7.36), (7.37), produces a solution with the property that yj → 0 as j → ∞, irrespective of the positive values of r and k, then the method is said to be A-stable. If the zero limit occurs only when k is small, then the method is conditionally A-stable. Otherwise, the method is unstable. In the above definition, when stating that k is small, it is meant that there ¯ is a k¯ so the condition holds for 0 < k < k.

7.3

IVP Methods Using Numerical Differentiation

317

As will become clear as we derive methods to solve an IVP, the following result is very useful for determining stability. Theorem 7.1 If y0 = 1 then the solution of yj+1 = κyj , for j = 0, 1, 2, · · ·

(7.38)

is yj = κj . Therefore, yj → 0 as j → ∞ if, and only if, κ satisfies |κ| < 1. The constant κ in this theorem is called the amplification factor for the equation. The proof of the above theorem involves simply using (7.38) to show that y1 = κ, y2 = κy1 = κ2 , y3 = κy2 = κ3 , etc. Example 7.5: Stability of Euler’s Method Euler’s method (7.30), when applied to the equation in (7.36), reduces to yj+1 = (1 − rk)yj . This can be written as yj+1 = κyj , where κ = 1 − rk is the amplification factor for Euler’s method. According to the above theorem, Euler’s method is A-stable if |1 − rk| < 1, or equivalently, −1 < 1 − rk < 1. From this we conclude that the step size must satisfy the condition k < 2/r. Therefore, the Euler method is conditionally A-stable. To show what happens when you do, or don’t, satisfy the stability condition the resulting solutions for the logistic equation are shown in Figure 7.4. It is worth noting, after looking at the solution for 0 < t < 1, that it is evident that satisfying the stability condition does not necessarily mean that the error is particularly small.  It might be puzzling why the radioactive decay equation (7.36) is the arbiter for stability. The reason comes from the desire that if the differential equation has an asymptotically stable steady-state y = a, then y = a is also an asymptotically stable steady state for the approximating difference equation. Finding a condition that ensures this for the general problem is not particularly easy, but it is possible to derive a condition for differential equations of the form y  = g(y). Writing y(t) = a + Y (t), and using Taylor’s theorem, one can show that the equation for Y is approximately Y  = −rY , where r = g  (a). In other words, near y = a, and assuming that r = 0, the original differential equation can be approximated with the radioactive decay equation. The requirement of A-stability ensures that the numerical method produces a solution that, like the actual solution, approaches the steady state. However, there are situations where A-stability is either not enough, or not relevant. This is why other types of stability have been defined, and these include zero-stability, L-stability, B-stability, and BN-stability. Those interested in this should consult Butcher [2008]. Because of this, it needs to be pointed out that when stating that a method is unstable in this textbook, it is referring to A-stability.

318

7 Initial Value Problems

y-axis

1.5 1 0.5 0

kr = 3 0

1

2

3

4

5

3

4

5

y-axis

1

0.5 kr = 1 0

0

1

2

t-axis

Figure 7.4 Solution of the logistic equation (7.32) using the Euler method (7.35) when kr = 3 (unstable) and when kr = 1 (stable). Also shown, with the red curve, is the exact solution.

7.3.2 Additional Difference Methods The steps used to derive the Euler method can be employed to obtain a host of other IVP solvers. The point in the derivation that separates one method from another is Step 3, where one makes a choice for the difference formula. A few possibilities are given in Table 7.1. It is interesting to see what sort of numerical methods can be derived using these expressions, and a couple of the possibilities are discussed below. Backward Euler If one uses the backward difference formula in Table 7.1, then in place of (7.26), we get y(tj ) − y(tj−1 ) + τj , (7.39) y  (tj ) = k where k (7.40) τj = y  (ηj ). 2 Introducing this into (7.29), we obtain y(tj ) − y(tj−1 ) + kτj = kf (tj , y(tj )).

(7.41)

Dropping the truncation error τj , the resulting finite difference approximation is (7.42) yj = yj−1 + kf (tj , yj ), for j = 1, 2, . . . , M.

7.3

IVP Methods Using Numerical Differentiation

319

From the initial condition (7.22) we have that the starting value is y0 = α.

(7.43)

The difference equation in (7.42) is the backward Euler method. It has the same order of truncation error as the Euler method. However, because (7.42) is an equation for yj , and it contains the term f (tj , yj ), the method is implicit. This is both good and bad. It is good because it helps make the method A-stable (see below). However, it is bad because it can make finding yj computationally more expensive than when using an explicit method. This is explained in the example below. As for stability (Step 5), for the radioactive decay equation (7.36) one finds that (7.42) reduces to (7.38), with the amplification factor κ=

1 . 1 + rk

Since |κ| < 1, then from Theorem 7.1, this method is A-stable. Example 7.6: Explicit versus Implicit A numerical comparison between backward Euler and some of the other methods we consider is made in Section 7.4 (see Figure 7.6). The objective of this example is to explain the difference between explicit and implicit methods, and this is done by solving dy = 6y(1 − y 3 ), dt where y(0) = 2. Using Euler’s method, which is explicit, the finite difference equation is yj+1 = yj + 6kyj (1 − yj3 ),

for j = 0, 1, 2, . . . ,

where y0 = 2. Assuming k = 13 , then it is a simple matter to evaluate the above formula to find that y1 = −26. In comparison, using the backward Euler method (7.42), the finite difference equation is 3 yj+1 = yj + 6kyj+1 (1 − yj+1 ),

for j = 0, 1, 2, . . . ,

(7.44)

where y0 = 2. Assuming k = 13 , then from the above formula we have that y1 = 2 + 2y1 (1 − y13 ). It is necessary to solve this nonlinear equation to determine y1 . Consequently, a computer program that uses the backward Euler method will have to include a nonlinear equation solver, like Newton’s method, to compute yj+1 from

320

7 Initial Value Problems

(7.44). This is true for any implicit method if you are not able to find an analytical solution of the difference equation. As will be explored in some of the exercises, the stopping error used for the solver has the potential to affect the accuracy of the computed solution.  A situation where backward Euler is a better choice than Euler arises when the solution contains a rapid transition, with the solution appearing to almost have a jump discontinuity (something similar to the curve in Figure 5.20). Such problems are said to be “stiff”. In these regions the stability condition for Euler is so severe that the step size k must be very small, while backward Euler has no such restriction. For stiff problems in general the key is to use BDF (backward difference formula) methods, and backward Euler is an example of such a method. More information about stiff problems can be found in Hairer and Wanner [2002]. A key component of most stiff solvers is a way to avoid the direct use of Newton’s method. One way to do this is to use what are called Newton-Krylov methods, and a review of the recent work on this can be found in Knoll and Keyes [2004] and Loffeld and Tokman [2013]. Leapfrog Method It is natural to expect that a more accurate approximation of the derivative will improve the accuracy of the computed solution. In looking over Table 7.1, the centered difference formula would appear to be a good choice for such an improvement because the error is O(k 2 ). Introducing this into (7.21) we obtain (7.45) y(tj+1 ) − y(tj−1 ) + 2kτj = 2kf (tj , y(tj )), where τj = O(k 2 ). Dropping the 2kτj term, the resulting finite difference approximation is yj+1 = yj−1 + 2kf (tj , yj ),

for j = 1, 2, . . . , M − 1.

(7.46)

This is known as the leapfrog, or explicit midpoint, method. Because this equation uses information from two previous time steps it is an example of a two-step method. In contrast, both Euler methods use information from a single time step back, so they are one-step methods. What this means is that the initial condition (7.22) is not enough information to get leapfrog started, because we also need y1 . This is a relatively minor inconvenience compared to the problem this method has with stability. To explain, applying (7.46) to the radioactive decay equation (7.36) yields yj+1 = yj−1 − 2rkyj . This second-order difference equation can be solved by assuming a solution of the form yj = sj . From this, it is found that √ the general solution has the form yj = α0 sj+ + α1 sj− , where s± = −kr ± 1 + k 2 r2 and α0 , α1 are arbitrary

7.3

IVP Methods Using Numerical Differentiation

321

constants. Because |s− | > 1, it is impossible to find a step size k to satisfy the stability condition. Therefore, the leapfrog method is unstable. It needs to be pointed here that leapfrog does, just barely, satisfy the zero-stability condition mentioned earlier.

7.3.3 Error and Convergence It is essential to examine the requirements needed to guarantee a numerical IVP solver will work. This, as usual, will require us to consider the error. As illustrated in Table 7.2, at each time point we have three different solutions, and they are y(tj ) ≡ exact solution of the IVP at t = tj ; yj ≡ exact solution of finite difference equation at t = tj ; y j ≡ solution at t = tj computed using the difference equation.

(7.47) (7.48) (7.49)

We are interested in the difference between the exact solution of the IVP and the values we actually end up computing using our algorithm. Therefore, we are interested in the error ej = |y(tj ) − y j |. To help make it more apparent what is contributing to the error we rewrite it as follows: ej = |y(tj ) − yj + yj − y j |.

(7.50)

From this, the error can be considered as coming from the following two sources: y(tj ) − yj : This is the approximation error at t = tj . This corresponds to the difference between the exact solution of the IVP and the exact solution of the problem we use as its approximation. As it occurs in Table 7.2, this should be the major contributor to the error until k is small enough that this difference gets down to approximately that of the round-off. yj − y j : This is the computational error at t = tj . This comes from roundoff when using floating-point calculations to compute the solution recursively using the difference equation. So, if the method is implicit then this includes the iterative error when solving the difference equation. The last column of Table 7.2 gives the values of the computational error for the first few time points. Getting values of 10−16 or smaller, as it occurs in this calculation, is about as good as can be expected using double precision. For an IVP solver to be useful it must be convergent. This means that if, as you increase the number of time steps in the time interval 0 ≤ t ≤ T , the yj ’s determined by the difference equation converge to the corresponding values

322

7 Initial Value Problems

of the y(tj )’s and this is true no matter what choice you make for f (t, y) as long as it satisfies the smoothness conditions stated in Section 7.1. It should come as no surprise that to guarantee that this happens it is required that the approximation of the IVP is consistent. In other words, the truncation error goes to zero with k. However, as explained below, consistency is not enough. The complication that arises when solving an IVP is the potential compounding of the error at each time step. Take, for example, Euler’s method which states that yj+1 = yj + kf (tj , yj ). So, the error in yj contributes to the error for yj+1 , which in turn contributes to the error for yj+2 , etc. This means if M = 5 then this compounding of the error for y1 will happen 4 times. If you take M = 20 then it will happen 19 times, but presumably, since the time steps are smaller, the contribution from each step is smaller. As it turns out, it is possible to find consistent approximations that will allow this compounding error to grow. Preventing this unstable situation is the reason for needing a stability condition. In fact, it is possible to prove that if a method is consistent, and A-stable or conditionally A-stable, then the method is convergent. For those who want to learn more about this, the connection between stability and convergence is contained in Dahlquist’s Equivalence Theorem [S¨ uli and Mayers, 2003; Iserles, 2009].

7.3.4 Computing the Solution This brings us to how to compute an accurate solution of an IVP. This will be done using an approach similar to what was done for numerical integration. Namely, given T , the solution will be computed using M1 time points, and computed again using M2 points, where M2 > M1 . The difference between the two solutions will be used to judge the accuracy. To avoid unnecessary complications, the previous time points are included in the new set. For example, if you double M then the previous time points are included. The iterative error is then determined using the time points that are in both sets. An outline of the procedure is given in Table 7.4. Example 7.7 The solution of the logistic equation from Example 7.4 was computed using Euler’s method in conjunction with the algorithm given in Table 7.4 when tol = 10−6 . The resulting values of the iterative error are shown in Figure 7.5. As expected, the iterative error decreases as O(k) and the values are indistinguishable from those given in Table 7.3. 

7.3

IVP Methods Using Numerical Differentiation

323

Pick: M and tol > 0 err = 10 ∗ tol Compute y Loop: while err > tol Y=y M =2∗M Compute y err = ||y(0 : 2 : M ) − Y||∞ end

Table 7.4 Algorithm for computing the solution of an IVP for a given error tolerance. Note y(0 : 2 : M ) = (y0 , y2 , y4 , · · · , yM )T and “Compute y” is a placeholder for whatever IVP solver is being used to compute the solution for 0 ≤ t ≤ T .

If the number of correct significant digits at each time point is important, then the local relative iterative error could be used in Table 7.4. In this case, you take err to be the maximum value of |(y2j − Yj )/y2j |, for j = 0, 1, 2, · · · , M/2. Of course, this requires including a contingency for the possibility that y2j = 0. Most commercial codes for solving IVPs employ an adaptive method to satisfy the given error condition. In doing this, instead of solving the IVP multiple times as in Table 7.4, they only solve the IVP once. They do this by changing the step size k based on what the solution is doing. As with the adaptive integration method described in Section 6.6, this approach requires a way to estimate the error. The result is that the size of the time steps can vary appreciably as the method progresses, looking similar to the nodes used

Iterative Error

100 10

-2

10

-4

10-6 10-8 100

101

102

103

104

105

106

107

Number of Time Points (M) Figure 7.5 Iterative error solving the logistic equation (7.32) using the Euler method (7.35) with the algorithm given in Table 7.4.

324

Rule Right Box Left Box Midpoint Trapezoidal Simpson

7 Initial Value Problems

Integration Formula

 tj+1 tj

 ti+1 tj

 tj+1 tj−1

f (x)dx = kf (tj+1 ) + O(k 2 ) f (x)dx = kf (tj ) + O(k 2 ) f (x)dx = 2kf (tj ) +

 tj+1

f (x)dx =

k 2

 tj+1

f (x)dx =

k 3

tj

tj−1

k3  3 f (ηi )



 f (tj ) + f (tj+1 ) −



 f (tj+1 ) + 4f (tj ) + f (tj−1 ) −

k3  12 f (ηj ) k5  (ηj ) 90 f

Table 7.5 Numerical integration formulas. The points t1 , t2 , t3 , . . . are equally spaced with step size k = tj+1 − tj . The point ηj is located within the interval of integration.

in Figure 6.16, where a small step size is used for regions with rapid changes, and relatively larger steps in other regions.

7.4 IVP Methods Using Numerical Integration Another approach to deriving a finite difference approximation of an IVP is to integrate the differential equation and then use a numerical integration rule. This is a very useful idea that is best explained by working through an example. To get started, a time grid is introduced, and so Step 1 is the same as before. However, Step 2 and Step 3 differ from what we did earlier. Step 2. Integrate the differential equation between two time points. We will take tj and tj+1 , and so from (7.21) we have 

tj+1

tj

dy dt = dt



tj+1

f (t, y(t))dt.

(7.51)

tj

Using the Fundamental Theorem of Calculus we obtain  tj+1 y(tj+1 ) − y(tj ) = f (t, y(t))dt. tj

(7.52)

Step 3. Replace the integral in Step 2 with a finite difference approximation. There are numerous choices, and they produce different numerical procedures. A few possibilities are listed in Table 7.5. We will use the trapezoidal rule, and introducing this into (7.52) yields

7.4

IVP Methods Using Numerical Integration

y(tj+1 ) − y(tj ) =



k

f tj , y(tj ) + f tj+1 , y(tj+1 ) + O(k 3 ). 2

325

(7.53)

Step 4. Drop the big-O term. After dropping the O(k 3 ) term in (7.53) the resulting equation is yj+1 = yj +

k

fj + fj+1 , for j = 0, 1, 2, . . . , M − 1, 2

(7.54)

where fj = f (tj , yj ) and fj+1 = f (tj+1 , yj+1 ). From the initial condition (7.22), (7.55) y0 = α. The finite difference equation (7.54) is known as the trapezoidal method for solving (7.21). Because of the fj+1 term this method is implicit, and it is not hard to show that the amplification factor for the method is κ=

1 − rk/2 . 1 + rk/2

Since |κ| < 1, it follows from Theorem 7.1 that the method is A-stable. To determine the truncation error for the method note that in (7.53) the error at each time step is O(k 3 ). In taking M time steps to reach t = T the resulting error is therefore M × O(k 3 ) = O(k 2 ). In other words, the truncation error is τj = O(k 2 ). One of the attractive features of the quadrature approach is that it involves multiple decision points that can be varied to produce different numerical methods. For example, the integration interval can be changed to, say, tj−1 ≤ t ≤ tj+1 and then Simpson’s rule is used on the resulting integral. In the vernacular of the subject, if the integration rule is based on a polynomial approximation then it is called an Adams method. More specifically, it is an Adams-Bashforth method if the rule does not use the value at tj+1 , and an Adams-Moulton method if it does. So, the midpoint rule results in a Adams-Bashforth method, while the trapezoidal and Simpson’s rules produce Adams-Moulton methods. Example 7.8 We have derived several methods for solving IVPs, including the Euler, backward Euler, leapfrog, and trapezoidal methods. It is worth taking them out for a test drive to see how they compare, and the logistic equation (7.6) is a good candidate for this. The equation that is solved is dy = 10y(1 − y), for 0 < t, dt

(7.56)

326

7 Initial Value Problems

where y(0) = 0.01. We take T = 3, so the time points are determined from the expression tj = jk, for j = 0, 1, 2, . . . , M and k = 3/M . Because f (t, y) = 10y(1 − y) our methods reduce to the finite difference equations listed below: Euler: yj+1 = yj + 10kyj (1 − yj ), Backward Euler: yj+1 = yj + 10kyj+1 (1 − yj+1 ), Leapfrog: yj+1 = yj−1 + 20kyj (1 − yj ), Trapezoidal: yj+1 = yj + 5k [yj (1 − yj ) + yj+1 (1 − yj+1 )] . The initial condition is y0 = 0.01, and for the leapfrog method it is assumed that y1 = y(k) (i.e., the exact value at t = t1 is used). Just how well these four expressions do is shown in Figure 7.6 for the case M = 30. The first thing one notices is just how badly the leapfrog method does (it had to be given its own graph because it behaves so badly). Moreover, increasing the number of time steps by a factor of 5 does not improve the situation. This is not unexpected, because we know that the leapfrog method is not Astable. The other three solution curves also behave as expected. In particular, the two Euler methods are not as accurate as the trapezoidal method and are approximately equal in how far each differs from the exact solution. To quantify just how accurately each method does in solving the problem, in Figure 7.7, the error ||y − yc ||∞ is plotted as a function of the number of time points used to reach T (this is the same error used in Example 7.4). As predicted, all decrease according to their respective truncation errors.

Solution

1

0.5

0

Exact Euler B Euler Trap 0

1

2

3

2

3

2 Exact Leap (M=30) Leap (M=150)

Solution

1.5 1 0.5 0 -0.5

0

1

t-axis

Figure 7.6 Solution of the logistic equation (7.56) using different numerical schemes. The leapfrog method is shown in the lower plot, and the two Euler schemes and the trapezoidal method are in the upper graph.

7.5

Runge-Kutta Methods

327

Error

100

10

10

-5

Euler B Euler Trap

-10

10

1

10

2

10

3

10

4

10

5

Number of Time Points Figure 7.7 The error as a function of the number M of time steps used to solve the logistic equation (7.56). The Euler and backward Euler curves are almost indistinguishable.

Specifically, the trapezoidal method decreases as O(k 2 ) and the two Euler methods decrease as O(k).  A significant additional observation to make about Figure 7.7 is that we now have a method that can achieve an error of 4 × 10−5 using only 160 points versus the 100,000 points required using Euler. What we have not done is find an explicit method that does this, and that is our next task.

7.5 Runge-Kutta Methods An extraordinarily successful family of numerical approximations for IVPs comes under the general classification of Runge-Kutta (RK) methods. The derivation is based on the following question: is it possible to determine an explicit method for finding yj+1 that only uses the value of the solution at tj and has a specified truncation error? The key in getting this to work is making a good guess as to what such a formula might look like.

7.5.1 RK2 To demonstrate how RK methods are derived, the best one-step explicit method we have so far is Euler’s method, and this has a truncation error of O(k). The RK question is, can we find an explicit one-step method that is O(k 2 )? We have been able to derive an implicit scheme with this error, and this is the trapezoidal method (7.54). The reason it is implicit is the term f (tj+1 , yj+1 ). Suppose we experiment a little and use Euler’s method and replace f (tj+1 , yj+1 ) with f (tj+1 , yj + kfj ). The result is

328

7 Initial Value Problems

yj+1 = yj +

 k f (tj , yj ) + f (tj + k, yj + kfj ) . 2

(7.57)

This is explicit and, as it turns out, the truncation error is O(k 2 ). The RK method starts by proposing a formula that is similar to the one above, and then determining what truncation error is then possible. Based on (7.57), the Runge-Kutta assumption for a O(k 2 ) method is that the difference equation has the form   yj+1 = yj + k af (tj , yj ) + bf (tj + αk, yj + βkfj ) . (7.58) The constants a, b, α, and β are determined from the requirement that the method has at least a O(k 2 ) truncation error. Finding the constants is fairly straightforward. To explain, the truncation error is determined by how well the difference equation approximates the original differential equation. To determine this, the exact solution y(t) is substituted into (7.58) and then everything is expanded based on the assumption that k is small. For example, yj+1 is replaced with y(tj + k), and then this is expanded using Taylor’s theorem to obtain y(tj + k) = y(tj ) + ky  (tj ) + 1 2     2 k y (tj ) + · · · . Since y (t) = f (t, y(t)), then y = ft + fy y (t) = ft + fy f . Consequently, 1 y(tj + k) = y + kf + k 2 (ft + fy f ) + O(k 3 ), 2

(7.59)

where y = y(tj ), f = f (tj , y(tj )), etc. In a similar manner, using Taylor’s theorem, one finds that f (tj + αk, y + βkf )] = f + αkft + βkf fy + O(k 2 ).

(7.60)

Substituting (7.59) and (7.60) into (7.58), and collecting terms we have that 1 y + kf + k 2 (ft + fy f ) + O(k 3 ) = y + k(a + b)f + k 2 b(αft + βf fy ) + O(k 3 ). 2 (7.61) Equating the coefficients in this equation, one concludes that f: ft :

a + b = 1, 2αb = 1,

fy f :

2βb = 1.

(7.62)

These three equations are called the order conditions for RK2 methods, and interestingly, the solution is not unique. Some example solutions for the order conditions are listed below. (1) a = b = 12 , α = β = 1:

7.5

Runge-Kutta Methods

329

yj+1 = yj +

 k f (tj , yj ) + f (tj+1 , yj + kfj ) , 2

(7.63)

where fj = f (tj , yj ). This is known as Heun’s method. It is also the method, given in (7.57), that we derived by combining the trapezoidal and Euler methods. (2) a = 0, b = 1, α = β = 12 : yj+1

 k k = yj + kf tj + , yj + fj . 2 2

(7.64)

This is known as the midpoint method. (3) a = 14 , b = 34 , α = β = 23 : 

 1 2 2 yj+1 = yj + k f (tj , yj ) + 3f tj + k, yj + kfj . 4 3 3

(7.65)

This choice comes from using an optimal integration rule involving Radau quadrature, and this is explained in Exercise 7.23(e). Whichever solution of the order conditions is selected, the truncation error for the resulting method is O(k 2 ). The reason is because (7.59) and (7.61) both hold to terms that are O(k 3 ) and, as explained in Section 7.4, this means that the truncation error for the method is O(k 2 ). For those wondering about whether the undetermined constant remaining after satisfying the order conditions can be used to remove the O(k 3 ) term in (7.61), this will not work. As seen in (7.61), one order condition is needed to remove the O(k) term and two order conditions are needed to remove the O(k 2 ) terms. Removing the O(k 3 ) terms also requires multiple order conditions, and this is not possible using the one extra constant. Example 7.9 As usual, we will test our new methods using the logistic equation. Specifically, we solve dy = 10y(1 − y), for 0 < t, (7.66) dt where y(0) = 0.01. The error ||y − yc ||∞ is shown in Figure 7.8 as a function of the number of time points M (this is the same error used in Example 7.8). The curves are for Heun’s method given in (7.63), the Radau-based method in (7.65), and the trapezoidal method. The three lines are parallel because all three methods are O(k 2 ). Also, even though the trapezoidal method does slightly better than the two RK2 methods, it requires more work to compute since the method is implicit. 

330

7 Initial Value Problems

7.5.2 RK4 The one method from the Runge-Kutta family that deserves special attention is known as the RK4 method. This is used in so many computer codes that it has become the workhorse of IVP solvers. The derivation of RK4 uses a generalization of the assumption in (7.58) and it is that yj+1 = yj + c1 k1 + c2 k2 + c3 k3 + c4 k4 ,

(7.67)

where k1 = kf (tj , yj ), k2 = kf (tj + α2 k, yj + β21 k1 ),

(7.68) (7.69)

k3 = kf (tj + α3 k, yj + β31 k1 + β32 k2 ), k4 = kf (tj + k, yj + β41 k1 + β42 k2 + β43 k3 ).

(7.70) (7.71)

There are 12 constants in the above assumption, and after plugging the exact solution into (7.67) one finds 11 order conditions. For the mildly curious, these are listed in Section 7.7.1. Any choice of the constants that satisfies the order conditions will produce a method with a truncation error that is O(k 4 ). Based on the above discussion, one of the parameters in (7.67)-(7.71) is arbitrary. It is possible to just pick some random value, but there is a more reasoned approach. To explain, consider the differential equation y  = f (t). In this case, (7.67) becomes yj+1 = yj + c1 kf (tj ) + c2 kf (tj + α2 k) + c3 kf (tj + α3 k) + c4 kf (tj + k). On the other hand, integrating the differential equation we get that y(tj+1 ) = t y(tj ) + tjj+1 f (t)dt. Using this and the above formula, we end up with an integration rule of the form

Error

100 10

-3

10-6 10-9 101

Trap Heun RK2-R 102

103

104

105

Number of Time Points Figure 7.8 The error as a function of the number M of time steps used to solve the logistic equation (7.66). The RK2-R curve comes from (7.65), and Heun comes from (7.63).

7.5



Runge-Kutta Methods tj+1

tj

331

f (t)dt ≈ c1 kf (tj ) + c2 kf (tj + α2 k) + c3 kf (tj + α3 k) + c4 kf (tj + k).

(7.72) This means that choosing the parameters is effectively choosing an integration rule. To guarantee a truncation error that is O(k 4 ) it is required that the selected integration rule has an error that is O(k 5 ). Simpson’s rule (6.21) is an obvious choice, and this is the first example below. However, there is a more interesting choice, and it is described in the second example. Example 7.10: Classic RK4 To have (7.72) correspond to Simpson’s rule, you take c1 = 16 . The resulting RK4 method is 1

(7.73) yj+1 = yj + k1 + 2k2 + 2k3 + k4 , 6 where  k 1  k1 = kf (tj , yj ), k2 = kf tj + , yj + k1 , 2 2  k 1  k4 = kf (tj+1 , yj + k3 ). k3 = kf tj + , yj + k2 , 2 2 This is often referred to as the RK4 method, or as the classic RK4 method.  Example 7.11: Lobatto RK4 The question about selecting the ci ’s and αi ’s in (7.72) to obtain the most accurate result is the same question asked in the derivation of the Gaussian integration rules. The difference with (7.72) is that the endpoints are being used to determine the integration rule. This variation is known as Lobatto quadrature, and this is explained in Exercise 6.31. Using the result of Exercise 6.31(c), the Lobatto rule that produces the maximum precision has c1 = 1/12. This gives a RK4 method of the form 1

k1 + 5k2 + 5k3 + k4 , 12  √ 1 1 3 1 where, setting r = 5, α = 1− , β =− 1+ , 2 r 4 r   k1 = kf (tj , yj ), k2 = kf tj + αk, yj + αk1 ,   k3 = kf tj+1 − αk, yj + β(k1 − rk2 ,   k4 = kf tj+1 , yj + k1 + 5β(k2 − k1 ) + 5α(k3 − k1 ) .

(7.74)

yj+1 = yj +



332

7 Initial Value Problems

Both of the RK4 methods described above have a truncation error that is O(k 4 ). What is different is that classic RK4 comes from an integration rule that is O(k 5 ), while the Lobatto RK4 method comes from one that is O(k 7 ). For an autonomous differential equation it is expected that the two methods are equivalent in terms of their accuracy. Where the Lobatto-based method likely does better is when solving forced problems, and an example of this is given below. Example 7.12 (1) The logistic equation problem (7.32), (7.33) was solved using the classic RK4 method, the Lobatto RK4 method, and the trapezoidal method, and the error curves are shown in Figure 7.9(top). The RK4 methods produce effectively the same result, both showing the expected O(k 4 ) decrease. Moreover, they are clearly better than the trapezoidal method which decreases as O(k 2 ).  (2) The errors using the same three methods, but solving y  = y − 10tet sin(5t2 ), for 0 < t ≤ 3, where y(0) = 0, are shown in Figure 7.9(bottom). Again, both RK4 methods decrease as O(k 4 ), but the Lobatto RK4 method provides a better result than the classic RK4 method. The reason for this is due to the time-dependent forcing, and the fact that

10

Error

10

0

-3

10-6 10

-9

RK4-C RK4-L Trap

10-12 10

-15

10

Error

10

1

10

2

10

2

10

3

10

4

10

3

10

4

0

10-3 10

-6

10

-9

10

RK4-C RK4-L Trap

-12

10

1

Number of Time Points

Figure 7.9 The error as a function of the number M of time steps using the classic RK4 method (RK4-C), the Lobatto RK4 method (RK4-L), and the trapezoidal methods. The specific problems solved are described in Example 7.12.

7.5

Runge-Kutta Methods

333

Lobatto RK4 is based on a better integration rule than the classic RK4 method.  With RK4 we now have an explicit method with a O(k 4 ) truncation error. A consequence of this is that for the logistic equation problem it achieves an error of 4 × 10−5 using 23 points versus 160 for the trapezoidal method and 100,000 points required using Euler. Given this, you might wonder why the other methods are even discussed, much less used by anyone. Well, there are several reasons for considering other methods, and one relates to stability (which is discussed shortly). Another reason is that RK4 does not do well in preserving certain properties of the solution, and an example is energy. In problems where energy conservation is important, other methods, called symplectic methods, are usually used. An example of such a method is introduced in Section 7.6.3. It should be mentioned that within the RK family, Lobatto methods usually refer to a collection of implicit methods that are derived using Lobatto integration. As such, they differ from the RK4 method considered here.

7.5.3 Stability The derivation of the RK methods is centered on the truncation error. What is missing is the Step 5 question, which is whether the methods are A-stable. As it turns out, all explicit Runge-Kutta methods are conditionally A-stable. As an example, given the order conditions in (7.62), the RK2 method in (7.58) is conditionally A-stable and the requirement is k < 2/r (see Exercise 7.24). This is the same inequality we obtained for Euler’s method, and so RK2 and Euler have the same stability requirement. As for RK4, the requirement is k < λ/r where λ ≈ 2.785 (see Exercise 7.25). This means that the stability region for RK4 is slightly larger than it is for RK2 or Euler. With Euler, the stability condition is not a particular issue since the method requires such a small time step to produce an accurate solution. This isn’t the case with RK4, which is capable of producing an accurate solution with a fairly large time step. It is not unusual with RK4 to find that as you decrease k, once it satisfies the stability condition, the solution is very accurate. Essentially, when this happens, you pass for “meaningless” to “meaningful and accurate” when k crosses the stability boundary.

7.5.4 RK-n The ideas developed here can be generalized to produce higher order RK methods, although the complexity of the derivation can be enormous. For

334

7 Initial Value Problems

example, in celestial mechanics you occasionally see people use twelfth-order RK methods. Such a scheme is not easy to derive, because it results in 5972 order conditions, and, as occurred earlier, these form an underdetermined nonlinear system. A short summary of the principal IVP solvers, derived so far in this chapter, is given in Table 7.6.

Methods for solving the differential equation dy dt

= f (t, y)

Method

Difference Equation

τj

Properties

Euler

yj+1 = yj + kfj

O(k)

Explicit; Conditionally A-stable

Backward Euler

yj+1 = yj + kfj+1

O(k)

Implicit; A-stable

Trapezoidal

yj+1 = yj +

k (fj + fj+1 ) 2

O(k2 )

Implicit; A-stable

Heun (RK2)

yj+1 = yj +

1 (k1 + k2 ) 2

O(k2 )

Explicit; Conditionally A-stable

O(k4 )

Explicit; Conditionally A-stable

Classic Runge– Kutta (RK4)

where k1 = kfj ,

k2 = kf (tj+1 , yj + k1 )

yj+1 = yj + where k1 = kfj ,

1 (k1 + 2k2 + 2k3 + k4 ) 6

k 1 , yj + k1 ), 2 2 k 1 k3 = kf (tj + , yj + k2 ), 2 2 k4 = kf (tj+1 , yj + k3 ) k2 = kf (tj +

Table 7.6 Finite difference methods for solving an IVP. The points t1 , t2 , t3 , . . . are equally spaced with step size k = tj+1 − tj . Also, fj = f (tj , yj ), fj+1 = f (tj+1 , yj+1 ) and τj is the truncation error for the method.

7.6

Solving Systems of IVPs

335

7.6 Solving Systems of IVPs Most applications involving IVPs have multiple differential equations. In such cases the mathematical problem has the form dy = f (t, y), dt where

for 0 < t,

y(0) = α.

(7.75)

(7.76)

In the above, y = (y1 (t), y2 (t), · · · , yn (t))T , f = (f1 , f2 , · · · , fn )T , and α = (α1 , α2 , · · · , αn )T . In component form, the IVP can be written as yi = fi (t, y1 , y2 , · · · , yn ), for i = 1, 2, · · · , n, where yi (0) = αi .

7.6.1 Examples Law of Mass Action An often occurring situation involves one or more species combining, or transforming, to form new or additional species. This is what happens when hydrogen and oxygen combine to form water. It also can be applied to a disease moving through a population. One example is the Kermack-McKendrick model for epidemics. This assumes the population can be separated into three groups. One is the population S(t) of those susceptible to the disease, another is the population I(t) that is ill (as well as infectious), and the third is the population R(t) of individuals that have recovered. Using the law of mass action one can derive a model that accounts for the susceptible group getting sick, the subsequent increase in the ill population, and the eventual increase in the recovered population [Holmes, 2019]. The result is the following set of equations: dS = −aSI, dt dI = −bI + aSI, dt dR = bI, dt where S(0) = S0 , I(0) = I0 , and R(0) = R0 . In the above equations a and b are rate constants. Given the three groups, and the letters used to designate them, this is an example of what is known as a SIR model in mathematical epidemiology. 

336

7 Initial Value Problems

Newton’s Second Law According to Newton’s second, F = ma. Letting y(t) designate position, then this law takes the form m

d2 y = F (t, y, y  ), for0 < t. dt2

(7.77)

Assuming that the initial position and velocity are specified, then the initial conditions for this problem are y(0) = α and y  (0) = β.

(7.78)

It is possible to write the problem as a first-order system by introducing the velocity v = y  . Using the differential equation (7.77), we obtain the following: y  = v, 1 v  = F (t, y, v). m Introducing the vector y(t) = (y, v)T , then the IVP can be written as y (t) = f (t, y),

for 0 < t,

(7.79)

where the initial conditions (7.78) take the form y(0) = (α, β)T . The function f (t, y) appearing in (7.79) is   v . f (t, y) = 1 m F (t, y, v) What is significant is that the change of variables has transformed the secondorder problem for y(t) into a first-order problem for y(t). Like the original, (7.79) is nonlinear if F depends nonlinearly on either y or v. 

7.6.2 Simple Approach Writing down a numerical method to solve (7.75) is easy because all of the formulas we have derived for single equations apply to the vector case. For

example, the approximation y  (tj ) ≈ y(tj+1 ) − y(tj ) /k becomes y (tj ) ≈

y(tj+1 ) − y(tj ) , k

and the error is still O(k). Note that the error is now a vector, and stating that it is O(k) means that each element of the error vector is O(k).

7.6

Solving Systems of IVPs

337

What this means is that every method listed in Table 7.6 works with vector functions. So, for example, the trapezoidal method (7.54) becomes yj+1 = yj +

k

fj + fj+1 , 2

(7.80)

where fj = f (tj , yj ) and fj+1 = f (tj+1 , yj+1 ). Similarly, the RK4 method (7.73) takes the form yj+1 = yj +

1

k1 + 2k2 + 2k3 + k4 , 6

(7.81)

where k1 = kf (tj , yj ), k2 = kf (tj + k2 , yj + 12 k1 ), k3 = kf (tj + k2 , yj + 12 k2 ), and k4 = kf (tj+1 , yj + k3 ). Moreover, properties like being explicit or Astable are still the same. It needs to be mentioned that the observation made in the previous paragraph does not necessarily hold for higher order Runge-Kutta methods. Specifically, it turns out that a RK method that is O(k p ) for the scalar equation y  = f (t, y) is not necessarily O(k p ) for the system y (t) = f (t, y) if p ≥ 5. This means that to derive a higher order Runge-Kutta method for systems you are not able to simply use a scalar equation and then convert the variables to vectors when you are done. Those wanting to learn more about this should consult the texts by Butcher [2008] and Lambert [1991]. Example 7.13 For the linear system u = u − 3v + 10, v  = −u + 2v + 5t, we have that y = (u, v)T and  f (t, y) =

u − 3v + 10

−u + 2v + 5t

 .

Also, let u(0) = 2 and v(0) = 1, so y0 = (2, 1)T , and take k = 12 . I)

Using Euler’s method,       13/2 2 1 9 , = + y1 = y0 + kf (0, y0 ) = 2 0 1 1       13/2 53/4 1 27/2 y2 = y1 + kf (k, y1 ) = + = . 2 −2 1 0

338

7 Initial Value Problems

II)

Using RK4,   k1 = kf (0, y0 ) =

9 2

0

1

1  k2 = kf k, y0 + k1 = 2 2 

,

1  k3 = kf k, y0 + k2 = 2 2 1



201 32 − 33 32



k4 = kf (k, y0 + k3 ) =

,

and so



1

y1 = y0 + k1 + 2k2 + 2k3 + k4 = 6

33 4 1 384



45 8 − 12

147 16 − 187 64





,

 .



Example 7.14 For the nonlinear system 1 u = v − u, 2 1  v = − v + 2u(2 − u2 ), 2

(7.82) (7.83)

we have that y = (u, v)T and  f (y) =

v − 12 u

− 12 v + 2u(2 − u2 )

 .

It is a simple matter to solve the problem numerically, and the resulting solution is shown in Figure 7.10. In this case the RK4 method, as given in (7.81), was used with M = 400 and T = 10. Shown is the computed solution for y(0) = (−1, 6)T , and the computed solution for y(0) = (3, 2)T . They end up spiraling into one of the two asymptotically stable steady states for this

v-axis

3 0 -3 -6 -3

-2

-1

0

1

2

u-axis Figure 7.10

The solution of (7.82), (7.83) computed using the RK4 method (7.81).

3

,

7.6

Solving Systems of IVPs

339

equation. As is always the case, RK4 is conditionally A-stable. For this equation the switch from unstable to stable occurs when changing from M = 15 to M = 16. A larger value of M is used in Figure 7.10 so the solution curves appear smooth.  The solutions in the last example eventually come to rest at one of the two steady states. The methods we have considered are able to produce an accurate solution in such cases without much difficulty. A more challenging problem is one in which the solution is periodic, and therefore does not come to rest. The next example illustrates some of the challenges that are involved in such cases. Example 7.15 The equation for the angular deflection of a pendulum is 

d2 θ = −g sin(θ), dt2

(7.84)

where the initial angle θ(0) and the initial angular velocity θ (0) are assumed to be given. Also,  is the length of the pendulum and g is the gravitational acceleration constant. Introducing the angular velocity v = θ then the equation can be written as the first-order system θ = v, v  = −α sin(θ),

(7.85) (7.86)

where α = g/. In this example, y = (θ, v)T and f = (v, −α sin(θ))T . Also, α = 1, θ(0) = π/4, and θ (0) = 0. The exact solution for θ(t) is a periodic function with amplitude π/4. The numerical solution obtained using RK4, shown in Figure 7.11, appears to periodic and has an amplitude that is very close to π/4. However, over the longer time interval shown in Figure 7.12 the situation is very different. The solid blue region in this figure is coming from the solution curve. The oscillations are so rapid on this time scale that it appears the region is solid blue. It is evident that the amplitude of the RK4 solution is decreasing over time, giving rise to a significant error in the numerical solution. It is possible to improve the accuracy in the RK4 solution by taking a smaller time step, but there is another way to do this, and this is considered next. 

7.6.3 Component Approach and Symplectic Methods A consequence of a vector version of the RK4 method, as given in (7.81), is that every component of y is being approximated the same way. There are

340

7 Initial Value Problems

Theta

/4

0

- /4 0

10

20

30

40

t-axis Figure 7.11 Numerical solution of the pendulum equation (7.84) in Example 7.15, for 0 ≤ t ≤ 40, using the RK4 method (7.81). The dashed red line is the value for the amplitude for the exact solution.

Theta

/4 0

- /4 0

500

1000

1500

2000

2500

3000

3500

4000

t-axis Figure 7.12 Numerical solution of the pendulum equation (7.84) in Example 7.15, for 0 ≤ t ≤ 4000, using the RK4 method (7.81). The step size k is the same as in Figure 7.11. The dashed red line is the amplitude for the exact solution.

sometimes benefits of not doing this and to explain, consider the problem of solving my  = F (y). This can be written in component form as y  = v, 1 v  = F (y), m

(7.87) (7.88)

where we have introduced the velocity v = y  . If the vector version of the trapezoidal method is applied to this system, one obtains k (vj + vj+1 ), 2  k  F (yj ) + F (yj+1 ) . = vj + 2m

yj+1 = yj + vj+1

(7.89)

The reason the trapezoidal method is considered is that it is about the only method we have derived that is capable of producing a periodic solution to even the linear problem y  = −y. However, as is always the case with the trapezoidal method, the resulting equations are implicit. The question there-

7.6

Solving Systems of IVPs

341

Theta

/4 0

- /4 0

500

1000

1500

2000

2500

3000

3500

4000

t-axis Figure 7.13 Numerical solution of the pendulum equation (7.84) in Example 7.15, for 0 ≤ t ≤ 4000, using the velocity Verlet method (7.90), (7.91). The step size k is the same as in Figure 7.11. The dashed red line is the value for the amplitude for the exact solution.

fore arises as to whether it might be possible to tweak the above difference equations so they are explicit, similar to what we did to discover the RK2 methods. With this in mind, note that one of the culprits for the implicitness is the vj+1 term in (7.89). Can we find an approximation for this term that uses information at, say, tj ? One possibility is to use the Euler method (7.88), k F (yj ). Introducing this into (7.89) yields which gives us vj+1 = vj + m yj+1 = yj + kvj + vj+1 = vj +

1 2 k F (yj ), 2m

 k  F (yj ) + F (yj+1 ) . 2m

(7.90) (7.91)

Assuming we first use (7.90) to calculate yj+1 and then use (7.91) to find vj+1 , the procedure is explicit. It is known as the velocity Verlet method for solving (7.87) and (7.88), and we will try it using the pendulum example. Example 7.16 The solution of the pendulum problem in (7.84) over the interval 0 ≤ t ≤ 40, using the velocity Verlet method, is essentially identical to the RK4 solution shown in Figure 7.11. This is not true for 0 ≤ t ≤ 4,000, which is shown in Figure 7.13. In particular, for the velocity Verlet solution there is no apparent decrease in amplitude and it stays at, or very near, π/4. In fact, there is no apparent decrease over the much longer time interval 0 ≤ t ≤ 4,000,000. In comparison, for RK4 the amplitude is less than 1% of the exact value for t > 30,000.  The result in Figure 7.13 is surprising. It’s possible to prove that the velocity Verlet method has a truncation error that is O(k 2 ), yet it has produced a better solution than the RK4 method which is O(k 4 ). Although it is always

342

7 Initial Value Problems

possible for a O(k 2 ) method to produce a better answer than a O(k 4 ) method on a particular problem, the reason here is more profound. What is happening is that velocity Verlet does a better job in approximating the energy in the system over a longer time interval than RK4. The reason for this is that velocity Verlet is an example of what is known as a symplectic method. These are methods that preserve certain geometric properties of the solution in the phase plane, which result in a much better approximation of the total energy. What they have difficulty with is providing an accurate value for the phase. To explain what this means, if you use velocity Verlet to compute the orbital motion of the Earth over 1000 years, taking k = 1 month, you find that the Earth stays on the right orbital path but it orbits the sun about 1009 times. In other words, it is going a bit faster than it should be. More about this, and symplectic methods in general, can be found in Holmes [2007], Stuart and Humphries [1998], or Hairer et al. [2003].

7.7 Some Additional Questions and Ideas (1) Round-off error can be a problem for the finite difference formulas in Table 7.1 (see Figure 7.1). What do you do if you want an accurate numerical derivative for very small step sizes? Answer: Well, one option is to transform the formula, as discussed in Section 7.2. Another way is to use a complex Taylor series expansion (CTSE), what is sometimes called the complex-step derivative approximation method. The standard example used for this is the centered difference formula y  (tj ) =

y(tj+1 ) − y(tj−1 ) 1 2  + k y (tj ) + · · · . 2k 6

(7.92)

Although this has an error that is O(k 2 ), as shown in Figure 7.1, it can have a problem when k is small. A way to avoid this is to use an imaginary step size. To explain, note that, using Taylor’s theorem, 1 1 y(t + ik) = y(t) + iky  (t) − k 2 y  (tj ) − ik 2 y  (t) + · · · , 2 6 √ where, as usual, i = −1. Taking the imaginary part of this equation, and rearranging, we have that y  (t) =

Im[y(t + ik)] 1 2  + k y (t) + · · · . k 6

(7.93)

This gives us a O(k 2 ) approximation for the first derivative that does not have the round-off problems that arise with (7.92). To compare, the

7.7

Some Additional Questions and Ideas

Error

10

10

343

0

-10

Centered CTSE

10

-15

10

-10

10

-5

10

0

Stepsize (k) √ Figure 7.14 Error when using a numerical approximation of y  (1), when y(t) = t, using the centered difference (7.18), and the CTSE approximation is (7.94). Note that the CTSE error is zero for k < 10−7 .

example for Figure 7.1 is redone in Figure 7.14 using (7.18) and the corresponding approximation coming from (7.93). The latter is y  (t) ≈

1  Im y(t + ik) . k

(7.94)

The approximation in (7.94) is actually better than indicated in Figure 7.14, because for values of k smaller than 10−7 , the error is zero (using double precision). This very interesting idea does have limitations, such as how the function is defined when using a complex argument, and more discussion of this can be found in Martins et al. [2003], Lai and Crassidis [2008], and Abreu et al. [2018]. (2) The problem with numerical differentiation seen in Figure 7.1 did not seem to be an issue when solving IVPs, why not? Answer: The formulas for the derivatives were rewritten when deriving the finite difference equations, and this avoids the difficulties in Figure 7.1. However, if after computing the solution y(t), you then use that solution to compute y  (t) using a numerical differentiation formula, then the difficulties reappear. (3) It’s nice that RK4 can find an accurate solution using just a few points scattered across the time interval, but I want a smooth curve. What do I do? Answer: As done in Figure 7.10, the easy solution is to just use more points. However, a better solution is to use interpolation. It’s possible in this case to use a clamped cubic spline because we know, or have a computed value for, y  at the endpoints. In particular, y  (0) = f (0, α) and y  (T ) = f (T, yM ). To illustrate how well this works, the solution of the

344

7 Initial Value Problems

c2 α23 + c3 α33 + c4 = 1/4

c1 + c2 + c3 + c4 = 1 β21 = α2 β31 + β32 = α3

c3 α2 β32 + c4 (α2 β42 + α3 β43 ) = 1/6 c3 α2 α3 β32 + c4 (α2 β42 + α3 β43 ) = 1/8 c3 α22 β32 + c4 (α22 β42 + α32 β43 ) = 1/12

β41 + β42 + β43 = 1 c2 α2 + c3 α3 + c4 = 1/2 c2 α22 Table 7.7

+

c3 α32

c4 α2 β32 β43 = 1/24

+ c4 = 1/3

Order conditions for a general RK4 method.

logistic equation (7.32), with y(0) = 0.1, is shown in Figure 7.15. With this it is possible to obtain an accurate value for the solution anywhere in the time interval. An example of how this can be used in data fitting is explained in Section refsec:dmp. (4) It is often stated that a physical system must satisfy causality, which, roughly speaking, means that the future does not affect the past. Doesn’t a forward difference approximation violate causality? Answer: It does. In fact, all but two of the difference formulas in Table 7.1 violate causality. What is interesting is that the one method that is based on a causal approximation of the derivative, which is backward Euler, is unconditionally stable. The non-causal approximations, in comparison, produce only conditionally stable methods. It is unclear what connections there are between these two properties, but apparently there might be. For those who might be interested in computations and its connections with some of the basic laws of physics, the collection of articles in Zenil [2012] might be consulted.

Solution

1

0.5 RK4 Spline Exact 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

t-axis Figure 7.15 Solution of the logistic equation (7.32) using RK4. Also shown is the exact solution and the clamped cubic spline fitted to the RK4 values (these two curves are almost indistinguishable from each other).

7.7

Some Additional Questions and Ideas

345

7.7.1 RK4 Order Conditions Assuming that 0 ≤ α2 ≤ α3 ≤ 1 in (7.68)-(7.71), the resulting RK4 order conditions are listed in Table 7.7. There are 11 equations and 12 unknowns, and the solutions are given in Ralston [1962]. Just so it’s clear, every solution results in an IVP solver with an error that is O(k 4 ).

Exercises Section 7.2 7.1. Derive an approximation of the derivative that uses the given points. Also, determine the order of the truncation error. (a) (b) (c) (d)

For For For For

y  (tj ) that uses y(tj ), y(tj−2 ). y  (tj ) that uses y(tj ), y(tj+1 ), and y(tj−2 ). y  (tj ) that uses y(tj ), y(tj−1 ), and y(tj−3 ). y  (tj ) that uses y(tj ), y(tj−1 ), and y(tj−2 ).

7.2. In the following, a claim is made about an approximation for the first derivative. Explain why the claim is wrong. (a) A O(k) approximation of y  (tj ) is y  (tj ) ≈

2y(tj+1 ) − y(tj ) . k

(b) A O(k) approximation of y  (tj ) is y  (tj ) ≈

y(tj+1 ) − 3y(tj ) + 2y(tj−1 ) . k

7.3. You are to find a O(k 2 ) approximation of the following: An approximation of y  (tj ) that uses y  (tj+1 ) and y  (tj−1 ). An approximation of y  (tj ) that uses y(tj ), y(tj−1 ), and y  (tj−1 ). An approximation of y(tj ) that uses y(tj−1 ) and y(tj+1 ). An approximation of 2y  (tj ) − 3y(tj ) that uses y(tj−2 ), y(tj−1 ), and y(tj+1 ). (e) An approximation of y  (tj ) + 5y(tj ) that uses y(tj−1 ), y(tj+1 ), and y(tj+2 ).

(a) (b) (c) (d)

7.4. For a linearly elastic material, the stress T (x) is given as T =E

du , dx

where u(x) is the displacement of the material and E is a positive constant known as Young’s modulus. You should assume that the material is steel, so

346

Table 7.8

7 Initial Value Problems x

0

1/3

2/3

1

u

1

0

−1

1

Values for Exercise 7.4.

E = 200. Using second-order approximations for u (x), and the data in Table 7.8, find T at x = 0, 1/3, 2/3, 1. 7.5. For a power law fluid, the shear stress S(x) is given as

γ dv , S=α dx where v(x) is the shear velocity and α and γ are positive constants. You can assume the fluid is ketchup at room temperature, so γ = 1/4 and α = 20. Using second-order approximations for v  (x), and the data in Table 7.9, find S at x = 0, 1/3, 2/3, 1. 7.6. This exercise concerns unequal spacing of the time points. So, assume the spacing between tj−1 and tj is kj−1 . Correspondingly, the spacing between tj and tj+1 is kj , and similarly for the other time points. (a) Show that the approximation of y  (tj ) that uses y(tj−1 ), y(tj ), and y(tj+1 ) is y  (tj ) ≈

y(tj+1 ) + (r2 − 1)y(tj ) − r2 y(tj−1 ) , (1 + r)kj

where r = kj /kj−1 . Also, explain why this is a second-order approximation. (b) Show that the approximation of y  (tj ) that uses y(tj ), y(tj+1 ), and y(tj+2 ) is y  (tj ) ≈

−y(tj+2 ) + (1 + r)2 y(tj+1 ) − r(2 + r)y(tj ) , r(1 + r)kj

where r = kj+1 /kj . Also, explain why this is a second-order approximation. (c) Derive a second-order approximation of y  (tj ) that uses y(tj ), y(tj−1 ), and y(tj−2 ). (d) Redo Exercise 7.4, but use the data in Table 7.10.

Table 7.9

x

0

1/3

2/3

1

v

0

2

4

8

Values for Exercise 7.5.

7.7

Some Additional Questions and Ideas

Table 7.10

347

x

0

1/4

3/4

1

u

1

0

−1

1

Values for Exercise 7.6.

7.7. (a) Based on the number of undetermined coefficients in (7.11), explain why you would expect to equate the coefficients of y, y  , and y  . In fact, explain why the expected truncation error is O(k 2 ). (b) Based on the observation made in part (a), how many time points would you expect are needed to obtain a O(k 5 ) approximation of y(tj )? Sections 7.3 and 7.4 7.8. The solution of y +

1 y = 0, for t > 0, 1+t

where y(0) = 1, is y(t) = 1/(1 + t). (a) (b) (c) (d)

Verify that y(t) satisfies the given IVP. What is f (t, y) for this problem? Taking k = 1/2, determine y(t1 ) and y(t2 ). Using Euler’s method with k = 1/2, determine y1 and y2 . Also, what is the error |y(ti ) − yi | in each case? (e) Do part (d) but use backward Euler. 7.9. The solution of y  − y = t2 − 1, for t > 0,

where y(0) = 1, is y(t) = (t + 1)2 . (a) (b) (c) (d)

Verify that y(t) satisfies the given IVP. What is f (t, y) for this problem? Taking k = 1/3, determine y(t1 ) and y(t2 ). Using Euler’s method with k = 1/3, determine y1 and y2 . Also, what is the error |y(ti ) − yi | in each case? (e) Do part (d) but use backward Euler.

7.10. Select one of the following IVPs, and then answer the questions that follow. t (I) y  = 1+y y(0) = 0

(II) y  = t + e−y y(0) = 0

y (III) y  + y 3 = 1+t y(0) = −1

(a) If the Euler method is used, what is the resulting finite difference problem? With this, and taking k = 1, determine y1 and y2 .

348

7 Initial Value Problems

(b) If the backward Euler method is used, what is the resulting finite difference problem? With this, and taking k = 1, determine the equations that must be solved to determine y1 and y2 . (c) If the trapezoidal method is used, what is the resulting finite difference problem? With this, and taking k = 1, determine the equations that must be solved to determine y1 and y2 . 7.11. In this problem the second one-sided difference approximation listed in Table 7.1 is used to solve (7.21), (7.22). (a) What is the resulting finite difference equation coming from the differential equation? (b) Is the equation in part (a) explicit or implicit? (c) Explain why the equation in part (a) cannot be used to determine y1 . (d) Is the method A-stable? To answer this you should use the same approach as used for the leapfrog method. 7.12. (a) Derive an approximation of y  (tj ) that uses y(tj+1 ), y(tj ), and y(tj−2 ). (b) Using the approximation from part (a), what is the resulting finite difference equation for solving y  = f (t, y)? What is the order of the truncation error? (c) Is the equation in part (b) explicit or implicit? (d) Is the method A-stable? To answer this you should use the same approach as used for the leapfrog method. 7.13. This is a continuation of Exercise 7.7. You do not need to derive the methods described, but you must provide a reason why your answer is expected to work. (a) What time points could be used to obtain an explicit method with a truncation error of O(k 5 )? (b) What time points could be used to obtain an implicit method with a truncation error of O(k 5 )? 7.14. To solve y  = f (t, y) one can use one of the following finite difference equations. Determine if the method is A-stable, conditionally A-stable or unstable. If it is conditionally stable, make sure to state what the condition is.   (a) yj+1 = yj + k3 f (tj , yj ) + 2f (tj+1 , yj+1 ) ,   (b) yj+1 = yj + k5 4f (tj , yj ) + f (tj+1 , yj+1 ) ,   (c) yj+1 = yj + k 2f (tj , yj ) − f (tj+1 , yj+1 ) . 7.15. The IVP for a mass-spring-dashpot system is m

d2 y dy + qy = F (t), for t > 0, +c dt2 dt

7.7

Some Additional Questions and Ideas

349

where y(0) = a and y  (0) = b. In this problem m, c, q, a, and b are given numbers and F (t) is a given forcing function. (a) Using finite difference approximations for y  and y  , derive a finite difference approximation for the differential equation that has truncation error that is O(k 2 ). (b) Convert the initial conditions so they apply to the finite difference approximation. Your approximation of y  (0) = b must have a truncation error that is O(k 2 ). (c) Assuming m = c = q = a = b = k = 1 and F (t) = t, find y1 and y2 . 7.16. Each line in Figure 7.16 represents the iterative error as a function of M using the algorithm described in Table 7.4. (a) Which line, if any, should correspond to the Euler method? (b) Which line, if any, should correspond to the trapezoidal method? (c) What would the truncation error for the method have to be to obtain the red dashed line? (d) What would the truncation error for the method have to be to obtain the blue dashed line?

Iterative Error

10 10

0

-2

10-4 10-6 10-8 101

102

103

Number of Time Points Figure 7.16

Graph used in Exercise 7.16.

7.17. To solve the differential equation y  = f (t, y) one can use the finite difference equation   yj+1 = yj + k θfj + (1 − θ)fj+1 , where θ is a number that satisfies 0 ≤ θ ≤ 1 (one first picks θ from this interval and then uses the above formula to calculate the solution). (a) For what value(s) of θ is the method explicit, and for which value(s) of θ is it implicit? (b) For what value(s) of θ is the method A-stable?

350

7 Initial Value Problems

7.18. In deriving the trapezoidal method for solving the differential equation y  = f (t, y) we integrated the equation over the interval tj ≤ t ≤ tj+1 . In this problem you are to integrate the equation over the interval tj−1 ≤ t ≤ tj+1 . (a) What numerical method is obtained if Simpson’s rule is used on the resulting integral? What is the truncation error for this rule? (b) What numerical method is obtained if the midpoint rule is used on the resulting integral? What is the truncation error for this rule? 7.19. This is a continuation of Exercise 7.7. (a) Suppose you want an explicit method with a truncation error that is O(k 4 ). Give an example of the time points that you expect can be used in an approximation of y  (tj ) that would accomplish this. (b) Suppose you want an implicit method with a truncation error that is O(k 4 ). Give an example of the time points that you expect can be used in an approximation of y  (tj ) that would accomplish this. (c) In the answer to part (b) you could include tj+1 and then have the other time points satisfy t ≤ tj , or include tj+1 and tj+2 and then have the other time points satisfy t ≤ tj . Which is the better choice? 7.20. This problem shows how the order of the truncation error of a finite difference equation can be determined using the idea of precision introduced in Section 6.5. The key observation comes from Table 7.1. If the truncation error is O(k m ), then τj = 0 for y = 1, t, t2 , · · · , tm but it is not zero for y = tm+1 . Assuming this can be reversed, if a finite difference equation is satisfied exactly when y = tn and f = ntn−1 , for n = 0, 1, 2, · · · , m but not for n = m + 1, then the order of the truncation error is O(k m ). (a) Show that the Euler method (7.30) is exact when: (i) n = 0 and (ii) n = 1 but it is not exact when (iii) n = 2. So, the truncation error is O(k). (b) Apply the steps in part (a) to the backward Euler method (7.42). (c) Use the method to determine the order of the truncation error for the trapezoidal method (7.54). (d) For what value(s) of θ does the method in Exercise 7.17 have a O(k 2 ) truncation error? Section 7.5 7.21. Select one of the IVPs in Exercise 7.10, and then answer the following questions: (a) If Heun’s method is used, what is the resulting finite difference problem? With this, and taking k = 1, determine y1 and y2 . (b) If the classic RK4 method is used, what is the resulting finite difference problem? With this, and taking k = 1, determine y1 and y2 .

7.7

Some Additional Questions and Ideas

351

7.22. This problem explores a RK1 method. Assume the method has the form yj+1 = yj + akf (tj , yj ). Find the constant a using the same procedure used to derive the RK2 method. What is the truncation error? 7.23. This problem explores some of the choices for a RK2 method. (a) What does (7.58) reduce to for the differential equation y  = f (t)? (b) Integrating the differential equation in part (a) you get y(tj+1 ) = y(tj ) +  tj+1 f (t)dt. By combing this with the result from part (a), explain how tj you end up with the integration rule  tj+1 f (t)dt ≈ akf (tj ) + bkf (tj + αk). tj

The constants a, b, and α are assumed to satisfy the order conditions in (7.62). (c) What RK2 method is obtained if the integration rule in part (b) is the trapezoidal rule? (d) What RK2 method is obtained if the integration rule in part (b) is the midpoint rule? (e) What RK2 method is obtained if the integration rule maximizes the precision? This is the same question asked in Exercise 6.29. Note that the error in this integration rule is O(k 4 ), while for parts (c) and (d) it is O(k 3 ). 7.24. This problem explores the stability condition for RK2. (a) What does the RK2 method in (7.58) reduce to for the equation y  = −ry? Assume the constants satisfy the order conditions (7.62). (b) Show that the amplification factor for RK2 can be written as κ = 1 − z + 12 z 2 , where z = rk. (c) Using the result from part (b), show that the RK2 method is conditionally A-stable and that the stability condition is rk < 2. 7.25. This problem explores the stability condition for RK4. (a) What does the RK4 method in (7.73) reduce to for the equation y  = −ry? You should find that the amplification factor κ is a fourth-order polynomial in z = rk. (b) Plot κ as a function of z, for 0 ≤ z ≤ 4 using MATLAB or another similar program. (c) Using the plot from part (b), explain why the method is conditionally A-stable. Also, using the plot, estimate the value of κM , where the RK4 method is A-stable for rk < κM and is unstable if rk > κM .

352

7 Initial Value Problems

7.26. Suppose you want to compute the solution of y  = e−y + t5 for 0 ≤ t ≤ 1 using 100 time steps (so k = 0.01). The methods to be tried are (i) the Euler method, (ii) the backward Euler method, (iii) the trapezoidal method, (iv) Heun, and (v) the RK4 method. (a) Which one would you expect to complete the calculation the fastest? Why? (b) Which one would you expect to be the most accurate? Why? (c) If stability is a concern which method would be best? Why? Computational Questions for Sections 7.3–7.5 7.27. This problem concerns the IVP, y  + p(t)y = sin(2πt), where y(0) = 0 and p(t) = sin(t)/(1 + t). You are to answer the questions to follow by either using an IVP solver, or using the formula for the exact solution which is  t eμ(r) sin(2πr)dr, y(t) = e−μ(t) 0

t

where μ(t) = 0 p(s)ds. Whichever you use, you must provide an explanation why you made that choice. You must also provide an explanation of how you computed the solution, including the requirement(s) you used to guarantee the accuracy of the solution. (a) Plot y(t) for 0 ≤ t ≤ 4. (b) What is y(4)? The value must be correct to at least six significant digits. 7.28. This is a continuation of Exercises 7.10 and 7.21. The problems below refer to the IVP you selected for those earlier exercises. (a) Using the trapezoidal method, and the algorithm in Table 7.4, compute the solution. You should take T = 10 and tol = 10−8 . You should plot the iterative error as shown in Figure 7.5, and report the value of M that achieved the stated error condition. Also plot the solution for 0 ≤ t ≤ T . (b) Redo part (a) but use the RK4 method. (c) Redo part (a) but use Euler’s method. (d) Redo part (a) but use Heun’s method. (e) Compare the methods you used. This includes ease of use, speed of calculation, accuracy of results, and apparent stability. 7.29. This problem concerns the Michaelis-Menten equation vS dS =− , dt K +S where v and K are positive constants. The initial condition is S(0) = S0 . This problem was considered in Exercise 2.17, but in this exercise you are to solve this problem using an IVP solver. To do this, take v = 2 mM/min, K = 20 mM, and S0 = 100 mM.

7.7

Some Additional Questions and Ideas

353

(a) Using Euler’s method, and the algorithm in Table 7.4, compute the solution. You should take T = 200 min and tol = 10−6 . You should plot the iterative error as shown in Figure 7.5, and report the value of M that achieved the stated error condition. (b) Redo (a) but use the trapezoidal method. (c) Redo (b) but use the classic RK4 method. Also, plot the computed solution for S(t). (d) Compare the methods based. This includes ease of use, speed of calculation, accuracy of results, and apparent stability. 7.30. This problem concerns the IVP involving the Bernoulli equation y + y3 =

y , a+t

for t > 0,

where y(0) = 1 and a = 0.01. You should take T = 3 and tol = 10−6 . (a) Using the trapezoidal method, and the algorithm in Table 7.4, compute the solution. You should plot the iterative error as shown in Figure 7.5, report the value of M that achieved the stated error condition, and also plot the computed solution for y(t). (b) Redo (a) but use the classic RK4 method. (c) Compare the two methods. This includes ease of use, speed of calculation, accuracy of results, and apparent stability. 7.31. This problem concerns the IVP involving the Bernoulli equation y + y3 =

y , a+t

for t > 0,

where y(0) = 1 and a = 0.01. You are to solve this problem using the trapezoidal and RK4 methods. (a) Verify that the exact solution is y=

a+t β + 23 (a + t)3

.

Also, determine the value of β from the initial condition. (b) On the same axes, plot the exact and the two numerical solutions for 0 ≤ t ≤ 3 in the case of when M = 80. (c) Redo (b) for M = 40 and for M = 160 (there should be one graph for each M ). (d) Compute the error ||y − yc ||∞ for M = 40, 80, 160, 320, 640. Report the values as in Table 7.3, but have 4 columns: M , k, and the errors for trapezoidal and RK4 methods. (e) Compare the two methods based on your results from parts (b)–(d). This includes ease of use, speed of calculation, accuracy of results, and apparent stability.

354

7 Initial Value Problems

7.32. In the Michaelis-Menten reaction a substrate S is converted into a product P . It is known that the concentration of S(t) satisfies dS vS =− , dt K +S where v and K are positive constants. The initial condition is S(0) = S0 . The question considered here is, how long do you need to wait before most of the S’s have been converted into P ’s? The requirement for “most” will be that the concentration is 1% of the original, and so we are looking for t = τ so that S(τ ) = 0.01S0 . In this problem, assume v = 2 mM/min, K = 20 mM3 , and S0 = 100 mM. (a) From the differential equation, explain why S(t) is a strictly monotonically decreasing function of t. (b) Taking k = 1, use the RK4 difference formula (7.73) to calculate y1 , y2 , y3 , · · · until yj < 0.01y0 . Using yj−1 and yj , and linear interpolation, determine when S(t) = 0.01S0 and denote this t value as τ1 . (c) Redo part (a) but use k = k/2, and denote the resulting t value as τ2 . (d) Keep repeating step (c) to obtain a sequence of values τ1 , τ2 , τ3 , τ4 , · · · . Stop when the relative iterative error in the τ values is less than 10−4 . What is the resulting approximation for τ , and what step size was required to compute it? Section 7.6 7.33. The IVP for a mass-spring-dashpot system is m

d2 y dy + qy = F (t), for t > 0 +c dt2 dt

where y(0) = a and y  (0) = b. In this problem m, c, q, a, b are given numbers and F (t) is a given forcing function. (a) Setting v = y  , find a first-order system for y and v. (b) Write down a numerical method for solving the problem in part (a) that has a truncation error that is O(k 2 ). (c) Assuming m = c = q = a = b = k = 1 and F (t) = t, find y1 and y2 . 7.34. This problem considers the linear IVP, y = Ay + g(t), for y(0) = b. In this problem, A is a n × n constant matrix. (a) What is Euler’s method for this problem? (b) What is the trapezoidal method for this problem? (c) What is Heun’s method for this problem?

7.7

Some Additional Questions and Ideas

355

(d) Taking  A=

−2

0

1

1



 ,

g=

−1



1−t

,

and

b=

  1 0

,

use Euler’s method to determine y1 and y2 . (e) Using the values given in part (d), use the trapezoidal method to determine y1 . (f) Using the values given in part (d), use Heun’s method to determine y1 . 7.35. The equation for the height y(t) of a projectile is d2 y gR2 = − , dt2 (R + y)2

for 0 < t,

where g = 9.8 m/s2 is the gravitational acceleration constant and R = 6.4 × 106 m is the radius of the Earth. Also, y(t) is the vertical distance from the surface of the Earth. (a) Write this as a first-order system using the height y and the velocity v = y . (b) If y(0) = 0 and v(0) = 343 m/s (i.e., the speed of sound), find the time that it takes until the object hits the ground. Do this as follows: (i) Pick an IVP solver that has a truncation error of O(k 2 ) or O(k 4 ). (ii) Taking k = 1, use the solver to determine y1 , y2 , y3 , · · · until you get yj < 0. (iii) Using yj−1 and yj from step (ii), and linear interpolation, determine when y(t) = 0 and denote this t value as τ1 . (iv) Redo steps (ii) and (iii) but now use k = k/2, and denote the resulting t value as τ2 . (v) Keep repeating step (iv) to obtain a sequence of values τ1 , τ2 , τ3 , τ4 , · · · . Stop when the relative iterative error in the τ values is less than 10−6 . What is the resulting approximation for τ , and what step size was required to compute it? (c) If y(0) = 0, then what does v(0) have to be so the object stays airborne for 10 minutes? In other words, if T = 10 min, then what does v(0) have to be so that y(T ) = 0? The value of v(0) should be correct to at least eight significant digits. In addition, you should plot y as a function of t, for 0 ≤ t ≤ T . Make sure to state what method you use to solve the IVP, why you picked the method, and how you used the solver to answer the question. 7.36. Using polar coordinates in the orbital plane, the position of an object orbiting the sun is (r(t), θ(t)). From Newton’s laws, to find the position it is necessary to solve r =

α2 μ − 2, r3 r

356

7 Initial Value Problems

Figure 7.17 Three oscillators that are coupled by springs, as an example of the problem considered in Exercise 7.37.

where α = θ (0)r(0)2 is constant, and μ = 39.4 au3 /yr2 is the gravitational parameter of the sun. The approximate values for Mars are r(0) = 1.45 au, r (0) = 0, and α = 7.737 au2 /yr. Note au is the astronomical unit and yr is an Earth year. (a) Write the above differential equation as a first-order system for the radial position r and radial velocity v = r . (b) A Martian year corresponds to (approximately) 1.88 yr. Using RK4, find and then plot r(t) for 0 ≤ t ≤ 1.88. The iterative error in the computed solution should be less than 10−8 . What value of k was used to produce the plot? (c) Do part (b) but use the velocity Verlet method. (d) The solution r(t) is periodic with a period of approximately 1.88. Taking k = 1/12 yr, compute the solution, and then plot the solution curve, for the first 10000 Martian centuries. Does the computed solution appear to be periodic? What does the computed solution appear to approach as t → ∞? (e) Do part (d) but use the velocity Verlet method. (f) The angular coordinate of the object is determined using the following formula:  t ds . θ(t) = θ(0) + α 2 0 r (s) Assuming that θ(0) = π/2, use your results from part (a) to find and then plot θ(t) for 0 ≤ t ≤ 1.88. (g) Compute the length of a Martian year that is correct to six significant digits. 7.37. This exercise involves finding the solution of the problem for the coupled oscillators shown in Figure 7.17. The equation of motion in this case is y + Ky = 0, where

7.7

Some Additional Questions and Ideas

⎛ ⎜ K=⎜ ⎝

357

1 + k12 + k13

−k12

−k13

−k12

1 + k12 + k23

−k23

−k13

−k23

1 + k13 + k23

⎞ ⎟ ⎟. ⎠

In the above matrix, kmn is the spring constant for the spring connecting the mth and nth oscillator, and it is positive (these are the three smaller springs shown in Figure 7.37). The initial conditions are y(0) = (−1, 0, 2)T and y (0) = (0, 0, 0)T . (a) Assuming y(t) = (y1 (t), y2 (t), y3 (t))T , introduce the velocity vj (t) = yj (t). Letting z(t) = (y1 (t), v1 (t), y2 (t), v2 (t), y3 (t), v3 (t))T , show that the oscillator problem can be rewritten as z = Az. What is z(0)? (b) If Euler’s method is used to solve the equation in part (a), what is the resulting finite difference equation? Your answer should be written in terms of zj and zj+1 . (c) If the trapezoidal method is used to solve the equation in part (a), what is the resulting finite difference equation? Your answer should be written in terms of zj and zj+1 . (d) If the RK4 method is used to solve the equation in part (a), what is the resulting finite difference equation? Your answer should be written in terms of zj and zj+1 . (e) Using one of the methods from (b) to (d), on the same axis, plot y1 , y2 , y3 for 0 ≤ t ≤ 10. Assume that k12 = k13 = k23 = 1/2. Make sure to state what method you used to solve the problem and why you picked that method. Also, explain why you believe your solutions are accurate approximations of the exact solutions. 7.38. This exercise explores the differences between a symplectic method and a method which conserves energy exactly. Recall that the equation my  = F (y) can be written as the first-order system given in (7.87), (7.88). A Hamiltonian for this system is  y 1 H(y, v) = mv 2 − F (s)ds, 2 0 and this corresponds to the total mechanical energy. The energy conserving method considered here is described in more detail in [Holmes, 2020a]. (a) It is assumed that the trapezoidal method given in (7.89) can be modified to produce a method that conserves energy. In particular, it is assumed that one can find the form k (vj+1 + vj ), 2 k = vj + W (yj+1 , yj ). m

yj+1 = yj + vj+1

358

7 Initial Value Problems

Using the above formula for the Hamiltonian, show that H(yj+1 , vj+1 ) − H(yj , vj )

   yj+1 k 1 = (vj + vj+1 ) W (yj+1 , yj ) − F (s)ds . 2 yj+1 − yj yj

From this, conclude that the method is conservative if  yj+1 1 W (yj+1 , yj ) = F (s)ds. yj+1 − yj yj It is possible to show that the resulting method is second order (you do not need to prove this). (b) What finite difference equations do you obtain when the method in part (a) is applied to the pendulum system in (7.85), (7.86)? The resulting method is implicit. Show that finding θj+1 and vj+1 reduces to solving f (θj+1 ) = 0, where 1 cos θ − cos θj f (θ) = θ − θj − kvj − k 2 . 2 θ − θj It is assumed here that  = g = 1. (c) Compute the solution using the method from part (b) for 0 ≤ t ≤ 4000, using k = 1/2 and the initial conditions θ(0) = π/4, θ (0) = 0. These are the same values used for Example 7.16. In your write-up, state what method you used to solve f (θj+1 ) = 0, including the stopping condition and what you used for the initial guess(es) for θj+1 . (d) With the computed solution from part (c), plot θ for 0 ≤ t ≤ 40 and for 3960 ≤ t ≤ 4000. (e) Plot the energy H(θ, θ ) for 0 ≤ t ≤ 40 and for 3960 ≤ t ≤ 4000. In each plot also include the energy curves determined using velocity Verlet and RK4.

Chapter 8

Optimization: Regression

The problem central to this chapter, and the one that follows, is easy to state: given a function F (v), find the point vm ∈ Rn where F achieves its minimum value. This can be written as F (vm ) = minn F (v). v∈R

(8.1)

The point vm is called the minimizer or optimal solution. The function F (v) is referred to as the objective function, or the error function, or the cost function (depending on the application). Also, because the minimization is over all of Rn , this is a problem in unconstrained optimization. This chapter concerns what is known as regression, which involves finding a function that best fits a set of data points. Typical examples are shown in Figure 8.1. To carry out the regression analysis, you need to specify two functions: the model function and the error function. With this you then need to specify what method to use to calculate the minimizer.

Figure 8.1 Examples of regression fits from Schutt and Moseley [2020] (L), and Lusso, E. et al. [2019] (R).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0 8

359

360

8 Optimization: Regression

8.1 Model Function We begin with the model function, which means we specify the function that will be fitted to the data. Examples are: • Linear Regression g(x) = v1 + v2 x

(8.2) 2

3

g(x) = v1 + v2 x + v3 x + v4 x g(x) = v1 ex + v2 e−x √ v1 + v2 1 + x g(x) = 1+x • Non-linear Regression g(x) = v1 + v2 ev3 x v1 x g(x) = v2 + x v1 g(x) = 1 + v2 exp(−v3 x)

- asymptotic regression function

(8.3)

- Michaelis-Menten function

(8.4)

- logistic function

(8.5)

In the above functions, the vj ’s are the parameters determined by fitting the function to the given data, and x is the independent variable in the problem. What determines whether the function is linear or nonlinear is how it depends on the vj ’s. Specifically, linear regression means that the model function g(x) is a linear function of the vj ’s. If you are unsure what this means, a simple test for linear regression is that for every parameter vj , the derivative ∂g ∂vj does not depend on any of the vi ’s. If even one of the derivatives depends on one or more of the vi ’s then the function qualifies for non-linear regression. The question often arises as to exactly which model function to use. Typically, its dependence on x is evident from the data. For example, in Figure 8.1, the spindle length data looks to be approximately linear, and so one would be inclined to use (8.2). The function also might be known from theoretical considerations. Examples of linear relationships as in (8.2) include Hooke’s law, Ohm’s law and Fick’s law. The nonlinear functions can also come from mathematical models, and because of this they are often given names connected from the original problem they came from. For example, the one in (8.4) gets its name from a model for an enzyme catalyzed reaction in biochemistry known as the Michaelis-Menten mechanism [Holmes, 2019]. The one in (8.5) gets its name from the fact that it is the solution of the logistic equation given in (7.6).

8.2

Error Function

361

8.2 Error Function The next decision concerns the error, or objective, function, which is to be used to fit the data. This is not a simple question and the choice can have a significant affect on the final solution. To illustrate the situation, three data points (x1 , y1 ), (x2 , y2 ), (x3 , y3 ), and a potential linear fit to these points, are shown in Figure 8.2. To measure the error, one can use the vertical distance di , as in the left figure, or the true distances Di , as in the right figure. It is easy to determine the vertical distance, and it is di = |g(xi ) − yi |. The formula for the true distance depends on the model function. In what follows we will concentrate on the linear function (8.2), in which case Di = 

di 1 + v22

.

Both of these are used. The di ’s are used in traditional least squares data fitting, while the Di ’s are used in what is called orthogonal regression. It’s necessary to be careful with the dimensional units using Di . For example,the usual distance between two points requires the calculation of D = (x0 − x1 )2 + (y0 − y1 )2 . If the x’s are measured, say, in terms of length, and the y’s are measured in terms of, say, time then the formula for D has no meaning. The simplest way to avoid this problem is to scale, or nondimensionalize, the variables before using orthogonal regression. An example of this can be found in Section 10.2.

8.2.1 Vertical Distance The usual reason for data fitting is to obtain a function g(x) that is capable of giving a representative, or approximate, value of the data as a function of x. This is a situation where using the vertical distances is appropriate. To

Figure 8.2

Different measures of error for the model function g(x) = v1 + v2 x.

362

8 Optimization: Regression

examine how this can be done, we start with the linear model g(x) = v1 + v2 x. Suppose there are three data points, as is illustrated in Figure 8.2. We want to find the values of v1 and v2 that will minimize d1 , d2 and d3 . One way this can be done is to use the maximum error, which is E∞ (v1 , v2 ) = max{d1 , d2 , d3 } = max |g(xi ) − yi | i=1,2,3

= ||g − y||∞ ,

(8.6)

where y = (y1 , y2 , y3 )T , g = (g(x1 ), g(x2 ), g(x3 ))T , and the ∞-norm is defined in Section 3.5. Other choices are the sum E1 (v1 , v2 ) = d1 + d2 + d3 =

3 

|g(xi ) − yi |

i=1

= ||g − y||1 ,

(8.7)

and the sum of squares E2 (v1 , v2 ) = d21 + d22 + d23 =

3 

[g(xi ) − yi ]2

i=1

= ||g − y||22 .

(8.8)

An important observation here is that a vector norm provides a natural way to measure the error. Of the three error functions given above, E2 is the one most often used, producing what is called least squares. This function has the advantage of being differentiable if g(xi ) is a differentiable function of the vj ’s. This is illustrated in Figure 8.3 in the case of when g(x) = v1 + v2 x. The same data set is used in each plot. It is evident that E∞ has edges and corners, while E2 is smooth. Another observation is that the location of the minimizer for E∞ is different than it is for E2 . In other words, the best fit values for v1 and v2 depend on how you measure the error. This will be discussed in more detail in Section 8.5.

8.3 Linear Least Squares The assumption is that the model function g(x) depends linearly on the parameters to be fitted to the data. For example, it could be g(x) = v1 + v2 x, g(x) = v1 ex + v2 e−x , or g(x) = v1 + v2 x3 . Because g(x) = v1 + v2 x is probably the most used model function involving least squares, we will begin with it. In what follows, the data are (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ). If the model function has m parameters, then it is assumed that m ≤ n.

Linear Least Squares

363

8

8

6

6

4

4

2

v -axis

2

v -axis

8.3

2 0 -2 -4

2 0

-2

0

2

4

6

-2 -4

-2

v1-axis

0

2

4

6

v1-axis

Figure 8.3 Plots of E∞ (v1 , v2 ) and E2 (v1 , v2 ), both as a surface and as a contour plot, when g(x) = v1 + v2 x. The red * indicates the location of the minimizer.

8.3.1 Two Parameters We begin with the case of when g(x) = v1 + v2 x, which means that the associated error function is E(v1 , v2 ) =

n n   2  g(xi ) − yi = (v1 + v2 xi − yi )2 . i=1

(8.9)

i=1

To find the minimum, the first derivatives are   n n n    ∂E =2 (v1 + v2 xi − yi ) = 2 nv1 + v2 xi − yi , ∂v1 i=1 i=1 i=1 and

  n n n n     ∂E 2 =2 (v1 + v2 xi − yi )xi = 2 v1 xi + v2 xi − xi yi . ∂v2 i=1 i=1 i=1 i=1

Setting these to zero yields the system of equations

364

8 Optimization: Regression





n  xi



xi x2i



v1 v2



  =



yi

xi yi

 .

(8.10)

This system is called the normal equation for the vj ’s. Using the formula for the inverse matrix in (3.26), we have that     2     v1 − xi xi yi 1 =  . (8.11)  2   xi yi v2 − xi n n x2i − xi With this, we have solved the least squares problem when using a linear model to fit the data. Example 8.1 To find thelinear fit to the  data in Table 8.1, note that n = 4,  yi = 3, and xi yi = −2. In this case, (8.11) gives x2i = 6, 1 v1 6 −2 3 = v2 4 −2 20 −2 1 11 = . 10 −7 The linear fit g(x) = 8.4. 

1 10 (11



xi = 2,

− 7x), as well as the data, are shown in Figure

A couple of mathematical comments are in order. First, the solution in (8.11) requires that the denominator be nonzero. It is not hard to prove that this holds except when all the xi ’s are the same (also recall that we are assuming n ≥ 2). Geometrically this makes sense because if all of the data points fall on a vertical line, then it is impossible to write the line as g(x) = v1 + v2 x. Another comment concerns whether the solution corresponds to the minimum value. It is possible to determine this using the second derivative test, but it is more informative to look at the situation geometrically. The error function E(v1 , v2 ) in (8.9) is a quadratic function of v1 and v2 . The assumptions made on the xi ’s mean that E(v1 , v2 ) is an elliptic paraboloid that opens upwards (i.e., it’s an upward facing bowl). It looks similar to the surface for

Table 8.1

xi

−1

2

0

1

yi

1

−1

2

1

Data used in Example 8.1.

8.3

Linear Least Squares

365

y-axis

2 1 0 -1 -1

-0.5

0

0.5

1

1.5

2

x-axis Figure 8.4

Linear fit obtained using the least squares formula in (8.11).

E2 shown in Figure 8.3. In such cases, the critical point is the location of the minimum point for the function. Looking at the error function as a surface in this way can be used to derive some rather clever ways to find the minimizer, and this is the basis of most of the methods derived in the next chapter.

8.3.2 General Case The model function is now taken to have the general form g(x) = v1 φ1 (x) + v2 φ2 (x) + · · · + vm φm (x),

(8.12)

and the error function is E(v1 , v2 , · · · , vm ) =

n  

2 g(xi ) − yi .

(8.13)

i=1

Letting v = (v1 , v2 , · · · , vm )T and g = (g(x1 ), g(x2 ), · · · , g(xm ))T , then g = Av, where ⎛ ⎞ φ1 (x1 ) φ2 (x1 ) · · · φm (x1 ) ⎜ φ1 (x2 ) φ2 (x2 ) · · · φm (x2 ) ⎟ ⎜ ⎟ (8.14) A=⎜ . ⎟. .. .. ⎝ .. ⎠ . . φ1 (xn ) φ2 (xn ) · · · φm (xn ) In this case, the above error function (8.13) can be written as E(v) = ||Av − y||22 .

(8.15)

There are various ways to find the v that minimizes E, and we will begin with the most direct method. It is assumed in what follows that the columns

366

8 Optimization: Regression

of A are independent, which means that the matrix has rank m. It is also assumed that m ≤ n.

Normal Equation Using the calculus approach, one simply calculates the first derivatives of E with respect to the vj ’s, and then sets them to zero. Using the gradient, the

T equation to solve is ∇v E(v) = 0, where ∇v ≡ ∂v∂ 1 , ∂v∂ 2 , · · · , ∂v∂m . To determine ∇v E(v), recall that ||x||22 = x • x and the dot product can be written as u • w = uT w. So, ||Av − y||22 = (Av − y) • (Av − y) = (Av)T (Av) − 2(Av)T y + y • y = vT (AT A)v − 2vT AT y + y • y. Given a m-vector b and a m × m matrix B, it is straightforward to show that



∇v vT b = b and ∇v vT Bv = (B + BT )v. (8.16) Therefore, since AT A is symmetric, ∇v E(v) = 0 reduces to solving (AT A)v = AT y.

(8.17)

This is the m-dimensional version of the matrix problem in (8.10). As before, it is called the normal equation for v. Because the columns of A are assumed to be independent, it is possible to prove that the symmetric m × m matrix AT A is positive definite. This means that the normal equation can be solved using a Cholesky factorization. However, as will be explained later, there are better ways to compute the solution of (8.17). Example 8.2 To fit a parabola to the data set (−1, 2), (0, −1), (1, 1), (1, 3), the model function is g(x) = v1 + v2 x + v3 x2 . So, φ1 = 1, φ2 = x, and φ3 = x2 . Since x1 = −1, x2 = 0, x3 = 1, and x4 = 1, then, from (8.14), ⎛ ⎞ ⎛ ⎞ 1 −1 1 1 x1 x21 ⎜ 1 x2 x22 ⎟ ⎜ 1 0 0 ⎟ ⎟ ⎜ ⎟ A=⎜ ⎝ 1 x3 x23 ⎠ = ⎝ 1 1 1 ⎠ . 1 1 1 1 x4 x24 Solving (8.17) it is found that v1 = −1, v2 = 0, and v3 = 3.



8.3

Linear Least Squares

367

Example 8.3 In the case of when g(x) = v1 + v2 x + · · · + vm xm−1 ,

(8.18)



⎞ 1 x1 x21 · · · xm−1 1 ⎜ 1 x2 x22 · · · xm−1 ⎟ 2 ⎜ ⎟ A=⎜. . .. .. ⎟ . . . ⎝. . . . ⎠ 1 xn x2n · · · xm−1 n

then

(8.19)

This is fit to the data shown in Figure 8.5, taking m = 4, 8, 12, 16. Looking at the resulting curves, it is evident that the fit for m = 8 is better than for m = 4, and the method fails for m = 12 and m = 16. The reason for failure is because AT A is ill-conditioned in these two cases, and to check on this the condition numbers are given in Table 8.2. This is not a surprise because A strongly resembles the Vandermonde matrix derived in Chapter 5 when taking the direct approach for polynomial interpolation. 

y-axis

Although the ill-conditioned nature of the matrix in the above example is due to its connection with the Vandermonde matrix, the normal equation has a propensity to be ill-conditioned for other model functions. As an indication of how bad this is, if A is a square diagonal matrix then 12

12

6

6

m=4

0

y-axis

0

0.2

0.4

0.6

0.8

1

0

12

12

6

6

m = 12

0 0

0.2

m=8

0

0.6

x-axis

0.8

1

0.4

0.6

0.8

1

0.4

0.6

0.8

1

m = 16

0 0.4

0.2

0

0.2

x-axis

Figure 8.5 Fitting data with the polynomial (8.18) using the normal equation approach (green curve), and the QR approach (red curve). In the top row the two curves are indistinguishable.

368

8 Optimization: Regression m

κ2 (AT A)

κ2 (Rm )

4

8.87e+04

3.38e+02

8

1.36e+12

1.20e+06

12

6.57e+17

1.14e+10

16

3.40e+19

5.59e+14

Table 8.2 Condition number of the matrices that arise when fitting the polynomial (8.18) to the data in Figure 8.5. The matrix Rm is obtained when using the QR approach, which is explained in Section 8.3.3.

κ∞ (AT A) = κ∞ (A)2 . So, if κ∞ (A) = 108 , then κ∞ (AT A) = 1016 . In this case, even though A is not ill-conditioned, AT A certainly is. Fortunately, there are other ways to solve the least squares problem, and one using a QR factorization is discussed below. Another option is to use regularization, which involves modifying the error function, and this is explained in Section 8.3.6.

8.3.3 QR Approach It is possible to use the QR factorization to solve the least square problem. To explain, as shown in Section 4.4.3, an invertible matrix B can be factored as B = QR, where Q is an orthogonal matrix, and R is an upper triangular matrix. The procedure used to find this factorization can be used on n × m matrices if the matrix has m linearly independent columns. For the matrix A in the least squares problem, this means that it is possible to factor it as A = QR, where Q is an n × n orthogonal matrix and R is an n × m matrix of the form Rm , (8.20) R= O where Rm is an upper triangular m × m matrix and O is a (n − m) × m matrix containing only zeros. With this, the least squares error function becomes E(v) = ||Av − y||22 = ||(QR)v − y||22

= ||Q Rv − QT y ||22 . Using the fact that orthogonal matrices have the property that ||Qx||2 = ||x||2 , it follows that E(v) = ||Rv − QT y||22 . This can be rewritten as

8.3

Linear Least Squares

369

E(v) = ||Rm v − ym ||22 + ||yn−m ||22 ,

(8.21)

where ym is the vector coming from the first m rows of QT y and yn−m is the vector coming from the remaining n − m rows of QT y. The latter does not depend on v. So, to minimize E(v) we need to minimize ||Rm v − ym ||22 . This is easy as the minimum is just zero, and this happens when Rm v = ym .

(8.22)

The fact that Rm is upper triangular means that solving this equation is relatively simple. It also needs to be pointed out that Rm is invertible as long as the columns of A are independent. One way to determine Rm is to apply the Gram-Schmidt procedure to the columns of A to obtain a n × m matrix G. With this, ym = GT y and Rm = GT A.

(8.23)

This is usually the easiest way when solving the problem by hand (i.e., when using exact arithmetic). When computing the solution, the Gram-Schmidt procedure (either version) can have trouble producing an orthogonal matrix when the condition number gets large (see Table 4.11 on page 153). Consequently, when floating-point arithmetic is involved, it is better to use the QR factorization method described in Section 8.4. If this is done, so A = QR, then the above formulas hold with G being the first m columns from Q. Example 8.4 For the A given in (8.26), applying the modified Gram-Schmidt procedure to its columns, one obtains √ ⎞ ⎛ 1 −3/√ 5 1 ⎜ 1 3/ 5 ⎟ √ ⎟. G= ⎜ 2 ⎝ 1 −1/√ 5 ⎠ 1 1/ 5 With this, and (8.23),

T

Rm = G A = and ym = GT y =



2 √1 0 5



3/2√ −7/(2 5)

, .

Solving Rm v = ym yields the same solution found in Example 8.1.



370

8 Optimization: Regression

Example 8.3 (cont’d) The model function is g(x) = v1 + v2 x + · · · + vm xm−1 , and the least squares fits using the normal equation, and the QR approach, are shown in Figure 8.5. The two approaches produce basically the same result for m = 4 and m = 8, but not for the two higher degree polynomials. Given the condition numbers for Rm in Table 8.2, the QR solution is expected to be fairly accurate for m = 12 and marginally accurate when m = 16. This is mentioned because the significant over and under-shoots appearing in the QR solution for m = 16 are not due to inaccuracies of the solution method but, rather, to the model function. It is similar to what happened when using a higher degree polynomial for interpolation (e.g., see Figure 5.4 on page 208). As will be explained in Section 8.3.7, over- and undershoots like this are often an indication that the data are being over-fitted. 

8.3.4 Moore-Penrose Pseudo-Inverse The assumption that the columns of A are independent means that AT A is invertible. So, (8.17) can be written as

where

v = A+ y,

(8.24)

A+ ≡ (AT A)−1 AT

(8.25)

is known as the Moore-Penrose pseudo-inverse. It has properties similar to those for an inverse matrix. For example, (A+ )+ = A, (AB)+ = B+ A+ , and if A is invertible then A+ = A−1 . Example 8.5 In Example 8.1, φ1 (x) = 1, φ2 (x) = x, and from (8.14) ⎛ ⎞ ⎛ ⎞ 1 −1 φ1 (x1 ) φ2 (x1 ) ⎜ φ1 (x2 ) φ2 (x2 ) ⎟ ⎜ 1 2 ⎟ ⎟ ⎜ ⎟ A=⎜ ⎝ φ1 (x3 ) φ2 (x3 ) ⎠ = ⎝ 1 0 ⎠ . 1 1 φ1 (x4 ) φ2 (x4 ) Since AT A =



1 1 1 1 −1 2 0 1





⎞ 1 −1 ⎜1 2 ⎟ ⎜ ⎟= 4 2 , ⎝1 0 ⎠ 2 6 1 1

(8.26)

8.3

Linear Least Squares

371

and (AT A)−1 = it follows that 1 A = 10 +



3 −1 −1 2



1 10



3 −1 −1 2

1 1 1 1 −1 2 0 1





1 = 10

,

4 1 3 2 −3 3 3 1

.

It is easily verified that A+ y yields the same solution as obtained in Example 8.1.  For by-hand calculations, (8.25), or the formula in Exercise 8.7(a) using Gram-Schmidt to determine G, are usually the easiest ways to determine A+ . However, for computational problems it is better to either use the formula in Exercise 8.7(a) using a Householder QR factorization (see Section 8.4) to determine G, or the SVD-based formula given in Exercise 8.7(b). If Householder is used, then G is the first m columns from Q.

8.3.5 Overdetermined System In data fitting the ideal solution is the one when the model function passes through all data points. So, if g(x) = v1 + v2 x, and the data points are (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ), then we would like v1 + v2 x1 = y1 v1 + v2 x2 = y2 .. . . = ..

(8.27)

v1 + v2 xn = yn . Given that n > 2, we have more equations than unknowns. Consequently, it is expected that there is no solution, and this means that the above system is overdetermined. Undeterred by the possibility that there is no solution, we write (8.27) as Av = y, where ⎛ ⎞ ⎛ ⎞ 1 x1 y1 ⎜1 x2 ⎟ ⎜ y2 ⎟ v1 ⎜ ⎟ ⎜ ⎟ A = ⎜. . ⎟ , v = , and y = ⎜ . ⎟ . . . v ⎝. . ⎠ ⎝ .. ⎠ 2 1 xn yn

372

8 Optimization: Regression

Solving Av = y is equivalent to finding v that satisfies ||Av − y||22 = 0. With data fitting, it is unlikely that there is a v that satisfies this equation. So, instead, the goal is to find v that gets ||Av − y||22 as close to zero as possible (i.e., it minimizes ||Av − y||22 ). This is the same function that appears in (8.15), and so the solution is v = A+ y,

(8.28)

where A+ is the Moore-Penrose pseudo-inverse defined in (8.25). In the vernacular of the subject, (8.28) is said to be the solution of (8.27) in the least squares sense, or the least squares solution. Example 8.6 Find the least-squares solution of x−y+z =2 x = −1 x+y+z =1 x+y+z =3 Answer: As shown in Exercise ⎛ 1 −1 ⎜1 0 A=⎜ ⎝1 1 1 1

T 8.7(a), A+ = R−1 m G . In this problem ⎞ ⎛ ⎞ 1 2 ⎜ ⎟ 0⎟ ⎟ , and y = ⎜ −1 ⎟ . ⎝ 1 ⎠ 1⎠ 1 3

Applying Gram-Schmidt to the columns of A we get that ⎞ ⎛ √ 22 1 √5 ⎛ 2 − 2 11 11 ⎟ ⎜ 2 √ ⎜ 1 − √1 2 22 ⎟ − ⎟ ⎜ ⎜2 T 11 2 11 G=⎜ ⎟ , and so Rm = G A = ⎝ 0 ⎟ ⎜1 √1 √3 2 ⎝ 2 11 22 ⎠ 0 1 2

√3 2 11

√1 22

After finding R−1 m you get that



0 ⎜ 1 T − A+ = R−1 G = ⎝ 2 m 1 2

1 √2 11 2

0

3 2 √1 2 11 √ 2 22 11

⎞ ⎟ ⎠.

⎞ 1 0 0 ⎟ 0 14 14 ⎠ . −1 14 14

Evaluating A+ y one finds that x = −1, y = 0, z = 3, which is equivalent to the solution found in Example 8.2. 

8.3

Linear Least Squares

373

8.3.6 Regularization Both fitting methods in Figure 8.5 produce unwieldy polynomials for larger values of m. For example, when m = 16 the QR approach results in a polynomial of the form (4 × 105 )x15 − (2 × 107 )x14 + (3 × 108 )x13 − (3 × 109 )x14 + · · · . Such large coefficients, with alternating signs, means that the resulting model function is very sensitive to both how accurately the vj ’s are computed as well as any potential errors in the data. A way to reduce the likelihood of having this happen is to modify the error function, and include a penalty for large values for the vj ’s. The usual way this is done is to take ER (v) = ||Av − y||22 + λ||v||22 ,

(8.29)

where λ > 0 is a parameter that you select. So, the larger the value of ||v||2 , the larger the penalty term contributes to the error function. This idea is called regularization. With (8.29), employing the same argument used to derive the normal equation, you find that instead of (8.17), you now get   (AT A) + λI v = AT y. (8.30) This is referred to as the regularized normal equation. So, regularization has resulted in a minor change in the equation that must be solved. Moreover, AT A + λI is not just invertible, it is positive definite (see Exercise 4.5) and typically has a better condition number than AT A. The big question is how the modified error function affects the fit to the data. One way to answer this is to note that ER (v) approaches the error function used in least squares as λ → 0. So, it is reasonable to expect that the fit is similar to that obtained using least squares for small values of λ. Another way to answer the question is to try it on our earlier example where the normal equation approach failed.

Example 8.3 (cont’d) The model function is g(x) = v1 + v2 x + · · · + vm xm−1 , and the least squares fits using the normal equation, and the QR approach, are shown in Figure 8.5. We are interested in m = 12 and m = 16 as these are values that the normal equation has trouble with. Taking λ = 10−4 , the fits using the regularized normal equation are shown in Figure 8.6. The improvement is dramatic. The regularized fit provides what looks to be a reasonable fit to the data and does not contain the over- and undershoots that the QR solution has. Moreover, the condition number for AT A + λI, for both m = 12 and m = 16, is about 5 × 104 . In other words, it is better conditioned than the matrix used in the

y-axis

374

8 Optimization: Regression 12

12

6

6

m = 12

0 0

0.2

m = 16

0 0.4

0.6

0.8

1

x-axis

0

0.2

0.4

0.6

0.8

1

x-axis

Figure 8.6 Fitting data with the polynomial (8.18) using the regularized normal equation approach (8.30), by the black curves, and the QR approach (8.22), by the red curves. This is the same data set used in Figure 8.5.

QR approach (see Table 8.2). Finally, you should also note that the two black curves are very similar. The significance of this will be discussed in Section 8.3.7.  The regularization used here is known as ridge regularization. Other norms can be used, and one possibility is to replace λ||v||22 with λ||v||1 . This is called lasso regularization. It is also possible to generalize this idea to get what is called Tikhonov regularization, and this is explained in Exercise 8.14. What is not obvious is what value to use for the regularization parameter λ. One approach is to simply pick a few values and see what produces an acceptable result (which is the method used in the above example). A more systematic approach involves something called cross-validation and this will be discussed a bit more in the next section. The idea underlying regularization is an old one, and arises in other areas in optimization. For example, it is used in the augmented Lagrangian method. It is also related to the method of Lagrange multipliers, although how the value of λ is determined is modified slightly.

8.3.7 Over-fitting the Data We need to return to the question of picking the model function. In Example 8.3, the polynomial g(x) = v1 + v2 x + · · · + vm xm−1 was considered. The data used to do this, which are shown in Figures 8.5 and 8.7(top), are known as the training data because they are used to determine the vj ’s. What is seen in Figure 8.7(bottom), by the blue curve, is that the larger you take m, the better the fit (using the QR approach). In other words, the fitting error decreases with m. If the smallest fitting error possible is your goal, then presumably you should keep increasing m until it is achieved (assuming that

8.3

Linear Least Squares

375

y-axis

10

Training Data Testing Data

5

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x-axis 200

Training Testing

Error

150 100 50 0

2

4

6

8

10

12

14

16

m-axis

Figure 8.7 Top: Training data and the testing data for Example 8.3. Bottom: The error E(v) = ||Av − y||22 on the training data set, and the error using the trained model function on the testing data set.

this is even possible computationally). In regression problems, which typically involve noisy data, this usually leads to what is called over-fitting the data. To explain why over-fitting the data is something to be avoided, fitting a model function to a given data set is referred to as training. The usual statement is that the optimal parameters are obtained by training the model on the data. Now, suppose the experiment used to produce the training data is run again, producing a new data set. This is known as testing data, and an example is shown in Figure 8.7(top). The data are obtained by rerunning the same code that produced the data in Figure 8.5. So, they are obtained using the same generating function, but appear different because the 20 random xi points are slightly different, and the noise is slightly different. The expectation is that the trained model function should fit the testing data about as well

Error

100

Training Testing

50

0

2

4

6

8

10

12

14

16

m-axis Figure 8.8 The resulting fitting error E(v) = ||Av − y||22 using the regularized solution on the training data set (blue curve), and on the testing data set (red curve).

376

8 Optimization: Regression

as it does the training data. As shown in Figure 8.7(bottom), by the red curve, for smaller values of m the fit to the testing data does improve with m. However, unlike with the training data, for larger values of m the error gets worse. The reason is that some of the testing data are falling in those regions where the model function has significant under or over-shoots, and this results in large errors (see Figure 8.6). We are in the proverbial Goldilocks situation, where we want the model function to be able to describe the qualitative behavior seen in the data, not being too simple or too complex. But, determining what is “just right” can be difficult. As it turns out, regularization can help with this decision. If you look at Figure 8.6 it is evident that the two regularized solutions (i.e., the two black curves) do not have over- and undershoots. Therefore, they should be less susceptible to overfitting. The verification of this is given in Figure 8.8. To explain this plot, given m, the fitting parameters are obtained by minimizing ER (v) given in (8.29). What we are interested in is the resulting fitting error E(v) = ||Av − y||22 . The resulting E(v) for the training data is the blue curve in Figure 8.8. The red curve is the resulting E(v) for the testing data. Unlike what happens in Figure 8.7(bottom), the regularized solution shows a similar dependence on m for the testing data as it does for the training data. The question of what is the preferred model function still has not been answered. For the polynomial functions g(x) = v1 + v2 x + · · · + vm xm−1 , based on the training curve in Figure 8.8, you would take m = 6. The reason is that any larger value of m (i.e., a more complex function) results in almost no improvement in the fitting error. This conclusion is also consistent with what happens with the testing curve. It should be pointed out that this conclusion is relative to having taken λ = 10−4 . Using training and testing sets is a form of cross-validation. Because overfitting, and other choices that might bias the results, are ubiquitous in regression, numerous ways have been developed to address these problems. Besides cross-validation [James et al., 2021, Liu and Dobriban, 2020], there is also data augmentation and ensembling (such as boosting and bagging), etc. More about these can be found in Ghojogh and Crowley [2019].

8.4 QR Factorization The QR factorization came up when solving eigenvalue problems in Section 4.4.4, as well as solving least square problems in the previous section. What is considered now is a better way to find the factorization than using GramSchmidt. The result will be a procedure that is able to derive a QR factorization for any matrix. And, unlike the LU factorization, it does not require pivoting to work. What is does require is more computational work, approximately twice that of LU.

8.4

QR Factorization

377

The objective is, given a n × m matrix A, find a factorization of the form A = QR,

(8.31)

where Q is an n × n orthogonal matrix, and R is an n × m upper triangular matrix. The exact form for R depends on the number of columns and rows in A. For example, the following are possible: ⎛ ⎞ ⎛ ⎛ ⎞ ⎞ x x x x x x x x x x x x ⎜0 x x ⎟ ⎜0 x x x ⎟ ⎜0 x x x x ⎟ ⎜ ⎟ ⎟ ⎟ ⎟ R=⎜ R=⎜ R=⎜ ⎜0 0 x ⎟ , ⎝0 0 x x ⎠ , ⎝0 0 x x x ⎠ . ⎝0 0 0 ⎠ 0 0 0 x 0 0 0 x x 0 0 0 mn

In the above matrices, a x simply indicates a location for a possible nonzero entry. It should be pointed out that Q and R are not necessarily unique. We are not looking for a way to find all of them, all we want is a method to find just one of them.

8.4.1 Finding Q and R The approach for finding the factorization is to construct orthogonal matrices Q1 , Q2 , Q3 , · · · so that the sequence R0 = A, ⎛ ⎞ r11 r12 r13 · · · r1m ⎜ 0 x x ··· x ⎟ ⎜ ⎟ ⎜ 0 x x ··· x ⎟ (8.32) R1 = Q1 R0 = ⎜ ⎟, ⎜ .. .. .. .. ⎟ ⎝ . . . . ⎠ 0 ⎛

x

x

r11 r12 r13 ⎜ 0 r22 r23 ⎜ ⎜ 0 x R2 = Q2 R1 = ⎜ 0 ⎜ .. .. .. ⎝ . . . 0 0 x

···

x

⎞ · · · r1m · · · r2m ⎟ ⎟ ··· x ⎟ ⎟, .. ⎟ . ⎠ ··· x

(8.33)

etc, will determine R after no more than m steps. So, what Q1 must do is produce the zeros in the first column as indicated in (8.32). What Q2 must do is produce the zeros in the second column, as indicated in (8.33), but also preserve the first row and column in R1 . A key observation is that if we can find a way to determine Q1 , then we can use the same method to determine Q2 . The reason is that the (n − 1) × (m − 1) block of x’s in (8.32) that Q2

378

8 Optimization: Regression

Figure 8.9 To transform c1 , the blue vector, to c e1 , the green vector, you can use a rotation, on the left, or a reflection, on the right.

must operate on is just a smaller version of the n × m matrix that Q1 must work on.

Finding Q1 and R1 The goal is to find a n × n orthogonal matrix Q1 so that ⎛ ⎞ r11 r12 r13 · · · r1m ⎜ 0 x x ··· x ⎟ ⎜ ⎟ R1 = Q1 A = ⎜ . . .. .. ⎟ . .. ⎝ .. . . ⎠ 0

x

x

···

(8.34)

x

More specifically, if c1 is the first column of A, then we want to find Q1 so that ⎛ ⎞ r11 ⎜ 0 ⎟ ⎜ ⎟ Q1 c1 = ⎜ . ⎟ . ⎝ .. ⎠ 0 In what follows, this will be written as Q1 c1 = c e1 ,

(8.35)

where e1 = (1, 0, 0, · · · , 0)T . The constant c in this equation is known. This is because orthogonal matrices have the property that ||Qx||2 = ||x||2 . Consequently, from (8.35), |c| = ||c1 ||2 . In other words, we can take c = ||c1 ||2 , or c = −||c1 ||2 . Both work, and which one is used is not important at the moment. It should be mentioned that if c1 already has the form c e1 , then there is nothing to do and we simply take Q1 = I. In what follows it is assumed that this is not the case. Finding Q1 is not hard if you remember that orthogonal matrices are associated with rotations and reflections. To take advantage of this, the equation (8.35) is illustrated in Figure 8.9 for n = 2, where c1 is shown in blue and

8.4

QR Factorization

379

c e1 is shown in green. So, one possibility is to rotate c1 down to c e1 , which is shown on the left in Figure 8.9. Another choice, shown on the right, is a reflection through the line that bisects the two vectors. Both methods are used. Rotations give rise to what is known as Givens method. It turns out that, from a computational point of view, it is better to use reflections. The particular method that will be derived is known as Householder’s method. Assuming the bisecting line is y = αx, then a straightforward calculation shows that the reflection shown in Figure 8.9 is obtained using 1 1 − α2 2α . (8.36) Q1 = 2α α2 − 1 1 + α2 This can be written as Q1 = I − P1 , where 2 2 α −α P1 = . 1 + α2 −α 1 Introducing the vector v = (−α, 1)T , then P1 =

2 vvT , v•v

where vvT is an outer product. What is significant here is that the vector v is orthogonal to the bisecting line (see Figure 8.9).

Householder Transformation The orthogonal matrix Q1 in (8.36) is an example of what is called a Householder transformation. The definition of a Householder transformation H in Rn is that it has the form H=I−

2 vvT , v•v

(8.37)

where v is nonzero. This is the orthogonal matrix that corresponds to a reflection through the hyperplane that contains the origin and has v as a normal vector. In R2 a hyperplane is a line, and in R3 it is a plane. We will satisfy (8.35) by taking Q1 = H, where v = c1 − c e1 .

(8.38)

In this expression you can take c = ||c1 ||2 or c = −||c1 ||2 (both work). To verify that v is normal to the bisecting hyperplane, note that because c1 and c e1 have the same length, then the vector c1 + c e1 is on the diagonal that bisects c1 and c e1 . It is easy to verify that v • (c1 + c e1 ) = 0. Since v

380

8 Optimization: Regression

is in the plane determined by c1 and c e1 , it follows that it is normal to the bisecting hyperplane. With this choice for Q1 , we have that ⎛ ⎞ ⎛ ⎞ r11 r12 r13 · · · r1m c x x ··· x ⎜0 x x ··· x ⎟ ⎜ 0 ⎟ ⎜ ⎟ ⎜ ⎟ =⎜ . R1 = Q1 A = ⎜ . . . ⎟ ⎟ . (8.39) . .. ⎠ ⎝ . ⎝ .. .. .. ⎠ . A1 0 x x ··· x 0 The first row of R1 is the first row in R, and A1 is a (n − 1) × (m − 1) matrix.

Finding Q2 and R2 Our task now is to find an orthogonal matrix Q2 so that R2 = Q2 R1 has the form given in (8.33). To preserve the first row and column in R1 , the orthogonal matrix Q2 is taken to have the form ⎛ ⎞ 1 0 0 ··· 0 ⎜0 ⎟ ⎜ ⎟ Q2 = ⎜ . ⎟ ⎝ .. ⎠ H 2

0 where H2 is the (n − 1) × (n − 1) Householder matrix for the submatrix A1 given in (8.39). This means we want H2 c1 = c e1 , where c1 is now the first column of A1 . This is a smaller version of the problem we solved to find Q1 . As before, H2 is given in (8.37) with v = c1 − c e1 , and this yields ⎛ ⎞ ⎛ ⎞ r22 r23 r24 · · · r2m c x x ··· x ⎜0 x x ··· x ⎟ ⎜ 0 ⎟ ⎜ ⎟ ⎜ ⎟ =⎜ . (8.40) H2 A1 = ⎜ . . . ⎟ ⎟. . .. ⎠ ⎝ . ⎝ .. .. .. ⎠ . A2 0 x x ··· x 0 The first row of H2 A1 is inserted into R2 as indicated in (8.33). Finding Q and R The steps outlined above are continued until all the columns of R are determined. To write out the general step, suppose Aj−1 is the submatrix in Rj−1 (where A0 = A). Letting Hj be the Householder matrix for Aj−1 , then

8.4

QR Factorization



Hj Aj−1

381

⎞ ⎛ ⎞ rjj rj,j+1 rj,j+2 · · · rjm x ⎜ ⎟ x⎟ ⎟ ⎜ 0 ⎟ ⎟ . (8.41) .. ⎟ = ⎜ . . ⎝ ⎠ ⎠ . . Aj 0 x x ··· x 0

c ⎜0 ⎜ =⎜. ⎝ ..

x x .. .

x ··· x ··· .. .

The associated orthogonal matrix Qj is Ij−1 0 Qj = , 0 Hj

(8.42)

where Ij−1 is the (j − 1) × (j − 1) identity matrix (with Q1 = H1 ). Although it is probably obvious, it’s worth mentioning that the vector e1 =(1, 0, · · · , 0)T used to determine Hj has the same size as the first column in Aj−1 . Now, suppose all of the required columns have been zeroed out, and QK is the last orthogonal matrix used to accomplish this. This means that R = QK RK−1 , and this can be written as R = QK QK−1 · · · Q1 A. The above expression is useful for determining Q, since A = (QK QK−1 · · · Q1 )−1 R −1 −1 = (Q−1 1 · · · QK−1 QK )R

= QR, where Q = Q1 · · · QK−1 QK .

(8.43)

The last expression follows because the Qj ’s are symmetric and orthogonal, T so that Q−1 j = Qj = Qj . As mentioned earlier, you can use c = ||c1 ||2 or c = −||c1 ||2 when finding the Qj ’s. In the following example the positive value is used. When using floating-point arithmetic, there is a possibility that c1 is very close to being a multiple of e1 . If this happens then it’s possible that v, as defined in (8.38), is rounded to zero. To prevent this from happening, in most computer codes, the sign of c is chosen to be the opposite of the sign of c11 (the first entry in c1 ). Example 8.7 Find a QR factorization of ⎛

⎞ 0 2 −1 A = ⎝ 1 1 1 ⎠. −1 0 2

(8.44)

382

8 Optimization: Regression



⎞ 0 √ Finding R1 : Since c1 = ⎝ 1 ⎠ then c = ||c1 ||2 = 2 and, from (8.38), −1 ⎛ ⎞ ⎛ ⎞ ⎛ √ ⎞ 0 1 − 2 √ v = c1 − c e1 = ⎝ 1 ⎠ − 2 ⎝0⎠ = ⎝ 1 ⎠ . −1 0 −1 So,

√ √ ⎞ 2 − 2 2 √ T ⎝ vv = −√ 2 1 −1⎠ , 2 −1 1 ⎛

and, from (8.37), √ √ ⎞ ⎛ ⎛ ⎛ ⎞ 2√ − 2 0 2 1 0 0 √ 1 H1 = ⎝0 1 0⎠ − ⎝−√ 2 1 −1 ⎠ = ⎝ 12 √2 2 0 0 1 − 12 2 2 −1 1 With this we have that

⎛√ 2 R1 = H1 A = ⎝ 0 0

1 2 1 2

√ √2 + √2 − 2 1 2

1 2



√ ⎞ 2 − 12 2 1 1 ⎠. 2 1 2

√ ⎞ − 12 √ 2 3 1 ⎠. − 2 2 √2 3 1 2 + 2 2

Finding R2 : The lower 2 × 2 block from R1 is √ 1 1√ + 2 √2 32 − 12 √2 2 A1 = 1 1 . 3 1 2 − 2 2 2 + 2 2 1 √ √ + 2 Now, c1 = 21 √ , which means that c = ||c1 ||2 = 3/ 2 and 2 2 − √ 1 √ 1 1− 2 3√ 1 + 2 √ . v = 21 √ − 2 = 0 2 2 2 1−2 2 2 − From this we have vvT =

and H2 =

1 4



√ 3 − 2√2 5−3 2

√ 5 − 3√2 , 9−4 2

√ 1 4+ 2 2 1 0 T √ vv = − 0 1 v·v 6 −4 + 2

√ −4 + √2 . −4 − 2

2 1 2

8.4

QR Factorization

383

⎞ ⎛ ⎛ 1 1 0 0 ⎠ = ⎝0 Q2 = ⎝0 H2 0 0

Moreover,

Therefore,

0√ 2/3 + √2/6 −2/3 + 2/6

⎛√ 2 R = Q2 R1 = ⎝ 0 0



1 2 √2 3 2 2

0

⎞ 0√ −2/3 + √2/6⎠ −2/3 − 2/6

√ ⎞ − 12 √2 − 16 2⎠ . − 73

Finding Q : Using (8.43), ⎛

0 √ Q = Q1 Q2 = ⎝ 21 √2 − 21 2



2 3 √2 1 6 √2 1 6 2



1 3 − 23 ⎠ . − 23



8.4.2 Accuracy It is informative to experiment with the QR factorization, as there are a few surprises, both good and bad. Of interest is how accurately Q and R are computed. The complication is that a QR factorization is not unique. It is, however, possible to prove that if the diagonals of R are required to be positive then the factorization is unique. So, in this example, a random n × n orthogonal matrix Q is generated along with a random n × n upper triangular matrix R that has positive diagonals. With this, the matrix is taken to be A = QR. The QR factorization of A is then computed using the Householder method, giving Qc and Rc , where the diagonals of Rc are positive. The resulting differences between the exact and computed values are given in Table 8.3. The fifth column in this table is used to check on the orthogonality of Qc , as explained in Example 4.12 on page 152. Two matrices are tested: one which is well conditioned for all values of n, and one for which the condition number becomes large as the matrix size increases. From the second column in Table 8.3 it is seen is that the product Qc Rc is computed accurately irrespective of the condition number or the size of the matrix. However, from the third and fourth columns, it is evident that Qc and Rc are computed accurately for the well conditioned matrices but not so for the more ill-conditioned matrices. What is surprising about this is that even as the accuracy of Qc and Rc decreases, their product remains very accurate. Moreover, as indicated in column 5, in all cases Qc is an orthogonal matrix (to machine precision).

384

8 Optimization: Regression ||QR − Qc Rc ||∞ ||A||∞

||Q − Qc ||∞

100

8.1e−16

8.9e−16

4.2e−15

8.8e−15

8e+01

500

1.1e−15

1.1e−15

1.2e−14

2.7e−14

4e+02

1000

1.3e−15

1.3e−15

1.9e−14

4.5e−14

7e+02

2000

1.5e−15

1.5e−15

2.9e−14

7.2e−14

1e+03

4000

1.6e−15

1.6e−15

4.3e−14

1.1e−13

3e+03

100

7.8e−16

1.4e−15

1.1e−15

9.7e−15

7e+02

500

7.5e−16

1.3e−14

3.7e−15

2.6e−14

2e+05

1000

8.6e−16

2.3e−13

1.2e−13

4.5e−14

3e+07

2000

8.4e−16

1.8e−10

7.4e−11

7.2e−14

2e+11

4000

1.3e−15

5.2e−04

1.2e−04

1.1e−13

4e+18

n

||R − Rc ||∞ ||QT c Qc − I||∞ κ∞ (A) ||R||∞

Well

Ill

Table 8.3 Accuracy in computing the QR factorization for a matrix that is wellconditioned (Well), and one that is ill-conditioned (Ill).

8.4.3 Parting Comments The QR factorization algorithm, unlike the one for a LU factorization, has no need to introduce something like pivoting to be able to find the factorization. This is reflective of the fact that you can prove that any matrix has a QR factorization, with no pivoting needed [Golub and Van Loan, 2013]. In considering the possible rotations and/or reflections that might be used to transform the first column of A to a multiple of e1 , Givens method was mentioned. While Householder uses reflections, Givens uses rotations to find the Qj ’s and this ends up requiring more flops (about 50% more). Also, it is relatively easy to code Householder to take advantage of the optimized operations available in BLAS. What this means is that for the current generation of multi-core processor computers, Householder is the usual way a QR factorization is computed.

8.5 Other Error Functions In some problems using the vertical distance between the data point and model function is not possible, or relevant to the questions being asked. To explore what happens when using other error functions, the one we have been

8.5

Other Error Functions

385

5

y-axis

4 3 2 ED

1 0

E2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x-axis Figure 8.10 line.

Linear fit obtained using ED , solid red line, and using E2 , the dashed black

considering is

E2 = ||g − y||22 .

An alternative is to use the true distance, as illustrated in Figure 8.2. Using least squares, with g(x) = v1 + v2 x, the error function is  Di2 ED = (8.45) =

1 ||g − y||22 . 1 + v22

(8.46)

In the case of when g(x) = v1 + v2 x, finding the values of v1 and v2 that minimizes ED reduces to solving a cubic equation. This can be solved using a nonlinear equation solver, such as Newton’s method. Another option is to compute the minimizer using a descent method, which are derived in the next chapter. In the example below, the values are calculated using the NelderMead algorithm, which is described in Section 9.6. Example 8.8 Shown in Figure 8.10 are data, along with the linear fit using the error func-

v1

v2

E2

1.16

2.93

ED

0.60

4.12

E1

1.51

2.49

E∞

1.04

3.19

Table 8.4 Fitted values for g(x) = v1 + v2 x for different error functions using the data shown in Figure 8.10.

386

8 Optimization: Regression 4

y-axis

2

0

-2

-4 -4

-2

0

2

4

x-axis Figure 8.11 squares.

Data to be fitted by a circle, and the resulting circle obtained using least

tion ED and the linear fit using the error function E2 . The corresponding values of the parameters v1 and v2 are given in Table 8.4. The values using E1 and E∞ are also given. The fact that the lines in the figure, or the values given in the table, are different is no surprise. If you use a different error function you should expect a different answer for the minimizer. Given the variation in the computed values, it is important to use an error function that is relevant to the application being considered.  Example 8.9 As a second example, consider the data shown in Figure 8.11. The objective is to find the circle x2 + y 2 = r2 that best fits this data, with the parameter in this case being the radius r. The first step is to specify the error function. This is a situation where using the vertical distance runs into some serious complications. This is because for any data point to the right (so, xi > r) or left (so, xi < −r) of the circle, the vertical distance is not defined. Consequently, it is more appropriate to use the radial distance between the curve and the data. So, given a data point (xi , yi ), the corresponding radial coordinates are (ri , θi ), where ri2 = x2i + yi2 . The least squares error based on the radial distance is n  E(r) = (r − ri )2 . (8.47) i=1

Taking the derivative of this with respect to r, setting the result to zero, and solving yields the solution

8.6

Nonlinear Regression

387 n

r=

1 ri . n i=1

(8.48)

This answer is one that you might have guested before undertaking the regression analysis because it states that the circle that best fits the data is the one whose radius is the average of the radii from the data. A comparison between the resulting circle and the data is shown in Figure 8.11.  In Section 8.1 these was a discussion of how a model function might be selected, or preferred, for a particular problem. The same questions arise for the error function, but in this case they are a bit harder to answer. It depends on the problem being solved. As an example, for the error function used in a facial recognition algorithm, you want it to identify the person irrespective if they are upside down, right-side up, or looking in a mirror. What you are asking is that the error function is invariant to rotations or reflections of the data. As it turns out, E2 does not do this but ED does. The point here is that the invariance properties of the error function can play a role in deciding which to use, and more about this can be in Holmes and Caiola [2018].

8.6 Nonlinear Regression To illustrate some of the complications that arise with nonlinear regression, we will consider the data shown in Figure 8.12, and given in Table 8.5, which is from Smith et al. [2010]. Based on what they refer to as a model for simple hyperbolic binding, they assumed the data can be fit with the function g(x) =

v1 x . v2 + x

(8.49)

This is known as the Michaelis-Menten model function, and it is used extensively in biochemistry. Using a least squares error function, then 2 n  v1 xi E(v1 , v2 ) = − yi . (8.50) v2 + xi i=1 As is typical with nonlinear regression, it is not possible to find a formula for a minimizer of E(v1 , v2 ) as we did for linear regression. However, there are various ways to compute a minimizer. One option is to use a decent method, and these are the subject of the next chapter (of particular relevance is the Levenberg-Marquardt method discussed in Section 9.5). A second option is to set the first partial derivatives of E to zero, and then solve the resulting nonlinear equations using Newton’s method. A third option is to transform the problem into one for linear regression, and then find a formula for the

388

8 Optimization: Regression

solution. This is the option we will consider here. If you think that this contradicts the first sentence of this paragraph, it does not and this will be explained in the development. Example 8.10 Some nonlinear regression problems can be transformed into linear ones, although this usually only works when the model function is not too complicated. To illustrate, the Michaelis-Menten function, given in (8.49), y=

v1 x , v2 + x

can be written as 1 v2 + x = y v1 x 1 v2 1 + . = v1 x v1 Setting Y = 1/y and X = 1/x, then the above model function takes the form G(X) = V1 + V2 X ,

(8.51)

where V1 = 1/v1 and V2 = v2 /v1 . If (xi , yi ) is one of the original data points, then the corresponding point for the transformed problem is (Xi , Yi ) = (1/xi , 1/yi ). Using least squares, then the error function is E(V1 , V2 ) =

n  (V1 + V2 Xi − Yi )2 . i=1

According to (8.11), the resulting solution is

Figure 8.12

Data from Smith et al. [2010].

(8.52)

8.6

Nonlinear Regression

Table 8.5



V1 V2



389

xi

0.1

0.5

1

2

5

10

yi

0.5

1.8

3.4

6.1

8.7

11.1

Xi

10

2

1

0.5

0.2

0.1

Yi

2

0.5556

0.2941

0.1639

0.1149

0.0901

Data for Example 8.10.

1  =  2 n Xi − ( Xi )2



 −



Xi2 −



Xi

Xi

  

n

Yi

Xi Yi

 .

(8.53)

The last step is to transform back, using the formulas v1 = 1/V1 and v2 = V2 /V1 . With this, we have been able to find a solution without the need for Newton’s method. The linearizing transformation approach is widely used but it has a potential, rather serious, flaw. To illustrate what this is, if you use the data in Table 8.5, then the curve shown in Figure 8.13 is obtained. The poor fit occurs because, for this particular model function,   1 G(Xi ) − Yi . g(xi ) − yi = − Yi G(Xi ) So, having G(Xi ) − Yi close to zero does not necessarily mean g(xi ) − yi is close to zero, and the reason is the multiplicative factor 1/(Yi Gi ). As an example, the last data point in Figure 8.13 is (x6 , y6 ) = (10, 11.1), which results in Y6 G(X6 ) ≈ 0.01 and g(x6 ) − y6 ≈ 100[G(X6 ) − Y6 ]. This magnification of the error is the cause for the poor fit in Figure 8.13. There is a simple fix, and it is to use the relative error ES , which is given as 12

y-axis

9 6 3 0

0

1

2

3

4

5

6

7

8

9

10

x-axis Figure 8.13

Fit of data from Smith et al. [2010] when using the error function in (8.52).

390

8 Optimization: Regression

12

y-axis

9 6 3 0

0

1

2

3

4

5

6

7

8

9

10

x-axis Figure 8.14

Fit of data from Smith et al. [2010] when using the error function in (8.54).

ES (V1 , V2 ) =

2 n  V1 + V2 Xi − Yi i=1

Yi

.

(8.54)

Using the solution for the minimum given in Exercise 8.17, the resulting fit is shown in Figure 8.14. As is clear, the fit is definitely better than what was obtained earlier.  Example 8.11 This example considers fitting data using the model function g(x) = v1 xv2 ,

(8.55)

which is known as a power law function, and also as an allometric function. It is assumed that the data are (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ), where the xi ’s and yi ’s are positive. There are a couple of options with this function as explained below. Both of these options are explored in the exercises. Option 1: Transform to Linear Regression. To transform into a linear regression problem, let y = v1 xv2 . Taking the log of this equation, then ln(y) = ln(v1 ) + v2 ln(x). Setting Y = ln(y) and X = ln(x), then the transformed model function is G(X) = V1 + V2 X, where V1 = ln v1 and V2 = v2 . The transformed data points (Xi , Yi ) in this case are Xi = ln xi and Yi = ln yi . With this, V1 and V2 are determined using (8.53), or using the formula in Exercise 8.17. Once they are computed, then v1 = eV1 and v2 = V2 . Option 2: Transform to Reduced Non-Linear Regression. It is possible to obtain a simplified non-linear regression problem using  this model function. To explain, taking the first partials of E(v1 , v2 ) = (v1 xvi 2 − yi )2 , and setting them to zero, one finds from the first equation that

8.7

Fitting IVPs to Data

391

 yi xv2 v1 =  2vi2 . xi Substituting this into the second equation, one ends up with an equation of the form F (v2 ) = 0. This is easy to solve using the secant method. The benefit of this is that this minimizes the original error function, and not the transformed error function used for linear regression.  The transformation to a linear regression problem needs more discussion. First, it requires that the change of variables is defined for the data under consideration. Second, the connection between the original and transformed minimization problems is tenuous. As explained in Section 8.5, different error functions usually produce different minimizers. Indeed, this is evident in the different solutions shown in Figures 8.14 and 8.52. If you are intent on determining the minimizer for (8.50), you can use the solution from the linear regression problem as the starting point when using one of the descent methods considered in the next chapter.

8.7 Fitting IVPs to Data Problems in mechanics, electrodynamics, and chemical kinetics often involve solving differential equations. The complication is that the coefficients in the equations are unknown, and they are usually determined by fitting the solution to experimental data. This results in a regression problem which includes solving the differential equations numerically. What follows is an approach for solving such problems.

8.7.1 Logistic Equation To give an example of this type of problem, it is often assumed that a population y(t) of a species satisfies the logistic equation dy = αy(β − y), dt

for 0 < t,

(8.56)

where y(0) = a. Note that although it is possible to solve this problem in closed form, in the discussion to follow this will not be used because in most problems such a solution is not available. Consider the data shown in Figure 8.15. Suppose it is assumed that (8.56) applies to this problem. The goal is then to determine α and β in (8.56) from the data. So, suppose the data are (t1 , y1 ), (t2 , y2 ), · · · , (tn , yn ). The error

392

8 Optimization: Regression

function to be used is E(α, β) =

n 

[y(ti ) − yi ]2 .

(8.57)

i=1

To use our minimization methods we need to be able to evaluate this function for given values of α and β. There are various strategies on how this can be done, and we will consider two: a direct method, and an interpolation method. Direct Method: For this one solves (8.56) numerically, and makes sure that the solver calculates the solution at all of the ti ’s in the data set. If there is a large number of points, as is the case in Figure 8.15, this can add considerable computing time to the procedure. Nevertheless, because this type of data fitting is so common, many IVP solvers have the ability to control the time points. For example, in MATLAB this is accomplished using the tspan option in ode45. Interpolation Method: The idea is to only use a small number of time steps when computing the numerical solution of the IVP. These points are then connected using an interpolation function, producing an approximation for y(t) over the entire interval. The procedure consists of the following steps: (1) Solve the IVP using a coarse time step. It is assumed that the time step required for an accurate numerical solution is significantly larger than the distance between the time points in the data set. In the example below, RK4 is used with a time step of 0.2, while the average distance between the data time points is 0.02. (2) With the computed values of the solution of the IVP, use interpolation to produce a function that approximates y(t) over the interval. In the

Figure 8.15 Data and curves from Raghavan et al. [2012].

8.7

Fitting IVPs to Data

393

3.5

y-axis

3 2.5 2 1.5 1

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

t-axis Figure 8.16 Synthetic data for the logistic equation, shown with the blue dots. The red curve is the solution of the logistic equation using the parameter values obtained from the data fit.

example below, a clamped cubic spline is used. Note that this requires the value of y  (t) at the two endpoints. However, given that we know y(0), and we have just computed y at the right endpoint, we can compute y  (t) at the endpoints using the differential equation (the merits of this are discussed in Section 7.7). (3) Evaluate the interpolation function at the ti ’s in the data set, and from this calculate the error in (8.57). Whichever method is used to evaluate E(α, β), it is also necessary to decide on what method to use to find the minimum. Several minimization algorithms are presented in the next chapter. In the examples to follow the Nelder-Mead procedure, which is described in Section 9.6, is used. It is a multivariable version of the secant method in the sense that it locates the minimum without requiring the computation of a derivative. This is available in MATLAB using the command fminsearch. Example 8.12 To demonstrate the procedure we will use what is called synthetic data. This is obtained by selecting α = 2 and β = 3, and then evaluating the solution at 79 random points in the interval 0 < t < 2. Each value yi is then randomly perturbed a small amount, so the data resembles what is typically seen in experiments. An example of such a data set is shown in Figure 8.16. This data is then fit using the interpolation method. To do this, RK4 with 6 time steps is used to produce the points needed to determine the clamped cubic spline. Using the Nelder-Mead algorithm one finds that, after 60 iteration steps, α = 1.93 and β = 3.04. Using these values the IVP was solved numerically, and the resulting curve is given in Figure 8.16. Note, that if you use the direct method to fit the parameters, then the computing time for this example

394

8 Optimization: Regression

increases by a factor of about 5, and the final solution is approximately the same.  Although there are several steps in the minimization process, the method is straightforward. It is essential to point out that even though it worked very well in the example, the procedure can easily fail. One complication is that it can be unpredictable what values the algorithm might use for α and β when searching for the minimum. For example, it might try to use unphysical values for the parameters, or values that cause the solution to behave in an unstable manner. In effect, this means that for many problems there are constraints on the minimization process, and one needs to build this into the algorithm. To illustrate, for the logistic equation, it is known that α must be positive. If the minimizer attempts to use a negative value this must be overridden, and an alternative put in its place.

8.7.2 FitzHugh-Nagumo Equations In computational neuroscience, the FitzHugh-Nagumo equations are used as a model for the potential in the nerve. The equations in this case are 1 v  = c(v − v 3 + w), 3 1 w = − (v − a − bw). c

(8.58) (8.59)

Synthetic data obtained from these equations in the case of when v(0) = −1, w(0) = 1, a = b = 0.2 and c = 3 are shown in Figure 8.17. The error function in this case is E(a, b, c) =

n 

[v(ti ) − vi ]2 + [w(ti ) − wi ]2 ,

(8.60)

i=1

where (ti , vi ) and (ti , wi ) are the data, and v(t) and w(t) are the solution of the FitzHugh-Nagumo equations. The objective is to use a fitting method to find the values of a, b, and c. We will use the interpolation method, which means that to evaluate the error function for a particular (a, b, c), the FitzHughNagumo equations are first solved numerically using a relatively coarse time step. In this particular example, the RK4 method is used with k ≈ 0.34. This solution is then used to construct a clamped cubic spline interpolation function that can be evaluated to obtain values for v and w at the various ti ’s appearing in the error function. Finally, the Nelder-Mead algorithm is used to find the minimum. The computed values are a = 0.22, b = 0.37, and c = 2.7. The resulting solution curves are shown in Figure 8.17.

Exercises

395

1.5 0.5 -0.5 -1.5

v w 0

2

4

6

8

10

12

14

16

18

20

t-axis Figure 8.17 Synthetic data and corresponding numerical solution of the FitzHughNagumo equations in (8.58), (8.59).

8.7.3 Parting Comments We used the blunt approach to find the coefficients of the IVPs, which means we just coded everything and then computed the answer. A better way would be to first analyze the problem, determine its basic mathematical properties, and then use a computational approach if necessary. As an example, for the logistic equation (8.56) a simple analysis shows that y = β is a steady state which is asymptotically stable if the parameters are positive. What this mean is that it is possible to just to look at the data in Figures 8.15 and 8.16 and produce a reasonably accurate value for β (it is the value the solution approaches as t → ∞). One can also get an estimate of α by noting that at t = 0, y  (0) = αa(β − a). From the estimate of β, and using the approximation y  (0) ≈ (y(t1 ) − y(0))/t1 = (y1 − a)/t1 , one can obtain an estimate for α. This information can then be used to produce reasonably good guesses for the minimization algorithm. Numerous methods have been proposed for finding material parameters from data. For example, Varah [1982] uses splines but fits them to the data, and then uses the spline to find the parameters. This is known as data smoothing, and more recent work on this can be found in Ramsay et al. [2007]. Other possibilities are discussed in T¨ onsing et al. [2019], Dattner [2020], and Matsuda and Miyatake [2021].

Exercises Section 8.3 8.1. Suppose the data set is (−1, 0), (−1, 2), (0, 2), (1, −1). (a) Fit a line to the data, and then on the same axes sketch the line and the data points.

396

8 Optimization: Regression

(b) Fit a parabola to the data, and then on the same axes sketch the parabola and the data points. 8.2. Do Exercise 8.1 but use the data set (0, −3), (−1, 1), (−1, 3), (1, 1). 8.3. Fit the following functions to the data: (−1, y1 ), (0, y2 ), (1, y3 ). With this, sketch the resulting model function, and include the given data points. (a)

g(x) = v1 x + v2 (x − 1)3 y1 = 7, y2 = 1, y3 = 1

(b) g(x) = v1 (x + 1)2 + v2 (x − 1)2 y1 = −1/2, y2 = 1, y3 = 0

(c)

1 1 + v2 x+2 x−2 y1 = 4/3, y2 = 1, y3 = 4/3

g(x) = v1

(d) g(x) = v1 x + v2 (1 − x2 ) y1 = 1, y2 = 2, y3 = −1

8.4. Find the least squares solution of the following: (a)

x+y =1 2x − y = 2 −x + 2y = 5

(c)

− 2x + y − z = −3

(b)

x − 2y = 1 x+y =4 −2x + y = 7

(d)

x=1

x−y+z =2 x−z =3

2x = −1 −x = 1

y − 2z = 1

x = −5

8.5. For this exercise you must pick one of the following matrices: ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ 1 0 1 1 1 A1 = ⎝ −1 1 ⎠ , A2 = ⎝ 1 −1 ⎠ , A3 = ⎝ −2 ⎠ . 0 1 2 1 −1 (a) Using (8.25), find A+ . (b) Using the formula in Exercise 8.7(a), find A+ . (c) It is said that A+ is a left inverse for A, but it is not necessarily a right inverse. By calculating A+ A and AA+ , explain what this means. 8.6. This problem concerns the constant that best approximates the data (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ). So, the model function is g(x) = v. (a) Find the value of v that minimizes E(v) = ||g − y||22 . (b) Find the value of v that minimizes E(v) = ||g − y||∞ . Suggestion: first answer this for n = 2, and then n = 3. From this, explain what the answer is for general n. 8.7. In this exercise formulas for the Moore-Penrose pseudo-inverse are derived. Assume that the columns of the n × m matrix A are independent. T (a) Using (8.22), show that A+ = R−1 m G , where Rm is given in (8.23).

Exercises

397

(b) Using the SVD, A = UΣVT . Use this to show that A+ = VΣ−T UT , where Σ−T is the n × m diagonal matrix with diagonals 1/σ1 , 1/σ2 , · · · , 1/σm . 8.8. This exercise considers what happens when there is only one parameter when data fitting using linear least squares. Assume here that f (x) is a given function. (a) Suppose the model function is g(x) = vf (x). Using least squares error, what is the resulting formula for v? In doing this, assume that there is at least one data point with f (xi ) nonzero. (b) Suppose the model function is g(x) = v + f (x). Using least squares error, what is the resulting formula for v? (c) Suppose that f (x) = x3 , and the data are: (−1, 1), (0, 2), (2, 3). Sketch the model function from part (a), and include the data points in the sketch. (d) Do part (c) but use the model function from part (b). 8.9. This exercise concerns fitting the data: (−1, 8), (0, 4), (1, 0), (2, 2), (3, 10) using the model function (8.12) with m = 3. (a) Plot (by hand) the data. Use this to select a three parameter (v1 , v2 , v3 ) model function of the form given in (8.12). Make sure to explain why it is an appropriate choice for this data set. (b) Compute v1 , v2 , v3 using (8.25). Also, report on the value of κ∞ (AT A). Finally, plot the resulting model function and data set on the same axes. (c) Compute v1 , v2 , v3 using the formula in Exercise 8.7(a). Also, report on the value of κ∞ (Rm ). Finally, plot the resulting model function and data set on the same axes. 8.10. This exercise concerns fitting the data: (−4, 2.1), (−3, 2.2), (−2, 2.5), (−1, 3), (0, 6), (1, 9), (2, 24) using the model function (8.12) with m = 2. (a) Plot (by hand) the data. Use this to select a two parameter (v1 , v2 ) model function of the form given in (8.12). Make sure to explain why it is an appropriate choice for this data set. (b) Compute v1 , v2 using (8.25). Also, report on the value of κ∞ (AT A). Finally, plot the resulting model function and data set on the same axes. (c) Compute v1 , v2 using the formula in Exercise 8.7(a). Also, report on the value of κ∞ (Rm ). Finally, plot the resulting model function and data set on the same axes. 8.11. In the case n of when the model function is g(x)=n v, (8.29) takes the form ER (v) = i=1 (v − yi )2 + λv 2 . Also, let E(v) = i=1 (v − yi )2 . (a) By minimizing ER , show that v = vM , where vM = αyM , yM is the average of the yi ’s, and α = 1/(1 + λ/n). (b) Show that the minimum value of E(v) occurs when v = yM .

398

8 Optimization: Regression

(c) The regularization parameter λ is usually chosen to be fairly small, with the objective that E(vM ) is close to the minimum value of E(v). With this in mind, suppose that the goal is that |E(vM ) − E(yM )| ≈ 10−6 . Explain why the resulting value of λ depends on the number of data 2 . points. Hint: Show that E(vM ) = E(yM ) + n(α − 1)2 yM 8.12. This exercise explores how regularization can be used with the QR approach. (a) Starting with (8.29), modify the argument used to obtain (8.21) to show that ER (v) = ||Rm v − ym ||22 + λ||v||22 + ||yn−m ||22 . (b) Show that the minimizer of ER (v) is determined by solving   (RTm Rm ) + λI v = RTm y. (c) Show that the equation in part (b) reduces to (8.22) when λ = 0. 8.13. Suppose that the error function is Eb (v) = ||Av − y||22 + λ||v − b||22 . (a) In this version of regularization, what is a reason, or what is the objective, for including b in this way? (b) Find the minimizer. 8.14. This exercise involves Tikhonov regularization, which uses the error function ET (v) = ||Av − y||22 + λ||Dv||22 , where D is a given n × m matrix. Note that this reduces to ridge regularization if D = I. (a) Show that the minimizer of ET (v) is found by solving (AT A+ λDT D)v = AT y. (b) Tikhonov regularization can be used to smooth the curve shown in Figure 8.18. The points on this curve are (xi , yi ), for i = 1, 2, · · · , n, where the xi ’s are equally space. Now, the curvature at xi is approximately (yi+1 − 2yi + yi−1 )/h2 , where h is the xi spacing (see Section 7.2.1). Less noisy curves are associated with smaller values of the curvature, which means we want to reduce the value of |yi+1 − 2yi + yi−1 |/h2 . The objective is to find v that is close to y but which also has curvature values that are not too big. Explain why this can be achieved by finding the minimizer of ET (v), where A = I is n × n, and D is the (n − 2) × n matrix given as ⎛ ⎞ 1 −2 1 ⎜ ⎟ 1 −2 1 ⎜ ⎟ D=⎜ ⎟. .. .. .. ⎝ . . . ⎠ 1 −2 1

Exercises

399

1

y-axis

0.5 0 -0.5 -1 0

0.2

0.4

0.6

0.8

1

x-axis Figure 8.18 8.14.

Noisy curve that is smoothed using Tikhonov regularization in Exercise

Note that the h4 term has been absorbed into the λ. For part (c) you need curve data, which is either provided or you generate. The curve in Figure 8.18 was obtained using the MATLAB commands: x=linspace(0,1,n); y=exp(-10*(x-0.6).^2).*cos(6*pi*x); y=y+0.4*(rand(1,n)-0.5);. (c) On the same axes, plot the noisy and smoothed versions of the curve. What value of λ did you use and why did you make this choice? In answering this question you should include two additional plots: one when λ = λ/20 and one when λ = 20λ. Why is your λ so much larger than what was used in Figure 8.8? 8.15. (a) The normal equation has a unique solution if the columns of A are independent. Explain why this requires that there are at least m distinct (i.e., different) xi ’s in the data set. (b) In the case of when the number of data points is the same as the number of parameters show that least squares reduces to interpolation. This is assuming, of course, that the columns of A are independent. (c) Explain why the normal equation (8.17) does not have a unique solution, and A+ is not defined, if one of the fitting functions φj (x) in (8.12) is zero at all of the xi ’s. (d) Suppose that all of the fitting functions φj (x) in (8.12) are zero at x1 . Explain why the resulting solution for v does not depend on y1 . 8.16. (a) Show that the system of normal equations (8.10) can be written as  v1 + v2 xM +

1 nxM n

n  i=1

v1 + v2 xM = yM ,  (xi − xM )2 = yM +

n 1  (xi − xM )(yi − yM ), nxM i=1

n where xM = n1 i=1 xi and yM = n1 i=1 yi . (b) By solving the equations in part (a), show that the resulting model function can be written as g(x) = yM + m(x − xM ), where the slope is

400

8 Optimization: Regression

Table 8.6

xi

−π

0

π

yi

1

0

2

Data for Exercise 8.18.

n m=

(x − xM )(yi − yM ) i=1 ni 2 i=1 (xi − xM )

.

The numerator and denominator are proportional to the covariance and the square of the x-variable standard deviation, respectively. 8.17. The exercise considers a least squares problem using the relative error. As usual, the data are (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ). The error function in this case is 2 n  g(xi ) − yi . ES = yi i=1 It is assumed here that the yi ’s are nonzero. (a) In the case of when g(x) = v1 + v2 x, show that the minimum occurs when  1 v1 a −b 1/y i  = , v2 xi /yi ad − b2 −b d    xi /yi2 , and d = 1/yi2 . where a = x2i /yi2 , b = (b) Calculate v1 and v2 for the data in Table 8.1. On the same axes, plot the resulting line, the data points, and the line found using (8.11). 8.18. The exercise considers the least squares problem using the general linear model function g(x) = v1 p(x) + v2 q(x). As usual, the data are (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ), where n > 2. n (a) Using the error function E(v1 , v2 ) = i=1 [g(xi ) − yi ]2 , show that  2   1 v1 −  p i q i  pi y i  qi  =  2 2 , p2i qi y i v2 pi qi − ( pi qi )2 − pi qi where pi = p(xi ) and qi = q(xi ). (b) Show that the solution in part (a) reduces to the one given in (8.11) in the case of when p(x) = 1 and q(x) = x. (c) Suppose the model function is g(x) = v1 x + v2 x2 and the data is given in Table 8.6. Find v1 and v2 . (d) Suppose the model function is g(x) = v1 sin(x) + v2 sin(2x) and the data is given in Table 8.6. Explain why it is not possible to determine v1 and v2 .

Exercises

401

(e) The question is, what requirement is needed to prevent the problem in part (d), or more generally what is needed so the solution in part (a) is defined. Setting p = (p1 , p2 , · · · , pn ) and q = (q1 , q2 , · · · , qn ), explain why the needed condition is that the angle θ between p and q is not 0 or π. Note that if either p or q is the zero vector then θ = 0. (f) What is the matrix A, as defined in (8.14), for this problem? Explain why the condition derived in part (e) is equivalent to the requirement that the columns of A are independent. (g) What are p and q for part (d)? (h) If n = 3 and p = (1, −1, 1), give two nonzero examples for q where the solution in part (a) is not defined. Section 8.4 8.19. In R3 , find an orthogonal matrix that reflects vectors through the given plane: (a) x − 3y + 4z = 0, (b) 4x − y − 2z = 0, (c) x = 0. 8.20. Use the Householder method to find a QR factorization for the following: 0 0 1 2 (a) (b) 2 −2 1 2 ⎛ ⎞ ⎛ ⎞ 1 −1 1 0 1 1 0 1⎠ 0 0⎠ (c) ⎝ 1 (d) ⎝ 0 1 −1 0 2 −1 1 ⎛ ⎞ ⎛ ⎞ 0 1 1 2⎠ (e) ⎝ 0 (f) ⎝ 1 ⎠ 1 −1 1 8.21. How many QR factorizations are there of 2 0 A= ? 1 −1 8.22. The following are true or false. If true, provide an explanation why it is true, and if it is false provide an example demonstrating this along with an explanation why it shows the statement is incorrect. (a) If Q is orthogonal then so is −Q. (b) If Q is orthogonal then so is I + Q. (c) If Q is orthogonal then so is Q2 . (d) If Q and A are n × n, and Q is orthogonal, then ||QA||2 = ||A||2 . 8.23. Using the fact that if Q is an n × n orthogonal matrix then ||Qx||2 = ||x||, for any x ∈ Rn , show the following:

402

8 Optimization: Regression

(a) κ2 (Q) = 1. (b) For any n × n invertible matrix A, κ2 (AQ) = κ2 (A). (c) For any n × n invertible matrix A, κ2 (QA) = κ2 (A). 8.24. Suppose

Q=

a b c d



is an orthogonal matrix. (a) Explain why it is always possible to find θ and φ so that a = cos θ, c = sin θ, and b = cos φ, d = sin φ. (b) Using the result from part (a), show that the only two possibilities for Q are cos θ − sin θ cos θ sin θ Q= or Q = . sin θ cos θ sin θ − cos θ For the record, the one on the left is a rotation and the one on the right is a reflection. 8.25. Find the connection(s) between the numbers a, b, c, and d so the following is a Householder matrix: a b . c d 8.26. This problem concerns some of the properties of the Householder matrix in (8.37). (a) Explain why H is symmetric. (b) Using (8.37), show that H2 = I. (c) Find Hv. (d) If v is a normal vector then so is αv for any nonzero number α. Show that the Householder matrix obtained using αv is exactly the same as the one obtained using v. 8.27. If A is an invertible n × n matrix, then AT A is symmetric and positive definite. Because of this, it is possible to write the Cholesky factorization of the product as AT A = LDLT , where L is a unit lower triangular matrix and D is diagonal with positive entries along the diagonal (see Exercise 3.14). Show that a QR factorization of A has Q = AL−T D−1/2 and R = D1/2 LT . To do this you need to show that the matrices are orthogonal and upper triangular, respectively, and they multiply to give A. 8.28. This problem explores using QR instead of LU when solving Ax = b. In what follows assume that A is an n × n invertible matrix. (a) Assuming the QR factorization of A is known, using an explanation similar to that used in Section 3.1, show that Ax = b can be solved by first computing y = QT b, and then solving Rx = y. In this problem this is called the QR approach.

Exercises

403

y-axis

3

0

-3 -5

0

5

x-axis Figure 8.19

Example data set to be fitted by elliptic curve in Exercise 8.30.

(b) Explain why R is invertible if A is invertible. Also, using the fact that ||Qx||2 = ||x||2 , ∀ x, show that κ2 (R) = κ2 (A). (c) Use the QR approach to solve, by hand, Ax = b when A is given in (8.44) and b = (−3, 1, 1)T . (d) Use the QR approach to solve the matrix equation in (3.11), for n = 4, 8, 12, 16, 160, 1600, and then report the error as in Table 3.3. Comment on any differences you observe with the LU results. (e) Use the QR approach to solve the Vandermonde matrix equation in (3.12), for n = 4, 8, 12, 16, 160, and then report the error as in Table 3.4. Comment on any differences you observe with the LU results. Section 8.5 8.29. Assume the data are (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ), with y1 < y2 < · · · < yn . Also, the model function is a constant, so g(x) = v1 . (a) For n = 3, find v1 that minimizes ||g − y||1 . (b) For n = 3, find v1 that minimizes||g − y||∞ . (c) How do your answers from (a) and (b) change for general n, with n ≥ 3? (d) In data analysis, words such as mean, median, harmonic mean, midrange, and root mean square are used. Do any of these apply to your answers in parts (a)-(c)? (e) Answer part (d) for Exercise 8.6. 8.30. This exercise explores fitting an ellipse to a data set, and an example of such data is given in Figure 8.19. The formula for the ellipse is

404

8 Optimization: Regression

 x 2 a

+

 y 2 b

= 1,

where a and b are positive and to be determined from the curve fitting. In what follows you are to find an error function and from this find a and b. (a) For a given a and b, write down an expression that can be used as a measure of the error between a data point (xi , yi ) and the curve. The requirements on this expression are: it is non-negative and only zero if (xi , yi ) is on the ellipse. (b) Using your expression from part (a), sum over the data points to produce an error function E(a, b). (c) Minimize E(a, b) and find explicit formulas for a and b. Note that if your error function does not result in you being able to find explicit formulas, then you need to redo parts (a) and (b) so you can. Section 8.6 8.31. Given the model function and the (xi , yi ) data values, do the following: (a) Transform the model function into one of the form G(X) = V1 + V2 X, and give the corresponding (Xi , Yi ) data values (similar to Table 8.5). (b) Calculate V1 , V2 and the corresponding values of v1 , v2 using the error function (8.52), and also using (8.54). (c) Plot the (xi , yi ) data and the two resulting model functions on the same axes for 0 ≤ x ≤ 2. Which is the better error function, (8.52) or (8.54)? (a)

g(x) = ln(v1 + v2 x2 ) (0, −1), (1, 1), (2, 2)

(c)

2 (b) g(x) = 1/ v1 + v2 sin(πx/4) (0, 1), (1, 4), (2, 9)

g(x) = v1 10v2 x (0, 2), (1, 5), (2, 12)

(d) g(x) = (x + v1 )/(x + v2 ) (0, 5), (1, 3), (2, 2)

8.32. The elliptical path of a comet is described using the equation r=

p , 1 + ε cos θ

where r is the radial distance to the Sun, θ is the angular position, ε is the eccentricity of the orbit, and p is a rectum parameter. In this exercise you are to use data for the comet 27P/Crommelin, which is given in Table 8.7.

Table 8.7

θi

0.00

1.57

3.14

4.71

6.28

ri

0.62

1.23

16.14

1.35

0.62

Data for the comet 27P/Crommelin, used in Exercise 8.32.

Exercises

405

(a) Writing the model function as r = g(θ), and using least squares, then the error function is E(p, ε) =

n 

[g(θi ) − ri ]2 .

i=1

What two equations need to be solved to find the value(s) of p and ε that minimize this function? (b) By writing the model function as 1 1 + ε cos θ = , r p explain how the nonlinear regression problem can be transformed into one with the model function R = V1 + V2 cos(θ). Also, what happens to the data values? (c) Writing the model function in part (b) as R = G(θ), and using the least squares error function E(V1 , V2 ) =

n 

[G(θi ) − Ri ]2 ,

i=1

compute V1 and V2 . Using these values, determine p and ε. (d) Redo part (c) but use the relative least squares error function 2 n  G(θi ) − Ri ES (V1 , V2 ) = . Ri i=1 (e) Does part (c) or does part (d) produce the better answer? You need to provide an explanation for your conclusion, using a quantified comparison, graphs, and/or other information. 8.33. This exercise explores using the power law function (8.55) with the data in Table 8.8. What is given, for each animal, is its typical mass and its maximum relative speed. The latter is the animal’s maximum speed divided by its body length. The latter does not include the tail and is measured in meters. (a) Taking x to be the mass, fit the power law to the data in Table 8.8 by transforming it into a problem in linear regression. Compute the parameters using (8.52). With this, plot the data and power law curve using a log-logplot. Finally, compute the value of the nonlinear error function E = (v1 xvi 2 − yi )2 . (b) Do part (a) but, instead of (8.52), use (8.54). (c) What was the running speed of a Tyrannosaurus rex? Assume here that its tail was approximately 1/3 of its total body length. Explain which regression method you used to answer this question, and why you made this choice.

406

8 Optimization: Regression Animal

Mass (kg) Relative Speed (1/s)

canyon mouse 1.37e−02

Table 8.8

39.1

chipmunk

5.10e−02

42.9

red squirrel

2.20e−01

20.5

snowshoe hare 1.50e+00

35.8

red fox

4.80e+00

28.7

human

7.00e+01

7.9

reindeer

1.00e+02

12.7

wildebeest

3.00e+02

11.0

giraffe

1.08e+03

3.8

rhinoceros

2.00e+03

1.8

elephant

6.00e+03

1.4

Data for Exercise 8.33 adapted from Iriarte-D´ıaz [2002].

8.34. The computing times for three matrix algorithms are given in Table 8.9, which is adapted from Table 4.12. The assumption is that the computing time T can be modeled using the power law function T = αN β . The goal of this exercise is to find α and β. Note that you do not need to know anything about how the matrix methods work to do this exercise. (a) By transforming this into a problem in linear regression, fit the model function to the LU times, and then plot the model function and data on the same axes. (b) For larger matrices, the flop count for LU is 23 n3 . Is there any connection between the values in this formula and your values from part (a)? Should there be?  (c) Show that to minimize the least-square error E = [T (Ni ) − Ti ]2 one gets that  Ti N β α =  2βi . Ni (d) The result in part (c) means E(α, β) can be written as only a function of β, in other words E(α, β) = G(β). It is not unreasonable to expect that the computing time is a reflection of the flops used to calculate the factorization. If so, then β should be a positive integer. What positive integer value of β minimizes G(β)? What is the resulting valuer for α?

Exercises

407 N

LU

QR

SVD

200

0.0003

0.0006

0.0050

400

0.0011

0.0021

0.0192

600

0.0021

0.0046

0.0392

800

0.0037

0.0085

0.0769

1000

0.0061

0.0143

0.1248

2000

0.0280

0.0890

0.9859

4000

0.1585

0.5242

5.6501

Table 8.9 Data for Exercise 8.34. Computing time, in seconds, for a LU, QR, and SVD factorization of a random N × N matrix using MATLAB.

Also, using these values, plot the model function and data on the same axes. Make sure to explain how you find β. (e) Redo (a) for the QR times. (f) Redo (a) for the SVD times. Section 8.7 The exercises to follow require synthetic data, which is either provided or you generate. For the latter, suppose that by using either the exact solution, or a numerical solution, you determine values y = (y1 , y2 , · · · , yn )T of the solution at the time points t1 , t2 , · · · , tn . By using the MATLAB commands: q=a*(2*rand(n,1)-1); yd=(1+q).*y; you can produce synthetic data. In doing this, the synthetic data values will satisfy (1 − a)y ≤ yd ≤ (1 + a)y. For example, in Figure 8.16, a = 0.1 while in Figure 8.17, a = 0.2. 8.35. This problem concerns the Bernoulli equation y + y3 =

y , a+t

for t > 0,

where y(0) = 1. The exact solution is y=

a+t β + 23 (a + t)3

,

where β = a2 (1 − 2a/3). (a) Taking a = 0.01, generate synthetic data for this problem using 400 equally spaced time points over the interval 0 < t ≤ 2. (b) Using either the data from part (a), or a data set provided, find a. Make sure to explain the method you use to solve this problem. Also, plot the data and numerical solution using these parameter values.

408

8 Optimization: Regression

8.36. In the description of the dynamics of excitons in semiconductor physics, one finds the following problem n = γ − αn2 x, x = αn2 x −

x , 1+x

where n(0) = 1 and x(0) = 1. Also, α and γ are constants that satisfy 0 < γ < 1 and α > 0. Note, x(t) and n(t) are densities and are therefore non-negative. (a) Taking α = 0.1 and γ = 0.2, generate synthetic data for n and x using 500 equally spaced time points over the interval 0 < t ≤ 60. (b) Using either the data from part (a), or a data set provided, find α and γ. Also, plot the data and numerical solution using these parameter values. If you modify the minimization code make sure to explain why you did so, and what exactly you did change. Additional Questions 8.37. Typically, in reporting experimental data, at any given xi , the mean yi and standard deviation σi are given. Examples are shown in Figures 5.1 and 8.12. The objective of this exercise is to derive a regression procedure that produces a predicted value y that: i) of foremost importance, falls within the interval yi − σi < y < yi + σi , and ii) of secondarily importance, has y ≈ yi .

p (a) Setting Ei = (y − yi )/σi , on the same axes, sketch Ei as a function of y for p = 2, 4, 8. Use this to explain why taking a larger p has the effect of giving more weight to objective (i). n (b) Taking p = 2, y = a + bx, and the error function E = i=1 Ei , find a and b that minimize E. (c) Suppose one gets better, and presumably more expensive, experimental equipment so the standard deviations are all reduced by a factor of, say, 10. Assuming the same yi values are obtained, explain why the answer in part (b) is unchanged. 8.38. This exercise considers using a cubic spline as the model function when using least squares. An example of this is shown in Figure 8.20. Assume the data points are (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ). These are going to be fitted with the cubic spline function s(x) =

m+1 

aj Bj (x),

j=0

where Bj (x) is given in (5.23). The nodes for the B-splines are assumed to ¯2 , · · · , x ¯m , where the x ¯j ’s are equally spaced between x ¯1 and x ¯m . be x ¯1 , x

Exercises

409

Temperature

40

20

0

-20 Jan

July

Jan

July

Jan

Date

Figure 8.20 Temperature data over a two year period, and the cubic spline fit using least squares as considered in Exercise 8.38 [NCEI, 2015].

Note that in (5.23), h is the spacing between the x ¯j points. In this exercise, m is taken to be given, and then least squares is used to determine the aj ’s. Also, the reason for including j = 0 and j = m + 1 in the sum is explained in Section 5.4.1. (a) Show that to obtain the minimum of the error function E(a0 , a1 , · · · , am+1 ) =

n 

[s(xi ) − yi ]2

i=1

(b) (c)

(d) (e)

x

one needs to solve Ca = d, where a = (a0 , a1 , · · · , am+1 )T , C = BBT , d = By, y = (y1 , y2 , · · · , yn )T , and B is a (m + 2) × n matrix. In particular, the ( , k) entry of B is B−1 (xk ). ¯m ? What value should you take for x ¯1 ? What about x Taking m = 3, compute the aj ’s using the data in Table 8.10. With this, plot s(x) and the data on the same axes, for 1 ≤ x ≤ 715. Also, compute ||C||∞ and comment on whether the matrix is ill-conditioned. Redo part (c), but take m = 4, m = 5, m = 6, and m = 13. Based on your results from parts (c) and (d), for a data set with n points, what do you recommend to use for m? In answering this, keep in mind that the model function should be capable of reproducing the more significant trends in the data, as well as provide a reasonable approximation over the entire interval.

1

48

94

144 193 242

y -5.6 -5 10.6 17.2

25

286 334 382 430

474

532

592

651

715

27.2 17.2 6.1 -4.3 -3.3 23.3 22.8 31.7 25.6 12.8

Table 8.10 Temperature data for Exercise 8.38(c) [NCEI, 2015]. Note, x is measured in days, with x = 1 corresponding to January 1, and x = 715 corresponding to December 15 of the following year.

Chapter 9

Optimization: Descent Methods

The problem central to this chapter, and the preceding one, is easy to state: given a function F (x), find the point xm ∈ Rn where F achieves its minimum value. This can be written as F (xm ) = minn F (x). x∈R

(9.1)

The point xm is called the minimizer or optimal solution. The function F is referred to as the objective function, or the error function, or the cost function (depending on the application). Also, because the minimization is over all of Rn , this is a problem in unconstrained optimization.

9.1 Introduction To illustrate the types of problems to be considered, Newton’s method was used in Section 3.10 to solve x2 + 4y 2 = 1, 4x2 + y 2 = 1. Another way to solve it is to rewrite it as a minimization problem. This can be done, for example, by taking F (x, y) = (x2 + 4y 2 − 1)2 + (4x2 + y 2 − 1)2 .

(9.2)

With this, F (x, y) is never negative, and it is only equal to zero when the original equations are satisfied. The surface plot of F (x, y), and its associated contour plot, are given in Figure 9.1. The four local minimum points for F (x, y) correspond to the four intersection points seen in Figure 3.5. Although the minima correspond exactly with the four solutions, the function F (x, y) © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0 9

411

412

9 Optimization: Descent Methods

1

y-axis

0.5

0

-0.5

-1 -1

-0.5

0

0.5

1

x-axis

Figure 9.1

Top: Function given in (9.2). Bottom: Contour curves for the function.

has multiple critical points. In particular ∇F = 0 at the four minimum points, at the local maximum at (x, y) = (0, 0), and at the four saddle points between the minimum points. What this means is that any method of finding the minima that involves trying to find where ∇F = 0 will need to be able to have a way of deciding if the point is a minimum or not. It is assumed in what follows that the function to be minimized is continuous, and it has one or more minima. For those who like their functions to have a unique minimum, additional assumptions are necessary. One often made is that the function is strictly convex. The function in Figure 9.1 does not have this property because it is possible to find a straight line connecting two points on the surface that intersects the surface other than at the two original points. In contrast, the surface in Figure 9.4 is strictly convex. Con-

9.2

Descent Methods

413

vexity will play a role in Section 9.3 but not in the later sections where more general nonlinear problems are considered. The methods developed in those sections are designed to find local minima. The question of how to find the global minimum of a function will not be considered, and more comments will be made about this in Section 9.5.1.

9.2 Descent Methods The idea underlying a descent method is simple: if you are standing somewhere on the surface shown in Figure 9.1, and want to go downhill, you pick a direction of descent and then start walking in that direction. This simple idea can be used to locate a local minimum, but two recurring decisions must be made: which particular downhill direction to use and just how far should you walk in this direction before stopping and then turning to walk in a different descent direction. The basic algorithm consists of picking a starting position x1 and then constructing a sequence x2 , x3 , x4 , . . . that (hopefully) converges to xm . In doing this it is required that the value of F (x) decreases with each step. To get this to happen, suppose we are located at xk . You first need to identify a direction of descent dk and you then need to “walk” a distance αk in that direction. In mathematical terms we have the following formula xk+1 = xk + αk dk ,

for k = 1, 2, 3, . . . .

(9.3)

This is illustrated in Figure 9.2. Assuming, for the moment, that we know dk then the question arises as to how far we should walk in this direction. In other words, what is the value of αk ? In optimization, determining αk is known as the line search problem. Given dk , then αk should make F (xk + αdk ) as small as possible. In other words, we want to find α that minimizes

Figure 9.2 In a descent method you move from xk to xk+1 along a direction of descent dk . The ∗ designates the location of the minimizer xm .

414

9 Optimization: Descent Methods

q(α) = F (xk + αdk ).

(9.4)

This can be done by solving q  (α) = 0. This equation, after using the chain rule, can be written as ∇F (xk + αdk ) • dk = 0.

(9.5)

It needs to stated that finding the minimum of q(α) does not, necessarily, find the minimizer xm . Rather, it is the minimum of F (x) in a particular direction. In this sense the descent method replaces the multidimensional minimization in (9.1) with a sequence of one-dimensional minimization problems.

9.2.1 Descent Directions In selecting a direction of descent you might think that you should simply pick one that points towards the minimizer (unlike what is shown in Figure 9.2). The fact is that at xk you have no idea where xm is located, and all you are able to do is determine if a direction is downhill or uphill. What is derived below is a test that can be used to verify that dk is a direction of descent. The discussion to follow assumes that x ∈ R2 , but the formulas derived will hold for x ∈ Rn . Recall from calculus that given a function y = F (x), if ∇F (xk ) = 0, then the direction of steepest descent at xk is −∇F (xk ). Moreover, −∇F (xk ) is perpendicular to the level curve containing xk (see Figure 9.3). To be a descent direction, dk must be on the same side of the level curve as −∇F (xk ). Said another way, the angle θ between dk and −∇F (xk ) must be less than 90◦ . Now, according to the law of cosines, ||dk ||2 • ||∇F (xk )||2 cos θ = −dk • ∇F (xk ).

Figure 9.3 Angle θ between a direction of descent dk and the direction of steepest descent (the black arrow).

9.3

Solving Linear Systems

415

Since the left-hand side of this equation is positive, it follows that the requirement for dk to be a descent direction at xk is dk • ∇F (xk ) < 0.

(9.6)

This leaves open the question of what to pick for dk , and several possibilities are considered in this chapter. To illustrate how to use descent methods we will use them to solve Ax = b. After that they will be used to solve nonlinear equations. Throughout the development it is assumed that F ∈ C 1 (Rn ), which means that the first partial derivatives of F (x) are defined and continuous everywhere.

9.3 Solving Linear Systems It is easy to rewrite a linear system Ax = b as a minimization problem. For example, in the previous chapter we solved the equation by minimizing F (x) = ||Ax − b||22 . However, for certain matrices there is another possibility. To explain, we start with the n = 1 case, where the equation is simply ax = b. We want to find a function F (x) that has a minimum exactly when ax = b. Recall that at a minimum F  (x) = 0. If ax − b = 0 corresponds to the equation F  (x) = 0 then we need F  (x) = ax − b. Integrating gives us F (x) = 1 2 2 ax − bx (the additive constant is not needed). For this to correspond to a minimum problem the quadratic must open upward, which means we have the positivity requirement a > 0. To see what happens when there are more variables, in the case of when n = 2, Ax = b takes the form a11 x1 + a12 x2 = b1 , a21 x1 + a22 x2 = b2 . We want these to correspond to the equations quently,

∂F ∂x1

= 0,

∂F = a11 x1 + a12 x2 − b1 , ∂x1 ∂F = a21 x1 + a22 x2 − b2 . ∂x2 The first thing to note is that the consistency condition ∂2F ∂2F = ∂x1 ∂x2 ∂x2 ∂x1

∂F ∂x2

= 0. Conse-

(9.7) (9.8)

416

9 Optimization: Descent Methods

requires that a12 = a21 . In other words, our construction requires that the matrix A be symmetric. Assuming this holds, then integrating (9.7) and (9.8) it follows that 1 (9.9) F (x) = xT Ax − b • x. 2 To guarantee that this has a minimum there is a positivity requirement on A, namely, it must be positive definite. Although the above derivation was for n = 2, it holds for general n. To summarize this discussion, we have the following result. Theorem 9.1 If F (x) = 12 xT Ax − b • x, where A is symmetric and positive definite, then the minimum of F (x) occurs at the same x that is the solution of Ax = b. Example 9.1 

If A=

3 −2 −2 4



 and

b=

1 −1

 ,

(9.10)

then (9.9) reduces to        1 3 −2 x 1 • x (x y) − −2 4 y −1 y 2 3 = x2 − 2xy + 2y 2 − x + y. 2

F (x, y) =

This surface is shown in Figure 9.4. It is an elliptic paraboloid, that opens upwards, and it, therefore, has a unique minimizer. In fact, for general n, the conditions in Theorem 9.1 mean that F (x) is a multidimensional version of an elliptic paraboloid. 

9.3.1 Basic Descent Algorithm for Ax = b As explained earlier, to solve the line search problem you must solve (9.5). In the case of when F (x) is given in (9.9), ∇F = Ax − b. (9.11)   Consequently, (9.5) reduces to A(xk + αdk ) − b • dk = 0. Solving this for α yields dk • rk αk = , (9.12) dk • qk where

9.3

Solving Linear Systems

Figure 9.4

417

Plot of F (x, y) in the case of when A and b are given in (9.10).

qk = Adk ,

(9.13)

rk = b − Axk .

(9.14)

and The vector rk is the residual at xk (see Section 3.6). Moreover, from (9.11), it follows that rk is also the direction of steepest descent at xk . This means, from (9.6), that for dk to be a direction of descent it must satisfy dk • rk > 0.

(9.15)

Another observation worth mentioning is that, given r1 = b − Ax1 , then for the other values of k, rk = b − Axk = b − A(xk−1 + αk−1 dk−1 ) = rk−1 − αk−1 qk−1 .

(9.16)

This makes computing the residual a bit easier as it avoids a matrix multiplication. To summarize the above discussion, the descent method used to solve Ax = b, when A is symmetric and positive definite, is: after picking x1 , then xk+1 = xk + αk dk ,

for k = 1, 2, 3, . . . ,

where dk is any vector that satisfies (9.15), and

(9.17)

418

9 Optimization: Descent Methods

αk =

dk • rk . dk • qk

(9.18)

In the above formula, qk is given in (9.13) and the residual rk is calculated using (9.16) for k > 1. The only thing left to do is specify the direction of descent dk .

9.3.2 Method of Steepest Descents for Ax = b We are going to consider what happens when we use the steepest descent direction to solve the minimization problem for Ax = b. In other words, we will take dk = rk in (9.17) and (9.18). Not surprisingly, this gives rise to what is known as the method of steepest descents (SDM). The resulting algorithm is: after picking x1 , then xk+1 = xk + αk rk , where rk = b − Axk , αk =

for k = 1, 2, 3, . . . , rk • rk , rk • qk

(9.19)

(9.20)

and qk = Ark . The vectors xk+1 computed using (9.19) continue getting closer to the solution xm , and we need a way to determine when we are close enough. One possibility is to use the iterative error (see Section 1.5), which means that we specify an error tolerance tol and then stop computing when ||xk+1 − xk || ≤ tol. Another possibility is to use the residual rk . If we are fortunate enough to have rk = 0, then the problem is solved and xm = xk . Since rk  = 0 only when xk is the solution, having rk  small can be used as a stopping condition for the iteration. This works as long as A is not ill-conditioned, and we will return to this topic later. Because the only significant operations involved with SDM are vector and matrix multiplication, it is very easy to code. Moreover, it is capable of taking maximum advantage of the sparseness of A. For example, to calculate qk = Adk we need to store, and use, only the nonzero entries in A. The benefits of doing this are discussed in Section 9.3.4. How effective is SDM in solving matrix equations? Let’s find out by trying a couple of examples. Example 9.2 In the two examples below the stopping condition is xk+1 − xk ∞ < 10−4 . Once the SDM has stopped, then the error xm − xk ∞ , which is the differ-

9.3

Solving Linear Systems

419

ence between the exact and SDM solution, is computed to check to see how well the method has worked.      3 −2 x (a) = 1 −1 −2 4 y First note that the matrix is symmetric and, from Theorem 3.4, it is positive definite. Taking x1 = (1, 2)T , to determine x2 , note that        1 3 −2 1 2 − = , r1 = b − Ax1 = −1 −2 4 2 −7      3 −2 2 20 = , q1 = Ar1 = −2 4 −7 −32 and α1 =

r1 • r1 53 . = • r1 q1 264

With this, x2 = x1 + α1 r1 =

  185/132 157/264



  1.4 0.6

.

The remaining xk ’s are computed using MATLAB, and the resulting steps are shown in the contour plot on the left in Figure 9.5. There are two important observations to be made here. First, each step is orthogonal to the contour it starts from. This is expected given that this is the method of steepest descents. The second observation is that successive steps are perpendicular to each other. For example, the line from x1 to x2 is perpendicular to the line from x2 to x3 , which in turn is perpendicular to the one from x3 to x4 . As will be explained later, this is not a good 2.5 0

y-axis

-20

1

-40

-60

-0.5 -0.5

1

x-axis

2.5

-80

0

20

40

60

80

x-axis

Figure 9.5 Contours of F (x, y) and the first few steps taken by the method of steepest descent in Example 9.2(a), on the left, and Example 9.2(b), on the right.

420

9 Optimization: Descent Methods

thing. Finally, for this example, the method takes 14 iteration steps and the resulting error is 4 × 10−5 .      5 4.99 x 1 (b) = 4.99 5 y −1 Taking x1 = (1, 1)T , the first few steps produced with the SDM are shown in the contour plot on the right in Figure 9.5. The contours are so elongated in this example that the elliptical curves around the minimum are not evident (as they are in the contour plot to the left). The path the method takes results in x2 being very close to x1 , x4 being very close to x3 , etc. In the end, the method takes a whopping 200 iteration steps and the resulting error is a mediocre 9 × 10−3 .  As seen in Figure 9.5, the SDM has more trouble when the contours are elongated ellipses as compared to the more circular ones. As it turns out, the shape of each elliptical contour is determined, approximately, by the condition number κ2 (A). Specifically, if a and b are the semi-major and semiminor axes, respectively, for any particular ellipse seen in these plots, then κ2 (A) = (a/b)2 . The proof of this comes from the spectral decomposition theorem given in Section 4.6. What this means is that you can get an idea of the condition number of A by just looking at the contour plot. For example, for the plot on the left in Figure 9.5, a ≈ 2b, so κ2 (A) ≈ 4. The actual value is κ2 (A) = 3.866 · · · . For the plot of the right, a ≈ 30b, so κ2 (A) ≈ 900. The actual value is κ2 (A) = 999. In reviewing the results from the above examples one might wonder why anyone would ever consider using the SDM because the method can be painfully slow to converge. This is not because the matrices are illconditioned. Although the condition number for the second matrix is not particularly small, it is not so large that we would anticipate a problem with round-off error. The difficulty the SDM has arises because of the alternating perpendicular paths it uses to locate the minimum. There is a final technical comment to make about the contour plots in Figure 9.5. In each, the x and y scales are the same. This is required to be able to see the orthogonality in the path directions. Equal scaling is also used in Figure 9.3 so the tangency of the descent direction is evident (see Exercise 9.3). This equal scaling is done, or assumed, in all contour plots in this chapter (including the exercises).

9.3.3 Conjugate Gradient Method for Ax = b To avoid the difficulties the SDM has we need to modify the direction of descent. We will still include the direction of steepest decent, but we will add

9.3

Solving Linear Systems

421

Pick: x1 and tol > 0 Let: r1 = b − Ax1 d1 = r1 Loop For k = 1, 2, 3, · · · , n qk = Adk rk • rk αk = dk • qk xk+1 = xk + αk dk If ||αk dk ||∞ < tol then stop rk+1 = rk − αk qk rk+1 • rk+1 rk • rk dk+1 = rk+1 + βk dk βk =

End

Table 9.1 Outline of the conjugate gradient method (CGM) used to solve the linear system Ax = b, where A is an n × n symmetric positive definite matrix.

in a term to adjust the direction slightly to prevent what happens in Figure 9.5. Our choice will be to start off with d1 = r1 but from then on take dk = rk + βk−1 dk−1 ,

for k = 2, 3, . . . .

(9.21)

So, the question is, how to pick βk−1 ? The choice is based on the observation that for the SDM the direction of steepest descent at xk is orthogonal to the steepest descent direction at xk−1 . In other words, rk • rk−1 = 0. This is evident in Figure 9.5 because of the right angle the path makes at each xk . In this two-dimensional setting, because there are only two nonzero perpendicular directions, then the SDM keeps repeating its descent directions. In particular, the direction it uses at x1 is the same one it uses at x3 , x5 , x7 , etc. This is the primary culprit in slowing down the SDM and we will use βk−1 to stop this. The way we will prevent rk from being in the same direction as rk−2 is to select βk−1 so that rk is actually orthogonal to rk−2 . To determine how this can be accomplished, consider the first three points produced by the algorithm: x1 , x2 , and x3 . The residuals at these points are: r1 , r2 , and r3 . Since r3 = r2 − α2 Ad2 and r1 • r2 = 0, then r3 • r1 = −α2 r1 • Ad2 . So, to have r3 • r1 = 0 we need r1 • Ad2 = 0. Given that d2 = r2 + β1 r1 , then for r3 to be orthogonal to r1 it is required that β1 =

r2 • r2 . r1 • r1

422

9 Optimization: Descent Methods

Continuing in this way yields βk =

rk+1 • rk+1 , rk • rk

for k = 1, 2, 3, . . . .

(9.22)

Another outcome of the above analysis is that dk • rk = rk • rk . The resulting algorithm is given in Table 9.1. It is assumed here that rk • rk = 0, but if this were not the case we would have solved the equation exactly at the previous step and there would be no need to calculate βk . This choice of βk , along with the descent direction in (9.21), produces what is known as the conjugate gradient method (CGM), and the steps involved are summarized in Table 9.1. In terms of operations per iteration step it is not much more than the SDM. To calculate βk and then construct dk adds about 4n flops per step.

Finite Termination Property The CGM has a remarkable property that has profound consequences for how well it works. At each step, three vectors are computed that are of importance, and these are the position (xk+1 ), the residual (rk+1 ) and the direction of descent (dk+1 ). As explained earlier, the descent direction d2 is selected so that the residual r3 is orthogonal to the previous residuals, r1 and r2 . Similarly, d3 is selected so r4 is orthogonal to r1 , r2 and r3 . The result is that the residuals r1 , r2 , r3 , . . . are mutually orthogonal, which means that ri • rj = 0, ∀i = j. In n dimensions, the only way for n + 1 vectors to be mutually orthogonal is that one of them is the zero vector. If a residual is zero, then we have found the exact solution. Therefore, the conjugate gradient method produces the exact solution in no more than n steps! This is known as the finite termination property. In contrast, iterative methods such as the SDM, Newton’s method, or the power method, are not guaranteed to stop and only approach the solution in the limit (if they converge at all). This discussion is summarized in the following theorem. Theorem 9.2 No matter what is selected for x1 , the conjugate gradient method, when used to solve Ax = b, where A is a n × n symmetric positive definite matrix, will find the exact solution in m steps, where m ≤ n. The proof of this can be found in Nocedal and Wright [2006]. It should also be restated that this result assumes that exact arithmetic is used. Example 9.3  (a)

3 −2 −2 4



x1 x2



 =

 1 −1

9.3

Solving Linear Systems

423

Taking x1 = (1, 2)T , all of the steps produced with the CGM are shown in the contour plot on the left in Figure 9.6. Given the stopping condition in Table 9.1, the procedure takes three steps since it must find x4 to realize that x3 is the solution. It should also be noted that x1 and x2 are the same as for the SDM shown in Figure 9.5. This is expected because the CGM uses the steepest descent direction at the first step, so if the SDM and CGM are started at the same point then they will calculate the same value for x2 . However, unlike what is obtained using the SDM, the descent direction at x2 used by the CGM is not perpendicular to the level curve.  (b)

5 4.99 4.99 5



x1 x2



 =

 1 −1

Taking x1 = (1, 1)T , all of the steps produced with the CGM are shown in the contour plot on the right in Figure 9.6. It is found that the method stops after calculating x4 , with the resulting iterative error being about 6 × 10−12 .  In comparing the search paths in Figures 9.5 and 9.6, the contrast between the SDM and CGM is striking. The CGM is far superior to the SDM. However, the reality is that round-off generally prevents the CGM from producing the exact result. This is not unexpected, nor a criticism of the method, because as with all floating-point computations we can only get as close as round-off permits. The issue with the CGM is that it is built on the assumption of the exact orthogonality of the residuals at each step. With round-off, however, this will not happen. As the above examples show, this is often not

2.5 0

y-axis

-20 -40

1 -60 -80

-0.5 -0.5

-100

1

x-axis

2.5

0

20

40

60

80

100

x-axis

Figure 9.6 Contours of F (x, y) and the steps taken by the conjugate gradient method in Example 9.3(a), on the left, and Example 9.3(b), on the right.

424

9 Optimization: Descent Methods

an issue. However, there are matrices, even ones that are not ill-conditioned, that the CGM has trouble with. This is discussed in more detail in Sections 9.3.5 and 9.3.6. The CGM was derived using a geometric-based explanation of how dk is selected. In more theoretical treatments, the start-out assumption is that the search directions should be chosen to satisfy dTk Adi = 0, for i = 1, 2, · · · , k − 1.

(9.23)

Why you would want to do this is not obvious, but the reasons become evident as the derivation proceeds. Vectors satisfying (9.23) are said to be conjugate, or A-orthogonal. This comes from the fact that (u, x)A ≡ uT Ax is an inner product, and (9.23) is simply requiring that√ (dk , di )A = 0. This inner product also gives rise to the vector norm ||x||A ≡ xT Ax (see Exercise 3.38). Example 9.4 : Solving Laplace’s Equation One of the more important applications that produces large positive definite matrix equations involves solving Laplace’s equation ∇2 u = 0. So, to investigate the effectiveness of the CGM for larger systems we will solve Ax = b, where A is the matrix coming from solving Laplace’s equation on a rectangular domain. The matrix is illustrated schematically in Figure 9.7. It is symmetric and contains zeros everywhere except along 5 diagonals. The main diagonal entries are aii = 4 and all the other nonzero entries in the matrix are −1. The upper-most diagonal starts at a1p , and the lower-most diagonal starts at ap1 (in Figure 9.7, p = 6 and n = 15). Also, the zeros appearing in the super-diagonal and the sub-diagonal occur every p rows. It is not difficult to show that the matrix is positive definite (see Exercise 4.44). To check on the accuracy of the CGM we need to know the exact solution xm . This will be done by specifying xm and then setting b = Axm . This

Figure 9.7 The x’s designate the positions of the nonzero entries in the symmetric positive definite matrix for Laplace’s equation example.

9.3

Solving Linear Systems

425

way we can observe the accuracy, and rate of convergence, of the CGM as it progresses. Also, there are various ways to measure error, and for this example we will consider the following Error: Ek = xk − xm ∞ , Iterative Error: Ik = xk − xk−1 ∞ , Residual: Rk = rk ∞ . The results when p = 102 and n = 104 , are shown in Figure 9.8. It shows that the residual and iterative errors are effectively equivalent in their estimation of Ek . Except at the start, both underestimate the error and neither decreases monotonically. Another observation is that there are two stages in the convergence. For about the first 80 steps the error reduction is modest. In fact, it is hard to tell if the error Ek decreases at all at the beginning. Shortly after that things change, and the CGM converges relatively quickly. Another important point to make about Figure 9.8 concerns the total number of steps. According to Theorem 9.2, for this example it could take up to n = 104 steps for the CGM to find the solution. Even so, the CGM has produced a reasonably accurate solution of the matrix equation in significantly fewer than that.  A major competitor to the CGM for solving matrix equations, like the one in the previous example, is the Cholesky method. How they compare, and how they can be used together, will be discussed in Section 9.3.6.

9.3.4 Large Sparse Matrices We now consider problems in which the conjugate gradient method (CGM) plays an important role. Specifically, we consider the case of when A, besides

Error

10 10

0

-4

10-8 10-12 100

Error Iteration Error Residual 101

102

103

Iteration Steps Figure 9.8 Error obtained at each iteration step of the CGM to solve an equation involving the matrix illustrated in Figure 9.7. In this calculation, p = 102 and n = 104 .

426

9 Optimization: Descent Methods

being symmetric and positive definite, is also sparse. There is not a precise definition of sparse, but it roughly means that it is large and there are so many zeros in the matrix that it is worth trying to find a solution method that takes advantage of this. For example, the matrix in Figure 9.7, or the ones in (9.24) and (9.25), qualify as sparse for larger values of n. The direct solvers we have considered, which includes LU and the QR approach, are limited in their ability to take advantage of sparsity. In contrast, the CGM can take maximum advantage of sparsity because the only recurring step in the CGM is computing Adk . It is relatively easy to limit this multiplication to only the nonzero elements of A. The key for using sparsity is to only store the nonzero entries of A. However, it is also necessary to store their locations. There are different ways to do this, and the one considered here is called a row compressed format. This means you go through the matrix, one row at a time, and put the nonzero entries in a vector. Actually, because we also need to know their locations, we will construct a 3 × m matrix C, where m is the number of nonzero entries in A. The first row of C contains the nonzero entries in A, the second row contains their column number in A, and the third row contains their row number. As should be evident, this is only worthwhile if m is not very big. The breakeven point is when 3m = n2 , but given the time overhead to construct and then use C in the CGM, m should be significantly smaller than n2 /3. Example 9.5 : Row Compression If A=

  −1 2 , 0 7

then in row compressed format ⎛ ⎞ −1 2 7 C = ⎝ 1 2 2⎠ .  1 1 2 In an effort to minimize the storage requirements, it is typical to not include the third row, so C is 2 × m. In place of the third row an n-vector c is used that identifies where the first entry from each row is A is located. So, ci gives the column in C where the ith row in A starts. In the above example, c = (1, 3)T . It is not difficult to rewrite the CGM, and other algorithms, to use compressed storage. However, in MATLAB it is very easy as you just use the command: C=sparse(A). MATLAB will then use sparsity in such commands as q=C*d or qr(C). For those matrices A that are so large that they cannot

Sparse/Non-Sparse

9.3

Solving Linear Systems

427

100 10-1 10

-2

10-3

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

n-axis Figure 9.9 Ratio of the computing times using the CGM to solve Ax = b, with and without taking advantage of sparsity, for the matrix illustrated in Figure 9.7.

be stored in active memory, the matrix C can be entered directly into the code. If you do and are using MATLAB, it uses a column compression format that differs slightly with the procedure described above. Example 9.6 : Benefits of Using Sparsity The matrix used in Laplace’s equation example, as illustrated in Figure 9.7, qualifies as a sparse matrix are large values of n. The ratio of the computing times required when using the CGM with and without sparseness is shown in Figure 9.9 as a function of the size of the matrix. For the larger matrices, taking advantage of sparsity results in the calculation being about 100 times faster. It is also of interest to note that if the equation is solved using MATLAB’s backslash command, using sparse(A)/b is about 100 times faster than using A/b for larger matrices. Finally, it should be pointed out that the number of iteration steps for the sparse and non-sparse solvers are exactly the same.  When using the CGM, not all sparse matrices benefit from sparsity. An example is the tridiagonal matrix in (9.25). For such matrices the Thomas algorithm, given in Table 3.6, is a direct solver that makes optimal use of the sparsity. More generally, the tridiagonal matrix is an example of a banded matrix. This means that the nonzero entries are confined to a band along the diagonal. If the band contains few zeros, like the one in (9.25), it is generally better to solve the equation using what is called a banded Cholesky factorization. All this means is that you rewrite the code for the Cholesky factorization so it only uses the banded portion of the matrix. On the other hand, the matrix in Figure 9.7 is banded, but the band contains a significant number of zeros. For such matrices the CGM is competitive and can be significantly faster than using a banded Cholesky factorization.

428

9 Optimization: Descent Methods

9.3.5 Rate of Convergence The CGM is classified as an iterative solver, and it can be used when A is symmetric and positive definite. A competitor to the CGM is the Cholesky factorization method (see Section 3.7.1), and it is classified as a direct solver. As you will see, these two methods are compared often in the following pages, and in some situations are combined to work together. How many steps the CGM requires on an n × n matrix varies considerably with the properties of the matrix. This is evident by trying it on the three matrices: ⎞ ⎞ ⎛ ⎛ 1 3 1 ⎟ ⎟ ⎜ 2 ⎜ 3 ⎟ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜ ⎜ 3 3 A2 = ⎜ (9.24) A1 = ⎜ ⎟, ⎟, ⎟ ⎜ ⎜ . . .. ⎠ .. ⎟ ⎠ ⎝ ⎝ n 1 3 ⎞ ⎛ 2 −1 ⎟ ⎜ −1 2 −1 ⎟ ⎜ ⎟ ⎜ −1 2 −1 ⎟ ⎜ (9.25) A3 = ⎜ ⎟. .. .. .. ⎟ ⎜ . . . ⎟ ⎜ ⎝ −1⎠ −1 2 What is shown in Figure 9.10 is the iterative error ||xk − xe ||∞ , where xk is the solution at the kth step using the CGM and xe is the exact solution. Note that all three matrices in this computation are 1000 × 1000, and they are not ill-conditioned. In looking at Figure 9.10, it is seen that the CGM is not the fastest for the diagonal matrix A1 but, instead, it is fastest for A2 . In fact, what is not shown in this figure is that the error for x3 is about 7 × 10−16 . It is worth recalling at this point that A2 has only three distinct eigenvalues (see Exercise 4.12). You might think that this indicates that there is a connection between the number of distinct eigenvalues and the number of steps before termination, and you would be correct. The theorem stating this result is below. Theorem 9.3 If A is a symmetric positive definite matrix with m distinct eigenvalues, then the conjugate gradient method, when using exact arithmetic, will find the exact solution in no more than m steps. It is not typical that you know the eigenvalues of the matrix, but if the CGM takes dramatically fewer steps than expected to find the solution, the above theorem provides a possible reason why. For those interested, a proof of the above theorem can be found in Nocedal and Wright [2006].

9.3

Solving Linear Systems

429

Error

100 10

-4

10

-8

A A 10-12 100

1

A2

10

1

3

102

103

Iteration Steps (k) Figure 9.10

Error ||xk − xe ||∞ using the CGM on three different 1000 × 1000 matrices.

This brings us back to the error curves for A1 and A3 in Figure 9.10. Note that both matrices have n distinct eigenvalues. Also, both curves show almost no improvement in the error when CGM starts out. For A1 , once the error starts to drop, the rate of convergence keeps improving and the error drops fairly rapidly. This drop, which often happens when using the CGM, is known as the superlinear convergence phase of the procedure. The matrix A3 , in contrast, is a tough nut, and the CGM is forced to take the maximum number of steps to be able to find the solution. If you look at the curve for A3 , the error is 10−4 or greater until it makes a final, dramatic, drop down to 10−12 at the very last step. For the iterative methods considered in this text, such as Newton’s method and the power method, the rate of convergence is easy to determine. This is not the case with the CGM. The rate is dependent on how the eigenvalues of the matrix are distributed. There are, in fact, numerous theorems that provide upper bounds for the rate of convergence, and they generally involve the eigenvalues in some way. However, because they are based on the assumption of exact arithmetic, they generally provide poor estimates of the convergence rate seen when computing the solution. As a case in point, they provide little insight into what happens in the next example. Example 9.7 : Poor Convergence As an example when the CGM runs into trouble, let A be a random, symmetric, 100 × 100 matrix with 100 eigenvalues from λ100 = 1 up to λ1 = 106 . Using this in the CGM algorithm yields the error curves shown in Figure 9.11. The error ||xk − xe ||∞ shows almost no improvement after 100 steps. Even going a factor of 10 more steps than the theoretical limit for the finite termination property, the error is a modest 10−5 . Another observation is that using the iterative error for the stopping condition is problematic given the rather significant oscillations in its value. The question is, what is it about this matrix that causes the CGM problems? It is not the condition number,

430

9 Optimization: Descent Methods

Error

100 10

-4

10-8 10-12 100

Error Iterative Error 101

102

103

Iteration Steps (k) Figure 9.11 The error ||xk − xe ||∞ and iterative error ||xk − xk−1 ||∞ using the CGM on a 100 × 100 matrix for which the method is very slow to converge.

which for this matrix is κ2 (A) = 106 . This is large but not what would be labeled ill-conditioned. The principal culprit is the distribution of the eigenvalues. To construct this matrix, the formula A = QDQT , where Q is a random n × n orthogonal matrix, was used. The matrix D is diagonal, with the diagonals logarithmically distributed from λn = 1 up to λ1 = 106 . When an evenly spaced, or a random, distribution is used, the poor convergence seen in Figure 9.11 does not occur.  Another factor in determining the convergence rate is b. The CGM involves a structured search for the minimum of F (x) = 12 xT Ax − x • b. Consequently, b must play a role in this, albeit for many matrices it is likely a small one. To illustrate, for the error curves in Figure 9.10, the entries in b were randomly selected between ±1. If, instead, one takes b = (1, 1, · · · , 1)T , then the results for A1 and A2 show little change. The curve for A3 , however, does change, and the CGM now finds the solution using about 500 iteration steps. This is a factor of two better than what is shown in Figure 9.10. So, for this particular matrix the rate of convergence has some sensitivity to b. Those interested in learning about what is known about the rate of convergence of the CGM should consult Nocedal and Wright [2006], Trefethen and Bau [1997], and Beckermann and Kuijlaars [2001].

9.3.6 Preconditioning The conjugate gradient method (CGM) relies on the construction of a sequence of mutually orthogonal directions. This makes it more sensitive than a direct solver, such as Cholesky, to the condition of the matrix and floating-point errors in general. What is considered here is whether it is possible to find a way to modify the equation to be solved, and in the process improve the convergence of the CGM. The method used for this, what is

9.3

Solving Linear Systems

431

called preconditioning, is not limited to the CGM and is often used with other solvers. We begin with an example, which is the symmetric and positive definite matrix   1 3−s A= , (9.26) 3−s 9 where s = 10−10 . The condition number for this matrix is κ∞ (A) ≈ 2 × 1011 , which means that it is close to being ill-conditioned. However, taking   9 s−3 B= , s−3 1 then the product A∗ = BA has κ∞ (A∗ ) = 1, and this qualifies as a wellconditioned matrix. So, the question arises as to whether it might be possible to improve the convergence of the CGM by multiplying the equation by a well-chosen matrix B. To see whether this can work we simply multiply Ax = b by B and obtain A∗ x = b∗ , where A∗ = BA and b∗ = Bb. Unfortunately, A∗ is not necessarily symmetric unless we severely restrict B. The needed remedy is to note that    BAx = BA BT B−T x    = BABT B−T x ,  −1 where B−T = BT . Therefore, we can transform Ax = b into A∗ x∗ = b∗ , ∗ T where A = BAB , x∗ = B−T x, and b∗ = Bb. In this case, A∗ is symmetric and it is also positive definite if B is invertible. At the moment we do not have a clue what to select for B, but assuming that good choices are available, we need to determine what happens to the steps making up the CGM. The equation to be solved is A∗ x∗ = b∗ . If you write out the CGM algorithm for A∗ x∗ = b∗ and then expresses the variables in terms of the original A, x, b, one obtains the algorithm in Table 9.2, where M = B−1 B−T . This is known as the preconditioned conjugate gradient method (PCGM) and M is the preconditioner. In terms of flops, the only significant difference between CGM and PCGM is the need to solve Mz = r at each iteration step. It is interesting that B appears in this procedure only through the matrix M. Therefore to use this algorithm we need only the preconditioner M and not the matrix B. Guidelines for Picking M There are a few requirements to keep in mind in selecting M. First, it must be symmetric and positive definite. Second, M should approximate A. To explain, one can think of Mz = r as a reduced version of Ax = b. The expec-

432

9 Optimization: Descent Methods

(0) pick a symmetric positive definite matrix M (1) picking x1 , set r1 = b − Ax1 , solve Mz1 = r1 and then let d1 = z1 (2) for k = 1, 2, 3, . . . xk+1 = xk + αk dk rk+1

= rk − αk qk

Mzk+1 = rk+1

% solve f or zk+1

dk+1 = zk+1 + βk dk where qk = Adk zk • rk dk • qk • z r βk = k+1 k+1 rk • zk

αk =

Table 9.2 Outline of the preconditioned conjugate gradient method (PCGM) used to solve the linear system Ax = b, where A is an n × n symmetric positive definite matrix.

tation is that the better M approximates A, the fewer the iteration steps that will be needed. For example, if M = A then the PGCM produces the exact solution in one step. The third requirement is that solving Mz = r must not add significant computational time or storage to the iteration step. As a first attempt, because positive definite matrices are often diagonally dominant, a possible choice is to take M to be the diagonal matrix with its diagonal entries equal to those from A. This is an example of what is known as a Jacobi preconditioner. Another possibility is to take the tridiagonal entries of A. Unfortunately, neither of these preconditioners work that well.

Incomplete Cholesky Factorization To find a preconditioner that does work, recall that the direct approach to solve Ax = b uses a Cholesky factorization. This means that the matrix is factored as A = UT U, where U is upper triangular with positive diagonals. So, if we can derive an approximation for U, that is upper triangular with positive diagonals, we will have a preconditioner that satisfies the three requirements listed earlier. In particular, it will be a symmetric and positive definite approximation of A, and Mz = r will be easy to solve since M is already factored into upper and lower triangular matrices. It is also important that whatever is chosen for U, that M has approximately the same

9.3

Solving Linear Systems

433

sparsity pattern as A. If so, then M is said to be an incomplete Cholesky factorization of A. To derive an incomplete Cholesky factorization, write the matrix as A = L + D + LT ,

(9.27)

here D is diagonal and L is strictly lower triangular (so it has zeros on the diagonal). For the Jacobi preconditioner mentioned earlier, it was reasoned that since positive definite matrices are often diagonally dominant then the entries in L are relatively small compared to the diagonal entries so A ≈ D. If this is done, then U ≈ D1/2 , where D1/2 is the diagonal matrix with diagonals √ aii . This choice does not work that well, but it can be improved by including a contribution from L. As shown in Exercise 9.41, if you include L in the approximation, then U ≈ D1/2 + D−1/2 LT . The result is that M = (D1/2 + LD−1/2 )(D1/2 + D−1/2 LT ).

(9.28)

Besides being an incomplete Cholesky factorization preconditioner, this is also an example of what is called a symmetric successive overrelaxation (SSOR) preconditioner. For computational purposes, it is better to write the preconditioner as M = L U,

(9.29)

where L = D + L and U = I + D−1 LT . With this, solving Mz = r can be reduced to solving the two triangular problems: Ly = r and Uz = y. More significantly, the triangular problems have the same sparsity properties as the original matrix. As a consequence of this, this preconditioner could easily be incorporated into the implementation of the sparse CGM described earlier. Example 9.8 

If A=

 36 −2 , −2 16

then (9.27) becomes  A=



Since 1/2

D then

    T 0 0 36 0 0 0 + + . −2 0 0 16 −2 0

=

 6 0 0 4

−1/2

and D

 =

 1/6 0 , 0 1/4

434

9 Optimization: Descent Methods

D1/2 + D−1/2 LT =



 6 −1/3 , 0 4

and (9.28) becomes  M=

  6 0 6 −1/3 . −1/3 4 0 4



Example 9.9 We start simple and look to see how the PCGM does with the tridiagonal matrix in (9.25) when n = 1000. The results are shown in Figure 9.12, where M is given in (9.29). As is evident, the PCGM results in a significant improvement in both the error and the iterative error. The computing times for the two methods are comparable.  Example 9.10 As a second example, we consider the type of matrix that is used for Figure 9.11. In this case, the matrix is a random 100 × 100 matrix with 100 eigenvalues from λ100 = 1 up to λ1 = 104 . The comparisons in the error for the PCGM, with M is given in (9.29), and the CGM are shown in Figure 9.13.

Error

100 10

-4

10-8 10

-12

CGM PCGM 100

200

300

400

500

600

700

800

900

1000

700

800

900

1000

Iteration Steps Iterative Error

100 10-4 10-8 10-12

CGM PCGM 100

200

300

400

500

600

Iteration Steps Figure 9.12 The error ||xk − xe ||∞ and iterative error ||xk − xk−1 ||∞ using the PCGM, and the CGM, on the 1000 × 1000 tridiagonal matrix given in (9.25).

9.3

Solving Linear Systems

435

As in the last example, the number of iteration steps is reduced using the PCGM. What this particular preconditioner does not do is reduce the magnitude of the oscillations in the iterative error or result in the finite terminology holding. As with the first example, the computing times for the two methods are comparable.  Based solely on error reduction per step, the preconditioned version of the CGM is impressive. However, based on the objective of obtaining an accurate solution in the shortest amount of computing time, preconditioning is of little benefit in Example 9.9 but is of benefit in Example 9.10. The fact is that the idea underlying using a preconditioner is very intriguing, but it is difficult to find a preconditioner that works well on a wide class of problems. It is true that for certain types of matrices good preconditioners have been found, but finding them is a challenge and beyond the scope of this text. Those interested in investigating this further should consult Golub and Van Loan [2013] and Benzi [2002].

9.3.7 Parting Comments When solving a matrix equation involving a symmetric and positive definite matrix you can use the CGM or the Cholesky factorization method. As men-

Error

10 10

0

-4

10-8 10-12

CGM PCGM 20

40

60

80

100

120

140

160

180

200

140

160

180

200

Iteration Steps Iterative Error

10

0

10-4 10 10

-8

-12

CGM PCGM 20

40

60

80

100

120

Iteration Steps Figure 9.13

The error and iterative error for a 100 × 100 matrix.

436

9 Optimization: Descent Methods

tioned earlier, the CGM is classified as an iterative solver, while Cholesky is a direct solver. The prevailing opinion is that direct solvers are better if they can be used, because they are capable of obtaining the maximum accuracy and are not subject to the convergence assumptions needed for an iterative solver. More specifically, if A is dense and not particularly large it is better to use Cholesky. Also, if A is sparse and banded, as long at the nonzero entries are confined to a very narrow band, then the banded version of Cholesky should be used. An exception to this is when the matrix is tridiagonal, in which case the Thomas algorithm should be used. Finally, if A is sparse and large then the CGM should be used, and it is usually a good idea to use a preconditioner in such cases. Certainly if you have additional information about the matrix, your choice could easily change. As an example, if you know that the matrix only has one or two distinct eigenvalues, then the CGM would become a viable option.

9.4 General Nonlinear Problem We now take up the problem of minimizing a general nonlinear function, and not necessarily a quadratic form as considered in the previous section. To be specific, we are going to consider how to numerically find a point xm ∈ Rn at which a given function F (x) achieves a local minimum. In doing this, ∇F (x) is used to determine descent directions, and so it is assumed that the function is smooth enough that this is possible. As outlined in Section 9.2, the basic descent algorithm is: after picking a starting position x1 , then xk+1 = xk + αk dk ,

for k = 1, 2, 3, . . . .

(9.30)

At each step this requires picking a direction of descent, as well as solving the line search problem to determine αk .

9.4.1 Descent Direction As shown in Section 9.2.1, to be a direction of descent it is required that dk • gk < 0, where gk is the gradient vector at xk and it is given as gk = ∇F (xk ). As for possible descent directions, we have our two earlier favorites.

(9.31)

9.4

General Nonlinear Problem

437

i) Steepest Descent Choice: dk = −gk ii) Conjugate Gradient Choice: dk = −gk + βk−1 dk−1 , where β0 = 0. Unlike the situation in Section 9.3.3, is not obvious what to take for βk−1 . The usual approach is to just pick something, and what follows are some of the more well-known choices. Fletcher-Reeves: This involves using the same formula as in Section 9.3.3, which means that gk+1 • gk+1 . (9.32) βk = gk • gk The rationale for making this choice is that very near a minimizer it is typical that F (x) is approximately a quadratic (this comes from Taylor’s Theorem). Polak-Ribi`ere: The choice in this case is βk =

gk+1 • (gk+1 − gk ) . gk • gk

(9.33)

When F (x) is the quadratic form given in (9.9), gk+1 • gk = 0, so this formula reduces to what was used in Section 9.3.3. A complication with this choice is that αk must be computed fairly accurately to guarantee that dk is a direction of descent. The usual fix for this is to simply take βk = 0 if dk • gk > 0, which restarts the search from that point. Even with this complication, it is often stated that numerical testing shows (9.33) to be a better choice than using (9.32), and this will be seen in the examples to follow. There have been numerous proposals for how to select βk . If you are interested in what they are, then Hager and Zhang [2006] and Andrei [2020] should be consulted. There are certainly other possibilities for selecting the descent directions. For example, if B is a symmetric and positive definite matrix then one can take (9.34) dk = −Bgk . Note that the steepest descent method is obtained if B = I. Another possibility is to take B−1 to be the Hessian of F (x) at xk . This produces Newton’s method for finding the minimum (assuming you also take αk = 1). To explain why, the calculus solution for finding the minimum is to solve ∇F = 0. As shown in Section 3.10, if Newton’s method is used to solve this then you need to calculate the Jacobian of ∇F , and this is called the Hessian (see, (9.44)). For this to qualify for a descent method it is required that the Hessian be positive definite. Also, the matrix B in this case depends on xk , which means the LU factorization, or whatever method is used to solve the matrix equation,

438

9 Optimization: Descent Methods

must be redone at each iteration step. However, this idea of combining Newton with a descent method leads to a very effective procedure as explained in Section 9.5.

9.4.2 Line Search Problem Determining the value of αk is known as the line search problem. The basic objective is to find a value of α that reduces the value of q(α) = F (xk + αdk ). It is usually not worthwhile to calculate an accurate value of α that minimizes q(α). This is because this is only one step in the iteration process, and will be mostly forgotten in the later iteration steps. What is needed is to find a value for αk that reduces the function enough so the decent method continues to make good progress towards the solution. With this in mind, most of the methods that are used start by sampling, which means that they have a method for picking a small number α’s at which they evaluate q(α). They then use this information to compute an approximation for αk . How well this works depends on how clever they are in picking the α’s. To illustrate, we consider what is currently one of the preferred methods. The first step is to determine the line tangent to q(α) at α = 0, and this is q(α) ≈ q(0) + α q  (0) = F k + mk α , where mk = gk • dk is the slope of the line and Fk = F (xk ). Also note that the slope is negative because dk is a descent direction. This tangent line, which is a linear function of α, is shown in Figure 9.14. A second line is also shown, and it is given as Q(α) = Fk + γ mk α,

(9.35)

where 0 < γ < 1 is specified. So, Q(α) is obtained from the tangent line, but it has a slope γ mk that is not as steep. A consequence of this is that near α = 0 the curve q(α) is above the tangent line but below Q(α). As α increases, q(α) will eventually start to increase and, assuming convexity, intersect Q(α). Letting α denote the value of α where this occurs, the algorithm to be described attempts to find a value of α that is closer to α than to α = 0. It does this by simply guessing a value a1 for what α might be. We want a1 big enough that we keep making progress towards the minimum. However, once we get close to the minimum, a large value of a1 can undo a lot of the progress we have made. The test used to decide if a1 is too big is to check if q(a1 ) > Q(a1 ). If it is then a smaller value is tried, and the one used is a2 = τ a1 , where

9.4

General Nonlinear Problem

439

Figure 9.14 The lower dashed line is tangent to q(α) at α = 0, and the upper dashed line is Q(α), which is given in (9.35). The goal of Armijo’s method is to find a value of α that is close to α.

0 < τ < 1. This is continued until one finds that satisfies q(ai ) ≤ Q(ai ), and this is the value used as the approximate solution of the line search problem. The procedure described above is known as Armijo’s method, and it consists of the following steps. 1 ), Step 0: Pick γ that satisfies 0 < γ < 1 (e.g., γ = 20 and pick τ that satisfies 0 < τ < 1 (e.g., τ = 12 ).

Step 1: Pick a1 > 0 (e.g., a1 = 1) Step 2: If F (xk + ai dk ) < Fk + ai γ gk • dk then stop, otherwise let ai+1 = τ ai and repeat Step 2. At termination, one takes αk = ai . To provide some insight into the parameters used here, note that if γ is close to zero then Q(α) is close to being horizontal. In contrast, if γ is chosen close to one then Q(α) gets closer to the tangent line shown in Figure 9.14. Because the goal is to try to make sufficient progress in the descent direction, the values usually suggested for γ are small, typically satisfying 10−3 ≤ γ ≤ 10−1 . The value of τ determines how fast one reduces the value of ai . Again, the goal is to not end up with a point near zero, and so the suggested values for τ usually satisfy 12 ≤ τ ≤ 34 . Finally, a1 is simply how far in the descent direction one starts the process. One of its distinctive features of Armijo’s method is that it uses a backtracking sampling method, which means that it starts with the farthest point, determined by a1 , and then tries points that get progressively closer to α = 0. This is done to try to maximize the step size in this particular direction. As a final comment, in the description of how an approximate solution of the line search problem is obtained, the issue of not wanting ai either too big or too small came up (what you might call the Goldilocks problem). There is a more mathematical formulation of these requirements, and they are based on what are known as the Wolfe conditions. Those who are interested in learning more about this should consult Nocedal and Wright [2006].

440

9 Optimization: Descent Methods

1

y-axis

0.5 0 -0.5 -1

-1

-0.5

0

0.5

1

0.5

1

x-axis 1

y-axis

0.5 0 -0.5 -1

-1

-0.5

0

x-axis Figure 9.15 Top: Function given in (9.36). Middle: Decent paths, for two different starting points, using the SDM. Bottom: Decent paths, for two different starting points, using the CGM-PR.

9.4

General Nonlinear Problem

441

9.4.3 Examples In the examples that follow, CGM-PR is the conjugate gradient method using Polak-Ribi`ere (9.33), while CGM-FR is the conjugate gradient method using Fletcher-Reeves (9.32). For the line search, γ = 1/20 and τ = 1/2. A couple of modifications were made to the line search procedure. First, τ 2 was used, instead of τ , until F (xk + ai dk ) < Fk . Second, at the start a1 = 0.5, but from then on a1 = 20ai , where ai is the last value used to determine xk−1 . Finally, the stopping condition was ||xk − xe ||∞ < 10−6 , where xe is the exact solution. Example 9.11 Consider the function F (x, y) = (x − y)4 + 8xy − x + y + 3.

(9.36)

This is plotted in Figure 9.15, and it is evident that there are two local minimum points and a saddle point in-between them. Both the SDM, and the CGM-PR were used and the results are shown in Figure 9.15. What is seen is that the local minimum one finds depends on the location of the starting point. Between the two curves, the CGM-PR evaluates F (x, y) 111 times, and the SDM evaluates it 70 times. Also, as it turns out, CGM-FR evaluates F (x, y) 521 times (the curves obtained using this method are not shown in the figure).  Example 9.12 Consider the function F (x, y) = 100(x2 − y)2 + (x − 1)2 .

(9.37)

This is known as the Rosenbrock function, and it is plotted in Figure 9.16. There is one minimum point and it is located at (x, y) = (1, 1). Taking x1 = (3/4, −3/4)T , the path taken by the CGM-PR is shown, as is the first 500 steps taken using the SDM. The CGM-PR finds the minimizer by taking 78 steps and, in the process, evaluates F (x, y) 326 times. In contrast, the SDM requires 33128 function evaluations to locate the minimizer, while CGM-FR requires 2120 evaluations.  As is evident in the above examples, the CGM looses its finite termination property for general nonlinear functions. However, it is still capable of finding a relative minimum without much effort. The SDM, on the other hand, has many of the same difficulties seen for the quadratic case. On some problems

442

9 Optimization: Descent Methods

1

y-axis

0.5

0

-0.5

-1 -1

-0.5

0

0.5

1

0.5

1

x-axis 1

y-axis

0.5

0

-0.5

-1 -1

-0.5

0

x-axis Figure 9.16 Top: Function given in (9.37). Middle: The first 500 steps using the SDM. Bottom: Decent path using the CGM-PR. The ∗ designates the location of the minimum.

9.5

Levenberg-Marquardt Method

443

it does just fine but it can be very slow on others, to the point that it is almost useless.

9.5 Levenberg-Marquardt Method We return to a comment made earlier that the descent algorithm (9.30) includes Newton’s method if you take B in (9.34) to be the Hessian. Two nice properties of Newton’s method are its quadratic convergence and it does not require a line search. Two not-so-nice properties are that it requires computing second derivatives and it requires a starting value close to the solution. What we are going to do is derive a method that comes close to eliminating the two not-so-nice properties, while maintaining the two nice properties. Be warned that the derivation is unlike others we have made in the sense that it is heuristic. Even so, the method derived is one of the most used for solving minimization problems. The function to be minimized is assumed to have the form F (x, y) = f (x, y)2 + g(x, y)2 . This can be written in vector form as F (x) = ||f ||22 = f • f ,

(9.38)

where f = (f (x), g(x))T . In this case, ⎞ ⎛ ⎞ ⎛ ∂g ∂f ∂F + 2g 2f ⎜ ∂x ⎟ ⎜ ∂x ∂x ⎟ ⎟ ⎜ ⎟ ∇F = ⎜ ⎝ ∂F ⎠ = ⎝ ∂f ∂g ⎠ 2f + 2g ∂y ∂y ∂y = 2JT f ,

(9.39) ⎛

where

∂f ⎜ ∂x J=⎜ ⎝ ∂g ∂x

⎞ ∂f ∂y ⎟ ⎟ ∂g ⎠ ∂y

(9.40)

is the Jacobian matrix for f .

Gauss-Newton Method To find the minimizer, using the calculus approach, we need to solve ∇F = 0. If Newton’s method, as given in (3.41), is used then xk+1 = xk − H−1 k gk , for k = 1, 2, 3, · · · ,

(9.41)

444

9 Optimization: Descent Methods

where g = ∇F = 2JT f and H is the Hessian matrix given as   Fxx Fxy . H= Fxy Fyy For example,

(9.42)

Fxx = 2fx2 + 2f fxx + 2gx2 + 2ggxx .

This brings us to the second assumption, which is that the second derivative terms in Fxx are appreciably smaller that the first derivative terms. In other words, Fxx ≈ 2fx2 + 2gx2 . Applying the same assumption to the other second derivatives of F , then (9.41) is replaced with xk+1 = xk − G−1 k gk , for k = 1, 2, 3, · · · , where

 G=

2fx2 + 2gx2

2fy fx + 2gy gx

2fy fx + 2gy gx

2fy2 + 2gy2

(9.43)



= 2JT J.

(9.44)

This is known as the Gauss-Newton method. Levenberg-Marquardt Method There is a serious flaw with Gauss-Newton, which is that it is not unusual for Gk to be non-invertible. The original fix for this, made by Levenberg, was to use shifting and replace Gk with Gk + μk I. This idea comes from how regularization was used in Section 8.3.6 to fix the normal equation approach when solving the least squares problem. Levenberg’s idea works, but it has the drawback that the choice for μk can easily make it so Gk has essentially no contribution to the sum Gk + μk I, and this is the term that comes from Newton’s method. A fix for this is to include a contribution of Gk in the shift. The choice, made by Marquardt, is to replace Gk with Gk + μk Dk , where Dk is the diagonal matrix containing the diagonal entries in Gk (see Example 9.13). The method resulting from the above discussion is xk+1 = xk − S−1 k gk , for k = 1, 2, 3, · · · ,

(9.45)

Sk = Gk + μk Dk ,

(9.46)

where

Dk = diag(Gk ),

9.5

Levenberg-Marquardt Method

Pick:

445

x1 , μ1 > 0, α > 1, β > 1, and tol > 0

Loop: For k = 1, 2, 3, · · · gk = 2JT k fk



T Sk = 2JT k Jk + μk diag 2Jk Jk

Sk z = gk



% solve f or z

xk+1 = xk − z If F (xk+1 ) > F (xk ) then

μk+1 = αμk , and xk+1 = xk elseif ||z|| < tol then stop else μk+1 = μk /β End End

Table 9.3 The Levenberg-Marquardt method (LMM) for finding a minimizer of F (x) = ||f ||22 . Note that Jk is the Jacobian at xk .

diag(Gk ) is the diagonal matrix obtained using the diagonals of Gk , and μk is a positive constant known as the regularization parameter, or the damping parameter. This is called the Levenberg-Marquardt method (LMM). Even though the method was derived assuming x ∈ R2 and f ∈ R2 , the formulas apply in the general case of when x ∈ Rm and f ∈ Rn . The one difference is that the Jacobian J in (9.40) is n × m. The LMM is an interesting combination of two of our earlier methods. If you take μk = 0 and restore the second derivative terms in Gk , you get Newton’s method. But, if you drop the Gk term in (9.46) you get the descent method (9.30) with dk = −D−1 k gk and αk = 1/μk . Note that to guarantee that it is a descent direction it is assumed that Dk is positive definite. This will hold if none of the diagonals of Dk are zero. For the LMM the value of μk is not determined using a line search but it is specified, and often the specified value is changed as the iteration proceeds. Typically, in the beginning when you are presumably far from the minimizer, large values are used for μk . Once you have closed in on the minimizer then positive values closer to zero are used. A complication with this is that even though the method moves in a direction of descent, since μk is being specified, it is possible to overshot the minimum and end up with F (xk+1 ) > F (xk ). The fix when this happens is to stay at xk and try another value for μk . An often used strategy that combines all of these comments is to do something like the following:

446

9 Optimization: Descent Methods

if

F (xk+1 ) > F (xk ) xk+1 = xk

then

and μk+1 = αμk

else μk+1 = μk /β. The resulting algorithm is outlined in Table 9.3. The final assumption is that the method actually works. It is possible to prove it converges, and the convergence is quadratic once you get close to the minimizer, if you are careful how you select the regularization parameter μk and J has full rank [Bellavia and Morini, 2014]. Although this is useful to know, the fact is that the theory has limited applicability because of the uncertainty of how to pick μk . What is certain is that the method has been found to work well on a wide variety of minimization problems. Arguably, it can be considered the standard method for solving small to moderately large problems. It is not used as much for very large values of m. The primary reason is that the resulting large matrices that the LMM uses at each iteration step increase the computing time significantly. In the examples to follow, μ1 = 10, α = β = 2, and tol = 10−6 . Also, the stopping condition is a bit different than what is given in Table 9.3. What is used is the relative iterative error of each component of xk+1 . So, if x and z are the ith entires in xk+1 and z, respectively, then, for each i, the requirement is that |x − z|/|x| < tol. There are two reasons for this. One is to be able to obtain a given number of correct digits for each entry. Second, the quantity ||z||2 is not defined in the last two examples and this is because the physical dimensions of the variables are incompatible. As an illustration, in Example 9.15, a and b have dimensions of length, while φ is an angle. Example 9.13 The Rosenbrock function is given in (9.37), and it can be written as F (x, y) = f 2 + g 2 , where f = 10(x2 − y) and g = x − 1. So, f = (10(x2 − y), x − 1)T ,     2 + 1 −200x 20x −10 400x J= , , G = 2JT J = 2 −200x 100 1 0 

and S=2

400x2 + 1 −200x −200x

100



 + 2μ

400x2 + 1

0

0

100

 .

Taking x1 = (3/4, −3/4)T , the resulting path produced by the LMM is shown in Figure 9.17. In finding the minimizer the LMM evaluates f and J, combined, 41 times, to locate the minimum. Consequently, for this example, the LMM and the CGM-PR are similar in terms of the work required to locate the minimizer. 

9.5

Levenberg-Marquardt Method

447

1

y-axis

0.5

0

-0.5

-1 -1

-0.5

0 x-axis

0.5

1

Figure 9.17 Decent path obtained when minimizing the Rosenbrock function using the LMM. The ∗ designates the location of the minimum.

It is worth pointing out that in the above example, the approximation used in (9.44) does not hold at every point along the descent path. Even so, the method still works. The reason is that even if (9.44) does not hold, Sk is still symmetric and positive definite. Consequently, the method is still moving in a descent direction. Example 9.14 The LMM is tailor made to solve nonlinear regression problems, and as an example, suppose the model function is g(t) = v1 (1 − e−v2 t ). Also, the data for the [ti , gi ] values are: [0.5, 0.05], [1, 0.1], and [2, 0.16]. In this case, n = 3 and m = 2. To use the formulas derived for the LMM, we take x = v1 , y = v2 , fi = x(1 − e−yti ) − gi , and f = (f1 , f2 , f3 )T . Also, note that ⎛ ⎞ ∂f1 ∂f1 ⎜ ∂x ∂y ⎟ ⎜ ⎟ ⎜ ∂f ∂f ⎟ ⎜ 2 2⎟ J=⎜ ⎟, ⎜ ∂x ∂y ⎟ ⎜ ⎟ ⎝ ∂f3 ∂f3 ⎠ ∂x

∂y

∂ ∂ where ∂x fi = 1 − e−yti and ∂y fi = xti e−yti . Taking x1 = (1, 1)T , the LMM takes 18 steps and finds that v1 = 0.2829 and v2 = 0.4196. 

448

9 Optimization: Descent Methods

Example 9.15 A parametric formula for an ellipse centered at the origin is   p(θ) = a cos φ cos θ − b sin φ sin θ, b cos φ sin θ + a sin φ cos θ . In this expression, a and b are the axes associated with the ellipse, and φ is the angle the ellipse makes with the x-axis. The values for these three parameters are determined by fitting p(θ) to the data points shown in Figure 9.18. Now, given a data point pi = (xi , yi ), its θi value can be determined using the polar coordinate formula tan θi = yi /xi . With this, we take fi = ||p(θi ) − pi ||22 = (x(θi ) − xi )2 + (y(θi ) − yi )2 . This means that F (a, b, φ) =

n   2  2 2 x(θi ) − xi + y(θi ) − yi . i=1

So, J is a n × 3 matrix with the ith row containing example,

∂ ∂ ∂a fi , ∂b fi ,

and

∂ ∂φ fi .

For

    ∂ fi = 2 x(θi ) − xi cos φ cos θi + 2 y(θi ) − yi sin φ cos θi . ∂a Taking x1 = (1, 0.5, 0)T , the LMM takes 56 steps and finds that a = 2.01, b = 1.03, and φ = 1.04. The resulting ellipse is shown in Figure 9.18.  2 1.5 1

y-axis

0.5 0 -0.5 -1 -1.5 -2 -2

-1

0

1

2

x-axis Figure 9.18 LMM.

Data to be fit by an ellipse, and the resulting ellipse obtained using the

9.6

Minimization Without Differentiation

449

9.5.1 Parting Comments The methods considered here have many moving parts. How well each work depends on the parameter values used, as well as the particular function being minimized. The choices you make also depend on what you want. For example, you might select the parameters based on finding a local minimizer very quickly. Typically in such cases the method will lock-in on a particular local minimum in very few steps. In contrast, you might want the method to have the opportunity to discover a local minimum that is lower than the others that are possible. In this case, the parameter values are selected to let the method look around a bit before committing to a particular minimum. One way to do this using a line search is to take a larger value of a1 , and for the LMM, to use a larger value for μ1 . The cost of doing this is more work, but with the benefit of maybe locating a lower minimum value. The methods we have considered are not designed to guarantee finding the global minimum of a function. Locating a global minimizer of a nonlinear multivariable function is typically very difficult. There are methods that are used for this, such as simulated annealing and evolutionary computational algorithms, but they are slow and have only limited success. A cruder approach, but used by many, is to simply use one of the methods considered here but solve the problem multiple times using a wide range of start locations. To get a sense of what’s involved, to cover the region in Figure 9.15, you might use 5 points along the x-axis and the same number on the y-axis. So, you would solve the problem using 25 different starting points. If there are 10 variables then the number jumps to about 107 . In other words, this crude approach is limited to lower dimensional minimization problems.

9.6 Minimization Without Differentiation The minimization method we now consider does not use the gradient and is therefore applicable to functions that are just continuous. It is based on a simple geometrical argument and has been known since antiquity. However, the method described here is of more recent vintage, first proposed in 1965, and it is known as the Nelder-Mead algorithm [Nelder and Mead, 1965]. It is worth mentioning that according to Nelder, their method was not well received by some “professional optimizers” because of the lack of mathematical analysis [Nelder, 1979]. Another reason researchers did not take them seriously was where they worked, which was the National Vegetable Research Station. However, all is well now (almost) and it is one of the foundational methods in computational optimization. The basic idea underlying the method can be illustrated using either of the surfaces shown in Figures 9.15 and 9.16. Given three points on either surface, it’s possible to construct an approximate direction of descent by positioning

450

9 Optimization: Descent Methods

yourself at the highest point and then looking in a direction that passes between the other two points. The goal is to find a point in this direction that is lower than the highest value. A description of the procedure is given in Table 9.4 for the case of finding a local minimum of F (x), where x ∈ R2 . To explain the reasoning for the steps in Table 9.4, in Step 1 the points are relabeled so x3 produces the largest value of the function. In Step 2 the midpoint between the two lower points, the centroid, is used to determine a descent direction by letting d = xc − x3 . Note that it is likely that d is a descent direction but it does not have to be, and contingencies are included if this happens. With this, in Step 3, the procedure tentatively moves in this direction by using a reflection. If the value of the function at xr is lower than all the other values then it will make a second step just to see if the function can be reduced even further (Step 4). However, if the value at xr is not small enough then the procedure will try a contraction (Steps 5 or 6). If all this fails then there is a restart (Step 7), which is done by placing the new points near the point that currently produces the smallest value of the function. Note that, in each loop (from Step 1 back to Step 1), if a restart is not needed then the method evaluates the function F no more than twice. Like all iteration methods, the question is when to stop. One option is to use the distances between the three vertices, requiring them to be less than a given tolerance tol. Related to this question is, when finished, what value is used for the approximate minimizer? The simple choice is to take the vertex that produces the smallest value for F (x). However, it often happens that method eventually traps the minimizer in the triangle (see Example 1). For this reason it is worth checking on the value of the function at the centroid (x1 + x2 + x3 )/3 of the triangle to see if it yields a smaller value than at the vertices. Not every situation is covered by the above algorithm, and a complete version would need to explain, for example, what to do if F2 = F3 , or if the three points calculated by this procedure become collinear. For those interested, the complete version is given in Lagarias et al. [1998] and Conn et al. [2009]. Examples are given below using the Nelder-Mead algorithm outlined above. It will be seen that it finds a local minimum fairly easily but the number of iteration steps is larger that what was needed when using the CGM. This is not surprising because some price is expected when using a method that works on a broader class of problems than the CGM (or any method that uses the gradient). However, the computational effort for Nelder-Mead is fairly low, usually requiring only 1 or 2 function evaluations per iteration. Example 9.16 Consider the function

F (x, y) = 10x2 + y 2 .

(9.47)

9.6

Minimization Without Differentiation

Step 0

Pick α > 0 (e.g., α = 1) and β with 0 < β < α (e.g., β = Pick 0 < δ < 1 (e.g., δ =

1 ). 2

Pick x1 , x2 , and x3 that are not collinear.

Step 1

If necessary, relabel points so that F1 ≤ F2 < F3 , where Fi = F (xi ).

Step 2

Calculate the centroid of the lowest side: xc = 12 (x1 + x2 ), and calculate an approximate direction of descent: d = xc − x3

Reflection: Calculate xr = xc + αd, and Fr = F (xr ). Step 3

If F1 ≤ Fr < F2 , let x3 = xr and return to Step 1. If Fr < F1 then go to Step 4 If F2 ≤ Fr < F3 then go to Step 5. If F3 ≤ Fr then go to Step 6.

Step 4

Expansion: Calculate xe = xc + 2αd, and Fe = F (xe ). If Fe < Fr then let x3 = xe otherwise let x3 = xr . Return to Step 1.

Step 5

Outside Contraction: Calculate xo = xc + β d. If F (xo ) ≤ Fr then let x3 = xo and return to Step 1, otherwise go to Step 7.

Step 6

Inside Contraction: Calculate xs = x3 + β d. If F (xs ) < F3 then let x3 = xs and return to Step 1, otherwise go to Step 7.

Step 7

Restart: Replace each xi with x1 + δ (xi − x1 ) and return to Step 1.

Table 9.4

Nelder-Mead algorithm for x ∈ R2 .

451 1 ). 2

452

9 Optimization: Descent Methods

1

y-axis

0.5

5

0

4 2

-0.5

3

1 -1 -1

-0.5

0

0.5

1

x-axis Figure 9.19 of (9.47).

Triangles constructed using the Nelder-Mead algorithm to find the minimum

The resulting surface is a paraboloid, which has a minimum at (x, y) = (0, 0). The first five triangles produced by Nelder-Mead are shown in Figure 9.19. It is worth identifying how the steps in the algorithm produced these triangles. To get the 2nd triangle, the first was reflected and then a double step was made. In other words, it comes from Step 4. The 3d triangle comes from a single reflection, so Step 3 was used. In contrast, the 4th comes from an outside contraction, which is Step 5, and the same is true for the 5th triangle. Because the minimum point is now within the 5th triangle, it is expected that the majority of triangles produced by the method involve inside contractions (Step 6). If so then from this point on the method is reminiscent of the bisection method. This is because an inside contraction involves a single subdivision after which one picks which side the solution is on. Like the bisection method, this produces a dependable way to find the minimizer but it is not particularly fast in finding it. For this example, it takes the method 46 iteration steps to achieve an area error that is about 10−6 .  Example 9.17 Consider the function F (x, y) = (x − y)4 + 8xy − x + y + 3.

(9.48)

This surface is shown in Figure 9.15, and the first six triangles used by NelderMead are shown in Figure 9.20. It takes the method 46 iteration steps to achieve an error that is less than 10−6 . 

9.7

Variational Problems

453

1

y-axis

0.5

0

-0.5

1

-1 -1

4

2

3 -0.5

6 5

0

0.5

1

x-axis Figure 9.20 of (9.48).

Triangles constructed using the Nelder-Mead algorithm to find the minimum

9.7 Variational Problems There are several well-known minimization principles in physics that can be used to characterize how something will behave. For example, Fermat’s principle states that the path taken by a ray of light minimizes the travel time. Another is the principle of minimum potential energy which states that out of all possible admissible displacements an object can undergo, the one that is realized is the one that minimizes the potential energy of the system. The question considered here is how these can be solved using one or more of the numerical optimization methods we have considered earlier in this chapter. Note that most of these principles are used to derive what are called Euler-Lagrange equations using the calculus of variations. This is not done here, and the problems will be solved by considering only the minimization principle.

9.7.1 Example: Minimum Potential Energy An elastic string occupies the interval 0 ≤ x ≤ 1, and is held at its two endpoints. When a transverse load w(x) is applied, the string deflects. Letting u(x) be the vertical displacement of the string, then the potential energy is  1 V = 0

 1 2 T u − wu dx, 2 x

(9.49)

454

9 Optimization: Descent Methods

where T is the tension (it is a positive constant). According to the principle of minimum potential energy, the displacement u(x) is the function which minimizes V [Weinstock, 1974]. Also, note that the string is being held at its ends, and so any possible solution of this problem is required to satisfy u(0) = u(1) = 0. We will find the minimizing function by introducing grid points along the x-axis, and then use them to numerically determine ux as well as the value of the integral. To help make it clear how these approximations are used, the problem is written as  1

V =

f (x)dx,

(9.50)

0

where

1 2 T u − wu. (9.51) 2 x Suppose five points are used, and they are x0 = 0, x1 = 1/4, x2 = 1/2, x3 = 3/4, and x4 = 1. We will use the composite trapezoidal rule to compute V , and so   1 1 f0 + f1 + f2 + f3 + f4 , V ≈h (9.52) 2 2 f (x) =

where fi = f (xi ). Centered, or symmetric, approximations will be used when possible to compute fi . Specifically, using the first-order approximation in Table 7.1 (with h = 1/4) 

2 u(x1 ) − u(x0 ) ux (0) ≈ , h  2  2  u(xi+1 ) − u(xi ) u(xi ) − u(xi−1 ) 1 2 , + ux (xi ) ≈ 2 h h 2

ux (1)2 ≈



u(x4 ) − u(x3 ) h

for i = 1, 2, 3, (9.53)

2 .

The reasoning behind using a symmetric difference formula will be explained after the examples are complete. Combining (9.52) and (9.53), and using the boundary conditions u0 = u4 = 0, we have that  T  2 u1 + (u2 − u1 )2 + (u3 − u2 )2 + u23 − h(w1 u1 + w2 u2 + w3 u3 ), 2h (9.54) where wi = w(xi ). It is worth observing that the above expression is a quadratic function of the ui ’s. To find the values of u1 , u2 , and u3 that minimize V we can use the CGM, Nelder-Mead, or the calculus solution. The latter is informative, and so we calculate the three derivatives V ≈

9.7

Variational Problems

455

∂V for i = 1, 2, 3 ∂ui and then set them to zero. This leads to the matrix equation ⎛ ⎞⎛ ⎞ ⎛ ⎞ 2 −1 0 u1 w1 ⎝ −1 2 −1 ⎠⎝u2 ⎠ = α⎝w2 ⎠ , 0 −1 2 u3 w3 where α = h2 /T . The above matrix is tridiagonal, and this also happens when more grid points are used. Consequently, the equation can be solved very quickly using the Thomas algorithm (see Section 3.8). A comparison between the resulting solution and the exact solution is shown in Figure 9.21 in the case of when w = x4 and T = 1. The exact solution is 1 x(x5 − 1). u= 30T Not unexpectedly, the approximation closely matches the exact result. 

9.7.2 Example: Brachistochrone Problem Suppose two points A and B in a vertical plane are given, with A higher than B. A mass is going to move from A to B, while subjected to gravity, and the problem is to find the path it must take so the travel time is minimized. If A is directly above B the answer is easy, the mass should simply drop straight down. So, in what follows it is assumed that the two points are not on a vertical line. In particular, it is assumed that A is the origin, and B is the point (1, b), where b > 0. It is assumed that y is positive in the downward direction. In this case, if the mass follows a path y(x), then the travel time for that path is [Weinstock, 1974]

u-axis

0

-0.01

-0.02

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x-axis Figure 9.21 Numerical solution, the circles, for the function that minimizes (9.49) when using 22 grid points, and the exact solution, the solid (red) curve.

456

9 Optimization: Descent Methods



1

T =

f (x)dx,

(9.55)

0

where

    2   1 dy  f (x) = 1+ . 2gy dx

(9.56)

The objective is to find the path y(x) that minimizes the value of this integral. In doing this it is required that y(0) = 0 and y(1) = b. This problem is one of the first examples studied in the calculus of variations, but its numerical solution is not so simple. This is evident in (9.56) because the denominator is zero at the left endpoint, and this complicates using a numerical method. However, the situation is actually worse than this. To explain, the solution is found to be a cycloid, and it is given parametrically as x = A(θ − sin θ), y = A(1 − cos θ), for 0 ≤ θ ≤ θM . The values of A and θM are found from the requirements that x = 1 and y = b at θ = θM . From this it is possible to show that near the left endpoint, y ≈ mx2/3 , where m is a positive constant. This means that the (y  )2 term in the numerator in (9.56) is as bad as the y term in the denominator. As a final comment, the numerical solution will be compared to the above exact solution. To make the comparison, we will take θM = π/2, which means that A = b = 2/(π − 2). One way to deal with the singularity at the left endpoint is to change variables in the integral. The particular choice will use the inverse function d −1 f )dy = dx for y. If y = f (x), then x = f −1 (y). This means that dx = ( dy dy dy and

dy dx

= 1/ dx dy . Substituting this into (9.55) we get that 

b

F (y)dy,

T =

(9.57)

0

where

    2   1 dx  F (y) = 1+ . 2gy dy

(9.58)

2 Note that the denominator is still zero at y = 0 but the ( dx dy ) term is not 3/2 singular at y = 0. This is because near the left end, x ≈ M y , where M is 3 2 2 a positive constant. So, ( dx dy ) ≈ ( 2 M ) y and this is a continuous function at y = 0. We will minimize (9.57) in a manner similar to what was done in the last example, which means a set of grid points will be used to find approximations

9.7

Variational Problems

457

for the derivative and integral. The difference now is that the grid points are placed along the y-axis, and we solve for the values of the xi ’s. Also, to deal with the singularity at y = 0 we will separate it from the rest of the interval, and this will be done by writing  T =



δ

F (y)dy + 0

b

F (y)dy,

(9.59)

δ

where δ is small and positive. It should also be noted that the original requirements that y(0) = 0 and y(1) = b are now written as x(0) = 0 and x(b) = 1. The integral on the right in (9.59) has no singularity and the composite trapezoidal rule will be used to evaluate it. The grid points are y1 < y2 < · · · < yn+1 , where y1 = δ, yn+1 = b, and yi+1 − yi = k. The resulting approximation is    b 1 1 F1 + F2 + F3 + · · · + Fn + Fn+1 , F (y)dy ≈ k (9.60) 2 2 δ As before, centered, or symmetric, approximations will be used when possible to compute the derivative term in F . Specifically, 



2 x(y1 ) − x(y2 ) , for i = 1 k   2 2  2  x(yi ) − x(yi−1 ) dx 1 x(yi+1 ) − x(yi ) ≈ + , for i = 2, 3, · · · , n dy 2 k k  2  2 x(yn+1 ) − x(yn ) dx ≈ for i = n + 1. dy k dx dy

2



The left integral in (9.59) requires a bit more care. As explained in Section 6.1, one possibility is to pick a point c0 in this interval and then use the 2

y-axis

1.5 1 0.5 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x-axis Figure 9.22 Numerical solution, the circles, of the brachistochrone problem, and the exact solution, the solid (blue) curve.

458

approximation

9 Optimization: Descent Methods



δ

0

F (y)dy ≈ F (c0 )δ.

We will do this for the “good” part of F (y) but leave the term that is responsible for the singularity. Picking the midpoint, so c0 = δ/2, leads to the approximation    δ  δ 1 1 2 1 +(x (c0 )) F (y)dy ≈ √ dy 2g y 0 0   2δ  2 = 1 +(x (c0 )) , g where

x(y1 ) x(y1 ) − x(0) = . δ δ A comparison between the resulting numerical solution obtained from minimizing (9.57) and the exact solution is shown in Figure 9.22. In this calculation, n = 8, δ = 10−2 , b = 2/(π − 2), and the minimizer was found using the Nelder-Mead algorithm as implemented in the MATLAB command fminsearch. The initial guess used to start the solver is simply a straight line connecting the two points. The resulting solution is reasonably accurate, and it can be improved by using larger values of n.  x (c0 ) ≈

There are other methods for computing the solution and one is to use a piecewise linear approximation for the path. What’s involved is outlined in Exercise 9.36. A somewhat different approach using a piecewise linear approximation, which goes back to Galileo, is worked out in Mungan and Lipscombe [2017].

9.7.3 Parting Comments The problem of finding a function which minimizes an integral looks to be straightforward, in the sense that it involves using numerical methods we have derived earlier. The fact is, however, that these problems are more finicky than what we have encountered elsewhere. This is one of the reasons for the symmetric differences used in the numerical procedure. To explain, both examples involve an integrand that includes a term of the form (fx )2 . Without thinking about it too hard, one might try using a second-order difference of the form  2  2 fi+1 − fi−1 df ≈ . dx 2h

Exercises

459

If you write this out for each grid point, you will notice that the values of f at the even nodes do not depend on the values of f at the odd nodes. This is a well-known problem in numerical fluid dynamics, but the solutions used for fluids are not directly applicable to the integrals considered here. This leads to dropping the idea of using a second-order approximation and reverting to just first-order differences such as fi+1 − fi df ≈ dx h

df fi − fi−1 ≈ . dx h

or

In minimization problems, the boundary conditions play a critical role in the formula for the quantity being minimized as well as determining the function which is the minimizer. The forward and backward differences given above have a biased direction implicit in their formulation, and this causes problems when interacting with the boundary conditions. This is the reason for looking for symmetric formulas, and examples are   2   fi+1 − fi fi − fi−1  df , ≈   dx h h and



df dx

2

1 ≈ 2



fi+1 − fi h



2 +

fi − fi−1 h

2  .

The second has the distinct advantage of being a smooth function of the fi ’s, and for this reason was used in the formulation. There have been numerous studies related to deriving numerical solutions that minimize integrals, and most discuss the difficulties encountered. For those interested, you might consult Dussault [2014], Levin et al. [2002], and Jameson and Vassberg [2001]. As a final comment, for the string example the calculus method was used to find the minimum but this was not done for the brachistochrone problem. The reason is that (9.54) is a quadratic function of the unknown ui ’s, so setting the derivatives to zero results in a linear problem. For the brachistochrone problem, T is a non-quadratic function of the unknown xi ’s, so the calculus method results in a system of nonlinear equations. Because of the difficulty of solving such a system, it is easier to just find the minimum directly, which was done using the Nelder-Mead algorithm.

Exercises Section 9.1 9.1. Rewrite the following as a minimization problem (you do not need to solve the problem).

460

9 Optimization: Descent Methods

(a) Solve: (x − 2)4 + (x − 2y)2 = 5 x2 − y = 0 (b) Solve: x + y 2 + ex+y = 3 2y + 3x =

x+y x2 + y 2 + 1

(c) Find the distance between the surface z = x2 + 5y 4 and the point (1, −1, −2). (d) The positions of two objects moving in the x, y-plane are: (x1 , y1 ) = (t + sin(3t), t + 3 cos(3t)), and (x2 , y2 ) = (4 sin t, t2 − 3). How close do they come to bumping into each other? (e) Two points p and q are on a surface z = f (x, y). Find a curve on this surface which connects these two points, and which has the smallest length. (f) Consider a region 0 ≤ x ≤ a, 0 ≤ y ≤ b. Three heaters are going to placed in this region, at points h1 , h2 , and h3 . The resulting temperature T at any point x in the region is T (x) = T1 F (||x − h1 ||2 ) + T2 F (||x − h2 ||2 ) + T3 F (||x − h3 ||2 ), where F (z) = 1/(1 + z 2 ). Where should the heaters be placed in the region so the coldest point in the room is as hot as possible? Section 9.2 9.2. In this problem take F (x, y, x) = x2 + (y + 1)4 + 3z 2 . (a) What is the direction of steepest decent at (1, −2, 1)? (b) Which of the following, if any, is a direction of descent at (1, −2, 1): (i) (1, 0, 0)T , (ii) (0, 1, 0)T , (iii) (0, 0, 2)T ? 9.3. This exercise considers using a descent method in R2 . (a) Assuming that ∇F (xk+1 ) = 0, use the fact that αk minimizes (9.4) to show that dk is tangent to the level curve on which xk+1 is located. This is illustrated in Figure 9.2, which shows that the line from xk to xk+1 is tangent to the level curve containing xk+1 . (b) The location of x1 is shown in Figure 9.23. If d1 = (1/2, 0)T , then use part (a) to locate x2 . (c) Redo part (b), but take d1 = (0, 1/2)T .

Exercises

461

Section 9.3 9.4. Consider the equation 

2 −1 −1 4

    x 7 = . y 0

(a) What is the quadratic form associated with this equation? Write it out as a polynomial. (b) Using the SDM, and taking x1 = (1, 1)T , calculate x2 . (c) Using the CGM, and taking x1 = (1, 1)T , calculate x2 and x3 . 9.5. Consider the equation 

3 1 1 2

    x 1 = . y −1

(a) What is the quadratic form associated with this equation? Write it out as a polynomial. (b) Using the SDM, and taking x1 = (−1, 2)T , calculate x2 . (c) Using the CGM, and taking x1 = (−1, 2)T , calculate x2 and x3 . 9.6. Use the CGM to solve, should not, be used. Assume ⎛ 10 1 a) A = ⎝−1 1 2 1

by hand, Ax = b or explain why it cannot, or that b = (0, 0, 1)T and x1 = (0, 0, 0)T . ⎞ ⎛ ⎞ 1 1 0 0 1⎠ b) A = ⎝ 0 −2 0 ⎠ 1 0 0 3

1

y-axis

0.5

0

-0.5

-1 -1

-0.5

0

0.5

1

x-axis Figure 9.23

Contour plot for Exercise 9.3. The ∗ designates the location of the minimum.

462

9 Optimization: Descent Methods

⎛ ⎞ 2 0 0 c) A = ⎝ 0 1 0 ⎠ 0 0 2



⎞ 1 0 0 d) A = ⎝ 0 2 1 ⎠ 0 1 2

9.7. Consider the function F (x) = 12 xT Ax − b • x, where A is a symmetric and positive definite 2 × 2 matrix. The contour plot for F is shown below in Figure 9.24. Make sure to clearly explain how you arrive at your answers. (a) Shown in the figure is the starting point x1 and the point x2 calculated using the SDM. Determine where the next point x3 is located. (b) If one uses the same starting point x1 for the CGM, where is x2 ? Where is x3 ? (c) Locate a starting point x1 so x2 computed by the SDM is exactly at the minimum. The point x1 must be located on one of the four sides of the contour plot. (d) Estimate κ2 (A) from Figure 9.24. 9.8. In R3 , the residuals found when using the CGM are r1 = (−1, 1, 0)T , r2 = (a, 2, b)T , r3 = (0, 0, 1)T , and r4 . (a) Find a and b. Make sure to state what property of the CGM you are using to find their values. (b) What is r4 ?

y-axis

9.9. The starting point x1 for the SDM is shown in Figure 9.25. Locate on the graph where x2 and x3 are located. Also, provide an explanation of how you are able to determine their locations.

x-axis Figure 9.24

Contour plot for Exercise 9.7. The ∗ designates the location of the minimum.

463

y-axis

Exercises

x-axis Figure 9.25

Contour plot for Exercise 9.9. The ∗ designates the location of the minimum.

9.10. This question considers if the SDM can ever do better than the CGM when solving Ax = b, where A is symmetric and positive definite. Assume exact arithmetic is used in all cases. a) In R2 , explain why no matter what the starting point is, the SDM cannot find the solution in fewer steps than the CGM. b) In R3 , is it possible for the SDM to find the solution in fewer steps than the CGM? If not, explain why, and if so then explain how this might occur. 9.11. A matrix A is negative definite if −A is positive definite. Explain why the algorithm in Table 9.1 can be used to solve Ax = b if A is an n × n symmetric negative definite matrix. 9.12. In calculus direction vectors are usually required to have unit length, such as when defining a directional derivative. Suppose this is imposed on the descent algorithm. So, dk in (9.17) and (9.18) would be replaced with dk /ak where ak = ||dk ||. What effect does this have on the resulting value for xk+1 ? Does it make any difference what vector norm is used? 9.13. Taking n = 4, find the row compressed version of: a) A1 in (9.24), b) A2 in (9.24), and c) A3 in (9.25). 9.14. This problem considers solving Ax = b, where A is an n × n symmetric matrix that contains zeros except aii = 2i, ai,i+1 = i, and ai,i−1 = i − 1. Also, the exact solution is x = (1, 1, · · · , 1)T , and so determine b using the formula b = Ax. (a) What are A and b when n = 5? (b) Explain why, for any n, A is positive definite.

464

9 Optimization: Descent Methods

(c) Using the SDM to solve the equation, plot the error, iterative error, and the residual error as a function of the iteration number (as is done in Figure 9.8). In doing this, take n = 1000, tol = 10−8 , and x1 = (2, −2, 2, −2, · · · )T . (d) Using the CGM to solve the equation, plot the error, iterative error, and the residual error as a function of the iteration number (as is done in Figure 9.8). In doing this, take n = 1000, tol = 10−8 , and x1 = (2, −2, 2, −2, · · · )T . (e) How sensitive is the CGM to the starting vector? Answer this by redoing part (d) but taking x1 = 0, x1 = (2, 2, 2, 2, · · · )T , x1 = (1, 2, 3, 4, · · · )T , and x1 chosen randomly (if you are using MATLAB, take x1 =rand(n,1)). (f) How much faster is the CGM if sparsity is used? Answer this by comparing the sparse versus the non-sparse computing times when n = 1000, 2000, 3000, 4000. In doing this take x1 = (2, −2, 2, −2, · · · )T and tol = 10−8 . 9.15. Do Exercise 9.14, but take A to be the n × n symmetric matrix that contains zeros except aii = 2 ln(i + 1), ai,i+2 = ln(i + 1), and ai,i−2 = ln(i − 1). 9.16. This problem considers solving Ax = b, where A is the n × n symmetric matrix with aii = 12 n(n + 1) + i(n − 2) + 1 and aij = i + j for i = j. Also, the exact solution is xe = (1, 1, · · · , 1)T , and so determine b using the formula b = Axe . In this problem take n = 1000, tol = 10−6 , and x1 = (2, −2, 2, −2, · · · )T . (a) Write out the matrix in the case of when n = 2, n = 3, and n = 4, and explain why they are all positive definite. It is possible to prove that A is positive definite for all values of n (you do not need to show this). (b) Using the SDM to solve the equation, plot the relative error Ek /||xe ||∞ , the relative iterative error Ik /||xk ||∞ , and the relative residual Rk /||b||∞ as a function of the iteration number. (c) Redo part (b) but use the CGM. Sections 9.3.5 and 9.3.6 9.17. (a) Find a symmetric positive definite 2 × 2 matrix, that has no zero entries, for which the CGM converges in one step. This means, irrespective of what is chosen for b or for x1 , that x2 is the exact solution. (b) Write out the steps of the CGM to demonstrate the result. For this, take b = (1, 5)T and x1 = 0. 9.18. Assume that A is a given 4 × 4 diagonal matrix and x1 is chosen randomly. Suppose every time the CGM is run that it converges in exactly three steps. Given an example of an A that will do this and explain why it works.

Exercises

465

9.19. Explain what happens to the PCGM if one picks M = I. 9.20. Suppose the preconditioner given in (9.29) is used with the matrix A1 in (9.24). How many steps does the PCGM need in this case to solve the problem? 9.21. Find the preconditioner given in (9.29) for: a) A2 in (9.24), and b) A3 in (9.25). Exercise 9.22 requires a sparse, symmetric, positive definite matrix A with n ≥ 4000. One option is to download A from the SuiteSparse Matrix Collection. If you do this, use the Filter to determine the symmetric and positive definite matrices in the collection. 9.22. You are to solve Ax = b using sparse CGM and sparse PCGM. The exact solution should be x = (1, 1, · · · , 1)T , and so you should take b = Ax. The preconditioner should be (9.28). In coding the PCGM it is strongly recommended that you solve Mz = r using Ly = r and Uz = y. (a) If you downloaded A from a collection of matrices, what is the name of the matrix you picked, the application it comes from, the total number of nonzero entries, and the value of n? (b) Plot the sparsity pattern for your matrix. In MATLAB this is obtained using the command spy(A). (c) For both sparse CGM and sparse PCGM, plot the error and the iterative error as a function of the iteration number. The stopping condition for each should be ||xk − xk−1 ||∞ ≤ 10−10 . Section 9.4 9.23. Consider the problem of minimizing the following function f (x, y) = 2x2 + 2xy + y 2 − x − 2y. (a) If x1 = (0, 0) then find x2 and x3 using the steepest descent method. (b) If x1 = (0, 0) then find x2 and x3 using the Polak-Ribi`ere method. 9.24. The SDM is used to find the minimum of a function F (x), and the contours for this function are shown in Figure 9.26. (a) Shown in the plot is the second point x2 calculated using the SDM. If the first point x1 is on the red contour, where is x1 located? Make sure to explain how you arrive at your answer. Also, you only need to find one location for x1 . (b) When should x1 be located on the red contour so that x2 is the exact solution?

466

9 Optimization: Descent Methods

y-axis

4

0

-4 -4

0

4

x-axis Figure 9.26 mum.

Contour plot for Exercise 9.24. The ∗ designates the location of the mini-

9.25. Consider the function 1 F (x, y) = −x4 + x6 − 10xy + y + y 4 . 6 (a) Plot the surface and contour curves for this function for −4 ≤ x ≤ 4 and −4 ≤ y ≤ 4. (b) Taking x1 = (0, 0), use the SDM to find the minimum point for this function. Your answer should be correct to six significant digits. Make sure to state what stopping condition you used, and the number of iterations needed. (c) Redo (b) but use the Polak-Ribi`ere method. 9.26. Consider the function F (x, y) =

1 (x + y)4 + (x − 1)2 + 4y 2 . 10

(a) Plot the surface and contour curves for this function for −2 ≤ x ≤ 2 and −2 ≤ y ≤ 2. (b) Taking x1 = (0, 0), use the SDM to find the minimum point for this function. Your answer should be correct to six significant digits. Make sure to state what stopping condition you used, the initial point, and the number of iterations needed. (c) Redo (b) but use the Polak-Ribi`ere method.

Exercises

467

9.27. For the conjugate gradient method, dk = −gk + βk−1 dk−1 , where β0 = 0. Assuming the line search problem for αk−1 is solved exactly, show that dk is a direction of descent at xk no matter what value is used for βk−1 . Assume that the gk ’s are nonzero. 9.28. This problem explores using a descent method to solve eigenvalue problems. Assume that A is a n × n symmetric matrix with eigenvalues λn ≤ λn−1 ≤ · · · ≤ λ1 . The associated Rayleigh quotient is R(x) =

xT Ax . xT x

It’s possible to prove that λn = min R(x), where the minimum is over all nonzero x ∈ Rn . Moreover, if x minimizes R(x), then it is an eigenvector for λn . (a) Show that ∇R = 2h/(xT x), where h = Ax − Rx. Note that (8.16) is helpful here. (b) Use part (a) to show that if x = 0 is a local minimizer of R(x) then it is an eigenvector for A and R(x) is the associated eigenvalue. (c) Explain how it is also possible to compute λ1 using the Rayleigh quotient. (d) Let n = 100, and assume that the entries of A are aij = i + j. Use the Polak-Ribi`ere method to compute λn , with a relative iterative error of no more than 10−6 . (e) Redo part (d) but compute λ1 . Section 9.5 9.29. The equation f (x) = 0 can be solved using the LMM by taking F (x) = [f (x)]2 . The purpose of this exercise is to compare the two shifting methods considered in the derivation of the LMM, and how they compare with Newton’s method as given in (2.7). (a) Show that Jk = fk , gk = 2fk fk , and Gk = 2(fk )2 . (b) Show that taking Sk = Gk + μk I results in xk+1 = xk −

2f  (xk ) f (xk ). μk + 2f  (xk )2

(c) Show that taking Sk = Gk + μk Dk results in xk+1 = xk −

1 f (xk ). (1 + μk )f  (xk )

(d) Assuming xk → x ¯, by taking the limit of the difference equations in parts (b) and (c), is it guaranteed that f (¯ x) = 0? 9.30. In this problem assume that μk > 0, and gk = 0.

468

9 Optimization: Descent Methods

(a) Explain why the eigenvalues of Gk + μk I are positive. Consequently, Gk + μk I is invertible. In fact, since Gk + μk I is symmetric then it is positive definite.  −1 (b) Use part (a) to show that dk = − Gk + μk I gk is a direction of descent at xk . (c) Using Weyl’s inequality it is possible to show that Gk + μk Dk is positive definite if Dk is positive definite. What must be assumed about f (xk ) and g(xk ) so that Dk is positive definite? (d) For the LMM, explain why −S−1 k gk is a direction of descent at xk if Dk is positive definite. 9.31. Consider the three nonlinear equations x2 + y 2 = 4, ex + y = 1, x3 + xy + y 3 = 1. Given that there are three equations, and two unknowns, this is an example of an overdetermined system. (a) Using the same approach as in Section 8.3.7, show how to rewrite the problem as minimizing a function of the form in (9.38). What are F (x, y, z) and J? (b) Use the LMM to find the minimizer using the starting value x1 = (1, 1)T . What is the resulting value of F ? (c) As in Figure 9.17, plot the steps that you obtained in part (b). (d) Redo parts (b) and (c) using the starting values x1 = (1, −1)T , and x1 = (0, −1)T , (e) Where is the minimizer located? Make sure to provide justification for your answer. 9.32. The objective of this problem is to use the LMM to fit the MichaelisMenten model function g(t) = v1 t/(v2 + t). The data for the [ti , gi ] values are: [0.1, 0.03], [0.8, 2.4], [1.2, 2.8], [1.7, 2.7], [2, 3]. (a) What are F (x, y) and J? (b) Plot the surface and contour curves for F (x, y), for 0 ≤ x ≤ 5 and 0 ≤ y ≤ 1. (c) Use the LMM to find v1 and v2 using the starting value x1 = (1, 0)T . What is the resulting value of F ? (d) As in Figure 9.17, plot the steps that you obtained in part (c). (e) Redo parts (c) and (d) using the starting values x1 = (0.5, 1)T , and x1 = (1, 1)T , (f) What values of v1 and v2 minimize F ? Make sure to provide justification for your answer.

Exercises

469

9.33. The objective of this problem is to use the LMM to fit the logmodified model function g(t) = (v1 + v2 t)v3 . The data for the [ti , gi ] values are: [0.1, 0.05], [0.4, 0.3], [1, 2], [1.3, 3], [1.7, 5.4]. (a) What are F (x, y, z) and J? (b) Use the LMM to find v1 , v2 , and v3 using the starting value x1 = (0.2, 1.2, 2.1)T . What is the resulting value of F ? (c) Redo parts (b) and (c) using the starting values x1 = (0, 1, 2)T , and x1 = (0, 0.2, 3)T , (d) What values of v1 and v2 minimize F ? Make sure to provide justification for your answer. 9.34. The position x = (x, y)T of an object is going to be determined using its distance from n known locations. To explain, if the ith location is ai , and the distance from x to ai is ρi , then the object’s position is determined by minimizing n  fi2 , F (x) = i=1

ai ||22

ρ2i .

where fi = ||x − − The data for the [ai , ρi ] values are: [(2, 2)T , 0.5], T T [(2, 2.25) , 4], [(2, 2.5) , 1], [(3, 3)T , 2], [(2.1, 2.5)T , 2.5]. (a) Plot the surface and contour curves for F (x) for 0 ≤ x ≤ 6 and 0 ≤ y ≤ 5. (b) Use the LMM to find the position of the object using the starting value x1 = (1, 1)T . What is the resulting value of F ? (c) In a similar fashion as in Figure 9.17, plot the steps that you obtained in part (b). (d) Redo parts (b) and (c) using the starting values x1 = (1, 4)T , and x1 = (3, 3)T , (e) Where is the object located? Make sure to provide justification for your answer. Section 9.6 9.35. This problem considers the various steps used in the Nelder-Mead method. (a) For the top contour plot in Figure 9.27, explain which step from Table 9.4 was used to determine each triangle. (b) For the bottom contour plot in Figure 9.27, explain which step from Table 9.4 was used to determine each triangle. Section 9.7 9.36. This exercise involves a brachistochrone problem where the path is piecewise linear. Specifically, it is a two-segment path as shown in Figure 9.28 (left). The objective is to determine the location of x1 so the total travel time T = t1 + t2 is a minimum. Here t1 and t2 are the travel times for the first and second segment, respectively.

470

Figure 9.27

9 Optimization: Descent Methods

Plots used in Exercise 9.35.

(a) This problem concerns the single segment shown in Figure 9.28 (right). If s(t) is the distance the particle has traveled along the segment then, from Newton’s second law, s = g cos θ. The initial conditions are s(0) = 0 and s (0) = va . Taking xa = (xa , ya )T and xb = (xb , yb )T , show that the particle reaches xb when t=

  1  − va + va2 + 2g(ya − yb ) , g cos θ

where cos θ = (ya − yb )/||xa − xb ||2 . In addition, show that the velocity  v of the particle when it reaches xb is v = va2 + 2g(ya − yb ). (b) For the two-segment path, it is assumed that the kinetic energy K = 12 v 2 of the particle is continuous. In particular, it is continuous at x1 . Use this to show that if the particle starts the first segment  with velocity v0 , then it starts the second segment with velocity v1 = v02 + 2g(y0 − y1 ). (c) The formulas from parts (a) and (b) can be used to determine both t1 and t2 for the two segment path shown in Figure 9.28. It is assumed that x0 = (0, y0 )T and x2 = (x2 , 0)T are known, as is g. So, the total travel time T = t1 + t2 depends only on x1 and y1 , where x1 = (x1 , y1 )T .

Figure 9.28 Left: Two-segment brachistochrone problem. Right: Generic single segment used to derive the formula for the travel time.

Exercises

471

Taking y0 = 3, x2 = 4, and g = 9.8, compute the x1 and y1 that minimize T . Make sure to identify the method used to locate the minimizer, the starting values used for the method, and the error tolerance used to stop the procedure. Also, is there any truth to the rumor that T has a minimum value when t1 = t2 ? 9.37. This exercise considers the problem of finding the axial displacement u(x) of an elastic bar, which is subject to a body force w(x), as well as a constant force Fr on the right end x = 1. At the left end, where x = 0, the bar is fixed, which means that u(0) = 0. The potential energy is   1 1 V = Du2x + w(x)u dx − Fr u(1), 2 0 where D is a positive constant.

1 (a) Find a function F (x) so that V = 0 F (x)dx. (b) Use the composite trapezoidal rule to obtain an approximation of the integral in part (a). In doing this, assume that the grid points are x0 = 0, x1 = h, x2 = 2h, · · · , xn+1 = 1, where h = 1/(n + 1). (c) As with the string example, use a centered second-order approximation for ux at x1 , x2 , · · · , xn , and a first-order approximation for ux at x0 and xn+1 . Using these with the result from part (b), what is the resulting approximation for V ? Note that like the string example, u0 = 0, but unlike the string example, un+1 is an unknown in this problem. (d) What is the resulting matrix equation that must be solved to find the minimum for the approximation for V found in part (c)? (e) Solve the matrix equation in part (d) in the case of when n = 30, D = 1, Fr = 1, and w = x4 . The exact solution in this case is u = x(x5 + 24)/30. On the same axes, plot the exact solution and the numerical solution. (f) The exact solution of the problem satisfies the boundary condition Du (1) = Fr . So, the question is, does your numerical solution satisfy this condition. Using the first-order approximation you used in part (c), from the numerical solution calculate ux (1), using n = 20, 40, 80, 160. Are your answers consistent with the statement that the solution satisfies the above condition? 9.38. This exercise considers the problem of finding the traverse displacement u(x) of an elastic beam, which is subject to a body force w(x). The potential energy is   1 1 2 EIuxx − w(x)u dx, V = 2 0 where EI is a positive constant. It is assumed that the beam has what are called simply supported ends. One consequence of this is that u(0) = 0 and u(1) = 0. The role of the boundary conditions will be discussed in more detail in part (f).

472

9 Optimization: Descent Methods

(a) Suppose the grid points are x0 = 0, x1 = h, x2 = 2h, · · · , xn+1 = 1, where 1 h = 1/(n + 1). Writing V = 0 F (x)dx, write down the composite trapezoidal approximation for V . (b) Use a centered second-order approximation for uxx at x1 , x2 , · · · , xn . You can use a first-order approximations for uxx at x0 and xn+1 . Using these with the result from part (a), what is the resulting approximation for V ? In doing this, remember that u0 = un+1 = 0. (c) What is the resulting matrix equation that must be solved to find the minimum for the approximation for V found in part (b)? (d) If EI = 1 and w = x4 , then the exact solution is u=

1 3 5 1 8 x − x + x. 1680 180 1008

On the same axes, plot the exact solution and the numerical solution from part (c) when n = 20. (e) A simply supported beam is required to satisfy the four boundary conditions: u(0) = 0, u(1) = 0, uxx (0) = 0, and uxx (1) = 0. The numerical solution was derived without any mention of the last two conditions. The reason is that they are natural boundary conditions, which means that if u minimizes V , then it will automatically satisfy uxx (0) = 0 and uxx (1) = 0. In comparison, u(0) = 0 and u(1) = 0 are essential boundary conditions, which means that we must explicitly require this of our approximation (see part (b)). So, the question is, does your numerical solution satisfy (approximately) uxx (0) = 0 and uxx (1) = 0. Using the first-order approximations you used in part (b), from the numerical solution calculate uxx (0) and uxx (1), using n = 20, 40, 80, 160. Are your answers consistent with the statement that the minimizer satisfies uxx (0) = 0 and uxx (1) = 0? 9.39. This exercise considers computing the curve that produces a surface of revolution with minimum area. Given two points (a, A) and (b, B) in the plane, with a < b and A and B positive, assume that y(x) is a smooth, positive, function connecting these points. If y(x) is rotated about the x-axis, the resulting surface has area  S=

b

2πy

 1 + (y  )2 dx.

a

It should be noted that the curve is required to satisfy y(a) = A and y(b) = B. Also, unlike the brachistochrone problem, y(x) does not have a singularity (at either end). (a) Suppose the grid points are x0 = 0, x1 = h, x2 = 2h, · · · , xn+1 = 1, where 1 h = 1/(n + 1). Writing S = 0 F (x)dx, write down the composite trapezoidal approximation for S.

Exercises

473

(b) Use a centered second-order approximation for y  at x1 , x2 , · · · , xn , and a first-order approximation for y  at x0 and xn+1 . Using these with the result from part (a), what is the resulting approximation for S? (c) The minimum of S from part (b) can be found using a descent method. This requires a starting guess for y = (y1 , y2 , · · · , yn )T . One possibility is to approximate y(x) with a linear function that satisfies y(a) = A and y(b) = B. What is the resulting y? (d) Taking a = 0, A = 1, b = 1, and B = 3, plot the numerical solution for y(x) for n = 4, n = 9, and n = 19. Make sure to explain how you computed the solution. (e) In calculus it is shown that the curve producing the minimum area is y = (1/α)cosh(αx + β), where α and β are determined from the requirements that y(a) = A and y(b) = B. Determine the nonlinear equations that must be solved to find α and β. Discuss the numerical difficulties associated with solving these equations, compared to the numerical solution obtained in parts (a)-(c). Note: The minimal surface problem has some interesting mathematical complications, and for more about this see Oprea [2007]. Additional Questions 9.40. This exercise examines the flop count of the CGM compared to the Cholesky factorization method. In what follows assume n is large. (a) Show that after determining xk−1 , it takes approximately 2n2 flops to compute xk . Based on this, how many steps does it take to equal (approximately) the flops required for a Cholesky factorization solution? (b) Suppose that A is sparse and has approximately m nonzero entries in each row. Show that computing xk requires about 2n(m + 8) flops. (c) Based on the result from part (b), and assuming that the CGM reaches its theoretical limit for the number of steps needed to find the solution, how sparse must A be so the sparse CGM takes fewer flops than when using a Cholesky factorization? 9.41. This exercise provides an explanation of how (9.28) is obtained. The first step is to factor out the largest entry, in absolute value, in L, and write it as L = μL1 , where the entries of L1 satisfy |ij | ≤ 1. (a) Write down a symmetric 3 × 3 matrix A that contains no zeros or ones, and determine D, L1 , and μ. In (b)-(d), A is a general n × n symmetric and positive definite matrix, and has the Cholesky factorization A = UT U. (b) The matrix U is a function of μ, and assuming Taylor’s theorem can be used, then U = U0 + μU1 + μ2 U2 + · · · . Substitute this series, as well as L = μL1 , into the equation L + D + LT = UT U. Then, by equating

474

9 Optimization: Descent Methods

like powers of μ, show that this leads to the conclusions that U0 = D1/2 and U1 = D−1/2 LT1 . (c) Assuming that μ 1, explain how the result in part (b) leads to the approximate Cholesky factorization in (9.28). (d) Based on the assumptions used to derive (9.28) in parts (b) and (c), is it more likely that the preconditioner will work better on the tridiagonal matrix in (9.25) or the matrix A2 in (9.24)? Make sure to explain your reasoning. Also, assume they are the same size.

Chapter 10

Data Analysis

10.1 Introduction In this chapter we consider a problem we have examined earlier, which is how to derive information from data. This was central to Chapter 5, when we derived interpolation formulas, and also in Chapter 8, where we investigated ways to use linear and nonlinear regression. In this chapter, three different situations are considered. The first two are examples illustrating the usefulness of the singular value decomposition (SVD) in data analysis. The SVD is explained in Section 4.6. All three methods make use of the regression material covered in Section 8.3.

10.2 Principal Component Analysis The goal is to have a method that can be used to find connections between experimentally determined quantities, even though there is no obvious reason why they have to be connected. It is easiest to explain this using an example. In addition to introducing how the SVD can be used to find the connections, it will also show how the method overlaps with the regression material considered in Chapter 8.

10.2.1 Example: Word Length The question considered is, is there a connection between the length of a word and how many words are used in its definition in the dictionary. Some sample data are given in Table 10.1, which were obtained from the Merriam-Webster Dictionary. We are going to see if there is a linear relationship between these © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0 10

475

476

10 Data Analysis word

L

W

X

Y

x

y

bag

3

223

−3

117.70

−0.6

0.766

across

6

117

0

11.70

0.0

0.076

if

2

101

−4

−4.30

−0.8

−0.028

insane

6

40

0

−65.30

0.0

−0.425

by

2

217

−4

111.70

−0.8

0.727

detective

9

41

3

−64.30

0.6

−0.418

relief

6

153

0

47.70

0.0

0.310

slope

5

87

−1

−18.30

−0.2

−0.119

scoundrel

9

4

3

−101.30

0.6

−0.659

look

4

212

−2

106.70

−0.4

0.694

neither

7

51

1

−54.30

0.2

−0.353

pretentious

11

70

5

−35.30

1.0

−0.230

solid

5

259

−1

153.70

−0.2

1.000

gone

4

90

−2

−15.30

−0.4

−0.100

fun

3

78

−3

−27.30

−0.6

−0.178

therefore

9

13

3

−92.30

0.6

−0.601

generality

10

22

4

−83.30

0.8

−0.542

month

5

57

−1

−48.30

−0.2

−0.314

blot

4

185

−2

79.70

−0.4

0.519

infectious

10

86

4

−19.30

0.8

−0.126

Table 10.1 Number, L, of letters in a word, and the number, W , of words appearing in its definition in a dictionary. Also, X, Y are the centered values, and x, y are the scaled values.

values. In particular, if L is the number of letters in a word, and W is the number of words in its definition, we want to fit W = αL + β

(10.1)

to the data. This brings up the first observation about this problem, which is that writing the formula this way implies that L is the independent variable and W is the dependent variable. The idea of independent and dependent is almost arbitrary here, and we could just as well have written

10.2

Principal Component Analysis

477

L = aW + b.

(10.2)

Mathematically these two formulas are equivalent, with α = 1/a and β = −b/a. However, as explained in Section 8.5, traditional linear least squares applied to (10.1) will not always produce equivalent coefficients to the values obtained when it is applied to (10.2). A way to avoid this is to use orthogonal regression, which uses the true distance between the point and line (see Figure 8.2). Not thinking too hard about this, and using the linear relationship in (10.1), one might claim that the error function is (see Section 8.2) n 1  (αLi + β − Wi )2 . E(α, β) = 1 + α2 i=1

(10.3)

There are two complications with this, and how these are resolved are as follows: Center the Data The first issue concerns finding the minimizer of the above error function. Even for this rather simple two variable problem, finding the minimum of E(α, β) requires solving a cubic. One way to avoid this is to center the data values using their mean, or average, values. In particular, letting 1 Li n i=1 n

L=

1 Wi , n i=1 n

and

W =

(10.4)

then the centered data values are computed using the formulas Xi = Li − L

and

Yi = Wi − W .

(10.5)

The resulting data values are given in Table 10.1. The significance of this step is that instead of the more general linear equation in (10.1), it is possible to now assume that Y = αX.

(10.6)

One necessary  reason  this is possible is that, for it to hold, it is  that Yi = α Xi . This applies to our centered data because Yi = Xi = 0. Moreover, as will be shown shortly, using this model function it is possible to find the minimizer rather easily. Scale the Data The second complication with (10.3) concerns the possible differences in the dimensional units, and magnitudes, of the variables. The solution is to scale each variable using a characteristic value for that variable. Letting Xc and Yc be characteristic values for X and Y , respectively, then the

478

10 Data Analysis

scaled data are obtained from the formulas xi = Xi /Xc

and

yi = Yi /Yc .

(10.7)

There are various ways to pick Xc and Yc , and these will be considered later. For this example we will pick the largest entry, in absolute value, for each variable. This means we will take Xc = 5 and Yc = 144.75. Based on the above adjustments, we will fit the line y = αx to the data values (xi , yi ), using the error function n 1  (αxi − yi )2 1 + α2 i=1  1  2 • x − 2αx • y + y • y , α = x 1 + α2

E(α) =

(10.8) (10.9)

where x and y are the normalized data vectors. With this, E  (α) = 0 reduces to α2 + λα − 1 = 0, where x•x−y•y . x•y

(10.10)

  1 −λ ± λ2 + 4 , 2

(10.11)

λ= The conclusion is that α=

where the + is used if x • y > 0 and the − is used if x • y < 0. It is also apparent that this conclusion requires that x and y not be orthogonal. Converting the answer back into the original variables used to introduce the example, we have the formula W = W + m(L − L),

(10.12)

where the slope is m = αYc /Xc . The resulting linear fit to this data is shown in Figure 10.1. What is rather surprising is that the linear fit obtained by minimizing the modified error function (10.9) can also be found using the singular value decomposition (SVD) of the normalized data matrix. To show this, let P be the 20 × 2 normalized data set obtained from the (xi , yi ) values in Table 10.1. From the SVD, as given in Section 4.6.3, P = UΣVT ,

(10.13)

where V is a 2 × 2 orthogonal matrix whose column vectors vi are orthonormal eigenvectors for PT P, and Σ is a 20 × 2 diagonal-like matrix containing √ the two singular values σ1 and σ2 . The latter are calculated using σi = λi ,

Number of Words

10.2

Principal Component Analysis

479

250 200 150 100 50 0 0

1

2

3

4

5

6

7

8

9

10

11

Number of Letters Figure 10.1 definition.

Linear fit connecting the length of a word and the number of words in its

where λi is an eigenvalue for PT P and labeled so that λ1 ≥ λ2 . Using the values given in Table 10.1, one finds that σ1 = 3.23, σ2 = 1.38, and   0.692 0.722 . V= −0.722 0.692 The two column vectors v1 and v2 from this matrix, along with the normalized data from Table 10.1, are shown in Figure 10.2. As is immediately evident, the first vector v1 points in the direction determined by the linear fit, while the second column points in a direction orthogonal to the first. In other words, by using the first column from the matrix V we can solve the linear fit problem. As will be shown next, this is not a coincidence.

1

v

y-axis

2

0

v1 -1 -1

0

1

x-axis Figure 10.2 Vectors v1 and v2 obtained using the SVD. Also shown are the normalized data from Table 10.1 and, by the solid (red) line, the linear fit determined by minimizing (10.9).

480

10 Data Analysis

10.2.2 Derivation of Principal Component Approximation It is assumed that there are m variables, or quantities, q1 , q2 , · · · , qm under consideration. Also, there are n observations, or data vectors, for these variables. These will be denoted as q1 , q2 , · · · , qn , where each qi is an m-vector. These vectors are used to form the rows of the n × m data matrix, giving ⎞ ⎞ ⎛ ⎛ q11 q12 · · · q1m ← qT1 → ⎜ ← qT → ⎟ ⎜ q21 q22 · · · q2m ⎟ 2 ⎟ ⎟ ⎜ ⎜ . ⎟=⎜ . ⎜ .. .. .. ⎟ .. . ⎝ ⎠ ⎝ . . . . ⎠ . qn1 qn2 · · · qnm ← qTn → So, the data in column j corresponds to the n observed values for the variable qj . Normalized Data Before using a PCA, the data needs to be normalized. The first step is to center the values in each column using the mean value for that column. In vector form, the mean is n 1 qi . qM = n i=1  T So, to center the data, the ith row is replaced with qi − qM . The second step is to scale the values in each column using a characteristic value for that column. If cj is the jth column of the centered data matrix, and Sj is the scaling factor for that column, then cj is replaced with pj = cj /Sj . With this we then have the n × m normalized data matrix P, which has the form ⎛ ⎞ ⎞ ⎛ ← pT1 → p11 p12 · · · p1m ⎜ ← pT → ⎟ ⎜ p21 p22 · · · p2m ⎟ 2 ⎜ ⎟ ⎟ ⎜ . (10.14) P=⎜ ⎟=⎜ . .. .. .. ⎟ .. . ⎝ ⎝ ⎠ . . . . ⎠ . ← pTn →

pn1 pn2 · · · pnm

In the word length example, we used Sj = ||cj ||∞ . However, there are various choices one can take for Sj , and these will be discussed later.

Subspace Approximation To explain what subspace approximation means, a data set involving three variables is shown in Figure 10.3. A typical question is, what line (i.e., what

10.2

Principal Component Analysis

481

0.5

0

-0.5 -0.5

0

0.5

Figure 10.3 Example data set in the case of three normalized variables p = (p1 , p2 , p3 )T . The upper plot shows the best fit line. The lower plot is the data projected onto a plane that is perpendicular to the solid (red) line shown in the upper plot.

one-dimensional subspace) best fits this data? This means we want to find a direction vector v1 so that the line p = α1 v1 best fits the data. Or, wanting a better approximation, you might want to determine the plane (i.e., the two-dimensional subspace) that best fits the data. In this case, you want to find direction vectors v1 and v2 so p = α1 v1 + α2 v2 provides the best fit. For a PCA, given k, you want to find direction vectors v1 , v2 , · · · , vk so that p = α1 v1 + α2 v2 + · · · + αk vk best fits the normalized data P. Just as in the word example, true distance will be used when fitting the data.

482

10 Data Analysis

The formula for the error function used with a PCA is derived in the next few pages. A summary of the derivation is given in Section 10.2.3 (on page 484).

Error Function This brings us to the next step, which is to determine the formula for the error. So, suppose we pick the one-dimensional approximation p = α1 v1 . This is a line with direction vector v1 and parameter α1 . The distance between this line and a data point pi is, by definition, the distance between pi and the point on the line that is closest to pi . By minimizing ||α1 v1 − pi ||2 , as a function of α1 , one finds that the distance between the line and pi is  Di = pi • pi − (pi • v1 )2 . We are using the least squares error, and so the resulting error function using the entire data set is E1 (v1 ) =

n 

Di2 =

i=1

n    pi • pi − (pi • v1 )2 .

(10.15)

i=1

We want to find a vector v1 that minimizes this function under the constraint that ||v1 ||2 = 1. As a second example, suppose one picks the two-dimensional approximation p = α1 v1 + α2 v2 . The distance between this subspace and a data point pi is determined by finding α1 and α2 that minimize ||α1 v1 + α2 v2 − pi ||2 . After finding α1 and α2 , the resulting least squares error for the entire data set is n    E2 (v1 , v2 ) = pi • pi − (pi • v1 )2 − (pi • v2 )2 . (10.16) i=1

It is required that ||v1 ||2 = 1, ||v2 ||2 = 1, and v1 • v2 = 0. For general k, the error function is  n  k   2 (pi • vj ) Ek (v1 , v2 , · · · , vk ) = pi • pi − i=1

=

n 

j=1

pi • pi −

i=1

=

n  i=1

k 

||Pvj ||22

j=1

pi • pi −

k 

vjT PT Pvj ,

(10.17)

j=1

where P is the n × m normalized data matrix (10.14) and the vj ’s are orthonormal.

10.2

Principal Component Analysis

483

Critical Points We now turn to the problem of minimizing the error function, subject to the constraint that the vj ’s are orthonormal. This can be done using the method of Lagrange multipliers. For the one-dimensional approximation, the function to minimize is   (10.18) F (v1 , λ1 ) = E1 (v1 ) + λ1 ||v1 ||22 − 1 , where λ1 is a Lagrange multiplier. For general k the function is F (v1 , · · · , vk , λ1 , · · · , λk ) = Ek (v1 , · · · , vk ) +

k 

  λj ||vj ||22 − 1 , (10.19)

j=1

where the λj ’s are Lagrange multipliers. Finding the minimizer is straightforward. To do this, note that (10.19) can be written as F =

n  i=1

pi • pi +

k  

 λj (vjT vj − 1) − vjT PT Pvj .

j=1

The first step is to differentiate F with respect to the components of vj , as well as λj . This can be done using (8.16). Setting the derivatives to zero one finds that the vj ’s must satisfy  T  P P v j = λ j vj . (10.20) The conclusion is that each vj must be an eigenvector for PT P, and it has corresponding eigenvalue λj . Also, since this matrix is symmetric, the eigenvectors are orthogonal. It is important to recognize that (10.20) is the same eigenvalue problem used to determine the singular values, and the matrix V, for the singular value decomposition (see Section 4.6.1). Moreover, by taking the dot product of both sides of (10.20) with vj , it follows that vjT PT Pvj = λj .

(10.21)

Minimum Error We now know that vj must be an eigenvector for PT P, but which one do we take? Substituting (10.21) into (10.15) we obtain E1 = −λ1 +

n  i=1

(pi • pi ) .

(10.22)

484

10 Data Analysis

So, to minimize E1 we should use the eigenvector v1 corresponding to the largest eigenvalue of PT P. By definition, the largest eigenvalue is σ12 , where σ1 is the first singular value for P, and v1 is the first column from V in the SVD for P (see Section 4.6.1). Using the same reasoning, from (10.17), for the general case, the minimum error is Ek = −σ12 − σ22 − · · · − σk2 +

n 

(pi • pi ) ,

(10.23)

i=1

and this is obtained when v1 , v2 , · · · , vk are taken to be the first k columns from V. One last fact we need is, since P is n × m with m ≤ n, that (see Exercise 10.6) n  2 (pi • pi ) = σ12 + σ22 + · · · + σm . i=1 2 2 2 Consequently, (10.23) can be written as Ek = σk+1 + σk+2 + · · · + σm . This means that the relative error is

R(k) =

2 2 2 + σk+2 + · · · + σm σk+1 . 2 σ12 + σ22 + · · · + σm

In statistics this is called the relative residual variance.

10.2.3 Summary of the PCA The above discussion is summarized in the following theorem. Theorem 10.1 Let P be a normalized nonzero n × m data matrix, with m ≤ n. Also, let the SVD of this matrix be P = UΣVT . Given k, with 1 ≤ k < m, consider a linear fit of the data of the form p = α1 v1 + α2 v2 + · · · + αk vk . The smallest error is obtained when the vectors are chosen to be the first k columns of V, and the resulting relative error is R(k) =

2 2 2 + σk+2 + · · · + σm σk+1 . 2 σ12 + σ22 + · · · + σm

(10.24)

In the parlance of the subject, the columns of V are called the principal components. Now the more practical question, which is, so what? The easiest way to answer this is thorough an example.

10.2

Principal Component Analysis

485

Example 10.1 The word length data in Table 10.1 is expanded in Table 10.2 to include two new variables. One is the relative number of times the word appeared in print in 1908, and the other is the number of times for 2008 (both of these numbers are given in percentages). These were acquired using Google’s ngram program [Michel et al., 2011]. The last four columns form the normalized data matrix P. Calculating the SVD of P one finds that

Word

L

W

1908

2008

p1

p2

p3

p4

bag

3

223

0.002

0.003

−0.6

0.766

−0.062

−0.064

across

6

117

0.002

0.003

0.0

0.076

−0.062

−0.064

if

2

101

0.176

0.173

−0.8

−0.028

1.000

1.000

insane

6

40

0.002

0.003

0.0

−0.425

−0.062

−0.064

by

2

217

0.002

0.003

−0.8

0.727

−0.062

−0.064

detective

9

41

0.000

0.001

0.6

−0.418

−0.070

−0.076

relief

6

153

0.002

0.003

0.0

0.310

−0.062

−0.064

slope

5

87

0.002

0.003

−0.2

−0.119

−0.062

−0.064

scoundrel

9

4

0.002

0.003

0.6

−0.659

−0.062

−0.064

look

4

212

0.019

0.028

−0.4

0.694

0.043

0.093

neither

7

51

0.002

0.003

0.2

−0.353

−0.062

−0.064

pretentious

11

70

0.002

0.003

1.0

−0.230

−0.062

−0.064

solid

5

259

0.002

0.003

−0.2

1.000

−0.062

−0.064

gone

4

90

0.011

0.010

−0.4

−0.100

−0.006

−0.020

fun

3

78

0.001

0.003

−0.6

−0.178

−0.067

−0.063

therefore

9

13

0.002

0.003

0.6

−0.601

−0.062

−0.064

generality

10

22

0.002

0.003

0.8

−0.542

−0.062

−0.064

month

5

57

0.008

0.007

−0.2

−0.314

−0.026

−0.038

blot

4

185

0.002

0.003

−0.4

0.519

−0.062

−0.064

infectious

10

86

0.002

0.003

0.8

−0.126

−0.062

−0.064

Table 10.2 Extension of the data given in Table 10.2, where 1908 and 2008 refer to the relative use of the word in the respective year. The variables p1 , p2 , p3 , and p4 are the respective normalized values of the data.

486

10 Data Analysis





0.771 0.362 0.524 0.003 ⎜ −0.617 ⎟ 0.628 0.474 0.008 ⎟. V=⎜ ⎝ −0.109 −0.490 0.494 0.710 ⎠ −0.113 −0.485 0.506 −0.704 Also, the mean values are qM = (6, 105.3, 0.0119, 0.0131)T , and the scaling factors used on the centered data are S1 = 5, S2 = 153.7, S3 = 0.1641, and S4 = 0.1599, respectively. (1) p = α1 v1 In component form, it is being assumed that p1 = α1 v11 p2 = α1 v21 p3 = α1 v31 , p4 = α1 v41 , where v1 = (v11 , v21 , v31 , v41 )T . So, given, say, the value of p1 we can determine α1 . From this we are then able to determine the predicted values for p2 , p3 , and p4 . To illustrate, consider the word “anything,” which has 8 letters. So, q1 = L = 8. Given the normalization used in Table 10.2, for this word p1 =

L − q1M = 0.4. S1

Since v11 = 0.771, it then follows that α1 = p1 /v11 = 0.519. From this we obtain the values, p2 = α1 v21 = −0.32, p3 = −0.056, and p4 = −0.059. In original variables, the assumption that p = α1 v1 leads us to the conclusions that “anything” takes q 2 + S2 p2 ≈ 56 words in the dictionary, it appeared in q 3 + S3 p3 ≈ 0.003% of the printed literature in 1908, and appeared in q 4 + S4 p4 ≈ 0.004% in 2008. (2) p = α1 v1 + α2 v2 Using the word “anything,” then q1 = 8 and, using Google’s ngram program, it appeared in 0.0154% of the printed literature in 1908. As before, p1 = 0.4. Also, p3 = (0.0154 − q3M )/S3 = 0.0212. From the first and third entries in the equation p = α1 v1 + α2 v2 we get that p1 = α1 v11 + α2 v12 p3 = α1 v31 + α2 v32 . Substituting in the values for the pi ’s and vij ’s, and then solving one finds that α1 = 0.602 and α2 = −0.177. We can now use this information

10.2

Principal Component Analysis

487

to find approximations for the other two variables, using the formula pi = α1 vi1 + α2 vi2 . The conclusions are that the word “anything” takes 31 words in the dictionary, and that it appeared in 0.016% of the printed literature in 2008.  As demonstrated in the previous example, using a PCA, one or more variables are used to predict the values of the others. Also, it does not make any difference which variables are taken to be “known” and which are “predicted”. The exception to this statement occurs when there is a zero divisor. For example, when k = 1 in the previous example, if v11 = 0 then the value of α1 would need to be determined from the second or third component of v1 . Note that it would be concluded in this case that p1 = 0. Another point to make is that you might have noticed that nothing was said about how accurate the predictions are. The reason is that there is not enough data in the example to expect to be able to make accurate predictions. This is not the case in the example considered later involving crime data. The question of accuracy will be addressed, as it often is using a PCA, by splitting the data set into a training set, which is used to find the principal directions, and then a testing set which is used to test the accuracy of the approximations.

10.2.4 Scaling Factors One question left unanswered is how to select a characteristic value Sj to scale the values in the jth column of the centered data matrix. A mathematical answer to this is to use a vector norm. If cj is the jth column vector from the centered data matrix, then possibilities are Sj =

1 ||cj ||1 , n

1 Sj = √ ||cj ||2 , or Sj = ||cj ||∞ . n

Each of these is dimensionally consistent with the dimensions of cj . The multiplicative factors for the first two are there to scale for the size of the vector. For example, if cj = (c, c, · · · , c)T , with c ≥ 0, then each of the above scaling factors yields Sj = c. Also, note that in the word length example, the ∞-norm choice was used for the scaling. In applications, other choices √ are sometimes made. One is autoscaling, and the formula is Sj = ||cj ||2 / n − 1. This is used in statistical applications because it corresponds to the standard deviation for the column. Another choice that is sometimes used comes from the difference between the largest and smallest entries in the uncentered data column, which results in the formula Sj = maxi qij − mini qij . This is called range scaling.

488

10 Data Analysis

Figure 10.4 Example data set similar to the one shown in Figure 10.3, except now the three orthogonal vectors determined using a PCA are shown.

Whatever choice is made, it is essential that Sj have the same dimensions as the values appearing in the jth column. Not unexpectedly, the choice can have an effect on the final answer. This is because different scalings produce different error functions, and this will likely affect the value of the minimizer. A discussion of various ways to scale data, as well as other aspects of the normalization process, can be found in van den Berg et al. [2006] and Parente and Sutherland [2013].

10.2.5 Geometry and Data Approximation The PCA was derived using a regression analysis, and the usefulness of this was demonstrated in the subsequent examples. However, it is worth considering what was done geometrically. For this we will use the data shown in Figure 10.4. The first step, which involved centering, is nothing more than finding the centroid of the data and then shifting the coordinate system so this is the origin. The regression step in a PCA is equivalent to rotating, and/or reflecting, the coordinate system, about the origin, so it is aligned with the data with the first coordinate vector pointing in the longest direction, the second coordinate pointing in the next longest direction, etc. This is illustrated in Figure 10.4. Looking at the procedure geometrically helps identify some of the complications that might arise using a PCA. The one of most concern involves the scaling. This is done on each coordinate independently, which is equivalent

10.2

Principal Component Analysis

489

to stretching (or contracting) the data points in the respective direction. Mathematically, it is possible to select scalings that dramatically distort the data. Consequently, it’s important to select a scaling that reflects the data under consideration, and the vector norm-based choices discussed in Section 10.2.4 are reasonable choices for this. A second complication arises when the data form a circular, or spherical, pattern rather than an elliptical, or ellipsoidal, pattern shown in Figure 10.3. In such cases there is no longest direction. The PCA procedure works in such cases, but how it is used needs to be modified to be effective. This situation arises when the singular values repeat. As an example, suppose that there are three variables, and the singular values satisfy σ1 = σ2 and σ2 > σ3 . The error, as given in (10.22), would be the same if we pick v1 or v2 , and this means that the line of best fit is not unique. In such a case, the one-dimensional approximation should be avoided and a two-dimensional approximation used instead.

10.2.6 Application: Crime Data To illustrate how a PCA can be used, consider the major crime rates for the larger cities in the U.S. A portion of the data set is shown in Table 10.3. In total there are 105 cities (note that three cities in the original data set were removed because their data were incomplete). We will use this data and a PCA to see what connections there might be between the different crime rates. This will be done by splitting the data into two sets, one will be the training data and the other the testing data. The PCA will be carried out on the training set, and we will then see how well this does in predicting the data values for the cities in the testing set. The training set consists of 2/3 of the cities randomly chosen, and the testing sets are those left out. The normalization is determined using the training set, and for this example the 2-norm scaling given in Section 10.2.4 is used. For the training set, the singular values are found to be σ1 = 7.4, σ2 = 2.0, σ3 = 1.6, σ4 = 1.2, σ5 = 0.6, σ6 = 0.48, σ7 = 0.29, and σ8 = 0.11.

One-Dimensional Approximation The first question is, can one of the crime rates be used to estimate the other rates (as well as the population). According to Theorem 10.1, the line of best fit is p = αv1 . The question we are considering is, given one of the rates (i.e., one of the pi ’s), what would we predict the other rates (and population) are. We will use the murder rate, which means we will select p2 . Since p2 = αv21 , then α = p2 /v21 . From the equation p = αv1 , the best line fits for the other rates (and population) are pi = ai p2 , where ai = vi1 /v21 . The resulting lines,

8400907

3848776

2273771

1597397

. . .

202714

202447

202192

New York, NY

Los Angeles, CA

Houston, TX

Phoenix, AZ

. . .

Fremont, CA

Irving, TX

Yonkers, NY

965

604

490

. . .

8730

25593

24070

46357

Murder

8

4

2

. . .

122

287

312

471

Rape

36

34

34

. . .

522

823

903

832

Robbery

446

352

237

. . .

3757

11367

12217

18597

Assault

3145

8427

4978

. . .

65617

120933

94240

142000

Burglary

620

1913

1190

. . .

16281

29279

18435

18780

Larceny

2182

5730

3237

. . .

39643

77058

57414

112526

Vehicle Theft

Table 10.3 Crime data rates for 105 of the largest cities in the U.S. in 2009 [U. S. Census Bureau, 2012]. The rates are per 100,000 population. Also given is the population of the city.

Population

City

490 10 Data Analysis

10.2

Principal Component Analysis

491

0.8

1.2 1

0.6

Assault

Vehicle

0.8 0.4 0.2

0.6 0.4 0.2

0

0 -0.2

-0.2 -0.2

0

0.2

0.4

0.6

0.8

-0.2

0

Murder

0.2

0.4

0.6

0.8

0.6

0.8

Murder

1.2

1

1 0.8

Population

Robbery

0.8 0.6 0.4 0.2

0.6 0.4 0.2

0 0

-0.2 -0.2

0

0.2

0.4

0.6

0.8

-0.2

0

Murder

0.2

0.4

Murder

Figure 10.5 Using the murder rate to predict the rates of three other crimes and the population. The line comes from the training set, while the data comes from the testing set.

along with the data from the testing set, are shown in Figure 10.5. It is evident that the murder rate can be used to predict the assault rate fairly accurately. The predictions for the other three rates are poor, as there is significantly more scatter in the data. It is possible to provide a more quantified evaluation of the model. If y are the predicted values, y are the observed values from the testing set, and if there are n data points in the testing set, then three often used measures of the fit are as follows: mean squared prediction error (MSPE) ≡

1 ||y − y||22 n

1 ||y − y||1 n  2 y•y coefficient of determination (R2 ) ≡ . ||y||2 ||y||2 mean absolute deviation (MAD) ≡

(10.25)

492

Table 10.4

10 Data Analysis

MSPE

MAD

R2

Population

2.70e−02

1.01e−01

0.605

Robbery

3.47e−02

1.36e−01

0.565

Assault

3.65e−03

3.95e−02

0.949

Vehicle

2.2e−02

8.7e−02

0.710

Error measures given in (10.25) for the plots in Figure 10.5.

So, the better the fit, the closer MPSE and MAD are to zero, and the closer R2 is to one. The values obtained from Figure 10.5 are given in Table 10.4. They reaffirm the earlier observations about the accuracy of the predictions. As a final note, there are various ways to define R2 . The one used here is taken from Greene [2018], and specialized to centered data.

Seven-Dimensional Approximation We consider a second question, which is given the values for six of the crime rates and the population, how well can we predict the remaining crime rate? According to Theorem 10.1, the best fit seven-dimensional plane is p = α1 v1 + α2 v2 + · · · + α7 v7 . This can be written out in matrix form as ⎛ ⎞ ⎛ ⎞⎛ ⎞ v11 v12 · · · v17 α1 p1 ⎜ p2 ⎟ ⎜ v21 v22 · · · v27 ⎟⎜ α2 ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ .. ⎟ = ⎜ .. .. .. .. ⎟⎜ .. ⎟ . ⎝ . ⎠ ⎝ . . . . ⎠⎝ . ⎠ p8 v81 v82 · · · v87 α7

(10.26)

We want to determine one of the pi ’s, given the values for the others. It makes the mathematical formulas easier to write down if the variable of interest is p1 . In other words, it is assumed that the original data matrix is either written down, or column swapped, so the first column is the variable of interest here. With this, and (10.26), we need to find the αi ’s in terms of p2 , p3 , · · · , p8 . This is done by solving the reduced problem ⎛ ⎞ ⎛ ⎞⎛ ⎞ v21 v22 · · · v27 p2 α1 ⎜ p3 ⎟ ⎜ v31 v32 · · · v37 ⎟⎜ α2 ⎟ ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ .. ⎟ = ⎜ .. .. .. .. ⎟⎜ .. ⎟ . ⎝ . ⎠ ⎝ . . . . ⎠⎝ . ⎠ p8

v81 v82 · · · v87

α7

10.2

Principal Component Analysis

493

3 50 40

Testing

Testing

2

1

0

20 10 0

Vehicle -1 -1

30

0

1

2

Training

Murder

-10

3

-1

0

1

2

3

Training

Figure 10.6 The testing value versus the value predicted using the training set. The (red) line is what would be obtained if the testing value and training value are equal.

8 Writing the solution as αi = j=2 aij pj , then the first row from (10.26) can be written as (10.27) p1 = a2 p2 + a3 p3 + · · · + a8 p8 , 7 where ai = j=1 v1j aji . This is the formula we will use to predict the value of p1 given the value of the other pi ’s. The resulting predictions for the vehicle theft rate, and the predictions for the murder rate, using the testing data are given in Figure 10.6. In these graphs, each symbol corresponds to the values for a city in the testing set. The training value for the city is the predicted value using (10.27). For example, to determine the predicted vehicle theft rate, the other rates (and population) for that city are used in (10.27) to calculate p1 . The second coordinate of the data points in Figure 10.6 is the known values from the testing data set (for that particular crime). So, the better the prediction, the closer the data points will be to the line, which is what would be obtained if the two values are equal. What we conclude from this analysis is that it is possible to accurately predict the vehicle theft rate, but not the murder rate. This naturally leads to the question as to why this happens. The answer to this involves complicated social and economic reasons, and outside the prevue of mathematics. What the mathematics has done is made the connections, or non-connections, evident. It is up to the disciplinary experts to explain why.

10.2.7 Practical Issues An issue that arises when using a PCA concerns the question of how many data points are needed. We begin with two variables, so m = 2. With this, the question is, how many points in the plane are needed so you can confidently determine the orientation of an elliptical pattern? The number is unclear, but

494

10 Data Analysis

Sigma

20

10

0 1

2

3

4

1

2

3

4

5

6

7

8

5

6

7

8

Error

0.15 0.1 0.05 0

k-axis Figure 10.7 Singular values, top graph, and the relative error (10.24), bottom graph, for the normalized crime data.

it is certainly more than three. If it is assumed that 20 points are enough, then presumably for larger values of m the number is 10m. This is known as the “rule of 10,” and it is often used in experimental studies. Further discussion about such rules, including their limitations, can be found in Costello and Osborne [2005] and Mei et al. [2008]. A second issue relates to what value of k to use. The singular values and the relative error (10.24) for the normalized crime data are given in Figure 10.7. This information can be used to decide on how many columns (components) to use for a PCA. However, in practice this is often done using questionable ad hoc assumptions. For example, one of the commonly used methods is known as the Guttman-Kaiser criterion. This states that you simply use those factors which have a corresponding singular value that satisfies σ > 1. There are others, such as the scree test, the broken stick test, etc. From a mathematical point of view, the test should involve the relative error R(k) since it is provably connected with the accuracy of the data fit. Even so, as seen when using k = 7 with the crime data, a small approximation error does not guarantee an accurate prediction obtained using a PCA. Two likely contributors to this include the possible inapplicability of the linearity assumption underlying a PCA, and the incompleteness of the variables being considered in the fitting. Needless to say, there has been considerable discussion of this in the research literature, and some insights about this can be found in Peres-Neto et al. [2005] and Josse and Husson [2012].

10.3

Independent Component Analysis

495

10.2.8 Parting Comments The PCA is used in a wide variety of applications, and one reason is that it provides a relatively simple way to determine connections between variables in large data sets. As examples, it has been used in magnetic resonance imaging (MRI) of the brain [Jung et al., 2012, Andersen et al., 1999], for studying turbulent combustion [Parente and Sutherland, 2013], to determine connections between cancer and gene expression [Khan et al., 2001], for fraud detection for bodily injury claims in automobile insurance [Brockett et al., 2002], for assessing energy sustainability of rural communities [Doukas et al., 2012], and for cosmological tests of general relativity [Hojjati et al., 2012]. There are also variations on how to compute a PCA, or an approximate PCA. One of recent interest uses a randomized algorithm that can be used on very large data sets, and this is discussed in Rokhlin et al. [2010]. As popular as the PCA has been, it is based on a major assumption, which is that the variables are related through a linear equation. In many problems the dependence is nonlinear, which limits the use of the PCA. For example, suppose you are interested in the trajectory of a baseball, and your data variables are speed, position, and time. A PCA study would assume that these variables are linearly related, where a more appropriate assumption would be that the speed is determined by the ratio of the relative values for position and time, which is a nonlinear relationship. A significant effort has been invested in developing a nonlinear version of the PCA, and this falls into the more general problem of finding dimensionality reduction methods. There are numerous versions of this, including isomaps, diffusion maps, Laplacian eigenmaps, etc. A nice review of some of them, and how well they work, can be found in van der Maaten et al. [2009] and Burges [2010]. As a final comment, the PCA is used extensively in statistics, which is a subject not considered in this text. For those who might be interested in the statistical implications of the PCA, the word length example is based on a similar example considered in Abdi and Williams [2010]. They concentrate on the statistical applications, which includes correlations, statistical inference, and confidence intervals, and it should be consulted to learn more about this.

10.3 Independent Component Analysis To introduce the idea considered next, consider the situation of microphones recording sounds from multiple sources. Because the microphones are placed at different distances and angles from the sources, what they record will differ. The question is, from nothing more than what is recorded by each microphone, it is possible to separate out the independent sources (components) contributing to the sound record? This is an example of what is called a blind source separation problem.

496

10 Data Analysis

As a simple example, suppose there are two sources, s1 and s2 , and they are emitting the waves shown in Figure 10.8. There are also two microphones x1 and x2 . It is assumed that the signals they record are x1 = s1 + 2s2 and x2 = s1 − 2s2 , which produce the curves shown in Figure 10.9. So, the question is, given the data in Figure 10.9, how close can we come to reproducing the source data in Figure 10.8. To answer this, we will make several assumptions. The first is that the recorded, or observed, data is determined linearly from the source values. In the case of two sources, s1 and s2 , and two recorders, x1 and x2 , it is assumed x1 = a11 s1 + a12 s2 x2 = a21 s1 + a22 s2 .

(10.28)

s1 -axis

1

0

-1 0

1

2

3

4

0

1

2

3

4

5

6

7

8

9

10

5

6

7

8

9

10

s2 -axis

2 1 0 -1

t-axis

x1 -axis

Figure 10.8

Waves being emitted by sources s1 and s2 .

4 2

x2 -axis

0 0

1

2

3

4

0

1

2

3

4

5

6

7

8

9

10

5

6

7

8

9

10

0 -3 -6

t-axis Figure 10.9

Waves recorded by microphones x1 and x2 .

10.3

Independent Component Analysis

497

Written this way, the problem is to determine the aij ’s and sj ’s from the measured values of the xi ’s. Given that there are six unknowns, and only two equations, it is apparent that the solution is not unique. Nevertheless, it is possible to derive useful information from this analysis, and this will be apparent later, after we complete the derivation of the method. A few comments are needed before continuing. First, for the time series data in Figure 10.9, it is assumed that the same coefficients aij are used at each time point (i.e., they do not change with t). Also, note that we are using the same number of recorders as sources, and this will play an important role later. Finally, our solution will come close to determining the sources, but it will not be able to determine which is which (i.e., which one is s1 and which is s2 ).

10.3.1 Derivation of the Method The assumption is that there are m sources and m recorders. Also, at each t = ti , the recorded and source signals satisfy xi = Asi ,

(10.29)

where the A is m × m and is called the mixing matrix and it is assumed to be invertible. Also, si = (si1 , si2 , · · · , sim )T is the m-vector containing the values of the sources at time ti , xi = (xi1 , xi2 , · · · , xim )T is the m-vector containing the observed values at time ti . The objective is to find A and the si ’s using the xi ’s. The first step, as usual, is to normalize the data.

Centering the Data As was done for a PCA, the data will be centered by letting x∗i = xi − xM , where n 1 xM = xi . n i=1 With this, (10.29) takes the form x∗i = As∗i ,

(10.30)

where s∗i = si − sM and xM = AsM . Note that because A is invertible, it follows that n 1 ∗ s = 0. (10.31) n i=1 i

498

10 Data Analysis

In other words, the centered sources have zero mean. The resulting n × m centered data matrices are X∗ and S∗ , where the ith row of X∗ contains x∗i and the ith row of S∗ contains s∗i . With this, (X∗ )T X∗ =

n 

x∗i (x∗i )T ,

(10.32)

s∗i (s∗i )T .

(10.33)

i=1

and

(S∗ )T S∗ =

n  i=1

Both data matrices are assumed to have rank m in what follows, which means that they have independent columns.

White Source Data As mentioned earlier, the solution of our problem is not unique, and one way to see this is that (10.30) can be rewritten as x∗ = (AC−1 )(Cs∗ ), where C is an arbitrary m × m invertible matrix. What this means is that if A and s∗ is claimed to be a solution, then so is AC−1 and Cs∗ . Among the various solutions possible for this problem, we will look for one that scales so that 1 ∗ ∗ T s (s ) = I, n i=1 i i n

(10.34)

where I is the m × m identity matrix. In statistics this is referred to as whitening the source data because it is said to remove the correlations in the input. Note that the solution is still not unique. But, instead of C being an arbitrary m × m invertible matrix, it is now restricted to an arbitrary orthogonal matrix (see Exercise 10.11). How to whiten a data set is explored in Exercise 10.12, and in what follows the method described in Exercise 10.12(a) is used. However, at least in this example, the solution is insensitive to which method is used.

Reducing the Problem The objective is to show how to replace the problem of finding A with finding an orthogonal matrix V. Although this might look to be a minor simplification, it goes a long way in reducing the problem we must solve. First, using (10.32) and (10.30), 1 (As∗i )(As∗i )T . (X ) X = n i=1 ∗ T



n

10.3

Independent Component Analysis

499

So, from (10.34), 

 n 1 ∗ ∗ T s (s ) AT (X ) X = A n i=1 i i ∗ T



= AAT .

(10.35)

Now, (X∗ )T X∗ is symmetric and positive definite, so it is possible to write (X∗ )T X∗ = QDQT ,

(10.36)

where D is a diagonal matrix with positive entries, and Q is an orthogonal matrix. It is also important to note that both D and Q are known, or determinable, from the data. As for A, using the SVD, A = UΣVT , where U and V are m × m orthogonal matrices, and Σ is an m × m diagonal matrix. With this, AAT = UΣ2 UT .

(10.37)

Comparing (10.36) and (10.37), it is seen that we are able to determine two of the matrices in the SVD for A. Namely, we can take U = Q and Σ = D1/2 . In this last expression, since D is a diagonal √ matrix with diagonals di , then D1/2 is a diagonal matrix with diagonals di .

10.3.2 Reduced Problem What we have been able to establish, so far, is that the mixing matrix A in (10.29) has the form (10.38) A = QD1/2 VT . In other words, the solution of the problem can be written as

where

s∗i = Vyi∗ ,

(10.39)

yi∗ = D−1/2 QT x∗i .

(10.40)

In terms of the original variables, the solution is si = Wxi ,

(10.41)

where W = VD−1/2 QT is known as the unmixing matrix. What is left to do is determine the orthogonal matrix V. To help explain how V is determined, consider the example shown in Figure 10.9. Because m = 2, V corresponds to either a rotation or a reflection (see Exercise 8.24). For a rotation, it is possible to write

500

10 Data Analysis

 V=

cos θ − sin θ sin θ cos θ

 ,

(10.42)

where θ is the angle of rotation. To explain the usefulness of this, in Figure 10.10, the values for s∗i and yi∗ are plotted in their respective planes. According to (10.39), the problem of finding V means we are looking for an angle that will rotate the yi∗ data in Figure 10.10 so they are aligned with the s∗i data. It appears that a counterclockwise rotation of about 270◦ , or a clockwise rotation of about 90◦ , will align it with the s∗ region. It is also possible that V is a reflection. The reason is that the method being developed provides a way to compute s1 and s2 but it is not able to determine if the computed s1 is the original s1 or s2 . If it’s s2 then you must interchange the s1 or s2 labels, and this corresponds to a reflection. A second useful observation coming from Figure 10.10 is that if the data regions happen to be circular, then there is no rotation for the yi∗ ’s that will provide the best fit. In statistics, circular data patterns arise with Gaussian distributions, and avoiding these will play a central role in finding V.

10.3.3 Contrast Function

2

2

1

1

y*2 -axis

s *2 -axis

The matrix V is found by introducing what is called a contrast function, which serves a similar purpose to the error function in regression. Before doing this, it is worth first determining what θ produces the best fit (so we will know if the contrast function is successful in finding it). A simple way to do this is to pick θ values from the interval 0 ≤ θ ≤ 2π and then compare the Vyi∗ values with s∗i ’s. Some examples of this are shown in Figure 10.11. So, for example, the top row shows the Vyi∗ values when you take θ = π/4. The best fit found in this way is shown in the lower row, which corresponds to θ = 3π/2. You should also look at the second row. The Vyi∗ values are those from the lower row but multiplied by a minus sign. The

0 -1

0 -1

-2 -2

0

s *1 Figure 10.10 (10.40).

-axis

2

-2 -2

0

y*1

2

-axis

The values of s∗i from Figure 10.9, and the values for yi∗ determined using

10.3

Independent Component Analysis

501

method we will be deriving will consider these two solutions to be equivalent, and it will be up to you to decide which one to pick. What makes the problem of determining θ so challenging is that we need to be able to select the correct red curves in Figure 10.11 without knowing the blue curves. To provide some insight into how this is done, if you look at the lower row, the red curves are simpler than those in rows 1 and 3. In this sense the contrast function should provide a measure of the simplicity of the curves. There is no unique, or perfect, way to do this. Also, the curves in row 2 are equivalent to those in row 4 based on their simplicity. Because of this, as mentioned above, our method will produce both sets as possible solutions.

Kurtosis The ICA method was originally developed as a statistical tool for data analysis, and the assumption central in the development is that the sources are independent. More precisely, they are described using independent s1 *

2

s2 *

2

0

0

-2

-2 0

2

4

6

8

10

2

2

0

0

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

-2

-2 0

2

4

6

8

10

2

2

0

0 -2

-2 0

2

4

6

8

10

2

2

0

0 -2

-2 0

2

4

6

t-axis

8

10

t-axis

Figure 10.11 The solid (red) curves are the values for Vyi∗ for four different values of θ. The dashed (blue) curves are the values for the original sources s∗i . For each row, starting at the top, θ = π/4, π/2, 7π/8, and 3π/2.

502

10 Data Analysis

probability distributions. This can be achieved if the distributions are nonGaussian. As a measure of how Gaussian they are, the kurtosis is used. For a single distribution s∗ that satisfies (10.31) and (10.34), the kurtosis is computed using the formula 1 ∗ 4 (s ) − 3. n i=1 i n

κ=

If s∗ is Gaussian, then κ = 0. For the ICA we want s∗ to be as non-Gaussian as possible, and so the goal is to find a solution which produces a kurtosis value that is as far away from zero as possible. In other words, we need to find a solution which maximizes |κ| or κ2 . The complication for an ICA is that there are multiple sources, and this means the measure of non-Gaussianity must be generalized. One possibility for a contrast function is to simply add the functions together. Assuming there are m sources, then the result is K=

m 

|κj |,

(10.43)

j=1

where

1 ∗ 4 κj = (s ) − 3, n i=1 ij n

(10.44)

and s∗i = (s∗i1 , s∗i2 , · · · , s∗im ). The requirement is that, after substituting (10.39) into this expression, that V produces a local maximum for K. For the case of when m = 2, (10.43) is K(θ) = |κ1 | + |κ2 |, where

1 ∗ ∗ (y cos θ − yi2 sin θ)4 − 3, n i=1 i1 n 1 ∗ ∗ (y sin θ + yi2 cos θ)4 − 3, κ2 = n i=1 i1

(10.45)

n

κ1 =

(10.46) (10.47)

∗ ∗ T and yi∗ = (yi1 , yi2 ) . The connection between yi∗ and the source data is given in (10.39). The function K(θ) is plotted in Figure 10.12, using the data from Figure 10.9. There are four local maximum points, corresponding to the values θ1 = 0.43π, θ2 = 0.95π, θ3 = 1.44π, and θ4 = 1.94π. The corresponding solutions for the sources are shown in Figure 10.13. At this point, you (as the observer) decide on which one is the accepted answer. In this case, it is row 3, which means that θ3 = 1.4π. This is very close to the value we were expecting. It is worth looking at Figure 10.13 a little closer. The ICA has produced just two curves, one that resembles the original s1 , and the other resembling

10.3

Independent Component Analysis

503

the original s2 . The s2 version appears upside down (rows 1 and 4) and rightside up (rows 2 and 3). Similarly for the s1 version. If + indicates right-side up, and − upside down, then the ICA produced (i.e., red) curves in Figure 10.12 have the pattern: row 1 (−, −), row 2 (+, −), row 3 (+, +), row 4 (−, +). Without having the known solutions (the dashed blue curves), we would not be able to determine which combination to take. However, for many applications, such as the image separation example considered below, it is easy to determine which one is preferable. Reduced Data Set In the above example, n = 6000 and all of the x∗i data points were used to find the mixing matrix A. computationally expensive step in the derivation The involves computing x∗i (x∗i )T . It is possible to obtain a reasonably accurate approximation for this matrix using only a small subset of the data. To demonstrate this, first note that the order of the yi∗ ’s is not important. This is because, if they are relabeled in a random order, the value of K is unchanged. So, suppose we randomly pick N data points. Using these points, the kurtosis function (10.45) can be evaluated, similar to the situation shown in Figure 10.12. What is of interest here is the location θ1 of the first maximum, which is sufficient to determine the two curves making up the (±, ±) patterns mentioned earlier. The value of θ1 , as a function of N , is shown in Figure 10.14. This shows that the values of θ1 stabilize once N is 70 or larger. There have been proposals for how to determine the minimum number of data points needed. For EEG signals, the assumption that the mixing matrix is time independent is applicable only over relatively short time periods, and the concern is that they are able to acquire enough data to carry out an ICA. The rule of thumb they often use is that one needs at least km2 data points, where k is a constant between 5 and 32 [S¨ arel¨ a and Vig´ ario, 2003, Onton and Makeig, 2006, Korats et al., 2013]. The result in Figure 10.14 is consistent with this, which shows that k should be 25 or larger for

K -axis

2.2 2 1.8 1.6 0

/2

3 /2

2

-axis Figure 10.12 Function K(θ), given in (10.45), plotted over the interval 0 ≤ θ ≤ 2π, using the data from Figure 10.9.

504

10 Data Analysis

s1 *

2

s2 *

2 0

0

-2

-2 0

2

4

6

8

10

2

2

0

0

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

0

2

4

6

8

10

-2

-2 0

2

4

6

8

10

2

2

0

0 -2

-2 0

2

4

6

8

10

2

2

0

0

-2

-2 0

2

4

6

t-axis

8

10

t-axis

Figure 10.13 Solutions obtained from (10.39) for each of the four values for θ that maximize K. Row 1 is for θ = 0.43π, row 2 is for θ = 0.95π, row 3 is for θ = 1.44π, and row 4 is for θ = 1.94π.

this example. However, for the image separation problem considered later, it appears that k should be significantly larger. An idea related to using a reduced sample size is known as downsampling. In this case, instead of using every data point, one uses every other point, or every third point, or every Kth point. This results in using, approximately, n/K data points. The difference here is that the recommendation is based on the number of data points, while the EEG rule is based on the number of unknowns in the mixing matrix.

Parting Comments In the case of m sources, the contrast function in (10.45) can be written as K = ||κ||1 , where κ = (κ1 , κ2 , · · · , κm )T . It is also possible to use K = ||κ||22 or K = ||κ||∞ . When using m sources, a rotation matrix is characterized using 12 m(m − 1) angles. These are determined by finding the locations of the local maximum points of the contrast function. So, an ICA requires solving a multidimensional maximization problem, and the methods considered

10.3

Independent Component Analysis

505

Theta

/2

/4

0 0

50

100

150

N-axis Figure 10.14 Value of the optimal angle as a function of the number N of random data points used. Shown are the values for a particular case, dashed (red) curve, and the average using 10 cases, solid (blue) curve.

in Chapter 9 can be used in this case. The fly in the ointment is the contrast function. The kurtosis-based functions considered here have the advantage of simplicity, but it is possible to find examples where they produce non-optimal answers (i.e., the solutions they produce are not the best possible). This has been the impetus for finding better contrast functions, as well as a more refined theoretical explanation of how they produce optimal solutions. Comprehensive introductions to the ICA method, which discuss different types of contrast functions, can be found in Comon and Jutten [2010] and Hyv¨ arinen et al. [2001]. As stated earlier, we used an ICA to solve a blind source separation (BBS) problem. Because we assumed the number of sources N and the number of recorders M are equal it is called a determined BBS. There are also underdetermined BBS (N < M ), overdetermined BBS (N > M ), and undetermined BBS (N is unknown) problems. How these are solved involve more than the ICA approach considered here. For those interested in this, you might look at Sawada et al. [2019] and Wildeboer et al. [2020].

10.3.4 Summary of ICA Given an n × m data set X, corresponding to m sources and n observations, an ICA consists of the following steps: (1) Center the data in each column to produce the data set X∗ . (2) Compute the matrix n1 (X∗ )T X∗ and then find its SVD, which can be written as n1 (X∗ )T X∗ = QDQT . (3) Compute Y∗ = X∗ QD−1/2 . (4) Pick a contrast function K(S∗ ). Setting S∗ = Y∗ VT , determine the m × m orthogonal matrices V that correspond to the local maxima of K.

506

10 Data Analysis

(5) Selecting one of the V’s, the unmixing matrix is W = VD−1/2 QT and the corresponding source data are S = XWT . To use this procedure it is required that the columns of X are independent. Also, note that the above procedure is written in matrix form. To connect this with the vector version considered earlier, the ith row of the data matrix X is obtained from the data vector xi . The same applies to X∗ , Y∗ , S∗ , and S. In doing this, because the original column vectors are put into the matrices as row vectors, the above formulas appear as transposes of the ones given earlier. As an example, the equations si = Wxi are written above in matrix form as S = XWT .

10.3.5 Application: Image Steganography An interesting use of ICA involves embedding one image inside another. As an example, the two images in the first row in Figure 10.15 were mixed to produce the two images in the second row. To explain how this was done, each of the original grayscale images consists of an M × N array of integers, each with a value from 0 to 255. This number represents the intensity of gray for that pixel, with 0 corresponding to black and 255 corresponding to white. Letting the pixel matrices for the two source images be S1 and S2 , then the pixel matrices for the two mixed images are X1 = 0.75S1 + 0.25S2 and X1 = 0.9S1 + 0.1S2 . It is easy to find values so the fox is very difficult, if not impossible, to see in the mixed images. The values used here are chosen so you can still see the fox in the image on the right but not in the one on the left. Hiding images inside another underlies what is called image steganography. This is done, for example, for digital watermarking. The reason is that it enables you to prove ownership of a digital image since you are the only one able to extract the hidden image. Another, perhaps more nefarious, use is to be able to transport or transit information without being detected. This has been done, or at least claimed to have been done, in several recent cyberespionage cases. There are M N pixels in each image, and they are labeled from i = 1 up to i = M N . The order they are labeled is not important, but the order must be exactly the same in all four images. In the example considered here column labeling is used. This means that those in the first column are labeled from 1 to M , those in the second column labeled M + 1 to 2M , etc. For the ith pixel, the source vector is si = (si1 , si2 )T , where si1 and si2 are the values for the pixel from image S1 and S1 , respectively. Similarly for the mixed vector xi = (xi1 , xi2 )T . The mixing matrix used is therefore

10.3

Independent Component Analysis

507

Figure 10.15 Original images, first row, and the images obtained after they are mixed, second row. Although it might be hard to see, the fox is in both of the images on the second row.

 A=

0.75 0.25 0.9 0.1

 .

Also, the two original images are 720 × 1280, so there are n = M N = 921,600 pixels. The problem is now mathematically equivalent to the microphone example, which means the formulas we derived earlier are directly applicable to the data vectors for the images. In particular, the solution is given in (10.41), and the contrast function using the kurtosis is given in (10.43). As with the earlier example, the ICA finds multiple solutions. After looking at them, it was decided that the best one is the one shown in Figure 10.16. If you are wondering what the other possible solutions look like, one is shown in Figure 10.17. The fox image is the same but the black and white pixels in the flower image have been reversed. This is equivalent to the observation

Figure 10.16

Result of using ICA to unmix the two images.

508

Figure 10.17

10 Data Analysis

Another solution found by the ICA.

made about the solutions in Figure 10.13 differing by a minus sign. Selecting Figure 10.16 over Figure 10.17 is a judgement call based on the appearance of the two mixed images (row 2 in Figure 10.15). How you react to the images depends on what you were planning to do with them. If you had hoped to have them framed and mounted on your wall, then you are possibly disappointed. The reason being that the separation is incomplete (e.g., there is a residual image of the flowers in the fox image). If, on the other hand, you intended to use the embedded fox image as a digital watermark, then the separated images are acceptable. This is also true for some of the primary applications of ICA, and particular examples include facial recognition [Bartlett et al., 2002], finding defects in fabrics [Zhang and Cheng, 2014], identifying 3D shapes of the heart [Wu et al., 2014], detection of down syndrome in newborns [Zhao et al., 2014], and for methods of watermarking images [Nguyen and Patra, 2008]. Returning to the quality of the separated images, it does not appear to be possible to get a better result by using a different contrast function. Contributing to the quality of the images is the fact that an ICA is inherently lossy because floating point transformations are being applied to integer arrays. There have been several proposals on ways to measure how well ICA has separated the sources, what are called performance measures [Chen and Bickel, 2006, Nordhausen and Oja, 2018]. There has also been work on what is called robustification of ICA. It is not difficult to find examples where outliers in the data, produced by noise or other artifacts, can seriously degrade the quality of the separation. Removing such data is called robustification, and this is discussed more in Virta and Nordhausen [2019] and Brys et al. [2005].

10.4 Data Based Modeling We will now consider how to use data to model discrete dynamical systems. This could include, for example, tracking the hourly values of a collection of stocks in the stock market, the daily temperatures in a collection of cities,

10.4

Data Based Modeling

509

or the number of new cases of an infectious disease. There are sophisticated mathematical models for all of these, but we are going to investigate how it might be possible to derive this information directly from experimental observations. To help introduce the ideas, suppose we have measured the temperatures in m different cities on day ti , giving us a data m-vector xi = x(ti ). With this we want to predict the values yi for some other quantity. For example, yi might be the number of ice cream cones that were sold in each of the cities on day ti , or maybe it is the temperatures in each of the cities 2 days later. The modeling assumption is that yi = Axi + b.

(10.48)

A critical observation to make here is that it is assumed that A and b do not depend on ti . The rationale behind (10.48) comes from Taylor’s theorem, and this will be discussed later. The first task is to explain how to determine A and b from the data. It is assumed that measurements have been made at t1 , t2 , · · · , tn , giving us x1 , x2 , · · · , xn . These are normalized in exactly the same way as was done for the PCA. Expressing this in matrix form, the normalized x’s are x∗i = S−1 x (xi − xM ),  where xM = n1 xj is the mean, and Sx is a m × m diagonal matrix containing the scaling factors for the components of x. Similarly, the normalized y’s are yi∗ = S−1 y (yi − yM ). Note that by summing (10.48) it follows that b = yM − AxM . Using the scaled variables, the assumption (10.48) can be written as yi∗ = Px∗i ,

(10.49)

where P = S−1 y ASx is called the propagator matrix. The m2 entries in P can be found using least squares. So, suppose x1 , x2 , · · · , xn , along with y1 , y2 , · · · , yn , have been measured. After normalizing the data, the resulting error function is E(P) =

=

n 

||Px∗j − yj∗ ||22

j=1 n  

(10.50)

 Px∗j • Px∗j − 2yj∗ • Px∗j + yj∗ • yj∗ .

j=1

The next step is to differentiate E(P) with respect to the entries in P and then set them each to zero. There are m2 such equations, and it is a straightforward calculation to find the solution. In preparation for writing the answer down,

510

10 Data Analysis

we introduce the n × m normalized data matrices X∗ and Y∗ . The ith row of X∗ is (x∗i )T , and the ith row of Y∗ is (yi∗ )T . With this, minimizing E reduces to solving (10.51) P(X∗ )T X∗ = (Y∗ )T X∗ . Assuming that the columns of X∗ are independent, then solving the above  −1 equation yields P = (Y∗ )T X∗ (X∗ )T X∗ . It is possible to express this solution using the Moore-Penrose pseudo-inverse (see Section 8.3.4), giving the formula T  (10.52) P = (X∗ )+ Y∗ . As explained in Section 8.3, the formula for the pseudo-inverse given in (8.25) often runs into trouble with ill-conditioning. For computational reasons, it is better to compute P using the SVD. To explain how this is done, suppose that X∗ = UΣVT where U and VT are orthogonal matrices and Σ is a diagonal matrix as in (4.38). With this, (10.51) becomes PVS2 VT = (Y∗ )T UΣVT . Assuming that the columns of X∗ are independent then it has only positive singular values. In this case, solving for P you get that (10.53) P = (Y∗ )T UΣ−1 VT , where Σ−1 is the n × m diagonal matrix with diagonals 1/σ1 , 1/σ2 , · · · , 1/σm . Once P is known, returning to the original (unnormalized) variables,

where

yi = Axi + b,

(10.54)

A = Sy PS−1 x ,

(10.55)

and b = yM − AxM . Example 10.2 : Markov Chain We are going to consider a variation of the car rental example considered in Section 4.5.3. There are four locations and the associated directed graph for how cars move between these locations is shown in Figure 10.18. If y0 = (a, b, c, d)T is the vector for the number of cars at each location at the start, then in the next week the distribution of cars is given as y1 = Ay0 , where the adjacency matrix has the form ⎞ ⎛ a11 a12 0 0 ⎜a 0 ⎟ ⎟ ⎜ 21 a22 0 (10.56) A=⎜ ⎟. ⎝ a31 0 a33 a34 ⎠ 0 0 a43 a44 In this matrix, a11 is the fraction of cars that stay at location A, a21 is the fraction of cars that move from A to B, a12 is the fraction of cars that move

10.4

Data Based Modeling

511

from B to A, etc. Therefore, the aij ’s are positive and each column sums to one. Also, in a similar manner, after k weeks, the distribution vector is yk = Ayk−1 . To determine A using (10.55) and (10.53), note that in this example x1 = y0 , x2 = y1 , x3 = y2 , · · · , xn = yn−1 . Also, because of the requirement that the columns of X∗ are independent, it is required that n ≥ 4. To generate the data, the following values were used: a11 = 14 − δ, a21 = 3 1 2 1 3 4 − δ, a31 = 2δ, a12 = a34 = 3 , a22 = a44 = 3 , a33 = 4 , and a43 = 4 . Also, 1 T y0 = (100, 0, 0, 0) and δ = 20 . After generating the yk ’s using the formula yk = Ayk−1 , a small amount of noise was applied to each yk typical of what might be expected in an experiment. Finally, scaling is not used in this example, and so, Sx = Sy = I. Consequently, A = P. The exact value of A, to two digits, and the computed value Ac using (10.53) are ⎞ ⎞ ⎛ ⎛ 0.20 0.30 0.00 −0.00 0.20 0.33 0.00 0.00 ⎜ 0.78 ⎜0.70 0.67 0.00 0.00⎟ 0.67 0.00 0.00 ⎟ ⎟ ⎟ ⎜ ⎜ A=⎜ ⎟. ⎟ , Ac = ⎜ ⎝ 0.11 −0.00 0.25 0.32 ⎠ ⎝0.10 0.00 0.25 0.33⎠ −0.00 −0.00 0.78 0.67 0.00 0.00 0.75 0.67 In this calculation n = 5 and Ac is given using fixed-point notation with two digits (so, a number such as −0.001 will be shown as −0.00). To check on whether using more data points improves the accuracy, the relative error ||Ac − A||∞ /||A||∞ is shown in Figure 10.19 as a function of the total number of time steps n used in the calculation. The error seen in this plot is almost entirely due to the noise applied to the generated data. For example, if the exact values of the yk ’s are used, then the error drops to the level of machine epsilon. Another semi-useful conclusion coming from Figure 10.19 is that the accuracy of Ac does not improve by collecting more data. 

10.4.1 Dynamic Modes It is possible to derive a modal representation of the solution. To explain what this means consider the temperature example used earlier. So, let x(t)

Figure 10.18

Directed graph for the car rental example.

512

10 Data Analysis

be the vector containing the temperatures at m locations. Also, assume that the temperatures are measured at t1 , t2 , · · · , tn+1 , giving us x1 = x(t1 ), x2 = x(t2 ), · · · , xn+1 = x(tn+1 ). It is assumed that the ti ’s are equally spaced, so ti = (i − 1)Δt. We will be using xi to predict the temperature at the next time point ti+1 , so in this formulation (10.48) applies with yi = xi+1 . As before, it is being assumed that A does not depend on ti . However, it very likely does depend on Δt and that is why the time points are taken to be equally spaced. Scaling is not used here, so A = P. Assuming that P is not defective, then it has independent eigenvectors p1 , p2 , · · · , pm with respective eigenvalues λ1 , λ2 , · · · , λm . It is not assumed that the matrix is symmetric, so the eigenvalues can be complex-valued and the eigenvectors are not necessarily orthogonal. Given the starting vector x1 , its basis, then x∗1 = α1 p1 + normalized version is x∗1 . Using the eigenvectors as a m ∗ ∗ α2 p2 + · · · + αm pm . This means that x2 = Px1 = j=1 αj λj pj . More gen m κj erally, x∗i = Px∗i−1 = j=1 αj λi−1 = |λj |, then eωj ti = |λj |i−1 j pj . Taking e where ωj = κj /Δt. With this, x∗i =

m 

α ¯ j pj eωj ti ,

(10.57)

j=1

¯ j = −αj . This shows that the solution where α ¯ j = αj if αj ≥ 0, otherwise α is the superposition of the dynamic modes pj eωj t coming from the propagator matrix. This is analogous to how the vibration of a guitar string can be written as a superposition of the natural frequencies of the string. The significance of this will be illustrated in the next two examples. Example 10.3 : Reduced Mode Approximation

Relative Error

For the car rental problem in Example 10.2, the eigenvalues of P are λ1 = 1, λ2 = 0.9698, λ3 = −0.1031, and λ4 = −0.0833. The fact that λ3 and λ4 are an order of magnitude smaller than the other two eigenvalues brings up the

0.05

0 5

10

15

20

25

30

35

40

n-axis Figure 10.19 example.

The relative error for the adjacency matrix computed for the rental car

10.4

Data Based Modeling

513

possibility of using the reduced mode approximation x∗i ≈

M 

α ¯ j pj eωj ti ,

(10.58)

j=1

where 1 ≤ M < m. To investigate the accuracy of this approximation, we will take Δt = 1, so t1 = 0, t2 = 1, t3 = 2, etc. The resulting relative error in a reduced mode approximation is shown in Figure 10.20. Not unexpectedly, given the fact that λ1 and λ2 are about the same, the M = 1 approximation is inaccurate. Similarly, since λ3 and λ4 are relatively small compared to λ1 and λ2 , the M = 2 mode approximation provides a fairly accurate approximation of the solution.  The above example demonstrates the viability of using a reduced mode approximation. This is useful as the value of m in applications can be enormous, and avoiding having to compute all of the eigenvalues of the propagator matrix can significantly speed up the calculation. Another observation to make about (10.57) is how much it resembles the solution of a linear differential equation. If you recall, the general solution of x (t) = Bx can be written as x(t) =

m 

βj bj erj t ,

(10.59)

j=1

where r1 , r2 , · · · , rm are the eigenvalues of the m × m matrix B, with respective eigenvectors b1 , b2 , · · · , bm . The question arises that if you run an experiment and obtain the xi values, can you use this to determine the governing differential equation? More specifically, assuming the differential equation is x (t) = Bx, can you use P to determine B? One way to answer this is to recall that in Chapter 7, differential equations are solved by first approximately then with a finite difference equation. As explained in Section 7.6.2, to solve x = Bx using Euler’s method you

Relative Error

100

10-5 M=1 M=2 M=3

10-10

1

2

3

4

5

6

7

8

t-axis Figure 10.20

The relative error when using the approximation (10.58).

9

10

514

10 Data Analysis

Relative Error

102 100 10-2 10

-4

10

-3

10

-2

10

-1

10

0

10

1

k-axis Figure 10.21 The relative error when using the approximation (10.60) as a function of the time-step k = Δt used to generate the data.

get xi+1 = (I + kB)xi , where k = Δt. Similarly, using backward Euler you get xi+1 = (I − kB)−1 xi . Both of these approximations, as well as the other methods listed in Table 7.6, result in a linear difference equation of the form given in (10.48), where yi = xi+1 , xi = xi , and b = 0. Example 10.4 : Backward Euler Approximation For the backward Euler method, P = (I − kB)−1 . Solving this for B gives B=

 1 I − P−1 . k

(10.60)

Consequently, since P−1 pj = pj /λj , then pj is an eigenvector for B with eigenvalue (1 − 1/λj )/k. To check on the accuracy of this approximation, we will take B to be a random 4 × 4 matrix that has eigenvalues 1, 1/2, −1/2, −1. Using the initial condition x(0) = (100, 0, 0, 0)T , the xi ’s are then calculated using the exact solution (10.59). After computing P using (10.53), the resulting approximation Bc is computed using (10.60). What is shown in Figure 10.21 is the relative error ||B − Bc ||∞ /||B||∞ as a function of the time-step k used to generate the data. The relative error in the eigenvalues, which is not shown, shows a similar improvement as k is decreased. The conclusion is, at least for this example, that the differential equation can be accurately determined using a data model approach.  The examples that were considered show the effectiveness of the data modeling assumption (10.48). However, this is because the original problems, both the Markov chain and the differential equation, are linear. When the original problem is nonlinear the situation becomes more challenging. For certain nonlinear differential equations it is possible to adapt the linear model using Koopman theory, and this is illustrated in Exercise 10.20. How linear data analysis might be applied to a nonlinear problem is illustrated in the next example.

10.4

Data Based Modeling

515

Albany

Allegany

Bronx

···

Wyoming

Yates

4/2/20

13

3

736

···

4

1

4/3/20

14

2

1743

···

1

0

4/4/20

26

2

1229

···

. ..

. ..

. ..

. ..

4/1/21

83

12

815

···

3

0

. ..

. ..

16

2

Table 10.5 Number of new COVID-19 cases in the counties in New York from April 2, 2020 to April 1, 2021 [USAFacts, 2021].

10.4.2 Example: COVID-19 In the effort to limit the spread of coronavirus disease 2019 (COVID-19), the number of daily cases was tracked on a county by county basis across the United States. The question considered here is how well the assumption (10.54) describes what was observed. The data used are for the 62 counties of New York, and a portion of the data set is shown in Table 10.5. Using the notation introduced earlier, m = 62, n = 358, and xi consists of the 62 COVID values listed in the ith row of the data matrix. What is of interest is whether (10.48) can predict the number of cases d days later. So, we are asking how close yi , as determined using (10.48), is to the observed value yi = xi+d . In what follows, the propagator matrix will be denoted Pd , indicating the fact that it depends on the particular choice for d. The first question to consider is whether the linearity assumption (10.54) is an accurate approximation to what is observed. Taking d = 7, and calculating P7 using (10.53), the resulting comparison between observed and predicted values for three counties is shown in Figure 10.22. Similar results are obtained for the other counties in the state. Moreover, the accuracy seen in Figure 10.22 is obtained when d = 1, 2, · · · , 6. It is of interest to compare the model with other possibilities. For example, there is an old adage that the most accurate forecast for the weather tomorrow is what the weather is today. This is referred to as a persistence forecast, and mathematically it is equivalent to assuming that P = I in (10.49). A second model that is used is based on the observation that given x∗ , once P1 is determined, then the next day forecast is P1 x∗ . Taking this a step further, it is not unreasonable to think that P1 (P1 x∗ ) = P21 x∗ provides a prediction for the second day, and a weekly forecast can be obtained using P71 x∗ . This is referred to as an iterative forecast. The error obtained using these three models, for Erie County, is given in Table 10.6 as a function of d. In addition to concluding that the linear model provides the more accurate forecast, it

516

10 Data Analysis

Albany

300 150 0 April 1

June 1

Oct 1

Jan 1

Oct 1

Jan 1

Oct 1

Jan 1

Erie

800 Predicted Actual

400 0 April 1

June 1

Ulster

120 60 0 April 1

June 1

t-axis Figure 10.22 The number of new COVID cases, solid (red) curve, and the values predicted using (10.54), dashed (blue) curve, for three different counties when using P7 .

should also be noted that not unexpectedly that the error in the iterative forecast increases with d.

10.4.3 Propagation Paths With the establishment of (10.48) as a viable approximation of the time evolution of the number of COVID cases, we can investigate what might be learned from it. In particular, we have propagator matrices P1 , P2 , · · · , P7 . So, it is of interest to explore how P changes with d, and whether it is possible to derive useful, or insightful, information for this. The question considered is how the cases in a particular county depend on the number of cases in the other counties, and how this dependence changes with time. Taking Albany County, which corresponds to the first entry in the data vectors, (10.49) can be written as (Albany)j+d = p11 (Albany)j + p12 (Allegany)j + · · · + p1m (Yates)j ,

10.4

Data Based Modeling

517

where (Albany)j+d is the predicted value for (Albany)j+d . So, a larger positive value of p1k means that the kth county has a more significant contribution to the increase in the Albany cases over the time interval of d days. A similar observation applies to when p1k is negative and the decrease in the Albany cases. This is visualized in Figure 10.23 by colorizing the counties based on the value of p1k . The redder the county the more positive its respective p1k value, and the bluer the county the more negative its respective p1k -value. It is seen that for d = 1, increases are most strongly connected to the number of cases in Albany and Westchester counties (the latter is in the southern part of the state). However, over a week the situation changes, and Albany County is now blue, and Suffolk County (which is in the southern part of the state) now has the stronger connection to increases in Albany. The data-based model, and its consequences shown in Figure 10.23, raises more questions than it answers. For example, why is the number of new cases in Albany county more impactful on a day-to-day basis, but just the opposite on a week-to-week basis? Similarly, why are counties that are 120 to 220 miles away connected to what is happening in Albany county? Answering these questions requires a consideration of the characteristics of the disease (e.g., how long an individual is infectious), and similarities or differences in how individuals in a county interact compared to other counties. There is also the ever present concern that one or more of these observations are primarily due simply to data artifact.

d

Linear

Persistence

Iterative

1

3.3e−02

1.8

1.0

2

3.5e−02

1.8

1.2

3

3.7e−02

1.8

1.3

4

3.7e−02

1.8

1.4

5

3.5e−02

1.9

1.6

6

3.4e−02

1.8

1.8

7

3.4e−02

1.8

2.0

Table 10.6 The average error ||y − y||2 /n for Erie County, where y and y are the normalized actual and predicted values, respectively. Linear refers to using (10.49), and the error values for the persistence and iterative forecast error are relative to the respective error value for the linear model.

518

10 Data Analysis

Figure 10.23 The redder the county the more it contributes to increases in the COVID cases in Albany County for d = 1 and for d = 7. The bluer the county the more it contributes to decreases in the Albany County cases. Albany County is outlined in bold in both maps.

10.4.4 Parting Comments The central assumption that has been made is linearity, as expressed in (10.49) and (10.54). For the COVID-19 example, whatever number of days separate x and y, a lot can happen and whatever this is, it likely changes over the time interval the data are collected. For example, there are changing weather conditions, changing policies related to public gatherings, movement between counties, and a host of other time dependent and potentially nonlinear affects. Nevertheless, we found that the assumptions work reasonably well. One could argue that the reason is that (10.54) is basically a two-term Taylor series, about the mean, and it should therefore serve as a reasonable approximation so long as the nonlinear affects do not have the ability to strongly affect the values over the given time interval. This argument is often given to justify, or explain, why (10.54) is used. For this reason, in the meteorological literature (10.54) is referred to as a tangent linear model. Given the way the method was explained, you might assume that it requires that the xj ’s are ordered in some way (i.e., that xj comes before xj+1 ), and that the yj ’s are values for the xj ’s in the future. Neither are required, and all that is needed is that (10.49) holds. There has been recent research into whether it’s possible to find fundamental components of P that can be used to explain how the system evolves. These come under the classification of modal decomposition methods, and if you are interested in this you might consult Schmid [2010], Tu et al. [2014], and Zhang et al. [2019].

10.4

Data Based Modeling

Table 10.7

519 x

y

z

2

1

2

1

−1

1

−1

1

−1

−2

−1

−2

Data for Exercise 10.2

Exercises Section 10.2 10.1 This problem uses the word length data in Table 10.7. (a) What is the predicted value for the number of words in the dictionary for the word “linear”? (b) What is the predicted value for the number of words in the dictionary for the word “bug”? (c) Given that in 1908 the word “linear” appeared in 0.0009% of the printed literature, what is the predicted value for 2008? (d) Given that in 1908 the word “bug” appeared in 0.0002% of the printed literature, what is the predicted value for 2008? 10.2 This problem applies a PCA to the data given in Table 10.7. (a) (b) (c) (d) (e)

Find the normalized data matrix P using the ∞-norm scaling. Find the SVD for the normalized data matrix from part (a). Suppose p1 = 2.5. What do you predict the values are for p2 and p3 ? Suppose p1 = 2.5 and p2 = 0.5. What do you predict the value is for p3 ? Explain why the linear fit p = α1 v1 + α2 v2 determined using a PCA is a plane that interpolates all of the normalized data points.

10.3 Suppose the data matrix is ⎛

0 ⎜ 1 X=⎜ ⎝−1 1

0 1 0 0

⎞ 1 0⎟ ⎟ 1⎠ 0

(a) What is the normalized data matrix P used for a PCA? Use the ∞-norm for the scaling. (b) From the SVD for P, it is found that ⎛ ⎞ 0.51 −0.22 0.83 V = ⎝ 0.31 0.95 0.06⎠ . −0.80 0.23 0.55

520

10 Data Analysis

Taking k = 1, if p1 = 2, then what is p2 and p3 ? (c) Taking k = 2, if p1 = 2 and p2 = 1, then what is p3 ? 10.4 Assume that P is a normalized 10 × 3 data matrix. (a) How many variables are there? (b) Suppose that the singular values are σ1 = 6, σ2 = 0, and σ3 = 0. Explain why the normalized data values must lie on a straight line that goes through the origin. Does this happen if the singular values are σ1 = 600, σ2 = 0, and σ3 = 0? (c) Suppose that the singular values are σ1 = 6, σ2 = 2, and σ3 = 0. Explain why the normalized data values must lie on a plane that passes through the origin. Does this happen if the singular values are σ1 = 6, σ2 = 6, and σ3 = 0? 10.5 (a) What happens if a PCA is applied to the data matrix 5I? (b) Is it possible for a normalized data matrix to have the form aI, for some constant a? 10.6 Assume that A is an n × m matrix, with m ≤ n, and let (ai )T be the vector corresponding to the ith row from A. Also, assume A has singular values σ1 , σ2 , · · · , σm . n (a) Show that i=1 ai • ai = tr(AT A). n 2 2 2 • (b) Use part (a) to show that i=1 ai ai = σ1 + σ2 + · · · + σm . Hint: tr(AB) = tr(BA) (c) The Frobenius norm of A is defined as   m   n  a2ij . ||A||F ≡  i=1

j=1

2 Show that ||A||2F = σ12 + σ22 + · · · + σm . (d) Letting Ak be given as in (4.54), show that 2 2 + · · · + σm σ 2 + σk+2 ||A − Ak ||2F = k+12 . 2 2 2 ||A||F σ1 + σ2 + · · · + σm

Note that the Frobenius norm is not associated with a vector norm, which makes it, according to Chapter 3, an unnatural norm. The result in part (d) is a Eckart-Young theorem but using the Frobenius norm. It shows that there is a connection between the PCA and the low-rank approximations considered in Section 4.6. This is not pursed in this text, but you can find out more about this in Vidal et al. [2016].

10.4

Data Based Modeling

521

10.7 One of the criticisms of a PCA is that the answer depends on the scaling. This is evident in (10.12) as the value of the slope m can change significantly depending on what scaling values are used. This problem explores a regression procedure where this does not happen, and therefore can be classified as being scale invariant. It is assumed that the data (x1 , y1 ),(x2 , y2 ), xi = 0 and yi = 0. · · · , (xn , yn ) have been centered, which means that It is also assumed that the model function is y = αx. (a) The usual (vertical) least squares error is Ev (α) =

n 

(αxi − yi )2 .

j=1

Show that the minimum occurs when α = x • y/x • x. (b) Suppose the data are scaled as Xi = xi /Sx and Yi = yi /Sy , and the model function is Y = α ¯ X. Explain why α ¯ = αSx /Sy . Show that minimizing n  (¯ αXi − Yi )2 , j=1

and then converting back to the original variables that one gets the same value for α as obtained in part (a). Because of this, the (vertical) least squares error is scale invariant. (c) Redo (a) and (b) but use the (horizontal) least squares error Eh (α) =

n 

(xi − yi /α)2 .

j=1

Make sure to explain why this is scale invariant. Also, explain why the resulting value for α is, in general, different than the value obtained in part (a). (d) Consider the error function E(α) = Ev (α)Eh (α). Minimize this and show that  y•y . α=± x•x Explain how the choice for the ± is determined by the sign of x • y. (e) An important property for PCA is that irrespective of whether you write y = αx or x = ay, the regression procedure yields a = 1/α. As shown in part (c), this does not happen when using the usual least squares error. Explain why, using the error function in part (d), the resulting value for α does not depend on whether you start out assuming y = αx or x = ay. (f) Explain why the error function in part (d) is scale invariant.

522

10 Data Analysis

10.8 An often used model function in nonlinear regression is g(x) = v1 xv2 . As shown in Section 8.6, by using the log, the model function can be written as G(X) = V1 + V2 X, where V1 = log v1 and V2 = v2 . Also, a data point (xi , yi ) transforms to (Xi , Yi ), where Xi = log xi and Yi = log yi . The data considered in this exercise is given in Table 10.8, which was also considered in Exercise 8.33. (a) Apply the same normalization to the transformed (Xi , Yi ) data as was used in the word length example. This means you center the X- and Y values, and then scale the values using the maximum entry, in absolute value. Label the normalized variables as p and q. Also, make sure to give the values of X and Y , as well as the values used to scale the centered values. (b) Assuming p = αq, and using the true distance, then the error function is E(α) =

n 1  (αpi − qi )2 . 1 + α2 i=1

The minimum of E occurs when α is given in (10.11). Calculate α using your normalized data from part (a). (c) In this problem, instead of (10.12), we have Y = Y + m(X − X). Using the common log, show that v2 = m and v1 = 10Y −mX . (d) Using part (c), plot the (original) data and power law curve using a log-log plot. (e) Based on your result from part (c), what was the running speed of a Tyrannosaurus rex? 10.9 In using the PCA, suppose that the data set contains two variables, and the normalized versions are labeled as p1 and p2 . Also, after calculating the SVD, one obtains the matrix   v11 v12 . V= v21 v22 In this problem the PCA assumption is that p = α1 v1 . (a) Assuming v11 = 0, show that the assumption p = α1 v1 can be written as p2 = αp1 , where α = v21 /v11 . (b) Letting q1 and q2 be the original, unnormalized, data variables, show that the fitted line can be written as

10.4

Data Based Modeling

Table 10.8

523

Animal

Mass (kg)

Relative Speed (1/s)

canyon mouse

1.37e−02

39.1

chipmunk

5.10e−02

42.9

red squirrel

2.20e−01

20.5

snowshoe hare

1.50e+00

35.8

red fox

4.80e+00

28.7

human

7.00e+01

7.9

reindeer

1.00e+02

12.7

wildebeest

3.00e+02

11.0

giraffe

1.08e+03

3.8

rhinoceros

2.00e+03

1.8

elephant

6.00e+03

1.4

Data for Exercise 10.8 adapted from [Iriarte-D´ıaz, 2002].

q2 = M2 + a (q1 − M1 ), where M1 and M2 are the means of q1 and q2 , respectively. Also, a = αS2 /S1 , where S1 and S2 are the scale factors used for the respective variables. 10.10 Suppose that the data set contains three variables, and the normalized versions are labeled as p1 , p2 , and p3 . Also, assume that the PCA assumption is p = α1 v1 + α2 v2 , where v1 = (v11 , v21 , v31 )T and v2 = (v12 , v22 , v32 )T are the first two columns in the matrix V. (a) Show that the PCA assumption can be written as p3 = αp1 + βp2 , where α = (v31 v22 − v32 v21 )/D, β = (v32 v11 − v31 v12 )/D, and D = v11 v22 − v12 v21 . This assumes, of course, that D = 0. (b) Using the result from part (a), show that the PCA assumption, when expressed in terms of the original, unnormalized, data variables, can be written as q3 = M3 + a (q1 − M1 ) + b (q2 − M2 ), where M1 , M2 , and M3 are the means of q1 , q2 , and q3 , respectively. Also, a = αS3 /S1 and b = βS3 /S2 , where S1 , S2 , and S3 are the scale factors used for the respective variables.

524

10 Data Analysis

Section 10.3 10.11 As originally stated, if A and s∗ is a solution, then AC−1 and Cs∗ is a solution for any m × m invertible matrix C. Show that by assuming (10.34) holds, the statement is revised to: AC−1 and Cs∗ is a solution for any m × m orthogonal matrix C. 10.12 This problem considers the process of whitening a data set with two sources. Let s1 , s2 , · · · , sn be the source data vectors, with si = (si1 , si2 )T . Also, let S be the n × 2 source data matrix, so its ith row is sTi . It is assumed that rank(S) = 2. √ T (a) From the SVD, S = Us Σs VsT . Letting s∗i = n Σ−1 s Vs si , show that the ∗ si ’s satisfy (10.34). (b) Suppose it’s possible to find an invertible 2 × 2 matrix Z with the property that n 1 ZZT = si (si )T . n i=1

(c)

(d) (e) (f)

Setting s∗i = Z−1 si , show that the s∗i ’s satisfy (10.34). This shows that any matrix Z with the above property can be used to whiten the data set. Because the matrix is symmetric and positive definite the SVD can be n written as n1 i=1 si (si )T = VΣVT . Show that one possible choice is Z = VΣ1/2 VT , where Σ1/2 is the 2 × 2 diagonal matrix with diagonals √ √ and σ2 . Also, explain why this can be interpreted as taking Z = σ1  1 si (si )T . n Explain how it is possible to use a Cholesky factorization to whiten the data. Whiten one of the data sets in Table 10.9 using the method from part (a). Note that you should first center the data, so (10.31) is satisfied. Whiten one of the data sets in Table 10.9 using the method from part (c). Note that you should first center the data, so (10.31) is satisfied.

10.13 Shown in Table 10.10 are the pixel values for two images (they contain 4 pixels). Suppose that the images are mixed as follows: 14 S1 + 34 S2 and 3 1 4 S1 + 4 S2 . (a) (b) (c) (d)

What are si , and xi , for i = 1, 2, 3, 4? What is the mixing matrix? What are s∗i , and x∗i , for i = 1, 2, 3, 4? Find yi∗ , for i = 1, 2, 3, 4.

10.4

Data Based Modeling

525

⎛ i)

Table 10.9

S=⎝

3

1 −1







S=⎝

ii)

⎞ 0 2 −2 1



Data for Exercise 10.12.

S1 :

Table 10.10

1

0

0

1

1

S2 :

0

1

0

1

Data for Exercise 10.13.

10.14 This problem works through an example similar to the one used in the text. Suppose there are two sources: s1 (t) = sin(πt) and s2 (t) = sin(2πt − 1) + t(1 − t/10). They are mixed as follows: x1 (t) = s1 + 2s2 , and x2 (t) = 2s1 − 2s2 . Also, assume that n = 2000 and that the ti ’s are uniformly spaced for 0 ≤ t ≤ 10. So, ti = (i − 1)Δt for i = 1, 2, · · · , 2000, where Δt = 10/1999. (a) Plot s1 (t), s2 (t), x1 (t), and x2 (t) for 0 ≤ t ≤ 10 as in Figures 10.8 and 10.9. (b) √ From the SVD, S∗ = UΣVT . The whitened source values are then s∗i = n Σ−1 VT s∗i . Plot the resulting values of (s∗1 , s∗2 ) as in Figure 10.10. In a similar manner plot the values of (y1∗ , y2∗ ). (c) V is a rotation, or it is a reflection. From your plots in part (b), which is it? (d) Plot K(θ) in a similar manner as in Figure 10.12. What are the θ values for the local maximum points for K? (e) For each θ that corresponds to a local maximum of K, plot the computed values for s∗1 and s∗2 . Which one is the preferred choice? Section 10.4 10.15 (a) In the derivation of (10.52) the assumption was made that the columns of X∗ are independent. Explain why this means that m ≤ n. (b) If one instead uses (10.53), is it required that m ≤ n? 10.16 Show that the error function (10.50) can be written using the Frobenius norm (Exercise 10.6), so that E(P) = ||X∗ PT − Y∗ ||2F .

526

10 Data Analysis

10.17 The directed graph for a car rental company is shown in Figure 10.24. The weight for each of the unmarked arrows is 19/20. This exercise repeats the steps taken in Examples 10.2 and 10.3. (a) After reviewing Example 4.5.3, write down the adjacency matrix A for this problem. (b) Taking y0 = (100, 0, 0, 0)T , compute yi , for i = 1, 2, · · · , n when n = 5. With this, compute P using (10.53). Also, compute the relative error ||Ac − A||∞ /||A||∞ . (c) Redo part (b) but take n = 10. Are there any significant differences in the computed values? (d) Compute, and then plot, the relative error when using the approximation (10.58) as a function of t (similar to what is shown in Figure 10.20). Explain any significant differences between the modal approximations. (e) Redo part (b) but include noise in the data, which amounts to about 10% of the original. If you are using MATLAB then each yi is modified as: q=0.1*(2*rand(n,1)-1); y=(1+q).*y;. Comment on how the noise affects the accuracy of Ac . 10.18 As was done in Example 10.4, for the given IVP solver you are to determine B in terms of P, and also determine how the eigenvalues and eigenvectors of B are P are related. (a) Using Euler’s method. (b) Using the trapezoidal method. (c) Using Heun’s method. 10.19 After selecting a method from Exercise 10.18, redo Example 10.4. This should result in a plot similar to Figure 10.21. Comment on any differences between your result and those for Example 10.4. To code B you should use the formula B = SDS−1 where S is a randomly generated 4 × 4 matrix and D is a diagonal matrix with the diagonal entries being the specified eigenvalues. 10.20 This problem investigates how you can use linear data analysis to derive a nonlinear dynamical system x (t) = f (x). (a) Suppose that x = (x1 , x2 )T and f = (−4x1 , 3x2 + 5x21 )T . Explain why, unlike what occurs in Exercise 10.18, Euler’s method does not lead to an equation of the form xi+1 = Pxi .

Figure 10.24

Directed graph for Exercise 10.17.

10.4

Data Based Modeling

527

(b) In part (a), suppose a third variable x3 = x21 is introduced. Write down the differential equation for x = (x1 , x2 , x3 )T , and show that it can be written as x = Bx. Assuming Euler’s method is used, what is P? (c) For the problem in part (a), suppose that x(t) is measured at 4 time points, giving the following values:         1 0 1 1 , x2 = , x3 = , x4 = . x1 = 0 1 −1 2 Assuming the approach in part (b) is used, and adapting the formula in T  (10.52), then P = (X)+ Y . If P is n × m, then what are the values of n and m? What are X and Y equal to? (d) Consider the differential equations x1 = 2x1 − 7x2 + 4x32 ,

x2 = 7x2 .

Using the idea introduced in part (b), explain why to write this as a linear system you should introduce two new variables: x3 and x4 . Show that the resulting equation can be written as x = Bx. If the initial condition for the original nonlinear system is x1 (0) = −2 and x2 (0) = 3, then what is the initial condition for the linear system? (e) Suppose, for the problem in part (d), that the second equation is changed to x2 = x1 . How many new variables need to be introduced to obtain a linear system? Comment: Writing a nonlinear equation as a higher dimensional linear equation as in parts (b)–(e) underlies Koopman theory in data analysis.

Appendix A

Taylor’s Theorem

One of the essential tools in the development of numerical methods is Taylor’s theorem. The reason is simple, Taylor’s theorem enables you to approximate a function with a polynomial, and polynomials are easy to work with. To start, we need the following definition. Definition of C n (a, b). Given a non-negative integer n, and an interval a < x < b, stating that f ∈ C n (a, b) means that f (x), f  (x), f  (x), · · · , f (n) (x) exist and are continuous functions on the interval a < x < b. Note that this definition does not follow the usual convention for exponents. In particular, f ∈ C(a, b) and f ∈ C 0 (a, b) are the same statement, which are both different than stating that f ∈ C 1 (a, b). If f ∈ C(a, b), or equivalently if f ∈ C 0 (a, b), then the function is continuous on the interval. In contrast, f ∈ C 1 (a, b) means that f (x) and f  (x) are continuous on the interval. Also, to state that f ∈ C ∞ (a, b) means f (x) and all of its derivatives are defined and continuous for a < x < b. Taylor’s Theorem. Given a function f (x), assume that f ∈ C n+1 (a, b). In this case, if x and x + h are points in the interval (a, b) then 1 1 f (x + h) = f (x) + hf  (x) + h2 f  (x) + · · · + hn f (n) (x) + Rn+1 , (A.1) 2 n! where the remainder is Rn+1 =

1 hn+1 f (n+1) (η), (n + 1)!

(A.2)

and η is a point between x and x + h. The result in (A.1) is known as Taylor’s theorem with remainder. The mystery point η in (A.2) is not known other than it is somewhere in the given interval.

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0

529

530

Appedix A: Taylor’s Theorem

Writing out the first few cases we have that f (x + h) = f (x) + hf  (η), 1 f (x + h) = f (x) + hf  (x) + h2 f  (η), 2 1 1  f (x + h) = f (x) + hf (x) + h2 f  (x) + h3 f  (η). 2 6 For small values of h, these give rise to the following approximations: f (x + h) ≈ f (x),

(A.3)

f (x + h) ≈ f (x) + hf  (x),

(A.4)

1 f (x + h) ≈ f (x) + hf  (x) + h2 f  (x). 2

(A.5)

As a function of h, (A.3) is a constant approximation, (A.4) is a linear approximation, and (A.5) is a quadratic approximation. There are various ways to write a Taylor expansion. One is as stated in the above theorem, which is 1 1 f (x + h) = f (x) + hf  (x) + h2 f  (x) + · · · + hn f (n) (x) + · · · . 2 n! The assumption here is that h is close to zero. Another way to write the expansion is as f (x) = f (x0 ) + (x − x0 )f  (x0 ) +

1 1 (x − x0 )2 f  (x0 ) + · · · + (x − x0 )n f (n) (x0 ) + · · · . 2 n!

In this case it is assumed that x is close to x0 . This gives rise to the linear approximation (A.6) f (x) ≈ f (x0 ) + (x − x0 )f  (a), the quadratic approximation 1 f (x) ≈ f (x0 ) + (x − x0 )f  (x0 ) + (x − x0 )2 f  (x0 ), 2

(A.7)

and the cubic approximation 1 1 f (x) ≈ f (x0 ) + (x − x0 )f  (x0 ) + (x − x0 )2 f  (x0 ) + (x − x0 )3 f  (x0 ). 2 6 (A.8) It’s certainly possible to write down higher order approximations, but they are not needed in this text. In the case of a multivariable function f (x), the quadratic approximation in (A.7) takes the form 1 f (x) ≈ f (x0 ) + (x − x0 ) · ∇f (x0 ) + (x − x0 )T H0 (x − x0 ) + · · · , 2 where H0 = H(x0 ), and H(x) is the Hessian of f (x).

A.1 Useful Examples for x Near Zero

531

A.1 Useful Examples for x Near Zero 1 1 f (x) = f (0) + xf  (0) + x2 f  (0) + x3 f  (0) + · · · 2 6

Power Functions 1 1 (a + x)γ = aγ + γxaγ−1 + γ(γ − 1)x2 aγ−2 + γ(γ − 1)(γ − 2)x3 aγ−3 + · · · 2 6 1 2 3 = 1 + x + x + x + ··· 1−x 1 = 1 + 2x + 3x2 + 4x3 + · · · (1 − x)2 √ 1 1 1 1 + x = 1 + x − x2 + x3 + · · · 2 8 16 3 2 1 5 1 √ = 1 + x + x + x3 + · · · 2 8 16 1−x

Trig Functions 1 3 1 x + x5 + · · · 3! 5! 3 1 arcsin(x) = x + x3 + x5 + · · · 6 40 1 1 cos(x) = 1 − x2 + x4 + · · · 2 4! 1 3 π arccos(x) = − x − x3 − x5 + · · · 2 6 40 1 2 tan(x) = x + x3 + x5 + · · · 3 15 1 1 arctan(x) = x − x3 + x5 + · · · 3 40 1 1 1 cot(x) = − x − x3 + · · · x 3 45 1 π 1 arccot(x) = − x + x3 − x5 + · · · 2 3 5 1 sin(a + x) = sin(a) + x cos(a) − x2 sin(a) + · · · 2 1 2 cos(a + x) = cos(a) − x sin(a) − x cos(a) + · · · 2 tan(a + x) = tan(a) + x sec2 (a) + x2 tan(a) sec2 (a) + · · · sin(x) = x −

532

Appedix A: Taylor’s Theorem

Exponential and Log Functions 1 1 ex = 1 + x + x2 + x3 + · · · 2 6 1 1 ax = ex ln(a) = 1 + x ln(a) + [x ln(a)]2 + [x ln(a)]3 + · · · 2 6 x 1  x 2 1  x 3 ln(a + x) = ln(a) + − + + ··· a 2 a 3 a

Hyperbolic Functions 1 1 5 x + ··· sinh(x) = x + x3 + 6 120 3 1 arcsinh(x) = x − x3 + x5 + · · · 6 40 1 1 cosh(x) = 1 + x2 + x4 + · · · 2 24  √ 3 2 1 x + ··· arccosh(x) = 2x 1 − x + 12 160 1 2 tanh(x) = x − x3 + x5 + · · · 3 15 1 1 arctanh(x) = x + x3 + x5 + · · · 3 5

A.2 Order Symbol and Truncation Error As a typical example of how we will use Taylor’s theorem, we start with 1 1 1 sin(x) = x − x3 + x5 − x7 + · · · 3! 5! 7! From this we have the approximations sin(x) ≈ x, and

1 3 x . 3! It is useful to have a way to indicate how the next term in the series depends on x. The big-O notation is used for this, and we write sin(x) ≈ x −

sin(x) = x + O(x3 ), and sin(x) = x −

1 3 x + O(x5 ). 3!

(A.9)

(A.10)

A.2 Order Symbol and Truncation Error

533

In this text, the part of the series that is dropped when deriving an approximation is often designated as τ . Given where it comes from, τ is referred to as the truncation error. Using the above example, we will sometimes write (A.9) as sin(x) = x + τ , where τ = O(x3 ). Similarly, (A.10) can be written as sin(x) = x −

1 3 x + τ, 3!

where τ = O(x5 ). We will occasionally need to know how big-O terms combine. The rules that cover many of the situations we will come across, for x close to zero, are the following: • If n < m then O(xn ) + O(xm ) = O(xn ). • For any nonzero constant α, O(αxn ) = αO(xn ) = O(xn ). As an example of how they are used, if f (x) = 1 + 2x + O(x3 ) and g(x) = −4 + 3x + O(x4 ) then f + 2g = 1 + 2x + O(x3 ) + 2[−4 + 3x + O(x4 )] = −7 + 8x + O(x3 ). For the same reason, −2f + 6g = −26 + 14x + O(x3 ). The last topic concerns two ways that the truncation error can be written. These come from writing the Taylor series using the remainder term, or else writing it out as a series. For example, one can write the series version sin(x) = x −

1 1 1 3 x + x5 − x7 + · · · 3! 5! 7!

as sin(x) = x + τ , where τ =−

1 1 1 3 x + x5 − x7 + · · · 3! 5! 7!

(A.11)

In contrast, the remainder form, coming from (A.1) and (A.2), is sin(x) = x + τ , where 1 τ = − x3 cos(η), (A.12) 3! and η is a point between zero and x. For small x, both versions result in the conclusion that τ = O(x3 ).

Appendix B

Vector and Matrix Summary

The following is a brief summary of notation and facts from matrix algebra that are relevant to the contents of this textbook. Some of this material is in the text, and is included here for completeness. Vectors: bold lower case letters v, x, etc. All vectors are assumed to be written in column format (see below). Matrices: bold upper case letters A, R, etc. (see below). ⎛ ⎞ ⎛ ⎞ a11 a12 · · · a1n v1 ⎜ a21 a22 · · · a2n ⎟ ⎜ v2 ⎟ ⎜ ⎟ ⎜ ⎟ A=⎜ . v=⎜ . ⎟ .. .. ⎟ ⎝ .. ⎝ .. ⎠ . ··· . ⎠ vn

am1 am2 · · · amn

Transpose: designated using a superscript T ⎛

v T = v 1 v 2 · · · vn

a11 a21 ⎜ a12 a22 ⎜ AT = ⎜ . .. ⎝ .. . a1n a2n

Important properties: (AB)T = BT AT , (AT )T = A, and (Av)T = vT AT

⎞ · · · am1 · · · am2 ⎟ ⎟ .. ⎟ ··· . ⎠ · · · amn

(A + B)T = AT + BT ,

Inverse Matrix: if A is square (and invertible), then the inverse is denoted as A−1 and it has the property that A−1 A = AA−1 = I. Important properties: (A−1 )−1 = A, (AB)−1 = B−1 A−1 , (AT )−1 = (A−1 )T , (An )−1 = (A−1 )n . © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0

535

536

Appedix B: Vector and Matrix Summary

Vector Products Inner (or, Dot) product: assuming v, w ∈ Rn , then v • w ≡ T

n

i=1 vi wi . T

This can be written in vector form as v • w = v w or as v • w = w v. Important properties: (Ax) • y = xT AT y = x • (AT y) Outer product: assuming v ∈ Rn and w ∈ Rm , then ⎛ ⎞ ⎜ ↑ ↑ ⎜ w v w vw ≡ ⎜ 1 2v · · · ⎜ ⎝ ↓ ↓ T

↑ ⎟ ⎟ wm v⎟ ⎟. ↓ ⎠

Important properties: a) vwT is an n × m matrix; b) if v and w are nonzero, then vwT has rank one; c) (vwT )T = wvT ; and d) vvT is a symmetric matrix

Useful Matrix Theorems Orthogonal Matrix Theorem. By definition, Q is orthogonal if, and only if, Q−1 = QT . (1) (2) (3) (4) (5) (6) (7)

Q is orthogonal if, and only if, its rows are orthonormal vectors. Q is orthogonal if, and only if, its columns are orthonormal vectors. Q is orthogonal if, and only if, ||Qx||2 = ||x||2 , ∀ x. Q is orthogonal if, and only if, Qx • Qy = x • y, ∀ x, y. If Q1 and Q2 are orthogonal n × n matrices then Q1 Q2 is orthogonal. If Q is orthogonal, then QT is orthogonal. If Q is orthogonal, then either det(Q) = 1 or det(Q) = −1.

Positive Definite Matrix Theorem. Assume A is a symmetric matrix. By definition, A is positive definite if, and only if, xT Ax > 0, ∀ x = 0. (1) A is positive definite if, and only if, all of its eigenvalues are positive. (2) A is not positive definite if any diagonal entry is negative or zero, or if the largest number, in absolute value, is off the diagonal. (3) A is positive definite if it is strictly diagonal dominant and its diagonals are all positive. (4) If A is positive definite then it is invertible and A−1 is positive definite. (5) If A is positive definite then AT and Ak , for any positive integer k, are positive definite.

Appendix B: Vector and Matrix Summary

537

Symmetric Eigenvalue Theorem. If A is a symmetric n × n matrix, then the following hold: (1) Its eigenvalues are real numbers. (2) If xi and xj are eigenvectors for different eigenvalues, then xi • xj = 0. (3) It is possible to find a set of orthonormal basis vectors u1 , u2 , · · · , un , where each ui is an eigenvector for A. Modified Eigenvalue Theorem. Suppose the eigenvalues of an n × n matrix A are λ1 , λ2 , · · · , λn . (1) Setting B = A + ωI, where ω is a constant, then the eigenvalues of B are λ1 + ω, λ2 + ω, · · · , λn + ω. Also, if xi is an eigenvector for A that corresponds to λi , then xi is an eigenvector for B that corresponds to λi + ω. (2) If A is invertible, then the eigenvalues of A−1 are 1/λ1 , 1/λ2 , · · · , 1/λn . Also, if xi is an eigenvector for A that corresponds to λi , then xi is an eigenvector for A−1 that corresponds to 1/λi . Spectral Decomposition Theorem. If A is a symmetric matrix, then it is possible to factor A as A = QDQT , where D is a diagonal matrix and Q is an orthogonal matrix. Letting u1 , u2 , · · · , un be orthonormal eigenvectors for A, with corresponding eigenvalues λ1 , λ2 , · · · , λn , then ⎛ ⎞ λ1 ⎜ λ2 ⎟ ⎜ ⎟ D=⎜ ⎟, . .. ⎝ ⎠ λn and the jth column of Q is uj . This factorization can be written as the outer product expansion A = λ1 u1 uT1 + λ2 u2 uT2 + · · · + λn un uTn . Outer Product Theorem. (1) If A is n × p and B is p × m, then AB = a1 bT1 + a2 bT2 + · · · + ap bTp , where aj is the jth column from A and bj is the jth column from BT . (2) If v, w ∈ Rn with v • w = 0, then A = vwT has exactly two distinct eigenvalues: λ1 = v • w, and λ2 = 0. For λ1 , v is an eigenvector (and there are no other linearly independent eigenvectors for λ1 ). The eigenvalue λ2 has a geometric multiplicity of n − 1 and each of its eigenvectors is orthogonal to w.

538

Appedix B: Vector and Matrix Summary

(3) Assume that A is n × n, and it can be written as A = λ1 u1 uT1 + λ2 u2 uT2 + · · · + λk uk uTk , where the λi ’s are nonzero. If the uj ’s are orthonormal, then each uj is an eigenvector for A with corresponding eigenvalue λj . If k < n, then λ = 0 is also an eigenvalue for A, and it has a geometric multiplicity of n − k. AT A Theorem. Assume A is a m × n matrix. (1) AT A is a n × n symmetric matrix with non-negative eigenvalues. (2) If the columns of A are linearly independent, then AT A is positive definite. (3) Letting N (•) and R(•) denote the null space and range, respectively, then N (AT A) = N (A). Consequently, rank(AT A) = rank(A). (4) If λ is a nonzero eigenvalue of AT A with eigenvector x, then λ is an eigenvalue of AAT with eigenvector Ax. If λ = 0 is an eigenvalue of AT A with eigenvector x, and if Ax = 0, then λ = 0 is an eigenvalue of AAT with eigenvector Ax. (5) ||AT A||2 = ||AAT ||2 = ||A||22 , and tr(AT A) = tr(AAT ) = ||A||2F . Also, 2 if A is invertible, then κ2 (AT A) = [κ2 (A)] . An AAT theorem is obtained from the above by writing it as (AT )T (AT ).

References

H. Abdi and L. J. Williams. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4):433–459, 2010. ISSN 1939-0068. 10.1002/wics.101. R. Abreu, Z. Su, J. Kamm, and J. Gao. On the accuracy of the complexstep-finite-difference method. J. Comput. Appl. Math., 340:390–403, 2018. 10.1016/j.cam.2018.03.005. J. Alman and V. V. Williams. A refined laser method and faster matrix multiplication. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, pages 522– 539, 2021. 10.1137/1.9781611976465.32. A. H. Andersen, D. M. Gash, and M. J. Avison. Principal component analysis of the dynamic response measured by fMRI: a generalized linear systems framework. Magn. Reson. Imaging, 17(6):795–815, 1999. N. Andrei. Nonlinear Conjugate Gradient Methods for Unconstrained Optimization. Springer, 2020. ...P. Astier, J. Guy, N. Regnault, R. Pain, E. Aubourg, D. Balam, S. Basa, R. G. Carlberg, S. Fabbro, D. Fouchez, I. M. Hook, D. A. Howell, H. Lafoux, J. D. Neill, N. PalanqueDelabrouille, K. Perrett, C. J. Pritchet, J. Rich, M. Sullivan, R. Taillet, G. Aldering, P. Antilogus, V. Arsenijevic, C. Balland, S. Baumont, J. Bronder, H. Courtois, R. S. Ellis, M. Filiol, A. C. Gon¸calves, A. Goobar, D. Guide, D. Hardin, V. Lusset, C. Lidman, R. McMahon, M. Mouchet, A. Mourao, S. Perlmutter, P. Ripoche, C. Tao, and N. Walton. The supernova legacy survey: measurement of ωm , ωΛ and w from the first year data set. Astron. and Astrophys., 447(1):31–48, 2006. 10.1051/0004-6361:20054185. D. H. Bailey, K. Lee, and H. D. Simon. Using Strassen’s algorithm to accelerate the solution of linear systems. J. Supercomputing, 4:357–371, 1991. M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski. Face recognition by independent component analysis. IEEE T. Neural Networ., 13(6):1450–1464, 2002. 10.1109/tnn.2002.804287. S. Batterson and J. Smillie. Rayleigh quotient iteration fails for nonsymmetric matrices. Appl. Math. Lett., 2(1):19 – 20, 1989. http://dx.doi.org/10.1016/0893-9659(89)90107-9. C. Beattie and D. Fox. Localization criteria and containment for Rayleigh quotient iteration. SIAM J. Matrix Anal. Appl., 10(1):80–93, 1989. 10.1137/0610006. B. Beckermann and A. B. J. Kuijlaars. Superlinear convergence of conjugate gradients. SIAM J. Numer. Anal., 39(1):300–329, 2001. 10.1137/S0036142999363188.

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0

539

540

References

S. Bellavia and B. Morini. Strong local convergence properties of adaptive regularized methods for nonlinear least squares. IMA J. Numer. Anal., 35(2):947–968, 2014. 10.1093/imanum/dru021. M. Benzi. Preconditioning techniques for large linear systems: A survey. J. Comput. Phys., 182(2):418 – 477, 2002. A. Berman and R. J. Plemmons. Nonnegative Matrices in the Mathematical Sciences. Society for Industrial and Applied Mathematics, 1994. 10.1137/1.9781611971262. G. P. Berman and F. M. Izrailev. The Fermi-Pasta-Ulam problem: Fifty years of progress. Chaos, 15(1):015104, 2005. 10.1063/1.1855036. J. Bernoulli. Meditationes de chordis vibrantibus. Comment. Acad. Sci. Imper. Petropol., 3:13–28, 1728. A. Bj¨ ourck. Numerical Methods in Matrix Computations. Springer, 2015. ISBN 978-3-31905088-1. P. Blanchard, N. J. Higham, and T. Mary. A class of fast and accurate summation algorithms. SIAM J. Sci. Comput., 42(3):A1541–A1557, 2020. 10.1137/19M1257780. I. Bogaert. Iteration-free computation of Gauss–Legendre quadrature nodes and weights. SIAM J. Sci. Comput., 36(3):A1008–A1026, 2014. 10.1137/140954969. J. P. Boyd. Solving Transcendental Equations: The Chebyshev Polynomial Proxy and Other Numerical Rootfinders, Perturbation Series, and Oracles. SIAM, 2014. ISBN 978-1-611973-51-8. P. L. Brockett, R. A. Derrig, L. L. Golden, A. Levine, and M. Alpert. Fraud classification using principal component analysis of RIDITs. J. Risk Insur., 69(3):341–371, 2002. 10.1111/1539-6975.00027. G. Brys, M. Hubert, and P. J. Rousseeuw. A robustification of independent component analysis. J. Chemometr., 19(5-7):364–375, 2005. 10.1002/cem.940. J. R. Bunch and J. E. Hopcroft. Triangular factorization and inversion by fast matrix multiplication. Math. Comp., 28(125):231–236, 1974. C. J. C. Burges. Geometric Methods for Feature Extraction and Dimensional Reduction. In O. Maimon and L. Rokach, editors, The Data Mining and Knowledge Discovery Handbook, pages 53–82. Springer, 2nd edition, 2010. J. Burkardt. Hand-data, March 2015. URL http://people.sc.fsu.edu/∼jburkardt/m src/ hand data/hand data.html. J. C. Butcher. The Numerical Analysis of Ordinary Differential Equations. Wiley, Chichester, UK, 2nd edition, 2008. A. Chen and P. J. Bickel. Efficient independent component analysis. Ann. Statist., 34(6):2825–2855, 2006. 10.1214/009053606000000939. F. Cheng, X. Wang, and B. A. Barsky. Quadratic B-spline curve interpolation. Comput. Math. Appl., 41(1):39–50, 2001. 10.1016/S0898-1221(01)85004-5. A. K. Cline and I. S. Dhillon. Computation of the singular value decomposition. In L. Hogben, editor, Handbook of Linear Algebra, pages 45.1–45.13. Chapman & Hall/CRC, 2007. P. Comon and C. Jutten, editors. Handbook of blind source separation: independent component analysis and applications. Communications Engineering. Elsevier, 2010. ISBN 978-0-12-374726-6. A. R. Conn, K. Scheinberg, and L. N. Vicente. Introduction to Derivative-Free Optimization. SIAM, 2009. A. Cordero, C. Jordnn, and J. R. Torregrosa. One-point Newton-type iterative methods: A unified point of view. J. Comput. Appl. Math., 275:366 – 374, 2015. 10.1016/j.cam.2014.07.009. A. B. Costello and J. W. Osborne. Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical Assessment, Research and Evaluation, 10(7):1–9, 2005. ISSN 1531-7714. R. Crandall and C. B. Pomerance. Prime Numbers: A Computational Perspective. Springer, 2nd edition, 2010. ISBN 9780387289793.

References

541

I. Dattner. Differential equations in data analysis. WIREs Computational Statistics, page e1534, November 2020. https://doi.org/10.1002/wics.1534. P. J. Davis and P. Rabinowitz. Methods of Numerical Integration. Dover Publications, 2007. ISBN 978-0486453392. C. de Boor and I. J. Schoenberg. Cardinal interpolation and spline functions VIII. The Budan-Fourier theorem for splines and applications. In K. Bohmer, G. Meinardus, and W. Schempp, editors, Spline Functions, volume 501 of Lecture Notes in Mathematics, pages 1–79. Springer Berlin Heidelberg, 1976. 10.1007/BFb0079740. J. Demmel and Y. Hida. Accurate and efficient floating point summation. SIAM J. Sci. Comput., 25(4):1214–1248, 2004. J. W. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, PA, 1997. ISBN 0-89871-389-7. J. Dongarra and F. Sullivan. The top 10 algorithms. Comput. Sci. Eng., 2(1):22–23, 2000. J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Yamazaki. The singular value decomposition: Anatomy of optimizing an algorithm for extreme scale. SIAM Review, 60(4):808–865, 2018. 10.1137/17M1117732. J. Dongarra, L. Grigori, and N. J. Higham. Numerical algorithms for high-performance computational science. Proc. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci., 378(2166):20190066, 2020. 10.1098/rsta.2019.0066. H. Doukas, A. Papadopoulou, N. Savvakis, T. Tsoutsos, and J. Psarras. Assessing energy sustainability of rural communities using principal component analysis. Renew. Sust. Energy Rev., 16(4):1949 – 1957, 2012. 10.1016/j.rser.2012.01.018. J.-P. Dussault. Solving trajectory optimization problems via nonlinear programming: the brachistochrone case study. Optimization and Engineering, pages 1–17, 2014. 10.1007/s11081-013-9244-4. U. Erdmann, W. Ebeling, and A. S. Mikhailov. Noise-induced transition from translational to rotational motion of swarms. Phys Rev E, 71(5):051904, 2005. G. A. Evans and J. R. Webster. A comparison of some methods for the evaluation of highly oscillatory integrals. J. Comput. Appl. Math., 112:55 – 69, 1999. 10.1016/S03770427(99)00213-7. J. Ford. The Fermi-Pasta-Ulam problem: Paradox turns discovery. Phys. Rep., 213(5):271 – 310, 1992. 10.1016/0370-1573(92)90116-H. B. Fornberg. Improving the accuracy of the trapezoidal rule. SIAM Review, 63(1):167–180, 2021. 10.1137/18M1229353. ´ M. Fourier. Analyse des Equations D´ etermin´ ees. Paris, 1831. F. Le Gall and F. Urrutia. Improved rectangular matrix multiplication using powers of the Coppersmith-Winograd tensor. In A. Czumaj, editor, SODA ’18: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1029– 1046. SIAM, 2018. ISBN 9781611975031. W. Gander and W. Gautschi. Adaptive quadrature - revisited. BIT, 40(1):84–101, 2000. 10.1023/A:1022318402393. B. Ghojogh and M. Crowley. The theory behind overfitting, cross validation, regularization, bagging, and boosting: Tutorial. arXiv:1905.12787, 2019. GIMPS. Great Internet Mersenne Prime Search, 2022. https://en.wikipedia.org/wiki/ Great Internet Mersenne Prime Search. L. Giraud, J. Langou, and M. Rozloznik. The loss of orthogonality in the Gram-Schmidt orthogonalization process. Comput. Math. Appl., 50(7):1069 – 1075, 2005. ISSN 08981221. G. M. L. Gladwell. Contact Problems in the Classical Theory of Elasticity. Sijthoff and Noordhoff, Germantown, MD, 1980. D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv., 23(1):5–48, 1991.

542

References

G. Golub and F. Uhlig. The QR algorithm: 50 years later its genesis by John Francis and Vera Kublanovskaya and subsequent developments. IMA J. Numer. Anal., 29(3):467– 485, 2009. 10.1093/imanum/drp012. G. H. Golub and C. F. Van Loan. Matrix Computations. Matrix Computations. Johns Hopkins University Press, 4th edition, 2013. ISBN 1421408597. G. H. Golub and J. H. Welsch. Calculation of Gauss quadrature rules. Math. Comp., 23(106):221–230, 1969. 10.3934/jcd.2014.1.391. P. Gonnet. A review of error estimation in adaptive quadrature. ACM Comput. Surv., 44(4):1–36, 2012. 10.1145/2333112.2333117. F. Goualard. Generating random floating-point numbers by dividing integers: a case study. In Computational Science – ICCS 2020, pages 15–28. Springer, 2020. 10.1007/978-3030-50417-5 2. W. Greene. Econometric Analysis. Pearson, 8th edition, 2018. C. R. Gwinn, M. D. Johnson, J. E. Reynolds, D. L. Jauncey, A. K. Tzioumis, S. Dougherty, B. Carlson, D. Del Rizzo, H. Hirabayashi, H. Kobayashi, Y. Murata, P. G. Edwards, J. F. H. Quick, C. S. Flanagan, and P. M. McCulloch. Noise in the cross-power spectrum of the Vela pulsar. Astrophys. J., 758(1):6, 2012. P. S. Hagan and G. West. Interpolation methods for curve construction. Appl. Math. Finance, 13(2):89 – 129, 2006. ISSN 1350486X. W. Hager and H. Zhang. A survey of nonlinear conjugate gradient methods. Pac. J. Optim., 2:35–58, 2006. E. Hairer and G. Wanner. Solving Ordinary Differential Equations II. Springer, 2nd edition, 2002. E. Hairer, C. Lubich, and G. Wanner. Geometric numerical integration illustrated by the Stormer–Verlet method. Acta Numer., 12:399–450, 2003. 10.1017/S0962492902000144. N. Halko, P.-G. Martinsson, Y. Shkolnisky, and M. Tygert. An algorithm for the principal component analysis of large data sets. SIAM J. Sci. Comput., 33(5):2580–2594, 2011. 10.1137/100804139. C. A. Hall and W. W. Meyer. Optimal error bounds for cubic spline interpolation. J. Approx. Theory, 16(2):105 – 122, 1976. 10.1016/0021-9045(76)90040-X. J. Harrison, T. Kubaska, S. Story, and P. Tang. The computation of transcendental functions on the IA-64 architecture. Intel Tech. J., 4:234–251, 1999. N. Heavner, P.-G. Martinsson, and G. Quintana-Ort. Computing rank-revealing factorizations of matrices stored out-of-core. arXiv:2002.06960v2, 2020. N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, PA, 2nd edition, 2002. ISBN 0-89871-521-0. N. J. Higham. The numerical stability of barycentric Lagrange interpolation. IMA J. Numer. Anal., 24(4):547–556, 2004. 10.1093/imanum/24.4.547. A. Hojjati, G.-B. Zhao, L. Pogosian, A. Silvestri, R. Crittenden, and K. Koyama. Cosmological tests of general relativity: A principal component analysis. Phys. Rev. D, 85:043508, 2012. 10.1103/PhysRevD.85.043508. J. C. Holladay. A smoothest curve approximation. Mathematical Tables and Other Aids to Computation, 11(60):pp. 233–243, 1957. M. H. Holmes. Introduction to Numerical Methods in Differential Equations. Springer, 2007. M. H. Holmes. Introduction to the Foundations of Applied Mathematics. Springer, 2nd edition, 2019. M. H. Holmes. Conservative numerical methods for nonlinear oscillators. Amer. J. Phys., 88(1):60–69, 2020a. 10.1119/10.0000295. M. H. Holmes. Introduction to Differential Equations. XanEdu Publishing, 2nd edition, 2020b. ISBN 978-1975077204. M. H. Holmes and M. Caiola. Invariance properties for the error function used for multilinear regression. PLOS ONE, 13(12):1–25, 2018. 10.1371/journal.pone.0208793.

References

543

S. Huss-Lederman, E. M. Jacobson, J. R. Johnson, A. Tsao, and T. Turnbull. Implementation of Strassen’s algorithm for matrix multiplication. In Supercomputing, 1996. Proceedings of the 1996 ACM/IEEE Conference on, pages 9–6, 1996. H. T. Huynh. Accurate monotone cubic interpolation. SIAM J. Numer. Anal., 30(1):57– 100, 1993. A. Hyv¨ arinen, J. Karhunen, and E. Oja. Independent Component Analysis. WileyInterscience, May 2001. ISBN 047140540X. G. Iooss and G. James. Localized waves in nonlinear oscillator chains. Chaos, 15(1):015113, 2005. J. Iriarte-D´ıaz. Differential scaling of locomotor performance in small and large terrestrial mammals. J. Exp. Bio., 205(18):2897–2908, 2002. E. Isaacson and H. B. Keller. Analysis of Numerical Methods. Dover Publications, 1994. ISBN 9780486680293. A. Iserles. A First Course in the Numerical Analysis of Differential Equations. Cambridge University Press, 2nd edition, 2009. ISBN 978-0521734905. A. Iserles, S. P. Nørsett, and S. Olver. Highly oscillatory quadrature: The story so far. In A. B. de Castro, D. Gomez, P. Quintela, and P. Salgado, editors, Numerical Mathematics and Advanced Applications, pages 97–118. Springer, 2006. ISBN 978-3-540-34288-5. M. M. Jacak, P. Jozwiak, J. Niemczuk, and J. E. Jacak. Quantum generators of random numbers. Scientific Reports, 11(1):16108–16129, 2021. 10.1038/s41598-021-95388-7. G. James, P. G. Kevrekidis, and J. Cuevas. Breathers in oscillator chains with Hertzian interactions. Phys. D, 251(0):39 – 59, 2013. 10.1016/j.physd.2013.01.017. G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning. Springer, 2nd edition, 2021. ISBN 978-1-0716-1417-4. A. Jameson and J. Vassberg. Studies of alternate numerical optimization methods applied to the brachistochrone problem. CFD Journal, 9(3):281–296, 2001. F. Johansson and M. Mezzarobba. Fast and rigorous arbitrary-precision computation of Gauss–Legendre quadrature nodes and weights. SIAM J. Sci. Comput., 40(6):C726– C747, 2018. 10.1137/18M1170133. K. A. Johnson and R. S. Goody. The original Michaelis constant: Translation of the 1913 Michaelis-Menten paper. Biochemistry, 50(39):8264–8269, 2011. 10.1021/bi201284u. J. Josse and F. Husson. Selecting the number of components in principal component analysis using cross-validation approximations. Comput. Statist. Data Anal., 56(6):1869 – 1879, 2012. 10.1016/j.csda.2011.11.012. B. C. Jung, S. I. Choi, A. X. Du, J. L. Cuzzocreo, Z. Z. Geng, H. S. Ying, S. L. Perlman, A. W. Toga, J. L. Prince, and S. H. Ying. Principal component analysis of cerebellar shape on MRI separates SCA types 2 and 6 into two archetypal modes of degeneration. Cerebellum, 11(4):887–895, 2012. 10.1007/s12311-011-0334-6. W. M. Kahan. Numerical linear algebra. Canad. Math. Bull., 9:757–801, 1966. D. Kershaw. A note on the convergence of interpolatory cubic splines. SIAM J. Numer. Anal., 8(1), 1971. J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, and P. S. Meltzer. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med., 7(6):673–679, June 2001. ISSN 10788956. D. A. Knoll and D. E. Keyes. Jacobian-free Newton-Krylov methods: A survey of approaches and applications. J. Comput. Phys., 193(2):357–397, 2004. 10.1016/j.jcp.2003.08.010. D. E. Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms. Addison-Wesley Longman Publishing Co., Boston, MA, 3d edition, 1997. ISBN 0-20189684-2. N. Kollerstrom. Thomas Simpson and ‘Newton’s method of approximation’: an enduring myth. Brit. J. Hist. Sci., 25:347–354, 1992. 10.1017/S0007087400029150.

544

References

S. Kondrat, C. R. Perez, V. Presser, Y. Gogotsi, and A. A. Kornyshev. Effect of pore size and its dispersity on the energy storage in nanoporous supercapacitors. Energy Environ. Sci., 5:6474–6479, 2012. 10.1039/C2EE03092F. G. Korats, S. Le Cam, R. Ranta, and M. Hamid. Applying ICA in EEG: Choice of the window length and of the decorrelation method. In J. Gabriel, J. Schier, S. Van Huffel, E. Conchon, C. Correia, A. Fred, and H. Gamboa, editors, Biomedical Engineering Systems and Technologies, volume 357 of Communications in Computer and Information Science, pages 269–286. Springer, 2013. 10.1007/978-3-642-38256-7 18. J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. Wright. Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM J. Optim., 9:112–147, 1998. K.-L. Lai and J. L. Crassidis. Extensions of the first and second complex-step derivative approximations. J. Comput. Appl. Math., 219(1):276 – 293, 2008. 10.1016/j.cam.2007.07.026. J. D. Lambert. Numerical Methods for Ordinary Differential Systems: The Initial Value Problem. Wiley, Chichester, UK, 1991. Y. Levin, M. Nediak, and A. Ben-Israel. A direct Newton method for calculus of variations. J. Comput. Appl. Math., 139(2):197 – 213, 2002. 10.1016/S0377-0427(01)00427-7. S. Liu and E. Dobriban. Ridge regression: Structure, cross-validation, and sketching. arXiv:1910.02373v3, 2020. J. Loffeld and M. Tokman. Comparative performance of exponential, implicit, and explicit integrators for stiff systems of ODEs. J. Comput. Appl. Math., 241(0):45 – 67, 2013. http://dx.doi.org/10.1016/j.cam.2012.09.038. C. H. Love. Abscissas and weights for Gaussian quadrature for n=2 to 100, and n=125, 150, 175, and 200. National Bureau of Standards, U.S. Govt. Print. Off., 1966. Lusso, E., Piedipalumbo, E., Risaliti, G., Paolillo, M., Bisogni, S., Nardini, E., and Amati, L. Tension with the flat model from a high-redshift Hubble diagram of supernovae, quasars, and gamma-ray bursts. A&A, 628:L4, 2019. 10.1051/0004-6361/201936223. A. Mann. Core concept: Nascent exascale supercomputers offer promise, present challenges. Proc. Natl. Acad. Sci. USA, 117(37):22623–22625, 2020. 10.1073/pnas.2015968117. M. J. Marsden. Quadratic spline interpolation. Bull. Amer. Math. Soc., 80(5):903–906, 1974. J. R. R. A. Martins, P. Sturdza, and J. J. Alonso. The complex-step derivative approximation. ACM Trans. Math. Software, 29(3):245–262, 2003. T. Matsuda and Y. Miyatake. Estimation of ordinary differential equation models with discretization error quantification. SIAM/ASA Journal on Uncertainty Quantification, 9(1):302–331, 2021. 10.1137/19M1278405. L. Mei, M. Figl, A. Darzi, D. Rueckert, and P. Edwards. Sample sufficiency and PCA dimension for statistical shape models. In D. Forsyth, P. Torr, and A. Zisserman, editors, Computer Vision - ECCV 2008, volume 5305 of Lecture Notes in Computer Science, pages 492–503. Springer, 2008. 10.1007/978-3-540-88693-8 36. J.-B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, The Google Books Team, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M. A. Nowak, and E. L. Aiden. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176–182, 2011. 10.1126/science.1199644. A. Morin, J. Urban, P. D. Adams, I. Foster, A. Sali, D. Baker, and P. Sliz. Shining light into black boxes. Science, 336(6078):159–160, 2012. 10.1126/science.1218263. M. E. Mortenson. Geometric Modeling. John Wiley, 2nd edition, 1997. J.-M. Muller. Elementary Functions: Algorithms and Implementation. Birkhauser, 3d edition, 2016. J.-M. Muller, N. Brisebarre, F. de Dinechin, C.-P. Jeannerod, V. Lef`evre, G. Melquiond, N. Revol, D. Stehl´ e, and S. Torres. Handbook of Floating-Point Arithmetic. Birkh¨ auser Boston, 2010.

References

545

C. E Mungan and T. C. Lipscombe. Minimum descent time along a set of connected inclined planes. European J. Phys., 38(4):045002, 2017. 10.1088/1361-6404/aa6c19. Y. Nakatsukasa and N. Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comput., 35(3):A1325–A1349, 2013. 10.1137/120876605. NCEI. Daily summary for Albany, NY. In National Centers for Environmental Information, Climate Data Online. Washington, DC, 2015. URL http://www.ncdc.noaa.gov/ cdo-web/search?datasetid=GHCND. J. A. Nelder. This week’s citation classic. Citation Classics Commentaries, April 9(15):22, 1979. J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer Journal, 7(4):308–313, 1965. T. V. Nguyen and J. C. Patra. A simple ICA-based digital image watermarking scheme. Digit. Signal Process., 18(5):762 – 776, 2008. 10.1016/j.dsp.2007.10.004. J. Nocedal and S. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, 2006. ISBN 9780387303031. K. Nordhausen and H. Oja. Independent component analysis: A statistical perspective. WIREs Computational Statistics, 10(5):e1440, 2018. https://doi.org/10.1002/wics.1440. F. W. J. Olver, D. W. Lozier, R. F. Boisvert, and C. W. Clark. NIST Handbook of Mathematical Functions. Cambridge University Press, 2010. J. Onton and S. Makeig. Information-based modeling of event-related brain dynamics. In C. Neuper and W. Klimesch, editors, Event-Related Dynamics of Brain Oscillations, volume 159 of Progress in Brain Research, pages 99 – 120. Elsevier, 2006. 10.1016/S00796123(06)59007-7. J. Oprea. Differential Geometry and Its Applications. Mathematical Association of America, 2007. ISBN 9780883857489. M. L. Overton. Numerical Computing with IEEE Floating Point Arithmetic. SIAM, 2001. R. D. Pantazis and D. B. Szyld. Regions of convergence of the Rayleigh quotient iteration method. Numer. Linear Algebra Appl., 2(3):251–269, 1995. 10.1002/nla.1680020307. A. Parente and J. C. Sutherland. Principal component analysis of turbulent combustion data: Data pre-processing and manifold sensitivity. Combustion and Flame, 160(2):340 – 350, 2013. 10.1016/j.combustflame.2012.09.016. B. N. Parlett. The Rayleigh quotient iteration and some generalizations for nonnormal matrices. Math. Comp., 28:679–693, 1974. B. N. Parlett. The Symmetric Eigenvalue Problem. SIAM, 1998. 10.1137/1.9781611971163. R. D. Peng. Reproducible research in computational science. Science, 334(6060):1226–1227, 2011. 10.1126/science.1213847. P. R. Peres-Neto, D. A. Jackson, and K. M. Somers. How many principal components? stopping rules for determining the number of non-trivial axes revisited. Comput. Statist. Data Anal., 49(4):974 – 997, 2005. 10.1016/j.csda.2004.06.015. G. Peters and J. H. Wilkinson. Inverse iteration, ill-conditioned equations and newton’s method. SIAM Review, 21(3):pp. 339–360, 1979. ISSN 00361445. R. Raghavan, Y. D. Kelkar, and H. Ochman. A selective force favoring increased g+c content in bacterial genes. Proc. Natl. Acad. Sci. USA, 109(36):14504–14507, 2012. 10.1073/pnas.1205683109. A. Ralston. Runge-Kutta methods with minimum error bounds. Math. Comp., 16:431–437, 1962. J. O. Ramsay, G. Hooker, D. Campbell, and J. Cao. Parameter estimation for differential equations: A generalized smoothing approach. J. Royal Stat. Soc.: Series B, 69(5):741– 796, November 2007. ISSN 1467-9868. J. Raphson. Analysis aequationum universalis. London, 1690. V. Rokhlin, A. Szlam, and M. Tygert. A randomized algorithm for principal component analysis. SIAM J. Matrix Anal. Appl., 31(3):1100–1124, 2010. 10.1137/080736417.

546

References

J. B. Rosser, C. Lanczos, M. R. Hestenes, and W. Karush. Separation of close eigenvalues of a real symmetric matrix. J. Res. Bur. Stand., 47(4):291–297, 1951. D. Salomon. Curves and Surfaces for Computer Graphics. Springer, 2006. J. S¨ arel¨ a and R. Vig´ ario. Overlearning in marginal distribution-based ICA: Analysis and solutions. J. Mach. Learn. Res., 4:1447–1469, 2003. 10.1162/jmlr.2003.4.7-8.1447. H. Sawada, N. Ono, H. Kameoka, D. Kitamura, and H. Saruwatari. A review of blind source separation methods: two converging routes to ILRMA originating from ICA and NMF. APSIPA Transactions on Signal and Information Processing, 8:e12, 2019. 10.1017/ATSIP.2019.5. P. J. Schmid. Dynamic mode decomposition of numerical and experimental data. J. Fluid Mech., 656:5–28, 2010. 10.1017/S0022112010001217. K. L. Schutt and J. B. Moseley. The phosphatase inhibitor Sds23 promotes symmetric spindle positioning in fission yeast. Cytoskeleton, 77(12):544–557, 2020. https://doi.org/10.1002/cm.21648. S. Singh. The Simpsons and Their Mathematical Secrets. Bloomsbury, 2013. J. A. Smith, L. Wilson, O. Azarenko, X. Zhu, B. M. Lewis, B. A. Littlefield, and M. A. Jordan. Eribulin binds at microtubule ends to a single site on tubulin to suppress dynamic instability. Biochemistry, 49(6):1331–1337, 2010. V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13(4):354–356, 1969. 10.1007/BF02165411. A. Stuart and A. R. Humphries. Dynamical Systems and Numerical Analysis. Cambridge University Press, Cambridge, UK, 1998. E. S¨ uli and D. F. Mayers. An Introduction to Numerical Analysis. Cambridge University Press, 2003. E. Talvila and M. Wiersma. Optimal error estimates for corrected trapezoidal rules, 2012. C. T¨ onsing, J. Timmer, and C. Kreutz. Optimal Paths Between Parameter Estimates in Nonlinear ODE Systems Using the Nudged Elastic Band Method. Frontiers in Physics, 7:149, 2019. 10.3389/fphy.2019.00149. J. F. Traub. Iterative Methods for the Solution of Equations. AMS Chelsea Publishing Series. Chelsea, second edition, 1982. ISBN 978-0-8284-0312-2. L. N. Trefethen. Is Gauss quadrature better than Clenshaw-Curtis? SIAM Review, 50(1):67–87, 2008. 10.1137/060659831. L. N. Trefethen. Approximation Theory and Approximation Practice. SIAM, Philadelphia, PA, 2012. ISBN 1611972396, 9781611972399. L. N. Trefethen and D. Bau. Numerical Linear Algebra. SIAM, 1997. ISBN 0898713617. L. N. Trefethen and J. A. C. Weideman. The exponentially convergent trapezoidal rule. SIAM Review, 56(3):385–458, 2014. 10.1137/130932132. J. H. Tu, C. W. Rowley, D. M. Luchtenburg, S. L. Brunton, and J. N. Kutz. On dynamic mode decomposition: Theory and applications. J. Comp. Dyn., 1(2):391–421, 2014. 10.3934/jcd.2014.1.391. U. S. Census Bureau. Crime rates by type: Selected large cities. In Statistical Abstract of the United States: 2012. Washington, DC, 131st edition, 2012. USAFacts. US Coronavirus Cases and Deaths, Dec 2021. URL https://usafacts.org/ visualizations/coronavirus-covid-19-spread-map/. R. A van den Berg, H. C. J. Hoefsloot, J. A. Westerhuis, A. K. Smilde, and M. van der Werf. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics, 7(1):142, 2006. 10.1186/1471-2164-7-142. L. J. P. van der Maaten, E. O. Postma, and H. J. van den Herik. Dimensionality Reduction: A Comparative Review. Technical report, Tilburg University, 2009. J. Varah. A spline least squares method for numerical parameter estimation in differential equations. SIAM J. Sci. Stat. Comput., 3(1):28–46, 1982. 10.1137/0903003. R. S. Varga. Gerˇsgorin and His Circles. Springer Series in Computational Mathematics. Springer, 2004. ISBN 9783642177989.

References

547

R. Vidal, Y. Ma, and S. S. Sastry. Generalized Principal Component Analysis. Springer, 2016. ISBN 0387878106. J. Virta and K. Nordhausen. Estimating the number of signals using principal component analysis. Stat, 8(1):e231, 2019. 10.1002/sta4.231. N. Voglis. Waves derived from galactic orbits. In Galaxies and Chaos, volume 626 of Lecture Notes in Physics, pages 56–74. Springer Berlin / Heidelberg, 2003. J. Waldvogel. Towards a general error theory of the trapezoidal rule. In W. Gautschi, G. Mastroianni, and T. M. Rassias, editors, Approximation and Computation, volume 42 of Springer Optimization and Its Applications, pages 267–282. Springer New York, 2011. 10.1007/978-1-4419-6594-3 17. J. Wallis. Operum Mathematicorum. Number v. 2. 1699. D. S. Watkins. The QR algorithm revisited. SIAM Rev., 50(1):133–145, 2008. 10.1137/060659454. J. A. C. Weideman. Numerical integration of periodic functions: A few examples. Amer. Math. Monthly, 109(1):pp. 21–36, 2002. ISSN 00029890. R. Weinstock. Calculus of Variations: With Applications to Physics and Engineering. Dover, 1974. ISBN 9780486630694. F. M. White. Viscous Fluid Flow. McGraw Hill Series in Mechanical Engineering. McGrawHill, 3d edition, 2005. R. R. Wildeboer, F. Sammali, R. J. G. van Sloun, Y. Huang, P. Chen, M. Bruce, C. Rabotti, S. Shulepov, G. Salomon, B. C. Schoot, H. Wijkstra, and M. Mischi. Blind source separation for clutter and noise suppression in ultrasound imaging: Review for different applications. IEEE Trans. Ultrason. Ferroelectr. Freq. Control, 67(8):1497–1512, 2020. 10.1109/TUFFC.2020.2975483. G. Wilkins and M. Gu. A modified Brent’s method for finding zeros of functions. Numer. Math., 123(1):177–188, 2013. 10.1007/s00211-012-0480-x. J. Wu, K. G. Brigham, M. A. Simon, and J. C. Brigham. An implementation of independent component analysis for 3D statistical shape analysis. Biomed. Signal Processing Control, 13(0):345 – 356, 2014. 10.1016/j.bspc.2014.06.003. T. J. Ypma. Historical development of the Newton-Raphson method. SIAM Rev., 37(4):531–551, 1995. 10.1137/1037125. H. Zenil. A Computable Universe: Understanding and Exploring Nature as Computation. World Scientific Pub, 2012. H. Zhang and Z. Cheng. The performance evaluation of classic ICA algorithms for blind separation of fabric defects. J. Fiber Bioeng. Inform., 7(3):377–386, 2014. 10.3993/jfbi09201407. H. Zhang, C. W. Rowley, E. A. Deem, and L. N. Cattafesta. Online dynamic mode decomposition for time-varying systems. SIAM J. Appl. Dyn. Syst., 18(3):1586–1609, 2019. 10.1137/18M1192329. Q. Zhao, K. Okada, K. Rosenbaum, L. Kehoe, D. J. Zand, R. Sze, M. Summar, and M. G. Linguraru. Digital facial dysmorphology for genetic screening: Hierarchical constrained local model using ICA. Medical Image Analysis, 18(5):699 – 710, 2014. 10.1016/j.media.2014.04.002.

Index

Symbols Bi (x), 222 C[a, b], 37 C n (a, b), 529 O(xn ), 532 ε, 5, 6 ||x||1 , 89 ||x||2 , 88 ||x||∞ , 88

A A-stable, 316 Adams method, 325 adaptive quadrature, 283 adjacency matrix, 163, 194, 195 benzene, 195 H¨ uckel Hamiltonian matrices, 195 Markov chain, 166, 511 naphthalene, 195 airfoil cross-section, 246 Akima algorithm, 239 allometric function, 390 amplification factor, 317, 319, 325, 351 Armijo’s method, 439 asymptotically stable, 304 augmented matrix, 80

B backtracking sampling method, 439 backward Euler method, 319, 514 barycentric Lagrange formula, 252 Bernoulli equation, 353, 407

BLAS, 21 blind source separation problem, 495 brachistochrone problem, 455 Brent’s method, 54 bungee cord, 31

C causality, 344 chapeau function, 215 characteristic equation, 129 Chebyshev interpolation, 232 exponential convergence, 234, 238 interpolation error, 234 used for integration, 283 used for solving nonlinear equations, 54 Chebyshev points, 232 Chebyshev polynomial, 235 Cholesky factorization, 104, 105 incomplete, 433 normal equation, 366 Clenshaw-Curtis quadrature, 283 coefficient of determination, 492 Colebrook equation, 63 compensated summation, 13, 14 integration, 287 complex Taylor series expansion, 342 composite Hermite rule, 269 condition number, 93, 94 diagonal matrix, 98 elliptical contours, 420 conjugate gradient method, 421 finite termination property, 422

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. H. Holmes, Introduction to Scientific Computing and Data Analysis, Texts in Computational Science and Engineering 13, https://doi.org/10.1007/978-3-031-22430-0

549

550 consistent approximation, 313 contrast function, 500 correct significant digits, 94 corrected trapezoidal rule, 269 covariance, 400 crime rates, 489 Crout factorization, 76, 82 cubic B-splines, 220, 223, 269 derivative, 252 integrals, 252 least squares, 408 cubic spline clamped, 218, 223, 252, 393 natural, 218, 223, 224 not-a-knot, 218, 223, 224 periodic end conditions, 248 cubic spline interpolation, 217 integration of data, 275 integration rule, 269 interpolation error, 230 minimum curvature, 226

D Dahlquist’s Equivalence Theorem, 322 descent direction, 414 steepest, 436 dominant eigenvalue, 134 Doolittle factorization, 76, 82 dot product, 130 double precision, 5, 6 downsampling, 504 drag on sphere, 63

E eigenvalue problem, 129 elastic bar minimum potential energy, 471 elastic beam minimum potential energy, 471 elastic string elastic foundation, 193 minimum potential energy, 453 elliptic integral first kind, 67 second kind, 67 epoch, 23 error, 18 for IVP solver, 321 iterative, 18 relative, 18 vector, 93

Index error function, 67, 294 Euler’s constant, 13 Euler-Maclaurin formula, 13 exascale computers, 114 explicit method, 313 F Fermi-Pasta-Ulam (FPU) chain, 132 Fibonacci sequence, 52 finite termination property, 422 FitzHugh-Nagumo equations, 394 Fletcher-Reeves method, 437 floating-point numbers, 5 flop, 10, 87, 422 Freudenstein equation, 60 Fritsch-Butland procedure, 239 Frobenius norm, 520, 525 Fundamental Theorem of Calculus, 324 G Gauss-Kronrod rules, 287 Gaussian elimination, 80 Gaussian quadrature 1-point, 279 2-point, 280 exponential convergence, 282 Gershgorin circles, 160 Gershgorin theorem, 160 Goldilocks, 376 Golub-Reinsch algorithm, 179 Golub-Welsch algorithm, 281 Gram matrices, 171 Gram-Schmidt, 151 modified, 152 Gram-Schmidt procedure, 148 least squares, 369 Great Internet Mersenne Prime Search, 12 H Halley’s method, 68 Hamming norm, 126 hat function, 215 Hermite interpolation, 239, 245 Hessian, 437, 530 Hilbert matrix, 124 Homer Simpson, 26 Horner’s method, 15, 209 I ideal gas law, 65

Index IEEE-754, 5 ill-conditioned, 94 image compression, 180 implicit method, 319 incomplete Cholesky factorization, 433 independent component analysis contrast function, 500 downsampling, 504 EEG rule, 503 kurtosis, 502 mixing matrix, 497 unmixing matrix, 499 whitening source data, 498 induced matrix norm, 92 Inf, 9 interpolation cubic splines, 217, 393 global polynomial, 207, 227 Lagrange, 209 piecewise linear, 213, 228 piecewise quadratic, 239, 253 inverse iteration, 143 inverse orthogonal iteration, 162 irreducible, 160, 168 iterative error, 18, 137, 418 IVP solvers A-stable, 316 Adams method, 325 backward Euler, 319 Euler’s method, 313 Heun’s method, 329 leapfrog method, 320 RK2, 327 RK4, 330, 337, 343, 392, 394 that conserve energy, 357 trapezoidal method, 325, 337, 340 velocity Verlet method, 341

J Jacobi preconditioner, 432 Jacobian, 109, 437

K Kermack-McKendrick model, 335 Koopman theory, 515, 527 Kurtosis, 502

L Lagrange interpolation, 209 interpolation error, 226 Lagrange multipliers, 483

551 lasso regularization, 374 law of mass action, 335 leapfrog method, 320, 326 least squares cubic splines, 408 nonlinear, 387 normal equation, 364, 366 QR approach, 368 least squares solution, 372 Legendre polynomial, 281 Leslie matrix, 197 Levenberg-Marquardt method, 445 line search problem, 413, 438 Lobatto quadrature, 300, 331 IVP solver, 331 logistic equation, 304, 313, 325, 329, 360, 391 Lorentzian function, 283 loss of significance, 10 LU factorization, 74, 82 computing time, 155, 407 Crout, 76, 82 Doolittle, 76, 79, 82 faster than, 113 flops, 87 pivoting, 80 tri-diagonal matrix, 106

M machine epsilon, 5, 6 magic matrix, 123 mantissa, 5 Markov chain, 511 Mars, 356 matrix condition number, 94 defective, 131 dense, 108 diagonally dominant, 102 exchange, 201 Gram, 171 ill-conditioned, 94 inverse of 2 × 2, 95 irreducible, 160, 168, 169, 196, 198 lower triangular, 74, 82 negative definite, 463 normal, 119 orthogonal, 154 positive definite, 100, 126, 416, 422 rotation, 499, 500 sparse, 107 storage requirement, 86

552 strictly diagonally dominant, 102 strictly lower triangular, 433 trace, 159 tri-diagonal, 83, 105, 455 unit lower triangular, 82, 116, 402 unit upper triangular, 82 upper triangular, 74, 80, 82 matrix factorization LU, 82 LDLT , 116 QP, 202 QR, 154 QDQT , 169, 537 UL, 116 UΣVT , 175 UT U, 104 ZZT , 524 matrix norm, 90 mean absolute deviation, 492 mean squared prediction error, 492 Mersenne prime, 12, 24 method of false position, 57 Michaelis-Menten function, 360 Michaelis-Menten model, 62 numerical solution, 352 midpoint rule, 324 minimum potential energy, 454 mixing matrix, 497, 499 model function allometric, 390 asymptotic regression, 360 logistic, 360 Michaelis-Menten, 360, 387 power law, 390 modified Lagrange formula, 212 Mooney-Rivlin model, 31, 295 Moore-Penrose pseudo-inverse, 199, 370, 510 QR, 396 SVD, 396 mutually orthogonal, 422

N NaN, 9 National Vegetable Research Station, 449 Nelder-Mead algorithm, 54, 393, 449 Newton’s method, 39, 109 discoverer, 54 finding implicit functions, 47 finding inverse functions, 66 minimization, 437

Index nonlinear system, 108 Newton’s second law, 336 ngram program, v, 485 nodes, 207 norm ∞-norm, 88, 91 1-norm, 89, 91 2-norm, 88 dimensionally consistent, 89, 487 energy, 122 Euclidean, 88 Frobenius, 520 matrix, 90 vector, 88 normal equation, 364, 366 ill-conditioned, 367 normal modes, 132 numerical differentiation, 310 complex Taylor series expansion, 342 optimal step size, 309

O objective function, 359, 411 one-step method, 320 order of convergence, 37, 53 orthogonal iteration, 149 orthogonal matrix, 154, 169, 537 Orthogonal Matrix Theorem, 536 proper, 202 orthogonal regression, 361, 477 oscillators chain, 132 coupled, 194, 356 outer product, 178 overflow negative, 9 positive, 9

P PageRank algorithm, 167 partial pivoting, 80 Pascal matrix, 124 pendulum equation, 339, 341 permutation matrix, 79 Perron-Frobenius theorem, 168 persistence forecast, 516 perturbation matrix, 111 Picard-Lindel¨ of theorem, 305 piecewise linear interpolation, 213 integration rule, 261 interpolation error, 228

Index pivoting, 77 Planck’s law of blackbody radiation, 295 Polak-Ribi` ere method, 437, 441 polar decomposition, 202 population interpolation, 247 positive definite matrix, 100 Positive Definite Matrix Theorem, 536 power law fluid, 346 power law function, 390 power method, 134, 135 precision, 277 arbitrary, 12 double, 5, 6 of integration rule, 277 quadruple, 6 single, 6 solving matrix equations, 99 preconditioner Jacobi, 432 symmetric successive overrelaxation, 433 prime numbers, 12 principal component analysis autoscaling, 487 centering data, 477, 480 error function, 478, 482 Guttman-Kaiser criterion, 494 how many components, 494 how many points, 493 range scaling, 487 relative residual variance, 484 scaling data, 477, 480 Q QR factorization, 154 computing time, 155, 407 least squares, 368 QR method, 156 divide and conquer, 158 implicitly shifted, 158 R Radau quadrature, 299 radioactive decay, 303 Rayleigh quotient descent method, 467 Rayleigh’s quotient, 130 Rayleigh’s quotient iteration, 145 regression error function, 359, 361

553 fitting circle, 386 fitting ellipse, 403 lasso, 374 model function, 360 nonlinear, 387 objective function, 359 regularization, 373 relative error, 389 ridge, 374 Tikhonov, 398 relative error, 18 matrices, 180 relative iterative error, 18, 41 Remez algorithm, 235 residual, 93, 417, 425 Reynolds number, 63 ridge regularization, 374, 398 robustification, 508 Rolle’s theorem, 227 Romberg integration, 272, 293 Rosenbrock function, 441, 446 Rosser matrix, 121, 189 round to nearest, 8 row sums, 160 Runge’s function, 213, 237 Runge–Kutta methods, 327 Heun, 329 Lobatto quadrature, 331 order conditions, 330, 334, 345 RK2, 327, 328 RK4, 393

S secant method, 50 order of convergence, 52, 53 Simpson’s rule, 266, 324 adaptive, 284 error, 266 singular value decomposition computing time, 155, 407 flops, 179 ICA, 499 image compression, 180 PCA, 478 principal axes, 173 summary, 175 singular values, 170, 172, 175 geometric multiplicity, 175 SIR model, 335 sparse matrix, 107 square-and-multiply algorithm, 30 stability

554 IVPs, 316 standard deviation, 400 steady-state solution, 304 steepest descent direction, 418, 437 method, 418, 437 Strassen’s method, 113 strictly convex, 412 strongly connected, 168 subnormal floats, 9 superlinear convergence, 53 symmetric successive overrelaxation preconditioner, 433 symplectic method, 342 synthetic data, 393, 407 T tangent linear model, 519 taxi-cabs, 114 Taylor’s theorem, vii, 529 tent function, 215 terminal velocity, 63 testing data, 375, 489 Thomas algorithm, 106 Tikhonov regularization, 398 trace, 159 training data, 374, 489 transpose notation, 75 trapezoidal method, 325 trapezoidal rule, 262, 324 error, 263 periodic function, 288, 291 tri-diagonal matrix, 83, 105 eigenvalues, 122

Index eigenvectors, 204 invertibility, 106 LU factorization, 106 positive definite, 122 truncation error, 533 IVPs, 313, 325 quadrature approach, 325 two-step method, 320 Tyrannosaurus rex, 405, 522

U unimodular matrix, 119 unit lower triangular matrix, 82 unit upper triangular matrix, 82 unmixing matrix, 499

V van der Waals equation of state, 65 Vandermonde matrix, 86, 99, 152, 208, 367 vector norm, 88 velocity Verlet, 341

W whitening source data, 498, 524 Wolfe conditions, 439

Y Yogi Berra, 111 Young’s modulus, 345