Nonlinear Lp-Norm Estimation 0824781252, 9780824781255

Complete with valuable FORTRAN programs that help solve nondifferentiable nonlinear Lp and Linf-norm estimation problems

322 86 15MB

English Pages 320 [317] Year 1989

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Nonlinear Lp-Norm Estimation
 0824781252, 9780824781255

Table of contents :
Cover
Half Title
Title Page
Copyright Page
PREFACE
Table of Contents
1: Lp-NORM ESTIMATION IN LINEAR REGRESSION
1.1 The history of curve fitting problems
1.2 The linear Lp-norm estimation problem
1.2.1 Formulation
1.2.2 Algorithms
1.2.3 L2-estimation
1.2.4 L1-estimation
1.2.5 L∞-estimation
1.3 The choice of p
1.4 Statistical Properties of Linear Lp-norm Estimators
1.5 Confidence intervals for β
1.5.1 Case 1 < p < ∞
1.5.2 Case p = 1
1.6 Example
Conclusion
Appendix 1A: Gauss-Markoff Theorem
Additional notes
Bibliography: Chapter 1
2: THE NONLINEAR L1-NORM ESTIMATION PROBLEM
2.1 The nonlinear L1-norm estimation problem
2.2 Optimality conditions for L1-norm estimation problems
2.2.1 Examples
2.3 Algorithms for solving the nonlinear L1- norm estimation problem
2.3.1 Differentiable unconstrained minimization
2.3.2 Type I Algorithms
2.3.2.1 The Anderson-Osborne-Watson algorithm
2.3.2.2 The Anderson-Osborne-Levenberg-Marquardt algorithm
2.3.2.3 The McLean and Watson algorithms
2.3.3 Type II Algorithms
2.3.3.1 The Murray and Overton algorithm
2.3.3.2 The Bartels and Conn algorithm
2.3.4 Type III methods
2.3.4.1 The Hald and Madsen algorithm
2.3.5 Type IV Algorithms
2.3.5.1 The El-Attar-Vidyasagar-Dutta algorithm
2.3.5.2 The Tishler and Zang algorithm
Conclusion
Appendix 2A: Local and global minima
Appendix 2B: One-dimensional line search algorithms
Appendix 2C: The BFGS approach in unconstrained minimization
Appendix 2D: Levenberg-Marquardt approach in nonlinear estimation
Appendix 2E: Rates of convergence
Appendix 2F: Linear Algebra
Appendix 2G: Extrapolation procedures to enhance convergence
Appendix 2H: FORTRAN Programs
Additional notes
Bibliography: Chapter 2
3: THE NONLINEAR L∞-NORM ESTIMATION PROBLEM
3.1 The nonlinear L∞-norm estimation problem
3.2 Loo-norm optimality conditions
3.3 The nonlinear minimax problem
3.4 Examples
3.5 Algorithms for solving nonlinear Loo-norm estimation problems
3.5.1 Type I Algorithms
3.5.1.1 The Anderson-Osborne-Watson algorithm
3.5.1.2 The Madsen algorithm
3.5.1.3 The Anderson-Osborne-Levenberg-Marquardt algorithm
3.5.2 Type II Algorithms
3.5.2.1 The Murray and Overton algorithm
3.5.2.2 The Han algorithm
3.5.3 Type III Algorithms
3.5.3.1 The Watson algorithm
3.5.3.2 The Hald and Madsen algorithm
3.5.4 Type IV algorithms
3.5.4.1 The Charalambous acceleration algorithm
3.5.4.2 The Zang algorithm
Appendix 3A: FORTRAN Programs
Additional notes
Bibliography: Chapter 3
4: THE NONLINEAR Lp-NORM ESTIMATION PROBLEM
4.1 The nonlinear Lp-norm estimation problem (1 < p < ∞)
4.2 Optimality conditions for the nonlinear Lp-norm problem
4.3 Algorithms for nonlinear Lp-norm estimation problems
4.3.1 Watson’s algorithm
4.3.2 A first-order gradient algorithm
4.3.2.1 Examples
4.3.3 The mixture method for large residual and ill-conditioned problems
4.3.3.1 Examples
Conclusion
Appendix 4A: Cholesky decomposition of symmetric matrices
Appendix 4B: The singular value decomposition (SVD) of a matrix
Appendix 4C: Fletcher’s line search algorithm
Additional notes
Bibliography: Chapter 4
5: STATISTICAL ASPECTS OF Lp-NORM ESTIMATORS
5.1 Nonlinear least squares
5.1.1 Statistical inference
5.1.1.1 Confidence intervals and joint confidence regions
5.1.1.2 Hypothesis testing
5.1.1.3 Bias
5.2 Nonlinear L1-norm estimation
5.2.1 Statistical inference
5.2.1.1 Confidence intervals
5.2.1.2 Hypothesis testing
5.2.1.3 Bias and consistency
5.3 Nonlinear Lp-norm estimation
5.3.1 Asymptotic distribution of Lp-norm estimators (additive errors)
5.3.2 Statistical inference
5.3.2.1 Confidence intervals
5.3.2.2 Hypothesis testing
5.3.2.3 Bias
5.4 The choice of the exponent p
5.4.1 The simulation study
5.4.2 Simulation results and selection of p
5.4.3 The efficiency of Lp-norm estimators for varying values of p
5.4.4 The relationship between p and the sample kurtosis
5.5 The asymptotic distribution of p
5.6 The adaptive algorithm for Lp-norm estimation
5.7 Critique of the model and regression diagnostics
5.7.1 Regression diagnostics (nonlinear least squares)
5.7.2 Regression diagnostics (nonlinear Lp-norm case)
Conclusion
Appendix 5A: Random deviate generation
Appendix 5B: Tables
Appendix 5C: ω2p for symmetric error distributions
Appendix 5D: Var(pˆ) for symmetric error distributions
Appendix 5E: The Cramér-von Mises goodness-of-fit test
Additional notes
Bibliography: Chapter 5
6: APPLICATION OF Lp-NORM ESTIMATION
6.1 Compartmental models, bioavailability studies in Pharmacology
6.1.1 Mathematical models of drug bioavailability
6.1.2 Outlying observations in pharmacokinetic models
6.2 Oxygen saturation in respiratory physiology
6.2.1 Problem background
6.2.2 The data
6.2.3 Modelling justification
6.2.4 Problem solution
6.3 Production models in Economics
6.3.1 The two-step approach
6.3.1.1 Two-step nonlinear least squares
6.3.1.2 Two-step adaptive Lp-norm estimation
6.3.2 The global least squares approach
Concluding remark
Bibliography: Chapter 6
AUTHOR INDEX
SUBJECT INDEX

Citation preview

STATISTICS: Textbooks and Monographs A Series Edited by D. B. Owen, Coordinating Editor Department o f Statistics Southern Methodist University Dallas, Texas R. G. Cornell, Associate Editor

W. J. Kennedy, Associate Editor

fo r Biostatistics

for Statistical Computing

University o f Michigan

Iowa State University

A . M. Kshirsagar, Associate Editor

E. G. Schilling, Associate Editor

for Multivariate Analysis and

for Statistical Quality Control

Experimental Design

Rochester Institute o f Technology

University o f Michigan

Vol. 1: The Generalized Jackknife Statistic, //. L. Gray and W. R. Schucany Vol. 2: Multivariate Analysis, Anant M. Kshirsagar Vol. 3: Statistics and Society, Walter T. Federer Vol. 4: Multivariate Analysis: A Selected and Abstracted Bibliography, 1957-1972, Kocherlakota Subrahmaniam and Kathleen Subrahmaniam (out of print) Vol. 5: Design of Experiments: A Realistic Approach, Virgil L. Anderson and Robert A. McLean Vol. 6: Statistical and Mathematical Aspects of Pollution Problems, John W. Pratt Vol. 7: Introduction to Probability and Statistics (in two parts), Part I: Probability; Part II: Statistics, Narayan C. Giri Vol. 8: Statistical Theory of the Analysis of Experimental Designs, J. Ogawa Vol. 9: Statistical Techniques in Simulation (in two parts), Jack P. C. Kleijnen Vol. 10: Data Quality Control and Editing, Joseph I. Naus (out of print) Vol. 11: Cost of Living Index Numbers: Practice, Precision, and Theory, Kali S. Banerjee Vol. 12: Weighing Designs: For Chemistry, Medicine, Economics, Operations Research, Statistics, Kali S. Banerjee Vol. 13: The Search for Oil: Some Statistical Methods and Techniques, edited by D. B. Owen Vol. 14: Sample Size Choice: Charts for Experiments with Linear Models, Robert E. Odeh and Martin Fox Vol. 15: Statistical Methods for Engineers and Scientists, Robert M. Bethea, Benjamin S. Duran, and Thomas L. Boullion Vol. 16: Statistical Quality Control Methods, Irving W. Burr Vol. 17: On the History of Statistics and Probability, edited by D. B. Owen Vol. 18: Econometrics, Peter Schmidt Vol. 19: Sufficient Statistics: Selected Contributions, Vasant S. Huzurbazar (edited by AnantM. Kshirsagar) Vol. 20: Handbook of Statistical Distributions, Jagdish K. Patel, C. H. Kapadia, and D. B. Owen Vol. 21: Case Studies in Sample Design,/!. C. Rosander Vol. 22: Pocket Book of Statistical Tables, compiled by R. E. Odeh, D. B. Owen, Z. W. Birnbaum, and L. Fisher Vol. 23: The Information in Contingency Tables, D. V. Gokhale and Solomon Kullback

Vol. 24: Statistical Analysis of Reliability and Life-Testing Models: Theory and Methods, LeeJ. Bain Vol. 25: Elementary Statistical Quality Control, Irving W. Burr Vol. 26: An Introduction to Probability and Statistics Using BASIC, Richard A. Groeneveld Vol. 27: Basic Applied Statistics, B. L . Raktoe and J. /. Hubert Vol. 28: A Primer in Probability, Kathleen Subrahmaniam Vol. 29: Random Processes: A First Look, R. Syski Vol. 30: Regression Methods: A Tool for Data Analysis, Rudolf J. Freund and Paul D. Minton Vol. 31: Randomization Tests, Eugene S. Edgington Vol. 32: Tables for Normal Tolerance Limits, Sampling Plans, and Screening, Robert E. Odeh and D. B. Owen Vol. 33: Statistical Computing, William J. Kennedy, Jr. and James E. Gentle Vol. 34: Regression Analysis and Its Application: A Data-Oriented Approach, Richard F. Gunst and Robert L. Mason Vol. 35: Scientific Strategies to Save Your Life,/. D. J. Bross Vol. 36: Statistics in the Pharmaceutical Industry, edited by C. Ralph Buncher and Jia-Yeong Tsay Vol. 37: Sampling from a Finite Population, /. Hajek Vol. 38: Statistical Modeling Techniques, S. S. Shapiro Vol. 39: Statistical Theory and Inference in Research, T. A. Bancroft and C.-P. Han Vol. 40: Handbook of the Normal Distribution, Jagdish K. Patel and Campbell B. Read Vol. 41: Recent Advances in Regression Methods, Hrishikesh D. VinodandAman Ullah Vol. 42: Acceptance Sampling in Quality Control, Edward G. Schilling Vol. 43: The Randomized Clinical Trial and Therapeutic Decisions, edited by Niels Tygstrup, John M. Lachin, and Erik Juhl Vol. 44: Regression Analysis o f Survival Data in Cancer Chemotherapy, Walter H. Carter, Jr., Galen L. Wampler, and Donald M. Stablein Vol. 45: A Course in Linear Models, Anant M. Kshirsagar Vol. 46: Clinical Trials: Issues and Approaches, edited by Stanley H. Shapiro and Thomas H. Louis Vol. 47: Statistical Analysis of DNA Sequence Data, edited by B. S. Weir Vol. 48: Nonlinear Regression Modeling: A Unified Practical Approach, David A. Ratkowsky Vol. 49: Attribute Sampling Plans, Tables of Tests and Confidence Limits for Proportions, Robert E. Odeh and D. B. Owen Vol. 50: Experimental Design, Statistical Models, and Genetic Statistics, edited by Klaus Hinkelmann Vol. 51: Statistical Methods for Cancer Studies, edited by Richard G. Cornell Vol. 52: Practical Statistical Sampling for Auditors, Arthur J. Wilburn Vol. 53: Statistical Signal Processing, edited by Edward J. Wegman and James G. Smith Vol. 54: Self-Organizing Methods in Modeling: GMDH Type Algorithms, edited by Stanley J. Farlow Vol. 55: Applied Factorial and Fractional Designs, Robert A. McLean and Virgil L. Anderson Vol. 56: Design of Experiments: Ranking and Selection, edited by Thomas J. Santner and A jit C. Tamhane Vol. 57: Statistical Methods for Engineers and Scientists. Second Edition, Revised and Expanded, Robert M. Bethea, Benjamin S. Duran, and Thomas L. Boullion Vol. 58: Ensemble Modeling: Inference from Small-Scale Properties to Large-Scale Systems, Alan E. Gelfand and Crayton C. Walker Vol. 59: Computer Modeling for Business and Industry, Bruce L. Bowerman and Richard T. O'Connell Vol. 60: Bayesian Analysis of Linear Models, Lyle D. Broemeling

Vol. 61:

Methodological Issues for Health Care Surveys, Brenda Cox and Steven Cohen

Vol. 62:

Applied Regression Analysis and Experimental Design, Richard J. Brook and Gregory C. Arnold

Vol. 63: Statpal: A Statistical Package for Microcomputers - PC-DOS Version for the IBM' PC and Compatibles, Bruce J. Chalmer and David G. Whitmore Vol. 64: Statpal: A Statistical Package for Microcomputers - Apple Version for the II, II-*-, and He, David G. Whitmore and Bruce J. Chalmer Vol. 65: Nonparametric Statistical Inference, Second Edition, Revised and Expanded, Jean Dickinson Gibbons Vol. 66: Design and Analysis of Experiments, Roger G. Petersen Vol. 67: Statistical Methods for Pharmaceutical Research Planning, Sten W. Bergman and John C. Gittins Vol. 68: Goodness-of-Fit Techniques, edited by Ralph B. D ’Agostino and Michael A. Stephens Vol. 69: Statistical Methods in Discrimination Litigation, edited by D .H . Kaye and Mikel Aickin Vol. 70: Truncated and Censored Samples from Normal Populations, Helmut Schneider Vol. 71: Robust Inference, M. L. Tiku, W. Y. Tan, and N. Balakrishnan Vol. 72: Statistical Image Processing and Graphics, edited by Edward J. Wegman and Douglas J. DePriest Vol. 73: Assignment Methods in Combinatorial Data Analysis, Lawrence J. Hubert Vol. 74: Econometrics and Structural Change, Lyle D. Broemeling and Hiroki Tsurumi Vol. 75: Multivariate Interpretation o f Clinical Laboratory Data, Adelin Albert and Eugene K. Harris Vol. 76: Statistical Tools for Simulation Practitioners, Jack P. C. Kleijnen Vol. 77: Randomization Tests, Second Edition, Eugene S. Edgington Vol. 78: A Folio of Distributions: A Collection of Theoretical Quantile-Quantile Plots, Edward B. Fowlkes Vol. 79: Applied Categorical Data Analysis, Daniel H. Freeman, Jr. Vol. 80: Seemingly Unrelated Regression Equations Models : Estimation and Inference, Virendra K. Srivastava and David E. A. Giles Vol. 81: Response Surfaces: Designs and Analyses, Andre I. Khuriand John A. Cornell VoL 82: Nonlinear Parameter Estimation: An Integrated System in BASIC, John C. Nash and Mary Walker-Smith VoL 83: Cancer Modeling, edited by James R. Thompson and Barry W. Brown Vol. 84: Mixture Models: Inference and Applications to Clustering, Geoffrey J. MeLachlan and Kaye E. Basford VoL 85: Randomized Response: Theory and Techniques, Arijit Chaudhuri and Rahul Mukerjee Vol. 86: Biopharmaceutical Statistics for Drug Development, edited by Karl E. Peace Vol. 87: Parts per Million Values for Estimating Quality Levels, Robert E. Odeh and D. B. Owen Vol. 88: Lognormal Distributions: Theory and Applications, edited by Edwin L. Crow and Kunio Shimizu Vol. 89: Properties of Estimators for the Gamma Distribution, K. O. Bowman and L. R. Shenton VoL 90: Spline Smoothing and Nonparametric Regression, Randall L. Eubank Vol. 91: Linear Least Squares Computations, R. W. Farebrother Vol. 92: Exploring Statistics, Damaraju Raghavarao

Vol. 93: Applied Time Series Analysis for Business and Economic Forecasting, Sufi M. Nazem Vol. 94: Bayesian Analysis of Time Series and Dynamic Models, edited by James C. Spall Vol. 95: The Inverse Gaussian Distribution: Theory, Methodology, and Applications, Raj S. Chhikara and J. Leroy Folks Vol. 96: Parameter Estimation in Reliability and Life Span Models, A. Clifford Cohen and Betty Jones Whitten Vol. 97: Pooled Cross-Sectional and Time Series Data Analysis, Terry E. Dielman Vol. 98: Random Processes: A First Look, Second Edition, Revised and Expanded, R. Syski Vol. 99: Generalized Poisson Distributions: Properties and Applications, P.C. Consul Vol. 100: Nonlinear Lp-Norm Estimation, Rene Gonin and Arthur H. Money Vol. Vol. Vol. Vol.

101: 102: 103: 104:

Model Discrimination for Nonlinear Regression Models, Dale S. Borowiak Applied Regression Analysis in Econometrics, Howard E. Doran Continued Fractions in Statistical Applications, K.O. Bowman and L.R. Shenton Statistical Methodology in the Pharmaceutical Sciences, Donald A. Berry

A D D IT IO N A L V O LU M E S IN P R E P A R A T IO N

Nonlinear Lp-Norm Estimation RENE GONIN Institute for Biostatistics Medical Research Council Cape Town, South Africa

ARTHUR H. MONEY Henley The Management College Henley-on-Thames England

MARCEL DEKKER, INC.

New York and Basel

IS B N 13: 978-0-8247-8125-5

This book is printed on acid-free paper. Copyright © 1989 by M A R C E L D E K K E R , IN C . A ll Rights Reserved Neither this book nor any part m ay be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, m icrofilm ­ ing, and recording, or by any inform ation storage and retrieval system, without permission in writing from the publisher. M A R C E L D E K K E R , INC. 270 Madison Avenue, N ew Y ork , N ew Y o rk 10016 Current printing (last digit): 1098 76 5 43 2 1 P R IN T E D IN T H E U N IT E D S TA TE S O F A M E R IC A

Preface

W e hope this book will be o f use both to the statistician and to the applied mathematician. It is prim arily o f a computational nature and therefore belongs to the field o f statistical computing. The book is self-contained and permits flexibility in the manner in which it can be read. The reader should be able to grasp the un­ derlying computational principles without having had a formal course in statistical computing. Fam iliarity with linear programming is, however, essential. The book is organized into two parts, namely, numerical solution procedures and statistical analysis.

The original outline for this book arose out o f the first

author’s doctoral thesis. Subsequently the m aterial on nonlinear X i-norm and

L co­

norm estimation problems was incorporated. Our main objective is the treatment o f nonlinear Z^-norm estimation, which subsumes nonlinear least squares estimation. W e realized that an introduction to linear X p-norm estimation was needed as a preamble to the nonlinear case. Linear Zp-norm estimation is vast enough to warrant a complete text on its own. Lim ­ ited space permits only a brief review o f this problem in Chapter 1. W e also felt it im portant to bring to the statistician’s notice useful procedures known to the numerical analyst. W e hope that this book will be successful in bridging the gap between the statistician and the numerical analyst. We have included some F O R T R A N programs to assist the reader in solving the nondifferentiable nonlinear

L\-

continuously differentiable nonlinear

and Xoo-norm estimation problems.

L p-norm

For the

problem any general unconstrained

m inim ization routine can be used. This is not necessarily the most efficient avenue o f solution. Sections o f the manuscript were used by the first author in a course on nonlinear parameter estimation at the University o f Michigan in Ann A rb or during 1988. W e would like to express our gratitude to the Medical Research Council in Cape Town for the use o f their computer facilities. This entailed not only computational aspects but also the typesetting o f the book by means o f T gX . The first author would also like to thank the M edical Research Council for financial support related to this project. The second author acknowledges the continued financial support o f the University o f Cape Town and the Human Sciences Research Council.

111

Preface

iv

W e would like to express our sincere appreciation to Jan Pretorius, Director o f the Institute for Biostatistics, for creating the environment for pursuing this project. His continued interest and encouragement was an im portant factor in com pleting this task. W e would especially like to thank Professor Vince Sposito o f Iowa State Uni­ versity for his careful scrutiny o f the manuscript and his many valuable comments. Also, many thanks to Professor Cas Troskie o f the University o f Cape Town for perusing the manuscript. For the classification o f the X i-norm and Zoo-norm methods we are indebted to Dr. Alistair Watson o f Dundee University. Also, our gratitude to Professor Mike Osborne o f the Australian National University, who put into perspective for us the numerical methods o f nonlinear X i-norm estimation. W e would like to thank Doctors Joyce Gonin and Sinclair Wynchank for pa­ tiently editing the manuscript. Their efforts certainly im proved the clarity o f the subject m atter. Finally, thanks to our spouses, Joyce and Gillian, for their fortitude in coping with our impatience and idiosyncrasies as

p

tended to oo. Rene Gonin Arthur H. M oney

Contents PREFACE 1

iii

Lp-NORM ESTIMATION IN LINEAR REGRESSION

1

1.1

The history of curve fitting problems

1

1.2

The linear Lp-norm estimation problem

4

1.2.1

Formulation

1.2.2

Algorithms

5 5

1.2.3

/^-estimation

7

1.2.4

Li-estimation

7

1.2.5

/^-estimation

9

1.3

The choice of p

11

1.4

Statistical Properties of Linear L p-norm Estimators

12

1.5

Confidence intervals for ft

14

1.5.1

Case 1 < p < oo

14

1.5.2

Case p = I

15

1.6

2

Example

16

Conclusion

21

Appendix 1A: Gauss-Markoff Theorem

22

Additional notes

23

Bibliography: Chapter 1

32

THE NONLINEAR Li-NORM ESTIMATION PROBLEM

2.1

The nonlinear L i-norm estimation problem

2.2

Optimality conditions for Li-norm estimation problems

2.2.1 2.3

Examples Algorithms for solving the nonlinear L\- norm estimation problem

2.3.1

Differentiable unconstrained minimization

2.3.2

Type I Algorithms

43 43 45 49 52 54 55

2.3.2.1

The Anderson-Osborne-Watson algorithm

2.3.2.2

The Anderson-Osborne-Levenberg-Marquardt algorithm

58

2.3.2.3

The McLean and Watson algorithms

60

2.3.3

Type II Algorithms

55

64

2.3.3.1

The Murray and Overton algorithm

64

2.3.3.2

The Bartels and Conn algorithm

68 v

vi

2.3.4 2.3.4.1 2.3.5

Contents

Type III methods The Hald and Madsen algorithm Type IV Algorithms

72 73 76

2.3.5.1

The El-Attar-Vidyasagar-Dutta algorithm

76

2.3.5.2

The Tishler and Zang algorithm

78

Conclusion

81

Appendix 2B: One-dimensional line search algorithms

83

Appendix 2C: The BFGS approach in unconstrained minimization

87

Appendix 2D: Levenberg-Marquardt approach in nonlinear estimation

88

Appendix 2E: Rates of convergence

91

Appendix 2F: Linear Algebra

92

Appendix 2G: Extrapolation procedures to enhance convergence

96

Appendix 2H: F O R T R A N Programs

3

80

Appendix 2A: Local and global minima

98

Additional notes

111

Bibliography: Chapter 2

114

THE NONLINEAR L^-NORM ESTIMATION PROBLEM

120

3.1

The nonlinear Loo-norm estimation problem

121

3.2

Loo-norm optimality conditions

120

3.3

The nonlinear minimax problem

124

3.4

Examples

125

3.5

Algorithms for solving nonlinear Loo-norm estimation problems

3.5.1 3.5.1.1

Type I Algorithms The Anderson-Osborne-Watson algorithm

128 129 129

3.5.1.2

The Madsen algorithm

134

3.5.1.3

The Anderson-Osborne-Levenberg-Marquardt algorithm

136

3.5.2 3.5.2.1 3.5.2.2 3.5.3

Type II Algorithms The Murray and Overton algorithm The Han algorithm Type III Algorithms

137 137 141 144

3.5.3.1

The Watson algorithm

145

3.5.3.2

The Hald and Madsen algorithm

148

3.5.4 3.5.4.1 3.5.4.2

Type IV algorithms The Charalambous acceleration algorithm The Zang algorithm Appendix 3A: F O R T R A N Programs

151 151 152 155

Additional notes

161

Bibliography: Chapter 3

166

Contents

4

v ii

THE NONLINEAR V NORM ESTIMATION PROBLEM

170

4.1

The nonlinear L p-norm estimation problem (1 < p < oo)

170

4.2

Optimality conditions for the nonlinear L p-norm problem

171

4.3

Algorithms for nonlinear L p-norm estimation problems

174

4.3.1

Watson’s algorithm

174

4.3.2

A first-order gradient algorithm

176

4.3.2.1 4.3.3 4.3.3.1

Examples The mixture method for large residual and ill-conditioned problems Examples Conclusion

5 5.1 5.1.1

181 183 190 192

Appendix 4A: Cholesky decomposition of symmetric matrices

193

Appendix 4B: The singular value decomposition (SVD ) of a matrix

197

Appendix 4C: Fletcher’s line search algorithm

200

Additional notes

201

Bibliography: Chapter 4

203

STATISTICAL ASPECTS OF Lp-NORM ESTIMATORS Nonlinear least squares Statistical inference

206 207 207

5.1.1.1

Confidence intervals and joint confidence regions

207

5.1.1.2

Hypothesis testing

211

5.1.1.3 5.2 5.2.1

Bias Nonlinear Li-norm estimation Statistical inference

212 213 213

5.2.1.1

Confidence intervals

213

5.2.1.2

Hypothesis testing

214

5.2.1.3

Bias and consistency

215

5.3

Nonlinear L p-norm estimation

5.3.1

Asymptotic distribution of L p-norm estimators (additive errors)

5.3.2

Statistical inference

216 216 216

5.3.2.1

Confidence intervals

216

5.3.2.2

Hypothesis testing

217

5.3.2.3 5.4

Bias The choice of the exponent p

217 217

5.4.1

The simulation study

218

5.4.2

Simulation results and selection of p

219

5.4.3

The efficiency of Lp-norm estimators for varying values of p

220

5.4.4

The relationship between p and the sample kurtosis

222

viii

Contents

5.5

The asymptotic distribution of p

223

5.6

The adaptive algorithm for £p-norm estimation

226

5.7

Critique of the model and regression diagnostics

227

5.7.1

Regression diagnostics (nonlinear least squares)

228

5.7.2

Regression diagnostics (nonlinear Lp-norm case)

229

Conclusion

6 6.1

230

Appendix 5A: Random deviate generation

231

Appendix 5B: Tables

238

Appendix 5C: u;p for symmetric error distributions

241

Appendix 5D: Var(p) for symmetric error distributions

244

Appendix 5E: The Cramer-von Mises goodness-of-fit test

247

Additional notes

248

Bibliography: Chapter 5

250

APPLICATION OF Lp-NORM ESTIMATION Compartmental models, bioavailability studies in Pharmacology

254 254

6.1.1

Mathematical models of drug bioavailability

255

6.1.2

Outlying observations in pharmacokinetic models

257

6.2

Oxygen saturation in respiratory physiology

265

6.2.1

Problem background

265

6.2.2

The data

265

6.2.3

Modelling justification

265

6.2.4

Problem solution

268

6.3 6.3.1

Production models in Economics The two-step approach

275 275

6.3.1.1

Two-step nonlinear least squares

276

6.3.1.2

Two-step adaptive Lp-norm estimation

278

6.3.2

The global least squares approach

282

Concluding remark

282

Bibliography: Chapter 6

283

AUTHOR INDEX SUBJECT INDEX

285 293

Nonlinear Lp-Norm Estimation

1 Lp-Norm Estimation in Linear Regression

The fundamental problem of fitting equations to a number of observations has for centuries been of interest to scientists. For example, curve fitting models were used by early astronomers to predict the orbits of celestial bodies. Concerned with the calculation of the distance of a new star from the earth Galileof makes the following statement through the interlocutor Salviati in his Di­ alogue of the two chief world systems: “Then these observers being capable, and having erred for all that, and the errors needing to be corrected in order for us to get the best possible information from their observations, it will be appropriate for us to apply the minimum amendments and smallest corrections that we can — just enough to remove the observations from impossibility and restore them to possibility ” This statement heralded the beginning of the theory of errors in which the model is estimated from a number of inconsistent observations. 1.1. T h e h istory o f cu rve fittin g problem s

One of the first methods for smoothing random errors was based on averages and was known as the Principle of the Arithmetic Mean: suppose we wish to fit a linear regression model y = a + @x to the observations (x*,

), * — I, . . ., n. The

estimates of a and /3 are determined as follows: Slopes between all possible pairs of points,

bij = —----— Xj -

Xi

with

Xj / Xi

for

i = 1, . . . , n — 1;

j = i -f 1,. .., n

f Galilei, Galileo (1629). Dialogue concerning the two chief world systems, Ptole­ maic and Copernican. (English translation, by Stillman Drake. Revised 2nd edition UCLA Press 1967). 1

C H A P T E R 1: L p-norm estimation in linear regression

2

are first calculated and the corresponding intercepts atJ in each case are then calcu­ lated by substitution. The averages of the intercepts and slopes denoted by a and b respectively are then taken as estimates of a and /?. Cotes (1722) noted in certain models that only the response variable (obser­ vations y) is subject to measurement errors and suggested a procedure based on weighted arithmetic means with weights proportional to |x|. In the model y — (3x + e, the slope /? is estimated by b — y/x, the ratio of the two means, which in turn is equivalent to the zero sum residuals condition: n

Y ^(V i - bxi) = °

( 1.1)

1= 1

This is the same as stipulating that the ideal line must pass through the centroid (x, y) of the data. Euler (1749) and Mayer (1750) independently derived the so-called Method of Averages for fitting a straight line to observed data. The observations are subdivided on some subjective basis into as many subsets as there are coefficients. The grouping is made according to the value of one of the explanatory variables. Those with the largest values of this variable are grouped together and so on. Condition

(1.1) is

then applied to each observation of the subset. Boscovich (1757) considered the model y — a + j3x -f e and proposed that two criteria be met when fitting the best straight linef. The estimates a and b of the parameters a and f3 are determined such that: Criterion 1:

Is/*' ~ a ~ bxi\

a minimum.

Criterion 2: The sum of the positive and negative residuals in the y-variable are equal i.e., ^ ^ ( y i — a — bxt) = 0. Criterion 1 is the objective of Least Absolute Value (L A V ) estimation (also referred to as Li-norm estimation), while Criterion 2 ensures that the fitted line passes through the centroid of the data.

Boscovich’s solution procedure is based

on geometric principles. Solution procedures were derived by Simpson (1760) and Laplace (1793,1799). f Li-norm estimation has in fact been traced back to Galileo (1629), see e.g., Ronchetti (1987).

Section 1.1: The history of curve fitting problems

3

Laplace (1786) also used the Boscovich principles to test the adequacy of the relationship y = a + fix for the data (xi,y*). He described an algebraic procedure for determining the coefficients a and /?. Gauss (1806) developed the method of minimizing the squared observation errors in his works on celestial mechanics which subsequently became known as the Method of Least Squares (L 2-norm estimation).

Although Gauss had used least

squares since 1795, Legendre (1805) was the first to publish the method. He derived the normal equations algebraically without using calculus. He claimed that least squares was superior to other existing methods but gave no proof. In 1809 Gauss derived the normal (Gaussian) law of error which states that the arithmetic mean of the observations of an unknown variable x will be the most probable. Gauss (1820) succinctly writes: “Determining a magnitude by observation can justly be compared to a game in which there is a danger o f loss but no hope o f gain ... Evidently the loss in the game cannot be compared directly to the error which has been committed, for then a positive error would represent loss and a negative error a gain. The magnitude o f the loss must on the contrary be evaluated by a function o f the error o f which the value is always positive ... it seems natural to choose the simplest (function), which is, beyond contradiction, the square o f the error” . Laplace (1818) examined the distributional properties of the estimator, 6, of the parameter /3 in the simple regression model y = (3x -he when Li-norm estima­ tion is used. He assumed that all the errors e had the same symmetric distribution about zero and derived the density function /(e) of the errors e. He also showed that the slope b was asymptotically normally distributed with mean zero and vari­ ance {4 / (0 )2^ tn_ i x f } ~ 1. This is a well known result for the sample median in the location model y = (3 + e. Cauchy (1824) examined the fitting of a straight line y = a + /3x to data and proposed minimizing the maximum absolute residual. This he achieved by means of an iterative procedure.

Chebychev (1854) in his work on the approximation

of functions also proposed the estimation of parameters by means of minimizing the maximum absolute difference between the observed function and the estimated function. This minimax procedure later became known as Chebychev approximation or Loo-norm approximation.

C H A P T E R 1: Lp-norm estimation in linear regression

4

Edgeworth (1883) questioned the universal use of the normal law of errors and also examined the problem of outliers.

This problem was considered further by

Doolittle (1884). Edgeworth ( 1887a,b) used the first Boscovich principle and aban­ doned the zero sum residuals condition which forces the line through the centroid of the data. He, in fact, provided the first workable method for estimating these pa­ rameters. He also considered cases where least squares is inappropriate, i.e., where the error distributions are unknown or contaminated (normal) distributions. The aforementioned authors proposed complex solution procedures for these problems. This possibly accounts for the apparent lack of interest in applying LAV estimation in the period 1884 to 1920. Mild interest was revived again in the period 1920 to 1950 by Edgeworth (1923), Rhodes (1930), Singleton (1940) and Harris (1950). It was only after the appearance of the papers by Charnes et al. (1955) and Wagner (1959,1962) that interest in L\ and L

estimation was again stimulated.

These authors described the linear programming (L P ) formulation to the L i- and Loo-norm problems respectively. Progress, however, was hampered during this pe­ riod since computers were not advanced enough to handle very large LP problems. In general, the parameters of the equations may be estimated by minimizing the sum of the pth powers of the absolute deviations of the estimated values from the observed values of the response variable. This procedure is referred to as Lp-norm estimation. In the rest of this chapter we give a review of important developments in linear Lp-norm estimation. 1.2. T h e lin ear Lp-n orm estim a tion p ro b lem Consider the general linear model

y — X/3 -f- e where y is an n-vector of observable random variables X is an n x k matrix of known regressor variables is a A;-vector of unknown parameters, and e is an n-vector of unobservable errors.

(1.2)

Section 1.2: The linear Lp-norm estimation problem

5

1.2.1. F o rm u la tio n

The linear L p-norm estimation problem is then defined as: Find the parameter vector b = (& i,. . . , bk) 1 which minimizes n

n

= *=i

ip

(i.3 )

«=i

with yt the response and Xj = (z ll5 ... , x ^ ) the explanatory variables respectively, and xn = 1 for all i. Hence bi is the estimate of the intercept, b is the Lp-norm estimate of ft and e = ( e i , ... , en) 1 is the n-vector of residuals. Charnes et al. (1955) showed that problem (1.3) (with p — 1) can be formulated as a linear programming problem by considering the unrestricted error term to be the difference between two nonnegative variables. Let

= ut - i w h e r e ut , Vi > 0,

represent the positive and negative deviations respectively. The general L p-norm problem then becomes: n

Minimize

(^-^) t=i

subject to

Xjb -f ut - vt- = ytizt ,v ; > 0

"I > J

i — 1,. . ., n

b unconstrained Only in the cases p — 1 and p —►oo can linear programming procedures be used. When p = 2 the normal equations of least squares are solved. For other values of p unconstrained minimization procedures are used. 1.2.2. A lg o rith m s

Barrodale and Roberts (1970) showed that the linear L p-norm estimation prob­ lem can be formulated as a nonlinear programming (N L P ) problem in which the objective function is concave for 0 < p < 1 and convex for 1 < p < oo with the constraints being linear. They suggested for p > 1 the use of the convex simplex or Newton’s method. For the case 0 < p < 1 a modification of the simplex method for linear programming was proposed. Ekblom (1973) rewrites problem (1.3) as the perturbation problem:

6

C H A P T E R 1: L p-norm estimation in linear regression

Find the parameter vector b which minimizes n

£ [(w t=i

- xi b? +• c2]p/2

(where c is finite)

(1.5)

The author used the modified (damped) Newton method to solve problem (1.5). The advantage is that the Hessian (matrix of 2nd-order derivatives) of the perturbed problem remains positive definite as long as c ^ 0. A decrease is therefore assured at every iteration. Ekblom then showed that the limiting solution as c —►0 is the solution to the original problem (1.3). More recently Ekblom (1987) suggested that problem (1.5) with p = 1 be used for Li-norm estimation. Fischer (1981) considered problem yt for i = 1, ...,n , he transformed

(1.3) with 1 < p < 2. Setting rt = x-b -

(1.3) to the following “linearly constrained”

problem: Find the parameters b which minimize n

|rt |p

subject to

x^b — rt = y,,

i — 1,. . ., n

i= 1

This problem, known as the primal [see Zangwill (1969:Chapter 2)] can be formulated as: n

m inm axL(6, A) = V '[| r , |p +

6

A

- r; - y, )|

.=1

where A is the vector of undetermined Lagrange multipliers. The dual problem is then: max min L ( 6, A) A b Fischer also indicated how a constrained version of Newton’s method can be used to solve the dual problem. The solution of the primal problem corresponds to the iteratively reweighted least squares algorithm of Merle and Spath (1973). Fischer showed that his algorithm converges globally at a superlinear rate. He claims that his proof is not as involved as the one by Wolfe (1979) who in any event only succeeded in proving that the method is locally convergent at a linear rate. Fischer also provided some numerical results which he compared to those found by Merle and Spath. He used the number of iterations as the criterion for efficient convergence instead of the number of function evaluations performed.

This can be misleading since an

Section 1.2: The linear Lp~norm estimation problem

7

iteration is generally dependent on the structure of the algorithm and efficiency of the programme code.

The numerical results indicate that Fischer’s algorithm is

more efficient than the algorithm by Merle and Spath (1973). 1.2.3. /^-estim ation

The least squares problem can be formulated as a quadratic programming prob­ lem as in (1.3) but this is unnecessarily cumbersome. However, the solution to the normal equations is easily obtained by using Gaussian elimination [see Dahlquist and Bjorck (1974)]. In this case the estimates of the

s are linear combinations of

the elements of y. Specifically we have

b = ( X 'X ) ~ l X ' V

( 1.6)

Note that when p ^ 2 the estimate b cannot be expressed as an explicit function of X and y. The Gauss-MarkofF theorem indicates that the least squares estimator is the best linear unbiased estimator (B LU E ), see Appendix 1A. It is important to note that this theorem refers only to the class of linear estimators. It does not consider nonlinear alternatives (e.g., values of p ^ 2).

The statistical tractibility of least

squares estimators has made least squares the most popular method of estimation. Typically the least squares estimator is a kind of sample mean estimator [Kiountouzis (1973)]. This can be seen by considering the case k = 1 (location model). Hence

(1.1) becomes y — [3i + e, with least squares estimate b\ — y. This mean-

estimator can be undesirable when the error distribution has long tails [Blattberg and Sargent (1971)]. An obvious alternative would be the median-type estimator. This in fact coincides with the case p — 1 which will now be discussed. 1.2.4. L i-estim a tio n

The Li-norm estimation problem can be formulated as

8

C H A P T E R 1: L p-norm estimation in linear regression

min ]P(u,- + v0

1

J

j = 1, . . . , k

It therefore follows that the Li-norm problem can be formulated as an LP problem in 2n + k variables and n constraints. As with any linear programming problem a dual linear programming problem exists. Denote the dual variables by /t . The dual of (1.7) is formulated as follows [see Wagner (1959)]: n

maX £

f'Vi

i= 1

subject to

- 1 < fi < 1

i = 1,. . ., n

£

j = 2 ,..., k

fiX ij = 0

i= 1 Wagner (1959) also showed that by setting wt = /* -+- 1 the bounded variable dual problem may be formulated in terms of the nonnegative (upper bounded) variables Wi. n

max

Wiyi t=i

subject to

0 < Wi < 2

i — 1, . . . , n

n

^ n

£ t=l

- n n

wi xij - £

xij

i= l

Barrodale and Roberts (1973) managed to reduce the total number of primal simplex iterations by means of a process bypassing several neighbouring simplex vertices in a single iteration. Subroutine L I may be found in Barrodale and Roberts (1974). Armstrong et al. (1979) modified the Barrodale and Roberts algorithm by

9

Section 1.2: The linear L p-norm estimation problem

using the revised simplex algorithm of linear programming which reduces storage requirements and uses an LU decomposition to maintain the current basis. A number of interesting geometrical properties of the solution to the Li-norm problem have been derived. We shall list a few of these. The interested reader is referred to Appa and Smith (1973) and Gentle et al. (1977) for more detail. 1) If the matrix X is of full column rank, i.e., r (X ) = k then there exists a hyperplane passing through k of the points. If r (X ) = r < k there exists an optimal hyperplane which will pass through r of the observations. 2) Let n+ be the number of observations with positive deviations, n~ the cor­ responding number with negative deviations and n* the maximum number of observations that lie on any hyperplane. Then |n + — n | < n

3) For n odd every L\ hyperplane passes through at least one point. 4) For n odd there is at least one observation which lies on every L\ hyperplane. One point that remains to be considered is that of optimality.

Optimality

conditions for the nonlinear Li-norm problem are derived in Chapter 2 (see §2.2). The linear case is treated as a corollary of the nonlinear case. C o ro lla ry 1.1

Let A = { 11yt - x^b = 0 }, I = {1,2, . . . , n } - A be the sets

of indices corresponding to the active and inactive functions respectively. Then a necessary and sufficient condition for parameters 6 to be a global Li-norm solution is the existence of multipliers -1
0 Let

and

and

0

j =

2, . . . , k

I = 1,. . ., n

A number of geometrical properties of the Loo-norm solution have been derived by Appa and Smith (1973).

Section 1.3: The choice of p

1) For problem

11

(1.10) there exists one optimal hyperplane which is vertically

equidistant from at least k + 1 observations. 2) Any k + 1 observations determining the optimal hyperplane must lie in the convex hull of n observations. Barrodale and Phillips (1975) proposed a dual method for solving the primal formulation of the Loo-norm estimation problem. A FO R T R A N code CHEB is also given by these authors. Armstrong and Kung (1980) also proposed a dual method for this problem.

By maintaining a reduced basis the algorithm bypasses certain

simplex vertices. Optimality conditions for the Loo-norm problem also follow from the nonlinear case (§3.2 in Chapter 3) and are given in the following corollary: C o ro lla ry 1.2

Let A = {i|yt - X{b = d ^ }, I = { l , 2 , . . . , n } - A be the sets

of indices corresponding to the active and inactive functions respectively. Then a necessary and sufficient condition for parameters 6 to be a global Loo-norm solution is the existence of nonnegative multipliers c*t such that

£ < *> = £ a> = 1 t= l i£A a t=

0

Y^Oiisign(yi ieA

i E I

- zt*6)zt* = 0

1.3. T h e choice o f p

Forsythe (1972), in estimating the parameters a and (3 in the simple regression model y — a + f3x, investigated the use of L p-norm estimation with 1 < p < 2. He argued that since the mean is sensitive to deviations from normality the Lp-norm estimator will be more robust than least squares in estimating the mean. This will be the case when outliers are present. He suggested the compromise use of p — 1.5 when contaminated (normal) or skewly distributed error distributions are encountered. The Davidon-Fletcher-Powell (D F P ) method was used as a minimization technique. In a simulation study Ekblom (1974) compared the L p-norm estimators with the Huber M-estimators. He also considered the case p < 1. He concluded that

C H A P T E R 1: L p-norm estimation in linear regression

12

the Huber estimator is superior to the L p-norm estimator when the errors follow a contaminated normal distribution. For other error distributions (Laplace, Cauchy) he suggested that p = 1.25 be used. The proposal that p < 1 should be used for skewly distributed (x 2) errors is interesting and shows that the remark by Rice (1964) that problems where p < 1 are not of interest, is unjustified. Harter (1977) suggested an adaptive scheme which relies on the kurtosis of the regression error distribution. He suggested L i -norm estimation if the kurtosis fa > 3.8, least squares if 2.2 < fa < 3.8 and L 0c-norm estimation if fa < 2.2. This scheme has been extended by Barr (1980), Money et al. (1982), Sposito et al. (1983) and Gonin and Money ( 1985a,b) and will be discussed in Chapter 5. Nyquist (1980,1983) considered the statistical properties of linear L p-norm es­ timators. He derived the asymptotic distribution of linear Lp-norm estimators and showed it to be normal for sufficiently small values of p. It is not stated, however, how small p should be. A procedure for selecting the optimal value of p based on the asymptotic variance is proposed which validates the empirical studies by Barr (1980) and Money et al. (1982). Money et al. derived an empirical relationship between the optim al value of p and the kurtosis of the error distribution. Sposito et al. (1983) derived a different empirical relationship which also relates the optimal value of p to the kurtosis of the error distribution. These authors also showed that the formula of Money et al. (1982) yields a reasonable value of p for error distributions with a finite range and suggested the use of their own formula for large sample sizes (n > 200) when it is known that 1 < p < 2. The following modification of Harter’s rule was suggested: Use p = 1.5 (Forsythe) if 3 < fa < 6, least squares if 2.2 < fa < 3 and Loo-norm estimation if fa < 2.2. The abovementioned formulae will be the object of study in Chapter 5. 1.4. S ta tis tic a l P ro p e rtie s o f L in e a r Lp-n orm E stim a to rs We introduce this section by stating the following theorem due to Nyquist (1980,1983). See also Gonin and Money (1985a). T h e o re m 1.1

Consider the linear model y = X/3+ e. Let b be the estimate of f3

chosen so that :

n

s p(b) =

|y« — xt^\p i= l

Section 1.4: Statistical Properties of Linear L p-norm Estimators

13

is a minimum where 1 < p < 00. Assume that: A l: The errors e* are independently, identically distributed with common distribu­ tion 7. A2: The L\ (and Loo)-norm estimators are unique (in general Lp-norm estimators are uniquely defined for 1 < p < 00). A3: The matrix Q — limn-^oo X 'X / n is positive definite with rank(Q) — k. A4a: J is continuous with 7*(0) > 0 when p = 1. A4b: When 1 < p < 00 the following expectations exist: E{\ei\p 1}, E{\ei\p 2}, £{|e*|2p 2}; and also E{\ei\p l s ig n (e i)} = 0

Under the above four assumptions yjn(b —f3) is asymptotically normally distributed with mean f3 and variance wpQ ~ l where

Bassett and Koenker (1978) considered the case p — 1 and the following corollary results. [See also Dielman and Pfaffenberger (1982b)]. C o ro lla ry 1.3

For the linear model let b be the estimate of j3 such that n

5 i ( 6) = £ lv« - z i 6l i= 1 is a minimum. Under the assumptions A l, A2, A3 and A4a of Theorem 1.1 it follows for a random sample of size n that y/n(b —j3) is asymptotically normally distributed with mean O and variance \ 2( X 'X ) ~ l where A2/n is the variance of the sample median of residuals.

C H A P T E R 1: L p-norm estimation in linear regression

14

1.5. C on fiden ce in terva ls fo r /3

1.5.1. C ase 1 < p < oo

In view of the asymptotic results of Nyquist (1980,1983) we can construct the following confidence intervals for the components of /?. Specifically the 100(1 - a )% confidence interval for fa is given by bj ± za/2y / u j ( x ' x y i Where (X

'X

denotes the j th diagonal element of ( X ' X ) -1 and za/2 denotes the

appropriate percentile of the standard normal distribution. A major drawback is that a;2 is unknown.

However, we have performed a

simulation study in which the sample moments of the residual distribution were used to estimate ojp. The estimate ujp was calculated as follows: 2_ rn2p-2 U' ~ [(P - l ) m p_ 2P with m r = 1 S r = l l^*|r where et is the zth residual from the L p-norm fit.

The simulation was performed using a two parameter model of the form

yi = 10 + f a in + j32Xi2 In this model the parameters fa and fa were fixed at 8 and - 6 respectively. Values of (x t l, x ^ ) for i = 1, 2 ,..., n (n = 30, 50,100, 200, 400) were selected from a uniform distribution over the interval 0 to 20 and held fixed in all 500 samples. In all the runs a uniform error distribution with mean zero and variance 25 was used. The exact u>p for the uniform error distribution is given by 3

0

)

> J

i — I , ... ,n

We have therefore succeeded in reducing the 2n inequality constraints to n equality constraints whilst increasing the number of variables from n -+ k to 2n + k variables. In a mathematical programming framework this formulation is attractive because of the reduced number of constraints. In deriving optimality conditions, however, the formulation

(2.3) will be used. We now discuss optimality conditions for the

nonlinear Li-norm estimation problem. The well-known Karush (1939), Kuhn and Tucker (1951) necessary conditions (feasibility, complementary slackness and stationarity) of N LP will be utilized in the derivation of the optimality conditions. Optimality conditions for the linear L r norm problem (see Chapter 1) will be stated as a corollary.

2.2. Optimality conditions for Li-norm estimation problems

The following concepts will be needed:

• Active function set

A = {i |yi - fi(x { , 9) = 0 } • Inactive function set

I =

{*'! y* -

^0} =

n} -

• Active constraint sets

K i = {* I K2=

{»' | yi

= «< }

-

9) =

K = KxU K2

-u ,}

A

46

C H A P T E R 2: The nonlinear L \ -n orm estim ation problem

• Feasible region F =

{(«, 9)

e

SR" x Sft* I y, -

9) - m < 0 ,

-y, +

9)

- u< < 0 }

• The Lagrangian function for problem (2.3)

n

L( u,9,

A) = £ { ( 1 - A, - A,+n)u, + (A, - A1+„)[y, - /,(*,•, 0 )|} t= l

We shall assume that at least one ut > 0. The following condition or constraint

qualification will also be used: D efin itio n 2.1

be an i th unit vector. A point (tT ,0 *) such that the active

Let

constraint gradients evaluated at (u*,9*)

~ €{

}

i E K x and

(

)

* G K2

are linearly independent is termed a regular point of the constraint set. The Karush-Kuhn-Tucker (K K T ) theorem will now be stated. T h e o re m 2.1

Suppose the functions /*(*,•, 9) are continuously differentiable with

respect to 0 and that (tT ,0 *) is a regular local minimum point of problem (2.3). Then the following conditions are necessary for a local minimum point: (a) (Feasibility).

(tr ,0 * ) G F

(b) (Complementary slackness). There exist multipliers A *,A '+„ > 0, i = such that A*(t/, - /,•(*,•, #*) - u?) = 0 \+rt(y>

/«(*{> ^ ) + ui ) — 0

(c) (Stationarity). V L (ti\ r ,A * ) =

{ V uL )

= 0

(V e L ) For notational convenience denote hi(9) — yi - fi(z{,9), h(9) = [hi(9 ),. . . ,/in(0)]',

y

= ( y i 3- - , y n ) / and f (x ,9 ) -

9 ) , . . f n(xn ,9)\l. Then the active function

set is A = {« |hi{9*) = 0 } and inactive function set is / = {i \hi(9 M) ^ 0}. The K K T conditions can be simplified and will be stated as the following necessity theorem .

47

Section 2.2: O ptim ality conditions fo r L i-n o rrn estim ation problems

Theorem 2.2 that

Suppose hi{$) is continuously differentiable with respect to 0 and

is a regular point of (2.3) and a local minimum point of ( 2.2). Then a

necessary condition for 0 * to be a local minimum point is the existence of multipliers —1 < a t < 1 for all i

E

A such that

^ 8 i g n [ h i ( 0 t)]V h i {e,‘) + '^2o‘iV h i (et ) = 0 i€I iGA

Proof

Let Ii = {i|/it(0*) > 0} and /2 = {i|/it(0*) < 0} be the inactive function

sets. K K T conditions (b) and (c) state that

A ,*{M tf*) - « tj,„ + n >— 0 and

At* + At-+n = 1 it follows that —1 < a t < 1 and the proof is complete.

|

Remarks (1) If the functions h{(0) are convex or linear then the optimality condition of Theorem 2.2 will be necessary as well as sufficient. (2) A t the optimal solution the sets K\ — K 2 = { l , . . . , n } .

Hence all the con­

straints are active at optimality. This fact is used in the Murray and Overton (1981) algorithm (see §2.3.3.1).

48

C H A P T E R 2: The nonlinear L \ -n orm estim ation problem

In linear LAV estimation the model y — X9 + e is fitted where X is an n x k matrix and y, 9 and e are n-, k- and n-vectors respectively. Denote the i th row of

X by X{ = ( xhj . .., Xik) then hi(9) = yi — x^B. Define A = { i | yt - x^O* = 0} and I

= {* I

Vi ~ x %0 * ^

C o ro lla ry 2.1

°}-

In linear LAV estimation a necessary and sufficient condition for

9* to be a global LAV solution point is the existence of multipliers -1 < a t < 1 for all i such that

The following N ecessity T h e o re m is stated in terms of directional derivatives

d (see Appendix 2B). T h e o re m 2.3

Suppose the functions hi{9) are continuously differentiable with

respect to 9. Then a necessary condition for 9* to be a local minimum point of problem ( 2.2) is that

^

d's^n[/i,■(«*)] V A , - ( r ) + £

I d 'V M H I > 0

for all

d e %k

The following theorem constitutes sufficien cy conditions for optimality. T h e o re m 2.4

Suppose the functions hi(9) are twice continuously differentiable

with respect to 9. Then 9* is an isolated (strong) minimum point of ^ ( t f ) if there exist multipliers -1 < a t < 1 for all i t A such that

iei

ieA

and for every i G A and d / 0 satisfying:

d!Vhi(9*) = 0

if

\ai\^l

>0

if

ct{ = 1

0

Note that this is a one-dimensional minimization or exact line search problem. In practice this is seldom used; instead inexact line search procedures are used. Only a sufficient decrease in f(9 ) is therefore sought:

f ( » 1 + l i 6 9 1) < f { 9 1) - c

(e > 0)

In Appendix 2B we discuss some line search procedures. The abovementioned methods assume that the objective function f(0 ) is con­ tinuously differentiable. Since the objective functions S\(9) and SqoM are not dif­ ferentiable they need special treatment.

Section 2.8: Algorithm s fo r solving the nonlinear L i-n o r m estim ation problem

55

2.3.2. Type I Algorithms We now consider the SLP methods.

2.3.2.1 The Anderson-Osborne-Watson algorithm. This method is due to Osborne and Watson (1971) and Anderson and Osborne (1977a). The underlying philosophy behind this method is not difficult. At each step a first-order Taylor series (linear approximation) of f%{x^0) is used. The method is therefore essentially a generalization of the Gauss-Newton method for nonlinear least squares extended to solve L^norm estimation problems.

The algorithm Step 0: Given an initial estimate 01 of the optimal 0*, choose /i = 0.1 and A = 0.0001 and set j — 1. Step 1: Calculate 80 to minimize

n v/ , t=l Let the minimum be

with 80 — 8 OK

Step 2: Choose 7j as the largest num ber in the set

{ 1j Mi A*2) • ■•}

torwhich

».(»’ ) - m *

> A

1,-(«.(#>) - SI) Let the minimum be 5 J+ 1 with 7 = 7y. Step 3: Set 0i+l = 0i + ~ij80i and return to Step 1. Repeat until certain conver­ gence criteria are met.

Remarks ( 1) The original nonlinear problem is reduced to a sequence of linear Li-norm re­ gression problems, each of which can be solved efficiently as a standard LP prob­ lem using the Barrodale and Roberts (1973,1974) or Armstrong et al. (1979) algorithms.

56

C H A P T E R 2: The nonlinear L \ -n o rm estim ation problem

(2) The inexact line search in Step 2 ensures convergence and is due to Anderson and Osborne (1977a) (see Appendix 2B). Osborne and Watson (1971) originally proposed an exact line search: Calculate 7 > 0 to minimize X^r__i |y« E x a m p le 2.3 We shall illustrate one step of the algorithm on the following simple example. Given the data points (0,0), (1,1), (2,4) we wish to fit the curve y = edx to the three points. The objective function is plotted in Fig. 2.3.

F ig u re 2.3

Si(0) vs 9.

The solution is 0* = In 2 = 0.69315. Dennis and Schnabel (1983:226) also used this example to illustrate the slow convergence of the Gauss-Newton method when a least squares fit is performed. We start with 9l = 1. Step 1: We have to solve the problem: m in ^ ? = i \yi ~

- Xieelxi 8 d\ for 89.

This is equivalent to solving the linear programming problem: 3

min £ ( t=i

+ v*)

Section 2.8: Algorithms for solving the nonlinear L\-norm estimation problem

57

subject to the 3 constraints ui — v\ — 1 u2 - v2 - 2.718360 = 1.7183

U3 - v z - 14.77812, ^3 > 0 and 89 unconstrained in sign. The solution to this LP problem yields 89l = -0.2293, u\ = 1,u 2 = 1.0949, u3 = v± = v2 — v3 — 0- The minimum S 1 — 2.0949. The number of simplex itera­ tions using subroutine L I of Barrodale and Roberts (1974) was 1. Step 2: The inexact line search proceeds as follows: Select 7 = 1 , calculate 5 i(0 x) - 5 i(0 x + 7 89l ) _ 6.1073 - 2.8321 7 (S i(0 1) - 51)

“ 6.1073 - 2.0949 = 0.816 > 0.0001

Hence 71 = 1 is selected. The approximate value of S 2 = 2.8321. Step 3: Set 02 = 01 -f 7189l = 0.7707. The algorithm proceeds in this fashion to the optimal solution 0* = 0.693147 with S i(0*) = 2. The residuals are - 1 , - 1 and 0. Thus the curve goes through one point exactly as is to be expected. The total number of iterations was 4 and the total number of simplex iterations also equalled 4. The line search step 7 = 1 was chosen at every iteration (quadratic convergence). Under the usual smoothness assumptions and the assumption that the Jacobian matrix

is of full column rank, the following convergence properties were derived:

Convergence properties Assuming an exact minimum is located in Step 2, then: ( 1)

< Si for all j ; if Si < Si then 5 J+1 < S 3 for all j.

(2) At a limit point of the algorithm 5 J = SJ. (3) If SJ =

and if 803 is an isolated (unique) local minimum in Step 1 then a

limit point of the sequence { 0 3} is a stationary point of ( 2.2).

58

C H A P T E R 2: The nonlinear L \ -n orm estimation problem

If the inexact line search is performed in Step 2, then: (4) If Si(93) ^ S 3 and given that 0 < A < 1, then there exists a 7 , E {l,/z,ju2, . . .} with 0 < fi < 1 such that S

,

(

«

.

)

> A

- S i) (5) If the algorithm does not terminate in a finite number of iterations then the sequence {5 i(0 ; ) } converges to S 3 as j —►00. (6) Any limit point of {93} is a stationary point of Si(0) — i.e., a point 0* such that S\{0*) = S*. C on vergen ce ra te Anderson and Osborne (1977a) show that if the points 0* all lie in a compact (closed and bounded) set and a so-called “ multiplier” condition holds, then the ulti­ mate convergence rate is quadratic. Cromme (1978) has shown that the condition of

strong uniqueness is sufficient for ensuring quadratic convergence of Gauss-Newton methods. D e fin itio n 2.3

9* is strongly unique if 3 a 7 > 0 such that Si{9) > Sl (9*) + 1 \\9-9*\\

for all 9 in a neighbourhood of 0*. Jittorntrum and Osborne (1978) show that strong uniqueness is implied by the multiplier condition. It is therefore a weaker condition, however, not the weakest condition that ensures quadratic convergence. • The F O R T R A N program A N D O SB W A T with an example problem is listed in Appendix 2H. Convergence occurred at iteration 8 (S 8 = Ss = 0.001563). The intermediate results are in complete agreement with those found by Anderson and Osborne (1977a). 2.3.2.2 T h e A n d e rs o n -O s b o rn e -L e v e n b e rg -M a rq u a rd t a lg o rith m . Anderson and Osborne (1977a,1977b) couched the L \- and Loo-norm problems within the framework of so-called polyhedral norms (see Appendix 2F). A discussion of these norms can be found in Fletcher (1981a:Chapter 14). The polyhedral norm approximation problem is defined as:

Section 2.8: A lgorith m s fo r solving the nonlinear L \ -n orm estim ation problem

59

Find parameters 0 which minimize

II/W II b where B is a matrix as defined in Appendix 2F. Step 1 of the previous algorithm can therefore be written in polyhedral norm notation as: Step 1: Calculate 80 and a scalar u by solving the problem:

minimize

subject to

u

B(y — /(z, 03) - V^/(a;, 03) ' 80) < ue

Let the minimum be S 3 with 80 — 803.

This type of algorithm (Gauss-Newton) will be very effective if the number of ac­ tive functions at 0* is k. In general this will not necessarily be the case and as a consequence SLP methods will converge slowly if at all. To overcome this problem Anderson and Osborne (1977b) derived a LevenbergMarquardt algorithm (Appendix 2D). The inexact line search was replaced by a strategy for modifying /i which thereby imposes a bound on the change in 0. The size as well as the direction of

80 is therefore controlled. The algorithm

Step 0: Given an initial estimate 01 of the optimal 0*, select /io = /ii = 1 and set J = 1. Step 1: Calculate 80 by minimizing

80 0

fijl

Let the minimum be S 3 (fij) with 80 = 803. Step 2: Modify fjtj > 0 as follows: Set / = 0, Ti = 0, over := false , /ij = /i;-_ i and 8 = 0.01, 2(a): If / < 1 or Ti < 1 - 8 go to Step 2(c).

10, t = 0.2.

60

C H A P T E R 2: The nonlinear L \ -n orm estim ation problem

Set / := / + 1 and calculate T - T I P

„l,_

T‘ - T V

^

S,( » + » ' )

-

S l(„ ,

-

2(b): If Ti < 8 and over = true set //+1 = \{fLj + fij) If Ti < 8 and over = false set /z*.+1 = ^filIf Ti < 8 set fLj = (ilIf / > 1 and Ti > 1 - 8 set over := true , /^+1 = ^ (/x^ + /^) and /2; = 2(c): If 7} < 6 or [/ > 1 and Ti > I - 8 ] return to Step 2(b). Otherwise if / = 1 set /iy+i = r/i*If / ^ 1 set /i;+ i = nlj Step 3: set 0J+1 = 0i -f 803 and return to Step 1. Repeat until certain convergence criteria are met.

Remarks {

( 1) When fij > 0 then the matrix I

) V j1

I is of rank k.

)

(2) McLean and Watson (1980) observe that Step 1 can be time consuming since it may involve the solving of many linear programming problems. They suggest an alternative form of the linear subproblem which admits an efficient numerical solution. (3) Convergence properties analogous to those stated previously can also be found in Anderson and Osborne (1977b). Jittorntrum and Osborne (1980) give nec­ essary and sufficient conditions for quadratic convergence of Gauss-Newton methods when the objective function is a polyhedral norm.

2.3.2.3 The McLean and Watson algorithms. McLean and Watson (1980) provide two numerically robust and efficient meth­ ods that are based on Levenberg-Marquardt like approaches, and are therefore superlinearly convergent.

For these algorithms the functions ft(x^O) are assumed

to be at least once continuously differentiable in an arbitrary neighbourhood of a stationary point of (2.2). Their alternative to the previous approach (§2.3.2.2) is similar to Madsen’s (1975) proposal for the nonlinear Loo-norm problem.

Section 2.8: A lgorithm s fo r solving the nonlinear L \ -n orm estim ation problem

61

In Gauss-Newton methods fi{x{,03) is approximated by a first-order Taylor series /»(*,•, 0; ) + V/i(xt-, 03) fd. This approximation will only be accurate for “small” d. Hence McLean and Watson (1980) bound the direction by stipulating:

max \dA < A

l 0 and 0 < a < j

set

= I-

Step 1: Calculate d3( X j ) by solving:

n min £ | y i - /■(*,-,O1) -

t= l

subject to

max \d{\
oo. (4) The limit points of the iterates {03} are stationary points of Si(03).

Program M C L E A N W A T which is an implementation of Algorithm 1 is listed in Appendix 2H.

Section 2.8: Algorithms for solving the nonlinear L\-norm estimation problem

63

E x a m p le 2.4 The following example due to Jennrich and Sampson (1968) was used 10

min £ IVI - (exp(/0i) + exp(/02)| 1=1 where yi = 2 + 21. The contourlines for the objective function are given in Fig. 2.4.

0.300

0.275

0.250

0.225

0.200

0.200

0.225

F ig u re 2.4

0.250

0.275

0.300

Contours of S i(0) = 33, 35, 37.

The optimal solution is 0* = (0.255843,0.255843) and S i(0 *) = 32.091941.

64

C H A P T E R 2: The nonlinear L \ -n orm estim ation problem

General comments: Type I algorithms

• If the conditions for quadratic convergence of GN methods hold then the bound on 80J becomes inactive and all Type I methods will be identical • If yt = f i(x j,0 ) for all i then the convergence rate is quadratic for any norm (Zyj, Lp and Loo)• If y, ^

for some i then convergence is either slow or quadratic for L\

problems. For L p-norms the rate will be linear.

2.3.3. Type II Algorithms In the event that strong uniqueness does not prevail, Gauss-Newton methods may converge slowly and hence 2nd-derivative information has to be used. This is achieved by sequential quadratic programming (SQ P). Algorithms of this type are complex, and are therefore stated without worked examples. The interested reader is encouraged to read the relevant papers. Again for notational convenience denote y. - A K ', 0 ) by hi(9).

2.3.3.1 The M urray and Overton algorithm. The Murray and Overton (1981) algorithm is an application and adaptation of the modified Newton algorithm for linearly constrained N LP problems [Gill and Murray (1974)] to solve Li-norm problems. We give a brief outline of the approach, following the description given in Overton (1982). In order to obtain a superlinear convergent algorithm, second-order derivative information must be utilized. The method utilizes the structure of (2.3) and essen­ tially reduces the problem back to an optimization problem in k variables (parame­ ters). If we solve problem (2.2) directly, then 5 i(0 ) may be used as a merit function. Recall (§3.2) that the objective function in (2.3) cannot be used as a merit function. In the algorithm a descent direction is determined from a quadratic program based on a projected Lagrangian function. Two types of Lagrange multiplier estimates (1st- and 2nd-order) are calculated to form the Lagrangian function. A line search is then performed using this direction in order to reduce the merit function.

Section 2.8: A lgorithm s fo r solving the nonlinear L i-n o r m estimation problem

65

We shall partly adopt the Murray and Overton notation in this section: the con­ straints are assumed to have been ordered accordingly without loss of generality. • Active function set

Thus at any point 0 ^ 9 * we define A as the set of functions which we think will be zero at the solution 0 X. • Let

h(9) = {ht+1 ( 9 ) , . . . , h n(9 )]' h(9) = [h1 ( 9 ) , . . . , h t(9)}' • Active constraints

i = t -f 1,. .., n

Ui - sig n [h i(0 )]h i

ut - hi(B)

i = l,...,t U{ + hi( 0 ) The matrices of gradients

V k x t = ( V h 1( 9 ) , . . . , V h t(9 ))

vk«n-t = ( V ) ! I + I ( « ) , . . , v y « ) ) S txt = diag{ sign[hi(9)\ : i = 1, . . t } En-txn-i = diag{ st'jrn[/ii(0)] : i = t + 1,. . ., n } After a further re-ordering of the active constraints so that the active con­ straints are given in matrix notation

u -'S h \ clff]

=1

it -

/ =

h

it + h J

~

\

I S h - fc

I th + h )

(-V Y , Vc{9) k + n x t + n

0

In-t V o

-V

K\

0

0

it

h)

66

C H A P T E R 2: The nonlinear L \ -n orm estim ation problem

The gradient of the objective function of (2.3) with respect to (9,u) is given by: 'O '

• (») = where e and e are (n — £)- and £-unit vectors respectively. Orthogonal decomposition of V (Appendix 2F).

/ R (0) ' V = [Y (0 ) Z (0 )}

=