Complete with valuable FORTRAN programs that help solve nondifferentiable nonlinear Lp and Linfnorm estimation problems
322 86 15MB
English Pages 320 [317] Year 1989
Table of contents :
Cover
Half Title
Title Page
Copyright Page
PREFACE
Table of Contents
1: LpNORM ESTIMATION IN LINEAR REGRESSION
1.1 The history of curve fitting problems
1.2 The linear Lpnorm estimation problem
1.2.1 Formulation
1.2.2 Algorithms
1.2.3 L2estimation
1.2.4 L1estimation
1.2.5 L∞estimation
1.3 The choice of p
1.4 Statistical Properties of Linear Lpnorm Estimators
1.5 Confidence intervals for β
1.5.1 Case 1 < p < ∞
1.5.2 Case p = 1
1.6 Example
Conclusion
Appendix 1A: GaussMarkoff Theorem
Additional notes
Bibliography: Chapter 1
2: THE NONLINEAR L1NORM ESTIMATION PROBLEM
2.1 The nonlinear L1norm estimation problem
2.2 Optimality conditions for L1norm estimation problems
2.2.1 Examples
2.3 Algorithms for solving the nonlinear L1 norm estimation problem
2.3.1 Differentiable unconstrained minimization
2.3.2 Type I Algorithms
2.3.2.1 The AndersonOsborneWatson algorithm
2.3.2.2 The AndersonOsborneLevenbergMarquardt algorithm
2.3.2.3 The McLean and Watson algorithms
2.3.3 Type II Algorithms
2.3.3.1 The Murray and Overton algorithm
2.3.3.2 The Bartels and Conn algorithm
2.3.4 Type III methods
2.3.4.1 The Hald and Madsen algorithm
2.3.5 Type IV Algorithms
2.3.5.1 The ElAttarVidyasagarDutta algorithm
2.3.5.2 The Tishler and Zang algorithm
Conclusion
Appendix 2A: Local and global minima
Appendix 2B: Onedimensional line search algorithms
Appendix 2C: The BFGS approach in unconstrained minimization
Appendix 2D: LevenbergMarquardt approach in nonlinear estimation
Appendix 2E: Rates of convergence
Appendix 2F: Linear Algebra
Appendix 2G: Extrapolation procedures to enhance convergence
Appendix 2H: FORTRAN Programs
Additional notes
Bibliography: Chapter 2
3: THE NONLINEAR L∞NORM ESTIMATION PROBLEM
3.1 The nonlinear L∞norm estimation problem
3.2 Loonorm optimality conditions
3.3 The nonlinear minimax problem
3.4 Examples
3.5 Algorithms for solving nonlinear Loonorm estimation problems
3.5.1 Type I Algorithms
3.5.1.1 The AndersonOsborneWatson algorithm
3.5.1.2 The Madsen algorithm
3.5.1.3 The AndersonOsborneLevenbergMarquardt algorithm
3.5.2 Type II Algorithms
3.5.2.1 The Murray and Overton algorithm
3.5.2.2 The Han algorithm
3.5.3 Type III Algorithms
3.5.3.1 The Watson algorithm
3.5.3.2 The Hald and Madsen algorithm
3.5.4 Type IV algorithms
3.5.4.1 The Charalambous acceleration algorithm
3.5.4.2 The Zang algorithm
Appendix 3A: FORTRAN Programs
Additional notes
Bibliography: Chapter 3
4: THE NONLINEAR LpNORM ESTIMATION PROBLEM
4.1 The nonlinear Lpnorm estimation problem (1 < p < ∞)
4.2 Optimality conditions for the nonlinear Lpnorm problem
4.3 Algorithms for nonlinear Lpnorm estimation problems
4.3.1 Watson’s algorithm
4.3.2 A firstorder gradient algorithm
4.3.2.1 Examples
4.3.3 The mixture method for large residual and illconditioned problems
4.3.3.1 Examples
Conclusion
Appendix 4A: Cholesky decomposition of symmetric matrices
Appendix 4B: The singular value decomposition (SVD) of a matrix
Appendix 4C: Fletcher’s line search algorithm
Additional notes
Bibliography: Chapter 4
5: STATISTICAL ASPECTS OF LpNORM ESTIMATORS
5.1 Nonlinear least squares
5.1.1 Statistical inference
5.1.1.1 Confidence intervals and joint confidence regions
5.1.1.2 Hypothesis testing
5.1.1.3 Bias
5.2 Nonlinear L1norm estimation
5.2.1 Statistical inference
5.2.1.1 Confidence intervals
5.2.1.2 Hypothesis testing
5.2.1.3 Bias and consistency
5.3 Nonlinear Lpnorm estimation
5.3.1 Asymptotic distribution of Lpnorm estimators (additive errors)
5.3.2 Statistical inference
5.3.2.1 Confidence intervals
5.3.2.2 Hypothesis testing
5.3.2.3 Bias
5.4 The choice of the exponent p
5.4.1 The simulation study
5.4.2 Simulation results and selection of p
5.4.3 The efficiency of Lpnorm estimators for varying values of p
5.4.4 The relationship between p and the sample kurtosis
5.5 The asymptotic distribution of p
5.6 The adaptive algorithm for Lpnorm estimation
5.7 Critique of the model and regression diagnostics
5.7.1 Regression diagnostics (nonlinear least squares)
5.7.2 Regression diagnostics (nonlinear Lpnorm case)
Conclusion
Appendix 5A: Random deviate generation
Appendix 5B: Tables
Appendix 5C: ω2p for symmetric error distributions
Appendix 5D: Var(pˆ) for symmetric error distributions
Appendix 5E: The Cramérvon Mises goodnessoffit test
Additional notes
Bibliography: Chapter 5
6: APPLICATION OF LpNORM ESTIMATION
6.1 Compartmental models, bioavailability studies in Pharmacology
6.1.1 Mathematical models of drug bioavailability
6.1.2 Outlying observations in pharmacokinetic models
6.2 Oxygen saturation in respiratory physiology
6.2.1 Problem background
6.2.2 The data
6.2.3 Modelling justification
6.2.4 Problem solution
6.3 Production models in Economics
6.3.1 The twostep approach
6.3.1.1 Twostep nonlinear least squares
6.3.1.2 Twostep adaptive Lpnorm estimation
6.3.2 The global least squares approach
Concluding remark
Bibliography: Chapter 6
AUTHOR INDEX
SUBJECT INDEX
STATISTICS: Textbooks and Monographs A Series Edited by D. B. Owen, Coordinating Editor Department o f Statistics Southern Methodist University Dallas, Texas R. G. Cornell, Associate Editor
W. J. Kennedy, Associate Editor
fo r Biostatistics
for Statistical Computing
University o f Michigan
Iowa State University
A . M. Kshirsagar, Associate Editor
E. G. Schilling, Associate Editor
for Multivariate Analysis and
for Statistical Quality Control
Experimental Design
Rochester Institute o f Technology
University o f Michigan
Vol. 1: The Generalized Jackknife Statistic, //. L. Gray and W. R. Schucany Vol. 2: Multivariate Analysis, Anant M. Kshirsagar Vol. 3: Statistics and Society, Walter T. Federer Vol. 4: Multivariate Analysis: A Selected and Abstracted Bibliography, 19571972, Kocherlakota Subrahmaniam and Kathleen Subrahmaniam (out of print) Vol. 5: Design of Experiments: A Realistic Approach, Virgil L. Anderson and Robert A. McLean Vol. 6: Statistical and Mathematical Aspects of Pollution Problems, John W. Pratt Vol. 7: Introduction to Probability and Statistics (in two parts), Part I: Probability; Part II: Statistics, Narayan C. Giri Vol. 8: Statistical Theory of the Analysis of Experimental Designs, J. Ogawa Vol. 9: Statistical Techniques in Simulation (in two parts), Jack P. C. Kleijnen Vol. 10: Data Quality Control and Editing, Joseph I. Naus (out of print) Vol. 11: Cost of Living Index Numbers: Practice, Precision, and Theory, Kali S. Banerjee Vol. 12: Weighing Designs: For Chemistry, Medicine, Economics, Operations Research, Statistics, Kali S. Banerjee Vol. 13: The Search for Oil: Some Statistical Methods and Techniques, edited by D. B. Owen Vol. 14: Sample Size Choice: Charts for Experiments with Linear Models, Robert E. Odeh and Martin Fox Vol. 15: Statistical Methods for Engineers and Scientists, Robert M. Bethea, Benjamin S. Duran, and Thomas L. Boullion Vol. 16: Statistical Quality Control Methods, Irving W. Burr Vol. 17: On the History of Statistics and Probability, edited by D. B. Owen Vol. 18: Econometrics, Peter Schmidt Vol. 19: Sufficient Statistics: Selected Contributions, Vasant S. Huzurbazar (edited by AnantM. Kshirsagar) Vol. 20: Handbook of Statistical Distributions, Jagdish K. Patel, C. H. Kapadia, and D. B. Owen Vol. 21: Case Studies in Sample Design,/!. C. Rosander Vol. 22: Pocket Book of Statistical Tables, compiled by R. E. Odeh, D. B. Owen, Z. W. Birnbaum, and L. Fisher Vol. 23: The Information in Contingency Tables, D. V. Gokhale and Solomon Kullback
Vol. 24: Statistical Analysis of Reliability and LifeTesting Models: Theory and Methods, LeeJ. Bain Vol. 25: Elementary Statistical Quality Control, Irving W. Burr Vol. 26: An Introduction to Probability and Statistics Using BASIC, Richard A. Groeneveld Vol. 27: Basic Applied Statistics, B. L . Raktoe and J. /. Hubert Vol. 28: A Primer in Probability, Kathleen Subrahmaniam Vol. 29: Random Processes: A First Look, R. Syski Vol. 30: Regression Methods: A Tool for Data Analysis, Rudolf J. Freund and Paul D. Minton Vol. 31: Randomization Tests, Eugene S. Edgington Vol. 32: Tables for Normal Tolerance Limits, Sampling Plans, and Screening, Robert E. Odeh and D. B. Owen Vol. 33: Statistical Computing, William J. Kennedy, Jr. and James E. Gentle Vol. 34: Regression Analysis and Its Application: A DataOriented Approach, Richard F. Gunst and Robert L. Mason Vol. 35: Scientific Strategies to Save Your Life,/. D. J. Bross Vol. 36: Statistics in the Pharmaceutical Industry, edited by C. Ralph Buncher and JiaYeong Tsay Vol. 37: Sampling from a Finite Population, /. Hajek Vol. 38: Statistical Modeling Techniques, S. S. Shapiro Vol. 39: Statistical Theory and Inference in Research, T. A. Bancroft and C.P. Han Vol. 40: Handbook of the Normal Distribution, Jagdish K. Patel and Campbell B. Read Vol. 41: Recent Advances in Regression Methods, Hrishikesh D. VinodandAman Ullah Vol. 42: Acceptance Sampling in Quality Control, Edward G. Schilling Vol. 43: The Randomized Clinical Trial and Therapeutic Decisions, edited by Niels Tygstrup, John M. Lachin, and Erik Juhl Vol. 44: Regression Analysis o f Survival Data in Cancer Chemotherapy, Walter H. Carter, Jr., Galen L. Wampler, and Donald M. Stablein Vol. 45: A Course in Linear Models, Anant M. Kshirsagar Vol. 46: Clinical Trials: Issues and Approaches, edited by Stanley H. Shapiro and Thomas H. Louis Vol. 47: Statistical Analysis of DNA Sequence Data, edited by B. S. Weir Vol. 48: Nonlinear Regression Modeling: A Unified Practical Approach, David A. Ratkowsky Vol. 49: Attribute Sampling Plans, Tables of Tests and Confidence Limits for Proportions, Robert E. Odeh and D. B. Owen Vol. 50: Experimental Design, Statistical Models, and Genetic Statistics, edited by Klaus Hinkelmann Vol. 51: Statistical Methods for Cancer Studies, edited by Richard G. Cornell Vol. 52: Practical Statistical Sampling for Auditors, Arthur J. Wilburn Vol. 53: Statistical Signal Processing, edited by Edward J. Wegman and James G. Smith Vol. 54: SelfOrganizing Methods in Modeling: GMDH Type Algorithms, edited by Stanley J. Farlow Vol. 55: Applied Factorial and Fractional Designs, Robert A. McLean and Virgil L. Anderson Vol. 56: Design of Experiments: Ranking and Selection, edited by Thomas J. Santner and A jit C. Tamhane Vol. 57: Statistical Methods for Engineers and Scientists. Second Edition, Revised and Expanded, Robert M. Bethea, Benjamin S. Duran, and Thomas L. Boullion Vol. 58: Ensemble Modeling: Inference from SmallScale Properties to LargeScale Systems, Alan E. Gelfand and Crayton C. Walker Vol. 59: Computer Modeling for Business and Industry, Bruce L. Bowerman and Richard T. O'Connell Vol. 60: Bayesian Analysis of Linear Models, Lyle D. Broemeling
Vol. 61:
Methodological Issues for Health Care Surveys, Brenda Cox and Steven Cohen
Vol. 62:
Applied Regression Analysis and Experimental Design, Richard J. Brook and Gregory C. Arnold
Vol. 63: Statpal: A Statistical Package for Microcomputers  PCDOS Version for the IBM' PC and Compatibles, Bruce J. Chalmer and David G. Whitmore Vol. 64: Statpal: A Statistical Package for Microcomputers  Apple Version for the II, II*, and He, David G. Whitmore and Bruce J. Chalmer Vol. 65: Nonparametric Statistical Inference, Second Edition, Revised and Expanded, Jean Dickinson Gibbons Vol. 66: Design and Analysis of Experiments, Roger G. Petersen Vol. 67: Statistical Methods for Pharmaceutical Research Planning, Sten W. Bergman and John C. Gittins Vol. 68: GoodnessofFit Techniques, edited by Ralph B. D ’Agostino and Michael A. Stephens Vol. 69: Statistical Methods in Discrimination Litigation, edited by D .H . Kaye and Mikel Aickin Vol. 70: Truncated and Censored Samples from Normal Populations, Helmut Schneider Vol. 71: Robust Inference, M. L. Tiku, W. Y. Tan, and N. Balakrishnan Vol. 72: Statistical Image Processing and Graphics, edited by Edward J. Wegman and Douglas J. DePriest Vol. 73: Assignment Methods in Combinatorial Data Analysis, Lawrence J. Hubert Vol. 74: Econometrics and Structural Change, Lyle D. Broemeling and Hiroki Tsurumi Vol. 75: Multivariate Interpretation o f Clinical Laboratory Data, Adelin Albert and Eugene K. Harris Vol. 76: Statistical Tools for Simulation Practitioners, Jack P. C. Kleijnen Vol. 77: Randomization Tests, Second Edition, Eugene S. Edgington Vol. 78: A Folio of Distributions: A Collection of Theoretical QuantileQuantile Plots, Edward B. Fowlkes Vol. 79: Applied Categorical Data Analysis, Daniel H. Freeman, Jr. Vol. 80: Seemingly Unrelated Regression Equations Models : Estimation and Inference, Virendra K. Srivastava and David E. A. Giles Vol. 81: Response Surfaces: Designs and Analyses, Andre I. Khuriand John A. Cornell VoL 82: Nonlinear Parameter Estimation: An Integrated System in BASIC, John C. Nash and Mary WalkerSmith VoL 83: Cancer Modeling, edited by James R. Thompson and Barry W. Brown Vol. 84: Mixture Models: Inference and Applications to Clustering, Geoffrey J. MeLachlan and Kaye E. Basford VoL 85: Randomized Response: Theory and Techniques, Arijit Chaudhuri and Rahul Mukerjee Vol. 86: Biopharmaceutical Statistics for Drug Development, edited by Karl E. Peace Vol. 87: Parts per Million Values for Estimating Quality Levels, Robert E. Odeh and D. B. Owen Vol. 88: Lognormal Distributions: Theory and Applications, edited by Edwin L. Crow and Kunio Shimizu Vol. 89: Properties of Estimators for the Gamma Distribution, K. O. Bowman and L. R. Shenton VoL 90: Spline Smoothing and Nonparametric Regression, Randall L. Eubank Vol. 91: Linear Least Squares Computations, R. W. Farebrother Vol. 92: Exploring Statistics, Damaraju Raghavarao
Vol. 93: Applied Time Series Analysis for Business and Economic Forecasting, Sufi M. Nazem Vol. 94: Bayesian Analysis of Time Series and Dynamic Models, edited by James C. Spall Vol. 95: The Inverse Gaussian Distribution: Theory, Methodology, and Applications, Raj S. Chhikara and J. Leroy Folks Vol. 96: Parameter Estimation in Reliability and Life Span Models, A. Clifford Cohen and Betty Jones Whitten Vol. 97: Pooled CrossSectional and Time Series Data Analysis, Terry E. Dielman Vol. 98: Random Processes: A First Look, Second Edition, Revised and Expanded, R. Syski Vol. 99: Generalized Poisson Distributions: Properties and Applications, P.C. Consul Vol. 100: Nonlinear LpNorm Estimation, Rene Gonin and Arthur H. Money Vol. Vol. Vol. Vol.
101: 102: 103: 104:
Model Discrimination for Nonlinear Regression Models, Dale S. Borowiak Applied Regression Analysis in Econometrics, Howard E. Doran Continued Fractions in Statistical Applications, K.O. Bowman and L.R. Shenton Statistical Methodology in the Pharmaceutical Sciences, Donald A. Berry
A D D IT IO N A L V O LU M E S IN P R E P A R A T IO N
Nonlinear LpNorm Estimation RENE GONIN Institute for Biostatistics Medical Research Council Cape Town, South Africa
ARTHUR H. MONEY Henley The Management College HenleyonThames England
MARCEL DEKKER, INC.
New York and Basel
IS B N 13: 9780824781255
This book is printed on acidfree paper. Copyright © 1989 by M A R C E L D E K K E R , IN C . A ll Rights Reserved Neither this book nor any part m ay be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, m icrofilm ing, and recording, or by any inform ation storage and retrieval system, without permission in writing from the publisher. M A R C E L D E K K E R , INC. 270 Madison Avenue, N ew Y ork , N ew Y o rk 10016 Current printing (last digit): 1098 76 5 43 2 1 P R IN T E D IN T H E U N IT E D S TA TE S O F A M E R IC A
Preface
W e hope this book will be o f use both to the statistician and to the applied mathematician. It is prim arily o f a computational nature and therefore belongs to the field o f statistical computing. The book is selfcontained and permits flexibility in the manner in which it can be read. The reader should be able to grasp the un derlying computational principles without having had a formal course in statistical computing. Fam iliarity with linear programming is, however, essential. The book is organized into two parts, namely, numerical solution procedures and statistical analysis.
The original outline for this book arose out o f the first
author’s doctoral thesis. Subsequently the m aterial on nonlinear X inorm and
L co
norm estimation problems was incorporated. Our main objective is the treatment o f nonlinear Z^norm estimation, which subsumes nonlinear least squares estimation. W e realized that an introduction to linear X pnorm estimation was needed as a preamble to the nonlinear case. Linear Zpnorm estimation is vast enough to warrant a complete text on its own. Lim ited space permits only a brief review o f this problem in Chapter 1. W e also felt it im portant to bring to the statistician’s notice useful procedures known to the numerical analyst. W e hope that this book will be successful in bridging the gap between the statistician and the numerical analyst. We have included some F O R T R A N programs to assist the reader in solving the nondifferentiable nonlinear
L\
continuously differentiable nonlinear
and Xoonorm estimation problems.
L pnorm
For the
problem any general unconstrained
m inim ization routine can be used. This is not necessarily the most efficient avenue o f solution. Sections o f the manuscript were used by the first author in a course on nonlinear parameter estimation at the University o f Michigan in Ann A rb or during 1988. W e would like to express our gratitude to the Medical Research Council in Cape Town for the use o f their computer facilities. This entailed not only computational aspects but also the typesetting o f the book by means o f T gX . The first author would also like to thank the M edical Research Council for financial support related to this project. The second author acknowledges the continued financial support o f the University o f Cape Town and the Human Sciences Research Council.
111
Preface
iv
W e would like to express our sincere appreciation to Jan Pretorius, Director o f the Institute for Biostatistics, for creating the environment for pursuing this project. His continued interest and encouragement was an im portant factor in com pleting this task. W e would especially like to thank Professor Vince Sposito o f Iowa State Uni versity for his careful scrutiny o f the manuscript and his many valuable comments. Also, many thanks to Professor Cas Troskie o f the University o f Cape Town for perusing the manuscript. For the classification o f the X inorm and Zoonorm methods we are indebted to Dr. Alistair Watson o f Dundee University. Also, our gratitude to Professor Mike Osborne o f the Australian National University, who put into perspective for us the numerical methods o f nonlinear X inorm estimation. W e would like to thank Doctors Joyce Gonin and Sinclair Wynchank for pa tiently editing the manuscript. Their efforts certainly im proved the clarity o f the subject m atter. Finally, thanks to our spouses, Joyce and Gillian, for their fortitude in coping with our impatience and idiosyncrasies as
p
tended to oo. Rene Gonin Arthur H. M oney
Contents PREFACE 1
iii
LpNORM ESTIMATION IN LINEAR REGRESSION
1
1.1
The history of curve fitting problems
1
1.2
The linear Lpnorm estimation problem
4
1.2.1
Formulation
1.2.2
Algorithms
5 5
1.2.3
/^estimation
7
1.2.4
Liestimation
7
1.2.5
/^estimation
9
1.3
The choice of p
11
1.4
Statistical Properties of Linear L pnorm Estimators
12
1.5
Confidence intervals for ft
14
1.5.1
Case 1 < p < oo
14
1.5.2
Case p = I
15
1.6
2
Example
16
Conclusion
21
Appendix 1A: GaussMarkoff Theorem
22
Additional notes
23
Bibliography: Chapter 1
32
THE NONLINEAR LiNORM ESTIMATION PROBLEM
2.1
The nonlinear L inorm estimation problem
2.2
Optimality conditions for Linorm estimation problems
2.2.1 2.3
Examples Algorithms for solving the nonlinear L\ norm estimation problem
2.3.1
Differentiable unconstrained minimization
2.3.2
Type I Algorithms
43 43 45 49 52 54 55
2.3.2.1
The AndersonOsborneWatson algorithm
2.3.2.2
The AndersonOsborneLevenbergMarquardt algorithm
58
2.3.2.3
The McLean and Watson algorithms
60
2.3.3
Type II Algorithms
55
64
2.3.3.1
The Murray and Overton algorithm
64
2.3.3.2
The Bartels and Conn algorithm
68 v
vi
2.3.4 2.3.4.1 2.3.5
Contents
Type III methods The Hald and Madsen algorithm Type IV Algorithms
72 73 76
2.3.5.1
The ElAttarVidyasagarDutta algorithm
76
2.3.5.2
The Tishler and Zang algorithm
78
Conclusion
81
Appendix 2B: Onedimensional line search algorithms
83
Appendix 2C: The BFGS approach in unconstrained minimization
87
Appendix 2D: LevenbergMarquardt approach in nonlinear estimation
88
Appendix 2E: Rates of convergence
91
Appendix 2F: Linear Algebra
92
Appendix 2G: Extrapolation procedures to enhance convergence
96
Appendix 2H: F O R T R A N Programs
3
80
Appendix 2A: Local and global minima
98
Additional notes
111
Bibliography: Chapter 2
114
THE NONLINEAR L^NORM ESTIMATION PROBLEM
120
3.1
The nonlinear Loonorm estimation problem
121
3.2
Loonorm optimality conditions
120
3.3
The nonlinear minimax problem
124
3.4
Examples
125
3.5
Algorithms for solving nonlinear Loonorm estimation problems
3.5.1 3.5.1.1
Type I Algorithms The AndersonOsborneWatson algorithm
128 129 129
3.5.1.2
The Madsen algorithm
134
3.5.1.3
The AndersonOsborneLevenbergMarquardt algorithm
136
3.5.2 3.5.2.1 3.5.2.2 3.5.3
Type II Algorithms The Murray and Overton algorithm The Han algorithm Type III Algorithms
137 137 141 144
3.5.3.1
The Watson algorithm
145
3.5.3.2
The Hald and Madsen algorithm
148
3.5.4 3.5.4.1 3.5.4.2
Type IV algorithms The Charalambous acceleration algorithm The Zang algorithm Appendix 3A: F O R T R A N Programs
151 151 152 155
Additional notes
161
Bibliography: Chapter 3
166
Contents
4
v ii
THE NONLINEAR V NORM ESTIMATION PROBLEM
170
4.1
The nonlinear L pnorm estimation problem (1 < p < oo)
170
4.2
Optimality conditions for the nonlinear L pnorm problem
171
4.3
Algorithms for nonlinear L pnorm estimation problems
174
4.3.1
Watson’s algorithm
174
4.3.2
A firstorder gradient algorithm
176
4.3.2.1 4.3.3 4.3.3.1
Examples The mixture method for large residual and illconditioned problems Examples Conclusion
5 5.1 5.1.1
181 183 190 192
Appendix 4A: Cholesky decomposition of symmetric matrices
193
Appendix 4B: The singular value decomposition (SVD ) of a matrix
197
Appendix 4C: Fletcher’s line search algorithm
200
Additional notes
201
Bibliography: Chapter 4
203
STATISTICAL ASPECTS OF LpNORM ESTIMATORS Nonlinear least squares Statistical inference
206 207 207
5.1.1.1
Confidence intervals and joint confidence regions
207
5.1.1.2
Hypothesis testing
211
5.1.1.3 5.2 5.2.1
Bias Nonlinear Linorm estimation Statistical inference
212 213 213
5.2.1.1
Confidence intervals
213
5.2.1.2
Hypothesis testing
214
5.2.1.3
Bias and consistency
215
5.3
Nonlinear L pnorm estimation
5.3.1
Asymptotic distribution of L pnorm estimators (additive errors)
5.3.2
Statistical inference
216 216 216
5.3.2.1
Confidence intervals
216
5.3.2.2
Hypothesis testing
217
5.3.2.3 5.4
Bias The choice of the exponent p
217 217
5.4.1
The simulation study
218
5.4.2
Simulation results and selection of p
219
5.4.3
The efficiency of Lpnorm estimators for varying values of p
220
5.4.4
The relationship between p and the sample kurtosis
222
viii
Contents
5.5
The asymptotic distribution of p
223
5.6
The adaptive algorithm for £pnorm estimation
226
5.7
Critique of the model and regression diagnostics
227
5.7.1
Regression diagnostics (nonlinear least squares)
228
5.7.2
Regression diagnostics (nonlinear Lpnorm case)
229
Conclusion
6 6.1
230
Appendix 5A: Random deviate generation
231
Appendix 5B: Tables
238
Appendix 5C: u;p for symmetric error distributions
241
Appendix 5D: Var(p) for symmetric error distributions
244
Appendix 5E: The Cramervon Mises goodnessoffit test
247
Additional notes
248
Bibliography: Chapter 5
250
APPLICATION OF LpNORM ESTIMATION Compartmental models, bioavailability studies in Pharmacology
254 254
6.1.1
Mathematical models of drug bioavailability
255
6.1.2
Outlying observations in pharmacokinetic models
257
6.2
Oxygen saturation in respiratory physiology
265
6.2.1
Problem background
265
6.2.2
The data
265
6.2.3
Modelling justification
265
6.2.4
Problem solution
268
6.3 6.3.1
Production models in Economics The twostep approach
275 275
6.3.1.1
Twostep nonlinear least squares
276
6.3.1.2
Twostep adaptive Lpnorm estimation
278
6.3.2
The global least squares approach
282
Concluding remark
282
Bibliography: Chapter 6
283
AUTHOR INDEX SUBJECT INDEX
285 293
Nonlinear LpNorm Estimation
1 LpNorm Estimation in Linear Regression
The fundamental problem of fitting equations to a number of observations has for centuries been of interest to scientists. For example, curve fitting models were used by early astronomers to predict the orbits of celestial bodies. Concerned with the calculation of the distance of a new star from the earth Galileof makes the following statement through the interlocutor Salviati in his Di alogue of the two chief world systems: “Then these observers being capable, and having erred for all that, and the errors needing to be corrected in order for us to get the best possible information from their observations, it will be appropriate for us to apply the minimum amendments and smallest corrections that we can — just enough to remove the observations from impossibility and restore them to possibility ” This statement heralded the beginning of the theory of errors in which the model is estimated from a number of inconsistent observations. 1.1. T h e h istory o f cu rve fittin g problem s
One of the first methods for smoothing random errors was based on averages and was known as the Principle of the Arithmetic Mean: suppose we wish to fit a linear regression model y = a + @x to the observations (x*,
), * — I, . . ., n. The
estimates of a and /3 are determined as follows: Slopes between all possible pairs of points,
bij = —— Xj 
Xi
with
Xj / Xi
for
i = 1, . . . , n — 1;
j = i f 1,. .., n
f Galilei, Galileo (1629). Dialogue concerning the two chief world systems, Ptole maic and Copernican. (English translation, by Stillman Drake. Revised 2nd edition UCLA Press 1967). 1
C H A P T E R 1: L pnorm estimation in linear regression
2
are first calculated and the corresponding intercepts atJ in each case are then calcu lated by substitution. The averages of the intercepts and slopes denoted by a and b respectively are then taken as estimates of a and /?. Cotes (1722) noted in certain models that only the response variable (obser vations y) is subject to measurement errors and suggested a procedure based on weighted arithmetic means with weights proportional to x. In the model y — (3x + e, the slope /? is estimated by b — y/x, the ratio of the two means, which in turn is equivalent to the zero sum residuals condition: n
Y ^(V i  bxi) = °
( 1.1)
1= 1
This is the same as stipulating that the ideal line must pass through the centroid (x, y) of the data. Euler (1749) and Mayer (1750) independently derived the socalled Method of Averages for fitting a straight line to observed data. The observations are subdivided on some subjective basis into as many subsets as there are coefficients. The grouping is made according to the value of one of the explanatory variables. Those with the largest values of this variable are grouped together and so on. Condition
(1.1) is
then applied to each observation of the subset. Boscovich (1757) considered the model y — a + j3x f e and proposed that two criteria be met when fitting the best straight linef. The estimates a and b of the parameters a and f3 are determined such that: Criterion 1:
Is/*' ~ a ~ bxi\
a minimum.
Criterion 2: The sum of the positive and negative residuals in the yvariable are equal i.e., ^ ^ ( y i — a — bxt) = 0. Criterion 1 is the objective of Least Absolute Value (L A V ) estimation (also referred to as Linorm estimation), while Criterion 2 ensures that the fitted line passes through the centroid of the data.
Boscovich’s solution procedure is based
on geometric principles. Solution procedures were derived by Simpson (1760) and Laplace (1793,1799). f Linorm estimation has in fact been traced back to Galileo (1629), see e.g., Ronchetti (1987).
Section 1.1: The history of curve fitting problems
3
Laplace (1786) also used the Boscovich principles to test the adequacy of the relationship y = a + fix for the data (xi,y*). He described an algebraic procedure for determining the coefficients a and /?. Gauss (1806) developed the method of minimizing the squared observation errors in his works on celestial mechanics which subsequently became known as the Method of Least Squares (L 2norm estimation).
Although Gauss had used least
squares since 1795, Legendre (1805) was the first to publish the method. He derived the normal equations algebraically without using calculus. He claimed that least squares was superior to other existing methods but gave no proof. In 1809 Gauss derived the normal (Gaussian) law of error which states that the arithmetic mean of the observations of an unknown variable x will be the most probable. Gauss (1820) succinctly writes: “Determining a magnitude by observation can justly be compared to a game in which there is a danger o f loss but no hope o f gain ... Evidently the loss in the game cannot be compared directly to the error which has been committed, for then a positive error would represent loss and a negative error a gain. The magnitude o f the loss must on the contrary be evaluated by a function o f the error o f which the value is always positive ... it seems natural to choose the simplest (function), which is, beyond contradiction, the square o f the error” . Laplace (1818) examined the distributional properties of the estimator, 6, of the parameter /3 in the simple regression model y = (3x he when Linorm estima tion is used. He assumed that all the errors e had the same symmetric distribution about zero and derived the density function /(e) of the errors e. He also showed that the slope b was asymptotically normally distributed with mean zero and vari ance {4 / (0 )2^ tn_ i x f } ~ 1. This is a well known result for the sample median in the location model y = (3 + e. Cauchy (1824) examined the fitting of a straight line y = a + /3x to data and proposed minimizing the maximum absolute residual. This he achieved by means of an iterative procedure.
Chebychev (1854) in his work on the approximation
of functions also proposed the estimation of parameters by means of minimizing the maximum absolute difference between the observed function and the estimated function. This minimax procedure later became known as Chebychev approximation or Loonorm approximation.
C H A P T E R 1: Lpnorm estimation in linear regression
4
Edgeworth (1883) questioned the universal use of the normal law of errors and also examined the problem of outliers.
This problem was considered further by
Doolittle (1884). Edgeworth ( 1887a,b) used the first Boscovich principle and aban doned the zero sum residuals condition which forces the line through the centroid of the data. He, in fact, provided the first workable method for estimating these pa rameters. He also considered cases where least squares is inappropriate, i.e., where the error distributions are unknown or contaminated (normal) distributions. The aforementioned authors proposed complex solution procedures for these problems. This possibly accounts for the apparent lack of interest in applying LAV estimation in the period 1884 to 1920. Mild interest was revived again in the period 1920 to 1950 by Edgeworth (1923), Rhodes (1930), Singleton (1940) and Harris (1950). It was only after the appearance of the papers by Charnes et al. (1955) and Wagner (1959,1962) that interest in L\ and L
estimation was again stimulated.
These authors described the linear programming (L P ) formulation to the L i and Loonorm problems respectively. Progress, however, was hampered during this pe riod since computers were not advanced enough to handle very large LP problems. In general, the parameters of the equations may be estimated by minimizing the sum of the pth powers of the absolute deviations of the estimated values from the observed values of the response variable. This procedure is referred to as Lpnorm estimation. In the rest of this chapter we give a review of important developments in linear Lpnorm estimation. 1.2. T h e lin ear Lpn orm estim a tion p ro b lem Consider the general linear model
y — X/3 f e where y is an nvector of observable random variables X is an n x k matrix of known regressor variables is a A;vector of unknown parameters, and e is an nvector of unobservable errors.
(1.2)
Section 1.2: The linear Lpnorm estimation problem
5
1.2.1. F o rm u la tio n
The linear L pnorm estimation problem is then defined as: Find the parameter vector b = (& i,. . . , bk) 1 which minimizes n
n
= *=i
ip
(i.3 )
«=i
with yt the response and Xj = (z ll5 ... , x ^ ) the explanatory variables respectively, and xn = 1 for all i. Hence bi is the estimate of the intercept, b is the Lpnorm estimate of ft and e = ( e i , ... , en) 1 is the nvector of residuals. Charnes et al. (1955) showed that problem (1.3) (with p — 1) can be formulated as a linear programming problem by considering the unrestricted error term to be the difference between two nonnegative variables. Let
= ut  i w h e r e ut , Vi > 0,
represent the positive and negative deviations respectively. The general L pnorm problem then becomes: n
Minimize
(^^) t=i
subject to
Xjb f ut  vt = ytizt ,v ; > 0
"I > J
i — 1,. . ., n
b unconstrained Only in the cases p — 1 and p —►oo can linear programming procedures be used. When p = 2 the normal equations of least squares are solved. For other values of p unconstrained minimization procedures are used. 1.2.2. A lg o rith m s
Barrodale and Roberts (1970) showed that the linear L pnorm estimation prob lem can be formulated as a nonlinear programming (N L P ) problem in which the objective function is concave for 0 < p < 1 and convex for 1 < p < oo with the constraints being linear. They suggested for p > 1 the use of the convex simplex or Newton’s method. For the case 0 < p < 1 a modification of the simplex method for linear programming was proposed. Ekblom (1973) rewrites problem (1.3) as the perturbation problem:
6
C H A P T E R 1: L pnorm estimation in linear regression
Find the parameter vector b which minimizes n
£ [(w t=i
 xi b? +• c2]p/2
(where c is finite)
(1.5)
The author used the modified (damped) Newton method to solve problem (1.5). The advantage is that the Hessian (matrix of 2ndorder derivatives) of the perturbed problem remains positive definite as long as c ^ 0. A decrease is therefore assured at every iteration. Ekblom then showed that the limiting solution as c —►0 is the solution to the original problem (1.3). More recently Ekblom (1987) suggested that problem (1.5) with p = 1 be used for Linorm estimation. Fischer (1981) considered problem yt for i = 1, ...,n , he transformed
(1.3) with 1 < p < 2. Setting rt = xb 
(1.3) to the following “linearly constrained”
problem: Find the parameters b which minimize n
rt p
subject to
x^b — rt = y,,
i — 1,. . ., n
i= 1
This problem, known as the primal [see Zangwill (1969:Chapter 2)] can be formulated as: n
m inm axL(6, A) = V '[ r , p +
6
A
 r;  y, )
.=1
where A is the vector of undetermined Lagrange multipliers. The dual problem is then: max min L ( 6, A) A b Fischer also indicated how a constrained version of Newton’s method can be used to solve the dual problem. The solution of the primal problem corresponds to the iteratively reweighted least squares algorithm of Merle and Spath (1973). Fischer showed that his algorithm converges globally at a superlinear rate. He claims that his proof is not as involved as the one by Wolfe (1979) who in any event only succeeded in proving that the method is locally convergent at a linear rate. Fischer also provided some numerical results which he compared to those found by Merle and Spath. He used the number of iterations as the criterion for efficient convergence instead of the number of function evaluations performed.
This can be misleading since an
Section 1.2: The linear Lp~norm estimation problem
7
iteration is generally dependent on the structure of the algorithm and efficiency of the programme code.
The numerical results indicate that Fischer’s algorithm is
more efficient than the algorithm by Merle and Spath (1973). 1.2.3. /^estim ation
The least squares problem can be formulated as a quadratic programming prob lem as in (1.3) but this is unnecessarily cumbersome. However, the solution to the normal equations is easily obtained by using Gaussian elimination [see Dahlquist and Bjorck (1974)]. In this case the estimates of the
s are linear combinations of
the elements of y. Specifically we have
b = ( X 'X ) ~ l X ' V
( 1.6)
Note that when p ^ 2 the estimate b cannot be expressed as an explicit function of X and y. The GaussMarkofF theorem indicates that the least squares estimator is the best linear unbiased estimator (B LU E ), see Appendix 1A. It is important to note that this theorem refers only to the class of linear estimators. It does not consider nonlinear alternatives (e.g., values of p ^ 2).
The statistical tractibility of least
squares estimators has made least squares the most popular method of estimation. Typically the least squares estimator is a kind of sample mean estimator [Kiountouzis (1973)]. This can be seen by considering the case k = 1 (location model). Hence
(1.1) becomes y — [3i + e, with least squares estimate b\ — y. This mean
estimator can be undesirable when the error distribution has long tails [Blattberg and Sargent (1971)]. An obvious alternative would be the mediantype estimator. This in fact coincides with the case p — 1 which will now be discussed. 1.2.4. L iestim a tio n
The Linorm estimation problem can be formulated as
8
C H A P T E R 1: L pnorm estimation in linear regression
min ]P(u, + v0
1
J
j = 1, . . . , k
It therefore follows that the Linorm problem can be formulated as an LP problem in 2n + k variables and n constraints. As with any linear programming problem a dual linear programming problem exists. Denote the dual variables by /t . The dual of (1.7) is formulated as follows [see Wagner (1959)]: n
maX £
f'Vi
i= 1
subject to
 1 < fi < 1
i = 1,. . ., n
£
j = 2 ,..., k
fiX ij = 0
i= 1 Wagner (1959) also showed that by setting wt = /* + 1 the bounded variable dual problem may be formulated in terms of the nonnegative (upper bounded) variables Wi. n
max
Wiyi t=i
subject to
0 < Wi < 2
i — 1, . . . , n
n
^ n
£ t=l
 n n
wi xij  £
xij
i= l
Barrodale and Roberts (1973) managed to reduce the total number of primal simplex iterations by means of a process bypassing several neighbouring simplex vertices in a single iteration. Subroutine L I may be found in Barrodale and Roberts (1974). Armstrong et al. (1979) modified the Barrodale and Roberts algorithm by
9
Section 1.2: The linear L pnorm estimation problem
using the revised simplex algorithm of linear programming which reduces storage requirements and uses an LU decomposition to maintain the current basis. A number of interesting geometrical properties of the solution to the Linorm problem have been derived. We shall list a few of these. The interested reader is referred to Appa and Smith (1973) and Gentle et al. (1977) for more detail. 1) If the matrix X is of full column rank, i.e., r (X ) = k then there exists a hyperplane passing through k of the points. If r (X ) = r < k there exists an optimal hyperplane which will pass through r of the observations. 2) Let n+ be the number of observations with positive deviations, n~ the cor responding number with negative deviations and n* the maximum number of observations that lie on any hyperplane. Then n + — n  < n
3) For n odd every L\ hyperplane passes through at least one point. 4) For n odd there is at least one observation which lies on every L\ hyperplane. One point that remains to be considered is that of optimality.
Optimality
conditions for the nonlinear Linorm problem are derived in Chapter 2 (see §2.2). The linear case is treated as a corollary of the nonlinear case. C o ro lla ry 1.1
Let A = { 11yt  x^b = 0 }, I = {1,2, . . . , n }  A be the sets
of indices corresponding to the active and inactive functions respectively. Then a necessary and sufficient condition for parameters 6 to be a global Linorm solution is the existence of multipliers 1
0 Let
and
and
0
j =
2, . . . , k
I = 1,. . ., n
A number of geometrical properties of the Loonorm solution have been derived by Appa and Smith (1973).
Section 1.3: The choice of p
1) For problem
11
(1.10) there exists one optimal hyperplane which is vertically
equidistant from at least k + 1 observations. 2) Any k + 1 observations determining the optimal hyperplane must lie in the convex hull of n observations. Barrodale and Phillips (1975) proposed a dual method for solving the primal formulation of the Loonorm estimation problem. A FO R T R A N code CHEB is also given by these authors. Armstrong and Kung (1980) also proposed a dual method for this problem.
By maintaining a reduced basis the algorithm bypasses certain
simplex vertices. Optimality conditions for the Loonorm problem also follow from the nonlinear case (§3.2 in Chapter 3) and are given in the following corollary: C o ro lla ry 1.2
Let A = {iyt  X{b = d ^ }, I = { l , 2 , . . . , n }  A be the sets
of indices corresponding to the active and inactive functions respectively. Then a necessary and sufficient condition for parameters 6 to be a global Loonorm solution is the existence of nonnegative multipliers c*t such that
£ < *> = £ a> = 1 t= l i£A a t=
0
Y^Oiisign(yi ieA
i E I
 zt*6)zt* = 0
1.3. T h e choice o f p
Forsythe (1972), in estimating the parameters a and (3 in the simple regression model y — a + f3x, investigated the use of L pnorm estimation with 1 < p < 2. He argued that since the mean is sensitive to deviations from normality the Lpnorm estimator will be more robust than least squares in estimating the mean. This will be the case when outliers are present. He suggested the compromise use of p — 1.5 when contaminated (normal) or skewly distributed error distributions are encountered. The DavidonFletcherPowell (D F P ) method was used as a minimization technique. In a simulation study Ekblom (1974) compared the L pnorm estimators with the Huber Mestimators. He also considered the case p < 1. He concluded that
C H A P T E R 1: L pnorm estimation in linear regression
12
the Huber estimator is superior to the L pnorm estimator when the errors follow a contaminated normal distribution. For other error distributions (Laplace, Cauchy) he suggested that p = 1.25 be used. The proposal that p < 1 should be used for skewly distributed (x 2) errors is interesting and shows that the remark by Rice (1964) that problems where p < 1 are not of interest, is unjustified. Harter (1977) suggested an adaptive scheme which relies on the kurtosis of the regression error distribution. He suggested L i norm estimation if the kurtosis fa > 3.8, least squares if 2.2 < fa < 3.8 and L 0cnorm estimation if fa < 2.2. This scheme has been extended by Barr (1980), Money et al. (1982), Sposito et al. (1983) and Gonin and Money ( 1985a,b) and will be discussed in Chapter 5. Nyquist (1980,1983) considered the statistical properties of linear L pnorm es timators. He derived the asymptotic distribution of linear Lpnorm estimators and showed it to be normal for sufficiently small values of p. It is not stated, however, how small p should be. A procedure for selecting the optimal value of p based on the asymptotic variance is proposed which validates the empirical studies by Barr (1980) and Money et al. (1982). Money et al. derived an empirical relationship between the optim al value of p and the kurtosis of the error distribution. Sposito et al. (1983) derived a different empirical relationship which also relates the optimal value of p to the kurtosis of the error distribution. These authors also showed that the formula of Money et al. (1982) yields a reasonable value of p for error distributions with a finite range and suggested the use of their own formula for large sample sizes (n > 200) when it is known that 1 < p < 2. The following modification of Harter’s rule was suggested: Use p = 1.5 (Forsythe) if 3 < fa < 6, least squares if 2.2 < fa < 3 and Loonorm estimation if fa < 2.2. The abovementioned formulae will be the object of study in Chapter 5. 1.4. S ta tis tic a l P ro p e rtie s o f L in e a r Lpn orm E stim a to rs We introduce this section by stating the following theorem due to Nyquist (1980,1983). See also Gonin and Money (1985a). T h e o re m 1.1
Consider the linear model y = X/3+ e. Let b be the estimate of f3
chosen so that :
n
s p(b) =
y« — xt^\p i= l
Section 1.4: Statistical Properties of Linear L pnorm Estimators
13
is a minimum where 1 < p < 00. Assume that: A l: The errors e* are independently, identically distributed with common distribu tion 7. A2: The L\ (and Loo)norm estimators are unique (in general Lpnorm estimators are uniquely defined for 1 < p < 00). A3: The matrix Q — limn^oo X 'X / n is positive definite with rank(Q) — k. A4a: J is continuous with 7*(0) > 0 when p = 1. A4b: When 1 < p < 00 the following expectations exist: E{\ei\p 1}, E{\ei\p 2}, £{e*2p 2}; and also E{\ei\p l s ig n (e i)} = 0
Under the above four assumptions yjn(b —f3) is asymptotically normally distributed with mean f3 and variance wpQ ~ l where
Bassett and Koenker (1978) considered the case p — 1 and the following corollary results. [See also Dielman and Pfaffenberger (1982b)]. C o ro lla ry 1.3
For the linear model let b be the estimate of j3 such that n
5 i ( 6) = £ lv«  z i 6l i= 1 is a minimum. Under the assumptions A l, A2, A3 and A4a of Theorem 1.1 it follows for a random sample of size n that y/n(b —j3) is asymptotically normally distributed with mean O and variance \ 2( X 'X ) ~ l where A2/n is the variance of the sample median of residuals.
C H A P T E R 1: L pnorm estimation in linear regression
14
1.5. C on fiden ce in terva ls fo r /3
1.5.1. C ase 1 < p < oo
In view of the asymptotic results of Nyquist (1980,1983) we can construct the following confidence intervals for the components of /?. Specifically the 100(1  a )% confidence interval for fa is given by bj ± za/2y / u j ( x ' x y i Where (X
'X
denotes the j th diagonal element of ( X ' X ) 1 and za/2 denotes the
appropriate percentile of the standard normal distribution. A major drawback is that a;2 is unknown.
However, we have performed a
simulation study in which the sample moments of the residual distribution were used to estimate ojp. The estimate ujp was calculated as follows: 2_ rn2p2 U' ~ [(P  l ) m p_ 2P with m r = 1 S r = l l^*r where et is the zth residual from the L pnorm fit.
The simulation was performed using a two parameter model of the form
yi = 10 + f a in + j32Xi2 In this model the parameters fa and fa were fixed at 8 and  6 respectively. Values of (x t l, x ^ ) for i = 1, 2 ,..., n (n = 30, 50,100, 200, 400) were selected from a uniform distribution over the interval 0 to 20 and held fixed in all 500 samples. In all the runs a uniform error distribution with mean zero and variance 25 was used. The exact u>p for the uniform error distribution is given by 3
0
)
> J
i — I , ... ,n
We have therefore succeeded in reducing the 2n inequality constraints to n equality constraints whilst increasing the number of variables from n + k to 2n + k variables. In a mathematical programming framework this formulation is attractive because of the reduced number of constraints. In deriving optimality conditions, however, the formulation
(2.3) will be used. We now discuss optimality conditions for the
nonlinear Linorm estimation problem. The wellknown Karush (1939), Kuhn and Tucker (1951) necessary conditions (feasibility, complementary slackness and stationarity) of N LP will be utilized in the derivation of the optimality conditions. Optimality conditions for the linear L r norm problem (see Chapter 1) will be stated as a corollary.
2.2. Optimality conditions for Linorm estimation problems
The following concepts will be needed:
• Active function set
A = {i yi  fi(x { , 9) = 0 } • Inactive function set
I =
{*'! y* 
^0} =
n} 
• Active constraint sets
K i = {* I K2=
{»'  yi
= «< }

9) =
K = KxU K2
u ,}
A
46
C H A P T E R 2: The nonlinear L \ n orm estim ation problem
• Feasible region F =
{(«, 9)
e
SR" x Sft* I y, 
9)  m < 0 ,
y, +
9)
 u< < 0 }
• The Lagrangian function for problem (2.3)
n
L( u,9,
A) = £ { ( 1  A,  A,+n)u, + (A,  A1+„)[y,  /,(*,•, 0 )} t= l
We shall assume that at least one ut > 0. The following condition or constraint
qualification will also be used: D efin itio n 2.1
be an i th unit vector. A point (tT ,0 *) such that the active
Let
constraint gradients evaluated at (u*,9*)
~ €{
}
i E K x and
(
)
* G K2
are linearly independent is termed a regular point of the constraint set. The KarushKuhnTucker (K K T ) theorem will now be stated. T h e o re m 2.1
Suppose the functions /*(*,•, 9) are continuously differentiable with
respect to 0 and that (tT ,0 *) is a regular local minimum point of problem (2.3). Then the following conditions are necessary for a local minimum point: (a) (Feasibility).
(tr ,0 * ) G F
(b) (Complementary slackness). There exist multipliers A *,A '+„ > 0, i = such that A*(t/,  /,•(*,•, #*)  u?) = 0 \+rt(y>
/«(*{> ^ ) + ui ) — 0
(c) (Stationarity). V L (ti\ r ,A * ) =
{ V uL )
= 0
(V e L ) For notational convenience denote hi(9) — yi  fi(z{,9), h(9) = [hi(9 ),. . . ,/in(0)]',
y
= ( y i 3  , y n ) / and f (x ,9 ) 
9 ) , . . f n(xn ,9)\l. Then the active function
set is A = {« hi{9*) = 0 } and inactive function set is / = {i \hi(9 M) ^ 0}. The K K T conditions can be simplified and will be stated as the following necessity theorem .
47
Section 2.2: O ptim ality conditions fo r L in o rrn estim ation problems
Theorem 2.2 that
Suppose hi{$) is continuously differentiable with respect to 0 and
is a regular point of (2.3) and a local minimum point of ( 2.2). Then a
necessary condition for 0 * to be a local minimum point is the existence of multipliers —1 < a t < 1 for all i
E
A such that
^ 8 i g n [ h i ( 0 t)]V h i {e,‘) + '^2o‘iV h i (et ) = 0 i€I iGA
Proof
Let Ii = {i/it(0*) > 0} and /2 = {i/it(0*) < 0} be the inactive function
sets. K K T conditions (b) and (c) state that
A ,*{M tf*)  « tj,„ + n >— 0 and
At* + At+n = 1 it follows that —1 < a t < 1 and the proof is complete.

Remarks (1) If the functions h{(0) are convex or linear then the optimality condition of Theorem 2.2 will be necessary as well as sufficient. (2) A t the optimal solution the sets K\ — K 2 = { l , . . . , n } .
Hence all the con
straints are active at optimality. This fact is used in the Murray and Overton (1981) algorithm (see §2.3.3.1).
48
C H A P T E R 2: The nonlinear L \ n orm estim ation problem
In linear LAV estimation the model y — X9 + e is fitted where X is an n x k matrix and y, 9 and e are n, k and nvectors respectively. Denote the i th row of
X by X{ = ( xhj . .., Xik) then hi(9) = yi — x^B. Define A = { i  yt  x^O* = 0} and I
= {* I
Vi ~ x %0 * ^
C o ro lla ry 2.1
°}
In linear LAV estimation a necessary and sufficient condition for
9* to be a global LAV solution point is the existence of multipliers 1 < a t < 1 for all i such that
The following N ecessity T h e o re m is stated in terms of directional derivatives
d (see Appendix 2B). T h e o re m 2.3
Suppose the functions hi{9) are continuously differentiable with
respect to 9. Then a necessary condition for 9* to be a local minimum point of problem ( 2.2) is that
^
d's^n[/i,■(«*)] V A ,  ( r ) + £
I d 'V M H I > 0
for all
d e %k
The following theorem constitutes sufficien cy conditions for optimality. T h e o re m 2.4
Suppose the functions hi(9) are twice continuously differentiable
with respect to 9. Then 9* is an isolated (strong) minimum point of ^ ( t f ) if there exist multipliers 1 < a t < 1 for all i t A such that
iei
ieA
and for every i G A and d / 0 satisfying:
d!Vhi(9*) = 0
if
\ai\^l
>0
if
ct{ = 1
0
Note that this is a onedimensional minimization or exact line search problem. In practice this is seldom used; instead inexact line search procedures are used. Only a sufficient decrease in f(9 ) is therefore sought:
f ( » 1 + l i 6 9 1) < f { 9 1)  c
(e > 0)
In Appendix 2B we discuss some line search procedures. The abovementioned methods assume that the objective function f(0 ) is con tinuously differentiable. Since the objective functions S\(9) and SqoM are not dif ferentiable they need special treatment.
Section 2.8: Algorithm s fo r solving the nonlinear L in o r m estim ation problem
55
2.3.2. Type I Algorithms We now consider the SLP methods.
2.3.2.1 The AndersonOsborneWatson algorithm. This method is due to Osborne and Watson (1971) and Anderson and Osborne (1977a). The underlying philosophy behind this method is not difficult. At each step a firstorder Taylor series (linear approximation) of f%{x^0) is used. The method is therefore essentially a generalization of the GaussNewton method for nonlinear least squares extended to solve L^norm estimation problems.
The algorithm Step 0: Given an initial estimate 01 of the optimal 0*, choose /i = 0.1 and A = 0.0001 and set j — 1. Step 1: Calculate 80 to minimize
n v/ , t=l Let the minimum be
with 80 — 8 OK
Step 2: Choose 7j as the largest num ber in the set
{ 1j Mi A*2) • ■•}
torwhich
».(»’ )  m *
> A
1,(«.(#>)  SI) Let the minimum be 5 J+ 1 with 7 = 7y. Step 3: Set 0i+l = 0i + ~ij80i and return to Step 1. Repeat until certain conver gence criteria are met.
Remarks ( 1) The original nonlinear problem is reduced to a sequence of linear Linorm re gression problems, each of which can be solved efficiently as a standard LP prob lem using the Barrodale and Roberts (1973,1974) or Armstrong et al. (1979) algorithms.
56
C H A P T E R 2: The nonlinear L \ n o rm estim ation problem
(2) The inexact line search in Step 2 ensures convergence and is due to Anderson and Osborne (1977a) (see Appendix 2B). Osborne and Watson (1971) originally proposed an exact line search: Calculate 7 > 0 to minimize X^r__i y« E x a m p le 2.3 We shall illustrate one step of the algorithm on the following simple example. Given the data points (0,0), (1,1), (2,4) we wish to fit the curve y = edx to the three points. The objective function is plotted in Fig. 2.3.
F ig u re 2.3
Si(0) vs 9.
The solution is 0* = In 2 = 0.69315. Dennis and Schnabel (1983:226) also used this example to illustrate the slow convergence of the GaussNewton method when a least squares fit is performed. We start with 9l = 1. Step 1: We have to solve the problem: m in ^ ? = i \yi ~
 Xieelxi 8 d\ for 89.
This is equivalent to solving the linear programming problem: 3
min £ ( t=i
+ v*)
Section 2.8: Algorithms for solving the nonlinear L\norm estimation problem
57
subject to the 3 constraints ui — v\ — 1 u2  v2  2.718360 = 1.7183
U3  v z  14.77812, ^3 > 0 and 89 unconstrained in sign. The solution to this LP problem yields 89l = 0.2293, u\ = 1,u 2 = 1.0949, u3 = v± = v2 — v3 — 0 The minimum S 1 — 2.0949. The number of simplex itera tions using subroutine L I of Barrodale and Roberts (1974) was 1. Step 2: The inexact line search proceeds as follows: Select 7 = 1 , calculate 5 i(0 x)  5 i(0 x + 7 89l ) _ 6.1073  2.8321 7 (S i(0 1)  51)
“ 6.1073  2.0949 = 0.816 > 0.0001
Hence 71 = 1 is selected. The approximate value of S 2 = 2.8321. Step 3: Set 02 = 01 f 7189l = 0.7707. The algorithm proceeds in this fashion to the optimal solution 0* = 0.693147 with S i(0*) = 2. The residuals are  1 ,  1 and 0. Thus the curve goes through one point exactly as is to be expected. The total number of iterations was 4 and the total number of simplex iterations also equalled 4. The line search step 7 = 1 was chosen at every iteration (quadratic convergence). Under the usual smoothness assumptions and the assumption that the Jacobian matrix
is of full column rank, the following convergence properties were derived:
Convergence properties Assuming an exact minimum is located in Step 2, then: ( 1)
< Si for all j ; if Si < Si then 5 J+1 < S 3 for all j.
(2) At a limit point of the algorithm 5 J = SJ. (3) If SJ =
and if 803 is an isolated (unique) local minimum in Step 1 then a
limit point of the sequence { 0 3} is a stationary point of ( 2.2).
58
C H A P T E R 2: The nonlinear L \ n orm estimation problem
If the inexact line search is performed in Step 2, then: (4) If Si(93) ^ S 3 and given that 0 < A < 1, then there exists a 7 , E {l,/z,ju2, . . .} with 0 < fi < 1 such that S
,
(
«
.
)
> A
 S i) (5) If the algorithm does not terminate in a finite number of iterations then the sequence {5 i(0 ; ) } converges to S 3 as j —►00. (6) Any limit point of {93} is a stationary point of Si(0) — i.e., a point 0* such that S\{0*) = S*. C on vergen ce ra te Anderson and Osborne (1977a) show that if the points 0* all lie in a compact (closed and bounded) set and a socalled “ multiplier” condition holds, then the ulti mate convergence rate is quadratic. Cromme (1978) has shown that the condition of
strong uniqueness is sufficient for ensuring quadratic convergence of GaussNewton methods. D e fin itio n 2.3
9* is strongly unique if 3 a 7 > 0 such that Si{9) > Sl (9*) + 1 \\99*\\
for all 9 in a neighbourhood of 0*. Jittorntrum and Osborne (1978) show that strong uniqueness is implied by the multiplier condition. It is therefore a weaker condition, however, not the weakest condition that ensures quadratic convergence. • The F O R T R A N program A N D O SB W A T with an example problem is listed in Appendix 2H. Convergence occurred at iteration 8 (S 8 = Ss = 0.001563). The intermediate results are in complete agreement with those found by Anderson and Osborne (1977a). 2.3.2.2 T h e A n d e rs o n O s b o rn e L e v e n b e rg M a rq u a rd t a lg o rith m . Anderson and Osborne (1977a,1977b) couched the L \ and Loonorm problems within the framework of socalled polyhedral norms (see Appendix 2F). A discussion of these norms can be found in Fletcher (1981a:Chapter 14). The polyhedral norm approximation problem is defined as:
Section 2.8: A lgorith m s fo r solving the nonlinear L \ n orm estim ation problem
59
Find parameters 0 which minimize
II/W II b where B is a matrix as defined in Appendix 2F. Step 1 of the previous algorithm can therefore be written in polyhedral norm notation as: Step 1: Calculate 80 and a scalar u by solving the problem:
minimize
subject to
u
B(y — /(z, 03)  V^/(a;, 03) ' 80) < ue
Let the minimum be S 3 with 80 — 803.
This type of algorithm (GaussNewton) will be very effective if the number of ac tive functions at 0* is k. In general this will not necessarily be the case and as a consequence SLP methods will converge slowly if at all. To overcome this problem Anderson and Osborne (1977b) derived a LevenbergMarquardt algorithm (Appendix 2D). The inexact line search was replaced by a strategy for modifying /i which thereby imposes a bound on the change in 0. The size as well as the direction of
80 is therefore controlled. The algorithm
Step 0: Given an initial estimate 01 of the optimal 0*, select /io = /ii = 1 and set J = 1. Step 1: Calculate 80 by minimizing
80 0
fijl
Let the minimum be S 3 (fij) with 80 = 803. Step 2: Modify fjtj > 0 as follows: Set / = 0, Ti = 0, over := false , /ij = /i;_ i and 8 = 0.01, 2(a): If / < 1 or Ti < 1  8 go to Step 2(c).
10, t = 0.2.
60
C H A P T E R 2: The nonlinear L \ n orm estim ation problem
Set / := / + 1 and calculate T  T I P
„l,_
T‘  T V
^
S,( » + » ' )

S l(„ ,

2(b): If Ti < 8 and over = true set //+1 = \{fLj + fij) If Ti < 8 and over = false set /z*.+1 = ^filIf Ti < 8 set fLj = (ilIf / > 1 and Ti > 1  8 set over := true , /^+1 = ^ (/x^ + /^) and /2; = 2(c): If 7} < 6 or [/ > 1 and Ti > I  8 ] return to Step 2(b). Otherwise if / = 1 set /iy+i = r/i*If / ^ 1 set /i;+ i = nlj Step 3: set 0J+1 = 0i f 803 and return to Step 1. Repeat until certain convergence criteria are met.
Remarks {
( 1) When fij > 0 then the matrix I
) V j1
I is of rank k.
)
(2) McLean and Watson (1980) observe that Step 1 can be time consuming since it may involve the solving of many linear programming problems. They suggest an alternative form of the linear subproblem which admits an efficient numerical solution. (3) Convergence properties analogous to those stated previously can also be found in Anderson and Osborne (1977b). Jittorntrum and Osborne (1980) give nec essary and sufficient conditions for quadratic convergence of GaussNewton methods when the objective function is a polyhedral norm.
2.3.2.3 The McLean and Watson algorithms. McLean and Watson (1980) provide two numerically robust and efficient meth ods that are based on LevenbergMarquardt like approaches, and are therefore superlinearly convergent.
For these algorithms the functions ft(x^O) are assumed
to be at least once continuously differentiable in an arbitrary neighbourhood of a stationary point of (2.2). Their alternative to the previous approach (§2.3.2.2) is similar to Madsen’s (1975) proposal for the nonlinear Loonorm problem.
Section 2.8: A lgorithm s fo r solving the nonlinear L \ n orm estim ation problem
61
In GaussNewton methods fi{x{,03) is approximated by a firstorder Taylor series /»(*,•, 0; ) + V/i(xt, 03) fd. This approximation will only be accurate for “small” d. Hence McLean and Watson (1980) bound the direction by stipulating:
max \dA < A
l 0 and 0 < a < j
set
= I
Step 1: Calculate d3( X j ) by solving:
n min £  y i  /■(*,,O1) 
t= l
subject to
max \d{\
oo. (4) The limit points of the iterates {03} are stationary points of Si(03).
Program M C L E A N W A T which is an implementation of Algorithm 1 is listed in Appendix 2H.
Section 2.8: Algorithms for solving the nonlinear L\norm estimation problem
63
E x a m p le 2.4 The following example due to Jennrich and Sampson (1968) was used 10
min £ IVI  (exp(/0i) + exp(/02) 1=1 where yi = 2 + 21. The contourlines for the objective function are given in Fig. 2.4.
0.300
0.275
0.250
0.225
0.200
0.200
0.225
F ig u re 2.4
0.250
0.275
0.300
Contours of S i(0) = 33, 35, 37.
The optimal solution is 0* = (0.255843,0.255843) and S i(0 *) = 32.091941.
64
C H A P T E R 2: The nonlinear L \ n orm estim ation problem
General comments: Type I algorithms
• If the conditions for quadratic convergence of GN methods hold then the bound on 80J becomes inactive and all Type I methods will be identical • If yt = f i(x j,0 ) for all i then the convergence rate is quadratic for any norm (Zyj, Lp and Loo)• If y, ^
for some i then convergence is either slow or quadratic for L\
problems. For L pnorms the rate will be linear.
2.3.3. Type II Algorithms In the event that strong uniqueness does not prevail, GaussNewton methods may converge slowly and hence 2ndderivative information has to be used. This is achieved by sequential quadratic programming (SQ P). Algorithms of this type are complex, and are therefore stated without worked examples. The interested reader is encouraged to read the relevant papers. Again for notational convenience denote y.  A K ', 0 ) by hi(9).
2.3.3.1 The M urray and Overton algorithm. The Murray and Overton (1981) algorithm is an application and adaptation of the modified Newton algorithm for linearly constrained N LP problems [Gill and Murray (1974)] to solve Linorm problems. We give a brief outline of the approach, following the description given in Overton (1982). In order to obtain a superlinear convergent algorithm, secondorder derivative information must be utilized. The method utilizes the structure of (2.3) and essen tially reduces the problem back to an optimization problem in k variables (parame ters). If we solve problem (2.2) directly, then 5 i(0 ) may be used as a merit function. Recall (§3.2) that the objective function in (2.3) cannot be used as a merit function. In the algorithm a descent direction is determined from a quadratic program based on a projected Lagrangian function. Two types of Lagrange multiplier estimates (1st and 2ndorder) are calculated to form the Lagrangian function. A line search is then performed using this direction in order to reduce the merit function.
Section 2.8: A lgorithm s fo r solving the nonlinear L in o r m estimation problem
65
We shall partly adopt the Murray and Overton notation in this section: the con straints are assumed to have been ordered accordingly without loss of generality. • Active function set
Thus at any point 0 ^ 9 * we define A as the set of functions which we think will be zero at the solution 0 X. • Let
h(9) = {ht+1 ( 9 ) , . . . , h n(9 )]' h(9) = [h1 ( 9 ) , . . . , h t(9)}' • Active constraints
i = t f 1,. .., n
Ui  sig n [h i(0 )]h i
ut  hi(B)
i = l,...,t U{ + hi( 0 ) The matrices of gradients
V k x t = ( V h 1( 9 ) , . . . , V h t(9 ))
vk«nt = ( V ) ! I + I ( « ) , . . , v y « ) ) S txt = diag{ sign[hi(9)\ : i = 1, . . t } Entxni = diag{ st'jrn[/ii(0)] : i = t + 1,. . ., n } After a further reordering of the active constraints so that the active con straints are given in matrix notation
u 'S h \ clff]
=1
it 
/ =
h
it + h J
~
\
I S h  fc
I th + h )
(V Y , Vc{9) k + n x t + n
0
Int V o
V
K\
0
0
it
h)
66
C H A P T E R 2: The nonlinear L \ n orm estim ation problem
The gradient of the objective function of (2.3) with respect to (9,u) is given by: 'O '
• (») = where e and e are (n — £) and £unit vectors respectively. Orthogonal decomposition of V (Appendix 2F).
/ R (0) ' V = [Y (0 ) Z (0 )}
=