Uncertain Dynamic Systems 0139355936, 9780139355936

316 78 16MB

English Pages [576] Year 1973

Report DMCA / Copyright


Polecaj historie

Uncertain Dynamic Systems
 0139355936, 9780139355936

Citation preview

Uncertain Dynamic

Systems: Modelling , Estimation, Hypothesis esting, Identification and Control


FRED C. SCHWEPPE Department of Electrical Engineering Massachusetts Institute of Technology Cambridge, Massachusetts


Englewood Cliffs, New Jersey

Library of Congress Cataloging in Publication Data SCHWEPPE. FRED


Uncertain dynamic systems. (Prentice-Hall series in electrical engineerin&) Bibliography: p. 1. System analysis. 2~ Dynamics l. Estimation theory. 4. Statistical hypothesis testing. I. Title. QA402.S37 620'.72 72-11801 ISBN 0-13--93SS9J...6

© 1973 by Prentice-Hall, Inc. Englewood Cliffs, New Jersey

All rights reserved. No part of this book may be reproduced in any form or by any means without permission in writing from the publisher.

10 9 8 7 6 5 4 3 2 1 \

Printed in the United States of America




New Delhi

To Cindy, Edmund, Carl, and Fritz "Dignity at all costs"



tfrrlfffttmrrrttrrffiTJ111j111tt111111'11111~11n111111'11j11111'ffi1'fmrrrrrrmrrrmrrrrrmmrr1ffi~111lli~111~11~,m~]111~11ntntttnffntttttrntttttrntfrrlfflttmrrlttmrrltfiTtfflttrrtff1flrrm -


'''''''''''''''' ,,-,'''''' 'r'''''''''''''''''''''''''''' ,-,'''''''''' ,-,'' ,-,'''' ,-,'''''''' '''''


6.16 6.17 6.18 6.19



Estimation: Continuous-Time Linear Dynamic Systems 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8


Curve Fittingt 177 Asymptotic Efficiency: Fisher Estimationt Role of White Process Model 190 Discussion 191

Discrete to Continuous Time Limit: Bayesian Bayesian: Filtering and Prediction· 209 Weighted Least Squares 214 Fisher: Filtering 216 Unknown-but-Bounded: Filtering: 117 Weighting Patterns 220 Implementation 221 Discussion 221

Development of Models 231 Practical Filter Design 235 Sensor, Signal, Measurement, etc., Design Summary Discussion 244



PART Ill: Hypothesis Testing 9

Hypothesis Testing: Static Systems 9.1 9.2 9.3 9.4 9.5 9.6 9.7


247 249

Areas of Application 152 Philosophy of Presentation 255 Bayesian 256 Fisher 270 Weighted Least Squares 276 276 Unknown-but-Bounded Discussion 281

Hypothesis Testing: Linear Dynamic Systems 10.1 10.2 10.3



Application of Linear Model Estimation Theory 8.1 8.2 8.3 8.4


Discrete-Time Gaussian.Process 286· Continuous-Time Gaussian Process 299 Weighted Least Squares: Minimum Residual Decision Making 307 Unknown-but-Bounded 310 10.4 Sequential Hypothesis Testing 10.5 311 10.6 Finite Memory 313 313 10.7 Event Detection Testing Whiteness of Residuals 316 10.8 316 10.9 Application of Hypothesis-Testing Theory 319 10.10 Discussion





PART IV: Nonlinear Models 11

Jl.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9

11.10 11.11 12

Bayesian 328 Fisher 334 Relationship Between Fisher and Bayesiant WeightedLeast Squares 339 Unknown-but-Bounded 340 Summary of Types of Estimators 341 Iterative Methods for Finding a Minimum Parameterizationt 355 Change of Variables 357 Stochastic Approximationt 358 Discussion 358

12.3 12.4 12.5 12.6 12.7



Linearized Error Analysis (Sensitivity Analysis) 368 Linear Error Analysis: !daximum Likelihood Estimator 370 Lower Bound on Estimate Error Variancet 372 Ambiguity Functions 376 A Better Lower Boundt 381· Monte Carlo Methods 382 Discussion 382


Estimation: Discrete-Time Nonlinear Dynamic Systems

13.1 13.2 13.3 13.4 13.5 13.6 13.7 13.8 13.9 13.10 13.11 13.12

Propagation of Uncertainity 389 Bayesian 391 Fisher 395 Weighted Least Squares 396 · Finite Memoryt 396 Unknown-but-Boundedt 397 Performance Analysis 397 Discussion of Theoretical Development 398 Approximate Propagation of Uncertainity 399 Recursive Filters 402 Continuous-Time Models 414 Discussion 415

PART V: 14


Performance of Estimators: Static Nonlinear Systems

12.1 12.2



Estimation: Static Nonlinear Systems

Survey Discussions .



System Identification

14.1 14.2

Model Structures 425 State Augmentation: Nonlinear Est(mation Theory



14.3 14.4 14.5 14.6 14.7 14.8 14.9 14./0 14.11 14./2 14.13


Maximum Likelihood: Gaussian Jvfodel, General Case Maximum Likelihood: Gaussian Model, Special Cases Maximum Likelihood: Error Analysis 449 Model Reference: IYeighted Least Squares 455 Correlation Methods 458 Instrument Variables 460 Unknown-but-Bounded 46/ Choice of Input 462 Model Reduciion 463 Nonlinear Str;tctures 464 Discussion 465

434 442

Stochastic Control 15./ 15.2 15.3 15.4 . 15.5 15.6 15.7 15.8 15.9



Types of Stochastic Control Problems 47 I Open-Loop Control 473 Closed-Loop Control: General 475 Separation Theorem: General 479 Separation Theorem: Linear Model; Quadratic Criteria Dual Control 486 487 Adaptive Confl·of Unknown-but-Bounded 487 Discussion 488






Appendix Appendix Appendix Appendix Appendix Appendix Appendix Appendix


Matrix Algebra 495 Linear Difference and Differential Equations Vector and Matrix Gradient Functions 507 Probability Theory 5/l Multivariate Gaussian (Normal) Distribution Stochastic Processes 527 Set Theory 531 537 Ellipsoids








This book is designed to help the reader make certain (that is, not uncertain) statements about dynamic systems when uncertainty exists in a system's input and output and in the nature of the system itself. Interest centers on methods applicable to large-scale, multivariable systems. The problems of and interactions among modeling, analysis, and design are considered. Emphasis is placed on techniques which can be implem~nted on a digital computer. Some new, previously unpublished results/on- unknown-but-bounded models are included, but most of the development is based on already existing and even classic concepts involving statistical-probabilistic models. The many different ideas are organized and presented so' that their similarities and differences hopefully become clear. Emphasis is on applications rather than theory so discussions of how to and how not to use the theory are con- . sidered to be as important as the derivation· of equations. Concepts are first introduced for static (nondynamic) systems so that the basic ideas can be understood with a minimum of mathematical sophistication and complication. Extension to the complexities ofdynamic systems then follows in a relatively straightforward fashion. This ·book resulted from lecture notes for a control-system theory course in an electrical engineering curriculum. However, the material is widely applicable in many differen~ areas of engineering as well as in econoxi




mic and social systems. The book is suitable for independent study or can be used as a text for a one- or two-semester course. The writing of the book was a part-time project spread over many years, and many people played a role in its evolution. Some of the earlier ideas were influenced by discussions with Harold Kushner, David Sakrison, and especially Leland Gardner. Various versions of the material have been presented in courses with the assistance of Edison Tse, Frank Galiana, J. D!Jncan Glover, Steve Hnylicza, and Robert Moore. Tom Kailath and Michael Athans provided critical and hence very useful comments which had a strong influence on the final draft. Barbara Smith, Helen Jackson, and Karen Keefe were extremely helpful in the many details associated with obtaining a final manuscript. Finally, but very definitely not least, Connie Spargo typed and retyped most of the manuscript. She demonstrated an uncanny ability to transform almost unreadable handwriting and rough sketches into clean, clear text, equations, and figures. FRED

Cambridge, Massachusetts



1 Introduction

The three main topics of interest in this book are: 1. Modeling of uncertainty. 2. Analyzing the effects of uncertainty. 3. Designing systems to remove or compensate for uncertainty. This introductory chapter is intended to remove some of the reader's uncertainty as to what he will encounter in later chapters.



According to the dictionary, something is uncertain if it is indeterminate, indefinite, not reliable, untrustworthy, and not clearly defined. One of the main goals of this book is to make determinate, definite, reliable, trustworthy, clearly defined statements about uncertain dynamic systems. To make certain statements about an uncertain system, it is necessary to use precise, welldefined (i.e., certain) models for the uncertainty. Thus, in a very real sense, we are not interested in truly uncertain systems, but instead are interested in problems for which certain models are available. These certain models are chosen to represent what is considered to be uncertain about the system. The following are examples of models for uncertainty. Possible models



for a scalar x whose value is uncertain are l. x is a random variable with specified probabilistic structure.

2. x is a random variable whose probabilistic structure contains uncertain parameters; for example, mean and variance. 3. x is a random variable whose probabilistic structure is unknown except for certain moments; for example, mean and variance. 4. x is completely unknown. 5. xis bounded; for example,


I xI.::;:


x can take on only certain values; for example, x

= I, 2, or 3.

These six models can be combined into two basic types as follows:

1-3. x is a random variable whose probabilistic structure may be uncertain. 4-6. x belongs to some set. Possible models fer a K-dimensional vector x whose value is uncertain are



is a random vector Whose probabilistic Structure may be Uncertain.

2. x belongs to some· set in a K-dimensional space; for example, x belongs to a closed set, x can take on only certain values, or x lies on some hypersurface. · ,t!J

Possible models for a scalar time function x(t) over 0 uncertain are

all w




w < -w 0

x(t)e-'"'' dt



5. x(t) is parameterized; for example, x(t) =

where the values of a 0 ,



+ a t + a t2 1


and a 1 are uncertain.

Models 2, 3, 4, and 5 can be viewed as methods .of constraining x(t) to lie in some set in the space of all possible time functions over 0 < t < T. Similar

Background Needed


types of models can be stated for uncertain vector time functions and functions of more than one independent variable. The above examples illustrate the two main classes of uncertainty models that will be considered in this book: 1. Probabilistic models with possible uncertainty in the probability distributions. 2. Set-theoretic models (called unknown- but- bounded models). Naturally not all possible combinations of such models will be studied. The actual subjects to be considered fall into four general categories: 1. Bayesian: Uncertainty is modeled by random variables and/or stochastic processes with either completely specified probability distributions or completely specified first and second moments (means and variances). 2. Fisher: Uncertainty is modeled by a combination of Bayesian models and completely unknown quantities. 3. Unknown-but-bounded: Uncertainty is modeled by set-theoretic models; the resulting quantities· are called unknown- but- bounded processes. 4. Weighted least squares: "Reasonability" arguments are used in place of explicit models for the uncertainty.


Background Needed

The required prior background in probability theory is an understanding of the basic concepts of vector random variables; i.e., probability density, expectation, conditional density, and conditional expectation. However, few of the ideas (and none of the really basic concepts) require an understanding at a sophisticated mathematical level. The book draws heavily on state space techniques for· handling and analyzing deterministic dynamic systems. Thus the reader is assumed to have a prior understanding of such concepts and a thorough grasp of the basic linear algebra of vector and matrix theory and of methods for handling vector difference and differential equations. Certain subjects to be covered are closely related to deterministic optimal control theory (calculus of variations, maximum principle, and dynamic programming). However, a prior background in these areas is not assumed of the reader. This book evolved from lecture notes for a course in control-system theory and some of the discussions reflect this control-system theory orientation. However, the material is applicable to a much broader class of problems,



and a reader without a control-system background should have little trouble in following the main ideas and developments.

1.3 Goals and Method of Approach

This book is not written to be an introduction, a survey, or a complete exposition. The book is written to take the reader from an understanding of certain basic ideas along a relatively narrow path which leads to a wide variety of advanced topics. The goal of the trip is to develop an understanding so the reader can successfully apply the ideas. The author personally feels that the route followed on this trip is limited primarily in the area of theory, not in the range of applications. An attempt is made to present the material so that the book can also be used as a reference by readers who already have · some background and experience in the field. The following pattern of presentation is continually repeated: 1. Concepts are first defined for the static case, i.e., a system defined by just a vector set of equations.

2. State space concepts are used to convert the static case into a discrete · time dynamic system. 3. A discrete to continuous time limit is used to yield results for continuous time systems. The theory for linear systems is presented before nonlinear systems are considered. This method of presentation has the disadvantage of being slow. However, four advantages to the chosen method of presentation are I. Concepts are first introduced and discussed in simple situations (linear, static systems) where they can easily be understood. 2. No prior knowledge of stochastic processes is needed. 3. Many applications involve discrete time systems and analysis, i.e., digital computers. 4. The linear theory provides the basis for many of the practical techniques for handling nonlinear problems. The use of state space concepts greatly simplifies the development of models and analysis techniques. State space techniques also provide resulrs in a form directly suitable for implementation on analog or digital computers. This is very important, as the types of problems encountered in modern applications rarely yield simple, closed-form, analytic results. In fact, the underlying philosophy of this book is that ''solution" consists of an algorithm (equation) suitable for computer implementation. Discussions of




computational problems are therefore included as an integral part of the overall presentation. The examples presented as an aid to understanding, usually evolve a one- or two-dimensional time-invariant system. The exercises at the end of the chapters are designed to be done with the minimum amount of tedium, so they usually involve one- or two-dimensional time-invariant systems. ·However, these simple examples and problems should not be viewed as being typical of applications, The general theory is aimed toward developing techniques for solving complicated, high-dimension, and/or time-varying problems on a (usually digital) computer. If this were the "best of all possible worlds," the reader would have a computer console furnished. with the book so the examples and problems cot41 be more realistic. This book assumes that the reader has a "reasonable degree" of mathematical sophistication. However, the material is presented in a discussion format rather than a theorem-proof format and results are often developed in anintuitive manner rather than as special cases of some more general, abstract mathematical concept. Two advantages to the chosen method of presentation .are I. In most (but not quite all) applications, intuitive understanding is

more useful than general, abstract theorems. 2. A more sophisticated approach would either require the inclusion of background material or would restrict the book to a narrow class of readers. The chosen method of presentation has the disadvantage that much excellent material available in the literature is written using more sophisticated mathematical formalization and terminology, and the book does not train the reader to handle such literature. The decision to write the book in its present form obviously reflects the author's personal interests.

1 .4


The book is divided into five parts. Part I presents general discussions on system theory and types of mathematical models to be considered. Part II covers estimation theory for linear systems. Part I II introduces certain aspects of decision theory (hypothesis testing), which is also viewed as a partial bridge between linear and nonlinear estimation theory. Part IV discusses estimation theory for nonlinear systems. Part V contains general discussions on system identification and stochastic control. The appendices contain background and supplementary material. Except for Part V, the book is organized in a highly structured form. ~--



The heart of structure is the basic sequence Static system (vectors)

! ! Continuous-time dynamic system Discrete-time dynamic system

·At each stage, consideration is given to the three basic uncertainty models (Bayesian, Fisher, upknown-but-bounded) and to the weighted-least-squares concept. This overall structure is evident in the table of contents. The tight structural form does not force the reader to start at the beginning and go on a straight line to the end. On the contrary, the structu~e is intended .to provide the reader with a choice of many possible routes through the book. The following general comments are intended to help one choose a route: 1. Chapter 2 contains a very general overview which is not essential to understanding the rest of the text.

2. The unknown-but-bounded model discussions can be viewed as a separate sequence and studied independent of a Bayesian, Fisher, weighted-least-squares sequence. S~paration of the Bayesian, Fisher, and weighted least squares into individual sequences, is not recommended. 3. The linear and nonlinear model developments can be combined~ different ways, for example, Chapters 4, 9, II, and 12 can be studied as a sequence on the static model. 4. The basic static, discrete-time, continuous-time sequence should not be broken. . 5. The survey discussions of Part V can be factored into a sequence. For example, Chapter 14 (system identification) and Chapter 15 (stochastic control) can follow Chapter 6, with only selected reading in other chapters. · Figure ·1.4.1 summarizes the main prerequisite structure ex1stmg between the various chapters. This prerequisite structure is related to general overall understanding of concepts. Sometimes reference is made to specific points in earlier non prerequisite chapters, but such references can either be skipped as being only "side points" or else the referenced material can be studied "out of context." As a final aid to choosing a route through this book, various sections and subsections are "daggered" (t) to indicate that they are not considered to be important to the main theme of the development. The appendices play an ambiguous role which depends on the reader's




Main Prerequisite Chapters None

Part I

None None


3 Part If 4 6

3 None 3 4 5


3 6

5 8 Part lli









Part IV 5*

II 12 D








4 4



implies chapter is recommended but is not essential


Part V

6 6

Figure 1.4.1. Prerequisite structure of chapters.

background and interests. The appendices contain definitions of basic mathematical concepts. Hopefully the reader is already familiar with many of these ideas, but it is expected that most readers will have to go to the appendices at various times.



The question of references presented a problem when writing this book as it is difficult to give credit where credit is due. Many of the basic principles can be traced back to the work on least squares by Gauss and Legendre. From this common base, the theory evolved through many different paths to its present state. Statisticians, economists, surveyors, natural scientists, and astronomers all developed various aspects, often independently of each other. The same basic results were often rederived many times for different applications. The engineers entered the picture with Wiener's work and proceeded with their own evolutionary cycle of deriving new results and rederiving old results which were thought to be new. The crediting problem is further complicated by the fact that the author has worked in this field for a relatively long time and many ideas have become so ingrained that he has forgotten where they came from originally. · ~··



In addition to giving credit, an ideal bibliography or set of references should prove to be a source of background reading and a guide to advanced studies. Unfortunately, this is also a difficult task. The basic concepts of uncertain systems apply and are used in many different fields, and the appropriate reference often depends on the reader's background. The actual bibliography and references found in this book are the result of the following ground rules: I. Credit is definitely given to all material that the author consciously used (whether it be book, paper, unpublished report, or personal discussion). 2. Background references are limited to only selected books, although in general many others may also be equally as good.

3. Selected references to papers and books where further or more advanced developments can be found are given. A complete listing of all related papers and books is not attempted. However, a failure to reference some book or paper definitely does not constitute a nonrecommendation.



Half the job of reading any paper or book is usually the "figuring out of what the notation means." As a start in this direction, some of the conventions to be used are now stated. A boldface symbol denotes a vector or matr.ix, i.e., a, A a, A

scalars vectors or matrices

Usually, boldface lowercase letters are vectors, while boldface capital letters are matrices. (This convention obviously is imprecise as a vector is also a matrix.) All vectors are column vectors unless explicitly stated otherwise. If A is a matrix, then A-• denotes inverse of A (assuming it exists) denotes determinant of A IAI tr A denotes trace of A A' denotes transpose of A The use of IA I for the determinant of A is somewhat ambiguous as I A I is also used to denote the absolute value of the scalar A, but the meaning is always made clear in the text. A unit matrix is denoted by I. The zero vector or matrix is noted by 0. The symbol E always denotes expectation of a random variable or


9 '

stochastic process. No notational convention is used to denote the difference between a random variable and a sample of a random variable. If the random vector x has a Gaussian (or normal) probability density with mean m and covariance matrix r,







E{[x - m][x -

this is often expressed as X

is N(m,


No particular convention is used in determining whether an English or Greek symbol is to be used. The following conventions are almost always followed (see also Fig. 2. I. I of Chapter 2): u

w v z x

known system input or input control uncertain disturbance "driving" the system uncertain disturbance corrupting the observations observations · system state vector

Time is usually the only independent variable considered. For continuous-time problems, 1 denotes time. In discrete-time problems, the integer n denotes time, the "event," or the sample number. When necessary, 1'1 denotes time between events, and so nil denotes discrete time. An estimate of variable is denoted by a superscript caret; i.e., x is an estimate of x. The error in an estimate is denoted by !i, and so


= x-


A superscript tilde is often used to denote a dummy or related variable. Thus X. may denote a variable similar in character to but different from x. Sets are denoted using notations such as . Q" =

[x: g(x) (na)x(na) + AG(na)w(na) z(na) = H(nA)x(na) + v(nA)


= 0





£(v(n,a)w'(n 2 A)} = 0, E(x(O)x'(O)} · 'V


= 0




= 0




Mathematical Models

Bayesian, continuous time:


x(t) = F(t)x(t)

+ G(t)w(t)

z(t) = H(t)x(t)

+ v(t).

E(x(O)} = 0



E{x(O)v'(t)J = 0






E[x(O)w'(t)l = 0

E{v(t 1)w'(t 2 )} = 0,

all t,


E{x(O)x'(O)} = 'JI


E{w(t,)w'(t 2 )} =

E(v(t 1)v'(t 2 )} = o(t 1


t 2 )Q(t,)

t 2 )R(t 1)

Unknown-but-bounded, discrete time: x(n&

+ A) =

+ .AG(n.A)w(n.A) H(n.A)x(nA) + ''(n.A)


z(n.A) = x(O)







v(nl!.) E 0Jn6)

For ellipsoidal sets, Q..(O) = {x:

X'\ji- 1X

. = I) .



+ l) + L;

cf>m(ll)y(n ~ m

m= I

z(n) = y(n)



I)= G(n)w(n)

Generality of State Space Models


Initial conditions are ignored for simplicity. Such an autoregressive model can be expressed in terms of a basic state space model such as (3.4.1). To illustrate, assume that M, = 2. Define x(n) = [x,(n)J = [y(n - I x 2 (n) y(n)

®(n) = [


G(n) = [




0 G(n)






H = (0


Then the state space model x(n


I)= (n)x(n)

+ G(n)w(n)

z(n) = Hx(n)

is completely equivalent to the original autoregressive model. A continuous-time autoregressive model for M 1 = 2 is 2

F 0 (t)y(t)

+ F,(l); y(l) + ~ 2 y(l) = z


= y(t) .

Th'is can be expressed as a special case of (3.4.2) by choosing




G(t) = [ G(t)





A discrete-time moving-average model (a = I) is M,

y(n)- I; Bm(n)w(n- m) m=l

z(n) = y(n)

Such discrete-time moving-average models can also be expressed in terms of (3.4.1). To illustrate, assume that !v/ 2 = 3. Then the following is an equivalent model:

M,, such continuous-time mixed models can be expressed in terms of the basic models of (3.4.2). It is seen from the preceding that the autoregressive-moving-average models and the basic models of (3.4.1) and (3.4.2) have much in common. In the vast majority of problems of interest, either basic formulation can be used, but exceptions to this general rule exist.


Uncertain Model Parameters and System Identification

The basic linear model forms defined in Section 3.1, I. Cover only systems defined by known F, , G, and H matrices, i.e., allow for uncertainty only in the initial conditions, system input, and observation disturbances. 2. Require uncertainty models that are completely defined in terms of known \jf, Q, and R matrices. The assumption that the F. 11>, G, H, 'JI, Q, and R matrices are known exactly is rarely (if ever) satisfied in practice. If this fact disturbs the reader, he should not despair too much, as there are three ways to handle the problem: I. Ignore it and simply use an "engineering guess" of the model parameters (i.e:. F, ., etc., values). 2. Choose an approach which does not need the uncertain model parameters. 3. Use system identification. The use of ap engineering guess for the model parameters is the simplest and most common approach and it is often satisfactory (in the sense that the final result does the job). In later chapters, techniques for analyzing the effects of errors in the model parameters are discussed. The weighted least squares estimation concept does not employ explicit models for the uncertainty in w, v, and x(O) and thus does not need values for the , R, and Q matrices. This approach requires one to choose other ''similar" parameter values, but the logic underlying their choice is different. System identification techniques ·are a more sophisticated approach wherein the observations arc used to estimate the values of the model parameters as well as the state. It turns out that state augmentation concepts can often be used to convert a linear system with unknown (or uncertain) model parameters ,.. into a nonlinear system.


Mathematical Models


Consider the scalar system x,(n

+I)= x,(n) + w(n) z(n)


x 1 (n)

+ v(n)

The value of «1> is uncertain. The equivalent two-dimensional nonlinear system x,(n

x 2 (n

+ +

I)= x 2 (n)x,(n)

+ w(n)

I) = x 2 (n)

z(n) = x, (n)

+ v(n)

does not have uncertain model parameters. Thus the discussions of Part IV of this book on nonlinear systems can conceptually be-applied to many linear systems with uncertain model parameters. However, Chapter 14 of Part V discusses system identification explicitly from a more fundamental point of view. The study of models with completely known model parameters (as specified in Section 3.1) is important because they must be clearly understood before real-world solutions are attempted.



The basic mathematical model forms of Section 3.1 are based on combining state space structures with white uncertain processes. The main point of this chapter is that an extremely wide range of problems can be put into this form. Sometimes this form is "natural"; in other cases the natural form has to be converted. A state space-white process model does not encompass all cases of possible interest, but it comes close, especially if approximations are allowed. The role of this chapter relative to real-world problems and the rest of the book is summarized in Fig. 3.6.1. The figure is (hopefully) self-explanatory. A reader with an interest in a class of physical problems whose natural models are not close to the basic state space form may find the need to go through Step II of Fig. 3.6.1 (i.e., use ideas of this chapter) rather frustrating. The book js written in its present manner primarily because it is not directed toward any particular class of physical p'roblems. In the author's opinion, the basic state space-white process model forms of Section 3.1 are by far the most powerful framework on which to develop both conceptual understanding and practical, useful results. The book's length could double or triple if separate developments were attempted for the other natural mathematical models of possible interest. Various types of models have been discussed: Bayesian, ·Fisher, and


. Real World Problem


"Natural" Mathematical Model


Basic State Space White Process Form


Results in State Space Form




Results in "Natural" Form ~~-•-~-

Step I: Up to Engineer

II: Discussed in this Chapter

III:f Discussed in Rest of Book IV:

IVa: Reverse Step II Figure 3.6.1 Possible steps in engineering solution.

,.,. I=

Real World lmplemen ta tion


Mathematical Models

unknown-but-bounded. The question "Which model is best?" is meaningless and has no answer. The choice of model depends on the particular problem being studied. Various aspects of choosing a model will be discussed in Chapter 8.

HistQrical Notes and References: Chapter 3

Most of the ideas of this chapter are in the "folk-lore" category; that is, they are well known by workers in the field, but it is effectively impossible to say who originated them. Book List 3 on Deterministic Systems contains background material on state space concepts, while Book List 4 on Stochastic Systems contains many of the stochastic modeling ideas discussed here. Autoregressive-moving average models are more common in statistical time series references, as found in Mathematical Statistics, Book List 2. Frequency domain IJlOdels tend to be emphasized in the Communications references of Book List 5. The terms Bayesian model and Fisher model are used in this book to try to categorize types of ideas, but the terminology is by no means universally used. Book List 4 on Stochastic Systems tends to emphasize Bayesian models, while Book List 2 on Mathematical Statistics emphasizes Fisher models. The unknown-but-bounded models are taken from Schweppe [9] and related works (see discussion on Section 6.6).

Exercises 3.1. Consider the no-dynamics scalar model

= x + v(n) + b, E(xj = 0 . E(x2J =






E(v(n)J = 0

1, ... , 3


£[v 2 (n)l

11=1, ... ,3

a. Describe in words physical (or economic or social, etc.) situations involving observations z(n) on a constant state x with observation errors v(n) + b for the following models for the constant b: (1) b is a random variable:

E(bJ = m (2) The value of b is known.

(3) The value of b is completely unknown.

b. For each of the three cases of part a, express the model as a static case model:

z = Hx + v


z (I )••






c: known vector




E(xx'J ='I'

£(¥} = 0



Specify H, x, ¥, 'ljl, R,.and c. 3.2. Consider z(l)



+ a t + a 212 + v(t) 1

v(t): white stochastic process



completely uukoowu

Express this model in the following two forms: a. No-dynamics model:

+ v(t)

z(t) = H(t)x

b. Dynamic system with no input: dl x(t) = Fx(t)


z(t) = Hx(t)

+ v(t)

3.3. Let y(l), 0 < I < T, denote the uncertain position versus time of an intoxicated eagle flying in a gusty head wind. Assume that the eagle's path "looks" like a second-degree polynomial in time over short time spans; i.e., y(t) = a 0 (t)


a 1 (t)t


a 2 (t)t 2 ,

where for small time intervals, r - x(t)

w c ( t ) - Time Constant



= o(l- T)Qc


E{c (t)}= 2






lA 1- 1 White process'------,----'

c(t) has correlation time constant

IAI- 1


Time Constant lA 1- 1


Time Constant


IFI- 1

Time Constant



IFI- 1

White Process

E{ w(t)w(r)} =




o(t-r) A 2 = - o ( t - r ) y

(d) Figure 4.2.1 Block diagram for example.

4.2.3 E(x(t)w(t)} = ?t

Consider (4.2.1). A question that sometimes arises is, What is E(x(t)w(t)}? This question can yield various answers. This is one of the unfortunate consequences of the use of a continuous-time white stochastic process model. Assume that the continuous-time model is viewed as the limit of a discrete-time model. Consider the integrator of Section 4.2.1. Then two possible definitions and results are E[x(t)w(t)} = lim E[x(n~)w(n~)} = 0 E[x(t)w(t)} = limE{[x(n~

+ ~i + x(n~)] w(na)}

(4.2.8) =



There are other possible definitions which yield still different answers, but ·discussion is restricted to (4.2.8) and (4.2.9). Consider the example of Section 4.2.2. For the "physical" model of (4.2.5), it can be shown that (see Exercise 4.12) E[xc(t)c(t)} = 2 A(i+ A) ·~-·


Analysis oi Linear Systems

or when IAI »IF!

Thus in the equivalent "white" model (4.2.7) it is reasonable to assume that E{x(t)w(t)} =


.as this yields E[x(t)w(t)} =


Hence in this sense the definition of (4.2.9) is more appropriate than (4.2.8). An alternative way to view the situation is to write x(t)



eF T, where N

X2 (N .1.) =


I: n=l

~x 1 (n~)w(n.1.)

x (NA), 2

N-> oo,



Analysis of Linear Systems

Equation (4.2.11) can be manipulated into the form X:2(N !1) =




.~ [xr(nf1


1 2


- - .L; n=l

+ !1) -


f12w2(n!1) -

so that

-~z(N!1) .

= xi(N!1


Thus in the limit as N _,. answers:

+ !1)- __!_ 2

oo, !1 ----..

- (T) = xr(T) - _!_QT .2 2 ,


£ !12w2(n!1) n=l

0, N .!1 ----.. T, there are two possible

if w(n.!1) is well behaved


if w(n.!1) is a white stochastic process and (4.2.10) is used


Is x 2 (T) given by (4.2.12) or (4.2.13)? If w(n.!1), n =I, ... , goes in the limit to some well-behaved (physically realizable) w(t), then .2:;;"~ 1 .!1 2 w2 (n.!1) _,. 0 and (4.2.12) is the result. Therefore if one wants answers which correspond to well-behaved processes, (4.2.12) is the best answer. Unfortunately, to obtain (4.2.12) from (4.2.11), it is necessary to ignore certain aspects of the mathematical model, and this does not satisfy our desire for mathematical consistency. However, (4.2.11) is only one of many ways to define x 2 (T). An equally reasonable definition is iiNf1) =

f; .!1[xt{n.!1 + .!1) + xt(n.!1)Jw(n.!1). 2



+ .!1) =

x 1(n!1)


+ .!1w(n.!1)

so that after maniptliation ~







+ 2"l .ti_ .!1 _N


w2 (n.!1)

N J; [xi(n!1 + !1) -


so in the limit "'(T) ~ xr(T) ---




Thus by using the definition of (4.2.14) instead of (4.2.11) it is possible to· derive(4.2.12) without violating the mathematics. The problem of defining x 2 (T) is closely related to the problem of defining E{x(t)w(t)}. In particular, one can say that (4.2.8) and (4.2.11) are "similar" definitions and that (4.2.9) and (4.;2.14) are "similar" definitions.

Continuous- Time White Stochastic Processes


The various definitions yield results which differ only by a constant, i.e., in the mean value of x 2 (T). Thus if the mean value is not important in the problem of interest, either definition can be used and the whole problem is only of academic interest. If the mean value is important, it is necessary to . examine the nature of the actual process approximated by w(t) more closely and to remember that ideal integrators and ideal multipliers do not actually exist. In terms of frequency domain concepts, it may be necessary to worry about the bandwidth of the·integrator and the multiplier relative to the bandwidth of w(t). 4.2.5 Mathematical Formalismt

In this book no attempt is made either to define or to manipulate continuous-time stochastic processes in a precise mathematical fashion. However, as an aid to the re(lder who wants to go to the literature, some mathematical formalism and associated jargon is briefly discussed. In a mathematical formulation, (4.2.1) would be written in the form [for scalar w(t)] dx(t) = F(t)x(t) dt + G(t) db,.(t) (4.2.15) where t) w(


dbjt) dt

b(t) is called a Wiener process, or Brownian motion, or a process with independent increments. (See Appendix F.) It has been shown that care is needed in defining exactly what is meant by a solution, i.e., in defining E[x(t)w(t)}, or a stochastic integral, or a stochastic differential equation. At the present time there are two main methods of approach, often called the Ito calculus and the Stratonovich calculus. Roughly speaking, (4.2.8) and (4.2.11) correspond to the Ito calculus, while (4.2.9) and (4.2.14) correspond to the Stratonovich calculus. In general, the Stratonovich calculus is more physically oriented. However, the Ito calculus actually has some mathematical advantages, and once its specialized properties are understood, the Ito calculus is sometimes easier to use. It is possible to change the system models in a compensating fashion so that one can get Stratonovich-calcuh.is-type results using ltO calculus. Any reader who is interested in doing mathematical research on continuous-time stochastic process problems must Jearn this mathematical formalism. Engineers who .are interested in applying sophisticated results are also urged to "learn and understand the language" as it is used in most advanced papers and books. However, readers of all types must always remember that mathematical formalism does not replace engineering judg-. ment with regard to when, where, or how the models should be used and what the results mean. • -r

then c(l) is a physically realizable process. However,

Since a physically realizable process cannot have a derivative with infinite variance, c(t) is, in a very real sense, just as physically unrealizable as the white process w(t). The conclusion is that none of the continuous-time stochastic processes considered in this book are physically realizable. This does not mean the equations to be derived are useless. It merely reemphasizes the point that all mathematical models are only approximate representations of the real world.



The output of a dynamic system driven by a white unknown-but-bounded process is now discussed. Instead of considering positive semidef111ite covariance matrices as in the Bayesian stochastic model, interest centers on sets (which are usually ellipsoids defined by positive semidefinite matrices). Comparison of the set-theoretic concepts to be discussed with the stochastic

I 1



i I'



1 ..--'



theory reveals that 1. The detailed mathematical manipulations are very different.

2. The final equations look similar. 3. The final equations can behave quite differently. It must always be remembered that the two models involve fundamentally

different sets of physical assumptions. lt is recommended that Appendices G and H be read before proceeding, as the ideas therein are used extensively in the following. 4.3.1 Static Model

Consider the static model

z=Hx+v X E Q.x

v En,. Let


denote the set that contains all the possible resulting z. Then Qz =

{z:z =Hx +v,x

E Qx,v E


Define Then or

I nz


vector sum of QH.x and



Define support function of Qz


SUpport function Of QH.
s.(1)), all T) sit))= [tt'[(l- y)- 1 R

+ y-


H\jiH']ttJ1 12

Thus the set O,,b is a bounding ellipsoid for Q,, where

(na)x 1 -1- AG(nA)w,

X1 E




Define O.JnA -i- Ll.lna) = (x:

X =



X1 E

Q 0 .,(nA) = (x :X c= AG(nA)w, w E Q.,(nA)}


~(11~ + A)

= vector sum of


+A Ina) and na •.CnA)


Define S .'(nA)1\]

Saw(n.)(1\) =


and (4.3.5) can be rewritten as S x(n(fl). The final result is Dx,b(nt.)


[x: xT- 1 (nt.)x





+ t.) =





T(O) ='I'

Note the structural similarity between the r equation of (4.3.7) and (4.1.6) for the Bayesian model. For time-invariant stable systems, the set of reachable states Dx(n~) will reach a steady-state value Q..(ss). The corresponding bounding el-



Analysis of Linear Systems

lipsoid n.~.b(ss) is given by a.{J>O EXAMPLE: Consider


+ 1) =


+ w(n),

tel> I< 1

[w(n)[< Qt:z Then from (4.3.8) (! -


{J(l. _



fJ _ 4>2) r., = co,

Thus a cho.ice of fJ = 1 - c1>2 yields while 1 > fJ > 1 - 2 yields a negative r,,. This is not inconsistent with the theory; it merely says that only the fJ such that 0 ~ fJ < 1 - cl> 2 are of any interest. The best value of fJ is

fJ=l-Cf! as this yields a minimum

rss Cs

Q =

(1 - cl>)2

which in this case actually defines the true set of reachable states.

As in Section 4.3.1, note the difference in computational problems between the general functional equation (4.3.6) and the matrix equations of (4.3.7), which give the bounding ellipsoid. 4.3.3 Continous-Time Dynamic System

A continuous-time unknown-but-bounded process x(t) can be modeled as dl x(t) = F(t)x(t)


+ G(t)w(t)



Q..(O) = [x: x'\ji- 1 X

0. Then Py(w" w 2 ) = 2



The "total power" in y(t) can be defined as py =

or by Parseval's theorem,

i I:

y 2(t) dt


Analysis of Linear Systems

4.4. 1 White Processest

Scalar stochastic white processes, w(t), defined by E[w(t)} = 0 E[w(t)w(r)} = o(t ~- r)Q

and scalar unknown-but-bounded white processes, w(t), defined by

I w(t) I< Q 112 are discussed. Consider first a stochastic white process. Interest centers on expected values. It can be shown that E[G •.(m)] = Q

Thus and




so the average total power of w(t) is infinite. This is to be expected as w(t) has infinite variance; i.e., E[w 2 (t)} = o(t- t)Q



Now consider an unknown-but-bounded white process. Interest centers on bounds. There does not exist a finite upper bound on Gw(w), as w(t) can be a pure sine wave which yields an impluse. This, of course, does not imply that w(t) can have infinite total power. The total power of an unknown-but-bounded white process is obviously bounded by

Pw< Q The power contained in some narrow range of frequencies w 1 I w1 co 2 1 small, is probably bounded by · Pw(mt> co 2 )

< ~'


C0 1 -

< w < w2,

w 2 1 small

as Q/2is the power in Q 112 sin(co 1 t). 4.4.2 Time-Invariant, Stable D}mamic Systemst

Consider a time-invariant, stable linear system with scalar input w(t) and scalar output y(t) defined by .

:r x(t)






+ Gw(t)

Frequency Domain Discussion


Let AUw) denote the frequency response of the system so that

Then (f y(jco) =

Gy(co) =

A(jco)ff wUw)

I A(jco) I2 G,..(co)

Assume that w(t) is a white stochastic process. Then

E{ Gy(w)} = E{P,(co,,

C0 2 )} =

I A(jw)I2E[ Gw(w)} 2Q


I A(jw) 12


I A(jco)I2Q



E[P,} = Q

r~ !A(jw)IZ dw

so that E[Py} is finite. Assume that w(t) is a white unknown-but-bounded process. Then G,(w) is unbounded, as it can contain impulses. For sufficiently small I co 1 - m 2 1, it is probably true that

It also follows that Py is bounded, where the bound depends on the structure of the linear system. 4.4.3 Comparisont

An unknown-but-bounded process can be a sine wave. Thus care must be exercised when working with quantities such as Gy(co) as they can contain impulses (i.e., infinities). White stochastic processes have, on the average, an infinite total power. The infinite power of a white stochastic process means that it can never exist in practice. These infinities, however do not mean that the basic models are meaningless, as both white unknown~ but-bounded and stochastic processes have only a finite amount of power in any finite (neither zero nor infinite)" band of frequencies. The stochastic model enables one to calculate the average power, while the unknown-butbounded model enables one to calculate an upper bound on· the power in the band. The discussions were given in terms of continuous-time models, but similar results can be obtained for the discrete-time case. In disc-rete time the annoying complications of infinities do not aris~. '





Analysis of Linear Systems

Role of White Process Model

The basic dynamic models considered in this chapter are of the form x(n +I)


+ l)x(n) + G(n)w(n) F(t)x(t) + Gw(t)

= .P(n =

w(n) or w(t): white process (stochastic or unknown but bounded)

The whiteness of the input w(n) or w(t) is what made the analysis easy and made the equations [e.g., (4.1.6)] come out in such a convenient form. This emphasizes the importance of the ideas of Chapter 3 on how to convert "natural" models into the state space-white process model form. Consider a case where some physical system has an input which is an uncertain but nonwhite process. Since white processes have no time structure, one .might be tempted to consider a white process as some sort of worst case. Thus in the interest of expediency, one might be tempted to simply replace the actual nonwhite input by a white input of the same size (variance or bound) in the hopes of obtaining a worst case analysis. This logic is valid for unknownbut-bounded models but does not work for stochastic models. For stochastic models it all depends on how the time structure of the nonwhite process "interacts" with the dynamics of the physical system itself. EXAMPLE: Consider the zero mean Bayesian model

x 1 (n-\- 1) = I'I> 1x 1 (n) -1- w,(n) w,(n -1- 1) = I'I>,ws(n) -1- w(n) flt =

\.P, I < 1 where w,(n) is a stationary process, so from (4.1.8) E(w;(n)}

=C =


_Q I'I>;

Define x(n) = [xtCnrj w,(n)

r(n) = E(x(n)x'(n)}

Assume that x(n) is also a stationary process so



l l',







where from (4.1.8)

r = vr' + [~ Solving for

r 11


E(xf(n)) gives




+ 1x(nl'.)




+ I'.Gw(nt..)



Consider the corresponding f'(nl1) and f'(t) and show directly that f'(nt..) ---> r(t) as ll---+ co, L1---+ 0, nt..---+ t. 4.7; Consider the scalar Bayesian model

d dt x(t)






+ w(t)

One way to calculate a differential equation which


satisfies is to use

;!t f'(t) = ;!t E[x2(t)] = 2E{x(t)


Carry out the details and compare the re~ults with (4.1.11). Hint: Dirac delta functions are often defined so that

,, u(t -" S

1 -

t 2 )dt 2



= 2

4.8. Consider the unknown-but-bounded model x(n


1) = «l>x(n)

+ b + w(n)

x(O}= 0





EXAMPLE: Consider (;:) = (

R=G '.l.

'',; '

:I • I

; ~!i


~ )x + (:) 'lf=S

mx = 01

Then from (5.1.1)

1:=(1 5C =


r~rz 1


+ frzz

The error uncertainty model (the error covariance matrix 1:) is independent of z and can be evaluated before the observations are made. This is extremely important in practice as it enables one to do an error analysis to decide how accurate the estimate will be wheri (if) it is actually calculated. Such error analysis can be very useful in design.





R =

[~ R~J


m_, = 0

Assume that the "meter" which yields z 2 has not yet been designed so that the value of R 22 is still free. Assume that "system specifications" require that the final estimate have an error with variance less than or equal to 1_-. Since accurate meters cost money, it is reasonable to try to find the maximum value of R 22 that is acceptable. For the model,

:E = [!

+ Rz-i + -k)-t

so the maximum allowable R 22 can be obtained by setting I: = R22

± to


= ~

This design can be done before any actual observations are made. Obviously a meter with R 12 = 4 will enable the specifications to be met only if the "best" estimator is used. 5.1.2 Gaussian

The discussions and results of Section 5.1.1 are valid for non-Gaussian as well as Gaussian models. However if the model is specialized to the Gaussian case sox and v are jointly normal, then x given by (5.1.1) or (5.1.2) is also the conditional expectation estimate,





This can be proven by rewriting (E.8) of Appendix E as

E{x Iz}



+ c,r;,1[Z- m,]

m, = E(z} = Hm. rzr


E([z- mz][z- m,]'}

r xz


E([x - mx][z - m,]'}

and then noting that

rx, =






so that (5.1.2) yields (5.1.8). This conditional expectation interpretation in the Gaussian case is not needed until Part III, Chapter 9, and is discussed in more detail in Part IV, Chapter I I. The Gaussian case is discussed here mostly for later reference and completeness. The fact that the estimator (5.1.1) or (5.1.2) is optimum without a


Estimation: Static Linear Systems

Gaussian assumption is very important in practice. The Gaussian assumption primarily effects only the interpretation that can be given to the results. This situation is analogous to the discussions in Chapter 4 on T(t) or T(n), the covariance matrix of a stochastic process, x(t) or x(n). In the Gaussian case, knowledge of~ enables one to specify the probability distribution of x - x, i.e.

x is N(O, :E)

x -


Fisher 5.2.1 General

Consider z = Hx


x: completely unknown







Then The estimate

x=Wz yields the minimum error covariance matrix for all Wunder the constraint E(x}








:EH'R- 1 T

:E = E([x -- x][x- x]} = [H'R- 1H]- 1 I

if I::. exists.


!. I.

For :E to exist, it is necessary that the dimension of z not be smaller than the dimension of x; i.e., K 2 > K 1 • A proof of (5.2.1) is as follows. Consider an estimator !~

x =Az + A



where A and A 0 are free. The unbiased constraint


E(x} = x

'., I'

requires that I


il /

+ AHx =


. all x


1 01

Since A 0 and A cannot depend on x, it is necessary that A0 = 0





all x



= Ef[x -

Az][x - Az]'}

for any A. Using (5.2.2),






.Let A= I::H'R- 1 + C

C: unspecified


I: = [H'R-1HJ-1





+ I:H'C' + CHA + CRC'

Substitution of (5.2.3) into (5.2.2) gives [I::H'R- 1 + C]Hx



so that the unbiased constraint requires CH

i5 =0,


0. Thus

=I:+ CRC'

Since CRC' is positive semidefinite,


all x


is minimized by choosing

A= I:H'R- 1



which proves (5.2.1). EXAMPLE:


Then from (5.2. 1)

L=(l+tl-t=-! ,>; = 4'z1 + fzz For the Fisher model, as in the Bayesian model, the fact that the error covariance matrix I: is independent of z and can be precomputed is of major practical importance. It is necessary to assume that K 2 > K" i.e., that the dimension of z is not smaller than the dimension of x. Mathematically, this condition is


Estimation: Static Linear Systems

necessary (not sufficient) if I: = [H'R - t H]- 1 is to exist. The condition is logical from a physical point of view, as xis assumed to be completely unknown. Thus the observation z can be viewed as defining K 2 equations in K 1 unknowns. If K 1 > K 2 , a unique solution cannot be expected; i.e., there are many values of x for which Hx = z. For K 1 = K 2 (and assuming that His invertible)~ = H- 1z, so that X. is the unique vector for which HX. = z. In this case, the modeling of the uncertainty in v has no effect on the estimate ·but. of course still affects the error covariance matrix E. In the case where K 1 > K 2 and I: does not exist, some authors like to invoke th.e so-called ·pseudoinverse. Loosely speaking, this approach says to pick the particular value x which satisfies Hx = z and which is of minimum length; i.e., x'x is minimum. Although the pseudoinverse has nice mathematical properties, the author feels that it cannot be glibly used without careful justification in each case. Any decision to choose a minimum length solution should be made consciously and not solely because it is convenient. Therefore pseudoinverses are not discussed further in this book. As in the Bayesian case, :i is global optimum in the sense that f: - E is positive semidefinite. Thus, to estimate y,

y = Cx one can use

y =CX. E[y}



E[[y- y][y - y]'}





5.2.2 Gaussian

The discussions and results of Section 5.2.1 are valid for non-Gaussian as well as Gaussian models. However if the model is specialized to the Gaussian case where v is normal while x is still unknown, the estimate of (5.2.1) is also the maximum likelihood estimate. (See Exercise 5.9.) The maximum likelihood interpretation is not needed until Chapter 9 and is discussed in more detail in Chapter II. It is mentioned here primarily for the sake of completeness.


Relationship Between Fisher and Bayesian

The estimators resulting from Bayesian and Fisher models are related in various ways, but it is important to emphasize that Bayesian and Fisher models of uncertainty are fundamentally different concepts which, in general, yield different results. This basic point is easy to forget, as, in the rest of this section, the close relationships between Bayesian and Fisher model estimators are discussed.

Relationship Between Fisher and Bayesian


The estimators for the Bayesian model of Section 5.1 and the Fisher model of Section 5.2 are as follows. From (5.1.1) :i 8 , , " ' ' " = ['V- 1 +~ H'R- 1H]- 1[H'R- 1z + w- 1mJ (5.3.1) l:samian = ['V-1 + H'R-1Hj-1 while from (5.2.1) :XFtsber =


[H'R- 1Hr 1H'R- 1z (5.3.2)

- [H'R-1Hj-1

..._.Fisher -

In both the Bayesian and Fisher models, the different ways: :X 8 .,.,,.~ is unbiased in the sense that E[xs,,.,,."}


x are

unbiased but in

mx = E{x}

where Eon the left-hand side denotes expectation over x and v, while Eon the right-hand side denotes expectation over just x. This unbiased property arose as a natural consequence of the definition of the "best" estimate. On the other hand, :X Fisher is unbiased in the sense that =



for any value ofx where the expectation is over just v. This unbiased property was directly imposed as an important part of the definition of the best estimate. Now consider how to go back and forth between Bayesian and Fisher results. In the Bayesian model, the random vector x is known (before any observations) to have a covariance matrix 'II· As 'I' ------> oo I (in some sense or other), the a priori distribution of x spreads out, and the a priori knowledge of the value of x becomes more and more uncertain. In the limit, when

w- 1 = o


Thus the Fisher model estimator equations can be derived as a limiting case of the Bayesian estimator. This fact is used in later chapters. · The reverse route from Fisher to Bayesian is done by introducing a new measurement. Assume that in a Fisher formulation two observations are made; the original one z = Hx


and a new one given by



+v 1

E(v 1 v'1 ) ='II E(vv'1 )



E(v 1)



These two observations can be combined into one by defining an augmented .

1 04

Estimation: Static Linear Systems

vector as follows: i








IR; OJ =

lo- -~ j

From the Fjsher logic of Section 5.2,

.I '!'



[fi'R- 1Hr 1H'R- 1 z

or (5.3.4) Comparison of (5.3.4) with (5.3.1) shows that the Fisher model estimator with the extra observation z 1 equals the Bayesian model estimator if z 1 =my, the mean value· of x (when xis a random vector). However, this equality of estimator equations does not mean that the Fisher and Bayesian models have somehow become equal. Jn the Bayesian model, 'V is associated with the random vector x. In the Fisher model, 'Vis associated with measurement noise v 1 of the extra observation z 1 and xis still considered to be completely unknown. The point is that a priori measurements made on an unknown vector x can have the same effect on the estimate as the assumption that x is a random vector. An important property of the Bayesian model (5.3.1) is, from (5.1.6), E[x - XaayosianJz'J = 0

However, for the Fisher model (5.3.2), a little manipulation shows that

E[x- Xp 1,h.,Jz'} = -[H'R- 1H]- 1H' This illustrates once again that despite their interrelationships, Fisher and Bayesian models are different. One mathematical reason for this particular difference is that the proof of (5.1.6) uses (5.1.1 ), which contains [HvH' + R] -J, and when K 2 > K 1 and 'V = = I, the inverse does not exist.




! '

1: ill


i :I

I :~

'' :'

In a Bayesian version of (5.1) both x and v are modeled as random vectors. In a Fisher model, vis modeled as a random vector but xis considered to be completely unknown. In weighted-least-squares theory, it is important to




Weighted Least Squares


Weighted Least Squares


emphasize that no model for the uncertainty in x or v is used. Weighted-leastsquares estimation theory replaces modeling and optimality arguments by the intuiti1·e judgment that given z, a "reasonable" estimate of x would be obtained by choosing the value of x that minimizes J(x) = (z- Hx)'R- 1(z- Hx)



where R- is now a positive definite weighting matrix chosen on the basis of engineering judgment. Let X. denote the x that minimizes (5.4.1). Then X.= (H'R- 1 H)- 1 H'R- 1 z

J(x) = z'[R- 1


R- 1H(H'R- 1 H)- 1H'R- 1]z


To prove (5.4.2), manipulate (5.4.1) into the form J(x) = z'[R- 1


R- 1Hl:H'R-']z

+ [x -l:H'Rl:



z]'l:._ 1 [x- :EH'R- 1 z]

[H'R- 1 H]- 1

Assume that E exists. Since the second term is nonnegative and is the only term involving x, it is obvious that i is as in (5.4.2). A direct minimization of J(x) yields, of course, the same result; i.e., aJa(x) = 2H'R- 1 [z- Hx] = 0 X

has a solution :X. given by (5.4.2), providing that l: exists. The estimator of (5.4.2) is identical in form to the Fisher estimator of (5.2.1). The difference between the two approaches is Fisher: R is a covariance matrix of zero mean observation error v in the model z = Hx v. Weighted least squares: R is a positive definite weighting matrix in (5.4.1) chosen by judgment.


In many cases, this difference can be viewed as just a play on words. However, the importance of the difference can be illustrated by the following example. EXAMPLE: Assume that you are a consultant and that a prospective client comes

to you with a problem that you see can be formulated as a z = Hx + v estimation problem. Assume that the client is intelligent but completely ignorant of all probability and/or statistical concepts. You will probably make a sale (of your services) much easier if you use a weighted-least-squares argument, as the client can readily understand what you will do for him and why. You may lose him if you insist on using "big words" such as covariance matrix, random vector, and unbiased.


Estimation: Static Linear Systems

A weighted-least-squares estimator which has the same form as the Bayesian estimator of (5.1.1) can be obtained by redefining (5.4.1) as J(x)


[z - Hx]'R - 1[z - Hx]

+ [x -

mJ''I'- 1[x -


By definition, weighted-least-squares estimation theory requires no models for the uncertainty in x and v. This has the disadvantage that there is no way to evaluate the performance of the estimate, i.e., to analyze its behavior. In this sense, the weighted-least-squares logic does not provide a complete answer to the basic estimation problem specified in the introduction to this chapter. This is the penalty one pays for not "needing" models for the uncertainty in x and v. · In the literature (including some of the author's own papers), the term weighted least squares is often used even when a Fisher or Bayesian model is being assuined. The approach being used in this book was chosen to emphasize that uncertainty models are not an essential part of the weighted-leastsquares. concept.

5.5 Weighting Patterns The estimators of Sections 5.1-5.4 are of the form







where the matrix W and the vector W 0 specify the estimator. The matrix W is called the weighting pattern matrix. lf xis scalar and if W 0 = 0,

and the Wk, k = 1, ... , K 2 , determine the relative weights (importance) that are given to each observation z~ in obtaining x. These weights are determined by the "size" of the uncertainties (R, 'II) and the structure of H. Consideration of the weighting pattern matrix W can aid in understanding how the estimator works. The author finds that the following informal discussion provides a motivation for why the basic estimator forms of the preceding sections are reasonable. Consider the Fisher model




z=Hx+v v=

E(vv'} =

H =


x: scalar

[R~1 R~J



Unknown-but- Bounded


so In the special case R 11 = R 22

W is proportional to H'; which, after a little thought, is a very reasonable answer. Similarly, in the special case H



W is proportional to [R!l R!i). which again makes a lot of sense. In general, the estimator matrix W is determined by the interaction of H and R. EXAMPLE:

Consider the Fisher model z=Hx+v z = [;:]

H =





[:J [~ ~]


w = hl.r -1,-1 Thus although z 2 is a noisier observation than z, (R22 > R 11 ), z 2 receives more weight (W2 > W 1 ) because of the effect of H in determining W. In the above discussion of the Fisher model, x was a scalar and [H'R-'H]- 1 did not effect the ratio of W 1 to W 2 • In the case of vector x

· ' the [H'R-'Ht' term can play a more important role. In the weighted-least-squares approach, R- 1 was viewed as a weighting matrix. This weighting should not be confused with the weighting pattern matrix W of (5.5.1). The above discussions can also be modified for application to a Bayesian model.




z =Hx +v X E Qx



where n .. and n. are sets in K 1 - and K 2 -dimensional spaces, respectively. The. particular sets to be discussed are ellipsoids.


Estimation: Static Linear Systems

5. 6.1 Arbitrary Sets

Since v must lie in a set, it follows that a given observation z = z,ctuat combines with the set nv in K 2 space to define a new set in K, space which must contain x. Thus even if x itself is not bounded a priori, the observation Z = Zactual specifies a Set Qxlr""""' Which must COntain X; i.e., X E Qx!z,.c, .... :

nxlr = [x: z- Hx E nv}

Consider the two sets, their intersection. Let

nx and nxlr• Each set contains X. Thus X


must lie in

denote this intersection. Then 1

n.st = n .. ()


(5.6. I)


The estimation problem is now conceptually complete. The modeling of the uncertainties implies that, given the observation z, the value of x must lie within nest· Furthermore, n.st is the smallest set, which must contain X and which can be calculated from the available information. Thus the set Q"' is the "best" estimate set. Note that the estimate of the vector xis defined as a set-not as a vector as in the Sections 5.1-5.4. An unknown-but-bounded model does not provide any specific way of determining which vector within n.,, is the best estimate of x. Naturally, a reasonable choice for a vector estimate is to define an :X as the center of n.,,, where the center can be defined in any convenient way. Such center-defined X. will often be used in the following, but it must always be remembered that the basic logic leads to an estimate which is a set, not a vector. EXAMPLE: Consider


'!i !!j





nv =


:,:\ :1\

). This can be a big help if an estimate i.(N IN) is.wanted only for someJinal value of N and not for n = I, ... , N - I, as only one matrix inversi~ at time N is needed (see Section 6. 14). 6.3.5 Time-Invariant Models and Steady-State Behavior

The preceding equations apply to time-varying models. Simplifications result when the system models are time-invariant; i.e., when the 0, then E(NI N) comoerges to a unique positive definite E~, gil'en by (6.3.7). -------------------------~

If (6.3.7) has a unique, positive definite solution, then :E", must be independent of the initial error covariance matrix 'I'· The following heuristic argument is presented instead of a formal

Bayesian: Interpretation. Alternative Forms, Properties


mathematical proof of (6.3.8). If I:~ is to be unique, it must be independent of the initial condition 'I'· This occurs if the input uncertainty w(n) "excites" all the states, as then, for large N, the effect ofw(n) has to dominate the initial uncertainty, x(O). The conditions of a controllable system and Q ::> 0 guarantee that w(n) excites all the states. If R = 0 (see Section 6.9) ~-'(NI N) does not exist, while if the ·system is not observable, ~(N 1N) may "blow up" (some components go to infinity). Hence the conditions of R > 0 and an 'observable system prevent I:(NI N) from going to zero or infinity for large N. Note that (6.3.8) says nothing about necessary conditions. Note also that the stability of the model «» does not affect the argument. EXAMPLE: Consider a scalar time-invariant model



z(n) = x(n)

E[v 2 (n)]

= R


1) = i.(NJN) ij)


K~ =

+ K~z(N +I)

[ I - K~HJii>


.E~H'R- 1

Equation (6.3.10) is a stable dynamic system .if the eigenvalues of are all

Bayesian: Interpretation, AINrnative Forms, Properties


less than I in magnitude. It turns out for this time-invariant case that The filter (6.3.10) is stable if

a I
(N)x(N IN) +


x(OjO) = m

where "E. is as in (6.2.5). In a feedback form, the estimator is x(N -t- lJ N -t- 1) = cJ>(N)X.(N IN) x (z(N

+ B(N)u(N) + K(N -t-

+ 1)- H(N

:X(OJO) = m where tae gain K is as in (6.3.1).


-t- 1)[lll(N)x(N N) 1

+ B(N)u(N)]}

Bayesian: Interpretation, Alternative Forms, Properties


6.3. 8 Effect of Initial Conditions

The initial conditions for the difference equation defining the estimator for the Bayesian model are x(OiO) = E[x(O)]

1:(010) ='II

One of the important aspects of the discussions of Sections 6.3.4 and 6.3.5 is that if the filter becomes a time-invariant system that is stable and/or if the filter is stable, the effects of the initial conditions decay with time. 6.3.9 Discussion o{ Bayesian

The error covariance matrix I:(n 1n) is obtained by solving a matrix Riccati equation (discrete-time version) which does not depend on the observations. The estimate x(n 1 n) is obtained from a linear [in the observation z(n)] vector difference equation. This recursive form wherein x(N + 11 N + J) is expressed in terms of x(N 1 N), I:.(N + II N + 1), and z(N + I) [and "f:.(N + I IN + I) in terms of I:(N 1 N)] is extremely useful as the past z(n), n = I, ... , do' not have to be stored. In fact, if the basic argument of Section 6.2.2 did not actually yield the optimum, the equations would still often be used because of their practical value, i.e., ease of implementation. The global optimality of x(N 1 N) has not been explicitly .demonstrated, but it holds by direct analogy with Chapter 5. To be specific x(N IN) yields the minimum error covariance matrix of any linear estimator in the sense that "£(N IN)

K 2 ), H'R- 1H is singular and :E(lll) does not exist. This difficulty is solved by waiting until, m observa-

Fisher: Filtering and Prediction


tions have been made and forming a single vector from z(l) · .. z(m), which is then treated as one observation. If m is large enough, this yields a !:(m /m), which exists (usually the condition mK2 > K 1 suffices). For n > m, (6.2.5), etc., can then be applied directly. Thus the only problem that arises when x(O) is completely unknown is the "start up" of the estimation equations. The details when K, > K 2 are messy but straightforward. The matrix manipulation ideas of Section 6.1 are helpful. 6.4.2 Unknown lnputt

Consider (6.1), where x(O) and w(n), n = I, ... , are completely unknown, while v(n) is a white stochastic process. Such a problem can make physical sense if the dimension of the observation z exceeds the dimension of the completely unknown input w(n). If a Fisher model with unknown input makes sense, a solution can be found from the Bayesian results by letting Q(n) ___. oo I. To illustrate how to proceed, consider the equation for !:(N + 11 N) of (6.2.5). Using the matrix identity (A.3) in Appendix A, this can be rewritten as I:- 1(N

+ li:N) =

[(f>(N)!:(NIN)~'(N)r 1 - [W(N)!:(NIN)~'(N)t 1 G(N)



+ G'(N)[~(N)!:(NJN)«D'(N)]-


G(N)t 1

X G'(N)[(N):E(N I N)W'(N)]- 1 G(N)


X [G'(N)[W(N)!:(N N)~'(N)J-1 G(N)t I


X G'(N)[~(N):E(N N)~'(N)]- 1

Note that 'f:.(N

+ lJ N)

does not exist since G'(N)I:- 1(N +liN)= 0

I:- 1 (N +I I N)G(N) = 0

+ 11 N + 1) exists. If it does not exist, more manipulation is required. The estimate x(NI N) given by (6.4.1) is an unbiased, minimum variance estimate (in the sense of Section 5.2). Thus

It is, of course, assumed here that the resulting :E(N

E(x(NIN} = x(N) E[[x(N) - x(N I N)][x(N) -- i(N IN)]'}

where the expectation is over v(n), n =I, ... , N. «":


!:(N IN)


Estimation: Discrete-Time Linear Dynamic Systems

In certain applications it is sometimes desired to obtain an estimate of w(N) itself. Let w(Nl N + I) denote the estimate of w(N) made from z(l) . - · z(N + 1). It is obvious from (6.1) that z(l) · ·- z(N) provides no information on w(N), as w(N) is completely unknown and is "first observed" by z(N + I). From (6.1), one can write


+ 1) =


+ 1)[4l(N)x(N) + G(N)w(N)] + v(N + l)

~mploying arguments such as those used in Section 6.2.2, write x(N IN) as x(NI N) = x(N)- ox(NIN). The estimate w(NI N + 1) can be calculated by combining z(N + 1) and x(NJ N) into a single vector observation of the "unknowri" vector formed by combining x(N) and w(N). The condition that the dimension of z exceeds the dimension of w can often occur. A prime example is the case where. both the input and output of a system are observed, as discussed in Section 3.2.15. 6.4.3 Discussion of Fisher

Given the filtered estimate x(N IN), a "good" one-step prediction is given by x(N

+ 11 N)

_.:_ (J>(N)x(N 1 N)

The corresponding error covariance matrix 'L(N + 11 N), of course, does not exist when the input w(n) is unknown, as Y.:.- 1 (/ll 11 N) of (6.4.1) does not have an inverse. This is as expected from the nature of the model. For a time-invariant system model, a steady-state filter may result, just as in the Bayesian case. The discussions of Section 6.3.5 still apply, so they are not repeated. The derivation of the estimator for the Fisher model as a limit of Bayesian model estimator is based on the concepts of Section 5.3. Section 5.3 also states that the properties of x(NI N) for a Fisher model are different from those of x(NI N) for a Bayesian model even in the limit. The use of the limit argument does lead to an interesting question. Can one simply use the Bayesian model equations with \jl andfor Q(n) chosen . to be very large? In practice the answer may be yes. In fact it is the author's opinion (based on intuition, not experience) that the possibility of obtaining a Fisher model estimator by using Bayesian equations with large \jl and/or Q(n) should always be considered as an alternative to the explicit equations which were derived. With this approach, it will be necessary to determine "large enough" values for \jl and/or Q(n) by numerical studies wherein the values of \JI and Q(n) are increased until the filter's input-output behavior becomes independent of \jl and/or Q(n). A Gaussian assumption was not used. If the world is specialized to the Gaussian case, the estimates become maximum likelihood estimates (see Section 5.2.2).


Weighted Least Squares: Filtering



Weighted! Least



The weighted-least-squares estimation logic of Section 5.4 1s now applied to the dynamic case. 6.5.1 General Case

For the static model of Section 5.4, the estimator was defined by, Choose the value of x which minimizes J(x)


[z- Hx]'R- 1 [z- Hx]

where R is a positive definite matrix chosen by engineering judgement. For the dynamic model of (6.1), the corresponding estimator is defined by Choose the x(N) and w(n), n = 0, ... , N- 1, which minimize J[x(N), w(O) · · · w(N- I)] N


I; [z(n)- H(n)x(n)]'R- 1 (n)[z(n)- H(n)x(n)]





I; w'(n)Q- 1 (n)w(n)


+ x'(O)\jl-



subject to the constraint that x(n

+ 1) =


+ G(n)w(n)

where R(n), Q(n), and 'I' are positive definite matrices chosen by engineering judgement. Let x(N IN) denote the resulting value 9f x(N). If one actually performs the minimization using Lagrange multipliers, the resulting equations for .i.(NI N) are the same as those of Sections 6.2 and 6.3. 6.5.2 Exponential Smoothing

A special case of (6.5.1) is called exponential smoothing. For exponential smoothing, R(n), \jl, and Q(n) in (6.5.1) are chosen to be . R(n) = e-a(N-n)

· Q- (n) 1


I, 0



1, ... ,N

"'-1 =0


a>O so (6. 5.1) becomes Choose x(N) which minimizes J[x(N)]



I; n""'l


e-a ii(n Jn) = «'- 1 d(n -

lin - l)ci>- 1

+ H'He-•']- 1


+ H'H]-•

x(OIO) = 0

which is in a form analogous to (6.2.5). After more manipulation a feedback form analogous to (6.3.1) can be obtained: . x(N +I IN+ I ) - x(NIN) + K(N + l)[z(N + I ) - HCI>x(NIN)]

K(n + I)

· f:(N +


+ l)H'


It should be emphasized that f:(N + 11 N + 1) in (6.5.5) cannot be interpreted as an error covariance matrix, as exponential smoothing is a weighted-least-squares formulation. EXAMPLE: Consider a scalar case where (6.5.3) becomes N


= :L;

e-,.(N-nl[z(n) -




+ I) =


Then from (6.5.5) S.(N + 1\ N + 1) = .\:(N\ N) + K(N + 1)[z(N

+ 1)- x(N\ N)J

K(N +I)= L(N +liN+ 1)


+ 1 \N +


~zf(N\ N) + e-"

W2 'L(N\N)

Assume that steady state is reached so that

I(N + IIN + 1) = f(N\N) = f~ Then

I~= {~2-$2 e-,. If e-"

< 2 , then the nonzero±~ will be obtained. It is interesting to com-

pare this example with the example of Section 6.3.5. The motivation for using exponential smoothing is discussed in Chapter 8; see especially Section 8.2.3. The reader can actually read most of Chapter 8 after finishing this chapter, so a brief jump ahead might be appropriate if motivation is of any interest.


Estimation: Discrete- Time Linear Dynam1c Systems

Note that J[i.(N)] of (6.5.3) could be generalized to a form more like (6.5.1), i.e., with Q 0. This was not done because it complicates the mathematics and, in the author's opinion, has limited application.



Unknown-but-Bounded: Filtering

• Unknown-but-bounded models are now considered for (6.1) x.(n

+ 1) =


+ G(n)w(n),


+ I)



+ v(n)

where w(n) and v(n) are white unknown-but-bounded processes and x(O) is an unknown-but-bounded vector such that

(N)x(NIN) + K(N + 1) x [z(N + I ) - H(N + l)fi>(N)~(NIN)} Eb(N +liN+ I)= f.(N +liN)- t(N + liN)H'(N + 1) x [R(N + 1) + H(N + l)f.(N + liN) X H'(N + l)}- 1H(N + J)t(N +liN) K(N + l) = 'f.b(N + 'E.b(N


+ l)H'(N +

l)R - 1 (N

+ I)

+ 11 N + 1)


t(N +liN)

o (N + 2



as in (6.6.4)

+ I) ft(N)

i(O) = 0 I:(O I 0)

= "'

The feedback implementation of (6.6.5) is illustrated in block diagram form in Fig. 6.6.3. Equations {6.6.4) and (6.6.5) are "like" the Bayesian model equations (6.2.5) and (6.3.1) except for the ft(N), p(N + 1), and 2 (N + I) terms. Compare Fig. 6.6.3 with Fig. 6.3.3. The discussions in Sections 4.3.4 and 5.6.2 on the choice of the free parameters ft(N) and p(N + I) apply equally


z(N+ l)



Gain Calculation


K(N+ I)

Y-(N+ liN+ l)


Model of Dynamics



xcN+ II N) H(N +I)

Figure 6.6.3 Unknown-but-bounded discrete-time filter.

Unkno1111n-but- Bounded: ·Filtering


here: The fi(N) and p(N + I) can be prespecified or special logics can be developed so that they are computed as a function of x(NI N) and E(NI N). The o2 (N + I) term depends on the observations, so '!:.(N + I IN + I) is a function of the observations. For the Bayesian model, the corresponding '!:.(N + 11 N + I) is independent of the observations. Thus the unknownbut-bounded estimator, viewed as a system with input z(N + I) and output i.(N + II N + I), is a time-varying nonlinear system. For the Bayesian model, the corresponding est.imator is a time-varying linear system. 6.6.3 Bounding Ellipsoid: Special Casest

The fact that in (6.6.4) or (6.6.5) the estimate x(NI N) is a nonlinear function of the observation z(l) · · · z(N + 1) is not in itself undesirable. However, the fact that l:(NI N) depends on the observation means that it cannot be precomputed, so all of (6.6.4) [or (6.6.5)] must be computed online. Furthermore, since 'E(N IN) determines the "size" of the ellipsoid, it specifies "how well" the estimator is working, and the observation depend. ence means that error analysis cannot be done before the observations are made. Both these problems can be solved by considering a "bounding" bounding ellipsoid Ou(N IN) with x.(N IN) and E.(N IN) as defined by ilx,b(N +liN+ I)= {x: [x- x.(N +liN+ 1)]' X X


+ 1 N + 1)} l) 1

E.(N +liN+

'i:.;'(N+ liN+ I) [x-x.(N+ liN+ 1)]< l}


given by (6.6.4) or (6.6.5) with J 2(N +I)= 0

It is felt to be obvious that ilx,b(N IN)


D.x(N IN)

so that (6.6.6) defines an ellipsoid which always contains the state x(N). Equation (6.6.6) is linear estimator in that the ellipsoid center x(NI N) is a linear function. of the observations z(l) · · · z(N). More important, the matrix I:.(NI N) which specifies the size of the ellipsoid can be precomputed by solving a matrix Riccati equation which is independent of the observations. The idea of setting 2 (n + 1) = 0 is closely related to the ideas of Section 5.6. The E.(NI N) of (6.6.6) can also be used as a precomputable error analysis for the nonlinear estimate (6.6.4) or (6.6.5) as E.(NI N) > E(NI N) provided that the same values for p(N + 1) and fi(N) are used; i.e., the p(N + 1} and fi(N) have to be prespecified.





Estimation: Discrete-Time Linear Dynamic Systems



case of (6.6.4) or (6.6.5) of interest occurs when the in a special way. Let p(N) and p(N + I) denote some prespecified set of values used to determine a bounding ellipsoid as in (6.6.6). Let iJ(N) and p(N + 1) denote the values to be used in (6.6.4), where p(N) and p(N

+ I) are chosen

iJ(N) = I -[I



+ - 1q(N + I)= [l -


~(~tN)]q(N) +

. p(N I) [I - PCN)][l - p(N PCN)][l -p(N

+ [z(N + 1) -



+ l)]q(N)


+ 1)4J(N)i.(N IN)]'

X [H(N + l)f(N + IIN)H'(N +I)+ R(N +

1)}:- 1

x [z(N +I)- H(N + 1)4J(N)i.(NIN)] It turns out that for these iJ(N) and p(N

+ I)

I:.(N +I IN+ 1) = [ I - q(N+ l)]l:b(N +liN+ I) i.(N+ liN+ l)=i.b(N+ liN+ I)

Therefore the ellipsoid fi~.b(N

+ IIN + 1) =

+ liN+ 1)} I: II N 1 b(N + + )


[x: [x- i.b(N +liN+ I))' X

f;; 1(N +


[x- i.b(N +liN+ I)]< 1}

liN+ I)

(6.6.7) given by (6.6.4) or (6.6.5) with d 2 (N + I)= 0

I:.b(N +I IN+ l) = [ l - q(N

+ l))l:b(N +liN+


is such that n~.b(N 1N) => n~.bcN 1N)


n ..CN 1N)

The 'ellipsoid n . ,b(NI N) of (6.6.7) has the same center as O.x,b(N IN) of(6.6.6) but is smaller (or the same). Hence (6.6. 7) constitutes an improvement over (6.6.6) which does not cost much in the way of computation, as q(N) is relatively easy to obtain. 6.6.4 Time-Invariant Models and Steady-State Behavior

Consider a time-invariant model so that the fiJ, G, H, Q, and Rmatrices are constant. Even if constant values p and pare used for the free parameters p(N + 1) and p(N), a steady-state concept corresponding to the I:~ of the Bayesian-Fisher model does not occur for the bounding ellipsoids of Section 6.6.2, as l:(NI N) depends on the observations and thus does not go to any



Finite Memory Filtering


steady-state value. Therefore it is still necessary to implement equations such as (6.6.4) and (6.6.5), which can be viewed as a time-invariant nonlinear system with input z(n), n = I, ... , N, and output x(N IN). A linear time-invariant estimator can be obtained by removing· the E dependence on the observations by setting JZ(N + I) = 0 to get the bounding ellipsoid of Section 6.6.3. Jrt this case, (6.6.6) yields x(N

+ II N

+I)= «:i(NjN)+ K~{z(N + I ) - Htl>"i(NIN)}




+ H'R -•Hr• 1 p)E~(()' + GQG'

E~ =((I - p)E:'(IIO)




I R=-R



Q =


O R 0 , Q,,.(N) exists. For R = R 0 , Q,,(N) contains a single vector, and this vector is the vector that mir.timizes the max I o(n) J, n = 1, ... , N. Hence the min-max curve-fitting problem can be solved by estimation theory by varying R until the estimate set contains only a single vector. There exists a variety of variations on the above ideas.

6.16.3 Polynomial Curve Fitting

A polynomial curve-fitting problem involves use of (6.16.3); i.e., lh(n) = nk

The minimizing



of (6.16.4) can be found by solving the Fisher model


Estimation: Discrete-Time Linear Dynamic Systems

estimation problem z(n) = y(n)

+ v(n)

y(n) = H(n)x


n nz

H'(n) =

Let i.(NI N) denote the Fisher estimate of x using z(l) · · · z(N). Then :i(N IN)


:Ex(N IN) = . I:x(N IN) ji(nl N)


H'(n)H(n)J I

R[k:t;! H'(n)H(n)J

= E[[x=

ntl H'(n)z(n) 1

:i(N I N)][x - x(N IN)]'}


H(n):i(N IN)

. !:y(n IN) = E[[y(n) - j!(n IN)]'}

The matrix inversion associated with I:x(N IN) can be accomplished by introducing orthonormal polynomials yk(n, N), k = I, ... , K, n = 1, ... , · N. These polynomials have the following properties:


J=k Such orthonormal polynomials can be generated by the following formulas: I (

r.(n, N) = k! N(N2 -

(k+ 1(n, N) = (2k


2k I )·' ;2 I) ... (N2 -F) ~k(n, N)

+ 1)(2n- k -

l)~k(n, N)


Curve Fitting




0, I, 2,

Yo(n, N)


I ) (N



Yz(n, N)




)';z[·n- N

y,(n,N) = ( N(Nz- 1).


-~~fNz- 4)Y!Tnz- (N +


+ (N + 1~N + 2)]

The behavior of y0 (n, N) through y 5 (n, N) is indicated in Fig. 6.16.I for the case where N is very large.

Figure 6.16.1 Orthonormal polynomials, Tk (n, IN) for IargeN.

Now return to the Fisher model estimation problem. The polynomial y(n) can be expressed as

so z(n) = H(n)x

+ v(n)


E5timation: Discrete-Time Linear Dynamic Systems

H'(n) =


y 0 (r~,



Yx(n, N)

Because of the orthonormal properties,

f: H'(n)H(n)



n= l

so that A


x(N IN)


N I: H'(n)z(n) n=l

J:.x(NIN) = Rl

These results in terms of x (i.e., the bk) can now be converted back to x (i.e., the ak) and a wide variety of interesting equations follows. Plots of I:y(n IN) for K = 0, ... , 5 are presented in Fig. 6.16.2 for the case oflarge N. Let W x,x(n, N) denote the weighting function for estimating the highest degree coefficient of a polynomial so that ax(N) = Wx.x(n, N)z(n)

The shape of these weighting functions can be seen from Fig. 6.16.1, as Wx,x(n, N) is proportional to Yx(n, N). For many polynomial curve-fitting

nlN----Figure 6.16.2 Error variance of polynomial curve fit.

Curve Fitting

·1 83

applications it is more converrient to consider polynomials referenced to the midpoint. Thus irrstead of (6.16.7), consider z(n) = y(n)

First degree: Second degree:

y(n) =

+ v(n)

x + x,(n- N i

y(n) = x





n - N




1 )

+ ~3



n- N




Let :EJNI N) denote the resulting error covariance matrix. Then First degree: :Ex(N[N)



R~ ~ lo

0 12 N(N





Second degree: 3(3N 2 - 1) 4N(N 2 - 4)





-30 N(N 2 - 4)

-30 N(N 2 - 4)


12 N(N 2 -




720 N(N 2


1)(N 2



Many more similar results can be relatively easily derived using the orthonormal polynomials. The preceding equations result from a Fisher model. This Fisher formulation yields the coefficients which minimize the J of (6.16.4). However, care is needed in interpreting the Fisher model result in terms of the original curve-fitting problem. The curve-fitting problem is for arbitrary z(n), n = 1, ... , N, while the Fisher model assumes that z(n) is the sum of a polynomial and a white stochastic process of known variance R. 6.16.4 Adaptive Curve Fitting

Consider the polynomial curve-fitting problem. An adaptive curvefitting problem is how to use the z(1) ; · · z(N) to specify the degree of polynomial, K, to be used. At first thought, one might attempt to find the optimum value K by minimizing J of (6.16.4) over both the ak and the value of K. However, such a minimization yields K = N- I, with the corresponding minimum J being zero. Thus something else must be done (unless K = N- 1 is actually desired). To find 1m optimum value for K (K < N), some modeling of z(n), n = 1, ... , N, is needed. One reasonable approach is as follows. Assume


Estimation: Discrete-Time Linear Dynamic Systems

that z(n), n . I, ... , N, is the sum of some smooth time function g(n), n = I, ... , N, and a discrete-time white process v(n), n =I; ... , N:

+ v(n),

z(n) = g(n)

n=l, ... ,N

E{v(n)} = 0

Further assume that g(n) can be "closely approximated" by a K 0 th degree polynomial, i.e., .n=l, ... ,N

but that the values of K 0 and the ak are not known. Then, the value of K~ can be "estimated" by hypothesizing that K 0 actually takes on values, K 0 = 0, K 0 =I, ... , and then using hypothesis testing techniques to choose one of the hypotheses. The needed hypothesis-testing theory is given in Chapters 9 and 10. When choosing the hypothesized value, one can start with K 0 = 0 and increase the hypothesized value of K 0 until a hypothesis is accepted. If an upper bound Km." on the true value of K 0 is available, one can start with the hypothesis that K 0 = Kmax and "work down" until a hypothesis is rejected. The above concept can obviously be applied to th(n) which are not polynomials. It can also be applied to problems such as deciding whether it is "better" to fit a given z(l) · · · z(N) by polynomials or Fourier series. 6.16.5 Stochastic Polynomials

The polynomial curve-fitting problem of Section 6.16.4 can be viewed as the Fisher estimation of the state of an undriven dynamic system ( Q = 0). The estimation problem resulting when this same dynamic system is driven by a white stochastic process is now discussed. The resulting signal to be estimated is called a stochastic polynomial (for the lack of a better term). A second-degree polynomial curve-fitting problem can be formulated as the Fisher model x(n

+ 1) =






+ v(n)

[i : !]

H =[I


O] ill =F llz




Curve Fitting


where Hx(li) = x 1 (0)




+ xiO) nz 2

A stochastic polynomial curve-fitting problem results when (6.16.10) is modified to the··Fisher model (unknown initial conditions) x(n

+ 1) =


+ Gw(n)

G~m .

E[w(n 1 )w(n 2 )} =



· 0, n 1 o;t= n 2 Some explicit numerical results for this problem obtained using a digital computer are now summarized.





-n Figure 6.16.3 Weighting function for estimating position x 1 (N), N= 25.

The studies were motivated by the problem of using position measurements (~n)) made on a moving vehicle to estimate the vehicle's position


Estimatioo : Discrete-Time Linear Dynamic Systems

velocity and acceleration, where x(N) =


x,(N)j x 2 (N)



x 3 (N)

positiOn -] velocity


With this interpretation, the undriven dynamic model of (6.16.10) is a constant acceleration model, while the driven dynamic model of (6.16.11) can be considered a random acceleration model. Let i:.(Nl N) denote the estimate using z(l) · · · z(N). Then in terms of weighting. patterns N

xk(NjN) =

I; Wk(n, N)z(n),

k = 1, 2, 3







~ 5






Figure 6.16.4 Weighting function for estimating velocity x 1 (N), . N~



0.01 0.001

< 6 ~





0.1 0

50 N--

Figure 6.16.5 Error variance in position versus filter length.




p=~ 0.00 I



Figura 6.16.6 Error variance in velocity versus filter length.










Estimation: Discrete-Time Linear Dynamic Systems

The position weighting pattern W 1(n, N) and velocity weighting pattern W 2(n, N) are given in Fig. 6.16.3 and 6.16.4 for N =25 and various values of p where p = Q/R. Let l:k(NI N) denote the error variance of i.k(Nj N), k = I, 2, 3, so 1: 1(NI N) is the error variance of the position estimate, etc. :E 1(NIN), L 2 (N!N), and I:iNIN) versus Nis plotted in Fig. 6.16.5-6.16.7 for various values of p. In all these figures, p = 0 corresponds to results of Section 6.16.3 for a deterministic second-degree polynomial (K = 2). These results illustrate the large differences that can result when a relatively small amount of driving white process is introduced.



:(N)x(NIN) + K(N + l)[z(N + 1) - H(N + l] N 0 , and thus the state is determined exactly: X.(N IN) = x(N) for N > No• For N > N 0 , any gain K(N + 1) can be used. This is called a deadbeat observer. 6.18. Consider a Fisher model of the form z=Hx+v

Xt X12 x2 X23 X=




where H is such that

z N--} IN [XN-I,NJ XN

+ VN

x is unknown, and v is zero mean anO Consider the special case of a time-invariant model (constant F, G, H, Q, and R). In many cases :E(t It) of (7.2.3) goes to some steady-state value l:= for large t, where 'E= is the positive definite solution of 0 = F'E

+ :EF' -

I:H'R- I H'E

+ GQG'

Then for large t, or if '1' = 'E=, the estimator becomes a linear time-invariant system with input z(t), (7.2.7) The discussions of Section 6.3 on when I:= exists and is unique also apply here. The discussion in Section 6.3 on stability of the filter also applies to (7.2.3). EXAMPLE: Consider a scalar time-invariant model d

· dt x(t)






+ w(t)

+ v(t)

If steady-state error variances L= exist, they arc given by nonnegative solutions to

± ,.jFZR 2

+ RQ




Consider three cases. Case 1: If Q = 1 and R


1, F arbitrary, then



F+ vFZ



Estimation : Continuous- Time Linear Dynamic Systems

and the steady-state filter is

:ft S:(t It) --vP =

Case 2: If Q


+ I x(t It) + I, z(tl.

0, R =c I, and F < 0, then Loo



and the steady-state estimate is x(t 1 t) = 0. Case 3: If Q = 0, R =I, and F> O,then :E~ = 2F

or :E~ =


where the value to be used depends on ljl. :E~ steady-state filter is stable for :E~ = 2F.


0 results from


= 0. The

The equations of this section are often called the Kalman filter or the Kalman-Bucy filter.


Weighted least Squares 7.3.1 General Case

A discrete-time weighted-least-squares criterion was given in Section 6.5.1. Application of the discrete to continuous time limit to (6.5.1) yields the following continuous time criterion: Choose the x(T) and w(t), 0 J =

n. . b(t It) => n ..u It)

For a time-invariant model, a steady-state :E= does not exist for (7.5.2) but does exist for the bounding bounding ellipsoid of (7.5.3). (>•


Estimation: Continuous-Time Linear Dynamic Systems

7.6 Weighting Patterns The estimators have been expressed in terms of differential equations. By · analogy with the discrete-time case, these same estimators can often be . expressed in terms of their weighting pattern. In terms of its weighting pattern W(t, T) an estimator is expressed as :i.(TI T) =

fr W(t, T)z(t) dt + W (T) .· 0

• 0


Calculation of such weighting patterns is conceptually straightforward. The differential equation (7.2.3) for a continuous-time Bayesian filter can be rewritten as

~ :i.(t It)

= F(t):i.(t It)

+ G(t)z(t)

I:(t I t)H'(t)R - I (t)H(t)

F(t) = F(t) -

· G(t) =I:(tlt)H'(t)R-'(t)

Let 0(1 u 1 2 ) denote the fundamental matrix associated with F(t). Then it follows that the weighting pattern of (7.6.1) is given by W(t, T) = O(T, t)G(t)


Similar results are obtained for Fisher models, for finite memory filters, for smoothing problems, and for bounding bounding ellipsoids. A weighting pattern is often a useful quantity to "look at," as it provides a "feel" for how the filter behaves. Thus one may want to calculate a weighting pattern such as W(t, T) even though the actual implementation is done using differential equations. EXAMPLE: Consider the Bayesian model d dt x(t)


z(t) =

+ w(t) x(t) + v(t) Fx(t)



with the corresponding steady-state filter given by

ft .'C(tl

t) =

-v' P +

lx(t It)

+ (F + v' P + l)z(t)

(see example of Section 7.2). Thus x(T! T) =


W(t, T) = (F

W(t, T)z(t)dt

+-.I F2 + l)e- 0. Assume that a record of the past behavior of y(t), say, from -T < t < 0 (where Tis positive), is available. Four pos- · sible models for the future t > 0 behavior of y(t) are I. Assume that y(t) is an uncertain constant x which is modeled as a random variable with


Mean = TI . Variance = T1

-r y(t)dt



[y(t)- mean)Z dt

Development of· Models


2. Assume that y(t) is a white unknown-but-bounded process w(t) defined by Y,;"



.. 1'f{Yt(Zactuat)>O N o d ectswn rizactual) > 0 Note the resemblance between y1(z) of (9.6.4) and the Gaussian log likelihood function !;1(z) of (9:3.8). They both have the same z dependence, yet the decision rules applied to them are different. Consider JC / z = H 1x + v, j = 1, ... , M X E QJ,x = {x :X''If/ 1X < 1} V E

Q 1 ,v =

{v: v'R/ 1V

O Now· Q.-< does not exist if 1 - b 2 < 0 for any admissible p ;• 0 < p < 1. ·Thus comparison of &2 of (9.6.8) with y/z) of (9.6.7) shows that the statement (9.6.2) is true for bounding ellipsoids. 9.6.3 Set-Theoretic Distance Measuret

Assume that M = 2. Let D denote the distance between the two hypotheses, where D is to play the same role for unknown-but-bounded models as the divergence J and Bhattacharyya distance ·B do for stochastic models. One reasonable way to define D is -

D -



-In (V, Vz)':z

V1 : volume of 0


vlnt: volume of n, n

j = I, 2



Equation (9.6.9) is reasonable in the following sense: If V,., = 0, one can always choose the correct hypothesis and V1. , = 0 implies that D = oo. If fl 1 and 0 2 are identical sets, one can never make a decision and !l 1 identical to 0 2 implies that V 1 = V 2 = V1. , which· implies that D = 0. The definition of (9.6.9) is, of course, not the only possibility. 9.6.4 Bounding Ellipsoid: Distance Measurest

The distance measure D of (9.6.9) is studied for



z E !! 1 j= 1,2 0 1 = (z :[z- m1]Tj 1 [z'- m 1] p>O The volume of an ellipsoid defined by r is proportional to some manipulation, D becomes 2D =

Ir 1'' 2 • Thus after

2ln!r'r, +(I- p)-'r 2! - Inl r I I - Jnj r


- Kln(l- [m,- m 2J'[r'r,

+ ( l - p)-'r

2 ]-


[m,- m 2 ]},

I>p>O This unknown-but-bounded distance D bears a close structural resemblance to the Gaussian Bhattacharyya distance B of (9.3.17).

9. 7


For Bayesian and Fisher models, theorems on the optimality of the use of log likelihood functions as decision functions are available. However, many types of decision functions are possible. Log likelihood decision functions are used in practice because they work and because often nothing better is available. There exists a lot of freedom in the choice of decision rule. The choice depends on the designer's judgment on which types of error are most important. The Bayesian .and Fisher models can be confusing because of all the different symbols and formulas. However, in the Gaussian case, the basic decision mechanism is closely related to that of weighted least squares, and weighted least squares provides a very simple interpretation of how the decisions are made. In other words, the basic principle is, Gil•en the observation z = Zactuat. choose the hypothesis (model) which is closest to Zaetuat· The nature of unknown-but-bounded models automatically specifies the exact nature of the decision functions and rules. As in Part II there is is a very strong structural resemblance between the equations resulting from the Gaussian models and the unknown-but-bounded ellipsoidal models .



Hypothesis Testing: Static Systems

However, also as in Part II, the behavior of the resulting tests is radically different. This is as expected, as the models are fundamentally different.

Historical Notes and References : Chapter 9

Section 9.3, 9.4 These basic ideas are discussed in many places in many ways. The approach used here is patterned on the Mathematical Statistics references of Book List 2: Cramer [1] and Wilks [I] were actually used. The author was introdticed to the use of the Bhattacharyya distance by discussions with T. Kailath, see Kailath [!]; Bhattacharryya [I]. Many different distance measures are compared in Kullbeck [1].

Section 9.6 To the author's knowledge, hypothesis testing for unknovm-but-bounded models is a new and untried technique.

Exercises 9.1. Consider the Bayesian model

P1 = P1 =


Assume that the decision rule (9.3.2) is to be used. a. Evaluate the total probability of error, P,. b. Evaluate the Bhattacharyya distance Band compute an upper bound for P. using (9.3.16). Comment: To do part (a) you will have to use tables of Gaussian probabilities, which are not in this book. Such tables are widely available. 9.2. Consider j = 1, 2

z1 has probability density pj(z)

p1 : a priori probability that JC 1 is true Define. IX J,k

probability that JC 1 is chosen when


is true

Consider the cost or risk function c



+ Y2P1a2,1

where the y1 > 0, j .= 1, 2, are chosen to "weight" the different types of errors. Specify a decision function and a decision rule which minimizes c.



9.3. Consider the Bayesian model

JC 1 :z=zl>


z 1 has probability density p 1(z)

p 1 : a priori probability that JC 1 is true

Let P, denote the tota1 probability of error. a. Prove that


P, rj),j = 1, 2. Evaluate·

J[Pt (z)]'[p (z))1-• dz 2

9.4. Derive (9.3.17). 9.5. Consider

m1 =


r 1 = (rx 0

mz =



Evaluate the divergence J and the Bhattacharyya distance B. What happens to these distances as 0? Does this make sense? Comment: Consider the models with a = 0. Given an observation, ask yourself how you would decide which model applies.


9.6. Consider the case where there is only one hypothesis:


z=zt z 1 : random variable with probability density p 1 (z)

e-z Pt(z) = {



z>O z't(nl!J.)[I;y,l(ntl!nl!J.- A)



+ ll- R (ntl)]-Il) (nll) 1




+I; l>~(ntl)[I;...,,(ntllnll••I

+ A- R (nA))1




l> 2 (ritl)

10.1.2 Equal Correlation Functions

An important special case occut:s when the y1 of (10.1.5) have equal correlation functions and the hypotheses differ only in the mean values. This leads to the matched filter concept of communi~ation theory. ,.

Discrete-Time Gaussian Process


·Consider :JC 1 : z(n.1) =Yin&) + v(nA),

j=I, ... ,M

y 1{n.1) = m/nA) + Y(n.1)


m inA): deterministic (mean value), j I, ... , M }"(n.1): zero mean Gaussian process v(n.1): zero mean white Gaussian process · . {.1- 1R(n-1), E{v(n 1a)v'(n 2 .1)} = . 0, y, v: independent



n 1 :;.!:


n z n2

Thus the hypotheses differ only with respect to the deterministic process, i.e., their mean values. The zero mean Gaussian processes, }"(n.1) + v(na), are the same for all hypotheses. Define the matrix 3'.[] as the linear operation on y(m-1).+ v(m-1), m = 1, ... , n - I, which yields ·y(n-11 n.1 - .1) so that j/n-1!!1.1 :.._ .1) = m/n-1)

+ 3'Jz(m.1)- mima), m ~

I, ... , n- 1]

li/n-1) = z(n.1) - :J.[z(ma), m = 1, ... , n - 1] - {m/na)- :J.[mima), m = I, ... ; n- I]} Note that S'N is independent of which hypothesis is true. Then manipulation of (10.1.6) yields for 3 and 5'r,I+2

Then application of the discrete to continuous time limit to (10.1.14) and (10.1.15) yields for the xj of (10.2.1) dJ(T) 2R ( T)----;rr- = Ly,, 1 z(TI T) +Ly,z 1 ,(TI T)

- Ly, 1(Ti T) - Ly, 2CTI T)

+ d~(T) + d~(T)


J(O) = 0

dB(T) 4R ( T)(Jf'


. Ly, 1 + 2 (TI T)- Ly,JTI T) - 1:y,2(TI T)

+ 1;df+z(T)


B(O) = 0

In the special case where the Y/t) have the same covariance functions, 8R(T)d,B(T) = R(T)dJ(T) = d 2 (T) dt dT

d(T) = m,(T)- m,(T)- 5'r[m 1(t)- m,(t), 0 < t < T]


Once again more explicit equations can be obtained by using the dynamic system models and results of Chapter 7. 10.2.6 Discussion of Continuous Time

The discrete to continuous time Iiinit operation is basically a simple operation, but some confusion arises as a consequence of using a white process modeL The mathematics yields a· .well-defined answer in the limit. . .

Weighted Least Squares: Minimum Residual Decision Making


The confusion lies in the fact that different ways of taking the limit can yield different answers. (Problems do not exist in the case of equal correlation functions.) However, it is the author's opinion that the difference between the various answers [such as (10.2.5) with or without the bias term] are usually only of academic interest. If one wants to worry about the difference, it is also the author's opinion that (10.2.5) is closest "to the real world" as it is written. Since the confusion arises because of the white observation process v(l), one is tempted to redo the a"nalysis for the case where v(t) = 0 [R(t) = 0]. Unfortunately this does not really help, as it leads to the singular detection prohlem, which means that, mathematically, perfect decisions can be made from arbitrarily short time intervals of observations. This phenomenon follows from the preceding equations, but the details are not given.

10.3 Weighted least Squares: Minimum Residual Decision Making Assume (hat z(t), 0 < hypotheses: X




1. In some applications the special cases of (6.6.6) or (6.6.7) (for discrete time) can be used to reduce "on-line" computation requirements. If(6.6.6) is used, 0 2 (N +I) is computed for use as a decision function but is not used to update the I: equations. If (6.6.7) is used, q(N + I) is used, as the decision function. It is clear that (6.6. 7) always provides a better (or at least not worse) decision function than (6.6.6). Similar discussions apply to (7.5.2), (7.5.3) and (7.5.4). The distance measure D definition of (9.6.9) should extend to discreteand continuous-time systems in a straightforward fashion. However, to the author's knowledge, the details have never been worked out. When comparing unknown-but-bounded and Gaussian modd hypothe-


Sequential Hypothesis· Testing


~is testing, there are two important points. First, for unknown-but-bounded

models, the filters of Chapters 6 and 7 provide the decision function directly, while for Gaussian models it is necessary to add a squarer and integrator. Second, for unknown-but-bounded models, no decision using z(l) · · · z(N) may be possible, but alternatively a perfect decision may be possible using only z(l) · · · z(n) for n < N. In this sense, unknown-but-bounded models automatically yield a sequential hypothesis-testing procedure, to be discussed in Section 10.5. · 1 0.5

Sequential Hypothesis Testing

Consider the use of log likelihoods as decision functions for hypotheses involving stochastic models. Following the ideas of Chapter 9, one method to make a decision (in discrete time) is I. Observe z(l) · · · z(N).

2. Calculate i;iN),j = I, ... , M. 3. Make decision by choosing the "most likely" hypothesis. '

However, a more general decision making approach is I. Observe z(l).

2. Calculate i;il),j =I, ... , M. 3. (a) If one hypothesis is "clearly" true, choose it and stop observing. (b) Otherwise make another observation. 4. Calculate (x,)t 1



'(x,)R- 1 [z- h(x,)]


This gain is like the Newton-Raphson gain of (1!.7.8) and (11.7.9) except that a term of (11.7,9) has been dropped. As with the Newton-Raphson, many simplifications and variations are possible. For example, a constant (or piecewise constant) gain B can be used where B is evaluated for some x. Also implementation in the form of (11. 7 .10) should be considered. Note that B(i) of (II. 7.11) is always positive definite (assuming that the inverse exists). 11.7.3 Simple Gainst

A very simple gain is B(i)


b(i) I


where the scalar b(i) are chosen in some clever fashion. They may be constant. The effect of using (11.7.12) was illustrated in Fig. ll.7 .3. ' In the special case of (11.7.3) a simple gain is



bD,- 1



diagonal{H'(x;)R- 1 H< 1>(xJ}


where diagonal means D is HI l)'R -t H< 1 > with all off-diagonal terms set to zero. This is classified as a simple gain, as the matrix inversion is easy to perform. 11.7-" Gradient Methodst

A simple gradient method (also called steepest descent) is as follows: I. Evaluate J


(X. 1) for the initial guess x 1 .

Iterative Methods for Finding a Minimum


2. By a separate iteration or search procedure, find the value of b such that J(x 2 ) is minimum for X1 = x, - bJ(I 1(x,) 3. Evaluate J'(x)R-'H'(x,.)R- 1 H(I>(x 1)

Then if B, < 2s-' (x,)


the iteration (l I. 7.5) probably converges and an optimum gain is B, = s-'(x;)

This statement is admittedly vague, but the author has found it to be useful. 11.7.6 Relation to Linear Modelst

Consider the linear model problem J(x) = [z- Hx]'R- 1[z- Hx]

Let :X minimize J(x). From Chapter 5,


[H'R-'Hr'H'R- 1 z

However, another way to solve for :X is by an iteration:

x,+ 1 =


+ B,H'R-'[z- HxJ

(11.7. 19)

Iterative Methods for Finding a Minimum


·It follows from the ideas of Section 11.7 .5 that

x,---+ x The iteration (II. 7 .19) converges in one step for any x 0 if B 0 = (H'R- 1 H)- 1 EXAMPLE: Consider the Fisher linear model


· z






vis N(O, R)

[~ ~]

R =

The maximum likelihood estimate x minimizes J(x) = (z- Hx)'R- 1 (z - Hx)

and is given directly by

x = 1zt + -}z 2 However, assume that an iterative solution is attempted:

x,+t =.X,-

B, aaJ(x)l . X



initial guess of .'i;




The best B,


(H'R- 1H)- 1

= 1 and yields


x 1 = 4z 1 + z 2 = 5

However, any constant B, = B, 0 < B


x1 will converge to a constant x~, where x~=(l- ~ s)x~ +B[4z, dz2]


If B


4z 1



x1 does not converge.




for any x 0


x0 :


Estimation: Static Nonlinear Systems

11.7.7 Stopping Rulest

Thus far only the basic iteration process has been conside.red. A related question is, What stopping rule should be used? In other words, how does one decide that X.,+ 1 - X.1 is small enough to conclude that X., = x? Three obvious stopping rules are as follows: 1. Stop when all components of x.,+ 1 - X., are "small enough." 2. Stop when J(X. 1+ 1 ) - J(i. 1) is "small enough." 3. Stop when all components of J(I 1(X.1) are "small enough." The definition of "small enough" is, of course, a matter of engineering judgment. When deciding what is small enough, it is important to remember that i. =F- "•ru•• so there is no reason to iterate until X. 1 = i.. As long as X. 1 is "relatively close" to x, it is probably as good an estimate of X 1,ue as i. itself. This statement is admittedly vague, but the underlying concept is important. In the special case when

z = h(x)


v is N(O, R)

x: v:

K 1 vector

K 2 vector

a different type of stopping rule can sometimes be rised. In Section 9.4.4, it was discussed that for a Fisher model where J(x) = [z - Hx]'R- 1 [z- Hx] and i minimizes J(i.), thenJ(x) is a chi-squared random variable with K 1 - K 1 degrees of freedom. If the error x,ruo - :i. is small enough, then this same property also applies approximately for the nonlinear model. Thus a stopping rule can be based directly on the size of J(X. 1) itself. 11.7.8 ·Di!;cussion ·of Iterative Solutions

The basic iteration (11.7.2) is a powerful tool, but there is no guarantee that X.1 will converge to x. As the simplest counterexample, consider a J(x) which has two local minima (at x and x 0 ) as in Fig. 11.7 .4. If the initial guess -~ 1 is close to X 0 , .X, may converge to x 0 rather than x. Even if x 1 is close to . X, the gain sequence B(i) may be such that xi cpnverges to Xo· The basic iteration works well in many problems, but it is not perfect and the reader must never assume its success withoutfurther study or actual implementation. The best gain sequence is the one which converges with the least amount of expended computer time. A simple gain sequence such as B(i) = b I reduces the computation time per Jteration but may require more iterations (if it converge~ at all) and hence more overall computation time than a more ·sophisticated approach. Unfortunately there is no general answer to the question ofwhich gain sequence is best. · The analysis of Section 11.7.5 provides only guide on how to choose gains. There is still no absolute guarantee of convergence.





Figure 11 .7 .4 A case where iterations may not converge ~to true minimum.



Iterative solutions as in Section 11.7 are one way to obtain an estimate from a function of x such as p(x Iz), p(z: x), or J(x: z). Such iterative solutions are local in nature and yield only a single vector estimate j: with no understanding of the overall uncertainty in i, i.e., in the shape of p(x 1z), p(z: x), or J(x: z). A more global approach is to approximate the given function of x by some collection of numbers which characterize the overall function. This is called parameterization as a function of x in being parameterized by a finite set of numbers. Three approaches are considered: expansions, quantizations: and moments. Consider a Bayesian formulation where the estimator is based on the conditional probability density, p(x Iz). One convenient parameterization of p(x 1 z) is to expand p(x I z) in terms of functions of x and assume that p(x Iz) ""='






The () m(x), m = I, ... ' M, are predetermined functions of X. The am, m = 1, ... , M, are the parameters which characterize p(xlz), where the am depend on the observation z. One reasonable way to obtain the a,(z) is as the values which minimize (11.8.2) If one chooses the 8,(x) to be orthonormal so that

. dx f ~-~ . . . f~-~ em,(x)(Jm,(x)

1 ' ••

dxx, =


then the am(z) which minimize (11.8.2) are given by am(z) =

5... 5p(xlz)()m(x)dxl ... dxx,

Using (11.8.1), the conditional expectation estimator i =

J ··· Jp(xlz)xdx

1 •••



Estimation: Static Nonlinear Systems


.;.. =

J · · · Jx8m(x)dx

1 • • •



The big problem is how choose the B,.(x), m = I, ... , M, so that the am(z) and/;., can be easily evaluated and so that a small M yields a good estimate. · Consider a Bayesian formulation. "Quantize" x into M values X1 • • • XM and assume that p(x Iz) is "represented" by m = 1, ... , M

This canbe viewed as a special case of (11.8.1). The conditional expectation estimator can be approximated by · M


2: p(x'" Jz)x'" "'"'"=-.~~~--2: P(X .. Jz) m=l


The most probable estimator can be approximated by



where X.. is the value which yields the maximum P('Xm Iz), m = I, ... , M


The big problem here is that the needed value of M can become overly large. Even for scalar x, M may have to be too large to be practical. A third method of parameterization or characterizing some function of x is in terms of its moments; i.e., for a Bayesian formulation

J··· Jx'" p(x Iz) dx,

m = 1, ... , M

where x 1 = x, x 2 = xx', etc. Such an approach automatically yields the conditional expectation estimator as the first moment. The big problem here is the fact that it is often difficult to calculate the moments. The above parameterization concepts extend to Fisher and weightedleast-squares formulations in a relatively straightforward fashion. In an unknown-but-bounded formulation where


= {x: J(x: z)

l(x: z), all x (11.8.5) Possible l(x : z) are

](x: z) = [x- x(z)]'l:-l(z)[x- x(z)]

Change of Variables


The big problem is, of course, to assure that (11.8.5) is satisfied at least for "'mOst., x.

All the preceding parameterization techniques are conceptuallystraightforward and are simply standard methods of characterizing a function by a finite set of parameters. However, all the techniques have big problems associated with them, and in many cases none of them are practicaL Nevertheless they should be considered, as they yield a global characterization rather than simply a single point estimate :X.


Change of Variables

A recurring theme of this book is the importance of the choice of the model. Many aspects of model choosing for linear problems discussed in Chapter 8 carry, over to nonlinear problems and are not repeated here. However, the practical implementation problems of nonlinear estimation introduce some additional possibilities-in particular, the use of a change of variables. · Assume that in terms of the "natural" state vector x the model is "very" nonlinear. There may exist another state vector y related by the one-to-one · nonlinear transformation (11.9.1) X = O(y) for which the model is "more" linear in the sense that

+v z = h[O(y)] + v z :=,; Hy + v

z = h(x)

In such a case, it would be reasonable to first calculate y and get :X by :X= O[y] ·


Use of (11.9.2) need not be optimal in any mathematical sense, but it is reasonable and that is usually the most important point. Another approach is to try to choose a nonlinear transform of the observations



so that



+ v]

(I 1.9.3)

can (hopefully) be viewed as approximately

z= where



v is some new uncertain vector and A.[h(x)]




(l 1.9.4)


Estimation ; Static Nonlinear Systems

Sometimes there is enough freedom in the choice of uncertainty models to justify the use of (11.9.4) instead of (11.9.3). Cttanges of variables can be helpful for iterative and parametric solutions even if they do not linearize the problem. For example, a change of variables may change the "shape" of J(x) into a more suitable form. Unfortunately there. seems to be no· general way to choose the "right" change of variables. Su-

t V'---


a. Plot p(z Ix) versus x for z = 0.5. b. Assume that an observation z = Zactuah Zactuat = 0.5 is made. What is the value for the estimate x for the following estimators? (I) Conditional expectation. (2) Most probable.



(3) Median. (4) Minimax. 11.2. Consider the Bayesian model Zt

= x2

+ Vt

Z 2 = X + Vz

x is N(O, If/) v=



vis N(O, R)


x, v : independent

a. Let .X denote the most probable estimate of x. x is given by the solution of a cubic .equation. What is the equation? b. A "reasonable" estimator which is easier to implement is .X


W 1 (z 1 )

1 2


+ Wzz + Wo 2

Discuss how you would calculate "optimum" values for


W 2 , and W 0 •

11.3. Consider a random variable z which can take on values z = 0, 1, 2, .... Let p(z) denote the probability function: p(z)


= --,-


z has a Poisson distribution. (Note that since z assumes only discrete values, p(z) is not a probability density.) Let z(1) . .•. z(N) denote N independent

observations of z. Let f1N denote the maximum likelihood estimate of p computed from z(l) · · · z(N). Show that 1


flN = N n~t z(n) 11.4. Consider the observations

x: v(t):

z(t) = h(x, t) + v(t), unknown continuous-time white Gaussian stochastic process E{v(t)v("r)} = o(t - -r)R

Let x denote a maximum-likelihood-type estimate of x given z(t), 0 < t < T. a. Does the likelihood function exist? b. Give an argument which enables .X to be expressed as the value of x which minimizes some explicit function of h, R, and z(t), 0 < t < T. What is this function? Hint: Formulate the discrete-time version z(M)


h(x, nil)

+ v(nll)

and use the discrete to continuous time limit.



Estimation: Static Nonlinear Systems

11.5. Consider the unknown-but-bounded model z = h(x) h(x)

= [./

xr + x~J

X 1





+ Xz

+ v!

12 Performance ol Estimators: Static Nonlinear Systems

In Chapter 11, estimation techniques were discussed for the static case models z = h(x, v)

z = h(x)


(12.1) v


where z is the observation (a K 2 vector), xis the uncertain state to be estimated (a K, vector), vis the uncertainty in the observation, and his a known vector function. The performance of estimators is now discussed for a Fisher formulation where x is unknown and v is a random vector. The discussions are restricted to Fisher models for simplicity. They can be extended to Bayesian models in various ways. One approach is to use the concepts of Section 11.3 and "view" a Bayesian model as a Fisher model with an extra observation. Linearized error analysis for arbitrary estimators is discussed in Section 12.1 and similar results are given in Section 12.2 for Gaussian v for maximum likelihood estimators. Lower bounds on the estimate error cova:rianc~ matrix are discussed in Sections 12.3 and 12.5. Ambiguity functions are discussed in Section 12.4. Monte Carlo methods are briefly considered in Section 12.6. A final discussion is given in Section 12.7. 367



Performance of Estimators: Static Nonlinear Systems

Linearized Error Analysis (Sensitivity Analysis)

Assume that an estimate :i is obtained by a nonlinear operation on z as follows: (12.1.1)

:i = w(z)

where 'w is some K, vector noniinear function of the K 2 vector z. This w could &e one of the nonlinear estimators of Chapter II, but for the present it can be any w (whose nece!)sary derivatives exist). Only (12.2) when z = h(x) + v is considered. Assume that E(v} = 0

E(vv'} = R

A Gaussian assumption is not used. Let x, = x,, •• denote the true but un-known state. Let b(x,) denote the bias error in :t·so x, - E(i} = b(x,)


where the expectation is over v. Substituting (12.2) into (12.1.1) gives :i = w[h(x,)

+ v]

Expanding w in a Taylor series expansion about h(x) gives i.



+ W!1 [b(x)]v + 1

I ---az z~h(xrl


W" >[b(x,)] = aw(z)

Now assume that the uncertainty vis small enough so that (in some unspecified average sense) the higher-order terms in (12.1.3) can be neglected. Then from (12.1.3) and (12.1.2) and the fact that v is zero mean w[h(x,)] = x,- b(x,)

so that (I 2.1.3) becomes, after neglecting the higher-order terms, x,-


- Wu 1[h(x,)]v

+ b(x,)

Thus for the estimator (12.1.1 ), the estimate error is E((x, - i)(x, - :t)'} =


+ b(x,)b'(x,)

I:= WC 0 (h(x,)JRW< 11 '[h(x,)J


Now {12.1.4) is not directly usable, as the partial derivatives are ev~iluat­ ed at the true and hence unknown value x,. However, two ways to proceed are I. If an estimate :t = w(z.c, •• 1) for some observation z made, assume that · W' 1 >[b(x,)]

= W' "[h(:i)]


z.ctuat has been

Linearized Error Analysis (Sensitivity Analysis)


and then use



w[h(x 0 )]RW 11 1'[h(x 0 )]

as the covariance matrix of the error that would result if x, had the value X 0 • Thus for small v, (12.1.4) can be used to get a "feel" for :E, the covariance matrix of the errors. EXAMPLE:

Consider z=(x)'i2+v .x: unknown




£(v 2 }



A "reasonable" nonlinear estimator is .X= w(z) = z2

For this estimator W(l)[h(x,)] = 2z \z~h.nalysis: Maximum Likelih


[f / (z)Jz(z)p(zj dzJ < [f f~(z)p(z) dzJ[J Ji(z)p(z) dzJ 1


Using the Schwartz inequality on (12.3.I) gives

Since E[(x,-



f (x,- x) p(z: x,) dz 2

it follows that a lower bound on the estimate error variance_for any estimate x has been obtained; i.e., E[(x, _ x) 2}


[-1 + b([>(x,)F

J[(a/ax) In p(z: x,)Yp(z: x,) dz



x is any estimate

A different way of writing (12.3.2) is ~ 2 E((x, -,- x) }


[-I+b(l>(x,)]2 E([(a/ax) In p(z: x,)] 2 }

If the necessary derivatives exist, it follows that

fx Ip(z: x) dz I [1x In p(z: x)Jp(z: :X) dz =

= 0



Performance of Estimators: Static Nonlinear Systems

and thus

az1 ax


p(z: x) dz


I az +f

ax [In p(z: x)]p(z: x) dz


[:x lnp(z: x)Jp(z: x)


= 0



axz[lnp(z: x,)]


az axz[lnp(z: x)]



Then (12.3.2) can be rewritten in an equivalent form: E[(x,- x)z} ~.


+ b'll(x,)]z


-f (a tax )[lnp(z: x,)]p(z: x,) dz 2




The case where x is a vector is now outlined. For simplicity assume that

x is unbiased so that

J[x -

x,]p(z: x,) dz



Taking the partial with respect to x, gives




x,l[a In~~: x,)Jp(z: x,) dz =I


Now multiply (!2.3.5) by two arbitrary vectors y 1 and y 2 to give


· Y11( A.



)l{[a lnp(z: ax

x,)]' Y }p( . )d Z. X,


_ Y11Y2

Z -

The Schwartz inequality gives (y'II:y l)(y~By2)



(Y 1 YY



E((i - x,)(x - x,)'}



J{[a In~~: x,)J[a In~~: x,)J}

( 12.3.6) p(z: x,) dz

Since y 1 and y 2 are arbitrary, they can be chosen such that (assuming the inverse exists) so that (12.3.6) becomes

Lower Bound on Estimate Error Variance


But since y 1 is still arbitrary, it can be concluded that

:r: > s-,


which is the matrix, unbiased version of ( 12.3.2). The matrix analogy to (12.3.4) is (12.3.8)

12.3.2 Gaussian CaseT

Now consider the model

z = h(x)



vis N(O, R) Then lnp(z: x)


c - [z- h(x)]'R-•[z- h(x)]

c: constant '

If x is a scalar,

a~:2 lnp(z:

x,) = -H'' 1(xJR-'H

E(x,- x)(x,- x)'} (H''I'(x,)R-tH 2 J[,f;, wkp(z: xk)/p(z: x,)J p(z: x,) dz K

[± k~l



Since (12.5.1) holds for arbitrary wk and xk, k = 1, ... , K, it follows that the "tightest" bound is found by maximizing the right-hand side over wk and xk. The result of this maximization is the Barankin bound. Use of a larger value of K improves the bound. To use (12.5.1), it is necessary to pick the "best" xk and wk, k = 1, ... , K (and a value forK), in some fashion. Brute force computer search


Performance of Estimators: Static Nonlinear Systems

is possible but not intellectually satisfying. One reasonable approach is to choose the xk, k = I, ... , K, and then do an optimization over the w k in some fashion. This can be effective, especially in the Gaussian case. The choice of xk values can be a problem but placing the xk at the local maximum of the ambiguity function of Section 12.4 is not unreasonable. The lower bound of (12.3.2) for unbiased estimates can be derived as the limit of the special case of (12.5.1) when c --. 0, where

K=2 w,·=


x, = x, w2







Monte Carlo Methods

At this point, the reader who has really understood what has been said in this chapter is probably disappointed. Several different approaches were presented, but they have many shortcomings, such as requiring small errors and ignoring potentially very important phenomena such as biases. Ambiguity functions provide a general approach, but they become very unwieldy for large dimensional x. One way around these problems is to use Monte Carlo techniques. The Monte Carlo approach to error analysis is as follows: l. Use a random number generator to obtain typical samples of v, say,vi'j= I, ... ,M.

2. Pick a value of x (say, x,,".) and then compute the resulting z, say, j= 1., ... , M

3. Process each z 1 through the estimator to get the resulting estimates, x1,j = I, ... , M. 4. Look at the errors x.,u. the estimator works.

x,,j =

I, ... , M, and decide how well

· This approach is simplicity itself and automatically includes the case of large errors and bias errors. One big disadvantage is the need to first implement the estimator itself. Also, the required computation time cart be excessive, as the number of samples M may have to be large. To be truly "scientific," an M of several thousand might be needed but M = 20 or 30 is often "safe" (although the author has often used M = 2-5 and "crossed his fingers").



It should be restated that the preceding discussions on a Fisher (ormulation can be extended to a Bayesian formulation if desired.

Historical Notes and References: Chapter 12


By straightforward, albeit somew,hat hand-waving, linearization arguments for small uncertainty, the error covariance matrix I: for an arbitrary estimator, x == w(z), z = h(x) + v was shown to be (12.7.1) while for the maximum likelihood estimator for Gaussian v, 1: = [HW'R-tfl]-t


These two formulas are useful tools. However, it is important to emphasize two of their shortcomings. First, they are local analyses which ignore the global picture such as the shape of the ambiguity function. Second, even if the local picture is sufficient, (12.7.1) and (12.7.2) give only the error variances. In many applications, bias errors are more important. Equation (12.7.2) can be evaluated directly from the model (hand R) without implementing the estimator. This makes it an extremely useful tool in doing initial "feasibility studies" to see whether a given z = h(x) + v situation can yield acceptable results. Equation (12.7.2) often provides a good "feel" for the problem even if the errors are not Gaussian. In many applications the shortcomings of (12.7.2) are ignored under the philosophy that "some sort of answer is better than no answer at all." Such a philosophy can of course, also lead to trouble. The lower bound inequality of Section 12.3 provides a more mathematically precise approach to error analysis. For Gaussian errors (large or small) it enables (12.7.2) to be given an exact interpretation in terms of a lower bound for unbiased estimators. Unfortunately there is, in general, no guarantee that the lower bound given by the inequality can be achieved. Furthermore, although a bound including bias can be formulated, the evaluation oft he bias errors remains very difficult. The lower bound of Section 12.5 provides a method to try to improve the lower bound of Section 12.3 but the needed calculations can easily become prohibitive. · The ambiguity function is a useful concept for visualizing some of the global aspects of the problem. It also provides another way of viewing the lower bound of Section 12.3. Monte Carlo techniques are the only completely genen\1. approach to error analysis.

Historical Notes and References: Chapter 12

Section 12.3 The lower bound is often called the Cramer-Rao lower bound or the information inequality and is sometimes viewed as part of the theory of efficient estimation. See Book List 2, Mathematical Statistics. Cramer [1] and Wilkes [1] were used. This bound has been employed in a wide variety of



Performance of Estimators: Static Nonlinear Systems

situations, of which Sklar and Schweppe [I] and Schweppe [6] are examples of application to radar problems.

Section 12.4 The-term ambiguity function seems to have originated in the radar context in Woodward [1]. .

Section 12.5 .Barankin [I] is the original reference. The author used Hoeffesteder and .McAulay [1]. Another type of generalization of the Cramer-Rao bound is discussed in Bhattacharyya [2].

Exercises 12.1. Consider a Fisher model (see also Exercise 11.2) Zt


+ = x + v2




v 1 , v 2 : zero mean independent Gaussian




E(viJ = 4 x: completely unknown

Let x, denote the true value of x. a. Evaluate a lower bound on the error variance for any estimator and sketch its behavior as a function of x,. b. A maximum likelihood estimate of x requires the solution of a cubic equation. Therefore it might be desired to use one of the following estimators instead: or Perform· a linearized error analysis on these two estimators to determine the error variance assuming that the observation uncertainties are small. Sketch their behavior as a function of x,. c. Discuss how you might attempt to calculate the bias errors for the maximum likelihood estimator and for the estimators of part (b). Do not attempt to actually evaluate the bias errors. 12.2. Consider z = vx(l

+ v)

x: completely unknown

vis N(O, R) It is desired to estimate x using the observation z. a. A reasonable estimate to use is ,£ = z 2 • Evaluate the bias error and the

error variance for this estimate. To evaluate the error variance, you may



use either exact or linearized techniques, but state which approach you are following. b. Let .X denote the maximum likelihood estimate. What is .f.? Repeat the error analysis of part a for this .¥.. 12.3. Consider

z = h(x) + v v

is N[O, R(x)]

X: completely unknown

Let .X denote the maximum likelihood estimate. Calculate a lower bound (Cramer-Rao) on L, as in Section 12.3 assuming that xis unbiased.

12.4. Consider the observations z(f)




h(x, t)

+ v(t),



K 1 vector nonlinear function of the K 1 vector state x and time n and time n G: K 1 by K 3 matrix w, v: uncertain processes x(O): uncertain initial state h: K1. vector nonlinear function of the K 1 vector state


However, much of the analysis to follow is aiso valid for more gen~ral formulations such as x(n + 1) = !fl[x(n), n] + G[x(n), n]w(n) (13.3) z(n) = h[x(n), n] + v(n) 387

· 388

Estimation: Discrete- Time Nonlinear Dynamic Systems

or x(n

+ 1) =

f/>[x(n), n, w(n)]


b[x(n), n, v(n)J



In many applications there are also known inputs u(n), so the model of interest might be x(n

or, still more

+ 1) =

tb[x(n), n]

+ G[x(n), n]w(n) + B(n)u(n)



+ I) =

f/>[x(n), u(n), n]

+ G[x(n), n]w(n)

· Note, however, that since u(n) is a known function of time, f/>[x(n), u(n), n] can be viewed simply as f/>[x(n), nJ. The w(n) and v(n) are modeled as discretetime white processes. When a stochastic (Bayesian) model is used for the uncertainty, it is assumed lhat v(n), w(n), and x(O) are random vectors with v(n 1), v(n 2 ): independent

n 1 :;z:: n 2

w(n 1), w(n 2 ): independent

n 1 :;z:: n 2

v(n 1), w(n 2 ), x(O): independent, all n, and n 2

The definition of a white process used in Section 3. I .2 involved "uncorrelated in time" processes rather than "independent in time" processes, so the w(n) · and v(n) are special white processes. For probabilistic (Bayesian) models, the general theory is primarily concerned with the behavior of the probability densities and is mostly only an exercise in manipulating Bayes' rule. The only really difficult part is keeping the notation straight. The notation p[v(n)], p[w(n)], p[x(n)], etc., is used to denote the probability densities of v(n), w(n), and x(n), etc. It would be more precise to use the notation p[x, n] instead of p[x(n)J, as it is the probability density ofx at time n that is of interest, but the notation p[x(n)] seems to yield the "simplest"-appearing equations. · This chapter is organized somewhat like Chapter 11; first theory, then practice. The first group of sections presents the theory. Section 13.1 discusses the propagation of uncertainty for both Bayesian and unknown-but-bounded models. Section 13.2 discusses Bayesian filtering and smoothing. Sections 13.3-13.7 discuss Fisher, weighted-least-squares, finite memory estimation, and unknown-but-bounded models and performance analysis. Section 13.8 summarizes the theoretical development. The second group of sections discusses schemes for obtaining practical results. Approximate propagation of · uncertainty is discussed in Section 13.9. Recursive filters are considered in Section 13.1 Q. Although this chapter is on discrete time models, the continuous time s;ase is briefly discussed in Section 13.11.

Propagation of Uncertainty



Propagation of Uncertainty

Given uncer~nty models for x(O) and w(n), n = I, ... , it is desired to fipd the uncertainty model for the x(n) resulting from (13.4) (when no observatiqns are made). · 13.1.1 Bayesian

Consider x(n + 1) = l(>[x(n), w(n), n]. Assume that x(O) and w(n), n = I , 2, . ·.. , are modeled as independent random vectors with known probability densities. It is desired to evaluate p[x(n I)J in terms of p[x(n)].


By definition of marginal distributions p[x(n + l)J =

Jp[x(n +

1), x(n)J dx(n)

Using Bayes' rule p[x(n + l)J =

Jp[x(n +

I) I x(n)Jp[x(n)J dx(n)


This is the desired result, as it expresses p[x(n + l)J in terms of p[x(n)J and other computable functions. If x(n + I) = l(>[x(n), n} + G(n)w(n) and G(n) is a nonsingular matrix, it follows from the independence of the w(n), n = I, ... , that (see Appendix D.4) p[x(n+l)lx(n)]=


IG(n) I p[w(n)JIw(n)-G·•(n)[x(o+ 1)-+(z(n),•IJ


where p(w(n)J is the probability density of w(n). For more general models, similar expressions can be written in various ways. Equation (13.1.1) is a recursive functional equation. p[x(n)] is a scalar function of the Kl vector X. Except for special cases, this equation is not suitable for direct implementation on a computer. For example, to numerically calculate p[x(n + 1)], it is necessary to have already calculated and stored the value of p[x(n)J for all possible valucis of x at time n. In terms of state space concepts, (13.1.1) shows that p[x(n)] can be viewed as the "state" of the system. Since p[x(n)] is a function, this state can be considered to be an infinite dimensional vector. A Gaussian assumption on x(O) and w(n) enables more explicit equations to be written but, in general, does not alleviate the basic difficulty associated with solving functional equations. Of course, if the nonlinear model becomes linear, the results of Section 4.1 .2 can be derived from (13.1.1) .



Estimation: Discrete-Time Nonlinear Dynamic Systems

EXAMPLE: Consider the linear, zero mean, Gaussian scalar model x(n

+ 1) =

+ w(n)


£(x 1 (0)} = r(O) nt



n 1 ;:Cn:z.

Since x(O) and w{O) are Gaussian, p[x(O)l = [2nr(0)]- 11 1


p(x(l)lx(O)l = [2nQ]-lil exp{[x(l) 2cix(O)F} Thus from (13.1.1) p[x(l)] = [4n 2 r(O)Q]-Iil .


exp{xl(O) + [x(l) - ci>x(O)F} dx(O) 2r(O) 2Q

or, after a lot of rearrangement, p(x(l)] = [4nzr(O)Q)-Itz



. x2(1)

x(l)cl>r(O) Jz + tl>2r(Q)


}. d

+ 2[Q + «(l'[x•• m(n)]r(n)f>11)'_[x •• m(n)]


+ G[x •• m(n)JQG'[XnomCn)] T(O) ='If

The key assumption underlying (13.9.2) is that x(n) - x~.m(n) is small. A necessary condition for this is that the noise w(n) is small (or rather GQG'. is small) in some sense. However, this is not sufficient and care must always be exercised in the use of (13.9.2). EXAMPLE:



· - + w(n) + 1 + J) = -x2(n) 4 x(O) = 2 E{w(n)} = 0

£(w(n 1)w(n 2 )} = {



Then x •• ..,(n) satisfies Xnom(n

+ l) =




= 2

so Xnom(n) = 2, ·

all n

+ 1) =



so from (13.9.4) r(n


r(o) =·o Thus the linearized result of (13.9.2) says ihat x(n) is N(2, r(n)), where r(n) is gro~ing without bound. Hence no matter how small the v~lue of the noise . w(n) (how small Q is), it is clear that the linearized equation (13.9.1) becomes invalid for large time as the assumption that x •• m(n) is close to.x(n) becomes invalid. ·

Approximate Propagation of Uncertainty


13.9.2 Second-Order Termst

Equation (13.9.2) results from truncating a Taylor series expansion of -fJ(x) and keeping only the linear terms. The effect of keeping the secondorder term is now considered. To keep the notation from becoming awkward; only the case of scalar x(n) is considered and G[x(n), n] is assumed to be G. Otherwise the model is as iri Section 13.9 .I. If the second-order terhi. in the expansion of (x) is kept, (13.9.1) becomes, for scalar x,. d,.(n

+ 1) =


lf>C2J[Xnom(n)] = az¢(x) ax


I .



+ Gw(n)

>[xnom(n)J d!(n)


x-x ••• (n)

To further simplify the notation, n > and w will be written instead of W10 [x.om(n)J and wm[x.om(n)]. Define. . ml.n) =· E{d..(n)} r(n) = E([d..(n) -

E(d;(n)} = r(n)


+ mj(n)

Then from (13.9.3) mJ..n

+ I) =


[x(n)] + B[x(n)]u(n). · Figure 13.10.1 is a "conventional" engineering solution. The difficult part of implementation is to choose the gains K(n) which determine bow and how .much of the errors, z(n + I)- h[x(n + 1), n + 1], are "feedback" to drive the model. If the gains are too low, the tracking system will be "sluggish," so that, if the initial model state x(O I 0) is not "good," the tracking system will not "lock on" to the true system state x(n). If the gains are too high, the tracking system may overshoot, oscillate, and fail to track x(n) very closely. See the analogous discussion for linear models in Section 8.2.2.


Estimation: Discrete-Time Nonlinear Dynamic Systems

13.10.2 Linearization About Nominal

The second rule-Gil'en a nonlinear. problem, try to linearize it-can be applied in different ways. The following way is closely related to the ideas ofSection 13.9.1. Choose some nominal trajectory, Xnom(n), n = I, ... , through state space where · Define d,(n) = x(n) -

W[i(N IN), N]l:(N I N)f!>< I ~'[i(N IN), N]

+ G[i(N IN), N]Q(N)G'[i(N IN), N] r.co 1O) =


x(OIO) = 0


This filter is of the form of Fig. 13.10.1 where the gains K(N 1) are calculated using the linear theory of Chapter 6. To be very precise, the form of Fig. 13.10.1 corresponding to (13.1 0.3) should be modified to indicate the explicit dependence of K(N 1) on i(N 1N). Although (13.1 0.3) is the result of only a "minor modification" of (13,1 0.2), the two filters can exhibit completely different beha:'iors. One obvious disadvantage to (13.1 0.3) is that the gains depend on the estimate


Recursive filters


i.(n 1n) and thus cannot be precomputed. The full equation has to be solved in real time. EXAMPLE: Consider


+ I) = [x~n) + 1J + w(n) z(n) = x (n) + v(n) 3



Consider first the use of a nominal solution and (13.10.2). Let Xnom(n) = 2 q)II)[Xnom(n)J =

aaX [xl4 + 1] Ix=2



HII)[Xnom(n)] = L[xl)lx=2 = 12

Then from (13.10.2) R(N + II N + 1) = J,(N +


+ I\ N + I) =

ll N + 1) + 2

dx(NI N)


K(N + J)(z(N

+ l)

- 8 - 12d..(Ni N)}

K(N +I)= 12:L(N +II N + I)R-1

:L(N +liN+ 1) = [144R-I + (:L(N!N)

+ Q)-1]-1

This is a nice simple result which yields a I(NI N) which becomes constant for large N. However, since by the examples in Sections 13.9.1 and 13.9.2, the difference x(n) - Xnom(n) will not stay small, so for large N, the above equations become meaningless even for very small Q and R. This situation can be improved by using (13.10.3) instead of(l3.10.2) to give X.(N + l! N + 1) = x(N + x(N

ll N)

+ K(N + l){z(N + I) - [x(N + II N)]l}

+ 1\N) = .xz (~IN) + 1

K(N +1) = :L(N

+ 1 IN+ 1)3[x(N + 1 !N)]2R-l

:L(N +liN+ I)= {9[x(N

~IIN)]4 + px(N +41!N)J2I(N!N) +



The equations are now more complex, but they should work "reasonably well" provided that Q and Rare small enough. The corresponding equations for the linear model are (6.3.1), (6.2.5), and (6.2.7). Obviously (13.10.3) can be written in many different forms just as in the linear case of Chapter 6.


Estimation: Discrete- Time Nonlinear Dynamic Systems

In application of(l3.10.3), a trick that is sometimes useful is to increase

Q by some amount. The idea is that the increased Q accounts in some way for the extra errors resulting from the linearization. The problem is, of course, to choose the "right" value for Q. 13.10.4 Linearization.About New or Smoothed Estimatet

The filter ( 13.1 0.3) is straightforward to implement because x(N

+ II

N +I) is expressed as an explicit function ofx(NIN) andz(N + 1). How-

ever, from an idealist. point of view, it can be argued that there are more accurate approaches. One possibility is to replace



11 N), N




+ 11 I] to linearize about x(N + 11 N + H(l'[x(N




J), N

i.e., when linearizing h[x(N + 1), N + 1) instead of x(N+ II N) = [x(N 1N), N]. If this is done in (13.1 0.3), the result is a set of equations which implicitly define x(N + I IN + 1) but which still have to be solved, usually in some iterative fashion. An extension of this idea is to replace

( "[x(N 1N), NJ


0 >[x(N 1N + 1), NJ

· where x(N IN + 1) is calculated using the linear model smoothing theory of Chapter 6. This again converts ( 13.1 0.3) into an implicit rather than explicit set of equations for x(N +liN+ 1). 13.10.5 Fisher-Bayesian Type of Argument: Second-Order Termst

The argument used in Section 6.2.2 to derive the optimum filter for a linear model was as follows: I. Assume that x(N 1 N) and its statistical properties are given. 2. Calculate x(N

+ 1 IN) and its statistics.

3. View x(N + liN) as a Fisher measurement on x(N + 1). 4. Combine the two observations x(N x(N + 1 1 N + 1). ·


liN) and z(N


1) to get

This logic is now applied to the nonlinear case to yield the filter of (13.10.3) plus second-order terms. Techniques related to those of' Section 13.9.2 are · used. For simplicity consider the case where x is a scalar and the dyna.mic model is x(n


+ G(n)w(n) h[x(n)] + v(n)

I)= cp[x(n)]

z(n) =

Recursive Filters


Let S:(nl m) be an estimate of x(n) calculated from z(l) · · · z(m). Defi11e bx(nJm) = x(n)- s:(nJm)



m) denote the variance of bx(nl m):

L(nlm)-:- E[[bx(nlm)- E(bx(nlm)}]Z}

Consider x(N +liN)= cp[x(NIN)] .

A Taylor series expansion about x(N IN) yields S:(N+ 1JN)=x(N+ 1)-bx(N+ lJN)

+ q,m[x(NIN)]o~(NJN)

ox(N +liN)= , B, etc. For example, one might choose or. to contain the eigenvalues of ti> rather than explicit elements of the (j[) matrix. In fact, an important part of the system identification problem is to choose an ex, of the most appropriate form. This problem of choosing oc is not discussed much in this book, but its importance must be emphasized. 14.1.3 Time lnvariance

The model (14.1.1) is a time-invariant system. Time-varying structures such as x(n I)= (n)x(n) + G(n)w(n) + B(n)u(n)


are conceivable. Assume that the model degree is specified. Let or. denote the vector of uncertain values of the tl>(n), etc., n = 1, ... , matrices. There are two cases of interest: 1. or. is constant.

2. or.(n) is time-varying. Examples of these two cases are tl>(n) = [


~ I)( I


= [

~ a 1(n)



a,~n) .~J


System Identification

The case where ex, is a constant is relatively minor extension of (14.1.1 ). When cx,(n) is time-varying, the problem is more complicated. A completely_unknowncx,(li), n = J, ... , can ::;_ometimes be considered, or cx,(n) can be modeled as some stochastic or unknown-but-bounded process. A case of more practical importance occurs when cx,(n) is "slowly time-varying." In such a case, cx,(n) can be assumed to be constant over short time intervals. With luck, this time interval will be long enough to identify the parameter values using the model (14.1.1). The case of time-varying cx,(n) will be briefly discussed again in later sections. 14.1.4 Perfect Knowledge of Input u(n)

There are two input processes in (14.1.1), w(n) and u(n). It is assumed that the input u(n) is perfectly known. In some applications, the input u(n) ·is not known a priori but is observed; for example, z(n) = u(n}

+ v(n)

Such a situation can be put in the form of (14.1.1) using the ideas of Section . 3.2 by associating u(n) with z(n) and incorporating v(n) into w(n). 14.1.5 White Uncertainty

The model (14.1.1) assumes that the input and observation uncertainties w(n) and v(n) are white processes. However, the ideas of Section 3.2. can be used to express more general cases in the form of(l4.1.1). Similarly, nonzero mean stochastic processes or non-origin-centered set processes can be handled using the ideas of Section 3.2 .. 14.1. 6 Identifiability

Given (14.1.1), it is not always possible to identify all the unknown system parameter values from knowledge of z(n) and u(n), n = I, .... Consider Fig.. 14.1. I. It is obvious that parameters associated with (() 1 and (() 4 cannot be identified. Similarly, parameters associated with cl> 3 can be identified only if the initial conditions x 3 (0) allow it. If x 3 (0) = 0, values associated with (() 3 cannot be identified. In terms of the controllability and observability concepts discussed in Appendix B, . System ci> 1 : not observable System cl> 3 : not controllable System ci> 4 : not observable, not controllable

Model Structures



x 1 (n)


where is unknown so that x(n) = ¢• x(O)

An estimate of could be obtained but it is clear that the resulting model does not constitute a good fit to the observations so the identification actually yields a meaningless model. A different approach would be to hypothesize the structure x(n


I) = «l>x(n)

+ w(n)


"size" of w(n) .


both unknown

In this case the "identified" value of Q would be large. Thus even though the identified model will still not behave like the observations, one is "told" by the identified model that there is a lot of uncertainty in the model. The best procedure would, of course, be to hypothesize a second order structure which could have a sinusoidal output; 14.1.9 Discussion of .Structure

The structure (14.1.1) is definitely not the most general possible model, but it covers a very wide range of cases of interest. It should be evident that a crucial part of any system identification problem lies in making the vector oc of unknown parameters as small as possible. Care must also be taken to ensure that the unknown parameter oc can actually be estimated (in some sense).

432 · System Identification

There is no simple set of rules which enables one to perform the crucial initial stage of hypothesizing the model structure. This is the same basic type of problem discussed in Section 8. I and the same rules apply. One must understand the physlcaf process well as possible and then exercise good engineering judgment. · ·



14.2 State Augmentation: Nonlinear Estimation Theory System identification by state augmentation was briefly mentioned in Chapter 2. This point of view is now considered again. Consider the model (14.1.1). Assume that the model degree is known. Let a, denote the vector of all unknown system parameters (unknown values of II>, B, etc.). Then by using an augmented state vector x(n), the problem can be modeled as x(n)


[x(n)J a-(n)


+ 1) =



+ I) =

1/>[.i(n), w(n), u(n)]


z(n) = h[x(n), v(n)]

EXAMPLE: Consider x(n

+ 1) = z(n) =

+ w(n) + u(n) Hx(n) + v(n) x(n)

, H: unknown


Then the nonlinear model (14.2.1)

+ 1) = IX1 (n + 1) = IX2(n + 1) = x(n



cx 1 (n) x(n)

+ w(n) + u(n)

CX 1 (n)

IX 2 (n) ·

= CX 2 (n)x(n)

+ v(n)

. . This state augmentation· approach can be extended easily to the case of time-varying system parameters a-(n) which·are modeled as st~chastic or

State Augmentation : Nonlinear Estimation Theory


unknown-but-bounded processes. For example, if ot(n) is modeled as a stochastic process, (14.2.1) might be replaced by


x(n) =



+ I) =





+ Dwy(n)

wqere 9, D, and L are specified matrices and wy(n) is a white stochastic process with specified covariance matrix. The problem of using the z(I) . · · [and knowledge of the u(n), n = I, ... , ] to identify ot [or ot(n)] can now be viewed as a nonlinear state estimation problem, and the discussions and concepts of Part IV can be applied. Thus in one sense, no further discussion of system identification is required, as the system identification problem has been transformed into a problem which has already been discussed .extensively. The discussions are not terminated at this point for the simple reason that Part IV did not provide any "best" one way to solve a nonlinear state estimation problem. A major conclusion of Part IV was that the best way to proceed depends heavily on the explicit nature of the problem. System identification leads to very special types of nonlinear estimation problems, so specialized discussions are needed. In the sections to follow the state augmentation approach is not emphasized, as the author feels that it is much more appropriate to approach the system identification problem directly. However, there are special cases where state augmentation works very well. EXAMPLE: Consider x 1 (n

+ 1)


x 1 (n)

z(n) = x 1 (n) u(n);



+ w(n) + bu(n)


1 ... : known input

b is only unknown (a = b)

Define x(n

+ 1) =

[Xt(n)J x (n)

x 2 (n

+ 1) =

x 2 (n)




so x(n

+ 1) =


u~)Jx(n) + [w~n)J

The desired estimator follows from Chapter 6. However in such special cases, the techniques of the following sections also work very well.



System Identification

Maximum Likelihood: Gaussialll ModeD, General Case

Consider a stochastic uncertainty version of (14.1.1) so z(n) =I. .. is a stochastic process which is completely specified except for the values of the oc. vector. Such a problem formulation leads directly to the use of maximum likelihood concepts to try to firid an estimate of oc. (see Section 11.2). To be more specific, define · ZN



lz(N)J p(zN: CG): probability density of zN for given ot /;(N: oc.) =In p(zN: ex.): log likelihood function &(N) = value of ex which maximizes I;(N : ct) for particular zN

Thus a(N) is the maximum likelihood estimate of the unknown system parameter oc. given the observations z(l) · · · z(N). Note that it is being tacitly assu~ed that there are no structural errors; i.e., there actually exists some ·unknown ex for which the model corresponds exactly to z(n), n = 1 From (14.1.1) the assumed model is x(n




z(n) =

+ Gw(n) + Bu(n) · Hx(n) + v(n)


x: K 1 vector z: K 2 vector w: K 3 vector u: K 4 vector x(O), w(n), v(n): zero mean independent Gaussian


E[x(O)x'(O)} = 'II

ex.: vector of unknown parameters (elements) of , B, G, H, Q, R The degree of model (dimension ofx) is for the moment assumed to be known (see Section 14.3.5). The equation for the log likelihood function !;(N: ex.) can be written in different ways. The "filtering form" and "smoothing form" will be discussed.

Maximum likelihood: Gau~sian Model, General Case


.14.3.1 Filtering Form.of Likelihood Function

The filtering form of the equation for e(n: a.) has already been deri-.:ed in Section IO.I.l but for ease of reference, the derivation is outlined here (see Section 10.1.1 for more details). Using Bayes' rule l;(N: a.)= {.(N- I: a.) +In p{z(N) I zN-t: a.}


z(nl n -

1: ex.): conditional expectation ofz(n) given z(I) · · · z(n- I) for specified a;

li,(n: a;) = z(n) - i(n In - I : ot) :E,(n In - I : ex) = E(o.(n: cx)o:(n : a.)}, where the expectation is taken assuming that z(l) ... is obtained from system with specified ex..

Using the Gaussian assumption 2J;(N: ~) = f.biu(N: a;)+ f.observation(N: a;)

ebiasCN: ex.) =


I: In I :E,(n In -

- N K2 h1(27t) -

no;:; I

1 : a.) I



t:observuion(N: OG) = -

L: s:cn: ot)I:;


(n In- l: cx.)o;(n: ex.)


The equations for z and I:. are derived in Chapter 6. Expressing (6.2.7) and (6.3.1) in the present notation and including the effect of the input Bu(n) gives


j:(n + .II n + 1: ot) = Cl>i.(n In: or.) Bu(n) + I:(n + II n: a.)H':E; 1 (n + II n: ot) X li.(n l: ex)


ozCn +!:ex.)= z(n + 1)- z(n +lin: a.)

+ I In: a.) = I:(n + l In + 1: a) = z(n

Hcl»:i(n l.n: ot)

+ Bu(n)

I:(n + II n: or.) -I:(n +lin: a)H':E; 1(n +lin: or.) X H:E(n + lin: or.)

:E(n + 1 In: ot) = ci»I:(n In: oc.)Cl>'

E,(n +lin: ot) = R I:(O I0) = 'fl

:i(O[O) = 0


+ GQG'

+ H:E(n +lin: oc.)H'

(14.3.2) (continued)


System Identification

The dependence of the ttl, B, H, Q and R matrices on oc. is not explicitly indicated. 14.3.2 Smoothing Form for Likelihood Function

Equation (14.3,2) fs called the filt~ring form because it uses the filter ~stimate i:(n In: oc.). The smoothing form for I;,(N: oc.) involves the smoothed estimate x(n IN: oc.). The :;moothing form for the log likelihood function I;,(N: a,) is given by 2/;,(N: oc.)


I;.., ..(N: oc.)

+ /;,obs..vatlon(N: oc.)

l;..t..(N: oc.): as in (14.3.2) N

:E [z(n) - Hi:(n IN: oc.)]'R-I [z(n)

eobservntlon(N: oc.) = ·




- Hi(n IN: a,)] - L; [:i(n IN: oc.)



- «, B.

Since there is no input or initial condition uncertainty, it follows that z(n



!) = Hx(n)

+ 1) =

H x(n)

+ Bu(n)

x(O) = o

&,(n: a.) = z(n) E,(n

In -


I : ot) = R

Thus (14.3.2) becomes

C:,(N: ot)



N K2 ln(2n) - N In IR I - tr{ R

-I tl o:(n: a.)o:Cn: a.)}

Following the steps used in Section 14.4.1, yields I


R = N .




~ [z(n) -



I)= il>x(n)


H:i(n)][z(n) - Hx(n)]' .

+ Bu(n) (14.4.4)

x(O) = o

B and fi


C:. = j

are values which minimize

.%; [z(n) -- Hx(n)][z(n) -



Like (14.4. I) the result (14.4.4) is not overly surprising. Physically, it merely says to find the system parameter values c&, .B, H) which yield an output Hx(n) which is closest to z(n) when the input is u(n). (See also Section 14.5.1.) Equation (14.4.4) assumes that R is unknown. In the case where R is known, (14.4.4) becomes

£, B, and H minimize C:. =


I: II"'


[z(n) -



Hx(n))'R- 1 [z(n) - Hx(n)]


Maximum Likelihood: Gaussian Model, Special Cases


14.4.4 Moving Average Model: No Input Uncertaintyt

Consider the following moving average model with no input uncertainty M

y(n) =

I; am u(n - m) m=O

z(n) = y(n)

+ v(n)



This model is analogous to that of Section 14.4.3 except that a different dynamic system model is considered. The desired results follow by noting that the derivation of the log likelihood function ~(N: ex) in Section 14.3 did not assume a state space representation for Jhe dynamic system. Thus the basic form of I;(N: ex) of (14.3.2) still applies. For the present case M

z(n In- I)= I; a,u(n- m) m=O M

bin: at)


z(n) -

I::.(n In - 1 : ex)



I; a,u(n - m) m=O

2~(N: ex)= -N ln(2n:)- N In R.

£ &;(n:R ex)

•= I

Thus a~(N'O:)


= ~ U(n N


m) z(n) -

nf;O a,.u(n M




This yields a system of M + 1 linear equations in the a, m = 0 · ·. M which can be easily solved. A simple equation for the estimate of R also results. 14.4.5 Time Invariant Estimator

In all of the preceding special cases, the estimate z(n 1 n - I) was obvious by inspection and the estimation theory of Chapter 6 did not have to be used. A special case involving the estimators of Chapter 6 is now considered. The basic model (14.3.1) is a time invariant system; i.e., 41, G, B, H, Q and R are constant. However the corresponding filter (estimator) may be time varying because of the finite observation interval (and the ideas of Section 14.3



System Identification

generalize easily to time varying now assumes that:


etc.). The special case to be considered

I. The basic model is such that the filter becomes a time invariant system for long observation intervals. (See Section 6.3.5.) 2. The observation interval is long enough so that the filter can be considered to be time invariant for the whole time interval. N"ote that there is no assumption that x(n) is a stationary process. Under the above assumptions, (14.3.2) can be written as follows (see (63.6)) 2/;(N: oc) = /;b,.,(N: oc)

eb;..(N: tx-) =

+ .i(n

I:~ = 1:~(1

I 0) -

E=(l I0) = II>E=«' I:z = R

1C =

1:,=(11 O)H'I:; I HI:= (II 0)

+ GQG'

+ HE=(l IO)H'

II>I:~H'R- 1

Note that the equations are written in terms of .i(n + II n), rather than i(nl n). The vector of unknown parameters oc, as defined in Section 14.1 contains the unknown values of the elements of~. H, B, R, G, and Q. However, from (14.4.6) it is seen that for this special case I;(N: oc,) depends only on , H, B, E., iC. Therefore define



unknown parameters in , H, B,


The estimation problem can now be stated as follows: Find the values of a, I:., which maximize ?;(N)


where all the elements of :R~ and 1:, are assumed to be unknown. The maximization of I;(N) of (14.4.6) with respect toE~ can now be performed directly to yield (14.4.8)

a minimizes A


Maximum Likelihood: Error Analysis


After manipulation (see Appendix C)

a¢(N:&)= -~ "''( .->""'-•c-)az(nln-1:&) L.J"',n.Q(; ...., a a-« n=l a~where difference equations for evaluating cata&)z(n 1 n - 1: &) follow easily from (14.4.6). A suggested iterative algorithm is (see Sec. It. 7) -

_·- B(.I o;,

o;l+ I -

B(i +I)=

+ I)a¢'(N:&)I aex


{1"1 [az(nl~; l:&)J l:;' (&)az(nl~;

1:&)} -•


14 4 9 · · )


&( In a general sense, a change of variables has been made so that the values of all the elements of fL and :E, are being estimated instead of the values of the unknown parameters in R and GQG'. Let k s and k 6 denote the number of unknown parameters in R, GQG' and in I:., 1L respectively. When k 5 > k 6 , then R, GQG' are not completely identifiable (see Section 14.1.6). In such a case, consideration of I:, and fL as th,_e unknowns has many advantages. When k 5 = k 6 it may be possible to use Ci and :t, to solve for unique estimates of the unknowns in R and GQG'. If k 5 < k 6 , then the "change of variables" obviously introduces extra unknowns and its validity is questionable in a strict mathematical sense. 14.4.6 Discussion

The preceding discussions are not intended to tabulate all the special cases for which "nice" results can be obtained. Various extended versions of the special cases also yield simple results. On the other hand, certain apparently minor variations on the special cases can radically change the nature of the problem. In some special cases, "nice" algorithms also result when the minimization of f.(N: o;) is viewed as the minimization of f.(N: a., i(n In), n = I, ... , N) or f,(N: o;, i(n IN), n = I, ... , N) subject to the constraint that i(n In), or x(n 1 N) satisfy the necessary filtering or smoothing equations and the constraints are handled by introducing Lagrange multipliers (or costate variables or adjoint variables). The details of this approach are closely related to optimal control theory concepts.


Maximum Likelihood: Error Analysist

In Sections 14.3 and 14.4, the evaluation of &(N) the maximum likelihood estimate of the unknown model parameters o; from the observations z(l) ... .z(N), was discussed. The corresponding error analysis question is now



System ldefltification

considered; i.e., the difference between &(N) and the true value of ex. is considered. The discussions apply to the model of ( t 4.3.1) where the observations are assumed to be exactly represented by (14.3.1) for some value of ot and specified model degree. The errors in the estimate &(N) arise only from the uncertainty in x(O), w(n), and v(n). It is important not to confuse the following error analyses with the testing step of the three step (hypothesize-estimate-test) approach to system identification. In Chapter 12, error analysis for Fisher models of the form z

ii(x, v)


x: completely unknown

v: random vector

was discussed ..The system identification problem of ( 14.3.1) can be viewed as ZN =


fiN[ex; VNl

~r')l Lz(N)

vN: contains x(O), w(n),


n =I, ... ,N

ex.: completely unknown Hence the concepts of Chapter 12 can be applied directly in the present case. In Chapter 12, a key concept is the ambiguity function (see Section 12.4) which is the average value of the log likelihood function y(x, x)....:...

J [lnp(z: x)}p(z; xJ dz

The ambiguity function provides a global error analysis (i.e., an indication of whether multiple peak log likelihood functions are to be. expected). Consideration of its curvature;

a2 y(xr: x) I . ax- x;x,

enables one to evaluate a lower bound on the parameter estimate error covariance matrix (wherethe lower bound corresponds to that obtained from a linearized error analysis). Since such uses are discussed in Section 12.4, only the system identification ambiguity function itself is discussed here. In system identification, the vector of unknowns is ex. Thus after including the N dependence in the notation the ambiguity function is y(N: ex,, ex) y(N: ex,, ex)


J[ln[p(zN: ex)]}p(zN: ex,) dzN

or in terms of f,(N: ex) y(N: ex,, ex) =

JI;,(N: a)p(zN: ex,) dtN

' (14;5.1)

Maximum Likelihood: Error -Analysis


or in still simpler notation y(N: IX" et) = E[(.(N: ~X)}

As noted in Section 12.4, the ambiguity function is closely related to the divergence .J of hypothesis testing. Thus for the Gaussian case of (14.3.1) the equation for y(N: et.; ~X) is closely related to (10.1.14). Define (by analogy with Section 10.1.4) 3'N,«: linear system which operates on the zero mean component of z(n), n = 1, ... , N- 1, to give the zero mean component of

the conditional expectation of z(N) assuming that z(l) · · · z(N) is generated by (14.3.1) with given value of IX m"'(n): mean value of z(n) as obtained from (14.3.1) with given value of IX; m.(N) depends on u(n), n = I; ... , N- 1

'•"=[J mN,« lm:(l)J =


~.(NIN- 1) = z(N)- m«,(N)- 3'N,JzN-t- mN-1.«.] d(N IN- I)= m",(N)- m.(N)- 3',v,"[mN-t,«,- mN-t,«J

Then in (14.3.2) 'l.(N IN- I: IX)= m"'(N)


~X) =

+ 3'N,«[zN-t- mN-t,.]

z(N) - 'l.(N IN -

= ~.(NIN- 1)



+ d(NIN-


Define :E.,«d«(N IN- I) = E[~_.(N IN- l)~:(N IN- 1)}

Then from (14.3.2) and (14.5.1) the ambiguity function is given by N

2y(N: ~X, IX)= -NK 2 In(2n)- :2:; In p::.(n In- 1: IX) I n= 1


- L; tr{:E; (n In- I: ex) 1



X E{f,z(n: cx)f,:(n :IX)}}

E(f,z(n: !X)f,~(n: a)}


d(n In -


+ :Ez:«d•(n In -I;-

In 1)



System Identification

The E,(n In


1: x(n)

z(n) = x(n)

n = 1 ... N

Q: known

ex: contains elements of Define f', and cf'Jl) as in (14.7.2). Assume there is no value of ex (i.e., ) which enables the following ·equations to be solved.










System Identification

Th m, it is necessary to define some scalar criteria which measures how close a given et (i.e.,«) comes to satisfying (14.7.3). The optimum criterion follows fran; (14.4.2) which yields for large N that & (i.e., Ci>) minimizes



tr[Q-l[tz- cf>d'>.(l)-

~z(l)~' + ci>tzci>'J}

This criterion is not obvious without maximum likelihood theory. :Correlation techniques do not provide a natural way to "test validity" .it'l the three steps, hypothesize-estimate-test, of system identification. This is a major disadvantage. 14.8

Instrument Variables

Another approach to system identification is the use of "instrument variable:;". This approach uses an hypothesized structure containing stochastic processes. The concept is easiest to understand when the hypothesized model is expressed in the autoregressive-moving average form (see Section 14.1.7). In order to simplify the notation only the following special case of (14.3.1) is considered. y(n) = 1 y(n- 1) + 2 y(n- 2) + f(n) z(n)



f(n): moving average process f(n) = G 1 w(n)



w(n- I)

The values of Vp V 2 , G 1 and G 2 are unknown (which means Q can be assumed to be known). However it is assumed that it is desired to only estimate 1 and ¢ 2 • Let i 1(n) and i 2(n) n = I. .. denote two time functions, the "instrument variables". Consider the following equations

+ 2A,(2) + B, ,A (1) + Az(2) + B L: iln)z(n- · k) n==l

At(O) = VIAI(l) A 2(0) =



A/k) =




In (14.8.1), the A/k) are calculable quantities while the B 1 are unknowns. Now suppose that the instrument variables iin) are chosen such that 1. i1 (n) are correlated with z(n)

2. i/n) are uncorrelated with f(n)

. (14.8.2)

If the conditions (14.8.2) are satisfied, it is reasonable to expectthat for large N, the B 1 terms in (14.8.1) can be ignored. Thus estimates , and, 2 can be


defined by


+ A (2) (1) + & A (2)



A 2 (0) = I for any n = 1 ... N where t5 2 (n) is defined in (6.6.4). .

14.10 Choice of lnputt The basic model (14.1.1) contains a known input u(n), n =·0, .... The choice of this input can have a major effect on how well the system identification can be performed. For example, a~sume that x(O) = 0 and that w(n) · 0. Obviously u(n) cannot be zero if any system identification is to be accomplished. It is also feltto be relatively obvious that a u(n), n = 0, ... , with large rapid variations should provide a better system identification than a low-amplitude, constant u(n). Therefore an interesting question is, What is the best choice of u(n), n = 0, ... , when the possible u(n) are constrained in some fashion? · In general this question is very difficult to answer, as the best u(n) depends on both the structure of «!>, B, etc., and their true but unknown values. For example, if «!> corresponds to a discrete-time oscillator of frequency ro, a good u(n) would have "a lot of energy. at" the frequency w. However, if ro is not known, it is difficult to design the u(n) before the system is identified .


Model Reduction


If values of the system parameters are known approximately a priori, an approach to designing u(n) can be formulated. Let oc;nom denote the nominal value of the unknown ot. Assume that x(O), w(n), and v(n) are stochastic processes. Then using the Cramer-Rao inequality it is conceptually possible to evaluate a lower bound on the error variance of a maximum likelihood estimate (see Section 14.5) where this bound depends on u(n), n = 0, ... , N --2 1, and on otnom· The determination of the u(n), n = 0, ... , which minimizes this bound can be viewed as signal design problem which is related to the discussions of 'sections 8.3 and 10.9.5. Unfortunately, the optimum input obtained this way may be sensitive to the assumed nominal values, oc;nom• and thus be pf limited practical importance.


Model Reductiont

The following discussion constitutes a change in direction toward a problem which is not really system identification but which can be approached using the ideas of system identification. Consider ,a system of equations x(n

+ I) =


+ Bu(n)

y(n) = Hx(n)

x(n): K, vector


u(n), y(n): scalars [x(n), w(n), u(n)]

z(n) = h[x(n), v(n)] 0(.:

vector of unknown parameter values

Three ways to tackle such a problem are

1. Use equations of Section 13.2 to get a likelihood function which is then maximized over ()(.. · 2. Combine the linearized recursive filters of Section I 3.10 with the maximum likelihood results of Sections 14.3 and 14.4. 3. Extend the model reference ideas of Section 14.6 to nonlinear structures in any reasonable way. The first approac::h is general but impractical except in very special cases. The second approach can be very effective provided that the linearization is valid (small errors). The effectiveness of ·the third approach of course depends on the "reasonableness" of the engineer. As expected, _the second and third approaches can yield the same results. As noted earlier in this chapter, a linear structure may yield acceptable results even if the basic system is nonlinear (but not "too" nonlinear). If Q is identified, its "size" may effectively compensate for the unmodeled nonlinearities.



14.13 · Discussion

Only system identi.fication techniques related to the concepts of earlier chapters were discussed. However, the techniques that were discussed are useful in an extremely wide variety of problems. The basic three step procedure of system identification discussed here is: 1. Hypothesize mathematical structure with unknown parameters 2. Estimate numerical values of ex.


3. Test validity of model. where many "iterations" with new hypothesized structures' may b~ required. The first step hypothesize is up to the reader. The discussions of.this chapter have concentrated on the second step estimate. However, the third step test is equally important and must never be ignored. In Sections 14.2-14.8, five different approaches (state augmentation, maximum likelihood, model reference, correlation, and instrument variables) to the problem of estimating the unknown parameters (eX) were discussed. All five are appTicable to stochastic type input and observation uncertainties. Figure 14.3.1 summarizes the author's opinions on· their usefulness. When tackling a system identification problem, the author feels the following approach is widely applicable.

I. Make a Gaussian assumption and formulate a maximum likelihood State Augmentation

General approach but not tailored to specific needs of system identification. Not recommended except as possible method of tracking small time variations in system parameter values.

Maximum Likelihood : Gaussian case

Most powerful general approach. Disadvantage is possible computation requirements. Gaussian assumption can usually be ignored.

Model Reference: Weighted Least Squares

Conceptually simple. Choice of criterion can be difficult. Yields equations similar to those of maximum likelihood. ·

Correlation Methods

Conceptually simple. Choice of criterion can be difficult. Can yield easy to implement equations.

Instrument Variables

Yields easy to implement equations. Choice of instrument variables can be difficult.

Figure 14.13.1 Comparison of parameter estimation techniques for system identification. ·



System Identification

approach. Implement it if possible (even if the Gaussian assumption is not valid). 2. If maximum likelihood is implemented, consider presenting it to others (in reports and presentations) in terms of model reference concepts. This makes the approach more understandable to those without specialized training. 3: If the computation requirements of maximum likelihood are unacceptable, consider correlation or instrument variable concepts or some simple model reference criterion which is hopefully related to the maximum likelihood equations.

Historical Notes and References: Chapter 14

System identification has been approached in many ways by many authors, and no attempt will be made to provide even a partially complete discussion of past work. However, the work of K. Astrom and his associates on maxi. mum likelihood techniques deserves special mention as they pioneered theoretical developments and, even more important, actual applications. Many of the ideas of this chapter arose from discussions with F. Galiana, R. Moore, and R. Masiello. See Moore and Schweppe [l]; Galiana and Schweppe [1]; DeVille and Schweppe [I] for application to electric power system problems. The discussions of Section 14.3.3 on using the "smoothing form" were motivated by Bar-Shalom [1]. There is a close tie between system identification and econometric theory as in Wonnacott and Wonnacott [1]. The excellent survey paper by Astrom and Eykhoff [I] contains an extensive bibliography. ·

Exercises 14.1. Do Exercise 11.8. 14.2. Consider (14.4.3). Let b(N) and {J(N) denote the estimates with the dependence on the number of observations N explicitly indicated. Write a set of

nonlinear difference equations which expressb(N + 1), G(N + 1) in terms of b(N), O(N), z(N + 1) and other variables which also satisfy difference equations.

14.3. Consider the Bayesian, Gaussian model x(n





z(n) = Hx(n) x(O)



+ Bu(n) + v(n)



u(n): known

E[v(n)] = 0 E[v(n 1 )v(n 2 )]

= {:

Assume and R are known. Write explicit equations for the maximum likelihood estimate of a. H assuming that B is known b. B assuming that His known. 14.4. Consider the zero mean Bayesian model x(n


1) = x(n)

z(n) ='= x(n)

+ w(n)

n = l. .. N

+ v(n)

x(O) = 0 E[w(n 1 )w(n 2 )} = { ;

E[v(n 1 )v(n 2 )}



Assume N is large. a. Consider a model reference approach where

oc =

« and

& minimizes



= 2.:

[z(n) - ,

«,: true value of «P Hint: Since z(n) = ,z(n -


+ w(n

- 1)

+ v(n)


,v(n -


then N

ell, -


ell, L; v 2 (n - 1)

/!t 1

+ other terms

L; z 2 (n - 1) n=l

b. Consider an instrument. variable approach. Choose i(n) Is the resulting & biased? c. Consider a correlation approach. Define d',(m)




z(n -



N _ m n~n z(n)z(n -


Assume R and Q are known and write in terms of

JO, when good information becomes available.

This is not a counterexample in the mathematical sense to the separation theorc:m (versionU in Fig. J5.4.l):it merely says that the criterion is suchthat the optimum control (in a mathematical sense) is not always a good control (in an intuitive sense). It should also be stated that this Section 15. 3.3 · · exampie is pathological. It is discussed only because of the author's premise that nothing is perfect. 15.6

Dual Cont:rolt

Consider a nonlinear model such as (15.1). Assume that it is desired to obtain a state es.timate, x(NI N). The nature of the input u(n) or u(n: z.), n = 0, ... , N- I, can affect the accuracy ofx(NIN). Hence it is possible to formulate a stochastic control problem where the criterion is to find the controls (inputs) which yield the best estimate. Note that for a linear model, the controls (inputs) do not effect the accuracy of the state estimate. Now consider a nonlinear model where it is desired to find the controls which minimize some general cost such as c of (15.1.2). In general terms, the cost will be reduced if the accuracy of the state estimates is increased. Thus in some applications, the optimum control can be viewed as a compromise between two conflicting desires: the desire to get an accurate state estimate and the desire to minimize the cost. This dual nature sometimes results in the use of the term dual control problem. The dual nature of the control action can become especially evident in the case of a system with no input uncertainty such as x(n


+ Bu(n) h[x(n)] + v(n)

I)= 1/>[x(n)]

z(n) =

x(O): uncertain and of a criterion of minimizing a cost which depends only on x(N) such as c = E[ c Jx(N)]}

subject to a peak amplitude constraint on the control such as u(n: z.)

E Q" ·

If N is sufficiently large, the initial controls u(n: zj, n = 0, ·' .. , N 0 , may be devoted primarily to improving the accuracy of the state estimate x(N 0 I N 0 ), while the later controls u(n: zJ, n = N 0 + 1, ... , N - 1, are devoted to trying to minimize E[ c ,[x(N)]}. The dual nature of the control problem is oftenj ust a way of interpreting results and viewing the problem. However, a reasonable approach 1s some-



times to choose some N 0 and divide the design into two parts. Design u(n: zJ, n = 0, ... , N 0 , to yield the best state estimate and then design u(n: zJ, n = N 0 + 1, ... , N- l, to minimize the cost. 15.7

Adaptive Controli"

In many control problems, various parameters of the system model may be uncertain. For example, in a linear model such as (15.5.1),. the N 0 , may be devoted primarily to yielding the best &(N 0 IN o), while the u(n: zn), n = Na + 1, ... , N- 1, may concentrate on minimizing the cost. In practice, a reasonable control logic may choose u(n) or u(n: z.), n = 0, ... , N 0 , solely on the basis of providing the best identification although the discussions of Section 14.10 indicate · the difficulties associated with this task. Chapter 14 discusses the problem of estimating o; from a more basic point of view. 15.8


This chapter is devoted to stochastic models but it is appropriate to briefly discuss the problems of control when unknown-but-bounded models are used instead. For an unknown-but-bounded model, the range of possible criteria differs greatly from the ideas of Section 15.1. For example suppose it is desired to make y(n) = Cx(n) follow some predetermined trajectory d(n) n = 1, ... , N. For a stochastic (Bayesian) model one might try to minimize c = E


[y(n)- d(n)]'C,[y(n)- d(n)l}

However for an unknown-but-bounded model, one might define il,(n)


(b(n): b(n)


y(n) - d(n)}


Stochastic Control

and then try to minimize the maximum "size" of illn) n = 1, ... , N. Alternately one might be given a target tube which is the set of allowable deviations o(n) = y(n) - d(n) and then choose the minimum "energy" or "fuel" control which keeps the actual deviation within the tube. In many applications, the criteria possible with unknown-but-bounded models are actually more natural than those of stochastic models. As should be expected, the mathematics associated with developing controls for unknown-but-bounded models is not like that of stochastic control. However the basic concepts of open versus closed loop still apply and reasonable closed-loop controls can be developed using the basic ideas of Sections 15.3.1, 15.3.2. Also as should be expected, truly optimum closedloop controls for unknown-but-bounded models are like those for stochastic. models in that they are in general very difficult to compute. 15.9


One important point of the preceding discussions is that there are many different stochastic control problems. When reading the literature on stochastic control, it is very important to study the problem statement carefully · to decide exactly which problem is being considered and exactly what the controls depend on. Another important point is that although optimum stochastic control problems can be formulated with relative ease, the computation of the optimum control is usually very difficult. Successful controls for stochastic problems usually result from the designer's ability to use his engineering judgment to develop reasonable controls that work satisfactorily. At best, the mathematical optimization theory usually provides only a guide to how to proceed. It is the author's opinion that for most applications, attempts to tackle the complete optimum stochastic control problem are unrewarding. If the system to be controlled and the sensors which provide.the observations are well designed, the uncertainties are often small enough that a state estimator followed by a reasonable zero memory controller (say, designed from a deterministic system) provides satisfactory. performance which is close to optimum. Ifthe uncertainties are large, it often appears that even an optimum control cannot yield satisfactory performance. It should be emphasized in conclusion that this chapter has definitely not discussed all aspects and approaches to stochastic control problems. Historical Notes and References: Chapter 15

As in Chapter 14 on System Identification, the available literature is huge, and no attempt will be. made to provide even i1 partial overview: Mendel

Historical Notes and References: Chapter 1 5


and Giese king [1] is a bibliography which contains many references to stochastic control. See also Book List 4 on Stochastic Systems. Section 15.4 and 15.5 Joseph and Tou [I] is one of the first papers on the "separation theorem."Witsenl;lausen [4] discusses the "separation tl;leorem" in a very general context. Section 15.6 See Feldbaum [1]. Section 15.8 Glover and Schweppe [I] discusses certain theoretical aspects, while Glover and Schweppe [2] discusses an application to electric power systems .




Our travels together are coming to an end. However, before our final parting it seems appropriate to briefly review the journey. A huge number of complicated-looking equations were presented .. Most of them should be treated like the slides or photographs taken on a more conventional journey. They are nice to have, but one should rememberonly where they are kept and in general what they are about. The details should be forgotten as they can be looked up if needed. Many basic concepts were also presented. Unlike the explicit equations, the concepts should be remembered and be available on instant recall. A list of the most important concepts would include the following:

l. Methods of uncertainty modeling and the techniques for obtaining white process state space models from natural models. 2. Effect of choice of uncertainty model on analysis and interpretation of results. 3. Analysis of propagation of uncertainty through dynamic systems. 4. Tracking feedback form of filter~. 5. Conversion of a filter into a decision function generator. 6. Use of hypothesis testing as a multipurpose tool. 491




7. Usc of error (performance) analysis for estimation and hypothesis testing as a basic tool in the design of measurement and sensor systems. The importance of engineering judgment could be added to the above list but it deserves special attention. Mathematical optimality received a lot of attention in the. developments but in most applications, the mathemati"s must be combined with judgment both in the choice of models and the · implementation of the results. Relatively sophisticated mathematics is an absolutely necessary ingredient in handling many complex problems but it is not sufficient. Unfortunately it is difficult to write a book on judgment so the author can only state and restate andre-restate that it is needed. Ideally a List of important concepts related to uncertainty not covered , in this book should now be given. However, the list would be very long and would end the book on a negative note. Therefore the list will not be given. Parting is sweet sorrow. Go forth with certainty into a world of uncertainty. ·


These appendices contain background and supplemental material of a mathematical nature. The topics to be covered are matrix algebra, linear difference and differential equations, vector and matrix gradient functions, probability theory, Gaussian (normal) random vectors, stochastic processes, set theory, and ellipsoids .


Appendix .

Matrix Algebra


Partitioned Matrices

Consider a K by K matrix A partitioned as follows: A




! A22

(A .I)

Assume that A_, exists and define B =A-I

where B is partitioned as in (A. I) to be

Then (A.2) where B:zi


A22- A2,A1lA 12

This result can be proved by multiplying (A. I) by (A.2) and seeing that the unit matrix results. 495 .....


Matrix AI gebra

The A- 1 = B of (A.2) can obviously be rewr~tten in a different form by simply reordering the subscripts. Equating terms in these two expressions forB yields the following very useful matrix identity: [A,, - A,2A2lA2,]- 1 =All + A[,'A12 X [A 22 :....... A 2 ,A;-t'A 12 ]- 1A 2,A[ 11


with the special case

[I+AJ- 1 =1-[I+A-'J-'


Two other useful results for partitioned matrices are


I. tr A= A.2

tr A 11

+ tr A I 22


Positive Definite Matrices

In this book, the term positil'e definite matrix also implies that the matrix is symmetric. LetT beaK by K positive definite matrix. Then

xTx Let A.k, k




all x =F 0


1, ... , K, denote the eigenvalues ofT. Then


k=I, ... ,K


Any symmetric matrix which satisfies (A.7) or (A.S) is positive definite. For a positive semidefinite matrix

> A.k >



ali x =FO




For any K by K positive semidefinite matrix that CC' =I

I, ... , K

r, there exists a

matrix C such

12, crC' = Lo The K by K positive definite matrix T has rank K, if only K, of the A.k, k = 1, ... , K, are greater than zero. In general, it is impossible to say that one matrix is larger than another matrix unless one defines some scalar measure. of the matrix's "si'ze." How-

Positive Definite Matrices

ever, there is an important exception to this rule. Let tive definite matrices. Then, by definition


1 -


r, and T

is positive definite implies that



be two posi-

r1> r2


Equation (A.9) implies that

trr, > irr 2

1r,1 > 1rz1 rl,kk

(r kk


all k is kth main diagonal element of r) r2.kk•

Equation (A.9) also implies that

Cr 1 C'


Cr 2 C'

for any matrix C These ideas can be extended to the case of positive semidefinite matrices in an obvious fashion. Let r 1,; 1 and r 2 • 11 denote the elements of r, and r2. It is important to note that r l , i } > r2,/j• all k,j, does not imply that r 1 > r 2· EXAMPLE: Consider

r, = [:



rz = [; ~] allk,j


is not positive definite. In a similar fashion,

r ,. > r


does not imply that





2 ,kJ'

all k,).

EXAMPLE: Consider


1 - r 2 iS positive definite but the Off-diagonal elements than thOSe Of r I•

If (A.9) holds, then If r is partitioned,




are larger


Matrix Algebra


Since tr Ir

it follows

I = AI 1r1 = A1 that for positive definite r



+ . .. + A


• · · AK


[ITI] 11K

i· r






Linear Difference and Differential Equations

Time is considered to be the independent variable, although this necessary. The notational convention used is 116.



denotes continuous time denotes discrete time

where 11 is an integer and 6. is the time between events. Thus writing x(t) x(116.)

implies that x is a continuous-time process implies that x is a discrete-time process

In a similar fashion x(t)



.., is the value of the continuous-time process x(t} at t =


Discrete Time

Consider the linear, first-order, vector difference equation x(n6.

+ 6.) =


x(O): initial condition 499

+ 6G(n6)w(116)



Linear Difference and Differential Equations

This difference equation can be solved to give

x(N il + il)


+ il, OJ]x(O) L; [O[Nil + il, mil + il])G(mt.)w(mt.)

(O[N il


+ il



where O(Nil, mil)' is the transition matrix associated with cl>(nil). 9(nil, mt.) satisfies the linear difference matrix equation

+ t., mil) =


(J)(nil)O(nil, mil)

O(mt.., mil) = I

so that

O(Nt., mt.) =








O(nA, mt.) has the property O(N t., mil) = O(N il, nil)O(nt., mt.),


Jn the special case where cl>(nt..) = cl>, all n, O(N il, mt:.) depends only on N- m and can be considered to be a matrix function O[nt.] of one variable nt.

defined by O(nil

+ t.)

= (J)9(nil)

9(0) =I

In this special case N

x(Nil +A)= O(Nil + A)x(O) + t:.

2::; O(NA - mil)G(mt.)w(mt.)



O(nA) can be expressed as O(nt.)


= (f)•

Continuous Time

Consider the linear, first-order, vector differential equation !!__x(t) = F(t)x(t) dt

+ G(t)w(t)


x(O): initial condition. This differential equation can be solved to give X(t) = O(t, to)x(to) '

10 =



O(t, T)G(-r)w(T) dT




where O(t, t 0 ) is the fundamental or transition matrix associated ,with F(t).

Discrete to Continuous Time U.mit


9(t, t 0 ) satisfies the linear matrix differential equation d dt 9(t, t 0 )

F(t)O(t, t 0 )



9(t 0 ,t 0 )=1

9(t, t 0 ) has the properties 9(t, t 0 ) = 9(t, t ,)O(t l ' t 0 )



(t,to) = 9(to, t)

In the special case where F(t) = F, all t, O(t, t 0 ) depends on only t - t 0 and can be considered to be a matrix function O(t) of one variable, t, defined by

:fr 9(t) =



9(0) =I In this special case · x(t)




9(t -- -r)G(r)w(r) dr


or x(t) = 9(t)x(O)

+ O(t) {

o-'(r)G('r)w(-r) dr


{l(t) can be expressed in terms of an exponential matrix as G(t)



or similarly as the power series O(t) =I+ Ft +


(d! )F t

2 2


u! )Pt



Discrete to Continuous Tome Limit

Tim~-invariant equations are considered for the sake of simplicity. Consider first the time-invariant, continuous-time differential equation

d~~t) x(O)

= Fx(t) =

+ Gcw,(t)



where the input w,(t) is given by w,(t) = w(nt.),







aJ(X) ax 1K, 507 ...



Vector and Matrix Gradient Functions

If f(x) is a K 2 vector function of a K 1 vector x, then the Jacobian matrix is the K 2 by K 1 matrix

ar(x) ax= atx.(x)




Functions such as af(X)/aX, aF(x)!ax, and aF(X)jaX are also well-defined quantities but require tensor-type notations which will not be introduced here. EXAMPLES: If



[xzx xJJ x•

then dF(x) = [2x dx 1


2 ]


If f(x)

= xf


+ x1 + X1Xz


then a[(x)

+ Xz

(}X= (2xl


+ xtJ








then a[(X) =


[2xllx21- XzzXI2

xt, + 3x~l + 3]




f(x) =

xz [



+ xlz

+ X1X2


Vector and Matrix Gradient Functions

then ar(x) =








The ,trace is a linear operator, so that

:~ tr[F(x)]



Using this fact, a variety of useful equations can be derived by such as .

.1_ tr[AX] =A


(C. I)

.1_ tr[AX'] =A' .



a~ tr[AXBX]



.1_ tr[AXBX']






+ B'X'A'


Many other formulas can be derived from the above by using the facts that








Consider X=

[Xtt X1z] Xzl


Then so

atr(AX) =[all .ax

Similarly, tr[XX] = tr[XX'] =


xL + 2x,zXzt + xiz x1t + xr 2 + xi 1 + xh


atr[XX] = ax

atr[XX'] = ax


al2] =A an

z[Xtt Xzt


z[X11 Xzt] X12

= 2X



= ZX'


Vector and Matrix Gradient Functions

Using a power series expansion of eX, it can also be shown that

_j__ tr[ ex] = ex ax · A very useful equation results when f(x)




~x'Ax ax

= x'A'

+ x'A


This can be viewed 'as a special case of (C.4) where the matrix X of (C.4) is the vector x and B = I as tr[Axx'] = x'Ax

The determinate is a "nonlinear operator" but simple gradient equations often result. ·Consider a K by K matrix X( a) which is a function of the scalar a. Then assuming X- 1 (o:) exists,



IX(a)lt{x- 1 (a)a~~a)J (C.6)

aaa In I X(a) I =

tr[x- I (alX(a)l aa J

This leads to formulae such as

_j__IXI = ax

a~ In lXI







Probability Theory

For simplicity it is always assumed that probability density functions exist, random variables have finite variances, etc. Consider a K-dimensional random vector x with elements xk, k = I, ... , K. Let P(x) denote the (cumulative) distribution function and p(x) denote the joint probability density function. Then ···x)=JKP(x,···xx) p() X = p ( Xl K a X1 • • •

P(x) = P(x 1

• • •

Xx) =


J:: ··· J:~

p(u 1

• • •

Ux) du 1

• • •


Consider a two-dimensional random vector x where P(x) = P(x., x 2) and p(x) = p(x., x 2 ) are the distribution and density functions of x, respectively. Let P(xj) and p(x), j = I, 2, be the marginal distributions and densities, respectively. Then P(x 1 )


p(x_,) =

P(x 1 , oo)


J:'= [~

p(u, v) dv du

[~ p(x" v) dv

Extension to vectors is straightforward. For x =


p(x,)= s== ... s:oop(x,,,,x,,z, ... )dx,,,,dx,,2,··· 511,..


Probability Theory

or, in simpler notation, (0.1)

Consider a K 1-dimensional random vector x with density p(x) = p(x 1 • • • xx,). Let Y(x) denote some K 2 by K 3 matrix function of x. Let E{.Y(x)} denote the expectation ofY. Then E[y1.1(x)}

E[Y(x)} =






E[y"'_lx)} = =


[= ·· · [= hix)p(x) dx

1 • • •


JJ'k,j(x)p(x) dx

Expectation is a linear operation in the sense that




(0.2) (0.3)


Means, Covariances, Variances, Standard Deviations, Moments

Let m denote the mean of the random vector x. Then m = E[x}

Let r denote the covariance matrix of x. Then


= E{[x -

m][x - m]'}


is a positive semidefinite, symmetric matrix. The components ofm are the first moments. The components ofr are the second moments about the mean. Consider a two-dimensional random.vector x with

[r11 r1z] r1z rzz r 11 is the variance· of Xp while (r 11 )112 is the standard deviation ofx1. r 12 r


is the covariance or cross correlation between x 1 and x 2 • The correlation coefficient between x 1 and x 2 is

Independence, Uncorrelated, OrthGJgonal


if r 12

= 0, then x 1 and x 2 are uncorrelated. Extension of these ideas to vector x 1 and x 2 is straightforward. The second moment of a random vector xis E[ xx'}. It follows that



r +





E[x 2 ] is also called the mean square value of x.


Independence, Uncorrelated, Orthogonal

If the rando'!l vectors

Xj · · · xx

P(x) = P(x 1 p(x) = p(x 1

are mutually independent, then Xx) = P(x 1 )

• • • • • •



p(x 1)

• • • • • •



where P(xj), p(x), j = I, ... , K, are the marginal distributions and densities. For independent vectors x 1 and x 2 E[Y ,(x,)Y 2(x 2 )} Consider a


vector x


E[Y ,(x .)}E[Y 2 (xz)}



E[[x- m][x- m]'}





rl, [ r12 I

x 1 and x 2 are uncorrelated if If x 1 and x 2 are uncorrelated,

x 1 and x 2 are orthogonal if E[x 1 x~J =


The following relationships exist between independence, uncorrelated and orthogonal random vectors: · independence



uncorrelated plus zero mean






EXAMPLE: Assume that w is a Gaussian random variable with (see Appendix E)

E(w} = 0 E(w 2 } = 1f1


Probability Theory

Then from Appendix E


E[w 3 ]

Define x.

= w

It is obvious that x 1 and x 2 are not independent. However,

= E(w 3 }

0 so x 1 and x 2 are orthogonaL They are also uncorrelated, as E[x 1x 2 }


E(x.} = 0 E(xz} = lj/



Characteristic Functionst

Consider a random variablex with probability density p(x). The characteristic function c(q) of x is defined as c(q)

= [~ e"'xp(x) dx =

E[ e1"x}

where i :- ,J=T. If w(l) · · · w(n) are independent random variables with characteristic functions c 1(17) ·. · c.(f!), then


x =


m= 1

has characteristic function c x(11) given by


c ,(1'/) =



For a vector x with p(x)

c(q) =

Je •'•p(x) dx 1

The characteristic function is also called the moment-generating function. Consider a scalar x. Then c(tJ) =

J~~ e'"xp(x)dx

.J~,. [ + l



+ (iq{) + · · ·Jp(x) dx

so that


ac(q) 1. = i afl .~o

azc(~) I






xp(x) dx

_!_ J~ x2p(x) dx 2




Conditional .Distribution: Bayes· Rule


0.4 ·Transformation of Variablest

Consider a K-dimensional random vector x with probability density Px("J'). Consider the K random variables k= I, ... ,K

Assume that the transformation is one to one so that the inverse transformation can be written as k=I, ... ,K

where y has the elements h· k density of y. Then

I, ... , K. Let py(y) denote the probability


py(y) = FJx1(y) · · · xAy)]








ay =


Saams of Rarndom Variables

Consider y






where x 1 and x 2 are random variables with probability density p .(x P x 2 ). Let py(y) denote the probability density of y. Then p,(y) =

r~p.(u, y


u) du

If x 1 and x 2 are independent so that Px(x" xJ = Px,(x!)p,"(X 2 )

then py(y) is given by the convolution P/Y) =

[~ Px,(U)Px,(Y-

u) du

Extension to vectors y, x" and x 2 and to sums of many random variables is straightforward. 0.6

Conditional Distribution: Bayes· Rule

Assume that x is a partitioned random vector where p(x) = p(x" x 2 ) is the · density and p(x 1 ) and p(x,) are the marginal densities. Let p(x,l x 2) denote ·~·



Probabiliiy Theory

the conditional probability density or x. given


Then the product rule is (0.6)

In (0.6) x 1 is sometimes viewed as a given constant value so that p(x 1 I X 2 ) is the probability density of x 1 'when the random vector x 2 assumes that ·value. However, x 2 can also be viewed as a rand.om vector so that p(x; I X 2) is a random function, i.e., a function of x 1 whose "shape" depends on the random vector x 2 • With.either interpretation, . (D.7)

Using the product rule, (D.8)

Using the definition of marginal distributions and the product rule gives

(0.9) Bayes' theorem then follows; i.e., p(x.IXz) =





1 1X 1)p(x 1).dx 1

In this book; the product rule, Bayes' theo.t:em, and other .related equations are all called Bayes' rule. ·· -

D.7 Conditional Expectation Consider a partitioned x with p(x) = p(xp x 2 ). Let y(x 1) be some function ofx 1 • Then the conditional expectation ofy(x 1) given x 2 is

E[y(x 1) I X 2 } =

f_ y(x )p{x I 1

1 X1 )

dx 1


The conditional mean of X 1 given x 2 is E{x 1 jx 1} =


X 1 p(X 1

jX 2)dx 1

·- £(x 1 jx1 } is usually called the conditional expectation of x 1 given X 1 rather than the conditional mean. The conditional cov~riance matrix of x 1 given

Conditional Expectation


x 2 is defined as

E{[x,- E{x,lx 2 j][x,- E(x,jx 2 }]'1xz} =fix, -E[x,lxa}][x, -E(x,jx 2 }]'p(x 1 lx 2)dx 1


It is very important to emphasize that E(x, Ix 2 }, etc., are functions of x 2 • In most cases, X 2 is viewed as a random vector so that the conditional mean, conditional variances, etc., are random quantities. When the conditional expectation is viewed 'as a random vector, it has many important properties. Some of these are now listed. Let Y(x) be some matrix function ofx. Then

l E{E(Y(x,)l E(.Y,(x,)

+ Y (x 2


X 2}}

)lx 2 }




£(Y,(x,)IX 2 }


+ E(Y-Ix

,, E(HY(x 1)lx 2 } = HE(Y(x,)IX 2 }



(0.14) (0.15)


I E[Y(x2) Ix2}




I E[Hix J =H I



if x" x 2 are independent


If H- 1 exists, then

I E[Y(x


IHx 2 }



E(Y(x 1) IX 2 }

The following property is of special importance

I E{Y(x )[x 2

1 -

E(x,IX 2 }]}





as it states that the random vector x 1 - E[x 1 1x 2 } is orthogonal to any function of the random vector x 2 • Consider the conditional variance of (0.12). Manipulation of (0.12) using (0.21) yields .

E{[x 1 =


E(x,lx 2}][x,- Efx,lx 2 }]1 'x 2 } E(x,x',l x,}- E{(E(x,lx 2 }][E(x,lx 2 }l'}



Multivariate Gaussian (Norrnal). Distribution

Consider the K-dimensional random vector x with probability density p(x). x is a Gaussian or normal random vector if p(x) is of the form p(x) = [(2nY 1r

ll-'/ 2 exp[ -(1;)1(x)J

J(x) = (x- mrr-'(x -


E[x] = m E[(x - m)(x - m)'}

(E. I)

= r

The K-dimensional vector m is the mean value vector, while the K by K matrix r is the covariance matrix. For scalar x 1 { (x - m) P(x ) = (2nT)ll2 exp 2r



The shape of p(x) is illustrated in Fig. E. I. The terms Gaussian and normal are both used interchangeably in this book. The characteristic function c(1J) of a Gaussian xis (E.2) ~If the

characteristic function is viewed as the definition of a Gaussian vector, can be a positive semidefinite matrix (i.e., r-J need not exist). The case where r is only positive semidefinite corresponds to a degenerate Gaussian random vector. Another type of degenerate Gaussian random


519 ·!< .


Multivariate Gaussian (Normal) Distribution




Figure E.l Gaussian (normal) probability density.

vector occurs when r-t is specified but is only positive semidefinite [i.e., (r-')-' does not exist]. The· fact that x is a Gaussian (normal) random vector with mean m and covariance matrix r is often indicated by writing xis N(m, r) Define the following partitioned vectors and matrices:


where X 1 , m1 are K 1 -dimensional vectors, rn are K 1 by K 1 matrices, j = I, 2, andr, 2 is a K, by K 2 matrix. K, + K 2 = K. Letpix) denotethe probability density of x r Assume that a K,-dimensional x is N(m, r). Define a K 2 -dimensional random vector as a linear transformation ofx: y= Hx


Then y is N(Hm

+ h, HrH')


One of many ways to derive (E.4) is to use the characteristic function. Let cy(Tt) denote the characteristic function of y. Then cy(TJ)


E{ exp(il)'Hx

+ irJ'h]}

. Conditional Distributions


or from (E.2) c,(ll) = exp(- ·H11'HrH'11J

+ iq'[Hm + h]J

which yields (E.4). lf K, ::2:: K 2 and (HrH')-• exists, then p(y) is determined as in (E.l) or (E.2). If[HrH']-' does not exist, p(y) is dete('mined in terms of the characteristic function .as in (E.2). An important special case of (E.4) is y =




If x is N(m, r), as defined in (E.3),

y is N(m 1 + m 2 , r,, + r, 2 + r'11 + r 21) Extension to sums of many Gaussian random vectors is straightforward. If x 1 and x 2 are uncorrelated or orthogonal and Gaussian, they are also independent. Consider a Gaussian x is partitioned as in (E.3). Then using (E.4), j=



This is obviously true if x, and x 2 are independent, but it is true in general. E.1 · Conditional Distributions

By Bayes' rule, the conditional probability density p(x 1 I x 2 ) is given by p(x, I Xz) = p(x, Xz) = Pz(Xz)


(i:.t) and

p(x) Pz(Xz)

(E.5) gives for the Gaussian case


p(x, lxz) =

[(2n)K•Jrl]_,z lrnl exp(-(fV r (x)} 1 2


where J rlz(X) = J(x) - J z(Xz)

Equation (E.6) can be manipulated after a lot of work into the· following more useful form: . p(x 1 jx 2) = [(2n)K' jr 112 1]- 112 e}{p(-(f!J 11 a(x 1)} J 112(x 1) = [x 1 - m 112]T!iz[X 1 -'2]


m 1 tz



+ r,

2 rii(X 2 -



r112 = rll ·- r,zri~r'IZ

If x 2 is consiqered to be a constant (i.e., not a random vector), then p(x 1 I X 2 ) is the probability density of a K ,-dimensional normal vector with mean m,,z and covaria.nce matrix r,lz·



Multivariate Gaussian (Normal) Distribution


Conditional Expectation

Consider the K 1-dimensional random vector E[ x 1 I x 2 } which is the conditional expectation of x 1 given x 2 • It follows from (E.7) that in the Gaussian case

I E(xl IX2} = mll2 = If E(x 1



+ r12r;-i(x2-




lx 2 } is


viewed as a random variable, i.e., as a function of the random then for a Gaussian x (E.9)

The conditional covariance matrix is defined by (0.12). Substituting (E.7) and (E.8) yields E[[xl- E[xtlxJ][xt- E[x 1 lxJ]' !x2} = r112 = r11- r12I'2ir21 Thus in the Gaussian case, the conditional covariance matrix is not a function of X 2 ; i.e., it is not a random function. The general properties of the conditional expectation (D.l3)-(D.21) also hold in the Gaussian case. Another useful property which can be proved for the Gaussian case is





are independent,

E[X 1 IX 2 , x 3 ] -

m 1 = E[x 1 1x 2 } - m 1 + E[x 1 IXJ- m 1


E.3 Wide Sense and Strict Senset Many concepts of random vectors and stochastic processes can be formulated in either a wide sense or a strict sense. Consider two random vectors x andy, where x : not Gaussian y: Gaussian

E(x} = E(y} = m &~~m~-~1=&~-~~-~1=r

Then although x and y are different, they have the same means and covariance matrices. Now suppose that some property Pw of xcan be expressed in terms of m and r. Let p, be the corresponding property for the Gaussian y. Thenpw is a wide sense property, while p, is a strict sense property. A strict sense property is "stronger" than a wide sense property. The following examples should help clarify the strict versus wide sens.e concept. . ',·

Higher-Order Moments

Consider X=

y =







= I;:'(y} = 0

· . Lr11, . E(xx'J · = E(yy'J = rl2 Suppose that

r 12 =

0. Then

1. y 1 and y 2 are independent (a strict sense property).

2. x 1 and x 2 are uncorrelated oi: orthogonal (a wide sense property). Both wide sense and strict sense conditional expectations can be defined. The definition of Section D.l2 yields a strict sense conditional expectation. For the Gaussian y £strict ,.... (y I Iy zl = r 12r2i y 2

For the non-G~ussian


the wide sense conditional expectation is defined as

Thus Gaussian y: E> 30.

It can

Constant Probability Density Contours

Consider some K-dimensional random vector x with probability density p(x). The set Q = (X : p(x) = constant}

Constant Probability Density Contours


is a K~dimensional surface defined by all x with the same probability density. For a Gaussian random vector, it follows from (E. I) that the ellipsoid surface


= [x : [x- m]T- 1 [x- m] = constant}

is such a contour of constant probability density. (Extensive discussions on ellipsoids are given in Appendix G.) Consider the ellipsoid [x: [x- m]T- 1 [x- m]

Qc =