Diffuse Algorithms for Neural and Neuro-Fuzzy Networks: With Applications in Control Engineering and Signal Processing 0128126094, 9780128126097

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks: With Applications in Control Engineering and Signal Processing p

656 137 7MB

English Pages 220 [219] Year 2017

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks: With Applications in Control Engineering and Signal Processing
 0128126094, 9780128126097

Table of contents :
Front cover
Diffuse Algorithms for Neural and Neuro-Fuzzy Networks. With Applications in Control Engineering and Signal Processing
Copyright
CONTENTS
List of Figures
List of Tables
Preface
1 Introduction
1.1 Separable Models of Plants and Training Problems Associated With Them
1.1.1 Separable Least Squares Method
1.1.2 Perceptron With One Hidden Layer
1.1.3 Radial Basis Neural Network
1.1.4 Neuro-Fuzzy Network
1.1.5 Plant Models With Time Delays
1.1.6 Systems With Partly Unknown Dynamics
1.1.7 Recurrent Neural Network
1.1.8 Neurocontrol
1.2 The Recursive Least Squares Algorithm With Diffuse and Soft Initializations
1.3 Diffuse Initialization of the Kalman Filter
2 Diffuse Algorithms for Estimating Parameters of Linear Regression
2.1 Problem Statement
2.2 Soft and Diffuse Initializations
2.3 Examples of Application
2.3.1 Identification of Nonlinear Dynamic Plants
2.3.2 Supervisory Control
2.3.3 Estimation With a Sliding Window
3 Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval
3.1 Problem Statement
3.2 Properties of Normalized Root Mean Square Estimation Error
3.3 Fluctuations of Estimates under Soft Initialization with Large Parameters
3.4 Fluctuations Under Diffuse Initialization
3.5 Fluctuations with Random Inputs
4 Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms
4.1 Problem Statement
4.2 Training With the Use of Soft and Diffuse Initializations
4.3 Training in the Absence of a Priori Information About Parameters of the Output Layer
4.4 Convergence of Diffuse Training Algorithms
4.4.1 Finite Training Set
4.4.2 Infinite Training Set
4.5 Iterative Versions of Diffuse Training Algorithms
4.6 Diffuse Training Algorithm of Recurrent Neural Network
4.7 Analysis of Training Algorithms With Small Noise Measurements
4.8 Examples of Application
4.8.1 Identification of Nonlinear Static Plants
4.8.2 Identification of Nonlinear Dynamic Plants
4.8.3 Example of Classification Task
5 Diffuse Kalman Filter
5.1 Problem Statement
5.2 Estimation With Diffuse Initialization
5.3 Estimation in the Absence or Incomplete a Priori Information About Initial Conditions
5.4 Systems State Recovery in a Finite Number of Steps
5.5 Filtering With the Sliding Window
5.6 Diffuse Analog of the Extended Kalman Filter
5.7 Recurrent Neural Network Training
5.8 Systems With Partly Unknown Dynamics
6 Applications of Diffuse Algorithms
6.1 Identification of the Mobile Robot Dynamics
6.2 Modeling of Hysteretic Deformation by Neural Networks
6.3 Harmonics Tracking of Electric Power Networks
Glossary
Notations
Abbreviations
References
Index
Back cover

Citation preview

DIFFUSE ALGORITHMS FOR NEURAL AND NEURO-FUZZY NETWORKS

DIFFUSE ALGORITHMS FOR NEURAL AND NEURO-FUZZY NETWORKS With Applications in Control Engineering and Signal Processing

BORIS A. SKOROHOD

Butterworth-Heinemann is an imprint of Elsevier The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States Copyright r 2017 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress ISBN: 978-0-12-812609-7 For Information on all Butterworth-Heinemann publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Joe Hayton Acquisition Editor: Sonnini R. Yura Editorial Project Manager: Ashlie Jackman Production Project Manager: Kiruthika Govindaraju Cover Designer: Limbert Matthew Typeset by MPS Limited, Chennai, India

CONTENTS

�d�_ List of Tables

2.

3.

Introduction 1.1

Separable Models of Plants and Training Problems Associated With Them

1.2

The Recursive Least Squares Algorithm With Diffuse and 50ft Initializ3tions

13

,.3

Diffuse Initialization of the Kalman Filter

15

Diffuse Algorithms for Estimating Parameters of Linear Regression

17

Problem Statement

17

2.2

Soft and Diffuse Initializations

20

2.3

Examples of Application

39

Statistical Analysis of Fluctuations of Least Squares Algorithm 49

3.1

Problem Statement

49

3.2

Properties of Normalized Root Mean Square Estimation Error

50

3.3

Fluctuations of Estimates Under Soft Initialization With Large Parameters

65

3.4

Fluctuations Under Diffuse Initialization

71

3.5

Fluctuations With Random Inputs

77

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

80

4.1

Problem Statement

80

4.2

Training With the Use of Soft and Diffuse Initializations

82

4.3

Training in the Absence of a Priori Information About Parameters of the Output Layer

5.

1

2.1

on Finite Time Interval

4.

xi xiii

Preface 1.



95

4.4

Convergence of Diffuse Training Algorithms

103

4.5

Iterative Versions of Diffuse Training Algorithms

123

4.6

Diffuse Training Algorithm of Recurrent Neural Netvv'ork

125

4.7

Analysis of Training Algorithms With Small Noise Measurements

127

4.8

Examples of Application

130

Diffuse Kalman Filter

142

5.1

Problem Statement

142

S.2

Estimation With Diffuse Initialization

144

v

vi

Contents

5.3

6.

Estimation in the Absence or Incomplete a Priori Information About Initial Conditions

153

5.4

Systems State Recovery in a Finite Number of Steps

165

5.5

Filtering With the Sliding Window

166

5.6

Diffuse Analog of the Extended Kalman Filter

169

5.7

Recurrent Neural Network Training

170

5.B

Systems With Partly Unknown Dynamics

173

Applications of Diffuse Algorithms

175

6.1

Identification of the Mobile Robot Dynamics

175

6.2

Modeling of Hysteretic Deformation by Neural Networks

183

6.3

Harmonics Tracking of Electric Power Networks

188

Glossary

193

References

195

Index

199

LIST OF FIGURES Figure 1.1 Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 Figure 2.8 Figure 2.9 Figure 2.10

Figure 2.11 Figure 2.12 Figure 2.13 Figure 2.14 Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 Figure 3.7 Figure 3.8 Figure 3.9 Figure 3.10

Functional scheme of the supervisory control system. Dependence of the estimation error on t obtained using the RLSM with the soft and diffuse initializations. Dependencies of the plant output and the model output on time with ten neurons. Dependencies of the plant output and the model output on time with five neurons. Plants and the diffuse algorithm outputs with λ 5 0:98. The error estimation with λ 5 0:98. Plants and the diffuse algorithm outputs with λ 5 1. The error estimation with λ 5 1. Dependencies of the plant output and the reference signal on time. Dependencies of the supervisory control and the PD controller on time. Transient processes in a closed system with diffuse algorithm (the dotted curve) and the gradient method (the dash-dotted curve), the continuous curve is the reference signal rt . Estimation errors of a. Estimation errors of b. Estimation of the parameter a. Estimation of the parameter b. Realization of signal. Dependencies β t ðμÞ, γ t ðμÞ, ηt ðμÞ on t for μ 5 12, St 5 19:4 dB, k 5 40, R 5 1:52 . Dependencies β t ðμÞ, γ t ðμÞ, ηt ðμÞ on t for μ 5 36, St 5 19:4 dB, k 5 40, R 5 1:52 . Dependencies β t ðμÞ, γ t ðμÞ, ηt ðμÞ on μ, St 5 19:4 dB, k 5 40, R 5 1:52 . Dependencies β t ðμÞ, γ t ðμÞ, ηt ðμÞ on t for μ 5 36, St 5 19:4 dB, k 5 100, k 5 25, R 5 1:52 . Dependencies β t ðμÞ, γ t ðμÞ, ηt ðμÞ on t for μ 5 0:022, St 5 1 dB, k 5 40, R 5 12:52 . Dependencies β t ðμÞ, γ t ðμÞ, ηt ðμÞ on t for μ 5 0:072, St 5 1 dB, k 5 40, R 5 12:52 . Dependencies β t ðμÞ, γ t ðμÞ, ηt ðμÞ on μ, for St 5 1 dB, k 5 40, R 5 12:52 . Dependencies β t ðμÞ, β t;min ðμÞ, β t;max ðμÞ, t 5 11,St 5 1 dB. Dependencies β t ðμÞ, β t;min ðμÞ, β t;max ðμÞ, t 5 11, St 5 10:5 dB:

12 27 41 41 42 42 43 43 45 45 45

47 47 48 48 56 57 57 57 58 58 58 59 61 62

vii

viii

List of Figures

Figure 3.11 Figure 3.12 Figure 3.13 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6 Figure 4.7 Figure 4.8 Figure 4.9 Figure 4.10 Figure 5.1 Figure 5.2

Figure 5.3

Figure 5.4

Figure 5.5

Figure 5.6

Figure 6.1 Figure 6.2 Figure 6.3 Figure 6.4

Dependencies β t ðμÞ, β t;min ðμÞ, β t;max ðμÞ, t 5 11,St 5 21:7 dB: Dependency β t ðμÞ on t. Dependency β t ðμÞ on t for μ 5 107, St 5 44 dB (1), St 5 43:9 dB (2). Approximating dependencies for DTA1 (1) and DTA2 (2). Dependencies of the plant and model outputs on the number of observations. Dependencies of six membership functions on the number of observations before optimization. Dependencies of the mean-square estimation errors on the number of iterations M. Dependencies of six membership functions on the number of observations after optimization. Dependencies of the plant output and the models outputs on the time. Dependency of the plant input on the time. Dependency of the plant output on the time. Dependencies of the plant output, the model output on the time. Dependencies of the plant output, the model output on the time with output layer estimation parameters only. Dependencies of estimation errors on the time (curves 1 and 2, respectively). Dependencies of the DEKF output (curve 1) and the output plant (curve 2) on the number observations in the training set. Dependencies of the DEKF output (curve 1) and the output plant (curve 2) on the number observations in the testing set. Dependencies of the DEKF output (curve 1) and the output plant (curve 2) on the number observations in the testing set with the input signal ut 5 sinð2π=tÞ. Dependencies of x1t (curve 1), estimates of x1t for a known (curve 2) and unknown f2t (curve 3) on the number observations. Dependencies of x2t (curve 1), estimates of x2t for a known (curve 2) and unknown f2t (curve 3) on the number observations. Robotino exterior view. Transient performance of engines. Fragment inputoutput data for identification; ωout 1 ðtÞ is shown by a solid line, ωr1 ðtÞ by a dashed line. Fragment inputoutput data for identification; ωout 2 ðtÞ is shown by a solid line, ωr2 ðtÞ by a dashed line.

62 69 70 132 135 135 135 136 137 138 138 139 139 169 172

172

173

174

174

176 176 177 177

Figure 6.5 Figure 6.6 Figure 6.7

Figure 6.8

Figure 6.9

Figure 6.10 Figure 6.11

Figure 6.12

Figure 6.13

Figure 6.14 Figure 6.15 Figure 6.16

Figure 6.17 Figure 6.18 Figure 6.19 Figure 6.20

Figure 6.21

Figure 6.22 Figure 6.23 Figure 6.24

List of Figures

ix

Fragment inputoutput data for identification; ωout 3 ðtÞ is shown by a solid line, ωr3 ðtÞ by a dashed line. Spectral densities of the input signals. Dependences of the angular speed of the first motor (curve 1) and its prediction (curve 2) on the number of observations. Dependences of the angular speed of the second motor (curve 1) and its prediction (curve 2) on the number of observations. Dependences of the angular speed of the third motor (curve 1) and its prediction (curve 2) on the number of observations. Correlation functions of the residues and cross-correlation function inputs with the residues. Dependences of the angular speed of the first motor (curve 1) and its prediction (curve 2) on the number of observations. Dependences of the angular speed of the second motor (curve 1) and its prediction (curve 2) on the number of observations. Dependences of the angular speed of the third motor (curve 1) and its prediction (curve 2) on the number of observations. The dependence of the deformation (mm) from the applied force (kg). The dependence of force FðnÞ (kg) on the number of observations n. Dependences of deformation δ (mm) (curve 1) and its approximation (curve 2) on the number of observations n in the training set. Histogram of residues on the training set. Correlation residues with FðtÞ on the training set. Correlation residues on the training set. Dependences of deformation δ, (mm) (curve 1) and its approximation (curve 2) on the number of observations n on the testing set. Dependences of the deformation (curve 2) and approximating curve (curve 1) (mm) from the applied force (kg) on the test set. Histogram of residues on the testing set. The dependence of the current value on the number of observations. Dependencies of harmonic amplitude estimates A1 on the number of observations under the impulse noise action. The WidrowHoff algorithm (curve 1) and the DTA with sliding window (curve 2).

178 178 179

180

180

181 182

182

182

184 184 184

186 186 186 187

187

187 189 190

x

List of Figures

Figure 6.25

Figure 6.26 Figure 6.27

Figure 6.28

Figure 6.29 Figure 6.30

Figure 6.31

Figure 6.32

Dependencies of harmonic amplitude estimates A23 on the number of observations under the impulse noise action. The WidrowHoff algorithm (curve 1) and the DTA with sliding window (curve 2). The dependence of the signal value on the number of observations. Dependencies of harmonic amplitude estimates A1 on the number of observations while the varying load. The WidrowHoff algorithm (curve 1) and the DTA with sliding window (curve 2). Dependencies of harmonic amplitude estimates A23 on the number of observations while the varying load. The WidrowHoff algorithm (curve 1) and the DTA with sliding window (curve 2). The dependence of the amplitudefrequency characteristics of the filter on the harmonic number. Dependencies of harmonics amplitudes estimates A1 (curve 1) and B1 (curve 2) on the observations number under the impulse noise action. Dependencies of harmonics amplitudes estimates A23 (curve 1) and B23 (curve 2) on the observations number under the impulse noise action. The dependence of the estimation of the fundamental period on the observations number under the impulse noise action.

190

190 191

191

191 192

192

192

LIST OF TABLES Table 2.1 Table 3.1 Table 4.1 Table 4.2

Training errors Estimates of β tr12 obtained by means of the statistical simulation Training errors Training errors

41 79 132 137

xi

PREFACE The problem of neural and neuro-fuzzy networks training is considered in this book. The author’s attention is concentrated on the approaches which are based on the use of a separable structure of plants models— nonlinear with respect to some unknown parameters and linear relating to the others. It may be, for example, a multilayered perceptron with a linear activation function at its output, a radial base neural network, a neuro-fuzzy Sugeno network, or a recurrent neural network, which are widely used in a variety of applications relating to the identification and control of nonlinear plants, time series forecasting, classification, and recognition. Static neural and neuro-fuzzy networks training can be regarded as a problem of minimizing the quality criterion in respect to unknown parameters included in the description of them for a given training set. It is well-known that it is a complex, multiextreme, often ill-conditioned nonlinear optimization problem. In order to solve its various algorithms, that are superior to the error backpropagation algorithm and its numerous modifications in convergence rate, approximation accuracy and generalization ability have been developed. There are also algorithms that directly take into account separable character of networks structure. Thus, in Ref. [1] the VP (variable projection) algorithm for static separable plants models is proposed. According to this algorithm, the initial optimization problem is transformed into a new problem, but only with relation to nonlinear input parameters. Under certain conditions the stationary points sets of two problems coincide, but at the same time dimensionality decreases, and, as a consequence, there is no need for selecting initial values to linearly incoming parameters. Moreover, the new optimization problem is better conditioned [2a4], and if the same method is used for initial and transformed optimization problems, the VP algorithm always converges after a smaller number of iterations. At the same time, the VP algorithm can be implemented only in a batch mode and, in addition, the procedure of determining partial derivative of modified criteria in respect to the parameters becomes considerably more complicated. The hybrid procedure for a Sugeno fuzzy network training is proposed in Refs. [5,6], that is based on the successive use of the recursive least-square method (RLSM) for determining linearly entering parameters and gradient method for nonlinear

xiii

xiv

Preface

ones. The extreme learning machine (ELM) approach is developed in Refs. [7,8]. On the basis of this approach only linearly incoming parameters are trained and nonlinear ones are drawn at random without taking into account a training set. However, it is well-known that this approach can provide quite low accuracy at a relatively small size of the training set. It also should be noted that while the ELM and the hybrid algorithms initialization use the RLS algorithm, it is necessary to select the initial values for the matrix which satisfies the Riccati equation. Moreover, as a priori information about the estimated parameters is absent, its elements are generally put proportional to a large parameter, which may lead to divergence in the case of even linear regression. The purpose of this book is to present new approaches to training of neural and neuro-fuzzy networks which have a separable structure. It is assumed that in addition to the training set a priori information only about the nonlinearly incoming parameters is given. This information may be obtained from the distribution of a generating sample, a training set, or some linguistic information. For static separable models the problem of minimizing a quadratic criterion that includes only that information is considered. Such a problem statement and the Gauss Newton method (GNM) with linearization around the latest estimate lead to new online and offline training algorithms that are robust in relation to unknown a priori information about linearly incoming parameters. To be more precise, they are interpreted as random variables with zero expectation and a covariance matrix proportional to an arbitrarily large parameter µ (soft constrained initialization). Asymptotic representations as µ-N for the GNM, which we call diffuse training algorithms (DTAs), are found. We explore the DTA properties. Particularly the DTAs’ convergence in case of the limited and unlimited sample size is studied. The problem specialty is connected with the observation model separable character, and the fact that the nonlinearly inputting parameters belong to some compact set, and linearly inputting parameters should be considered as arbitrary numbers. It is shown that the proposed DTAs have the following important characteristics: 1. Unlike their prototype, the GNM with a large but finite µ, the DTAs are robust with respect to round-off error accumulation. 2. As in Refs. [1 4] initial values choice for linearly imputing parameters is not required, but at the same time there is no need to evaluate the projection matrix partial derivative.

Preface

xv

3. Online and offline regimes can be used. 4. The DTAs are followed with the ELM approach and the hybrid algorithm of the Sugeno neuro-fuzzy network training [6,7], and presented modeling results show that developed algorithms can surpass them in accuracy and convergence rate. With a successful choice of a priori information for the nonlinear parameters, rapid convergence to one of the acceptable minimum criteria points can be expected. In this regard, the DTAs’ behavior analysis at fixed values of the nonlinear parameters, when a separable model is degenerating into a linear regression, is very important. We attribute this to the possible influence of the properties of linear estimation problem on the DTAs. The behavior of the RLSM with soft and diffuse initialization in a finite time interval, including a transition stage, is considered. In particular, the asymptotic expansion for the solution of the Riccati equation, the gain rate in inverse powers of µ, and conditions for the absence of overshoot in the transition phase are obtained. Recursive estimation algorithms (diffuse) as µ-N not depending on a large parameter µ which leads to the divergence of the RLSM are proposed. The above-described approach is generalized in the training problem of separable dynamic plant models—a state vector and numerical parameters are simultaneously evaluated using the relations for the extended diffuse Kalman filter (DKF) obtained in this book. It is assumed that in addition to the training set a priori information only on nonlinearly inputting parameters and an initial state vector, which can be obtained from the distribution of a generating sample, is used. Linearly inputting parameters are interpreted as random variables with a zero expectation and a covariance matrix proportional to arbitrarily large parameter µ. Asymptotic relations as µ-N, which describe the extended KF (EKF), are called the diffuse extended KF. The theoretical results are illustrated with numerical examples of identification, control, signal processing, and pattern recognition problemsolving. It is shown that the DTAs may surpass the ELM and the hybrid algorithms in approximation accuracy and necessary iterations number. In addition, the use of the developed algorithms in a variety of engineering applications, which the author has been interested in at different times, is also described. These are dynamic mobile robot model identification, neural networks-based modeling of mechanical hysteresis deformations, and monitoring of the electric current harmonic components.

xvi

Preface

The book includes six chapters. The first chapter presents an overview of the known models of objects and results relating to the subject of the book. The RLSM behavior on a finite interval is considered in Chapter 2, Diffuse Algorithms for Estimating Parameters of Linear Regression. It is assumed that the initial value of the matrix Riccati equation is proportional to a large positive parameter µ. Asymptotic expansions of the Riccati equation solution and the RLSM gain rate in inverse powers of µ are obtained. The limit recursive algorithms (diffuse) as µ-N not depending on a large parameter µ which leads to the RLSM divergence are proposed and explored. The theoretical results are illustrated by examples of solving problems of identification, control, and signal processing. In Chapter 3, Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval, properties of the bias, the matrix of second-order moments, and the normalized average squared error of the RLSM on a finite time interval are studied. It is assumed that the initial condition of the Riccati equation is proportional to the positive parameter µ and the time interval includes an initialization stage. Based on the Chapter 2, Diffuse Algorithms for Estimating Parameters of Linear Regression results, asymptotic expressions for these quantities in inverse powers of µ for the soft initialization and limit expression for the diffuse initialization are obtained. It is shown that the normalized average squared error of estimation can take arbitrarily large but bounded values as µ-N. The conditions are expressed in terms of signal/noise ratio under which overshoot does not exceed the initial value (conditions for the absence of overshoot). Chapter 4, Diffuse Neural and Neuro-Fuzzy networks Training Algorithms deals with the problem of multilayer neural and neuro-fuzzy networks training with simultaneous estimation of the hidden and output layer parameters. The hidden layer parameters probable values and their possible deviations are assumed to be known. A priori information about the output layer weights is absent and in one initialization of the GNM they are assumed to be random variables with zero expectations and a covariance matrix proportional to the large parameter and in the other option either unknown constants or random variables with unknown statistical characteristics. Training algorithms based on the GNM with linearization about the latest estimate are proposed and studied. The theoretical results are illustrated with the examples of pattern recognition, and identification of nonlinear static and dynamic plants.

Preface

xvii

The estimation problem of the state and the parameters of the discrete dynamic plants in the absence of a priori statistical information about initial conditions or its incompletion is considered in Chapter 5, Diffuse Kalman Filter. Diffuse analogues of the Kalman filter and the extended Kalman filter are obtained. As a practical application, the problems of the filter constructing with a sliding window, observers restoring state in a finite time, recurrent neural networks training, and state estimation of nonlinear systems with partly unknown dynamics are considered. Chapter 6, Applications of Diffuse Algorithms provides examples of the use of diffuse algorithms for solving problems with real data arising in various engineering applications. They are the mobile robot dynamic model identification, hysteresis mechanical deformations modeling on the basis of neural networks, and electric current harmonic components monitoring. The author expresses deep gratitude to Head of Department Y.B. Ratner from Marine Hydrophysical Institute of RAS, Department of Marine Forecasts and Professor S.A. Dubovik from Sevastopol State University, Department of Informatics and Control in Technical Systems for valuable discussions, and to his wife Irina for help in preparation of the manuscript. Boris Skorohod, Sevastopol, Russia January 2017

CHAPTER 1

Introduction Contents 1.1 Separable Models of Plants and Training Problems Associated With Them 1.1.1 Separable Least Squares Method 1.1.2 Perceptron With One Hidden Layer 1.1.3 Radial Basis Neural Network 1.1.4 Neuro-Fuzzy Network 1.1.5 Plant Models With Time Delays 1.1.6 Systems With Partly Unknown Dynamics 1.1.7 Recurrent Neural Network 1.1.8 Neurocontrol 1.2 The Recursive Least Squares Algorithm With Diffuse and Soft Initializations 1.3 Diffuse Initialization of the Kalman Filter

1 1 3 5 7 9 9 11 12 13 15

1.1 SEPARABLE MODELS OF PLANTS AND TRAINING PROBLEMS ASSOCIATED WITH THEM 1.1.1 Separable Least Squares Method Let us consider an observation model of the form yt 5 Φðzt ; βÞα;

t 5 1; 2; . . .; N ;

(1.1)

where zt 5 ðz1t ; z2t ; . . .; znt ÞT ARn is a vector of inputs, yt 5 ðy1t ; y2t ; . . .; ymt ÞT ARm is a vector of outputs, α 5 ðα1 ; α2 ; . . .; αr ÞT ARr , β 5 ðβ 1 ; β 2 ; . . .; β l ÞT ARl are vectors of unknown parameters, Φðzt ; βÞ is an m 3 r matrix of given nonlinear functions, Rl is the space of vectors of length l, ðÞT is the matrix transpose operation, and N is a sample size. The vector output yt depends linearly on α and nonlinearly on β. This model is called the separable regression (SR) [1]. If the vector β is known, then Eq. (1.1) is transformed into a linear regression. Let there be given the set of inputoutput pairs fzt ; yt g, t 5 1; 2; . . .; N and the quality criteria

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks. DOI: http://dx.doi.org/10.1016/B978-0-12-812609-7.00001-9

© 2017 Elsevier Inc. All rights reserved.

1

2

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

JN ðα; βÞ 5 1=2

N X ðyk 2Φðzk ; βÞαÞT ðyk 2 Φðzk ; βÞαÞ 5 k51

(1.2)

5 1=2jjY 2 FðβÞαjj 5 1=2e ðα; βÞeðα; βÞ; 2

T

where Y 5 ðyT1 ; . . .; yTN ÞT , FðβÞ 5 ðΦT ðz1 ; βÞ; . . .; ΦT ðzN ; βÞÞT , eðα; βÞ 5 Y 2 FðβÞα, and jj  jj is the Euclidean vector norm. It is required to find the parameters α and β from the minimum JN ðα; βÞ: ðα ; β  Þ 5 arg min JN ðα; βÞ; αARr ; βARl : To solve this problem we can use any numerical optimization method. For example, the iterative GaussNewton method (GNM) oriented on the solution of nonlinear problems with a quadratic criterion of quality or its modification the LevenbergMarquardt algorithm xi11 5 xi 1 ½J T ðxi ÞJðxi Þ1μIr1l 21 J T ðxi Þeðxi Þ;

i 5 1; 2; . . .; M;

where x 5 ðβ T ; αT ÞT ARl1r , Jðxi Þ is the Jacobi matrix of residues eðxÞ 5 Y 2 FðβÞα, Im is the identity m 3 m matrix, μ . 0 is a parameter, and M is the number of iterations. However, methods that take into account the specifics of the problem (the linear dependence of the observed output of some parameters) can be more effective. In Ref. [1] a separable least squares method (LSM) to estimate the parameters included in the description Eq. (1.1) is proposed. The idea of the method is as follows. For a given β the value of the parameter α is defined as the solution of the linear problem by the LSM α 5 F 1 ðβÞY ;

(1.3)

where ðÞ1 is the pseudoinverse matrix of the corresponding matrix and α is the solution with minimum norm. Substituting α in Eq. (1.2), we come to a new nonlinear optimization problem, but only with respect to β min J~N ðβÞ51=2minjjðINm 2FðβÞF1ðβÞÞY jj2 51=2minjjPF ðβÞY jj2 ; βARl ; (1.4) where PF ðβÞ 5 INm 2 FðβÞF 1 ðβÞ is the projection matrix. The value of α is determined by substitution of the obtained optimal value β in Eq. (1.3). The following statement shows that the above two-step algorithm, under certain conditions, does minimize the initial performance criterion Eq. (1.2), allowing you to reduce the number of parameters from r 1 l to l.

Introduction

3

Theorem 1.1([1]). Assume that the matrix FðβÞ has a constant rank over an open set ΩCRl . 1. If β  AΩ is a minimizer of J~N ðβÞ and α 5 F 1 ðβ  ÞY , then ðα ; β  Þ is also a minimizer of JN ðα; βÞ. 2. If ðα ; β  Þ is a minimizer of JN ðα; βÞ for βAΩ, then β  is a minimizer of J~N ðβÞ in Ω and J~N ðβ  Þ 5 JN ðα ; β  Þ. Furthermore, if there is a unique α among the minimizing pairs of JN ðα; βÞ, then α must satisfy α 5 F 1 ðβÞY . For determination of β from the minimum condition J~N ðβÞ, and in the minimization of JN ðα; βÞ, any numerical methods can be used. However, the criterion J~N ðβÞ has a much more complicated structure compared to JN ðα; βÞ, so it is important to have an efficient procedure for the Jacobian finding of matrix residues eðβÞ 5 PF ðβÞY . In Ref. [9] the following analytical representation for the Jacobi matrix columns was obtained fJgj 5 2 ½PF ðβÞ@FðβÞ@β j F 2 ðβÞ 1 ðPF ðβÞ@FðβÞ@β j F 2 ðβÞÞT Y ;

(1.5)

where F 2 ðβÞ is generalized pseudoinverse matrix of FðβÞ, which satisfies the conditions FðβÞF 2 ðβÞFðβÞ 5 FðβÞ;

ðFðβÞF 2 ðβÞÞT 5 FðβÞF 2 ðβÞ:

Although the criterion J~N ðβÞ seems to be more complicated than JN ðα; βÞ, in Ref. [2] it is noted that the number of arithmetic operations required for the implementation of the separable LSM based on Eq. (1.5) is not more than in the GNM to be used for minimization of JN ðα; βÞ. In addition to the reduction of the dimension of the estimated parameter vector, the separable LSM decreases the conditionality matrix in the GNM in comparison with its prototype [4]. This may reduce the number of iterations required to obtain solutions with a given accuracy. Note also that using the LSM to minimize J~N ðβÞ it is not required to have a priori information about the incoming linearly parameters which are interpreted as unknown constants.

1.1.2 Perceptron With One Hidden Layer Consider a neural network (NN) with a nonlinear activation function (AF) in the hidden layer and a linear in the output layer [10]

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

4

yit 5

p X k51

wik σ

q X

! akj zjt 1 bk ;

i 5 1; 2; . . .; m;

t 5 1; 2; . . .; N ;

j51

(1.6) where zjt ; j 5 1; 2; . . .; q; t 5 1; 2; . . .; N are inputs, yit ; i 5 1; 2; . . .; m; t 5 1; 2; . . .; N , are outputs, akj ; bk ; k 5 1; 2; . . .; p; j 5 1; 2; . . .; q are weights and biases of the hidden layer, wik ; i 5 1; 2; . . .; m; k 5 1; 2; . . .; p are weights of the output layer, and σðxÞ is an AF of the hidden layer. It is usually assumed that the AF is the sigmoid or the hyperbolic tangent function σðxÞ 5ð11expð2xÞÞ21 ; σðxÞ 5 ðexpðxÞ 2 expð2 xÞÞ=ðexpðxÞ 1 expð2 xÞÞ which are continuously differentiable and have simple expressions for the derivatives. In applications associated with use of the NN, one of the basic assumptions is that with them a sufficiently wide class of nonlinear systems can be described. It is well known that if the AF is selected as the sigmoid or the hyperbolic tangent function, the NN Eq. (1.6) can approximate with any accuracy continuous functions on compact sets, provided there is a sufficiently large number of neurons in the hidden layer [11,12]. Using vectormatrix notations, we obtain more compact representation for the NN yt 5 W ðσða1 zt Þ; . . .; σðap zt ÞÞT ;

t 5 1; 2; . . .; N;

(1.7)

where yt 5 ðy1t ; y2t ; . . .; ymt ÞT ARm ; W 5 ðw1 ; w2 ; . . .; wm ÞT ARm 3 p ;

zt 5 ðz1t ; z2t ; . . .; zqt ; 1ÞT ARq11 ; wi 5 ðwi1 ; wi2 ; . . .; wip ÞT ;

ak 5 ðak1 ; ak2 ; . . .; akq ; bk Þ;

i 5 1; 2; . . .; m;

k 5 1; 2; . . .; p:

The NN Eq. (1.6) is a special case of the SR for which the weights of the output layer enter linearly and the weights and the biases of the hidden layer enter nonlinearly. Introducing notations α 5 ðw1T ; w2T ; . . .; wmT ÞT ARmp ; β 5 ðaT1 ; aT2 ; . . .; aTp ÞT ARðq11Þp ; Φðzt ; βÞ 5 Im  Σðzt ; βÞ;

(1.8)

Introduction

5

where Φðzt ; βÞ is an m 3 mp matrix, Σðzt ; βÞ 5 ðσða1 zt Þ; σða2 zt Þ; . . .; σðap zt ÞÞ, and  is the direct product of two matrices, we obtain the representation of the NN Eq. (1.6) in the form Eq. (1.1). Given a training set fzt ; yt g; t 5 1; 2; . . .; N , the training of the NN Eq. (1.6) can be considered as the problem of the mean-square error minimizing in respect to unknown parameters (weights and biases) occurring in its description. The backpropagation algorithm (BPA) is successfully used to train Eq. (1.6). At the same time, a slow convergence rate inherent in the BPA is well known that considerably complicates or makes the use of the algorithm in complicated problems practically impossible. In numerous publications devoted to the NNs, different learning algorithms are proposed whose convergence rate and obtained approximation accuracy exceed those of the BPA. In Refs. [13,14], LevenbergMarquardt and quasi-Newtonian methods are applied that use information on the matrix of second-order derivatives of a quality criterion. In Refs. [1518], training algorithms are based on the extended Kalman filter (EKF) using an approximate covariance matrix of estimation error. In Refs. [3,4,1921], for the NNs with the linear AF in the output layer, algorithms are presented that take into account a separable network structure. According to the variable projection algorithm [1], the initial optimization problem is transformed into a problem equivalent with respect to its parameters that enter nonlinearly (weights and biases of the hidden layer). In this case, both the problem dimension and conditionality decrease, which makes it possible to decrease the number of iterations for obtaining a solution. At the same time, this approach can be used only in batch mode and even additively entering measurement errors of the NN outputs enter nonlinearly in the transformed criterion. Moreover, the procedure of partial derivatives determination of the criterion in respect to parameters becomes considerably more complicated. In Refs. [7,22], the extreme learning machine (ELM) algorithm is proposed with the help of which only linearly entering parameters (weights of the output layer) are trained, and nonlinear ones are chosen randomly without taking into account the training set. This decreases learning time but can lead to low approximation accuracy. The NN trained in this way is called functionally connected [23] and it may approximate any continuous functions on compact sets provided that m 5 1.

1.1.3 Radial Basis Neural Network For a radial basis neural network (RBNN), relationships connecting its inputs and outputs are specified by the expressions [10]

6

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

yit 5

p X

wik ϕðbk jjzt 2 ak jj2 Þ 1 wi0 ;

i 5 1; 2; . . .; m;

t 5 1; 2; . . .; N

k51

(1.9) where zt 5 ðz1t ; z2t ; . . .; znt ÞT ARn is a vector of inputs yit ; i 5 1; 2; . . .; m, t 5 1; 2; . . .; N are outputs, ak ARn ; bk AR1 ; k 5 1; 2; . . .; p are centers and scaled factors, respectively, R1 is set of positive real numbers, wik ; i 5 1; 2; . . .; m; k 5 1; 2; . . .; p are weights of the output layer, wi0 ; i 5 1; 2; . . .; m are biases, and ϕðUÞ is a basis function. If the basic function is the Gauss function ϕðxÞ 5 expð 2 ðx2aÞ2 =bÞ; then the RBNN has a universal approximating property [10]. Given a training set fzt ; yt g; t 5 1; 2; . . .; N , the training of the RBNN Eq. (1.9) can be considered as the problem of minimizing the mean-square error in respect to unknown parameters (weights, centers, and scaled factors) occurring in its description. Using vectormatrix notations, we obtain following more compact representation for the RBNN yt 5 W ð1; ϕðb1 jjzt 2a1 jj2 Þ; . . .; ϕðbp jjzt 2ap jj2 ÞÞT ;

t 5 1; 2; . . .; N ; (1.10)

is vector of outputs, where yt 5 ðy1t ; y2t ; . . .; ymt ÞT ARm T T m 3 ðp11Þ W 5 ðw1 ; w2 ; . . .; wm Þ AR , wi 5 ðwi0 ; wi1 ; . . .; wip Þ , i 5 1; 2; . . .; m. It is seen that the perceptron with one hidden layer and the RBNN have a similar structure—one hidden layer and one output layer. Moreover, the RBNN is a SR in which linearly entering parameters wik ; i 5 1; 2; . . .; m, k 5 1; 2; . . .; p are weights of the output layer and nonlinearly entering ones are centers ak and scaled factors bk ; k 5 1; 2; . . .; p. Introducing notations α 5 ðw1T ; w2T ; . . .; wmT ÞTARmðp11Þ ; β 5 ðb1 ; b2 . . .; bp ; aT1 ; aT2 ; . . .; aTp ÞTARðn11Þp ; Φðzt ; βÞ 5 Im  Σðzt ; βÞ; Σðzt ; βÞ 5 ð1; ϕðb1 jjzt 2 a1 jj2 Þ; . . .; ϕðbp jjzt 2 ap jj2 Þ;

(1.11)

where Φðzt ; βÞ is an m 3 mðp 1 1Þ matrix, we obtain the representation Eq. (1.9) in the form Eq. (1.1).

Introduction

7

There are two approaches to training of the RBNN [10,24]. In the first one, the parameters are determined by using the methods of the nonlinear parametric optimization, such as, for example, the LevenbergMakvardt algorithm and the EKF requiring significant computational effort. The second approach is based on the separate determination of parameters. At first centers and scaled factors are determined and then weights of the output layer. This leads to a substantial reduction in training time in comparison with the first approach. Indeed, if centers and scaled factors of the RBNN have already been found, the problem of weights determination of the output layer is reduced to a linear problem of the LSM. In Ref. [10] it is proposed to choose the centers values randomly from the training set, and scaled factors are assumed to be equal to the maximum distance between centers. In Ref. [22] parameters of the hidden layer it is proposed to choose randomly from their domains of the definition. It is shown that trained the RBNN is a universal approximator under fairly general assumptions about the basic functions and m 5 1. However, speeding up the learning process may lead, as in the case of the two-layer perceptron, to divergence.

1.1.4 Neuro-Fuzzy Network One of the main methods of constructing fuzzy systems is the use of a neuro-fuzzy approach [5]. Its basic idea is that in addition to the knowledge of experts, lying in the knowledge base (rules, linguistic variables, membership functions), numerical information representing a set of input values and the corresponding output values is used. It is an introduced criterion (usually a quadratic one) which determines error between the experimental data and the values of the neuro-fuzzy system (NFS) outputs. Further, the membership function (MF) and outputs are parameterized and a nonlinear optimization problem with respect to the selected parameters is solved in some way. Motivation to use fuzzy rules jointly with experimental data is that, unlike the perceptron, they are easily interpreted and expanded. At the same time, fuzzy models as well as the multilayered perceptron allow to approximate with arbitrary precision any continuous function defined on a compact set. The effectiveness of the resulting fuzzy system largely depends on the optimization algorithm. Let us consider in more detail models of the NFS and known algorithms for their training.

8

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

Let a knowledge base in the Sugeno form be specified if ðz1t is A1j Þ and if ðz2t is A2j Þ and . . . if ðznt is Anj Þ then yt 5 aTj zt 1 bj , j ¼ 1; 2; . . .; m;

(1.12)

where Aij ; i 5 1; 2; . . .; n; j 5 1; 2; . . .; m are fuzzy sets with parameterized MFs μij ðzit ; cij Þ; cij ; aj ; bj ; i 5 1; 2; . . .; n; j 5 1; 2; . . .; m, are vectors of unknown parameters, yt , t 5 1; 2; . . .; N is a scalar output, and zt 5 ðz1t ; z2t ; . . .; znt ÞT ARn is a vector of inputs. Then the relationship for the output determined according to m rules Eq. (1.12) is represented by the expression [25] m P

yt 5

n

L μij ðzjt ; cij ÞðaTi zt 1 bi Þ

i51 j51

m P

;

n

L μij ðzjt ; cij Þ

t 5 1; 2; . . .; N

(1.13)

i51 j51

which is a SR. In Eq. (1.13) the MF and consequences parameters enter nonlinearly and linearly, respectively. The representation Eq. (1.1) follows from Eq. (1.13) with T T T T ; c21 ; . . .; cnm Þ ; α 5 ðaT1 ; aT2 ; . . .; aTm ; b1 ; b2 ; . . .; bm ÞT ARmðm11Þ ; β 5 ðc11 T T T Φðzt ; βÞ 5 ðq1 ðβÞzt ; q2 ðβÞzt ; . . .; qm ðβÞzt ; q1 ðβÞ; q2 ðβÞ; . . .; qm ðβÞÞT ; n

L μij ðzjt ; cij Þ qi ðβÞ 5

j51 n

m P

L μij ðzjt ; cij Þ

;

i 5 1; 2; . . .; m:

i51 j51

There are various algorithms for estimation of the MF parameters and the NFS parameters entering in rules consequences. So in Refs. [5,6] the system ANFYS is presented. It includes the BPA and hybrid learning procedure, integrating the BPA and the recursive LSM (RLSM). In Ref. [26] the EKF is used for training with the triangular MFs. Algorithms of the cluster analysis and the evolutionary programming are developed in Refs. [5,27,28]. In many cases, these approaches have serious problems due to the absence of convergence or slow convergence and, besides, only to a local minimum. It is known that convergence depends on the used algorithm and on the selected initial conditions for parameters. At the same time, the structure of the NFS is such that a priori information in respect to consequences parameters is absent and, therefore, it is not clear how reasonably to initialize the chosen algorithm.

Introduction

9

1.1.5 Plant Models With Time Delays One of the standard approaches to build dynamic system models consists in the interpretation of zt in Eq. (1.1) as a regressor defined on the measured values of inputs and outputs to the current time [10]. Suppose that at time t 5 1; 2; . . .; N input ut ARn and output yt ARm values of some system are measured. Let us introduce the regressors vector Zt 5 ðyt21 ; yt22 ; . . .; yt2a ; ut2d ; ut2d21 ; . . .; ut2d2b Þ; where a, b, d are some positive integer numbers. A nonlinear model of autoregressive moving average connecting input Zt and output yt of the system can be represented in the form yt 5 FðZt ; θÞ 1 ξt ;

t 5 1; 2; . . .; N

(1.14)

where Fð; Þ is a given nonlinear function, θ is vector of unknown parameters, ξt ARm is a random process that has uncorrelated values, zero expectation, covariance matrix Vt 5 E½ξ t ξTt  and characterizes output measurement errors and approximation errors. The model Eq. (1.14) predicts one step in the behavior of the system. If more than one step is to be predicted we can use Eq. (1.14) and y^t to obtain y^t11 . This procedure can be repeated h times to predict h steps ahead. We can also directly build the model of the form yt1h 5 FðZth ; θÞ 1 ξ t ; t 5 1; 2; . . .; N

(1.15)

to predict h steps ahead, where Zth 5 ðyt21 ; yt22 ; . . .; yt2a ; ut1h2d ; ut1h2d21 ; . . .; ut2d2b Þ: Advantages and disadvantages of each of these approaches are well known [29]. We can use the NN or the NFS to choose Fð; Þ in Eq. (1.14) and we are interested in a model of the form Eq. (1.14) that can be reduced to a SR Eq. (1.1).

1.1.6 Systems With Partly Unknown Dynamics Consider the cases where the mathematical description of the studied system is given in the form of a state-space model that is not complete. This may be, for example, if the right sides are defined up to unknown parameters or the right sides of the prior model are different from the real. Let a system model have the form

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

10

xt11 5 At ðθÞxt 1 Bt ðθÞut ;

(1.16)

yt 5 Ct ðθÞxt 1 Dt ðθÞut ; t 5 1; 2; . . .; N

(1.17)

where xt ARn is a state vector, yt ARm is a vector of measured outputs, ut ARl is a vector of measured inputs, θt ARr is a vector of unknown parameters, and At ðθÞ, Bt ðθÞ, Ct ðθÞ, Dt ðθÞ are given matrix functions of corresponding dimensions. It is required using observations fut ; yt g, t 5 1; 2; . . .; N to estimate θ. Let us show that this problem can be reduced to the estimation of regression parameters ðx1 ; θÞ, where x1 enters linearly and θ nonlinearly. Indeed, from Eq. (1.16) we find xt 5 Ht;1 ðθÞx1 1

t21 X

Ht;i11 ðθÞ Bi ðθÞui ;

(1.18)

i51

where the transition matrix Ht;s ðθÞ is a solution of the matrix equation Ht11;s ðθÞ 5 At ðθÞHt;s ðθÞ, Hs;s ðθÞ 5 In , t $ s, t 5 1; 2; . . .; N . Substituting Eq. (1.18) in the observation Eq. (1.17) gives t21 X yt 5 Ct ðθÞHt;1 ðθÞx1 1 Ct ðθÞ Ht;i11 ðθÞ Bi ðθÞui 1 Dt ðθÞut 5

5 Φt ðθÞx1 1 ϕt ðθÞ;

i51

t 5 1; 2; . . .; N :

Suppose that the quality criterion takes the form Jðx1 ; θÞ 5 1=2jjY 2 ΦðθÞx1 2 ϕðθÞjj2 ;

(1.19)

where Y 5 ðyT1 ; yT2 ; . . .; yTN ÞT ;

ΦðθÞ 5 ðΦT1 ðθÞ; ΦT2 ðθÞ; . . .; ΦTN ðθÞÞT ;

ϕðθÞ 5 ðϕT1 ðθÞ; ϕT2 ðθÞ; . . .; ϕTN ðθÞÞT : To determine the vector ðx1 ; θÞ from the minimum condition Eq. (1.19), we can use the idea of the separable LSM finding x1 with the help of the LSM for a fixed θ. As result we get x1 5 Φ1 ðθÞðY 2 ϕðθÞÞ: Substitution of this expression in Eq. (1.19) leads to the criterion ~ 5 1=2jjY 2 ΦðθÞΦ1 ðθÞðY 2 ϕðθÞÞ 2 ϕðθÞjj2 JðθÞ and the new optimization problem, but only relatively θ.

Introduction

11

In Refs. [30,31] conditions are obtained under which we can apply Theorem 1.1 for stationary systems Eq. (1.16). Note that this work uses a different parameterization, leading to SR of standard form. In Ref. [31], a recursive algorithm to minimize the criterion J~N ðβÞ is proposed. Consider now a stochastic nonlinear discrete system of the form xt11 5 ft ðxt ; ut Þ 1 wt ; yt 5 ht ðxt ; ut Þ 1 ξt ;

t 5 1; 2; . . .; N ;

(1.20) (1.21)

where xt ARn is a state vector, yt ARm is a vector of measured outputs, ut ARl is a vector of measured inputs, ft ðxt ; ut Þ, ht ðxt ; ut Þ are a priori defined vector functions, and wt , ξt are random perturbations with known statistical properties. Let the functions ft ðxt ; ut Þ and ht ðxt ; ut Þ differ from actual f~t ðxt ; ut Þ 5 ft ðxt ; ut Þ 1 εt ; h~t ðxt ; ut Þ 5 ht ðxt ; ut Þ 1 ηt ; where εt , ηt are some unknown functions. In Refs. [3234], the functions εt , ηt are proposed to approximate with the help of two multilayer perceptrons and to estimate their parameters simultaneously with the state xt .

1.1.7 Recurrent Neural Network Consider a recurrent neural network (RNN) of the form [10] 0 T 1 σða1 xt21 1 bT1 zt 1 d1 Þ A; ... xt 5 @ T T σðaq xt21 1 bq zt 1 dq Þ 1 c1T xt yt 5 @ . . . A; cmT xt

(1.22)

0

t 5 1; 2; . . .; N ;

(1.23)

where xt ARq is a state vector, zt ARn is a vector of measured inputs, yt ARm is a vector of measured outputs, and ai ARq , bi ARn , di AR1 , ci ARq , σðxÞ is the sigmoid or the hyperbolic tangent function.

12

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

Or in more compact form xt 5 σðAxt21 1 Bzt 1 dÞ;

(1.24)

yt ¼ Cxt ; t ¼ 1; 2; . . .; N ;

(1.25)

where AARq 3 q, BARq 3 n , dARq , CARm 3 q . It is required with the help of observations fzt ; yt g, t 5 1; 2; . . .; N to estimate A; B; C; d. Note two important properties of the RNN for us. Firstly, it can approximate with arbitrary accuracy solutions of nonlinear dynamical systems on finite intervals and compact subsets of the state [35]. Secondly, the RNN Eqs. (1.22 and 1.23) belongs to the class of separable models. Indeed, A,B,C,d enter in its description nonlinearly and C enters linearly. In Ref. [10], training algorithms of the RNN are given. These are a generalization of the BPA and the EKF.

1.1.8 Neurocontrol In Fig. 1.1, as an example, a functional scheme of the supervisory control system [36] with the inverse dynamics is shown. We assume that the inverse model has the form of the SR ust 5 ΦðZt ; Ut21 ; βÞα; t 5 1; 2; . . .; N ;

(1.26)

where Zt 5 ðzt ; zt21 ; . . .; zt2a ÞT , Ut21 5 ðust21 ; ust22 ; . . .; ust2b ÞT . The desired trajectory is used as input and the feedback to ensure stability of the system. Parameters β; α can be estimated online or offline.

SC utSC + zt –

et

+

PD

+

Plant

u tPD

Figure 1.1 Functional scheme of the supervisory control system.

yt

Introduction

13

1.2 THE RECURSIVE LEAST SQUARES ALGORITHM WITH DIFFUSE AND SOFT INITIALIZATIONS Assume that the observation model is of the form yt 5 Ct α 1 ξ t t 5 1; 2; . . .; N ;

(1.27)

where yt AR1 is a measured output, α 5 ðα1 ; α2 ; . . .; αr ÞT ARr is a vector of unknown parameters, Ct 5 ðc1t ; c2t ; . . .; crt ÞAR1 3 r , cit , i 5 1; 2; . . .; r are regressors (inputs), ξ t AR1 is a random process that has uncorrelated values, zero expectation and variance Rt 5 E½ξ 2t , and N is a sample size. Regressors in Eq. (1.27) can be specified by deterministic functions or by random. The autoregressive moving average model yt 5 a1 yt21 1 a2 yt22 1 . . . 1 am yt2m 1 b1 zt21 1 b2 zt22 1 . . . 1 bn zt2n 1 ξt ; t 5 1; 2; . . .; N

(1.28)

is one of the most important examples of the application of Eq. (1.27) with Ct 5 ðyt21 ; yt22 ; . . .; yt2m ; zt21 ; zt22 ; . . .; zt2n Þ; α 5 ða1 ; a2 ; . . .; am ; b1 ; b2 ; . . .; bn ÞT : Suppose that a sample fzt ; yt g; t 5 1; 2; . . .; N and the quadratic criterion of quality taking into account available information on Eq. (1.28) until the time t are given Jt ðαÞ 5

t X

λt2k ðyk 2Ck αÞ2 ;

t 5 1; 2; . . .; N;

(1.29)

k51

where λAð0; 1 is a parameter that allows to reduce the past observations’ influence (forgetting factor). We seek an estimate of the unknown parameter from the minimum condition Eq. (1.29), using the RLSM which, as is well known, is determined by the relations [37,38] αt 5 αt21 1 Kt ðyt 2 Ct αt21 Þ;

(1.30)

Pt 5 ðPt21 2 Pt21 CtT ðλ1Ct Pt21 CtT Þ21 Ct Pt21 Þ=λ;

(1.31)

Kt 5 Pt CtT ;

t 5 1; 2; . . .; N;

(1.32)

14

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

where αt is the optimal estimate of α, derived from all available observations up to the moment t. The RLSM successfully is applied in a variety of applications related to adaptive filtering, identification of plants, recognizing and adaptive control, due to the high rate of convergence to the optimal solution and the ability to work under varying conditions [38,39]. However, in order to use it you must specify the initial conditions for the vector αt and the matrix Pt (to initialize algorithm). There are several ways to their specifications [38]. One of the variants consists in determining and use of the standard LSM (nonrecursive) from the initial data block fzt ; yt g; t 5 1; 2; . . .; N , N . Disadvantages of this approach in real-time processing are quite obvious. First of all, there is the need to memorize observations, the appearance of additional matrix operations including matrix inversion, and the need to agree the work of the RLSM with the obtained estimates of α0 and P0 . In another approach (soft initialization) we set α0 5 0 and P0 5 μIr , where μ . 0 is a parameter selected by the user. The limiting case μ-N of soft initialization we will call the diffuse initialization of the RLSM. During the simulation it was found [40,41] that the estimates of the RLSM are subject to strong fluctuations in the initialization period (in transition stage) ( ) t X T t # tr 5 mint t : Ck Ck . 0; t 5 1; 2; . . .; N k51

for arbitrarily small noise, any regression order and large values μ . 0. In addition, large values μ can lead to the divergence of the algorithm for t . tr [41]. The study of this phenomenon was the subject of several publications [4246]. In Ref. [42], the theoretical justification of this behavior for autoregressive moving average models in the transition phase is given. In Ref. [43], the conditions under which value β t ðμÞ 5 E½eTt ðμÞet ðμÞ=jjαjj2 does not exceed 1 (conditions for the absence of the overshoot) at the end of the transition phase are derived, where et ðμÞ 5 αt 2 α. In Ref. [46] the behavior β t ðμÞ for λ 5 1 and diffuse initialization is studied, using asymptotic expansions of the RLSM characteristics in inverse

Introduction

15

powers that were obtained in Ref. [45]. They are based on the inversion formula of perturbed matrices that uses the matrix pseudoinverse [47]. Simulation results presented in Refs. [40,41] show also that a relatively small choice of values μ can increase the transition time. Thus, it is unclear how to choose μ. And finally, the last of the known methods of the initial conditions setting for the RLSM is the accurate initialization algorithm for the regression with scalar output obtained in Refs. [4749]. The algorithm allows to obtain an estimate of the RLSM coinciding with the estimate of the not recursive least squares.

1.3 DIFFUSE INITIALIZATION OF THE KALMAN FILTER Consider a linear discrete system of the form xt11 5 At xt 1 Bt wt ; yt 5 Ct xt 1 Dt ξ t ; tAT 5 fa; a 1 1; . . .; N g;

(1.33)

where xt ARn is a state vector, yt ARm is a vector of measured outputs, wt ARr and ξt ARl are uncorrelated random processes with zero expectations and covariance matrixes Eðwt wtT Þ 5 Ir ; Eðξt ξ Tt Þ 5 Il , and At ; Bt ; Ct ; Dt are known matrices of appropriate dimensions. It is required to estimate xt in the absence of statistical information regarding arbitrary components of x0 . In the literature a number of the KF modifications were proposed that are oriented to solution of this problem. Thus, in Ref. [50] the unknown initial conditions are interpreted as random variables with zero mean and covariance matrix proportional to a large parameter μ and use of the information filter which calculates recursively the inverse of the covariance matrix (the information matrix) of the estimation error is discussed. However, this filter cannot be used in many situations, and the work with the covariance matrix of the estimation error is more preferable than with its inverse [50, p. 1298]. In Ref. [50], it is proposed to interpret the unknown initial conditions as the diffuse random variables with zero mathematical expectations and an arbitrarily large covariance matrix. In Refs. [17,50a52], under certain assumptions about the process model and the diffuse structure of the initial vector, the problem was solved for μ-N. In these works, characteristics of the KF are expressed explicitly as functions of μ and then their limits are found as μ-N to obtain accurate solutions—the diffuse KF. In Ref. [53], getting the diffuse KF is based on the use of an information

16

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

filter. In Ref. [54], any assumptions about the statistical nature of the unknown initial conditions are not used. It is noted that this representation is logically justified in contrast to the diffuse initialization when dealing with an infinite covariance matrix of the initial conditions. In this paper a recursive estimation algorithm in the absence of information about all the initial conditions for a stationary model and a nonsingular dynamics matrix, which is a complicated version of the information filter, is proposed. A feature of the algorithm is the ability to obtain an estimate of the state only after performing a certain number of iterations. Its equivalence to algorithm obtained in Ref. [53] in cases of diffuse initial conditions is shown. In Ref. [55], it is proposed to initialize the KF with the help of expectation and covariance matrix of the obtained estimates from an initial sample. However, obtaining such estimates with acceptable accuracy in some cases is problematic (e.g., in the presence of gaps in the observations, estimating parameters of nonlinear systems and nonstationary processes). As a practical application of filters with diffuse initialization in Refs. [50a52] the problems of econometrics are considered (estimation, prediction, and smoothing of nonstationary time series), in Refs. [53,54] the problem of designing robust to impulse disturbances filters with sliding window and in Ref. [17] the problem of constructing observers restoring the state of a linear system in finite steps number and training NNs.

CHAPTER 2

Diffuse Algorithms for Estimating Parameters of Linear Regression Contents 2.1 Problem Statement 2.2 Soft and Diffuse Initializations 2.3 Examples of Application 2.3.1 Identification of Nonlinear Dynamic Plants 2.3.2 Supervisory Control 2.3.3 Estimation With a Sliding Window

17 20 39 39 43 46

2.1 PROBLEM STATEMENT Consider a linear observation model of the form yt 5 Ct α 1 ξt ; t 5 1; 2; :::; N ;

(2.1)

where yt 5 ðy1t ; y2t ; . . .; ymt ÞT ARm is a vector of outputs, α 5 ðα1 ; α2 ; . . .; αr ÞT ARr is vector of unknown parameters, Ct 5 Ct ðzt Þ is an m 3 r matrix, zt 5 ðz1t ; z2t ; . . .; znt ÞT ARn is a vector of inputs, ξt ARm is a random process which has uncorrelated values, zero expectation, and a covariance matrix Rt , N is a sample size. Suppose that the sample fzt ; yt g; t 5 1; 2; . . .; N and the quadratic criterion of quality taking into account available information on Eq. (2.1) by the time t are given Jt ðαÞ 5

t X

21

λt2k ðyk 2Ck αÞT ðyk 2 Ck αÞ 1 λt ðα2αÞT P ðα 2 αÞ=μ;

k51

t 5 1; 2; :::; N ; (2.2) r 3r

P . 0 are arbitrary vector and matrix, μ . 0, where αAR , PAR λAð0; 1 is a forgetting factor. r

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks. DOI: http://dx.doi.org/10.1016/B978-0-12-812609-7.00002-0

© 2017 Elsevier Inc. All rights reserved.

17

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

18

A vector of parameters α is found under the condition that Jt ðαÞ is minimal @Jt ðαÞ[email protected]α 5 2 2

t X

λt2k CkT ðyk 2 Ck αÞ 1 2λt P

21

ðα 2 αÞ=μ 5 0

k51

and the result must be updated after obtaining a new observation. Or, using the notation αt for the optimal estimate, we contain Mt αt 5 qt ; t 5 1; 2; :::; N ;

(2.3)

where Mt 5

t X

t X

21

λt2k CkT Ck 1 λt P =μ; qt 5

k51

λt2k CkT yk 1 λt P

21

α=μ:

k51

Let us at first find a recursive representation for αt . Since 21

Mt 5 λMt21 1 CtT Ct ; M0 5 P =μ; 21 qt 5 λqt21 1 CtT yt ; q0 5 P α=μ; t 5 1; 2; :::; N ;

(2.4)

we have 5 Mt21 qt 5 Mt21 ðλqt21 1 CtT yt Þ 5 Mt21 ðλMt21 αt21 1 CtT yt Þ (2.5) 5 Mt21 ððMt 2 CtT Ct Þαt21 1 CtT yt Þ 5 5 αt21 1 Kt ðyt 2 Ct αt21 Þ; α0 5 α; t 5 1; 2; :::; N ;

αt

where Kt 5 Mt21 CtT 5 Pt CtT :

(2.6)

In order to find a recursive representation for the matrix Pt 5 Mt21 , we use the following matrix identity A21 5 ðB21 1CD21 C T Þ21 5 B 2 BCðD1C T BCÞ21 C T B;

(2.7)

21 for B 5 Mt21 ; D 5 λIr ; C 5 CtT . This gives 21 Pt 5 ðλPt21 1CtT Ct Þ21 5 ðPt21 2 Pt21 CtT ðλIm 1Ct Pt21 CtT Þ21 Ct Pt21 Þ=λ; P0 5 Pμ; t 5 1; 2; :::; N :

(2.8)

The relations Eqs. (2.4)(2.6) and (2.8) describe the recursive least squares method (RLSM) with the quality criteria Eq. (2.2) and the soft initialization defined by α, P, μ. It is required to study the RLSM

Diffuse Algorithms for Estimating Parameters of Linear Regression

19

properties depending on α, P as μ-N with the quality criteria Eq. (2.2) for t belonging to a bounded set T 5 f1; 2; . . .; N g. Suppose now that in the quality criteria instead of forgetting factor, the covariance matrix of the observation noise, which is usually known in engineering applications [56], is used t X Jt ðαÞ 5 ðyk 2Ck αÞT Rk21 ðyk 2 Ck αÞ (2.9) k51 1 ðα2αÞT P

21

ðα 2 αÞ=μ; t 5 1; 2; :::; N :

Performing calculations analogous to the derivation (2.4)(2.6), (2.8), it is easy to obtain the following relations for a recursive estimation algorithm (the Kalman filter (KF) if you use statistical interpretation of α, α, P) αt 5 αt21 1 Kt ðyt 2 Ct αt21 Þ; α0 5 α; t 5 1; 2; . . .; N ;

(2.10)

where Kt 5 Mt21 CtT Rt21 5 Pt CtT Rt21 ; Mt 5 Mt21 1 CtT Rt21 Ct ; M0 5 P

21

(2.11) =μ;

(2.12)

Pt 5 Pt21 2 Pt21 CtT ðRt 1Ct Pt21 CtT Þ21 Ct Pt21 ; P0 5 M021 5 Pμ: (2.13) As in the case of the quality criteria Eq. (2.2) it is required to study the RLSM properties for t belonging to a bounded set T 5 f1; 2; :::; N g depending on α, P as μ-N. In this chapter it is assumed that Ct is a given deterministic matrix function, T 5 f1; 2; :::; N g is a bounded set, and μ is a large positive parameter, ( ) t X t2k T N $ ttr 5 mint t: λ Ck Ck . 0; t 5 1; 2; :::; N : k51

Asymptotic expansions of Pt , Kt in inverse powers of μ will be obtained. They are based on the inversion formula of perturbed matrices that use the matrix pseudoinverse. This allows to present in analytical form Pt , Kt as μ-N; to explain the reason for divergences of the RLSM observed in the simulation for a large μ; to offer limit diffuse estimation algorithms as μ-N that are independent of the large parameter and result in divergence.

20

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

2.2 SOFT AND DIFFUSE INITIALIZATIONS Let us prove two auxiliary results which will be needed further. Lemma 2.1. Suppose that Ωt is an n 3 n matrix which is defined by the expression Ωt 5 Ω0 =μ 1

t X

FsT Fs ; t 5 1; 2; . . .; N ;

(2.14)

s51

where Ω0 . 0 is an arbitrary n 3 n matrix, Ft , t 5 1; 2; . . .; N are arbitrary m 3 n matrixes. can be expanded in the power series Then the matrix Ω21 t Ω21 t

1 1 5 Ω21 0 ðIn 2 Ξt Ξt Þμ 1 Ξt 1 q X 21=2 1=2 1=2 i11 21=2 2i 1 ð21Þi Ω0 ðΩ0 Ξ1 Ω0 μ 1 Oðμ2q21 Þ t Ω0 Þ

(2.15)

i51

which converges uniformly in tAT 5 f1; 2; . . .; N g for bounded T and sufficiently large values Pt of T μ, where Ξt 5 s51 Fs Fs. Proof We have 1=2

21=2

Ωt 5 Ω0 ðIn 1 μΩ0

21=2

Ξt Ω 0

1=2

ÞΩ0 =μ:

By means of the inversion formula [47] of the perturbed matrices ðIn 1μV Þ21 5 ðIn 2 VV 1 Þ 1 V 1 ðV 1 1μIn Þ21 ;

(2.16) 21=2

where V is arbitrary symmetric matrix and setting V 5 Ω0 we obtain Ω21 t

21=2

21=2

21=2

21=2

Ξt Ω0

,

21=2

5 Ω0 ðIn 1μΩ0 Ξt Ω0 Þ21 Ω0 μ 21=2 21=2 1=2 1=2 5 Ω0 ½ðIn 2 Ω0 Ξt Ξ1 t Ω 0 Þ 1 Ω0 1=2 1=2 1 1=2 21 21=2 Ξ1 μ: t Ω0 ðΩ0 Ξt Ω0 1μIn Þ Ω0

Let us use the spectral decomposition of the matrices V and V 1 for the analysis of this expression

Diffuse Algorithms for Estimating Parameters of Linear Regression

21=2

Ω0

21=2

Ξt Ω 0

21

1 T 5 Tt Λt TtT ; Ω0 Ξ1 t Ω0 5 Tt Λt Tt ; 1=2

1=2

where Λt 5 diagðλt ð1Þ; λt ð2Þ; :::; λt ðnÞÞ, λt ðiÞ; t 5 1; 2; :::; N , i 5 1; 2; :::; n 21=2 21=2 1 1 are eigenvalues of the matrix Ω0 Ξt Ω0 , Λ1 t 5 diagðλt ð1Þ; λt ð2Þ; :::; λ1 t ðnÞÞ,  0; λt ðiÞ 5 0 1 λt ðiÞ 5 ; i 5 1; 2; :::; n; 1=λt ðiÞ; λt ðiÞ 6¼ 0 Tt is a orthogonal matrix, the columns of which are eigenvectors of 21=2 21=2 the matrix Ω0 Ξt Ω0 . If μ . 1=λmin and λt ðiÞ 6¼ 0, then the series 21 ðλ1 t ðiÞ1μÞ μ 5 1 1

q X

ð21Þi ½λt ðiÞμ2i 1 Oðμ2q21 Þ; i 5 1; 2; :::; n

i51

converges uniformly in tAT 5 f1; 2; . . .; N g for bounded T, where λmin 5 minfλt ðiÞ . 0; t 5 1; 2; . . .; N ; i 5 1; 2; . . .; ng: 21 As ðλ1 t ðiÞ1μÞ μ 5 1 for λt ðiÞ 5 0 then the use of matrix notations gives 21 1 21 T ðΩ0 Ξ1 t Ω0 1μIn Þ μ 5 Tt ðΛt 1μIn Þ Tt μ q X 1=2 1=2 i 2i 2q21 5 In 1 ð21Þi ðΩ0 Ξ1 Þ; μ-N: t Ω0 Þ μ 1 Oðμ 1=2

1=2

i51

This implies Eq. (2.15). Lemma 2.2. For arbitrary m 3 n matrices Ft ; t 5 1; 2; :::; N the following equalities hold

where Ξt 5

Pt

T ðIn 2 Ξt Ξ1 t ÞFs 5 0; s # t; t 5 1; 2; :::; N ;

k51

(2.17)

FkT Fk is an n 3 n matrix.

Proof 1. Introduce the notation for an n 3 mt matrix F~ t 5 ðF1T ; F2T ; ::: ; FtT Þ. Let the rank of this matrix be equal to kt and lt ð1Þ; lt ð2Þ; :::; lt ðkt Þ are its arbitrary linearly independent columns. Using the skeletal decomposition [57] gives

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

22

F~ t 5 Lt Γ t ; where Lt 5 ðlt ð1Þ; lt ð2Þ; . . .; lt ðkt ÞÞ; Γ t 5 ðΓ t ð1Þ; Γ t ð2Þ; . . .; Γ t ðtÞÞ; rankðLt Þ 5 kt ; rankðΓ t Þ 5 kt are n 3 kt , kt 3 mt matrices, respectively, Γ t ðiÞ, and i 5 1; 2; . . .; t are some kt 3 m matrices. Let us at first show that 21 T T In 2 Ξt Ξ1 t 5 In 2 Lt ðLt Lt Þ Lt :

We have Ξt 5 F~ t F~ t 5 Lt Γ~ t LtT ; T

where Γ~ t 5 Γ t Γ Tt . As Γ~ t . 0 (the Gram matrix constructed by linearly independent rows of Γ t ), rankðLt Þ 5 rankðΓ~ t LtT Þ and Lt is the full rank matrix by columns. Then [47] 21 T T 1 ~ 21 1 1 T ~ T 1 Ξ1 t 5 ðLt Γ t Lt Þ 5 ðLt Þ Γ t Lt ; Lt 5 ðLt Lt Þ Lt :

Thus ~ T ~ T 1 ~ T T 1 ~ 21 1 In 2 Ξt Ξ1 t 5 In 2 Lt Γ t Lt ðLt Γ t Lt Þ 5 In 2 Lt Γ t Lt ðLt Þ Γ t Lt 21 5 In 2 Lt Γ~ t LtT Lt ðLtT Lt Þ21 Γ~ t ðLtT Lt Þ21 LtT 5 In 2 Lt ðLtT Lt Þ21 LtT : Since FsT 5 Lt Γ t ðsÞ, we have 21 T T T ðIn 2Ξt Ξ1 t ÞFs 5ðIn 2Lt ðLt Lt Þ Lt ÞLt Γ t ðsÞ50; s51;2;...;t; t 51;2;...;n:

We at first consider the RLSM properties with the forgetting factor. Theorem 2.1. 1. The matrices Pt and Kt in Eqs. (2.6) and (2.8) can be expanded in the power series. 1

q X

Pt 5 Mt21 5 λ2t PðIr 2 Wt Wt1 Þμ 1 Wt1 1

ð21Þi λti P

i51

Kt 5 ½Wt1 1

q X

21=2

ðP

1=2

ð21Þi λti P

Wt1 P

21=2

ðP

1=2 i11 21=2 2i

1=2

Þ

μ 1 Oðμ2q21 Þ;

P

Wt1 P

(2.18)

1=2 i11 21=2 2i

Þ

P

μ CtT 1 Oðμ2q21 Þ

i51

(2.19)

Diffuse Algorithms for Estimating Parameters of Linear Regression

23

which converge uniformly in tAT 5 f1; 2; . . .; N g for bounded T and sufficiently large valuesμ, where Wt 5 λWt21 1 CtT Ct ; W0 5 0r 3 r :

(2.20)

2. Suppose that elements of the vector α are either unknown constants or random quantities whose statistical characteristics are unknown. And in the last case α is not correlated with ξt , t 5 1; 2; :::; N. Then for any ε . 0 Pðjjαt 2 α~ t jj $ εÞ 5 Oðμ2q21 Þ;

μ-N; t 5 1; 2; :::; N ;

(2.21)

where α~ t 5 α~ t21 1 K~ t ðyt 2 Ct α~ t21 Þ; α~ 0 5 α; K~ t 5 ½Wt1 1

q X

ð21Þi λti P

21=2

ðP

1=2

Wt1 P

(2.22)

1=2 i11 21=2 2i

Þ

P

μ CtT :

(2.23)

λ2k CkT Ck

(2.24)

i51

Proof 1. As Mt 5

t X

λ

t2k

CkT Ck

t 21

1λ P

=μ 5 λ ðP t

21

=μ 1

k51

t X k51

then putting in Lemma 2.1. Ωt 5 Mt ; Ω0 5 P

21

; Ft 5 λ2t=2 Ct ;

we obtain Eq. (2.18). The representation Eq. (2.19) follows from Eq. (2.18), Lemma 2.2. and the equality Kt 5 Mt21 CtT 5 Pt CtT : 2. Introducing the notations et 5 αt 2 α~ t ; ht 5 αt 2 α and using Eqs. (2.5) and (2.22), we obtain et 5 ðIr 2 K~ t Ct Þet21 2 ðKt 2 K~ t ÞCt ht 1 ðKt 2 K~ t Þξt ; e0 5 0; ht 5 ðIr 2 K~ t Ct Þht21 1 K~ t ξt ; h0 5 α 2 α: The matrix of second moments of the block vector xt 5 ðeTt ; hTt ÞT satisfies the following matrix difference equation

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

24

Qt 5 At Qt21 ATt 1 Lt with initial conditions Q0 5 0 if elements of the vector α are unknown constants and Q0 5 block diagð0r 3 r ; Q0 Þ if these elements are random quantities, where   I 2 K~ t Ct 2ðKt 2 K~ t ÞCt At 5 r ; 0 Ir 2 K~ t Ct ! T ; ðKt 2 K~ t ÞRt ðKt 2 K~ t ÞT ðKt 2 K~ t ÞRt K~ t Lt 5 T T K~ t Rt ðKt 2 K~ t Þ K~ t Rt K~ t Q0

5 E½ðα 2 αÞðα2αÞT :

Since jjKs 2 K~ t jj 5 Oðμ2q21 Þ; t 5 1; 2; :::; N ; μ-N then this implies EðeTt et Þ 5 Oðμ2q21 Þ; t 5 1; 2; :::; N ; μ-N: Using the Markov’s inequality gives for any ε . 0 Pðjjet jj $ εÞ # Eðjjet jj2 Þ=ε2 5 Oðμ2q21 Þ; μ-N; t 5 1; 2; . . .; N : Neglecting in Eqs. (2.18) and (2.19) the terms since the first order of smallness Oðμ21 Þ we get Pt 5 Mt21 5 λ2t PðIr 2 Wt Wt1 Þμ 1 Wt1 1 Oðμ21 Þ; dif

dif

dif

(2.25)

dif αdif t 5 αt21 1 Kt ðyt 2 Ct αt21 Þ; α0 5 α;

(2.26)

Ktdif 5 Wt1 CtT ;

(2.27)

Wt 5 λWt21 1 CtT Ct ; W0 5 0r 3 r ; t 5 1; 2; :::; N:

(2.28)

where

The relations Eqs. (2.25)(2.28) will be called the diffuse estimation algorithm of the linear regression parameter Eq. (2.1). Keeping in the expansions Eqs. (2.18) and (2.19) the terms of higher order of smallness, it is possible to obtain different estimation algorithms. So keeping terms of the order Oðμ21 Þ gives

Diffuse Algorithms for Estimating Parameters of Linear Regression

25

21

Pt 5 Mt21 5 λ2t PðIr 2 Wt Wt1 Þμ 1 Wt1 2 λt ðWt1 Þ2 P μ21 1 Oðμ22 Þ; (2.29) αμt 5 αμt21 1 Ktμ ðyt 2 Ct αμt21 Þ; αμ0 5 α;

(2.30)

where 21

Ktμ 5 ½Wt1 2 λt ðWt1 Þ2 P μ21 CtT : (2.31) Note that in contrast to the diffuse algorithm, these relations depend on P. Consequence 2.1.1. The diffuse component Ptdif 5 λ2t PðIr 2 Wt Wt1 Þμ is the term in expansion Pt which is proportional to a large parameter and itPvanishes when t $ tr, where tr 5 mint ft:Π t . 0; t 5 1; 2; :::; N g, Π t 5 tk51 CkT Ck . Indeed, we have T Π t 5 C~ t C~ t ; Wt 5

t X

T λt2k CkT Ck 5 C~ t Λt C~ t ;

k51

where C~ t 5ðC2T ; C1T ; :::; CtT ÞTARmt 3 r ; Λt 5block diagðλt21 Im ; λt22 Im ; :::; Im ÞARmtmt . As detΛt 6¼ 0, then rankðΠ t Þ 5 rankðWt Þ 5 rankðC~ t Þ dif

and consequently Ptr 5 0. Consequence 2.1.2. dif The matrix Kt does not depend on the diffuse component as opposed to the matrix Pt and as the function of μ is uniformly bounded in the norm for tAT as μ-N. Consequence 2.1.3. Numerical implementation errors can result in the RLSM divergence for large values of μ. Indeed, let δWt1 be the error connected with calculations of Wt pseudoinverse. Then by Theorem 2.1

26

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

Kt

5 Mt21 CtT 5 Pt CtT 5 5 ½λ2t PðIr 2 Wt ðWt1 1 δWt1 ÞÞμ 1 Oð1ÞCtT 5 ½ 2 λ2t PWt δWt1 μ 1 Oð1ÞCtT ; μ-N:

Thus for δWt1 6¼ 0 the matrix Kt becomes dependent on the diffuse component even when t $ tr. As the operation of pseudoinversion, generally speaking, is not continuous, then the resulting change can be substantial and can lead to divergence. Moreover, in this case Kt becomes proportional to a large parameter and so divergence is possible even if the continuity condition rankðWt Þ 5 rankðWt 1 δWt Þ is satisfied and δWt1 is arbitrarily small in respect to the norm. Note that numerically implemented the diffuse algorithm does not have the mentioned distinctive features. This is evidenced by the absence of diffuse components, i.e., quantities proportional to a large parameter in its construction. Let us illustrate the difference between the properties of the regression parameter estimates in Eq. (2.1) obtained by the RLSM with the soft initialization for large but finite values of μ and the diffuse algorithm using a numerical example. Example 2.1. Consider the problem of estimation of the parameter α in the observation model when m 5 1, r 5 100, and R 5 0:052 . Let the model inputs be generated by two units that are specified by the cascade connection of the gallery operator (‘randsvd’, 100) from Matlab package with 64-bit grid and the elements of the vector α are normally distributed with mean 0 and standard deviation 104 . Fig. 2.1 shows the dependence of the estimation error on t obtained using the RLSM with the soft and diffuse initializations qt 5 jjα 2 αt jj; pt 5 jjα 2 αdif t jj; respectively. In both cases the forgetting factor is set equal to 1. Curves 1, 2, and 3 correspond to the soft initialization with μ 5 106 ; 109 ; 1010 , respectively, and curve 4 corresponds to the diffuse initializations for α 5 0, P 5 Ir . It is seen that for μ 5 106 you cannot significantly reduce qt . If μ 5 109 then the value of qt decreases, but it remains approximately two times more than pt when t . 100. A further

Diffuse Algorithms for Estimating Parameters of Linear Regression

27

Figure 2.1 Dependence of the estimation error on t obtained using the RLSM with the soft and diffuse initializations.

increasing μ leads to divergence of the RLSM that corresponds to Consequence 2.1.3. Establish some properties of the RLSM with the diffuse initialization. We at first show that the estimates of the least-square method (LSM) coincide with the estimates obtained using the diffusion algorithm from the moment tr. Consider the problem of minimizing the weighted sum of squares J~t ðαÞ

5

t X k51 t

λt2k ðyk 2Ck αÞT ðyk 2 Ck αÞ

5 λ ðYt 2 C~ t αÞ

T

21 R~ t ðYt

(2.32)

2 C~ t αÞ; t 5 1; 2; :::; N ;

where C~ t 5 ðC1T ; C2T ; :::; CtT ÞT ARmt 3 r ; Yt 5 ðyT1 ; yT2 ; :::; yTt ÞT ARmt ; R~ t 5 block diagðλIm ; λ2 Im ; :::; λt Im ÞARmt 3 mt : The minimum value of α must satisfy the system of equations T 21 @Jt ðαÞ[email protected]α 5 2 2λt C~ t R~ t ðYt 2 C~ t αÞ 5 0:

If the rankðC~ t Þ 5 r then its solution after t observations will be a linear nonbiased estimate of the form T 21 T 21 αt 5 ðC~ t R~ t C~ t Þ21 C~ t R~ t Yt :

(2.33)

We want to show that from the time tr the estimates given by the expression Eq. (2.33) and the RLSM with the diffuse initialization coincide. We will need some auxiliary statements.

28

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

Lemma 2.3. For any t 5 2; 3; ::: the following identities are valid 1 1 Wt21 Þ 5 ðIr 2 Wt1 Wt ÞðIr 2 Wt21 Wt21 Þ; ðIr 2 Wt1 CtT Ct ÞðIr 2 Wt21 (2.34)

Wt1 Wt Ws1 Ws 5 Ws1 Ws ; s 5 1; 2; . . .; t:

(2.35)

Proof Let us prove Eq. (2.34). If t $ tr then the assertion is obvious. Let t , tr. We have T Ir 2 Wt1 CtT Ct 5 Ir 2 λ2t Λ1 t Ct Ct ;

(2.36)

where Λt determined by the matrix equation Λt 5 Λt21 1 λ2t CtT Ct ; Λ0 5 0; t 5 1; 2; :::; N : Transforming the left-hand side of Eq. (2.34) by means of Eq. (2.36), we obtain 1 Wt21 Þ ðIr 2 Wt1CtT Ct ÞðIr 2 Wt21 2t 1 T 5 ðIr 2 λ Λt Ct Ct ÞðIr 2 Λ1 t21 Λt21 Þ 1 1 2t 1 T T 5 Ir 2 λ2t Λ1 C C 2 Λ Λ t t t t21 t21 1 λ Λt Ct Ct Λt21 Λt21 1 1 1 1 1 1 5 Ir 2 Λt Λt 1 Λt Λt21 2 Λt21 Λt21 1 Λt Λt Λt21 Λt21 2 Λ1 t Λt21 Λt21 Λt21 1 1 1 1 1 5 ðIr 2 Λt Λt Þ 2 ðIr 2 Λt Λt ÞΛt21 Λt21 1 Λt Λt21 ðIr 2 Λt21 Λt21 Þ:

Since pseudoinverse A1 of any matrix A satisfies the condition AA1 A5 A [47], we have 1 Λ1 t Λt21 ðIr 2 Λt21 Λt21 Þ 5 0

and therefore 1 ðIr 2 Wt1 CtT Ct ÞðIr 2 Wt21 Wt21 Þ 1 1 5 ðIr 2 Λt Λt Þ 2 ðIr 2 Λ1 Λ t ÞΛt21 Λt21 t 1 1 1 5 ðIr 2 Λt Λt ÞðIr 2 Λt21 Λt21 Þ 5 ðIr 2 Wt1 Wt ÞðIr 2 Wt21 Wt21 Þ:

Let us prove Eq. (2.35). If t $ tr then the assertion is obvious. Suppose now that t , tr. Let Li be a r 3 ki matrix of the rank ki consisting of all linearly independent columns of the r 3 mi matrix

Diffuse Algorithms for Estimating Parameters of Linear Regression

29

C i 5 ðλ21=2 C1T ; λ21 C2T ; :::; λ2i=2 CiT Þ T

for i 5 1; 2; :::; t. The matrix Li is selected so that Li 5 ðLi21 ; Δi Þ, where Δi is a r 3 ðki 2 ki21 Þ matrix of rank ki 2 ki21 composed of all linearly independent columns of the matrix λ2i=2 CiT for each i 5 2; 3; :::; t. T Using the skeletal decomposition yields C t 5 Lt Γ t , where Γ t 5 ðΓ t ð1Þ; Γ t ð2Þ; :::; Γ t ðtÞÞ is a kt 3 mt matrix of the rank kt , Γ t ðiÞ, i 5 1; 2; :::; t are some kt 3 m matrixes, we have t X T λ2k CkT Ck 5 C t C t 5 Lt Γ~ t LtT ;

Λt

5

Λ1 t Wt1 Wt

21 5 ðLt Γ~ t LtT Þ1 5 ðLtT Þ1 Γ~ t Lt1 ; Lt1 5 ðLtT Lt Þ21 LtT ; 1 5 ðLt Γ~ t LtT Þ Lt Γ~ t LtT 5 Lt ðLtT Lt Þ21 LtT ;

k51

(2.37)

where Γ~ t 5 Γ t Γ Tt . 0 is the Gram matrix. As ðLtT Lt Þ21 LtT Ls 5 Iks ;kt ; Lt Iks ;kt 5 Ls ; where Iks ;kt 5 ðe1 ; e2 ; :::; es Þ, ei ARkt is i-th unit vector then 1 21 T 21 T T T 1 Wt1 Wt Ws1 Ws 5 Λ1 t Λt Λs Λs 5 Lt ðLt Lt Þ Lt Ls ðLs Ls Þ Ls 5 Ws Ws :

Let us consider the system of equations for the expectation of the dif estimation error et 5 Eðα 2 αt Þ et 5 ðIr 2 Ktdif Ct Þet21 ; t 5 1; 2; :::; N :

(2.38)

Establish the form of the normalized transition matrix  At At21 :::; As11 ; t 5 s 1 1; s 1 2; :::; s $ 0 Ht;s 5 Ir ; t 5 s of this system, where At 5 Ir 2 Ktdif Ct 5 Ir 2 Wt1 CtT Ct : Lemma 2.4. For any t 5 1; 2; ::: the following identities are valid Ht;0

1 5 ðIr 2 Wt1 Wt ÞðIr 2 Wt21 Wt21 Þ:::ðIr 2 W11 W1 Þ 1 5 Ir 2 Wt Wt :

(2.39)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

30

Proof We have Ht;0 5 At At21 . . .A1 ; t 5 1; 2; . . .: Let us prove the first equality in Eq. (2.39) using induction. As H1;0 5 Ir 2 W11 W1 then it holds at t 5 1. From Lemma 2.3 it follows that H2;0

5 ðIr 2 W21 C2T C2 ÞðI 2 W11 W1 Þ 5 ðI 2 W21 W2 ÞðI 2 W11 W1 Þ

Thus the first equality in Eq. (2.39) is valid at t 5 2. Suppose it is true at some t. Let us show that it will be carried out and at t 1 1. We have Ht11;0

1 T 5 ðIr 2 Wt11 Ct11 Ct11 ÞHt;0 5 1 T 5 ðIr 2 Wt11 Ct11 Ct11 ÞðIr 2 Wt1 Wt Þqt21 ;

where 1 Wt21 Þ. . .ðIr 2 W11 W1 Þ: qt21 5 ðIr 2 Wt21

Using Lemma 2.3 gives 1 Ht11;0 5 ðIr 2 Wt11 Wt11 Þqt

that implies the first equality in Eq. (2.39). Let us show now that Ht;0 5 Ir 2 Wt1 Wt ; t 5 1; 2; . . .:

(2.40)

We use once again induction. As H1;0 5 Ir 2 W11 C1T C1 5 Ir 2 W11 W1 then at t 5 1 Eq. (2.39) is satisfied. Suppose it is true at some t and show that it is true at t 1 1. Using Lemma 2.3 gives 1 T Ct11 Ct11 ÞðIr 2 Wt1Wt Þ 5 Ht11;0 5 ðIr 2 Wt11 1 5 ðIr 2 Wt11 Wt11 ÞðIr 2 Wt1Wt Þ 5 1 1 1 5 Ir 2 Wt11 Wt11 2 Wt1 Wt 1Wt11 Wt11 Wt1Wt ; 5 Ir 2 Wt11 Wt11 :

Diffuse Algorithms for Estimating Parameters of Linear Regression

31

Corollary The solutions of the system Eq. (2.38) satisfy the condition et 5 0 for any initial conditions and t $ tr. That is the parameter estimate obtained by the RLSM with diffuse initialization unbiased when t $ tr. Note also that in the absence of noise the RLSM with the diffuse initialization restores the vector of unknown parameters in a finite number of steps. Lemma 2.5. For any t $ tr Ht;s 5 λt2s Wt21 Ws ; s 5 1; 2; :::; t 2 1:

(2.41)

Proof Let us use induction. We have at s 5 t 2 1 5 At 5 Ir 2 Wt21 CtT Ct 5 Wt21 ðWt 2 CtT Ct Þ 5 λWt21 Wt21

Ht;t21 Let s 5 t 2 2, then Ht;t22

5 At At21 5 1 T 5 ðIr 2 Wt21 CtT Ct Þ; ðIr 2 Wt21 Ct21 Ct21 Þ:

Transforming this expression we get Ht;t22

1 T 5 Wt21 ðWt 2 CtT Ct ÞðIr 2 Wt21 Ct21 Ct21 Þ 21 1 T 5 λWt Wt21 ðIr 2 Wt21 Ct21 Ct21 Þ:

Let Lt a matrix of the rank kt composed of all linearly independent T columns of the matrix C t 5 ðλðt21Þ=2 C1T ; λðt22Þ=2 C2T ; . . .; CtT Þ. Using the skeletal decomposition T

C t 5 Lt Γ t ; CtT 5 Lt Γ t ðtÞ; where Lt and Γ t 5 ðΓ t ð1Þ; Γ t ð2Þ; . . .; Γ t ðtÞÞ are some mt 3 kt , kt 3 r matrices of the rank kt , Γ t ðiÞ, i 5 1; 2; . . .; t are some kt 3 m matrices gives Wt Wt1 CtT 5 CtT : It follows from this that Ht;t22 5 λ2 Wt21 Wt22 :

(2.42)

32

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

Thus Eq. (2.41) is valid at s 5 t 2 2. Let Eq. (2.41) be satisfied at s 5 t 2 i. We show that it is also valid at s5t2i21 Ht;t2i21

1 T 5 ðIr 2 Wt21 CtT Ct Þ. . .ðIr 2 Wt2i11 Ct2i11 Ct2i11 Þ 1 T 21 3 ðIr 2 Wt2i Ct2i Rt2i Ct2i Þ 5 1 T 5 λi Wt21 Wt2i ðIr 2 Wt2i Ct2i Ct2i Þ 5 λi11 Wt21 Wt2i21 :

Theorem 2.2. The estimate of the RLSM with diffuse initialization from the moment tr is defined by the expression ~ T ~ 21 ~ 21 ~ T ~ 21 αdif t 5 ðC t R t C t Þ C t R t Y for any initial vector α, where C~ t R~ t

5 ðC1T ; C2T ; . . .; CtT ÞT ARmt 3 r ; Yt 5 ðyT1 ; yT2 ; . . .; yTt ÞT ARmt ; 5 block diagðλt21 Im ; λt22 Im ; . . .; Im ÞARmt 3 mt :

Proof Iterating Eq. (2.26) and using Lemmas 2.4 and 2.5, we obtain dif

αt

5 Ht;0 α 1

t X Ht;s Ksdif ys 5 s51

t t X X 5 Ht;s Ksdif ys 5 Wt21 λt2s Ws Ws1 CsT ys s51

s51

t X T 21 T 21 5 Wt21 λt2s CsT ys 5 ðC~ t R~ t C~ t Þ21 C~ t R~ t Yt : s51

We show now that the estimate of the diffuse algorithm coincides with the LSM with a minimum norm. Let us return again to the problem of minimizing the weighted sum of squares Eq. (2.32). It is known [47] that if rank of the matrix C~ t is less than r, then it does not have the unique solution and the solution with minimal norm is given by the expression 1 21=2 αt 5 C~ t R~ t Yt ;

(2.43)

where 21=2 21=2 21=2 C~ t 5ðC1T R1 ;C2T R2 ;...;CtT Rt ÞT ARmt3r ; Yt 5ðyT1 ;yT2 ;:::;yTt ÞT ARmt ; R~ t 5blockdiagðλIm ;λ2 Im ;:::;λt Im ÞARmt3mt ; Ri 5λi Im ; i51;2;...;t:

Diffuse Algorithms for Estimating Parameters of Linear Regression

33

Establish a link between this estimate and the estimate of the RLSM with the diffuse initialization. We need the following auxiliary assertion. Lemma 2.6. The following identity holds 1 ~ C t21 ; Kt Rt ; t 5 1; 2; . . .; N; Mt1 C~ t 5 ½ðIr 2 Kt Ct ÞMt21 T

T

1=2

(2.44)

T where Mt 5 C~ t C~ t , Kt 5 Mt1 CtT Rt21 :

Proof Since 21=2 T

Mt1 C~ t 5 Mt1 ðC~ t21 ; CtT Rt T

T

Þ

then the last m columns on the left- and right-hand sides of Eq. (2.44) are the same. Let us show that T 1 ~T Mt1 C~ t21 5 ðIr 2 Kt Ct ÞMt21 C t21 :

Using the recursive formula Mt 5 Mt21 1 λ2t CtT Ct ; Λ0 5 0; t 5 1; 2; . . .; N ; we represent this expression in the equivalent form T 1 ðMt1 2 ðIr 2 λ2t=2 Kt Ct ÞMt21 ÞC~ t21 T 1 5 ½Mt1 2 ðIr 2 Mt1 Mt 1 Mt1 Mt21 ÞMt21 C~ t21 T 1 1 5 ½Mt1 ðIr 2 Mt21 Mt21 Þ 2 ðIr 2 Mt1 Mt ÞMt21 C~ t21 5 0:

(2.45)

We show at first that 1 ðIr 2 Mt1 Mt ÞMt21 5 0:

(2.46)

Suppose that the matrix C~ t has a rank kt and its lt ð1Þ; lt ð2Þ; :::; lt ðkt Þ linearly independent columns are selected in the same way as the columns of the matrix C t used in the proof of Lemma 2.3. The linear space Cðlt ð1Þ; lt ð2Þ; . . .; lt ðkt ÞÞ defined by them coincides with the space formed by the columns Mt . It follows from this that CðMt21 ÞDCðMt Þ. Using the skeleton decomposition of the matrices, we get T C~ t 5 Lt Γ t ;

where Lt 5 ðlt ð1Þ; lt ð2Þ; . . .; lt ðkt ÞÞ, Γ t 5 ðΓ t ð1Þ; Γ t ð2Þ; . . .; Γ t ðtÞÞ, rankðΓ t Þ T 5 kt . This implies Mt 5 C~ t C~ t 5 Lt Γ~ t LtT , where Γ~ t 5 Γ t Γ Tt .

34

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

As the matrix Lt is of full rank by column then Lt1 5 ðLtT Lt Þ21 LtT and 21

Mt1 5 ðLt Γ~ t LtT Þ1 5 ðLtT Þ1 Γ~ t ðLt Þ1 ; Mt1 Mt 5 Lt ðLtT Lt Þ21 LtT : Substituting these expressions in Eq. (2.46) gives 21

T 1 ðIr 2 Lt ðLtT Lt Þ21 LtT ÞLt21 ðLt21 Lt21 Þ21 Γ~ t21 Lt21 5 0:

But since ðIr 2 Lt ðLtT Lt Þ21 LtT ÞLt21 5 0; then Eq. (2.46) is actually performed. The identity 1 Mt1 ðIr 2 Mt21 Mt21 ÞC~ t21 5 0 T

(2.47)

follows from the equality T T ðIr 2 Lt21 ðLt21 Lt21 Þ21 Lt21 ÞLt21 Γ ðt 2 1Þ 5 0:

Theorem 2.3. For any t 5 1; 2; ::: the solution to the optimization problem Eq. (2.32) given by Eq. (2.43) can be obtained by the RLSM with the diffuse initialization and the initial condition α0 5 0. Proof Let us show by induction that αt 5 αdif t ; t 5 1; 2; :::; N :

(2.48)

If t 5 1 then the assertion follows from the expressions dif

α1 α1

5 K1 y1 5 ðC1T C1 Þ1 C1T y1 ; 1 21=2 21=2 21=2 5 C~ 1 R~ 1 Y1 5 ðR1 C1 Þ1 R1 y1 5 ðC1T C1 Þ1 C1T y1 : dif

Assume that Eq. (2.48) holds at some t 2 1. We show that it will also be 1 T carried out at t. Since [47] C~ t 5 Mt1 C~ t then the use of Eq. (2.44) gives 1 21=2 dif dif 1=2 1 ~ C t21 ; Kt Rt  αt 5 C~ t R~ t Yt 5 ½ðIr 2 Kt Ct ÞMt21 21=2 dif dif T ~ 21=2 T 21=2 T 1 ~ 3 ðYt21 Rt21 ; yt Rt Þ 5 ðIr 2 Kt Ct ÞMt21 C t21 R~ t21 Yt21 1 Kt yt dif dif dif 5 ðIr 2 Kt Ct Þαt21 1 Kt yt 5 αt21 1 Kt ðyt 2 Ct αt21 Þ:

Diffuse Algorithms for Estimating Parameters of Linear Regression

35

dif

The definition of Kt is connected with the need to calculate the pseudoinverse of the matrix high order. We show how to avoid it when m 5 1 calculating recursively the pseudoinverse of the matrix T C~ t 5 ðλðt21Þ=2 C1T ; λðt22Þ=2 C2T ; . . .; CtT ÞT

and using the Greville formula [47]  1 T  ~ t21 Þ ðIr 2 CtT kTt Þ 1 T ð C ~ ðC t Þ 5 ; kTt where

kt 5

8 T 1 > ðIr 2 C~ t21 ðC~ t21 ÞT ÞCtT T 1 > > ; ðI 2 C~ t21 ðC~ t21 ÞT ÞCtT 6¼ 0 > > < jjðIr 2 C~ T ðC~ 1 ÞT ÞC T jj2 r t t21 t21 > > > > > :

1 1 C~ t21 ðC~ t21 ÞT CtT ~T ~1 T T 2 ; ðIr 2 C t21 ðC t21 Þ ÞCt 5 0 1 T 1 1 jjCt21 Ct jj

(2.49)

:

(2.50)

1 T It follows from Eq. (2.49) that kt 5 C~ t er 5 ðC~ t C~ t Þ1 CtT and therefore dif Kt 5 k t . Let us show that in the transition stage (t # tr) under the condition

rankðC~ t Þ 5 mt; α 5 0

(2.51)

the diffuse algorithm does not depend on the forgetting factor λ. Initially, we show that Ht;s Ksdif 5 λt2s Wt1 CsT ; t . s;

(2.52)

where Ht;s is the transition matrix of the system Eq. (2.38). Denote Xt;s the normalized transition matrix of the auxiliary system xt 5 ðIr 2 Kt Ct Þxt21 ; t 5 1; 2; . . .; N : Establish the form of the matrix Xt;s . Using Eq. (2.4) we find Ir 5 Mt21 Mt 5 λMt21 Mt21 1 Mt21 CtT Ct : Since

Ir 2 Mt21 CtT Ct 5 λMt21 Mt21

then Xt;s satisfies the system Xt;s 5 λMt21 Mt21 Xt21;s :

(2.53)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

36

It follows from this that Mt Xt;s 5 λMt21 Xt21;s 5 λ2 Mt22 Xt22;s 5 ::: 5 λt2s Ms Xs;s ; Xt;s 5 λt2s Mt21 Ms :

(2.54)

Using Lemmas 2.1 and 2.2. gives Ht;s Ksdif 5 limμ-N Xt;s Ms21 CsT 5 λt2s limμ-N Mt21 CsT 5 λt2s Wt1 CsT ; (2.55) Ht;0 5 limμ-N Xt;0 5 λt limμ-N Mt21 M0 5 PðIr 2 Wt Wt1 Þ:

(2.56)

Since Ht;0 5 PðIr 2 Wt Wt1 Þ 5 P

1=2

ðIr 2 ðP

1=2

Wt ÞðP

1=2

Wt Þ1 ÞP

1=2

5

5 ðIr 2 Wt Wt1 ÞP; then this implies that the system Eq. (2.38) has the normalized transition matrix Eq. (2.39). Note that proving Lemma 2.4., we did not use the power series expansion Eq. (2.18) which, as an example, for random input generally speaking is not true. We have for t # tr dif

αt

5 Ht;0 α 1

t X

Ht;s Ksdif ys 5

s51

5 ðIr 2 Wt Wt1 Þα 1 Wt1

t X

(2.57) λ

t2s

CsT ys :

s51

As Wt

t X T T 21 5 λt2s CsT Cs 5 λt C t R~ t C t 5 λt C~ t C~ t ; s51

Wt1

T 5 λ2t ðC~ t C~ t Þ1 ;

t X T 21=2 21=2 λt2s CsT ys 5 λt C~ t R~ t Yt ; C~ t 5 R~ t C t ; s51

Diffuse Algorithms for Estimating Parameters of Linear Regression

37

where 21=2 21=2 21=2 C t 5 ðC2T ; C1T ; . . .; CtT ÞT ; C~ t 5 ðC1T R1 ; C2T R2 ; . . .; CtT Rt ÞT 21=2 5 R~ t C t ARmt 3 r ; Yt 5 ðyT1 ; yT2 ; . . .; yTt ÞT ARmt ; R~ t 5 block diagðλIm ; λ2 Im ; :::; λt Im ÞARmt 3 mt ;

and for any matrix A we have A1 5 ðAT AÞ1 AT [47], then the substitution of these expressions into Eq. (2.57) gives T 21=2 dif αt 5 ðIr 2 Wt Wt1 Þα 1 Wt1 C~ t R~ t Yt 5 T T 21=2 5 ðIr 2 Wt Wt1 Þα 1 ðC~ t C~ t Þ1 C~ t R~ t Yt 5 1 5 ðIr 2 Wt Wt1 Þα 1 C t Yt :

This implies our assertion. Let us show that if rankðC~ t Þ 5 mt, then the LSM with a minimum norm Eq. (2.43) does not depend on the forgetting factor λ. Let A, B be arbitrary n 3 p, p 3 m matrices, respectively. It is known [47] that if rankðAÞ 5 rankðBÞ 5 p then ðABÞ1 5 B1 A1 . This implies 1 21=2 1 1=2 C~ t 5 ðR~ t C t Þ1 5 C t R~ t : 1 Substitution of this expression in Eq. (2.43) gives αt 5 C t Yt . To conclude this section, we present analogues of some results obtained for the case when the matrix of the intensities of the noise measurement Rt is used in the quality criteria instead of the forgetting factor. In fact, they are simple consequences of the contained results.

Theorem 2.4. 1. The matrices Pt and Kt in Eqs. (2.11) and (2.13) can be expanded in the power series.

1

q X

Pt 5 Mt21 5 PðIr 2 Wt Wt1 Þμ 1 Wt1 1 ð21Þi P

21=2

ðP

1=2

Wt1 P

1=2 i11 21=2 2i

Þ

P

μ 1 Oðμ2q21 Þ;

(2.58)

i51

"

# q X 21=2 1=2 1=2 21=2 2i 2q21 Kt ¼ Wtþ þ ð21Þi P ðP Wtþ P Þiþ1 P μ CtT R21 Þ; t þ Oðμ i¼1

(2.59)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

38

which converge uniformly in tAT 5 f1; 2; . . .; N g for bounded T and sufficiently large values ofμ, where Wt 5 Wt21 1 CtT Rt21 Ct ; W0 5 0r 3 r :

(2.60)

2. For any ε . 0 Pðjjαt 2 α~ t jj $ εÞ 5 Oðμ2q21 Þ;

μ-N; t 5 1; 2; . . .; N;

(2.61)

where α~ t 5 α~ t21 1 K~ t ðyt 2 Ct α~ t21 Þ; α~ 0 5 α; K~ t 5 ½Wt1 1

q X

ð21Þi P

21=2

ðP

1=2

Wt1 P

1=2 i11 21=2 2i

Þ

P

μ CtT :

i51

Proof 1. As 21

Mt 5 P =μ 1

t X

CkT Rk21 Ck ;

k51

then putting in Lemma 2.1. Ωt 5 Mt ; Ω0 5 P

21

21=2

; Ft 5 Rt

Ct ;

we obtain Eq. (2.58). The representation Eq. (2.59) follows from Eq. (2.58), Lemma 2.2., and the equality Kt 5 Mt21 CtT Rt21 5 Pt CtT Rt21 : 2. We omit the proof of Eq. (2.61) which is similar to one given in the derivation of Eq. (2.21) in Theorem 2.1. Neglecting in Eqs. (2.18) and (2.19) the terms beginning with the first order of smallness Oðμ21 Þ we get Pt dif αt

5 Mt21 5 PðIr 2 Wt Wt1 Þμ 1 Wt1 1 Oðμ21 Þ; dif dif dif dif 5 αt21 1 Kt ðyt 2 Ct αt21 Þ; α0 5 α;

(2.62)

Ktdif 5 Wt1 CtT Rt21 ; t 5 1; 2; :::; N :

(2.63)

where

Note that analogues of the Consequences 2.1.1.2.1.3. are saved in this case. Corollary 2.1.3. can be extended as follows. The expression

Diffuse Algorithms for Estimating Parameters of Linear Regression

39

Kt 5 ½Wt δWt1 μ 1 Oð1ÞCtT Rt21 ; μ-N implies that the effect of divergence may be increased by high precision measurements (small values of jjRt jj). Note also that under the condition Eq. (2.51), where 21=2 21=2 21=2 C~ t 5 ðC1T R1 ; C2T R2 ; :::; CtT Rt ÞT ARmt 3 r

the diffuse algorithm does not depend on Rt when t # tr. Theorem 2.5. 1. For the transition matrix of the system et 5 ðIr 2 Ktdif Ct Þet21 5 ðIr 2 Wt1 CtT Rt21 Ct Þet21 ;

(2.64)

the following representations hold Ht;0 5 Ir 2 Wt1 Wt ; t 5 1; 2; :::; N;

(2.65)

Ht;s 5 Wt21 Ws ; t $ tr; s 5 1; 2; . . .; t 2 1:

(2.66)

dif

2. The estimate αt coincides with the LSM estimate of the minimum norm.

2.3 EXAMPLES OF APPLICATION 2.3.1 Identification of Nonlinear Dynamic Plants Let us illustrate the use of the RLSM with the diffuse initialization for construction of the functionally connected neural network (NN). Suppose that a plant is described by a nonlinear model of the autoregressive moving average of the form yt 5 Φðzt ; βÞα; 1 ξt ; t 5 1; 2; :::; N ;

(2.67)

where zt 5 ðyt21 ; yt22 ; :::; yt2a ; ut2d ; ut2d21 ; :::; ut2d2b Þ ARa1b is a vector of inputs, yt AR1 is a measured output, αARr , βARl are vectors of unknown parameters, Φðzt ; βÞ is a known function, nonlinear in respect to zt ; β, N is a training set size, ξt AR1 is a random process that has uncorrelated values, zero expectation, and variance Rt 5 E½ξ2t , a; b; d . 0 are some integer numbers. It is required to find estimates β t , αt using the inputoutput pairs fzi ; yi g, i 5 1; 2; . . .; t.

40

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

We use the diffuse algorithms from Section 2.2 and the approach based on the functionally connected NNs. The values β are selected by small random numbers and the vector α is unknown and it is needed to find its estimate using given until t the observations of the inputoutput pairs. Thus the problem reduces to an estimate of α by the linear observations model yt 5 Ct α 1 ξt ; t 5 1; 2; . . .; N ;

(2.68)

where Ct 5 Φðzt ; βÞ. If we use a perceptron with one hidden layer (1.6) and scalar output, then the elements of the matrix Ct are defined by the expression Ct 5 ðσða1 zt 1 b1 Þ; σða2 zt 1 b2 Þ; . . .; σðar zt 1 br ÞÞ; where ak 5 ðak1 ; ak2 ; . . .; akða1bÞ Þ, k 5 1; 2; . . .; r, ak , bk , k 5 1; 2; . . .; r are weights and biases, respectively, and σðÞ is the activation function (AF). For the RBNN with one hidden layer (1.9) and scalar output Ct is determined by the expression Ct 5 ð1; φðb1 jjzt 2 a1 jj2 Þ; φðb2 jjzt 2 a2 jj2 Þ; . . .; φðbp jjzt 2 ar jj2 ÞÞ; where ak, bk , k 5 1; 2; . . .; r are centers and scaled factors, respectively, and σðÞ is a basis function. Example 2.2. Let a plant be described by the linear difference equation [58]. yt 5 yt21 yt22 ðyt21 1 2:5Þ=ð1 1 y2t21 1 y2t22 Þ 1 ut21 :

(2.69)

A sample of ut from a uniform distribution on the interval ½ 2 2; 2 is used for training and for testing the signal ut 5 sinð2πt=250Þ. The plant model is sought in the form yt 5 f ðyt21 ; yt22 ; ut21 Þ;

(2.70)

where f ðÞ is the multilayer perceptron (1.6) with the sigmoid AF. The output layer weights are estimated only. The weights of the hidden layer and biases are selected from the uniform distribution on the intervals ½ 2 1; 1 and ½0; 1, respectively. The size of the training sample is N 5 2000 , the testing is N 5 500, λ 5 1, Rt 5 1. Figs. 2.2 and 2.3 present outputs of the plant and models with 10 and 5 neurons in the hidden layer (curves 1 and 2), respectively. We show

Diffuse Algorithms for Estimating Parameters of Linear Regression

41

Figure 2.2 Dependencies of the plant output and the model output on time with ten neurons.

Figure 2.3 Dependencies of the plant output and the model output on time with five neurons. Table 2.1 Training errors Sizes of training Number of neurons and testing sets in the hidden layer

N 5 2000

N 5 500

5 10 15 20 4 5 6

Errors Training set

Testing sets

0.58 0.23 0.14 0.12 0.82 0.57 0.46

0.6 0.23 0.08 0.044 0.9 0.6 0.46

some realization fragments of testing samples. It can be seen that with 10 neurons in the hidden layer the outputs of the system and its models are visually practically indistinguishable. Table 2.1 shows the values of the 90th percentile of the mean square error on the training and test sets with N 5 2000 and N 5 500, respectively. It is evident that if the number of hidden layer neurons and the

42

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

training sample size are small, then the identification results with the help of the functionally connected models can be unsatisfactory. Example 2.3. Let us use the RBNN and the diffuse algorithm to identify the nonlinear plant described by the equation [59] yt 5 u3t 1

yt21 : 1 1 y2t21

(2.71)

As a basic function we choose the Gaussian function, zt1 5 ut , zt2 5 yt , r 5 5, a1 5 ð2 1; 2 1Þ; a2 5 ð2 0:5; 2 0:5Þ; a3 5 ð0; 0Þ; a4 5 ð0:5; 0:5Þ; bj 5 3:0; j 5 1; 2; . . .; 5; ut 5 sinðTtÞ; T 5 0:001: Figs. 2.42.7 show the plant outputs and the diffuse algorithm outdif puts, the estimation error et 5 yt 2 Ct αt21 with λ 5 0:98 and λ 5 1,

Figure 2.4 Plants and the diffuse algorithm outputs with λ 5 0:98.

Figure 2.5 The error estimation with λ 5 0:98.

Diffuse Algorithms for Estimating Parameters of Linear Regression

43

Figure 2.6 Plants and the diffuse algorithm outputs with λ 5 1.

Figure 2.7 The error estimation with λ 5 1.

respectively. It can be seen that the use of the forgetting parameter can significantly reduce the error.

2.3.2 Supervisory Control A functional diagram of the control system is shown in Fig. 1.1. Suppose that the model of the plant is known and is described by a nonlinear model of the autoregressive moving average yt 5 Fðzt ; ut Þ 1 ξ t ; t 5 1; 2; :::; N ;

(2.72)

where FðU; UÞ is a given nonlinear function, yt AR1 is a plant output, ut AR1 is a control, ξt AR1 is a random process that has uncorrelated values, zero expectation, and variance Rt 5 E½ξ 2t , zt 5 ðyt21 ; yt22 ; :::; yt2a Þ. Supervisory control is sought in the form uct 5 Φðrt ; βÞα; t 5 1; 2; :::; N ;

(2.73)

44

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

where αARr ; βARl are vectors of unknown parameters, rt is a reference signal, Φðrt ; βÞ is a given nonlinear in respect to rt ; β function, and N is a training set size. PD controller is given by upd t 5 kp et 1 kd e_t ; where et 5 rt 2 yt is a control error, e_t is derivative of the control error, kp , kd are PD parameters to be selected. As in the previous section we use an approach based on the functionally connected NN. We assume that the vector β is specified and the vector α is an unknown parameter and to be evaluated. The control problem is formulated as follows. It is required to choose α from the minimum quality criteria t t X X J 5 1=2 ðuci 2ri Þ2 5 1=2 ðΦðri ; βÞα2ri Þ2 i51

i51

for each t 5 1; 2; :::; N , the PD regulator coefficients from a condition for ensuring the stability of the closed system and the specified quality of the transient process. We illustrate using a numerical example the behavior of the diffuse estimation algorithm and compare it with the gradient algorithm. Example 2.4. Let a plant be linear and described by the transfer function [59] GðpÞ 5

1000 : p3 1 50p2 1 2000p

The functionally connected system is selected in the class of the RBNN with the Gaussian basic function Ct 5 Φð~r t ; βÞ 5 ð1; φðb1 jjrt 2 a1 jj2 Þ; φðb2 jjrt 2 a2 jj2 Þ; :::; φðb5 jjrt 2 a5 jj2 ÞÞ; where a 5 ð2 5; 2 3 ; 0 ; 3 ; 5Þ, b 5 ð0:5; 0:5; 0 5; 0:5Þ, kp 5 30, kd 5 0:5, T 5 0:001 is the discreteness step, α 5 0:05 is the learning rate, η 5 0:3 is the impulse coefficient in the gradient algorithm. Figs. 2.8 and 2.9 present the time dependency of the output of the control plant and the reference signal (curves 1 and 2, respectively) of the supervisory control and the PD controller. The supervisory control values were fixed in the case of the exceeding their absolute value of 10.

Diffuse Algorithms for Estimating Parameters of Linear Regression

45

Figure 2.8 Dependencies of the plant output and the reference signal on time.

Figure 2.9 Dependencies of the supervisory control and the PD controller on time.

Figure 2.10 Transient processes in a closed system with diffuse algorithm (the dotted curve) and the gradient method (the dash-dotted curve), the continuous curve is the reference signal rt .

Fig. 2.10 shows the transition processes in a closed system with diffuse algorithm (the dashed curve) and the gradient method (the bar dotted curve), the continuous curve is the set point rt . It is evident that the transition process in the control system with the diffuse supervisory control is substantially shorter than with the gradient algorithm and the overshoot does not exceed 2%.

46

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

2.3.3 Estimation With a Sliding Window One of the known approaches for obtaining robust in respect to perturbations of the estimation algorithms is to use a sliding window [6064]. In this case the main assumption is based on the adequacy of the system model only on a sliding window interval instead of the entire interval of observations. In these works, preventing divergence of the algorithm and improving the accuracy of estimation are proposed in two ways: 1. Reducing the impact or the rejection of the use of the data outside of the current sliding window. 2. Testing hypotheses about the values of the parameters according to estimates derived from past data and current sliding window and reinitialization, of the estimation algorithm if it is necessary. In both cases you can use the RLSM with diffuse initialization, which has the property to give unbiased estimates after processing a finite number of observations. If the noise is absent, this algorithm can accurately restore the vector of unknown parameters in a finite number of steps, as opposed to the KF. Consider the intervals of observations ½t 2 M; t, t 5 1; 2; :::; N (sliding windows), where M . tr and suppose that at the moment t 2 M a priori information about the vector αt2M is absent. Using Theorem 2.4., it is easy to write the following relations to estimate αt at the moment t based on a sliding window of the latest observations M αs 5 αs21 1 Ksdif ðys 2 Cs αs Þ; αt2M 5 α;

(2.74)

Ksdif 5 Ws1 CsT Rs21 ;

(2.75)

where

Ws 5 Ws21 1 CsT Rs21 Cs ; Wt2M 5 0; s 5 t 2 M 1 1; t 2 M 1 2; . . .; t; t 5 M; M 1 1; . . .:

(2.76)

Example 2.5. Let us illustrate the effect of using the RLSM with the diffuse initialization in the regime of the sliding window under the parametric uncertainty. Let there be given the linear model of the autoregressive moving average of the form [64] yt 1 at yt21 5 bt ut21 1 ξt ; t 5 1; 2; . . .; N;

(2.77)

Diffuse Algorithms for Estimating Parameters of Linear Regression

47

where at 5 a 2 δt , bt 5 b 2 δt , a 5 0:98, b 5 2 0:005, R 5 0:003, ut 5 sinð0:1tÞ 1 50cosð200tÞ;

(2.78)

δt is the impulse noise, defined by the expression  0:05; 200 # t # 205 δt 5 : 0; othewise The simulation results are shown in Figs. 2.11 and 2.12, where e1ðtÞ and e2ðtÞ are the parameter estimation errors of a and b, respectively. Curves 1 and 2 are the estimates obtained with the initialization at t 5 0 and the estimates obtained using a sliding window in the prediction horizon M 5 50, respectively. It can be seen that the speed of convergence of the sliding window estimates is significantly higher compared to the speed with the initialization at t 5 0.

Figure 2.11 Estimation errors of a.

Figure 2.12 Estimation errors of b.

48

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

Let us show that the RLSM with diffuse initialization can be used together with the assessment procedures of the abrupt change in the parameters of the plant model (change-point). After the abrupt change detection, re-initialization of the algorithm is performed. To illustrate we use the signal model Eq. (2.77) and assume that   a; t # 200 b; t # 200 at 5 ; b 5 : 2a; t . 200 t 2b; t . 200 In this example, changes in the signal model are estimated using a statistical test cumulative sum (CUSUM) [64] according to which the statistics are calculated gt 5 gt21 1 et 2 v; g0 5 0; t 5 1; 2; :::; N ; if 2h , gt , h and otherwise gt 5 0, ta 5 t, where h and ν are the parameters selected by the user, et 5 yt 2 Ct αt . The simulation results are shown in Figs. 2.13 and 2.14, where the curves 1 and 2 are the parameter estimates obtained using CUSUM with h 5 100, ν 5 0:5 and the parameter estimates with the initialization at t 5 0, respectively.

Figure 2.13 Estimation of the parameter a.

Figure 2.14 Estimation of the parameter b.

CHAPTER 3

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval Contents 3.1 3.2 3.3 3.4 3.5

Problem Statement Properties of Normalized Root Mean Square Estimation Error Fluctuations of Estimates Under Soft Initialization With Large Parameters Fluctuations Under Diffuse Initialization Fluctuations With Random Inputs

49 50 65 71 77

3.1 PROBLEM STATEMENT We restrict ourselves to the most frequently occurring applications of the linear model Eq. (2.1) with the scalar output (m 5 1) and the constant intensity measurement noise (Rt 5 R). In this case, the recursive leastsquare method (RLSM) with the forgetting factor and the initializations α0 5 0, P0 5 Ir μ is determined by the following expressions αt 5 αt21 1 Kt ðyt 2 Ct αt21 Þ;

α0 5 0;

Kt 5 Mt21 CtT 5 Pt CtT ; Mt 5 λMt21 1 CtT Ct ;

M0 5 P

(3.1) (3.2)

21

=μ;

(3.3)

Pt 5 ðPt21 2 Pt21 CtT ðλ1Ct Pt21 CtT Þ21 Ct Pt21 Þ=λ; P0 5 Ir μ; t 5 1; 2; :::; N : (3.4) The components α in Eq. (2.1) are assumed to be unknown constants and t takes the values from a bounded set tAT 5 f1; 2; . . .; N g, where ( ) t X N . ttr 5 mint t : CkT Ck . 0; t 5 1; 2; . . .; N : k51 Diffuse Algorithms for Neural and Neuro-Fuzzy Networks. DOI: http://dx.doi.org/10.1016/B978-0-12-812609-7.00003-2

© 2017 Elsevier Inc. All rights reserved.

49

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

50

To characterize the statistical properties of the RLSM we will use the bias vector, the matrix of second moments of the estimation error, and a normalized value of the mean square estimation error, respectively at ðμÞ 5 E½αt  2 α 5 E½et ;

(3.5)

Ψt ðμÞ 5 E½et eTt ;

(3.6)

β t ðμÞ 5 E½eTt et =jjαjj2 5 MSEt ðμÞ=jjαjj2 5 trace½Ψt ðμÞ=jjαjj2 ;

(3.7)

where et 5 αt 2 α. As e0 5 2 α and Ψ0 ðμÞ 5 ααT then we have β 0 ðμÞ 5 1. The function β t ðμÞ determines the value of the RLSM overshoot for tAT and fixed μ . 0: if β t ðμÞ # 1, then the overshoot is absent and it is observed if β t ðμÞ . 1. It is required to: 1. Study the behavior of at ðμÞ, Ψt ðμÞ and β t ðμÞ for tAT and μ . 0. 2. Obtain conditions of the RLSM overshoot absence with the soft and the diffuse initializations.

3.2 PROPERTIES OF NORMALIZED ROOT MEAN SQUARE ESTIMATION ERROR In this section we study the behavior of at ðμÞ, Ψt ðμÞ, and β t ðμÞ for tAT and the finite values of μ . 0. Iterating equations for the estimation error et 5 αt 2 α et 5 ðIr 2 Kt Ct Þet21 1 Kt ξ t ; e0 5 2 α;

t 5 1; 2; . . .; N:

(3.8)

gives et 5 2 Xt;0 α 1

t X

Xt;s Ks ξs ;

t 5 1; 2; . . .; N ;

(3.9)

s51

where Xt;s is the transition matrix of the homogeneous system xt 5 ðIr 2 Kt Ct Þxt21 ;

t 5 1; 2; :::; N :

Substitution of Xt;s 5 λt2s Mt21 Ms ; Ks 5 Ms21 CsT in Eq. (3.9) yields

(3.10)

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

et 5 2 λt Mt21 α=μ 1 λt Mt21

t X

λ2s CsT ξs ;

t 5 1; 2; :::; N :

51

(3.11)

s51

Thus at ðμÞ 5 E½αt  2 α 5 2 λt Mt21 α=μ;

(3.12)

where Mt 5

t X

λt2k CkT CkT 1 λt Ir =μ;

t 5 1; 2; :::; N :

k51

Using Eqs. (3.11) and (3.12), we obtain Ψt ðμÞ 5 E½et eTt  5 λ2t Mt21 ααT Mt21 =μ2 1 RMt21 5 at ðμÞaTt ðμÞ 1 Πt ðμÞ;

t X λ2ðt2kÞ CkT CkT Mt21 k51

t 5 1; 2; :::; N : (3.13)

It follows from Eqs. (3.12) and (3.13) that the matrix of the second moments of the estimation error Ψt ðμÞ can be represented as the sum of two components. The first is defined only by the bias and does not depend on the intensity of the measurement noise R. Let us show that the second term in Eq. (3.13) is the covariance matrix of αt . Introducing the notation αt 5 αt 2 E½αt  and iterating the system of the equations αt 5 ðIr 2 Kt Ct Þαt21 1 Kt ξt ; α0 5 0;

t 5 1; 2; . . .; N ;

we obtain αt 5 λt Mt21

t X

λ2s CsT ξs ;

t 5 1; 2; :::; N :

s51

Therefore, the covariance matrix of the estimate αt is given by the expression t X t X T 21 λ2t2k2j CkT E½ξk ξ Tj CjT Mt21 covðαt Þ 5 E½αt αt  5 Mt k51 j51

t X 5 RMt21 λ2t2k CkT CkT Mt21 5 Πt ðμÞ: k51

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

52

Substitution of Eq. (3.13) into Eq. (3.7) gives " # t X β t ðμÞ 5 trace λ2t Mt21 ααT Mt21 =μ2 1 RMt21 λ2ðt2kÞ CkT CkT Mt21 =jjαjj2 k51 ! t X 2t T 2ðt2kÞ T T 21 2 22 2 21 =jjαjj2 5 λ α Mt α=ðμ jjαjj Þ 1 Rtrace Mt λ Ck Ck Mt k51

5 ηt ðμÞ 1 γ t ðμÞ:

(3.14) It can be seen that the numerator of ηt ðμÞ is proportional to the square of the bias norm and the numerator of γ t ðμÞ equals the sum of the diagonal elements of the covariance matrix of αt . Let us continue detailing the expression Eq. (3.7) for the normalized values of the mean square estimation error. We have M~ t # Mt 5

t X

λt2k CkT CkT 1 λt Ir =μ # M t ;

(3.15)

k51

where M~ t 5 λ

t

t X

! CkT CkT

T 1 Ir =μ 5 λt ðC~ t C~ t 1 Ir =μÞ;

k51

Mt 5

t X

T CkT CkT 1 Ir =μ 5 C~ t C~ t 1 Ir =μ;

C~ t 5 ðC1T ; C2T ; :::; CtT ÞT :

k51

Let A; B; C be positive definite matrices such that C # A # B. Since B21 # A21 # C 21 then 21 M t # Mt21 # M~ t :

Using the spectral decomposition T C~ t C~ t 5 Pt Λt PtT ;

PtT Pt 5 Ir ;

where Λt 5 diagðλt ð1Þ; λt ð2Þ; . . .; λt ðrÞÞ, Pt 5 ðpt ð1Þ; pt ð2Þ; . . .; pt ðrÞÞ is the T matrix whose columns are the eigenvectors of the matrix C~ t C~ t corresponding to its eigenvalues λt ðiÞ, i 5 1; 2; :::; r, t 5 1; 2; :::; N, we obtain M t 5 Pt ðΛt 1 Ir =μÞPtT ; M~ t 5 λt Pt ðΛt 1 Ir =μÞPtT ;

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

53

21 21 M t 5 Pt ðΛt 1Ir =μÞ21 PtT ; M~ t 5 λ2t Pt ðΛt 1Ir =μÞ21 PtT :

Since Pt is an orthogonal matrix then using the identity traceðABÞ 5 traceðBAÞ; we find η~ t ðμÞ # ηt ðμÞ # ηt ðμÞ;

(3.16)

where 21 21 ηt ðμÞ 5 λ2t traceðM~ t ααT M~ t Þ=ðμ2 jjαjj2 Þ 5 traceðPt ðΛt 1Ir =μÞ21 PtT ααT Pt ðΛt 1Ir =μÞ21 PtT Þ=ðμ2 jjαjj2 Þ 5 traceððΛt 1Ir =μÞ21 PtT ααT Pt ðΛt 1Ir =μÞ21 Þ=ðμ2 jjαjj2 Þ r X α~ 2i ðtÞ 2 5 2 2 =jjαjj ; ðiÞ11=μÞ μ ðλ t i51 21

21

η~ t ðμÞ 5 λ2t traceðM t ααT M t Þ=ðμ2 jjαjj2 Þ 5 λ2t traceðPt ðΛt 1Ir =μÞ21 PtT ααT Pt ðΛt 1Ir =μÞ21 PtT Þ=ðμ2 jjαjj2 Þ 5 λ2t traceððΛt 1Ir =μÞ21 PtT ααT Pt ðΛt 1Ir =μÞ21 Þ=ðμ2 jjαjj2 Þ r X α~ 2i ðtÞ 2 5 λ2t 2 2 =jjαjj ; i51 ðλt ðiÞ11=μÞ μ ~ 5 ðα~ 1 ðtÞ; α~ 2 ðtÞ; . . .; α~ r ðtÞÞT 5 PtT α: αðtÞ Suppose that Ct it is not a linear combination of C1 ; C2 ; . . .; Ct21 for t 5 2; 3; . . .; tr. Since in this case ! t X T rankðC~ t C~ t Þ 5 rank CkT Ck 5 rankðC1T ; C2T ; . . .; CtT Þ 5 t k51 T then r 2 t of the eigenvalues C~ t C~ t are equal to zero. Assuming without loss of generality that λt ðiÞ 6¼ 0, i 5 1; 2; . . .; t, λt ðiÞ 5 0, i 5 t 1 1; t 1 2; . . .; r, we obtain

" ηt ðμÞ 5

t X i51

# r X α~ 2i ðtÞ α~ 2i ðtÞ =jjαjj2 ; 1 ðλt ðiÞ11=μÞ2 μ2 i5t11

(3.17)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

54

" η~ t ðμÞ 5 λ2t

t X i51

# r X α~ 2i ðtÞ α~ 2i ðtÞ =jjαjj2 ; 1 ðλt ðiÞ11=μÞ2 μ2 i5t11

t 5 1; 2; :::; N : (3.18)

Establish analogous expressions for γ t ðμÞ. We have 21

21

λ2t M t Wt M t # Mt21

t X

21 21 λ2ðt2kÞ CkT CkT Mt21 # M~ t Wt M~ t ;

k51

where Wt 5

t P k51

CkT CkT . Whence it follows that γ~ t ðμÞ # γt ðμÞ # γ t ðμÞ;

(3.19)

where 21 21 21 21 γ t ðμÞ 5 RtraceðM~ t Wt M~ t Þ=jjαjj2 ; γ~ t ðμÞ 5 Rλ2t traceðM t Wt M t Þ=jjαjj2 :

We have also 21 M~ t Wt 5 λ2t Pt ðΛt 1Ir =μÞ21 PtT Pt Λt PtT 5 λ2t Pt ðΛt 1Ir =μÞ21 Λt PtT ;

21 21 M~ t Wt M~ t 5 λ22t Pt ðΛt 1Ir =μÞ21 Λt PtT Pt ðΛt 1Ir =μÞ21 PtT t X α~ 2i ðtÞ T 5 λ22t Pt ðΛt 1Ir =μÞ22 Λt PtT 5 λ22t 2 pi ðtÞpi ðtÞ; ðλ ðiÞ11=μÞ t i51

21

21

M t Wt M t 5 Pt ðΛt 1Ir =μÞ21 PtT Pt Λt PtT Pt ðΛt 1Ir =μÞ21 PtT t X α~ 2i ðtÞ T 5 Pt ðΛt 1Ir =μÞ22 Λt PtT 5 2 pi ðtÞpi ðtÞ: ðiÞ11=μÞ ðλ t i51 21

21

21

Substitution of the obtained expressions for M~ t Wt M~ t and M t 21 Wt M t in Eq. (3.19) and taking into account the orthogonality of columns Pt gives 21 21 γt ðμÞ ¼ RtraceðM~ t Wt M~ t Þ=jjαjj2 ¼ Rλ22t

t X i¼1

λt ðiÞ =jjαjj2 ; ðλt ðiÞ þ 1=μÞ2 (3.20)

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

21

21

γ~ t ðμÞ 5 Rλ2t traceðM t Wt M t Þ=jjαjj2 5 Rλ2t

t X i51

55

λt ðiÞ 2 2 =jjαjj : ðλt ðiÞ11=μÞ (3.21)

Theorem 3.1. ηt ðμÞ, η~ t ðμÞ, ηt ðμÞ are monotonically decreasing functions and γ t ðμÞ, γ~ t ðμÞ, γ t ðμÞ are monotonically increasing functions of μ for each fixed t 5 1; 2; :::; N . Proof This assertion follows from the expressions @ηt ðμÞ[email protected]μ 5

 @  2t T 22 λ α Mt α=ðμ2 jjαjj2 Þ @μ

5 2λ3t αT Mt23 α=ðμ4 jjαjj2 Þ 2 2λ2t αT Mt22 α=ðμ3 jjαjj2 Þ t X 5 2 2λ2t αT Mt23 λðt2kÞ CkT CkT α=ðμ3 jjαjj2 Þ , 0;

(3.22)

k51 t X @ηt ðμÞ[email protected]μ 5 2

t X α~ 2i ðtÞ α~ 2i ðtÞ 2 2 =jjαjj 2 2 3 4 2 3 =jjαjj ðλ ðλ ðiÞ11=μÞ μ ðiÞ11=μÞ μ t t i51 i51 t X α~ 2i ðtÞλt ðiÞ 2 522 3 3 =jjαjj , 0; ðλ ðiÞ11=μÞ μ t i51

(3.23) t 2 2 X ~ ~ α α ðtÞ ðtÞ 2 2t 2 i i @~ηt ðμÞ[email protected]μ52λ2t 3 4 =jjαjj 22λ 2 3 =jjαjj ðλ ðλ ðiÞ11=μÞ μ ðiÞ11=μÞ μ t t i51 i51 t 2 X α~ i ðtÞλt ðiÞ 2 522λ2t 3 3 =jjαjj ,0; ðiÞ11=μÞ μ ðλ t i51 t X

(3.24) " ! # t X @ =jjαjj2 @γt ðμÞ[email protected]μ5 λ2ðt2kÞ CkT CkT M22 Rtrace t @μ k51 " # ! t X 2 λ2ðt2kÞ CkT CkT @M22 5 Rtrace t [email protected]μ =jjαjj k51

5Rtrace

t X k51

λ

! 2ðt2kÞ

CkT CkT M23 t

=ðμ2 jjαjj2 Þ.0.0;

(3.25)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

56

@γt ðμÞ[email protected]μ52Rλ22t

t X i51

λt ðiÞ =ðμ2 jjαjj2 Þ.0; ðλt ðiÞ11=μÞ3

(3.26)

t X

λt ðiÞ 2 2 (3.27) 3 =ðμ jjαjj Þ.0: i51 ðλt ðiÞ11=μÞ Let us illustrate the obtained results using a numerical example. @γ~ t ðμÞ[email protected]μ52Rλ2t

Example 3.1. Consider the problem of estimating the bias and the amplitude of harmonics in a signal of an alternating electric current that is described by the model X ½A sinð2πfiΔtÞ1Bi cosð2πfiΔtÞ1ξt ; t 51;2;:::;N yt 5A0 1 i52j11;j50;1;:::;4 i where f 5 60 Hz is a fundamental frequency, r 5 11, ξ t is a noise, A0 5 20.078, A1 5 2.54, B1 5 4.25, A3 5 2.13, B3 5 20.35, A5 5 0.42, B5 5 21.39, A7 5 20.72, B7 5 20.67, A9 5 20.4, B9 5 0.019, Δ 5 1=ðfkÞ 2 T =k, k is the number of samples per period T of the fundamental frequency. Even harmonics are absent in the model. Figs. 3.13.5 show the simulation results when the signal to noise ratio St 5 OCt O2 OαO2 =R is equal to 19.4 dB. Fig. 3.1 shows one of the realizations of the signal, where t is a count number, k 5 40, R 5 1.52. Figs. 3.2 and 3.3 present dependencies β t ðμÞ, γ t ðμÞ, ηt ðμÞ (curves numbered 1, 2, 3, respectively) on t for μ 5 12 and μ 5 36, respectively, and Fig. 3.4 on μ for t511. It is seen that the maximum value of the overshoot (Fig. 3.3) is achieved at the end of the transitional stage at t 5 11 and its value becomes less than 1 after a few cycles of its completion. 20

yt

10 0 –10 –20

0

10

20

30

Figure 3.1 Realization of signal.

40

50 t

60

70

80

90

100

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

57

βt ( μ), γt ( μ), ηt ( μ)

Figure 3.2 Dependencies β t ðμÞ, γt ðμÞ, ηt ðμÞ on t for μ 5 12, St 5 19:4 dB, k 5 40, R 5 1:52 . 1.4 1.2

X: 11 Y: 1.429

1 2 3

1 0.8 0.6 0.4 0.2 5

10

15

20

25

30

35

40

t

Figure 3.3 Dependencies β t ðμÞ, γt ðμÞ, ηt ðμÞ on t for μ 5 36, St 5 19:4 dB, k 5 40, R 5 1:52 .

βt ( μ), γt ( μ), ηt ( μ)

1.5

1 2 3

X: 36 Y: 1.429

X: 12.3 Y: 0.9758

1

0.5

10

20

30

40

50

60

70

80

90

µ

Figure 3.4 Dependencies β t ðμÞ, γt ðμÞ, ηt ðμÞ on μ, St 5 19:4 dB, k 5 40, R 5 1:52 .

The value μ 5 12 corresponds to the absence of overshoot while for μ 5 36 it is equal to 1.4. Furthermore, it is seen that γ 1 ðμÞ is the monotonically increasing function and η1 ðμÞ is the monotonically decreasing function of μ. Fig. 3.5 shows dependencies β t ðμÞ on t for k 5 100 and k 5 25 for μ 5 36 (curves 1 and 2, respectively). It can be seen that the maximum value of the overshoot

58

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

is achieved not at the end of the transitional stage t 5 11 and its duration is substantially longer than the transitional stage duration. Figs. 3.63.8 show similar results with respect to the desired signal to noise ratio in 1 dB and R 5 12:52 . It is seen that for no overshoot value μ needs to be significantly reduced compared to the previous case.

Figure 3.5 Dependencies β t ðμÞ, γ t ðμÞ, ηt ðμÞ on t for μ 5 36, St 5 19:4 dB, k 5 100, k 5 25, R 5 1:52 .

Figure 3.6 Dependencies β t ðμÞ, γ t ðμÞ, ηt ðμÞ on t for μ 5 0:022, St 5 1 dB, k 5 40, R 5 12:52 .

Figure 3.7 Dependencies β t ðμÞ, γ t ðμÞ, ηt ðμÞ оn t for μ 5 0:072, St 5 1 dB, k 5 40, R 5 12:52 .

βt( μ), γt( μ), ηt( μ)

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

X: 0.022 Y: 1

1

59

1

X: 0.072 Y: 1.326

2 3

0.5

0

0.05

0.1

0.15

0.2

0.25

µ

Figure 3.8 Dependencies β t ðμÞ, γt ðμÞ, ηt ðμÞ on μ, for St 5 1 dB, k 5 40, R 5 12:52 .

Theorem 3.2. (The existence theorem of overshoot). There are such μ . 0, μ . 0 that μ , μ implies β t ðμÞ , 1; β~ t ðμÞ 5 η~ t ðμÞ 1 γ~ t ðμÞ , 1; β t ðμÞ 5 ηt ðμÞ 1 γ t ðμÞ , 1; t 5 1; 2; . . .; N and μ . μ implies β t ðμÞ . 1; β~ t ðμÞ 5 η~ t ðμÞ 1 γ~ t ðμÞ . 1; β t ðμÞ 5 ηt ðμÞ 1 γt ðμÞ . 1; t 5 1; 2; . . .; N: Proof Let us prove the statement at first for β t ðμÞ. Since β 0 ðμÞ 5 1 then it is necessary to show that there exists such μ . 0 that @β t ðμÞ[email protected]μ , 0 for μ , μ and @β t ðμÞ[email protected]μ . 0 for μ . 0, t 5 1,2. . .,tr. Using Eqs. (3.17) and (3.20) gives @β t ðμÞ[email protected]μ 5 @ηt ðμÞ[email protected]μ 1 @γ t ðμÞ[email protected]μ t t X X α~ 2i ðtÞλi ðtÞ λi ðtÞ 22t 2 2 2 =jjαjj 1 2λ R 522 3 3 3 =ðμ jjαjj Þ ðλ ðλ ðtÞ11=μÞ μ ðtÞ11=μÞ i i i51 i51

52

t X i51

 22t  λi ðtÞ R 2 α2i ðtÞ=μ =ðμ2 jjαjj2 Þ: 3 λ ðλi ðtÞ11=μÞ

Whence it follows that @β t ðμÞ[email protected]μ , 0 and @β t ðμÞ[email protected]μ . 0 if values μ and μ are defined from the conditions

60

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

μ , λ2N α2min =R; μ . λ2 α2max =R;

(3.28)

where α2max 5 maxfα~ 2i ðtÞ; i 5 1; 2; :::; rg; α2min 5 minfα~ 2i ðtÞ; i 5 1; 2; :::; rg; t 5 1; 2; :::; N . Using Eqs. (3.18) and (3.21) gives @β~ t ðμÞ[email protected]μ 5 @~ηt ðμÞ[email protected]μ 1 @γ~ t ðμÞ[email protected]μ 5 5 2 2λ

2t

t X i51

t X α~ 2i ðtÞλi ðtÞ λi ðtÞ 2t 2 2 2 3 3 =jjαjj 1 2Rλ 3 =ðμ jjαjj Þ ðλi ðtÞ11=μÞ μ i51 ðλi ðtÞ11=μÞ t X  λi ðtÞ ~ 2i ðtÞ=jjαjj2 : 5 2λ2t 3 3 R2α ðλ ðtÞ11=μÞ μ i i51

Whence it follows that the choice of μ using the conditions μ , α2min =R; μ . α2max =R

(3.29)

guarantees in this case the validity of the inequality @β t ðμÞ[email protected]μ , 0 for μ , μ and the inequality @β t ðμÞ[email protected]μ . 0 for μ . μ and any t 5 1,2,. . .,N. Since β~ t ðμÞ # β t ðμÞ # β t ðμÞ then for μ # μ 5 λ2N α2max =R the inequality β t ðμÞ , 1 is fulfilled and for μ $ μ 5 α2max =R the inequality β t ðμÞ . 1. The values α and R are unknown and, therefore, the conditions Eqs. (3.28) and (3.29) do not allow you to select the value μ in such a way that there was no overshoot. However, tasks of the signal processing are often accompanied by setting the ratio of signal to noise St 5 OCt O2 OαO2 =R or its lower and upper bounds. We show now that St can be used for the analysis of the overshoot. Let A,B be nonnegative definite matrices. Then, since λmin ðAÞB # AB # λmax ðAÞB; where λmax ðAÞ, λmin ðAÞ are maximum and minimum eigenvalues of the matrix A then taking into account Eq. (3.13), we find Ψt ðμÞ $ λ2t ααT =ðλt;max ðMt ÞμÞ2 1 Πt ðμÞ;

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

61

Ψt ðμÞ # λ2t ααT =ðλt;min ðMt ÞμÞ2 1 Πt ðμÞ; where λt;max ðMt Þ, λt;min ðMt Þ are maximum and minimum eigenvalues of the matrix Mt . This implies the estimate β t;min ðμÞ # β t ðμÞ # β t;max ðμÞ;

t 5 1; 2; . . .; N

(3.30)

where β tbьшт $ λ2t =ðλt;max ðMt ÞμÞ2 1 πt ðμÞ; β t;max ðμÞ 5 λ2t =ðλt;min ðMt ÞμÞ2 1 πt ðμÞ ! t X πt ðμÞ 5 OCt O2 trace λ2ðt2kÞ CkT Ck Mt22 =St

(3.31)

k51

βt( μ), βt,min( μ), βt,max( μ)

Example 3.2. We illustrate the use of Eq. (3.30) in the problem of the harmonics amplitudes estimation in the model of alternating current signal from the previous example. Figs. 3.93.11 show the dependencies β t ðμÞ, β t;min ðμÞ, β t;max ðμÞ on μ at t 5 11 for low, medium and large values of the signal to noise ratio St 5 1 dB, St 5 10.5 dB, and St 5 21.7 dB, respectively, obtained from the results of the statistical modeling (curve 2) and calculated by means of Eq. (3.30) (curves 1 and 3, respectively). Using the obtained results it is easy to evaluate the effect of forgetting factor on the upper and lower limits β t ðμÞ in the transitional stage. From Eq. (3.28) it is clear that the introduction of it into a quality criterion could lead to the need for a substantial reduction μ if the order of the model r is quite high. From Eqs. (3.17) and (3.20) it follows that ηt ðμÞ 8 6 4 1 2 3

2 0 0

0.2

0.4

0.6

0.8 µ

1

1.2

Figure 3.9 Dependencies β t ðμÞ, β t;min ðμÞ, β t;max ðμÞ, t 5 11, St 5 1 dB.

1.4

1.6

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

βt( μ), βt,min( μ), βt,max( μ)

62

4 1 2 3

3 2 1 0

0

2

4

6

8

10

12

14

µ

βt( μ), βt,m in( μ), βt,max( μ)

Figure 3.10 Dependencies β t ðμÞ, β t;min ðμÞ, β t;max ðμÞ, t 5 11, St 5 10:5 dB 15 1 2 3

10

5

0

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Figure 3.11 Dependencies β t ðμÞ, β t;min ðμÞ, β t;max ðμÞ, t 5 11, St 5 21:7 dB

does not depend on λ and γ t ðμÞ increases for λ , 1 in t. The alternative behavior demonstrates the lower bounds of β t ðμÞ decreasing for λ , 1 and increasing in t as it follows from Eqs. (3.18) and (3.21). However, the estimates can be fairly tough. Indeed, in Section 2.2 it was shown that for the diffuse initialization the least squares method (LSM) estimate may not depend on λ. A more subtle result is contained in Theorem 3.3. Note also that from the above calculations for the case λ 5 1 we have " # t r X X α~ 2i ðtÞ 2 ηt ðμÞ 5 α~ i ðtÞ =jjαjj2 ; (3.32) 2 2 1 ðiÞ11=μÞ μ ðλ t i51 i5t11 t X λt ðiÞ 2 (3.33) γ t ðμÞ 5 R 2 =jjαjj : i51 ðλt ðiÞ11=μÞ Theorem 3.3. Let us introduce the notations η12ε ðμÞ, γ 12ε ðμÞ for ηt ðμÞ, γ t ðμÞ, respect t tively, where, λ 5 1 2 ε . 0, ε $ 0. Then there is ε that when εAð0; ε Þ the following inequalities hold:

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

1. η12ε ðμÞ # η1t ðμÞ, γ 12ε ðμÞ # γ1t ðμÞ; t t

63

t 5 1; 2; :::; tr 2 1,

2. η12ε ðμÞ # η1t ðμÞ, γ 12ε ðμÞ # γ1t ðμÞ; t 5 tr; tr 1 1; . . .; N , t t t  P T where N . ttr 5 mint t: Ck Ck . 0; t 5 1; 2; . . .; N . k51

Proof ðμÞ, γ 12ε ðμÞ have continuous partial derivatives with respect 1. Since, η12ε t t to ε for εA½0; 1Þ, μ . 0, t 5 1; 2; :::; N then it is sufficient to show that @ 12ε @ 12ε ηt ðμÞjε50 # 0; γ ðμÞjε50 $ 0: @ε @ε t ðμÞ: We have for η12ε t 22

ðμÞ 5 ð12εÞ2t αT Mt22 α=ðμ2 OαO2 Þ 5 αT M^ t α=ðμ2 OαO2 Þ; η12ε t t P where M^ t 5 M^ t ðεÞ 5 ð12εÞ22k CkT Ck 1 Ir =μ. k51 Since t @ 22 21 @ 21 21 X 21 ^ ^ ^ ^ ^ kð12εÞ22k21 CkT Ck M^ t M 5 2 2M t M t M t 5 2 4M t @ε t @ε k51

then t X @ 12ε kCkT Ck Mt21 ð0Þα=ðμ2 OαO2 Þ # 0: ηt ðμÞjε50 5 2 4αT Mt21 ð0Þ @ε k51

(3.34) We have for γ12ε ðμÞ t ðμÞ 5 Rtrace Mt21 γ12ε t 5 Rtrace

t X

! ð12εÞ

2ðt2kÞ

CkT Ck Mt21 =OαO2 !

k51 t 21 X 21 M^ t ð12εÞ22k CkT Ck M^ t k51

=OαO2 :

We have   t 21 P 21 22k T @ ^ ^ ð12εÞ Ck Ck M t jε50 5 I1 1 I2 1 I3 ; @ε M t k51

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

64

where t @ ^ 21 X 21 ðM t Þ ð12εÞ22k CkT Ck M^ t jε50 @ε k51 t t X X 21 21 21 5 M^ t ð0Þ kCkT Ck M^ t ð0Þ CkT Ck M^ t ð0Þ

I1 5

k51

t X

21 5 M^ t ð0Þ

kCkT Ck

k51

21 I2 5 2M^ t

t X

!

k51

2

t X

21 kCkT Ck M^ t ð0Þ=μ

21 M^ t ð0Þ;

k51

21 21 kð12εÞ22k21 CkT Ck M^ t jε50 5 2M^ t ð0Þ

k51

t X

21 kCkT Ck M^ t ð0Þ;

k51 21

I3 5 M^ t

t X

ð12εÞ22k CkT Ck

k51 21

I1 1 I2 1 I3 5 M^ t ð0Þ

t X k51

kCkT Ck 1

@ ^ 21 ðM Þj 5 I1T ; @ε t ε50

t X

21

21

kCkT Ck M^ t ð0ÞM^ t ð0Þ=μ:

k51

It follows from this that ! t X @ 12ε T 23 kCk Ck Mt ð0Þ =ðμOαO2 Þ $ 0: γ ðμÞjε50 5 2Rtrace @ε t k51

(3.35)

Pt T 2. The statement follows from the facts that k51 kCk Ck $ Pt T k51 Ck Ck and at t 5 tr the inequalities Eqs. (3.34) and (3.35) become strict. Note the differences and the similarities in the formulation and the results of the considered problem and ridge regression [65]. The differences are: P 1. The matrix tk51 λðt2kÞ CkT Ck for all t 5 1; 2; . . .; tr 2 1 is singular and not ill-conditioned. 2. The choice of μ should ensure the condition β t ðμÞ , 1 for all t 5 1; 2; . . .; tr and not a reduction in the distance between the estimate and the unknown parameter for a fixed value of the sample size. The similarities are: 1. A priori α is not known so it is unclear how to select μ before the estimation problem solution.

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

65

2. The decomposition for the function β t ðμÞ describing overshoot into two components proportional to the square of the bias norm and the sum of the diagonal elements of the covariance matrix estimation takes place. The character of their behavior in respect to μ is the same as in the ridge regression.

3.3 FLUCTUATIONS OF ESTIMATES UNDER SOFT INITIALIZATION WITH LARGE PARAMETERS In this section we will continue the study of the RLSM estimate fluctuations suggesting that μ is a large positive parameter. At first we obtain the asymptotic formulas derived from Eq. (2.18) and use these for the RLSM estimate fluctuations analysis. It will be convenient to formulate the results in the form of two separate theorems. Theorem 3.4. The functions at ðμÞ, ηt ðμÞ, Lt ðμÞ, γt ðμÞ can be expanded in the power series ~ X at ðμÞ 5 E½et  5 a0t 1 a1t =μ 1 ð21Þi11 λti ðWt1 Þi11 μ2i21 α; (3.36) i51

ηt ðμÞ 5 η0t 1 λ2t

~ X

ð21Þi ði 1 1Þλti αt ðWt1 Þi12 α=OαO2 μ2i22 ;

(3.37)

i51

Πt ðμÞ 5 Π0t 1 Π 1t =μ 1 R

~ X

ð21Þi ði 1 3ÞðWt1 Þi13 μ2i22 ;

λ 5 1; (3.38)

i51

γ t ðμÞ 5 γ 0t 1 γ 1t =μ 1 R

~ X

ð21Þi ði 1 3ÞtraceððWt1 Þi13 Þ=OαO2 μ2i22 ;

λ51

i51

(3.39) which converge uniformly in tAT 5 f1; 2; . . .; N g for bounded T and sufficiently large values of μ, where a0t 5 2 ðIn 2 Wt Wt1 Þα; a1t 5 λt Wt1 α; η0t 5 αT ðIr 2 Wt Wt1 Þα=OαO2 ; Π0t 5 RWt1 =OαO2 ; Π1t 5 2 2RðWt1 Þ2 =OαO2 ; γ0t 5 RtraceðWt1 Þ=OαO2 ;

66

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

γ 1t 5 2 2RtraceðWt1 Þ2 =OαO2 ; Wt 5

t X

λt2k CkT Ck :

k51

Proof The expansion Eq. (3.36) follows from Eqs. (2.12) and (2.18). Let us prove Eq. (3.37). Using Eq. (2.18) gives Mt21 5 λ2t ðIr 2 Wt Wt1 Þμ 1

N P i50 2

ð21Þi λti ðWt1 Þi11 μ2i 5 I1 1 I2 ;

Mt22 5 ðI1 1I2 Þ 5 I12 1 I1 I2 1 I2 I1 1 I22 : Simplify the terms in the right-hand side of the expression for Mt22. We use the following pseudoinversion properties: 1. The matrix ðIr 2 Wt Wt1 Þ is idempotent and Wt1 Wt Wt1 5 Wt1 , Wt Wt1 Wt 5 Wt . 2. For any symmetric matrix A the identity AA1 5 A1 A is valid and therefore Wt1 ðIr 2 Wt Wt1 Þ 5 0;

ðIr 2 Wt1 Wt ÞWt1 5 ðIr 2 Wt Wt1 ÞWt1 5 0:

It follows from this that I12 5 λ22t ðIr 2 Wt Wt1 Þμ2 ; I1 I2 5 0; I2 I1 5 0:

(3.40)

With the help of the Cauchy formula of two series product presentation [66] ! ! N N N X n X X X ak bk 5 ak bn2k (3.41) i50

i50

n50 i50

setting ak 5 bk 5 ð21Þk λtk ðWt1 Þk11 μ2k , we find I22 5

N X

ð21Þi ði 1 1Þλti ðWt1 Þi11 μ2i :

i50

From this, taking into account Eq. (3.40), we obtain a uniformly converging in t 5 1; 2; :::; N and sufficiently large values of μ power series Mt22 5 λ22t ðIr 2 Wt Wt1 Þμ2 1

N X i50

ð21Þi ði 1 1Þλti ðWt1 Þi12 μ2i :

(3.42)

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

67

Substitution of Eq. (3.42) in ηt ðμÞ 5 λ2t αT Mt22 α=ðμ2 OαO2 Þ gives Eq. (3.37). Let us prove Eqs. (3.38) and (3.39). As Wt Wt1 Wt 5 Wt then N X 21 1 Mt Wt 5 ðIr 2 Wt Wt ÞWt μ 1 ð21Þi ðWt1 Þi11 Wt μ2i i50

N X 5 ð21Þi ðWt1 Þi11 Wt μ2i : i50

Putting in Eq. (3.41) ak 5 ð21Þk ðWt1 Þk11 Wt μ2k ; bk 5 ð21Þk ðWt1 Þk11 μ2k and using the identity Wt1 Wt Wt1 5 Wt1, we obtain uniformly converging in t 5 1; 2; :::; N and sufficiently large values of μ power series Mt21 Wt Mt21 5

N X ð21Þi ði 1 1ÞðWt1 Þi11 μ2i i50

5 Wt1 2 2ðWt1 Þ2 =μ 1

N X ð21Þi ði 1 3ÞðWt1 Þi13 μ2i22

(3.43)

i50 N X

ð21Þi ði 1 3ÞðWt1 Þi11 μ2i22 :

5 Wt1 2 2ðWt1 Þ2 =μ 1 R

i50

that implies Eqs. (3.38) and (3.39). Theorem 3.5. The asymptotic representation β t ðμÞ 5 η0t 1 γ 0t 1 γ1t =μ 1 Oðμ22 Þ; μ-N;

(3.44)

is valid for t 5 1; 2; :::; tr, where η0t 5 λ22t αT ðIr 2 Wt Wt1 Þα=jjαjj2 ; γ0t 5 RtraceðVt1 Þ=OαO2 ; γ1t 5 2 2λt RtraceðWt1 Vt1 Þ=OαO2 ;

tr 5 mint ft:Vt . 0; t 5 1; 2; . . .; N g; t X CkT Ck : Vt 5 k51

Proof Let us present Eq. (2.18) in the following form Mt21 5 λ2t ðIr 2 Wt Wt1 Þμ 1 Λt Wt1 ;

(3.45)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

68

where Λt 5

N X

ð21Þi λti ðWt1 Þi μ2i :

i50

Using this expression gives Mt21

t X

λ

t2s

t X

CsT ξs 5 λ2t ðIr 2Wt Wt1 Þ

s51

λ

t2s

CsT ξs μ 1 Λt Wt1

t X

s51

λt2s CsT ξs :

s51

With the help of Lemma 2.2 we establish for Ξt 5 Wt , Ft 5 λ2t=2 Ct that ðIr 2 Wt Wt1 ÞCsT 5 0; We have Wt 5

t X

s 5 1; 2; . . .; t:

T T λt2s CsT Cs 5 λt C t Rt21 C t 5 λt C~ t C~ t ;

s51

T Wt1 5 λ2t ðC~ t C~ t Þ1

t X

T

21=2 ~

λt2s CsT ξ s 5 C t Rt

ξt ;

s51

where C~ t 5 ðC1T λ21=2 ; C2T λ21 ; . . .; CtT λ2t=2 ÞT ARt 3 r ; ξ~ t 5 ðξ1 ; ξ2 ; . . .; ξ t ÞT ARt ; Rt 5 diagðλ; λ2 ; . . .; λt ÞARt 3 t ;

C t 5 ðC1T ; C2T ; . . .; CtT ÞT :

Whence it follows Mt21

t t X X λt2s CsT ξ s 5 Λt Wt1 λt2s CsT ξs

s51 1 21=2 5 Λt C~ t Rt ξ~ t

s51 1 5 Λt C t ξ~ t ; t 5 1; 2; . . .; r:

(3.46)

As 1

1

C t ðC t ÞT 5 ðC t C t Þ1 ; then " E Mt21

t X s51

T

Λt 5 Ir 2 λt Wt1 =μ 1 Oð1=μ2 Þ

# t X 1 1 λt2s CsT ξs λt2j ξj Cj Mt21 5 RΛt C t ðC t ÞT Λt 5 RΛt Vt1 Λt j51

5 RVt1 2 λt RWt1 Vt1 =μ 2 λt RVt1 Wt1 =μ 1 Oð1 1 μ2 Þ; this implies Eq. (3.44).

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

69

Let us establish some important consequences arising from Theorems 3.4 and 3.5. Keeping in Eq. (3.44) the terms of the zero-order with respect to 1=μ, we obtain β t ðμÞ 5 η0t 1 γ0t 1 Oðμ21 Þ;

μ-N;

(3.47)

for t 5 1; 2; . . .; N, where η0t 5 λ22t αT ðIr 2 Wt Wt1 Þα=jjαjj2

γ0t 5 RtraceðVt1 Þ=OαO2 :

Since the matrix Ir 5 Wt Wt1 is nonnegative definite and idempotent, then from Eq. (3.47) we find under the assumption that Ct is not a linear combination of C1 ; . . .; Ct21 , t 5 2; 3; :::; tr 2 1 β t ðμÞ # 1 1 OCt O2 traceðVt1 Þ=St 1 Oðμ21 Þ t X 21 5 1 1 OCt O2 λ21 t ðiÞ=St 1 Oðμ Þ; i51

where λt ðiÞ; i 5 1; 2; :::; r are the eigenvalues of the matrix Vt . This implies that the RLSM overshooting for arbitrarily large values of μ and t 5 1; 2; :::; tr is bounded in norm. As μ enters singularly in Eq. (3.3), this result is not obvious even though it is considered a finite interval of observation. At the same time, closeness to zero of eigenvalues can lead to arbitrarily large values of the overshoot which will continue over time, significantly exceeding t 5 tr. Example 3.3. Consider the problem of the neural network (NN) training from the Example 2.5. Fig. 3.12 shows the dependencies β t on the count for the realizations of the NN with 10 neurons.

βt

102

100

100

101

102 t

Figure 3.12 Dependency β t ðμÞ on t.

103

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

70

From Eq. (3.44) the necessary and sufficient condition for the absence of the overshoot in the zero approximation in the end of the transitional phase follows Str $ OStr O2 traceðVtr21 Þ: (3.48) Example 3.4. We will illustrate the use of the obtained criterion Eq. (3.48) of the overshoot absence at the end point of the transitional phase for the large signal to noise ratio using the signal model of the Example 3.1. Consider two cases: St 5 43.9 dB and St 5 44 dB. Since the value of right-hand side of Eq. (3.48) is equal to 43.9 dB, in the first case the overshoot should occur, and in the second case it should be absent for any μ . 0. Fig. 3.13 shows the simulation results supporting the calculations when μ 5 107 , in curves 1 and 2, respectively. ðμÞ for the function Introduce the notation β 12ε t β t ðμÞ 5 η0t 1 γ 0t 1 γ 1t =μ that takes into account the first order of the smallness with respect to 1=μ for λ 5 1 2 ε . 0; ε $ 0: Let us analyze the sensitivity of β t ðμÞ to λ at the moment of the transitional phase end. We have β 12ε ðμÞ 5 R½traceðVtr21 Þ 2 2λtr RtraceðWtr21 Vtr21 Þ=μ=OαO2 t " !21 ! # tr X 5 R traceðVtr21 Þ 2 2Rtrace ð12εÞ2k CkT Ck Vtr21 =μ OαO2 : k51

(3.49)

βt ( μ)

1 2

X: 11 Y: 1.263

1

X: 11 Y: 0.9778

0.5

0 0

5

10 t

Figure 3.13 Dependency β t ðμÞ on t for μ 5 107 , St 5 44 dB (1), St 5 43:9 dB (2).

15

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

71

It follows from this

! tr X @ 12ε kCkT Ck Vtr23 =ðStr μÞ . 0: β ðμÞjε50 5 2OCtr O2 trace @ε t k51

(3.50)

Taking into account that Vtr #

tr X

kCkT Ck # trVtr

k51

the two-sided estimate χ#

@ 12ε β ðμÞjε50 # trχ; @ε t

(3.51)

follows from Eq. (3.50), where χ 5 2OCtr O2 traceðVtr22 Þ=ðStr μÞ . 0:

3.4 FLUCTUATIONS UNDER DIFFUSE INITIALIZATION Let us study the fluctuation behavior of the diffuse algorithm which is described by expressions: dif

dif

dif αdif t 5 αt21 1 Kt ðyt 2 Ct αt21 Þ;

dif

α0 5 0;

Ktdif 5 Wt1 CtT ; Wt 5 λWt21 1 CtT Ct ; W0 5 0r 3 r ;

(3.52) (3.53)

t 5 1; 2; . . .; N :

(3.54)

Theorem 3.6. The representations at 5 E½et  5 2 ðIr 2 Wt Wt1 Þα; β t 5 E½eTt et =jjαjj2 5 ηt 1 γt ;

t 5 1; 2; . . .; N ;

(3.55) (3.56)

are valid, where 22t T et 5 αdif α ðIr 2 Wt Wt1 Þα=jjαjj2 ; γ t 5 RtraceðVt1 Þ=jjαjj2 : t 2 α; ηt 5 λ

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

72

Proof Since the bias at satisfies the equations system at 5 ðIr 2 Ktdif Ct Þat21 ; a0 5 2 α;

t 5 1; 2; . . .; N

then it follows from Lemma 2.4. the expression for its transition matrix Xt;0 5 Ir 2 Wt1 Wt ;

t 5 1; 2; . . .; N

which implies Eq. (3.55). Iterating the system equations for the estimation error et 5 αt 2 a et 5 ðIr 2 Ktdif t C t Þet21 1 Kt ξt ; et 5 2 α;

t 5 1; 2; . . .; N ;

we find et 5 2 Xt;0 α 1

t X

Xt;s Ktdif ξs ;

t 5 1; 2; . . .; N ;

(3.57)

s51

where Xt;s is the transition matrix of the homogeneous system xt 5 ðIr 2 Ktdif Ct Þxt21 ;

t 5 1; 2; . . .; N :

Substitution of Xt;0 5 ðIr 2 Wt Wt1 Þ; Xt;s Ksdif 5 λt2s Wt1 CsT in Eq. (3.57) gives et 5 2 ðIr 2 Wt Wt1 Þα 1 Wt1

t X

λt2s CsT ξ s ;

t 5 1; 2; . . .; N:

(3.58)

s51

Since Wt 5

t X T T λt2s CsT Cs 5 λt C t Rt21 C t 5 λt C~ t C~ t ; s51

T Wt1 5 λ2t ðC~ t C~ t Þ1 ;

t X T 21=2 λt2s CsT ξ s 5 C~ t Rt ξ~ t ; s51

where C~ t 5 ðC1T λ21=2 ; C2T λ21 ; . . .; CtT λ2t=2 ÞT ARt 3 r ; Rt 5 diagðλ; λ2 ; . . .λt ÞARt 3 t ;

ξ~ t 5 ðξ 1 ; ξ 2 ; . . .; ξt ÞT ARt ;

C t 5 ðC2T ; C1T ; . . .; CtT ÞT

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

then Wt1 "

t X

1 21=2 1 λt2s CsT ξ s 5 C~ t Rt ξ~ t 5 C t ξ~ t ;

s51

E Wt1

73

t X

λt2s CsT ξ s

s51

t X

# λt2j ξj Cj Wt1

j51

1 1 5 RC t ðC t ÞT

5 RVt1 ; t 5 1; 2; . . .; N :

This implies Eq. (3.56). Consider properties of β t , γt , ηt for the RLSM with the diffuse initialization. Theorem 3.7. Let Ct be not a linear combination of C1 ; . . .; Ct21 , t 5 2; 3; . . .; tr. Then: 1. γt is monotonically increasing function in t if t 5 2; 3; . . .; tr 2 1 and monotonically decreasing if t 5 tr; tr 1 1; . . .; N . 2. ηt is monotonically decreasing function in t if t 5 2; 3; . . .; tr 2 1 and it vanishes at t 5 tr. Proof 1. Let us denote Qt 5

t X

λ2k CkT Ck 5

k51

where C~ t 5 λ

2t=2

Qt1

1 5 Qt21

T ~ C~ k Ck;

k51

Ct . We use the following recursive definition of Qt1 [47]

1 ~T 1 1 C~ t Qt21 Ct ~T ~T T 1 T 2 ðΞt21 C t ÞðΞt21 C t Þ 2 ðC~ t Ξt21 C~ t Þ

1 ~ 1 ~ T Qt21 C t ðΞt21 C~ t ÞT 1 ðΞt21 C~ t ÞðQt21 Ct Þ ; T C~ t Ξt21 C~ t T

2

t X

T

T

T

t 5 2; 3; . . .; tr (3.59)

with the initial condition T 1 T T Q11 5 ðC~ 1 C~ 1 Þ1 5 C~ 1 ðC~ 1 Þ1 5 C~ 1 C~ 1 =jjC~ 1 jj4 ;

where Ξt 5 Ir 2 Qt Qt1. 1 From Eq. (3.59) we find taking into account that Qt21 Ξt21 5 0

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

74

traceðQt1 Þ

1 5 traceðQt21 Þ1

T 1 ~T 1 1 1 C~ t Qt21 2C~ t Qt21 Ξt21 C~ t Ct 2 T T C~ t Ξt21 C~ t C~ t Ξt21 C~ t

1 5 traceðQt21 Þ1

1 ~ 1 1 C~ t Qt21 Ct ; T C~ t Ξt21 C~ t T

(3.60)

t 5 2; 3; . . .; tr:

Since γ t 5 RtraceðWt1 Þ=OαO2 5 λ2t RtraceðQt1 Þ=OαO2 then using Eq. (3.60), we obtain 1 ~T 1 1 C~ t Qt21 Ct =OαO2 ; γ t 5 λ γt21 1 λ R T C~ t Ξt21 C~ t 21

2t

t 5 2; 3; . . .; tr;

γ1 5 λ21 R=OC1 O2 =OαO2 :

(3.61)

(3.62)

This implies that γ t is monotonically increasing function for t 5 2; 3; :::; tr 2 1. For the values t 5 tr 1 1; tr 1 2; . . .; N we use the following recursive definition of Qt21 [47] 21 2 Qt21 5 Qt21

21 ~ T ~ 21 Qt21 C t C t Qt21 21 C ~ Tt 1 1 C~ t Qt21

with the initial condition Qtr21

5

tr X

!21 2k

λ

CkT Ck

:

k51

The assertion follows from the expressions 22 ~ T C~ t Qt21 Ct 2 γ t 5 λ γ t21 2 λ R T =OαO ; 21 1 1 C~ t Qt21 C~ t 21

2t

t 5 tr 1 1; tr 1 2; . .! .; N ; 21 t X t2k T γ tr 5 Rtrace λ Ck Ck =jjαjj2 : k51

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

75

2. Let λt ðiÞ be the eigenvalues of the matrix Wt and pt ðiÞ are corresponding them eigenvectors, i 5 1; 2; :::; r, t 5 2; 3; :::; tr. Assume without loss of generality that λt ðiÞ 5 0 for i 5 1; 2; . . .; r 2 t and λt ðiÞ . 0, i 5 r 2 t 1 1; r 2 t 1 2; . . .; r. We have Wt 5

r X

λt ðiÞpt ðiÞpTt ðiÞ; Wt1 5

i51

r X

T λ1 t ðiÞpt ðiÞpt ðiÞ:

i51

Since pt ðiÞ, i 5 1; 2; :::; r are the orthogonal vectors of unit length for each t 5 1; 2; . . .; tr 2 1, we have Wt Wt1 5 5

r X

r X r X T λt ðiÞλ1 t ðjÞpt ðiÞpt ðjÞ i51 j51

T λt ðiÞλ1 t ðjÞpt ðiÞpt ðiÞ 5

i51

r2t X Ir 2 Wt Wt1 5 pt ðiÞpTt ðiÞ:

r X

pt ðiÞpTi ðiÞ

5r2t11

i51

From this it follows that ηt 5 λ2t

Xr2t i51

ðαT pt ðiÞÞ2 =jjαjj2

and at the same time the assertion of the theorem. Theorem 3.8. Let λ 5 1, Ct be not a linear combination of C1 ; . . .; Ct21 , t 5 2; 3; :::; tr and the signal to noise ratio St satisfies the inequality 1 CtT ; St , 1 1 Ct Wt21

t 5 2; 3; . . .; tr 2 1:

(3.63)

Then β t is a monotonically increasing function in t for t 5 2,3,. . .,tr. Proof Using the relations Ξt 5 Ξt21 2

Ξt21 CtT Ct Ξt21 ; Ct Ξt-1 CtT

Ξ1 5 I r ;

t 5 2; 3; . . .; tr

to determine the idempotent matrix Ξt 5 Ir 2 Wt Wt1 [47], we obtain the following recursive form presentation of ηt 5 αT ðIr 2 Wt Wt1 Þα=OαO2 :

76

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

ηt 5 ηt21 2

ðCt Ξt21 αÞ2 =OαO2 ; Ct Ξt21 CtT

η0 5 1;

t 5 2; 3; . . .; tr:

(3.64)

Since β t 5 ηt 1 γ t we find from Eqs. (3.61), (3.62), and (3.64) β t 5 β t21 1

1 Rð1 1 Ct Wt21 CtT Þ 2 ðCt Ξt21 αÞ2 =OαO2 ; T C~ t Ξt21 C~ t

t 5 2; 3; . . .; tr; (3.65)

β 1 5 1 1 R=OC1 O=OαO2 :

(3.66)

This implies that β t will monotonically increasing function for t 5 2; 3; . . .; tr if the following condition is fulfilled 1 CtT ÞTt ; ðCt Ξt21 αÞ2 # Rð11Ct Wt21

t 5 2; 3; . . .; tr:

(3.67)

We show that ðCt Ξt21 αÞ2 # OCt O2 OαO2 :

(3.68)

Let B; DARn 3 n be arbitrary, symmetric matrice eigenvalues which satisfy the condition λi ðBÞ $ λi ðDÞ;

i 5 1; 2; . . .; n:

Then there is an orthogonal matrix T ARn 3 n such that [67] T T BT $ D: Let us denote B 5 CtT Ct ; D 5 Ξt21 CtT Ct Ξt21 : Since the non-zero eigenvalues of B; D are determined by expressions λðBÞ 5 Ct CtT ;

λðDÞ 5 Ct Ξ2t21 CtT 5 Ct Ξt21 CtT ;

we have λðBÞ $ λðDÞ and T T BT $ D. Thus ðCt Ξt21 αÞ2 5 αT Ξt21 CtT Ct Ξt21 α # αT T T CtT Ct T α # # Ct TT T CtT αT α 5 OCt O2 OαO2 Using Eq. (3.66) together with Eq. (3.68) gives sufficient condition for β t to be a monotonically increasing function in t

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval 1 OCt O2 OαO2 # Rð1 1 Ct Wt21 CtT Þ;

77

t 5 1; 2; . . .; tr:

But since St 5 OCt O2 OαO2 =R then this implies the theorem assertion.

3.5 FLUCTUATIONS WITH RANDOM INPUTS Consider the fluctuations of the RLSM with random inputs Ct . Restrict ourselves to the case of the diffuse initialization. We have et 5 αt 2 α 5 2 Ht;0 α 1

t X

Ht;s Ksdif ξs ;

t 5 1; 2; . . .; N ;

s51

where Ht;s is the transition matrix of the system Eq. (3.10). Since Ht;0 5 Ir 2 Wt1 Wt then assuming that Ct does not depend on ξt for any t 5 1,2,. . .,N, we get when t $ tr " E½et  5 EfE 2Ht;0 α 1

t X

# Ht;s Ksdif ξs

jCs ; s 5 1; 2; . . .; tg 5 0;

s51

where EðζjυÞ is conditional expectation of a random variable ζ for a given other random variable υ. Thus, as in the case of deterministic input signals, the bias in the diffuse algorithm is absent if t $ tr. For conditional matrix of second moments of the estimation error with the help of Lemma 3.5. and the relations established in the proof of Theorem 3.6, we find E½et eTt jCs ; s 5 1; 2; . . .; tr 5 λ2t ðIr 2 Wt Wt1 ÞαT αðIr 2 Wt Wr1 Þ 1 RVt1 ; t P where Vt 5 CkT Ck. Thus k51

E½eTt et jCs ; s 5 1; 2; . . .; tr 5 λ2t αT ðIr 2 Wt Wt1 Þα 1 RtraceðVt1 Þ:

(3.69)

Assume that the input signals have the property of ergodicity, i.e., for sufficiently large tr the approximation ! t X CkT Ck  Σ 5 diagðσ2 ð1Þ; σ2 ð2Þ; :::; σ2 ðrÞÞ; (3.70) E k51

is possible, where σ2 ðiÞ 5 E½Ci CiT , i 5 1,2,. . .,r. Then from Eq. (3.69) it follows that

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

78

β tr

5 EfE½eTtr etr jCs ; s 5 1; 2; . . .; trg=OαO2 5 R r r X X 5 σ2 ðiÞ 1=σ2 ðiÞ=S; i51

Pr i51

1=σ2 ðiÞ=OαO2

i51

(3.17)

r P

σ2 ðiÞ=OαO2 =R is the ratio of signal to noise. where S 5 i51 From whence we find the condition for the absence of overshoot at the moment of the end of the transitional stage S$

r X

σ2 ðiÞ

i51

r X

1=σ2 ðiÞ:

(3.72)

i51

For the case σ2 ðiÞ 5 σ2, i 5 1; 2; . . .; r it has particularly simple form S $ r 2. We obtain now an analogue of Eq. (3.72) under the assumption that C1T ; C2T ; . . .; CNT is a sequence of independent, identically distributed random vectors with multivariate normal distribution Nð0; ΣÞ, Σ . 0, t 5 1; 2; . . .; N. It is known that in this case !21 t X CkT Ck k51

has the inverse distribution Wishart [67] when t $ r 1 2 with probability 1 and " !21 # t X T 5 Σ21 E5 Ck Ck k51

[68]. Therefore, the use of Eq. (3.69) for t 5 r 1 2 gives β r12 5 EfE½eTr12 er12 jCs ; s 5 1; 2; . . .; r 1 2g=OαO2 5 5 RttraceðΣ21 Þ=jjα2 jj 5 RtraceðΣÞtraceðΣ21 Þ=S;

(3.73)

where S 5 OαO2 traceðΞÞ=R is the ratio of signal to noise. From whence we find the condition for overshoot absence at the end moment of the transitional stage S $ traceðΣÞtraceðΣ21 Þ:

(3.74)

It is seen that under the assumption of correlations absence between inputs this condition coincides with Eq. (3.72).

Statistical Analysis of Fluctuations of Least Squares Algorithm on Finite Time Interval

79

Example 3.5. We will illustrate the condition of the overshoot absence at the moment t 5 r 1 2 when with probability 1 the matrix Wt is nonsingular. Consider the observation model Eq. (3.1) with random outcomes. We assume that E½Ct ðiÞ 5 0; E½Ct2 ðiÞ ¼ σ2 ¼ 1; i 5 1; 2; . . .; r; t 5 1; 2; . . .; r 1 2; αi 5 0:1i; i 5 1; 2; . . .; r: The evaluation results using a statistical modeling of 500 realizations are shown in Table 3.1. Table 3.1 Estimates of β tr12 obtained by means of the statistical simulation r S S, dB β tr12 R1=2

5 122 20.9 0.19 0.15

10 127 21 0.7 0.55

20 434 26.4 0.86 1.15

30 982 30 0.73 1.7

40 1830 33.6 0.73 3.2

50 2552 34 0.76 3.9

5 18 13.6 1.58 0.39

10 78.6 19 1.17 0.7

20 340 25 1.26 1.3

30 709 28.5 1.26 3.0

40 1310 31.5 1.33 3.6

50 1411 13.6 1.7 3.9

In the left half of the table the noise intensity was chosen so to ensure fulfillment of the condition S $ r 2 under which the overshoot is absent and in the right, so that this condition is not fulfilled. It can be seen that the simulation matches the obtained theoretical results.

CHAPTER 4

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms Contents 4.1 Problem Statement 4.2 Training With the Use of Soft and Diffuse Initializations 4.3 Training in the Absence of a Priori Information About Parameters of the Output Layer 4.4 Convergence of Diffuse Training Algorithms 4.4.1 Finite Training Set 4.4.2 Infinite Training Set 4.5 Iterative Versions of Diffuse Training Algorithms 4.6 Diffuse Training Algorithm of Recurrent Neural Network 4.7 Analysis of Training Algorithms With Small Noise Measurements 4.8 Examples of Application 4.8.1 Identification of Nonlinear Static Plants 4.8.2 Identification of Nonlinear Dynamic Plants 4.8.3 Example of Classification Task

80 82 95 103 104 116 123 125 127 130 130 136 140

4.1 PROBLEM STATEMENT Consider an observation model of the form yt 5 Φðzt ; βÞα 1 ξ t ;

t 5 1; 2; . . .; N ;

(4.1)

where zt AR is a vector of inputs, yt AR is a vector of outputs, αARr and βARl , are vectors of unknown parameters, Φ(zt,β) is an m 3 r matrix of given nonlinear functions, N is a training set size, and ξt ARm is a random process which has uncorrelated values, zero expectation, and a covariance matrix Rt. We suppose that the following conditions are satisfied for the parameters β and α appearing in description Eq. (4.1). A1: For the vector β, its probable value β is known and possible deviations from β are characterized by the function n

m

21

ΓðβÞ 5 ðβ2βÞT P β ðβ 2 βÞ; 80

© 2017 Elsevier Inc. All rights reserved.

(4.2)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks. DOI: http://dx.doi.org/10.1016/B978-0-12-812609-7.00004-4

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

81

where P β . 0 is a given positive definite matrix. A2: A priori information on elements of the vector α is absent and they are interpreted as random variables with EðαÞ 5 0; EðααT Þ 5 μP α ;

(4.3)

where μ . 0 is a large parameter selected in the course of simulation, P α . 0 is a positive definite r 3 r matrix independent of μ. A3: The vector α is not correlated with β and ξ t, t 5 1; 2; . . .; N . It is assumed that β and P β can be obtained from a training set, the distribution of a generating set or some linguistic information. Let the training set be specified {zt, yt}, t 5 1; 2; . . .; N and let the quality criterion at a moment t be defined by the expression Jt ðxÞ 5

t X

λt2k ðyk 2Φðzk ; βÞαÞT ðyk 2 Φðzk ; βÞαÞ

k51 21 1 λt ΓðβÞ 1 λt αP α α=μ;

(4.4)

t 5 1; 2; . . .; N :

A vector of parameters x 5 ðβ T ; αT ÞT ARl1r is found under the condition that Jt ðxÞ is minimal and the result must be updated after obtaining a new observation. We use the GN method with linearization in the neighborhood of the last estimate to solve this problem and study its behavior for a large μ (soft initialization) and as μ-N (diffuse initialization). Consider an alternative approach to the training problem in the absence of a priori information about α. In this approach it is assumed that the unknown parameters in Eq. (4.1) satisfy the following conditions: B1: For the vector β, its probable value β is known and possible deviations from β are characterized by the function Eq. (4.2). B2: A priori information on elements of the vector α is absent and they can be either unknown constants or random quantities whose statistical characteristics are unknown. Let the training set be specified {zt, yt}, t 5 1; 2; . . .; N and let the quality criterion at a moment t be defined by the expressions Jt ðxÞ 5

t X

λt2k ðyk 2Φðzk ; βÞαÞT ðyk 2 Φðzk ; βÞαÞ 1 λt ΓðβÞ:

(4.5)

k51

A vector of parameters x 5 ðβ T ; αT ÞT ARl1r is found under the condition that Jt ðxÞ is minimal and the result must be updated after obtaining a

82

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

new observation. As above, we use the GN method with linearization in the neighborhood of the last estimate to solve this problem.

4.2 TRAINING WITH THE USE OF SOFT AND DIFFUSE INITIALIZATIONS Let us assume that the unknown parameters in Eq. (4.1) satisfy conditions A1, A2, and A3 and the quality criteria at a moment t is defined by Eq. (4.4). Lemma 4.1. The solution of the minimization problem (4.4) by the GN method with the soft initialization can be found recursively as follows T

xt 5 xt21 1 Kt ðyt 2 ht ðxt21 ÞÞ; x0 5 ðβ ; 01 3 r ÞT ;

t 5 1; 2; . . .; N ; (4.6)

where xt 5 ðβ Tt ; αTt ÞT , ht ðxt21 Þ 5 Φðzt ; β t21 Þαt21 , Kt 5 ððKtβ ÞT ; ðKtα ÞT ÞT 5 Pt CtT ;

(4.7)

Ktβ 5 ðSt 1 Vt Lt VtT ÞðCtβ ÞT 1 Vt Lt ðCtα ÞT ;

(4.8)

Ktα ¼ Lt VtT ðCtβ ÞT þ Lt ðCtα ÞT ;

(4.9)

T Pt 5 S~t 1 V~ t Lt V~ t ;

S~t 5



St 0r 3 l

   0l 3 r Vt ; V~ t 5 ; 0r 3 r Ir

(4.10)

(4.11)

St 5 St21 =λ 2 St21 ðCtβ ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ St21 =λ; S0 5 P β ; (4.12) Vt 5 ðIl 2 St21 ðCtβ ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ ÞVt21 2 St21 ðCtβ ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ctα ; V0 5 0l 3 r ; 21 1 λðCtv Vt 1Ctα ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Lt21 5 λLt21 21

3 ðCtv Vt 1 Ctα Þ; L021 5 P α =μ;

(4.13)

(4.14)

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

83

Ct 5 ðCtβ ; Ctα Þ; Ctβ 5 Ctβ ðxt21 Þ 5 @½Φðzt ; β t21 Þαt21 [email protected]β t21 ; Ctα 5 Ctα ðxt21 Þ 5 Φðzt ; β t21 Þ: (4.15) Proof Consider the auxiliary linear observation model yk 5 Ckβ β 1 Ckα α 1 ξt ;

t 5 1; 2; . . .; N;

(4.16)

and the optimization problem ðα ; β  Þ 5 argminJt ðα; βÞ; αARr ; βARl ;

(4.17)

where Jt ðxÞ 5

t X

λt2k ðyk 2Ckβ β2Ckα αÞT ðyk 2 Ckβ β 2 Ckα αÞ

k51 21

1 λ ΓðβÞ 1 λ αP α=μ; t

t

(4.18)

t 5 1; 2; . . .; N ;

yt ARm ; Ctβ ARm 3 l ; Ctα ARm 3 r ; ξt ARm : Using the notation xt for ðα ; β  Þ, we find T

xt 5 xt21 1 Kt ðyt 2 Ct xt21 Þ; x0 5 ðβ ; 01 3 r ÞT ; where

t 5 1; 2; . . .; N ; (4.19)

Ct 5 ðCtβ ; Ctα Þ, Kt 5 Pt CtT ;

(4.20)

Pt 5 ðPt212 Pt21 CtT ðλIm 1Ct Pt21 CtT Þ21Ct Pt21 Þ=λ;P0 5 block diagðP β ; P α μÞ: (4.21) Let us show that the expressions (4.6)(4.15) define the solution of the problems (4.17) and (4.18) for arbitrary Ctβ ; Ctα, and μ . 0 Consider two solutions P~ t and S~t of the matrix equation Pt 5 ðPt21 2 Pt21 CtT ðλIm 1Ct Pt21 CtT Þ21 Ct Pt21 Þ=λ; with initial conditions P~ 0 5 block diagðP β ; P α μÞ; S~0 5 block diagðP β ; 0r 3 r Þ:

(4.22)

84

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

Since S~0 C1T

 5

P β ðC1β ÞT 0



then the first block representation in Eq. (4.11) is valid. Denote A 5 Ir1l =λ1=2, C t 5 Ct =λ1=2 . Then Eq. (4.22) will take the form Pt 5 APt21 A 2 APt21 C t ðIm 1C t Pt21 C t Þ21 C t Pt21 A: T

T

(4.23)

The difference Qt 5 P~ t 2 S~t ;

t 5 1; 2; . . .; N

satisfies the matrix equation [69] Qt 5 A~ t Qt21 A~ t 2 A~ t Qt21 C t ðIm 1C t P~ t21 C t Þ21 C t Qt21 A~ t ; T

T

T

T

Q0 5 P~ 0 2 S~0 5 P~ 0 5 block diagð0l 3 l ; P α μÞ;

(4.24)

t 5 1; 2; . . .; N ;

T T where A~ t 5 A 2 AS~t21 C t ðIm 1C t S~t21 C t Þ21 C t . Returning to the original notation, we obtain

Qt 5 ðAt Qt21 ATt 2 At Qt21 CtT ðλIm 1Ct P~ t21 CtT Þ21 Ct Qt21 ATt Þ=λ; (4.25) Q0 5 P~ 0 2 S~0 5 P~ 0 5 block diagð0l 3 l ; P α μÞ;

t 5 1; 2; . . .; N ;

where At 5 Ir1l 2 S~t21 CtT ðλIm 1Ct S~t21 CtT Þ21 Ct . Let us show that Qt 5 V~ t Lt V~ t : T

(4.26)

Substituting this expression in Eq. (4.25), we get  T T T V~ t Lt V~ t 5 At V~ t21 Lt21 V~ t21 ATt 2 At V~ t21 Lt21 V~ t21 CtT

 T T 21 T ~ ~ ~ 3 ðλIm 1Ct P t21 Ct Þ Ct V t21 Lt21 V t21 At =λ:

(4.27)

This relation will be carried out if the Lt and V~ t satisfy the matrix equations V~ t 5 At V~ t21 ; V~ 0 5 ð0l 3 r ; Ir ÞT ;

(4.28)

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

85

Lt 5 ðLt21 2 Lt21 V~ t21 CtT ðλIm 1Ct P~ t21 CtT Þ21 Ct V~ t21 Lt21 Þ=λ; L0 5 μIr : (4.29) T

Since

 St21 ðCtβ ÞT ; C SC β β T T t ~ t 5 Ct St21 ðCt Þ ;  0r 3 l β T  β T 21 α 21 β S ðC Þ N C S ðC Þ N C t21 t21 t t t t t t ; At 5 Ir1l 2 0r 3 l 0r 3 r S~t21 CtT 5



where Nt 5 λIm 1 Ctβ St21 ðCtβ ÞT then the second expression in Eq. (4.11) follows from this and the Eq. (4.28). Let us transform Eq. (4.29) using the matrix identity Eq. (2.7) for T B 5 Lt21 ; C 5 V~ t21 CtT ; D 5 λIm 1 Ctβ St21 ðCtβ ÞT :

Taking into account that T P~ t 5 Qt 1 S~t 5 V~ t Lt V~ t 1 S~t ;

we find 21 T T 21 ~ ~ ~ L21 t 5ðLt212Lt21 V t21 Ct ðλIm1Ct P t21 Ct Þ Ct V t21 Lt21 Þ λ T

T T 5ðLt212Lt21 V~ t21 CtT ðλIm1Ctβ St21 ðCtβ ÞT1Ct V~ t21 Lt21 V~ t21 CtT Þ21 Ct V~ t21 Lt21 Þ21 λ T 21 ~ ~ ~T T 5λL21 t21 1λV t21 Ct ðλIm1Ct St21 Ct Þ Ct V t21 :

But as Ct V~ t21 5 Ctβ Vt21 1 Ctα then Eq. (4.14) follows from this. Let us return to the solution of the original nonlinear minimization problem with the criterion Eq. (4.4). Residual linearization et 5 yt 2 Φðzt ; βÞα around the point xt21 gives et 5 yt 2 Φðzt ; β t21 Þαt21 2 Φðzt ; β t21 Þðα 2 αt21 Þ 2 @½Φðzt ; β t21 Þαt21 [email protected]β t21 ðβ 2 β t21 Þ 5 y~t 2 Ct x; where x 5 ðαT ; β T ÞT , y~t 5 yt 1 @½Φðzt ; β t21 Þαt21 [email protected]β t21 β t21 5 yt 2 Φðzt ; β t21 Þαt21 1 Ct xt21 : Substitution of this expression into criterion Eq. (4.4) leads to the linear optimization problem solution which as shown is defined by the

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

86

expressions (4.19) and (4.7)(4.15). This implies the assertion of the lemma. Lemma 4.2. The matrices Pt and Kt in Eqs. (4.20) and (4.21) can be expanded in the power series T T Pt 5 λ2t V~ t P α ðIr 2 Wt Wt1 ÞV~ t μ 1 S~t 1 V~ t Wt1 V~ t q X T 21 1 ð21Þi λti V~ t ðWt1 P α Þi11 P α V~ t μ2i 1 Oðμ2q21 Þ;

(4.30)

i51 T 21 21 Kt 5 ½S~t 1 V~ t Wt1 P α ðIr 1P α Wt1 =μÞ21 V~ t P α CtT Rt21 q X T 21 5 Ktdif 1 ð21Þi λti V~ t ðWt1 P α Þi11 P α V~ t CtT Rt21 μ2i 1 Oðμ2q21 Þ i51

(4.31) which converge uniformly in tAT 5 f1; 2; . . .; N g for bounded T and sufficiently large values of μ, where Wt 5 λWt21 1 λðCtβ Vt21 1Ctα ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 ðCtβ Vt21 1 Ctα Þ; W0 5 0r 3 r ; (4.32) T Ktdif 5 ðS~t 1 V~ t Wt1 V~ t ÞCtT :

(4.33)

Proof It follows from Eq. (4.14) that L21 t 5

t X k51

5λt

21

λt2k11 ðCtv Vt1Ctα ÞT ðλIm1Ctβ St21 ðCtβ ÞT Þ21 ðCtv Vt 1Ctα Þ1λt P α =μ ! t X 21 λ2k11 ðCt V~ t ÞT ðλIm1Ctβ St21 ðCtβ ÞT Þ21 Ct V~ t 1P α =μ : k51

By means of Lemma 2.1 with 21

Ωt 5 Lt21 ; Ω0 5 P α ; Ft 5 λ2ðt11Þ=2 ðλIm 1Ctβ St21 ðCtβ ÞT Þ21=2 Ct V~ t

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

87

we obtain the power series Lt 5 λ2t P α ðIr 2 Wt Wt1 Þμ 1 Wt1 q X 1=:2 21=:2 21=:2 1=2 1 ð21Þi P α λti ðP α Wt1 P α Þi11 P α μ2i 1 Oðμ2q21 Þ i51

which converges uniformly in tAT for bounded T and μ . 1=λmin , λmin 5 minfλt ðiÞ . 0; tAT ; i 5 1; 2; . . .; rg; λt ðiÞ; tAT ; i 5 1; 2; . . .; r; where λt ðiÞ, tAT ; i 5 1; 2; . . .; r 1=2 1=2 P α Wt P α .

are eigenvalues of the matrix T Substituting this representation in Pt 5 S~t 1 V~ t Lt V~ t gives

Eq. (4.30). Let us show now that T V~ t P α ðIr 2 Wt Wt1 ÞV~ t CtT 5 0:

(4.34)

Taking in account that Wt 5

t X

λt2k11 ðCkβ Vk21 1Ckα ÞT ðλIm 1Ckβ Sk21 ðCkβ ÞT Þ21 ðCkβ Vk21 1 Ckα Þ

k51

5

t X

λt2k11 ðCk V~ k21 ÞT ðλIm 1Ckβ Sk21 ðCkβ ÞT Þ21 Ck V~ k21

k51

and using Lemma 2.2 with Ft 5 λ2ðt11Þ=2 ðλIm 1Ctβ St21 ðCtβ ÞT Þ21=2 Ct V~ t21 yields P α ðIr 2 Wt Wt1 ÞðCt V~ t21 ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21=2 5 0: This implies T T V~ t P α ðIr 2 Wt Wt1 ÞV~ t ATt CtT 5 V~ t P α ðIr 2 Wt Wt1 ÞV~ t21 CtT T 2 V~ t P α ðIr 2 Wt Wt1 ÞV~ t21 CtT ðλIm 1Ct S~t21 CtT Þ21 Ct S~t21 CtT 5 0:

Theorem 4.1. The solution of the minimization problem (4.4) by the GN method with the diffuse initialization can be found recursively as follows dif

dif

dif

T

T dif xdif t 5 xt21 1 Kt ðyt 2 ht ðxt21 ÞÞ; x0 5 ðβ ; 01 3 r Þ ;

t 5 1; 2; . . .; N ; (4.35)

88

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

β T T dif T T α T T where xt 5 ððβ dif t Þ ; ðβ t Þ Þ ; Kt 5 ððKt Þ ; ðKt Þ Þ ; ht ðxt21 Þ 5 Φðzt ; β t21 Þαt21 ; dif

dif

Ktβ 5 ðSt 1 Vt Wt1 VtT ÞðCtβ ÞT 1 Vt Wt1 ðCtα ÞT ;

(4.36)

Ktα 5 Wt1 VtT ðCtβ ÞT 1 Wt1 ðCtα ÞT ;

(4.37)

St 5 St21 =λ 2 St21 ðCtβ ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ St21 =λ; S0 5 P β ; (4.38) Vt 5 ðIl 2 St21 ðCtβ ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ ÞVt21 2 St21 ðCtβ ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ctα ; V0 5 0l 3 r ;

(4.39)

Wt 5 λWt21 1 λðCtβ Vt21 1Ctα ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 ðCtβ Vt21 1 Ctα Þ; W0 5 0r 3 r ; (4.40) Ctβ 5 Ctβ ðxt21 Þ 5 @½Φðzt ; β t21 Þαt21 [email protected]β t21 ; Ctα 5 Ctα ðxt21 Þ 5 Φðzt ; β t21 Þ: (4.41) The assertion of this theorem will follow from Lemmas 4.1 and 4.2 if we omit in Eq. (4.31) the terms starting with the first order of smallness Oðμ21 Þ. This algorithm will be called a diffuse training algorithm (the DTA) for model Eq. (4.1) with separable structure. Let us formulate some consequences of obtained results. Consequence 4.1. T dif The diffuse component Pt 5 V~ t ðIr 2 Wt Wt1 ÞV~ t is the term in expansion Pt which is proportional to a large parameter and it is equal to zero for t $ tr, where tr 5 mint ft:Wt . 0; t 5 1; 2; . . .; N g. Consequence 4.2. dif The matrix Kt does not depend on the diffuse component as opposed to the matrix Pt and as the function μ is uniformly bounded in tAT as μ-N.

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

89

Consequence 4.3. Numerical implementation errors can result in the GN method divergence for large values of μ. Indeed, let δWt1 be the error connected with calculations of the pseudoinverse Wt. Then by Lemma 4.2 T Kt 5 ðS~t 1 V~ t Lt21 V~ t ÞCtT T 5 ½S~t 1 V~ t ðIr 2 Wt ðWt1 1 δWt1 Þμ 1 Oð1ÞÞV~ t CtT ; μ-N:

t 5 1; 2; . . .; N ;

For δWt1 6¼ 0 the matrix Kt becomes dependent on the diffuse component. Moreover, this implies that even when t $ tr the passage to the use of the representation Pt 5 ðPt21 2 Pt21 CtT ðλIm 1Ct Pt21 CtT Þ21 Ct Pt21 Þ=λ can turn out to be unjustified since the matrix Ptr can still be illconditioned. Numerically implemented, the DTA does not have the mentioned distinctive features. This is evidenced by the absence of diffuse components, i.e., quantities proportional to the large parameter in its construction. We also note one more important advantage of the DTA compared to the GN method with the soft initialization, namely, the selection of the value of μ is not required. Now we will derive the conditions under which the estimate x 5 ðαT ; β T ÞT , obtained by means of the GN method with the bounded dif value of μ, asymptomatically approaches xt as μ-N. Let us consider a more compact representation form of the training algorithms. We introduce the vector ut , whose elements are ordered by the column elements of xt , St , Vt , and Lt and which are defined by the equations system ut 5 ft ðut21 ; 1=μÞ;

tAT ;

(4.42)

with the initial condition u0 5 u. By means of these notations the DTA can be described by the system of equations dif

udif t 5 ft ðut21 ; 0Þ;

tAT ; dif

with the initial condition u0 5 u, where ut dif dif dif dif columns elements of xt , St , Vt , and Wt .

(4.43) includes ordered by the

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

90

Theorem 4.2. Let the conditions be satisfied: 1. P(||ξt|| , N) 5 1, tAT 5 f1; 2; . . .; N g, where T is bounded. 2. The solution of the system Eq. (4.43) belongs to the area dif U~ 5 ftAT ; jjut jj # hg, where h . 0 is some number. 3. jjΦðzt ; βÞjj # k1 , [email protected]Φðzt ; βÞ[email protected]βjj # k2 , [email protected] Φðzt ; βÞ[email protected]β i @β j jj # k3 , i; j 5 1; 2; . . .; l for tAT and jjβjj # h, where k1, k2, and k3 are some positive numbers. 4. rankðWt Þ is constant for every fixed tAT , jjWt jj # h and μ . μ, where μ is some sufficiently large number. Then with probability one in the area U 5 ftAT ; jjut jj # h; μ . μg the uniform asymptotic representation in tAT is true xt 5 xdif t 1 Oð1=μÞ;

μ-N; tAT ;

(4.44)

where dif

dif

diff

T

T dif xdif t 5 xt21 1 Kt ðyt 2 ht ðxt21 ÞÞ; x0 5 ðβ ; 01 3 r Þ :

(4.45)

Proof dif Introducing a new variable et 5 ut 2 ut , we obtain dif

et 5 et21 1 ft ðut21 ; 1=μÞ 2 ft ðut21 ; 0Þ dif

dif

5 et21 1 ft ðet21 1 ut21 ; 1=μÞ 2 ft ðut21 ; 0Þ dif

dif

dif

dif

5 et21 1 ½ft ðet21 1 ut21 ; 1=μÞ 2 ft ðut21 ; 1=μÞ 1 ½ft ðut21 ; 1=μÞ 2 ft ðut21 ; 0Þ dif

5 et21 1 Ψ t ðet21 ; 1=μÞ 1 ϕt ðut21 ; 1=μÞ; e0 5 0;

tAT ;

or equivalently t21 X et 5 Ψ i ðei ; 1=μÞ 1 ϕ~ t ð1=μÞ; e0 5 0; where ϕ~ t ð1=μÞ 5

i50 t21 P i50

tAT ;

(4.46)

dif

ϕi ðui ; 1=μÞ.

~ gives Using the mean-value theorem in U, ð1 @ft ðz; 1=μÞ dif dif ft ðet21 1 ut21 ; 1=μÞ 2 ft ðut21 ; 1=μÞ 5 jz5udif 1τet21 dτ et21 ; t21 @z 0 (4.47) ð1 dif @ft ðut21 ; zÞ dif dif (4.48) jz5τ=μ dτ=μ ft ðut21 ; 1=μÞ 2 ft ðut21 ; 0Þ 5 @z 0 so long as derivatives exist in these expressions.

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

91

It follows from the right-hand sides of Eqs. (4.12) and (4.13) that under the conditions 1 and 2 of the theorem their partial derivatives with respect to ut are bounded in U. The existence and the boundedness of the partial derivatives of the functions in the right-hand part of the systems Eqs. (4.6) and (4.45) depend on the matrix function properties T Kt 5 Pt CtT Rt21 5 ðS~ t 1 V~ t Lt V~ t ÞCtT ;

(4.49)

Ktdif 5 ðS~ t 1 V~ t ðWtdif Þ1 ðV~ t ÞT ÞCtT :

(4.50)

dif

dif

dif

We have Lt21



t

t X

! 2k11

λ

ðCt V~ t Þ

T

ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ct V~ t

21 1 P α =μ

k51 21=2 5 λt P α

t X

! 1=2 1=2 λ2k11 P α ðCt V~ t ÞTðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ct V~ t P α 1Ir=μ

21=2



k51 21=2 5 λt P α ðΛt

21=2

1 Ir =μÞP α

:

Using the identity [47] ðIn 1μV Þ21 5 ðIn 2 VV 1 Þ 1 V 1 ðV 1 1μIn Þ21 ; where V is an arbitrary symmetric matrix and assuming V 5 Λt , we obtain Lt 5 λ2t P α ðIr 1μΛt Þ21 P α μ 1=2

1=2

1 1 21 2t 5 λ2t P α ðIr 2 Λt Λ1 t Þμ 1 λ P α Λt ðΛt 1μIr Þ P α μ: 1=2

1=2

(4.51)

We have by Eqs. (4.34) and (4.51) 1=2 1=2 T Kt 5 S~ t CtT 1 λ2t V~ t P α ðIr 1μΛt Þ21 P α V~ t CtT μ 1=2 21=2 21=2 1=2 5 S~ t CtT 1 P α Wt1 ðλt P α Wt1 P α =μ1Ir Þ21 P α CtT :

(4.52)

If condition 4 of the theorem is satisfied then the matrix function Wt1 ðWt Þis continuous and has a continuous derivative with respect to Wt in Uw 5 ftAT ; jjWt jj # hg [1] and therefore jjWt1 jj is bounded in this area. Since 21=2

jjðIr 1λt P α

21=2

Wt1 P α

=μÞ21 jj 5 jjTt diagðð11λt ð1Þ=μÞ21 ;

ð11λt ð2Þ=μÞ21 ; . . .; ð11λt ðrÞ=μÞ21 ÞTtT jj # !1=2 r X 22 2 # jjTt jj ð11λt ðiÞ=μÞ # rjjTt jj2 ; μ . μ; i51

92

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

where λt ðiÞ, Tt ðiÞtAT ; i 5 1; 2; . . .; r are eigenvalues and corresponding 21=2 21=2 eigenvectors of the matrix λt P α Wt1 P α , Tt 5 ðTt ð1Þ; Tt ð2Þ; . . .; Tt ðrÞÞ is an orthogonal matrix then the norm of the matrix function 21=2

ΨðWt ; μÞ 5 ðIr 1λt P α

21=2

Wt1 P α

=μÞ21

is bounded as well. Differentials of ΨðWt ; μÞ with respect to Wt1 and μ are defined by the expressions 21=2

dW 1 ΨðWt ; μÞ 5 2 λt ðIr 1λt P α

21=2

3 ðIr 1λt P α

21=2

dμ ΨðWt ; μÞ 5 λt ðIr 1λt P α

21=2

21=2

21=2

dWt1 P α

=μÞ21 =μ; 21=2

=μÞ21 P α

21=2

Wt1 P α

21=2

=μÞ21 P α

Wt1 P α

Wt1 P α

21=2

3 ðIr 1λt P α

21=2

Wt1 P α

21=2

Wt1 P α

=μÞ21 =μ2 dμ

which are continuous functions of Wt and μ for tAT, μ . μ, and jjWt jj # h. Thus the derivatives in Eqs. (4.47) and (4.48) exist and are bounded in U~ and therefore there are c1 . 0, c2 . 0 such that dif

dif

jjψt ðet21 ; 1=μÞjj 5 jjft ðet21 1 ut21 ; 1=μÞ 2 ft ðut21 ; 1=μÞjj # c1 :et21 :; (4.53) dif

dif

dif

jjϕt ðut21 ; 1=μÞjj 5 jjft ðut21 ; 1=μÞ 2 ft ðut21 ; 0Þjj # c2 =μ:

(4.54)

Iterating Eq. (4.46), we obtain jje1 jj # b1 =μ; jje2 jj # c12 jje1 jj 1 b2 =μ # ðb1 c12 1 b2 Þ=μ; . . .; t jjet21 jj 1 bt =μ 5 Oð1=μÞ; jjet jj # c1t jje1 jj 1 c2t jje2 jj 1 ? 1 ct21

where cit , i 5 1; 2; . . .; t 2 1; bi ; i 5 1; 2; . . .; t are some positive numbers. dif dif Since ut AU~ then it follows from this that ut 5 ut 1 et AU for μ . μ and at the same time we have the asymptotic representation Eq. (4.44). Let us assume now that in the quality criteria instead of the forgetting factor the covariance matrix of the observation noise Rt is used t X ðyk 2Φðzk ; βÞαÞT Rk21 ðyk 2 Φðzk ; βÞαÞ Jt ðxÞ 5 (4.55) k51 21

1 ΓðβÞ 1 αP α α=μ;

t 5 1; 2; . . .; N :

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

93

The solution in this case is not difficult to obtain, using Theorem 4.1. Indeed, by replacing 21=2

y~k 5 Rk

~ k ; βÞ 5 R21=2 Φðzk ; βÞ yk ; Φðz k

(4.56)

Eq. (4.55) is transformed to Eq. (4.4) with λ 5 1 Jt ðxÞ 5

t X ~ k ; βÞαÞT ð~yk 2 Φðz ~ k ; βÞαÞ ð~yk 2 Φðz k51

21 1 ΓðβÞ 1 αP α α=μ;

(4.57)

t 5 1; 2; . . .; N :

It follows from this the assertion: Theorem 4.3. The solution of the minimization problem (4.55) by the GN method with the diffuse initialization can be found recursively as follows dif

dif

dif

T

T dif xdif t 5 xt21 1 Kt ðyt 2 ht ðxt21 ÞÞ; x0 5 ðβ ; 01 3 r Þ ;

t 5 1; 2; . . .; N ; (4.58)

where dif T dif T T dif β T α T T xdif t 5 ððβ t Þ ; ðβ t Þ Þ ; Kt 5 ððKt Þ ; ðKt Þ Þ ; ht ðxt21 Þ 5 Φðzt ; β t21 Þαt21

Ktβ 5 ðSt 1 Vt Wt1 VtT ÞðCtβ ÞT Rt21 1 Vt Wt1 ðCtα ÞT Rt21 ; Ktα 5 Wt1 VtT ðCtβ ÞT Rt21 1 Wt1 ðCtα ÞT Rt21 ; St 5 St21 2 St21 ðCtβ ÞT ðRt 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ St21 ; S0 5 P β ; Vt 5 ðIl 2 St21 ðCtβ ÞT ðRt 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ ÞVt21 2 St21 ðCtβ ÞT ðRt 1Ctβ St21 ðCtβ ÞT Þ21 Ctα ; V0 5 0l 3 r ;

(4.59) (4.60) (4.61) (4.62)

Wt 5 Wt21 1 ðCtβ Vt21 1Ctα ÞT ðRt 1Ctβ St21 ðCtβ ÞT Þ21 ðCtβ Vt21 1 Ctα Þ; W0 5 0r 3 r ; (4.63) Ctβ 5 Ctβ ðxt21 Þ 5 @½Φðzt ; β t21 Þαt21 [email protected]β t21 ; Ctα 5 Ctα ðxt21 Þ 5 Φðzt ; β t21 Þ: (4.64)

94

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

A successful choice of a priori information for β allows one to expect the fast convergence of the DTA to one of acceptable minimum points of the criterion. We first consider the following limiting case: P β 5 0r 3 r . It follows from Theorem 4.1 that in Eq. (4.1) only the parameter α is estimated by the relations dif α αdif t 5 αt21 1 Kt ðyt 2 Ct αt21 Þ; α0 5 0r 3 1 ; dif

where

dif

dif

t 5 1; 2; . . .; N ; (4.65)

Ktdif 5 Wt1 ðCtα ÞT ;

(4.66)

Wt 5 λWt21 1 ðCtα ÞT Ctα ; W0 5 0r 3 r :

(4.67)

If ξ t 5 0 then the equations system for the estimation error et 5 αt 2 α has the form et 5 ðIr 2 Kt Ctα Þet21 5 At et ; e0 5 2 α;

t 5 1; 2; . . .; N

and its transition matrix is defined by the expression Ht;0 5 Ir 2 Wt1 Wt (Lemma 2.4). P Assume that there is tr 5 mint ft:Πt . 0; t 5 1; 2; . . .; N g, where Πt 5 tk51 ðCkα ÞT Ckα. Then the vector estimation error et vanishes when t $ tr. If ξt 6¼ 0 then it means that the DTA estimate will be unbiased when t $ tr. The considered limiting case illustrates one more important distinctive feature of the DTA, namely, the initial stage of training (t # tr) can make an essential impact on the number of iterations necessary for the convergence of a nonlinear training problem. The training process with the help of the DTA looks like fast and slow movements that occur in dif dynamical systems. In our case β dif is fast. t is a slow variable and αt dif T dif T T Because of this, the point ððαt Þ ; ðβ t Þ Þ reaches a neighborhood of an acceptable minimum for a finite number of steps t $ tr. This property of the DTA is considered in more detail in Section 4.4.2. The situation in which values of the norm P β are different from zero but comparatively small seems to be more interesting. In this case different variants of construction of simplified treaning algorithms are possible. Let, for example, P β 5 OðεÞ, ε-0. Then St 5 OðεÞ, Vt 5 OðεÞ, ε-0 uniformly in t for bounded set T. Neglecting by terms Vt Wt1 VtT ðCtβ ÞT , Vt Wt1 ðCtα ÞT Wt1 VtT ðCtβ ÞT in Eqs. (4.36) and (4.37) and Ctβ St21 ðCtβ ÞT , Ctβ Vt21 in Eq. (4.40), we obtain a simplified version of the DTA from Theorem 4.1

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

Ktβ 5 St ðCtβ ÞT ; Ktα 5 Wt1 ðCtα ÞT ;

95

(4.68)

St 5 St21 =λ 2 St21 ðCtβ ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ St21 =λ; S0 5 P β ; (4.69) Wt 5 λWt21 1 ðCtα ÞT Ctα ; W0 5 0r 3 r :

(4.70)

Similarly, for the DTA from Theorem 4.3 we get Ktβ 5 St ðCtβ ÞT Rt21 ; Ktα 5 Wt1 ðCtα ÞT Rt21 ; St 5 St21 2 St21 ðCtβ ÞT ðRt 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ St21 ; S0 5 P β ; Wt 5 Wt21 1 ðCtα ÞT Rt21 Ctα ; W0 5 0r 3 r :

(4.71) (4.72) (4.73)

We can come to the same relations in a different way. Indeed, solving the optimization problem and alternately fixing parameters, we obtain relations (4.68)(4.73). This implies the name a two-stage training algorithm with the diffuse initialization.

4.3 TRAINING IN THE ABSENCE OF A PRIORI INFORMATION ABOUT PARAMETERS OF THE OUTPUT LAYER Let us assume now that the unknown parameters in Eq. (4.1) satisfy conditions B1 and B2 and the quality criteria at a moment t is defined by Eq. (4.5). We will need some auxiliary statements to obtain the main result of this section, namely, Theorem 4.4. Lemma 4.3. The following identity is true: 2t=2 ~T ~T Λ1 Kt Ct ÞΛ1 t C t 5 ½ðIl1r 2 λ t21 C t21 ; Kt ;

t 5 1; 2; . . .; N ;

(4.74)

where

Ct ARm 3 ðl1rÞ , C~ t 5 ðλ21=2 C1T ; λ21 C2T ; . . .; λ2t=2 CtT ÞT ARmt 3 ðl1rÞ , T Λt 5 ðC~ t C~ t 1 MÞARðl1rÞ 3 ðl1rÞ ;

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

96

2t=2

Kt 5λ

T ðl1rÞ 3 m Λ1 ;M t Ct AR



 0l 3 r ARðl1rÞ 3 ðl1rÞ ; P ARl 3 l ; β 0r 3 r

21

5 Pβ 0r 3 l

0p 3 q is a p 3 q matrix with zero elements. Proof 1 ~T 2t=2 T T ~T Ct Þ then the last m columns in the leftSince Λ1 t C t 5 Λt ðC t21 ; λ and right-hand sides of identity Eq. (4.74) coincide. Let us show that ~ Λ1t C~ t21 5 ðIl1r 2 λ2t=2 Kt Ct ÞΛ1 t21 C t21 : T

T

Using the recursive formula Λt 5 Λt21 1 λ2t CtT Ct ; Λ0 5 M;

t 5 1; 2; . . .; N

for the matrix determination of Λt , we represent this expression in the equivalent form 2t=2 ~ Kt Ct ÞΛ1 ðΛ1 t 2 ðIl1r 2 λ t21 ÞC t21 T

1 1 1 ~ 5 ðΛ1 t 2 ðIr1l 2 Λt Λt 1 Λt Λt21 ÞΛt21 ÞC t21 T

5 ðΛ1 t ðIr1l

2 Λt21 Λ1 t21 Þ 2 ðIr1l

1 ~T 2 Λ1 t Λt ÞΛt21 ÞC t21

(4.75)

5 0: Let us show that 1 ðIl1r 2 Λ1 t Λt ÞΛt21 5 0;

(4.76)

1 ~T Λ1 t ðIl1r 2 Λt21 Λt21 ÞC t21 5 0:

(4.77)

T Qt 5 ðM 1=2 ; C~ t Þ

Let the matrix have a rank qðtÞ and lt ð1Þ; lt ð2Þ; . . .; lt ðqðtÞÞ be all its arbitrary linearly independed columns and Lt 5 ðlt ð1Þ; lt ð2Þ; . . .; lt ðqðtÞÞ. The matrix Li is selected so that Li 5 ðLi21 ; Δi Þ, where Δi is matrix of rank qðiÞ 2 qði 2 1Þ, composed of all linearly independent columns of the matrix λ2i=2 CiT for each i 5 2; 3; . . .; t. The linear space CðQt Þ defined by them coincides with the space formed by columns Λt which provide CðΛt21 ÞDCðΛt Þ. Using the skeleton decomposition of the matrices, we get T ðC~ t ; M 1=2 Þ 5 Lt Γt ;

where rankðΓt Þ 5 qðtÞ.

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

97

We have Λt 5 Lt Γ~ t LtT , where Γ~ t 5 Γt ΓTt . Since Lt is a matrix of full rank with respect to columns, we have Lt1 5 ðLtT Lt Þ21 LtT and 1 1 21 T T 1 ~ 21 T ~ T 1 Λ1 t 5 ðLt Γt Lt Þ 5 ðLt Þ Γt ðLt Þ ; Λt Λt 5 Lt ðLt Lt Þ Lt ;

ðIl1r 2 Lt ðLtT Lt Þ21 LtT ÞLt21 5 0 that implies Eq. (4.76). Consider the equality Eq. (4.77). Since there is a matrix Γt such that T C~ t21 5 Lt21 Γt , we obtain 1 1 21 T T ~ Λ1 t ðIl1r 2 Λt21 Λt21 ÞC t21 5 Λt ðIl1r 2 Lt21 ðLt21 Lt21 Þ Lt21 ÞLt21 Γt 5 0; T

t 5 1; 2; . . .; N : Lemma 4.4. There is uniform in respect to tAT 5 f1; 2; . . .; N g the asymptotic representation ðεUt 1Mt Þ21 5 1=εA21 1 A0 1 OðεÞ; ε-0; where T A0 5 Mt1 5 ðC~ t C~ t 1λt MÞ1 ; Ut

 M5

21

Pβ 0r 3 l

 5

0l 3 l 0r 3 l

t 5 1; 2; . . .; N ;

(4.78)

 0l 3 r ARðl1rÞ 3 ðl1rÞ ; λt Ir

 0l 3 r ARðl1rÞ 3 ðl1rÞ ; P ARl 3 l ; β 0r 3 r

Ct ARm 3 ðl1rÞ ; C~ t 5 ðλðt21Þ=2 C1T ; λðt22Þ=2 C2T ; . . .; CtT ÞT ARmt 3 ðl1rÞ ; ε . 0 is a small parameter, T is a bounded set. Proof The existence of the asymptotic representation Eq. (4.78) follows from the expression [47] 1

1

ðH T H 11=εGT GÞ1 5 ðHHÞ1 1 εðI 2 H HÞðGT GÞ1 ðI 2H HÞT 1 Oðε2 Þ; where H and G are arbitrary matrices of corresponding dimensions, H 5 HðI 2 G1 GÞ 5 HðI 2 ðGT GÞ1 GT GÞ; 1=2

1=2

provided that H 5 Ut ; G 5 Mt .

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

98

We need to show that 1

1

Mt1 5 ðIl1r 2 H t Ut ÞMt1 ðIl1r 2H t Ut ÞT ; 1=2

1=2

where H t 5 Ut ðIl1r 2 Mt1 Mt Þ or 1=2

1

H t Ut Mt1 5 0: 1=2

Let us at first show that 1

H t 5 λ2t ðIl1r 2 Mt1 Mt ÞU; where

 U5

0l 3 l 0r 3 l

 0l 3 r : Ir

We use the following result [47]. Let A and B be arbitrary matrices. Then ðABÞ1 5 B1 A1 if and only if they carry out the conditions A1 ABBT AT 5 BBT AT ; BB1 AT AB 5 AT AB: Putting A 5 U, B 5 Il1r 2 Mt1 Mt , gives ðU 2 Il1r ÞðIl1r 2 Mt1 Mt ÞU 5 0;

(4.79)

Mt1 Mt UðIl1r 2 Mt1 Mt Þ 5 0:

(4.80)

Let Tt 5 ðtt ð1Þ; tt ð2Þ; . . .; tt ðl 1 rÞÞ be an orthogonal matrix such that Mt 5 Tt Λt TtT where Λt 5 diagðλt ð1Þ; λt ð2Þ; . . .; λt ðl 1 rÞÞ, λt ðiÞ; i 5 1; 2; . . .; l 1 r are eigenvalues of the matrix Mt. Define the structure of the matrix Tt. We have rankðMt Þ 5 rank λt M 1

β β ðC~ t ÞT C~ t

β α ðC~ t ÞT C~ t

α β ðC~ t ÞT C~ t

α α ðC~ t ÞT C~ t

!!

β β 5 rankðλt M 1 ðC~ t ÞT C~ t Þ 1 rankðSt Þ;

where

α

α

β

α

β

β

α

β

St 5 ðC~ t ÞT C~ t 2 ðC~ t ÞT C~ t ðIl 1ðC~ t ÞT C~ t Þ21 ðC~ t ÞT C~ t ; β α β α C~ t 5 ðC~ t ; C~ t Þ; C~ t ARmt 3 l ; C~ t ARmt 3 r :

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

99

This implies that qðtÞ 5 rankðMt Þ $ l. If rankðMt Þ 5 l 1 r, then the statement of the lemma is obviously true. Let l # qðtÞ , l 1 r. The eigenvectors of the matrix Mt (the columns of the matrix Tt) that correspond to zero eigenvalues are determined from the system Mt xt 5 0. Using the block representation xt 5 ðxTt ð1Þ; xTt ð2ÞÞT , we obtain !  β β β α  21 λt P β 1 ðC~ t ÞT C~ t ðC~ t ÞT C~ t xt ð1Þ 50 α β α α xt ð2Þ ðC~ t ÞT C~ t ðC~ t ÞT C~ t or equivalently λt P

21 21 ~β T ~ β xt ð1Þ 1 ðC t Þ C t xt

α 5 0; ðC~ t ÞT C~ t xt 5 0:

The solution of this system is defined by the expressions α α xt ð1Þ 5 0; xt ð2Þ 5 ðIr 2 ðC~ t Þ1 C~ t Þf ;

where f ARr is an arbitrary vector. From this without any loss of generality, we assume that Λt 5 diagðλt ð1Þ; λt ð2Þ; . . .; λt ðqðtÞÞ; 0; 0; . . .; 0Þ; λt ðiÞ . 0; i 5 1; 2; . . .; qðtÞ and obtain  T1t Tt 5 T2t

 0l 3 ðl1r2qðtÞÞ ; T1t ARl 3 qðtÞ ; T2t ARr 3 qðtÞ ; T3t ARr 3 ðl1r2qðtÞÞ : T3t

Transforming the left-hand sides of Eqs. (4.79) and (4.80) with the T help of the spectral decomposition Mt1 5 Tt Λ1 t Tt gives T ðU 2 Il1r ÞðIl1r 2 Mt1 Mt ÞU 5 ðU 2 Il1r ÞTt ðIl1r 2 Λ1 t Λt ÞTt U; 1 T T Mt1 Mt UðIl1r 2 Mt1 Mt Þ 5 Tt Λ1 t Λt Tt UTt ðIl1r 2 Λt Λt ÞTt :

The view of the structure Tt and the relation T3tT T2t 5 0ðl1r2qðtÞÞ 3 qðtÞ from here Eqs. (4.79), (4.80) follow. Similarly, it is verified that 1

1 T T H t Ut Mt1 5 ðIl1r 2 Mt1 Mt ÞUMt1 5 Tt ðIl1r 2 Λ1 t Λt ÞTt UTt Λt Tt 5 0: 1=2

Lemma 4.5. The following representation is true: T T Mt1 5 ðC~ t C~ t 1λt MÞ1 5 S~t 1 V~ t Wt1 V~ t ;

t 5 1; 2; . . .; N ;

(4.81)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

100

where 21

M5  Ut 5 S~t 5



St 0r 3 l

!



0l 3 r

0r 3 l

0r 3 r

0l 3 l

0l 3 r

0r 3 l

λt Ir

ARðl1rÞ 3 ðl1rÞ ; P β ARl 3 l ; 

ARðl1rÞ 3 ðl1rÞ ;

   0l 3 r Vt ; V~ t 5 ; St ARl 3 l ; Vt ARl 3 r ; 0r 3 r Ir

(4.82)

St, Vt, Wt are defined by the matrix equations (4.38)(4.40) for arbitrary Ctβ ARm 3 l , Ctα ARm 3 r . Proof The matrix Pt21 5 εUt 1 Mt 5 ε



0l 3 l 0r 3 l

0l 3 r λt Ir

 1 Mt

satisfies the matrix difference equation 21

21 Pt21 5 λPt21 1 CtT Ct ; P021 5 block diagðP β ; εIr Þ;

t 5 1; 2; . . .; N ;

where Ct 5 ðCtβ ; Ctα Þ. Let us use the matrix identity Eq. (2.7) to determine Pt. Putting in Eq. (2.7) B 5 Pt21 ; D 5 λIm , C 5 CtT , we obtain 21 1CtT Ct Þ21 5 ðPt21 2 Pt21 CtT ðλIm 1Ct Pt21 CtT Þ21 Ct Pt21 Þ=λ; Pt 5 ðλPt21

5 Pt21 =λ 2 1=λ2 Pt21 CtT ðIm 11=λCt Pt21 CtT Þ21 Ct Pt21 ; (4.83) P0 5 block diagðP β ; Ir =εÞ;

t 5 1; 2; . . .; N :

T It follows from Lemma 4.1 that Pt 5 S~t 1 V~ t Lt V~ t , where the matrices S~t ; V~ t ; Lt are defined by relations (4.11)(4.14). Using Lemma 2.1, we find from Eq. (4.14) that

Lt 5 1=ελ2t P α ðIr 2 Wt Wt1 Þ 1 Wt1 1 OðεÞ; ε-0: Since T Pt 5 ðεUt 1Mt Þ21 5 Qt 1 S~t 5 V~ t Lt V~ t 1 S~t T T 5 1=ελ2t V~ t P α ðIr 2 Wt Wt1 ÞV~ t 1 V~ t Wt1 V~ t 1 S~t 1 OðεÞ; ε-0

then Eq. (4.81) follows from Lemma 4.4.

(4.84)

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

101

Consider the auxiliary linear observation model Eq. (4.16) and the optimization problem that is connected with it ðα ; β  Þ 5 argminJt ðα; βÞ; αARr ; βARl ;

(4.85)

where Jt ðxÞ 5

t X

λt2k ðyk 2Ckβ β2Ckα αÞT ðyk 2 Ckβ β 2 Ckα αÞ 1 λt ΓðβÞ;

k51

t 5 1; 2; . . .; N ; (4.86) yt ARm ; Ctβ ARm 3 l ; Ctα ARm 3 r ; ξt ARm : Lemma 4.6. The solution of problems (4.85) and (4.86) can be represented in the recursive form β t 5 β t21 1 Ktβ ðyt 2 Ctβ β t21 2 Ctα αt21 Þ; β 0 5 β; αt 5 αt21 1 Ktα ðyt 2 Ctβ β t21 2 Ctα αt21 Þ; α0 5 0r 3 1 ;

(4.87)

t 5 1; 2; . . .; N ; (4.88)

where Ktβ 5 ðSt 1 Vt Wt1 VtT ÞðCtβ ÞT 1 Vt Wt1 ðCtα ÞT ;

(4.89)

Ktα 5 Wt1 VtT ðCtβ ÞT 1 Wt1 ðCtα ÞT :

(4.90)

Proof Assume that x 5 ðαT ; β T ÞT and pass to a more compact form of representation of the criterion Jt ðxÞ 5 ðYt 2 C~ t xÞT ðYt 2 C~ t xÞ 1 λt ðx2xÞT Mðx 2 xÞ;

t 5 1; 2; . . .; N ; (4.91)

T

where x 5 ðβ ; 01 3 r ÞT , the matrix M is defined in Lemma 4.3, C~ t 5 ðλðt21Þ=2 C1T ; λðt22Þ=2 C2T ; . . .; CtT ÞT ARmt 3 ðl1rÞ ; Yt 5 ðλðt21Þ=2 yT1 ; λðt22Þ=2 yT2 ; . . .; yTt ÞT ARmt 3 1 :

102

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

Using the replacement z 5 x 2 x, we obtain Jt ðz 1 xÞ 5 jjY~ t 2 C~ t zjj2 1 λt zT Mz ! ! Y~ t C~ t 2 2 z: ; 5: t=2 1=2 0ðl1rÞ 3 1 λ M where Y~ t 5 Yt 2 C~ t x. Stationary points of this problem are determined by the system of the normal equations 

C~ t t=2 1=2 λ M

T 

  T   Y~ t C~ t C~ t z5 ; 0ðl1rÞ 3 1 λt=2 M 1=2 λt=2 M 1=2

or with the use of the following equivalent form T T ðC~ t C~ t 1 λt MÞz 5 C~ t Y~ t :

(4.92)

We will consider only the minimum norm solution which is defined by the expression T T T z 5 ðC~ t C~ t 1λt MÞ1 C~ t Y~ t 5 Mt1 C~ t Y~ t :

(4.93)

Let us show that it can be recursively found. Denote zt 5 z. Using Lemma 4.3, gives T T T zt 5 ðC~ t C~ t 1MÞ1 C~ t ðY~ t21 ; y~Tt21 ÞT 1 ~ 5 ½ðIl1r 2 Kt Ct ÞMt21 C t21 ; Kt ðY~ t21 ; y~Tt ÞT T

T

5 ðIl1r 2 Kt Ct Þzt21 1 Kt y~t ; z0 5 0; t 5 1; 2; . . .; N ; T where Kt 5 Mt1 CtT 5 ðC~ t C~ t 1λt MÞ1 CtT . Coming back to the initial variable xt 5 zt 1 x, we obtain T

xt 5 xt21 1 Kt ðyt 2 Ct xt21 Þ; x0 5 ðβ ; 01 3 r ÞT ;

t 5 1; 2; . . .; N : (4.94)

Expressions (4.89) and (4.90) follow from Lemma 4.5. Now we can formulate an algorithm for the solution of the nonlinear minimization problem with criterion Eq. (4.5). Theorem 4.4. Let unknown parameters in Eq. (4.1) satisfy conditions B1 and B2. Then the solution to the problem of minimization of the criterion Eq. (4.5) by the GN method coincides with the DTA.

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

103

Proof This assertion follows from Lemma 4.6. Comparing the two problem statements of parameter estimation in Eq. (4.1), which were given in Section 4.1, we note the following. When solving the minimization problem with the quality criteria Eq. (4.5), any statistical assumptions about the nature of the unknown initial conditions are not used. It seems that this idea is more logically justified, unlike the diffuse initialization, in which we have to deal with infinite covariance matrix of the initial conditions. However from Theorem 4.4 it follows that these two settings lead to the same relations for parameter estimation.

4.4 CONVERGENCE OF DIFFUSE TRAINING ALGORITHMS The DTA behavior with an increase of sample size is studied in this section. The problem specialty is related to the separable character of the observation models and the fact that the nonlinearly inputting parameters belong to some compact set, and linearly inputting parameters should be considered as arbitrary numbers. We will suppose that the observations are generated by the model yt 5 Φðzt ; β  Þα 1 ξt ;

t 5 1; 2; . . .;

(4.95)

for some sequences ξ t ARm , zt ARn , where β  ARl , α ARr are unknown parameters. It is assumed that for the estimation of the parameters in Eq. (4.95) the DTAs obtained in Section 4.2 are used and defined by the system of equations T

xt 5 xt21 1 Kt ðyt 2 ht ðxt21 ÞÞ; x0 5 ðβ ; 01 3 r ÞT ;

t 5 1; 2; . . .;

(4.96)

where xt 5 ðβ Tt ; αTt ÞT , ht ðxt21 Þ 5 Φðzt ; β t21 Þαt21 or in the equivalent form β t 5 β t21 1 Ktβ ðyt 2 ht ðxt21 ÞÞ; β 0 5 β; αt 5 αt21 1 Ktα ðyt 2 ht ðxt21 ÞÞ; α0 5 0r 3 1 ; Kt 5 ððKtβ ÞT ; ðKtα ÞT ÞT

t 5 1; 2; . . .;

(4.97) (4.98)

where is described by the expressions (4.36)(4.41) or (4.59)(4.64). If ξ t 5 0 then the system Eq. (4.96) has the solution xt 5 x 5 ððβ  ÞT ; ðα ÞT ÞT . The vector function ηt 5 Kt ξ t can be interpreted as disturbance acting on it. We want to find conditions under which jjxt 2 x jj becomes small with increasing t.

104

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

4.4.1 Finite Training Set Along with the system Eq. (4.96) consider the system T

zt 5 zt21 1 Kt ðht ðx Þ 2 ht ðzt21 ÞÞ; z0 5 ðβ ; 01 3 r ÞT ;

t 5 1; 2; . . .: (4.99)

Definition 1 The solution zt 5 x of the system Eq. (4.99) is finitely stable under the action of the disturbance ξt if for any ε . 0, d . 0 there are such δ1 ðAd Þ, δ2 ðAd Þ, δ3 ðAd Þ . 0 that jjβ 0 2 β  jj , δ1 ðAd Þ, α0 5 0, jjξ t jj , δ2 ðAd Þ, jjS0 jj , δ3 ðAd Þ imply jjxt 2 x jj , ε for all α AAd 5 fα: jjαjj # dgCRr , t $ tr 5 minft:Wt . 0; tAT g, where tAT 5 f1; 2; . . .; N g is bounded set. Note that this definition formalizes the idea inherent to the DTA of fast and slow movements. We will need some auxiliary statements to obtain the main result of this section, namely, Theorem 4.5. Lemma 4.7. The solution of the equations system xt 5 At xt21 1 ft ; x0 5 x;

t 5 1; 2; . . .:

(4.100)

t 5 1; 2; . . .;

(4.101)

is defined by the expressions xt 5 Ht;0 x0 1

t X

Ht;s fs ;

s51

where xt , ft ARn , At ARn 3 n ,  At At21 . . .As11 ; t 5 s 1 1; s 1 2; . . . Ht;s 5 In ; t 5 s

(4.102)

or in another form Ht;s 5 At Ht21;s ; t 5 s 1 1; s 1 2; . . .; Hs;s 5 In : Proof Iterating Eqs. (4.100) and (4.103) yields, respectively, x1 5 A1 x0 1 f1 ; x2 5 A2 A1 x0 1 A2 f1 1 f2 ; . . .; xt 5 At At21 . . .A1 x0 1 At At21 . . .A2 f1 1 ? 1 ft ;

(4.103)

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

105

Hs11;s 5 As11 ; Hs12;s 5 As12 As11 ; . . .; Ht;s 5 At At21 . . .As11 : This implies Eqs. (4.101) and (4.102). Lemma 4.8. The solution of the equations system Zt21;s 5 Bt Zt;s 1 Ft;s ; Zs;s 5 In ;

t 5 s 2 1; s 2 2; . . .

(4.104)

is defined by the expressions Zt;s 5 Yt;s 1

s21 X

Yt; j Fj11;s ;

t 5 s 2 1; s 2 2; . . .;

(4.105)

j5t

where Zt;s ; Ft;s ; At ARn 3 n,  Bt11 Bt12 . . .Bs ; t 5 s 2 1; . . . Yt;s 5 In ; t 5 s

(4.106)

or in another form Yt21;s 5 Bt Yt;s ; t 5 s 2 1; s 2 1; . . .; Ys;s 5 In :

(4.107)

Proof Iterating Eqs. (4.104) and (4.107) yields, respectively, Zs21;s 5 Bs 1 Fs;s ; Zs22;s 5 Bs21 Bs 1 Bs21 Fs;s 1 Fs21;s ; . . .; Zt;s5Bt11 Bt12 ...Bs 1Bt11 Bt12 ...Bs21 Fs;s 1Bt11 Bt12 ...Bs22 Fs21;s 1?1Ft11;s ; Ys21;s 5Bs ; Ys22;s 5Bs21 Bs ;...; Yt;s 5Bt11 Bt12 ...Bs : Zt;s 5Yt;s 1Yt;s21 Fs;s 1Yt;s22 Fs21;s 1?1Ft11;s 5Yt;s 1

s21 X

Yt;j Fj11;s :

j5t

This implies Eqs. (4.105) and (4.106). We at first consider the convergence of the DTA from Theorem 4.3. Let us represent expressions for the DTA from Theorem 4.3 in the following more compact and convenient form for further use d ~1 ~ T T 21 ~ T 21 Kt 5 ðS~t 1 V~ t W t V t ÞCt Rt 5 S t Ct Rt 1 Kt ;

t 5 1; 2; . . .; (4.108)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

106

where S~t 5 S~t21 2 S~t21 CtT Nt21 Ct S~t21 ;

(4.109)

V~ t 5 ðIl1r 2 S~t21 CtT Nt21 Ct ÞV~ t21 5 A1t V~ t21 ;

(4.110)

~ t 5 W~ t21 1 V~ Tt21 CtT Nt21 Ct V~ t21 ; W

(4.111)

Nt 5 Rt 1 Ct S~t21 CtT ;

(4.112)

S~t 5 block diagðSt ; 0r 3 r Þ; V~ t 5 ðVtT ; Ir ÞT ; W~ t 5 Wt ; Ct 5 ðCtβ ; Ctα Þ: (4.113) Lemma 4.9. Suppose that there is tr 5 minft:Wt . 0; tAT g, T 5 f1; 2; . . .; N g and let S0 . 0. Then for t $ tr, tAT the following representations are true t21 X T T 21 V~ i Ci11 Ni11 Ci11 Φi;s ; (4.114) Ht;s 5 Φt;s 2 V~ t Wt21 i5s

Ht;s Ksd 5 V~ t Wt21 V~ s CsT Rs21 ; T

t 5 s 1 1; s 1 2; . . .;

(4.115)

where Ht;s, Φt;s are the solutions of the matrices equations systems Ht;s 5 ðIl1r 2 Kt Ct ÞHt21;s 5 At Ht21;s ; Hs;s 5 Il1r ;

(4.116)

Φt;s 5 ðIl1r 2 S~t CtT Rt21 Ct ÞΦt21;s 5 A~ t Φt21;s ; Φs;s 5 Il1r ;

(4.117)

Kt , V~ t , Wt , S~t , Nt are defined by the expressions (4.108)(4.113). Proof Consider two auxiliary systems of equations Gt21;s 5 ATt Gt;s ; Gs;s 5 Il1r ;

t 5 s 2 1; s 2 2; . . .;

(4.118)

Zt21;s 5 A~ t Zt;s 2 CtT Nt21 Ct V~ t21 Ws21 V~ s ; Zs;s 5 Il1r ; t 5 s 2 1; s 2 2; . . .; s $ tr:

(4.119)

T

T

Let us first show that Gt;s 5 Zt;s ;

t 5 s 2 1; s 2 2; . . .; s $ tr:

(4.120)

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

107

Putting in Eq. (4.104) T T Bt 5 A~ t ; Ft;s 5 2 CtT Nt21 Ct V~ t21 Ws21 V~ s ;

we obtain Zt;s 5 Yt;s 2

s21 X

T 21 Yt;j Cj11 Nj11 Cj11 V~ j Ws21 V~ s ; T

t 5 s 2 1; s 2 2; . . .;

j5t

where T Yt21;s 5 A~ t Yt21;s ; Ys;s 5 Il1r ;

t 5 s 2 1; s 2 2; . . .:

The use of Lemmas 4.7 and 4.8 gives T T T T T T Yt;s 5 A~ t11 A~ t12 . . .A~ s ; Φt;s 5 A~ t A~ t11 . . .A~ s11 ; ΦTs;t 5 A~ t11 A~ t12 . . .A~ s :

From this it follows that Yt;s 5 ΦTs;t . Thus Zt;s 5 ΦTs;t 2

s21 X

T T 21 ΦTj;t Cj11 Nj11 Cj11 V~ j Ws21 V~ s ;

t 5 s 2 1; s 2 2; . . .:

j5t

(4.121) We have Gt21;s 5ATt Gt;s 5ðIl1r 2Kt Ct ÞT Gt;s T T T 21 d ~ ~ 1~ 5ððIl1r 2S~t CtT R21 t Ct Þ2V s Wt V s Ct Rt Ct Þ Gt;s5ðAt 2Kt Ct Þ Gt;s : T

(4.122) From the comparison of the right-hand sides of Eqs. (4.119) and (4.122) it follows that Eq. (4.120) is satisfied if ðKtd ÞT Zt;s 5 Nt21 Ct V~ t21 Ws21 V~ s : T

(4.123)

Using the expressions V~ t 5 Φt;0 V~ 0 ; Φs;t Φt;0 5 Φs;0 gives ðKtd ÞT Zt;s 5 Rt21 Ct V~ t Wt1 V~ t

T

s21 X T T 21 ΦTs;t 2 ΦTj;t Cj11 Nj11 Cj11 V~ j Ws21 V~ s j5t

5 Rt21 Ct V~ t Wt1

! s21 X T T T 21 V~ j Cj11 Il1r 2 Nj11 Cj11 V~ j Ws21 V s : j5t

!

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

108

Iterating Eq. (4.111), we find Wt 5

t21 X

T T 21 V~ j Cj11 Nj11 Cj11 V~ j :

j50

Since s21 X

T T 21 V~ j Cj11 Nj11 Cj11 V~ j 5 Ws 2 Wt

j5t

then the left-hand side of Eq. (4.123) can be transformed to the form ! s21 X T T T 21 ðKtd ÞT Zt;s 5 Rt21 Ct V~ t Wt1 Ws 2 V~ j Cj11 Nj11 Cj11 V~ j Ws21 V~ s j5t T 5 Rt21 Ct V~ t Wt1 Wt Ws21 V~ s :

(4.124) Let us show that Rt21 Ct V~ t 5 Nt21 Ct V~ t21 : Note at first that A~ t 5 A1t . Indeed, we have  β T 21 β T 21 A~ t 5 Il1r 2 S~t Ct Rt Ct 5 Il 2 St ðCt Þ Rt Ct 0r 3 l

(4.125)  0l 3 r ; Ir

(4.126)

where St 5 St21 2 St21 ðCtβ ÞT ðRt 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ St21t ; S0 5 P β : Since 21 St21 5 St21 1 ðCtβ ÞT Rt21 Ctβ

then using the identity PH T ðHPH T 1RÞ21 5 ðP 21 1H T R21 HÞ21 HR21 with P 5 St21 , H 5 Ctβ , R 5 Rt gives St21 ðCtβ ÞT ðRt 1Ctβ St21 ðCtβ ÞT Þ21 5 St ðCtβ ÞT Rt21 : Taking into account Eq. (4.126), we obtain A~ t 5 A1t . We show now that A~ t 5 Il 2 St Ctβ Rt21 ðCtβ ÞT 5 ðIl 1St21 Ctβ Rt21 ðCtβ ÞT Þ21 :

(4.127)

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

109

Let us transform the right-hand side of this expression using the identity Eq. (2.7). Putting in Eq. (2.7) B 5 St21 , C 5 Ctβ , D 5 Rt we obtain 21 21 ðSt21 1Ctβ Rt21 ðCtβ ÞT Þ21 St21 5 ðSt21 2 St21 ðCtβ ÞT 3 21 3 ðRt 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ St21 ÞSt21 5

5 Il 2 St21 ðCtβ ÞT Nt21 Ctβ 5 Il 2 St ðCtβ ÞT Rt21 : Using Eqs. (4.110) and (4.127), we find T 21 ~ ~ ~ V~ t21 5 A21 1t V t 5 ðIl1r 1 St21 Ct Rt Ct ÞV t :

(4.128)

If Eq. (4.125) holds then there should be V~ t 5 V~ t21 2 S~t21 CtT Nt21 Ct V~ t21 5 V~ t21 2 S~t21 CtT Rt21 Ct V~ t that coincides with Eq. (4.128). Thus we have shown that Eq. (4.125) holds. Substituting Eq. (4.125) to Eq. (4.124) gives T ðKtd ÞT Zt;s 5 Nt21 Ct V~ t21 Wt1 Wt Ws21 V~ s :

(4.129)

T T T 21=2 21=2 21=2 Let C~ t 5 ðV~ 0 C1T N1 ; V~ 1 C2T N2 ; . . .; V~ t21 CtT Nt Þ. Assume that the rank of this matrix is equal to kðtÞ and lt ð1Þ; lt ð2Þ; . . .; lt ðkðtÞÞ are any of its linearly independent columns. The use of the skeletal decomposition yields C~ t 5 Lt Γt , where Lt , Γt are r 3 kðtÞ, kðtÞ 3 mt matrices of the rank kðtÞ, Γ t ðiÞ, i 5 1; 2; . . .; t are kðtÞ 3 ðl 1 rÞ matrices. We have

Wt 5 C~ t C~ t 5 Lt Γt ΓTt LtT ; Wt1 5 ðLt Γt ΓTt LtT Þ1 5 ðLtT Þ1 ðΓt ΓTt Þ21 ðLt Þ1 ; T

21=2

Nt

T V~ t21 CtT 5 Lt Γ~ t ;

where Lt1 5 ðLtT Lt Þ21 LtT ; Γ~ t is a matrix. Substituting these expressions in Eq. (4.129), we obtain T T ðKtd ÞT Zt;s 5 Nt21 Ct V~ t21 Wt1 Wt Ws21 V~ s 5 Rt21 Ct V~ t Ws21 V~ s :

(4.130)

This implies Eq. (4.120). Finally, the use of Lemmas 4.7 and 4.8 for Bt 5 ATt gives T 5 BTt BTt21 . . .BTs11 : Ht;s 5 At At21 . . .As11 ; Gt;s 5 Bt11 Bt12 . . .Bs ; Gs;t T 5 Zs;t and we have Eq. (4.114). The expression Therefore Ht;s 5 Gs;t (4.115) follows from Eq. (4.130).

110

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

Establish an analogous result for the DTA with forgetting factor described by the system of Eqs. (4.35)(4.41). We show that in this case we can use Lemma 4.9. Indeed, consider the relations describing the GN method with a linearization in the neighborhood of the last estimate and soft initialization for training with the criterion Eq. (4.4) T

xt 5 xt21 1 Kt ðyt 2 ht ðxt21 ÞÞ; x0 5 ðβ ; 01 3 r ÞT ;

t 5 1; 2; . . .; (4.131)

where ht ðxt21 Þ 5 Φðzt ; β t21 Þαt21, Kt 5 Mt21 CtT 5 Pt CtT ;

(4.132) 21

21

Mt 5λMt21 1CtT Ct ; M0 5 block diagðP β ;P β =μÞ; Pt 5 ðPt21 2Pt21 CtT ðλIr 1Ct Pt21 CtT Þ21 Ct Pt21 Þ=λ; P0 5block diagðP β ;P α μÞ; (4.133) Ctβ 5Ctβ ðxt21 Þ [email protected]½Φðzt ; β t21 Þαt21 [email protected]β t21 ; Ctα 5Ctα ðxt21 Þ 5Φðzt ;β t21 Þ: Transforming Kt by replacing Lt 5 λ2t Mt , we get Kt 5 λ2t Lt21 CtT ;

(4.134)

where the matrices Lt and Lt21 can be found recursively 21

21

Lt 5 Lt21 1 λ2t CtT Ct ; L0 5 block diagðP β ; P β =μÞ=λ; 21 21 T t 21 T 21 21 21 L21 t 5Lt21 2 Lt21 Ct ðλ Ir 1Ct Lt21 Ct Þ Ct Lt21 ; Lo 5block diagðP β ;P α μÞλ: (4.135) In view of Eqs. (4.134) and (4.135) and Theorem 4.3, we obtain

Ktβ 5 ðSt 1 Vt Wt1 VtT ÞðCtβ ÞT λ2t 1 Vt Wt1 ðCtα ÞT λ2t ; Ktα 5 Wt1 VtT ðCtβ ÞT λ2t 1 Wt1 ðCtα ÞT λ2t ;

(4.136) (4.137)

St 5 St21 2 St21 ðCtβ ÞT ðλt Im 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ St21 ; S0 5 P β ; (4.138) Vt 5 ðIl 2 St21 ðCtβ ÞT ðλt Im 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ ÞVt21 2 St21 ðCtβ ÞT ðλt Im 1Ctβ St21 ðCtβ ÞT Þ21 Ctα ; V0 5 0l 3 r ;

(4.139)

Wt 5 Wt21 1 ðCtβ Vt21 1Ctα ÞT ðλt Im 1Ctβ St21 ðCtβ ÞT Þ21 ðCtβ Vt21 1 Ctα Þ; W0 5 0r 3 r ; where Kt 5 ððKtβ ÞT ; ðKtα ÞT ÞT .

(4.140)

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

111

Let us represent these expressions for the DTA in the following more compact and convenient form for use 1 T Kt 5 ðS~t 1 V~ t W~ t V~ t ÞCtT λ2t 5 S~t CtT λ2t 1 Ktd ;

t 5 1; 2; . . .; (4.141)

where S~ t 5 S~t21 2 S~t21 CtT Nt21 Ct S~t21 ;

(4.142)

V~ t 5 ðIl1r 2 S~t21 CtT Nt21 Ct ÞV~ t21 5 A1t V~ t21 ;

(4.143)

~ t 5 W~ t21 1 V~ Tt21 CtT Nt21 Ct V~ t21 ; W

(4.144)

Nt 5 λt Im 1 Ct S~t21 CtT : Assume Ht;s is the matrix function determined by the system Ht;s 5 ðIl1r 2 Kt Ct ÞHt21;s 5 At Ht21;s ;

(4.145)

t 5 s 1 1; s 1 2; . . .; Hs;s 5 Il1r ; (4.146)

where 1 T Kt 5 Mt21 CtT 5 ðS~t 1 V~ t W~ t V~ t ÞCtT λ2t 5 S~t CtT λ2t 1 Ktd :

(4.147)

It follows from Lemma 4.9 that Ht;s 5 Φt;s 2 V~ t Lt21

t21 X

T T 21 V~ i Ci11 Ni11 Ci11 Φi;s ;

(4.148)

i51 T Ht;s Ksd 5 V~ t Lt21 V~ s CsT λ2s ;

(4.149)

where Φt;s is the solution of the system Φt;s 5 ðIl1r 2 λ2t S~t CtT Ct ÞΦt21;s 5 A~ t Φt21;s ; Φs;s 5 Il1r :

(4.150)

We turn now to the analysis of the convergence of the DTA. Theorem 4.5. Let the following conditions hold: 1. Pðjjξ t jj , NÞ 5 1 for tAT, where T 5 f1; 2; . . .; N g is a bounded set. n for tAT, βAAβ 5 fβ:jjβ 2 2. Φðzt ; βÞAСð0;2Þ ðz;βÞ ðZ 3 Aβ Þ, zt AZCR  β jj # δβ g, δβ . 0 is a number, Z is an arbitrary compact set. 3. There is tr 5 minft:Wt . 0; tAT g and the rank of the matrix Wt is a constant for each fixed t 5 1; 2; . . .; tr 2 1.

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

112

Then the solution zt 5 x of the system Eq. (4.99) is finitely stable under the action of the disturbance ξt . Moreover, for every tf $ tr, tf AT there are ρ0 . 0, χ . 0, ς . 0 such that for tATf 5 ftf ; tf 1 1; . . .; N g and ρ , ρ0 will be valid uniform in α AAd the bound jjet jj 5 jjxt 2 x jj # ðχjjeβ0 jj 1 ς maxtAf1;2;...;tf g jjξ t jjÞð11νðρÞÞt21 ; (4.151) where ρ 5 maxfjjeβ0 jj; jjS0 jj; maxtAf1;2;...;tf g jjξ t jjg, νðρÞ 5 OðρÞ as ρ-0. Proof Consider at first the DTA, taking into account the matrix Rt. Transform ht ðx Þin Eq. (4.99) using the formula [70] f ðx 1 yÞ 5 f ðxÞ 1 ðrf ðxÞ; yÞ 1 yT r2 f ðx 1 θyÞy=2; where rf ðxÞ and r2 f ðxÞ are the gradient and Hessian matrix of f ðxÞ, respectively, 0 # θ # 1. Let f ðxÞ 5 hti ðxÞ, x 5 xt21 , y 5 x 2 xt21 5 2 et21 . Then hit ðx Þ 5 hit ðxt21 Þ 2 Cit ðxt21 Þet21 1 ϕit ðxt21 ; x Þ;

i 5 1; 2; . . .; m; (4.152)

where T T Þ ; Ct 5 Ct ðxt21 Þ 5 ðC1tT ; C2tT ; . . .; Cmt

ϕit 5 ϕit ðxt21 ; x Þ 5 eTt21 r2 hit ðxt21 2 θet21 Þet21 =2; Using Eq. (4.152) gives et 5 ðIl1r 2 Kt Ct Þet21 1 Kt ðϕt 1 ξ t Þ; e0 5 2 x ;

i 5 1; 2; . . .; m: tAT ;

(4.153)

where ϕt 5 ðϕ1t ; ϕ2t ; . . .; ϕmt Þ . It follows from this T

et 5 Ht;0 e0 1

t X

Ht;s Ks ðϕs 1 ξ s Þ;

(4.154)

s51

where Ht;s is a matrix determined by the system Ht;s 5 ðIl1r 2 Kt Ct ÞHt21;s ;

t . s; Ht;t 5 Il1r :

(4.155)

Using condition 2 and Lemma 4.9 gives for t $ tr, tAT limρ-0 Ct 5 limρ-0 Ct ðxt21 Þ 5 ð@Φðzt ; β  Þ[email protected]β  α ; Φðzt ; β  ÞÞ;

(4.156)

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

limρ-0 St 5 0;

(4.157)

" limρ-0 Ht;s 5 limρ-0

113

t21 X T T 21 Φt;s 2 V~ t Wt21 V~ i Ci11 Ni11 Ci11 Φi;s

#

i5s

0 B 5B @

Il

1

0l 3 r

Ir 2 ðWt Þ21

0r 3 l

C t21 C; X   A T 21 Φ ðzj11 ; β ÞRj11 Φðzj11 ; β Þ j5s

(4.158)

limρ-0 Ht;s Ksd

T 5 limρ-0 V~ t Wt21 V~ s CsT Rs21

5

!

01 3 m ðWt Þ21 ΦT ðzs ; β  ÞRs21

;

(4.159) limρ-0 Ht;s S~s CsT Rs21 5 0;

Pt21

(4.160)

21 Φðzj11 ; β  Þ. where Wt 5 j50 ΦT ðzj11 ; β  ÞRj11 It follows from Lemma 4.9 that for t $ tr

Ht;0 ð01 3 m ; ðeα0 ÞT ÞT 5 Ht;0 V~ 0 eα0 5 0 and thus Ht;0 e0 5 Ht;0 ððeβ0 ÞT ; 01 3 r ÞT :

(4.161)

Therefore, for any tf $ tr, tf AT there exist positive constants ρ0 , η1 , η2 such that for tATf , ρ , ρ0 and all α AAd the following estimates uniform in α AAd are valid jjHt;0 e0 jj # η1 jjeβ0 jj; jjHt;s Ks jj # η2 ; jjHt;s Ks ξs jj # η2 maxsAf1;2;...;tf g jjξs jj: (4.162) Using Eqs. (4.154) and (4.162) gives jjet jj # η1 jjeβ0 j 1 η2

t X s51

jjϕs jj 1 η2 N maxtAf1;2;...;tf g jjξ t jj;

tATf : (4.163)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

114

Let us now get the bounds for ϕit , i 5 1; 2; . . .; m, tAf1; 2; . . .; tf g. We have !   H1  T Hit2 it β β T T α T α T ϕit 5 ðet21 Þ ; ðet21 Þ ðe Þ ; ðe Þ =2 t21 t21 ðHit2 ÞT 0r 3 r 5 ðeβt21 ÞT Hit1 eβt21 =2 1 ðeβt21 ÞT ðHit2 ÞT eαt21 ;

i 5 1; 2; . . .; m;

where Hit1 5 @2 Φi ðzt ; β t21 2 θeβt21 Þ[email protected]β 2t21 ðαt21 2 θeαt21 Þ 5 @2 Φi ðzt ; β  2 ð1 2 θÞeβt21 Þ[email protected]β 2t21 ðα 2 ð1 2 θÞeαt21 ÞARl 3 l ; Hit2 5 @Φi ðzt ; β t21 2 θeβt21 Þ[email protected]β t21 5 @Φi ðzt ; β  2ð12 θÞeβt21 Þ[email protected]β t21 ARl 3 r : Or in a more compact form ϕit 5 ζ it ðxt21 ; x Þeβt21 1 ψit ðxt21 ; x Þeαt21 5 ðζ it ðxt21 ; x Þ; ψit ðxt21 ; x ÞÞet21 ;

i 5 1; 2; . . .; m;

where ζ it ðxt21 ; x Þ 5 ðeβt21 ÞT @2 Φi ðzt ; β  2 ð1 2 θÞeβt21 Þ[email protected]β 2t21 α =2; ψit ðxt21 ; x Þ 5 ðeβt21 ÞT ½@Φi ðzt ; β  2 ð1 2 θÞeβt21 Þ[email protected]β t21 2 @2 Φi ðzt ; β  2 ð1 2 θÞeβt21 Þ[email protected]β 2t21 ð1 2 θÞ=2:

(4.164)

(4.165)

It follows from conditions 2 and 3 the existence of the limits for tAT limρ-0 @2 Φi ðzt ; β t21 2 θeβt21 Þ[email protected]β 2t21 αt21 5 @2 Φi ðzt ; βÞ[email protected]β 2 jβ5β  α ; (4.166) limρ-0 @ Φi ðzt ; β t21 2 θeβt21 Þ[email protected]β t21 5 @Φi ðzt ; β  Þ[email protected]β  ;

(4.167)

limρ-0 eβt 5 0: (4.168)  Therefore, for any α AAd there exists ρ0 . 0 such that uniformly in α AAd and tAf1; 2; . . .; tf g jjϕt jj # υðρÞjjet21 jj;

tAT;

where υðρÞ 5 OðρÞ as ρ-0. Thus t X jjes21 jj 1 η2 N maxtAf1;2;...;tf g jjξt jj; jjet jj # η1 jjeβ0 j 1 η2 υðρÞ

(4.169) tATf :

s51

(4.170)

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

115

We use Lemma Bellman to assess jjet jj [71]: if zt # z0 1 q

t21 X

zs ;

s51

where zt . 0, q . 0 then zt # z0 ð11qÞt21 ;

t 5 1; 2; . . .; N :

z0 5 η1 jjeβ0 j 1 η2 N

Putting zt 5 jjet jj, maxtAf1;2;...;tf g jjξt jj, q 5 η2 υðρÞ, we obtain Eq. (4.151). The proof for the DTA with the forgetting factor is carried out in the same way. So, in this case, instead of Lemma 4.9 the relations (4.148) and (4.149) are used to establish the existence of the limits " # t21 X T T 21 limρ-0 Ht;s 5 limρ-0 Φt;s 2 V~ t Lt21 V~ i Ci11 Ni11 Ci11 Φi;s i5s

0 B 5B @

Il 0r 3 l

1

0l 3 r

Ir 2 ðWt Þ21

C t21 C; X   A 2j21 T λ Φ ðzj11 ; β ÞΦðzj11 ; β Þ j5s

limρ-0 Ht;s Ksd 5 limρ-0 V~ t Lt21 V~ s CsT λ2s 5

01 3 m

T

ðWt Þ21 ΦT ðzs ; β  Þλ2s

limρ-0 Ht;s S~s CsT 5 0;

! ;

P 2j21 T Φ ðzj11 ; β  Þ and the bounds Eq. (4.162), where Wt 5 t21 j50 λ Φðzj11 ; β  Þ: Consider the two-stage DTA described by the expressions (4.35) and (4.68)(4.70). Let us show that Theorem 4.5 is saved in this case. It follows from Eqs. (4.97), (4.98), and (4.152) the following equations eβt 5 ðIl 2 Ktβ Ctβ Þeβt21 1 Ktβ ðϕt 1 ξt Þ; eβ0 5 β 2 β  ; eαt 5 ðIl 2 Ktα Ctβ Þeβt21 1 Ktα ðϕt 1 ξt Þ; eα0 5 2 α iterating which, we obtain β β e0 1 eβt 5 Ht;0

t X s51

β β Ht;s Ks ðϕs 1 ξ s Þ;

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

116

α α eαt 5 Ht;0 e0 1

t X

α α Ht;s Ks ðϕs 1 ξs Þ;

s51

where β β Ht;s 5 ðIl 2 Ktβ Ctβ ÞHt;s21

t . s; Ht;tβ 5 Il ;

β α Ht;s 5 ðIr 2 Ktα Ctα ÞHt;s21

t . s; Ht;tα 5 Ir :

Using the expressions β α α α 5 λt2s St Ss21 ; Ht;s Kt 5 λt2s Wt1 ðCtα ÞT ; Ht;0 5 lPðIr 2Wt Wt1 Þ Ht;s

and the bounds Eqs. (4.164)(4.169), we find Eq. (4.170). This implies Eq. (4.151).

4.4.2 Infinite Training Set Let us obtain conditions for the DTA convergence as t-N. The convergence is interpreted as stability of the solution zt 5 x of the system Eq. (4.99). Definition 2 The solution of the system Eq. (4.99) zt 5 x is asymptotically stable if there exists such δðAd Þ . 0 that conditions jjβ 0 2 β  jj , δðAd Þ, α0 5 0 imply limt-N jjxt 2 x jj 5 0 for all αAAd 5 fα: jjα 2 α jj # dgCRr and any given d . 0. Definition 3 The solution of the system Eq. (4.99) zt 5 x is stable under the action of the disturbance ξt if for any ε . 0 there exist such δðAd Þ . 0, δ2 ðAd Þ . 0 that the conditions jjβ 0 2 β  jj , δ1 ðAd Þ, α0 5 0, jjξ t jj , δ2 ðAd Þ imply jjxt 2 x jj , ε for t $ tr, all αAAd 5 fα: jjα 2 α jj # dgCRr and any given d . 0. Theorem 4.6. Suppose that: 1. The conditions of Theorem 4.5 are satisfied. 2. There are positive constants k1 , k2 such that jjΦðzt ; βÞjj # k1 ; [email protected]Φðzt ; βÞ[email protected]βjj # k2 ;

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

117

for zt AZCRn, βAAβ , tAT , where Aβ 5 fβ: jjβ 2 β  jj # δβ g, δβ . 0 is a number, Z is arbitrary compact set, T 5 f1; 2; . . .; g. 3. For some integer s and all j there exist numbers p1 and p2 such that 0 , p1 Ir1l #

j1s X

CiT Ci # p2 Ir1l , N

(4.171)

i5j

for any ðβ t ; αt ÞAAβ 3 Aα, where Aα 5 fα: jjα 2 α jj # δα gCRr . Then: 1. The solution of the system Eq. (4.99) zt 5 x with Kt described by the expressions (4.36)(4.41) is asymptotically stable and stable under the action of the disturbance ξ t . 2. There are positive constants σ, q1, q2, τ, ρ0 such that for tAT, ρ , ρ0 uniformly in αAAd jjet jj2 # ð12λ1σÞt q2 =q1 jjetf jj2 1 τ=q1 supt $ tf jjξt jj2 =ðλ 2 σÞ;

(4.172)

where tf $ tr 5 minft:Wt . 0; tAT g, ρ 5 maxfjjeβ0 jj; jjS0 jj; maxtAf1;2;...;tf g jjξt jjg, jjetf jj satisfies Eq. (4.151). Proof It follows from Theorem 4.5 that for any tf $ tr, N . tf , δβ . 0, δα . 0 there exists such ρ0 . 0 that if ρ , ρ0 and tAT 5 ftf ; tf 1 1; . . .; N g then ðβ t ; αt ÞAAβ 3 Aα . Consider the Lyapunov function in the area Aβ 3 Aα Vt 5 eTt Pt21 et ; where Pt21 is determined by the system 21 1 ðCt ÞT Ct Pt21 5 λPt21

for t $ tf AT with the initial condition Ptf21 . 0. Transforming Vt with help of the relation for the error Eq. (4.153), we obtain Vt 5 eTt21 Pt21 et 5 eTt ðIl1r 2Kt Ct ÞT Pt21 ðIl1r 2 Kt Ct Þet21 1 2eTt21 ðIl1r 2Kt Ct ÞT Pt21 Kt ϕt 2 2eTt21 ðIl1r 2Kt Ct ÞT Pt21 Kt ξ t (4.173) 1 2ξTt KtT Pt21 Kt ϕt 1 ξ Tt KtT Pt21 Kt ξt 1 ϕTt KtT Pt21 Kt ϕt : Now we find bounds for terms in Eq. (4.173). As Kt 5 Pt Ct and Pt satisfies the system Pt 5 Pt21 =λ 2 Pt21 CtT ðλIm 1Ct Pt21 CtT Þ21 Ct Pt21 =λ

118

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

for t $ tf AT with the initial condition Ptf21 . 0 then eTt21 ðIl1r 2Kt Ct ÞT Pt21 ðIl1r 2 Kt Ct Þet21 5 eTt21 ðIl1r 2Pt CtT Ct ÞT Pt21 ðIl1r 2 Pt CtT Ct Þet21 5 eTt21 ðPt21 2CtT Ct ÞT ðIl1r 2 Pt CtT Ct Þet21 21 21 21 5 λeTt21 Pt21 ðIl1r 2 Pt CtT Ct Þet21 5 λ2 eTt21 Pt21 Pt Pt21 et21 21 21 5 λeTt21 ½Pt21 2 CtT ðλIm 1Ct Pt21 CtT Þ21 Ct et21 # λeTt21 Pt21 et21 : (4.174)

From Eqs. (4.164) to (4.168) it follows that for any ν . 0 there exists such δ . 0 that jjϕt jj # νj jjet21 jj until jjet21 jj # δ and from Ref. [72] existence of q1 . 0 and q2 . 0 such that q1 Ir1l # Pt21 # q2 Ir1l : Therefore, there are η1 . 0, η2 . 0, η3 . 0 such that ϕTt KtT Pt21 Kt ϕt # η1 ν 2 eTt21 et21 ;

(4.175)

ξTt KtT Pt21 Kt ξ t # η2 ξTt ξt ;

(4.176)

2eTt21 ðIl1r 2Kt Ct ÞT Pt21 Kt ϕt # η3 νeTt21 et21 :

(4.177)

Using Young’s inequality [73] 2xT y # xT Xx 1 yT X 21 y; for x 5 ξ t , y 5 Pt21 ðIl1r 2 Kt Ct Þet21 , X 5 Il1r =ϑ; we obtain 2ξ Tt Pt21 ðIl1r 2 Kt Ct Þet21 # ξTt ξt =ϑ 1 ϑeTt21 Pt22 ðIl1r 2Kt Ct Þ2 et21 # # ξ Tt ξ t =ϑ 1 η4 ϑeTt21 et21 ; η4 . 0; where ϑ . 0 is a small number. Putting in Young’s inequality x 5 ξt, y 5 KtT Pt21 Kt ϕt , X 5 Il1r and using Eq. (4.175), we obtain 2ξTt KtT Pt21 Kt ϕt # ξ Tt ξ t 1 ϕTt21 ðKtT Pt21 Kt Þ2 ϕt21 : # ξTt ξt 1 η5 ν eTt21 et21 ; η5 . 0:

(4.178)

Using these bounds gives T 2 T Vt #λeTt21 P21 t21 et21 1ðη1 ν 1η3 ν 1η4 ϑ1η4\5 νÞet21 et21 1ð11η2 11=ϑÞξ t ξ t

#ðλ2σÞVt21 1η6 supt $tf jjξt jj2

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

119

as long as jjet jj # δ and ðβ t ; αt ÞAAβ 3 Aδ , where σ 5 ðη1 ν 2 1 η3 ν 1 η4 ϑ 1 η4\5 νÞ=q1 , η6 5 1 1 η2 1 1=ϑ. Choosing ϑ; ν so that λ . σ, we obtain N X t 2 2 q1 jjet jj # Vt # ð12λ1σÞ Vtf 1 η6 supt $ tf jjξt jj ð12λ1σÞi i50

# ð12λ1σÞ q2 jjetf jj 1 η6 supt $ tf jjξt jj =ðλ 1 σÞ: t

2

2

It follows from this jjet jj2 # ð12λ1σÞt q2 =q1 jjetf jj2 1 η6 =q1 supt $ tf jjξt jj2 =ðλ 1 σÞ:

(4.179)

We choose jjetf jj and restrictions on supt $ tf jjξ t jj2 to satisfy the condition ð12λ1σÞtf q2 =q1 jjetf jj2 1 η6 =q1 supt $ tf jjξ t jj2 =ðλ 1 σÞ # δ2 : From Theorem 4.5 it follows that it is always possible if ρ is sufficiently small. With this choice the inequality Eq. (4.179) will be carried out for all t . tf . From Eq. (4.179) follows the stability of the solution xt 5 x of the system Eq. (4.99) under the action of the disturbance ξt and if ξt 5 0 then its asymptotic stability. Theorem 4.7. Suppose that: 1. The conditions of Theorem 4.5 are satisfied. 2. There are positive constants k1 , k2 such that jjΦðzt ; βÞjj # k1 ; [email protected]Φðzt ; βÞ[email protected]βjj # k2 ; for zt AZCRn, βAAβ , tAT , where Aβ 5 fβ:jjβ 2 β  jj # δβ g, δβ . 0 is a number, set, T 5 f1; 2; . . .; g. Pt Z is aTcompact 21 21 3. limt-N i5tf 11 Ci Ri Ci =t 5 PN for ðβ t ; αt ÞAAβ 3 Aα, where  Aα 5 fα:jjα 2 α jj # δα g, δα . 0 is a number, tf $ tr. 4. There is k4 . 0 such that Rt $ k4 Im for tAT. Then: The solution of system Eq. (4.99) xt 5 x with the matrix Kt described by the expressions (4.59)(4.64) is asymptotically stable and stable under the action of the disturbance ξ t . Proof Let us at first show that T Pt 5 S~t 1 V~ t Wt21 V~ t . 0

(4.180)

120

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

for t $ trAT, where S~t 5



St

0l 3 r

0r 3 l

0r 3 r



; V~ t 5



 Vt : Ir

We represent Eq. (4.180) in the equivalent form ! St 1 Vt Wt21 VtT Vt Wt21 Pt 5 . 0: Wt21 VtT Wt21 As St 1 Vt Wt21 VtT . 0 then [73] jPt j 5 jSt 1 Vt Wt21 VtT j 3 jWt21 2 Wt21 VtT ðSt 1Vt Wt21 VtT Þ21 Vt Wt21 j: Transforming the second factor in this expression using the identity Eq. (2.7) for B 5 Wt21, C 5 VtT , D 5 St , we obtain jPt j 5 jSt 1 Vt Wt21 VtT j 3 jðWt 1VtT St21 Vt Þ21 j: But since for any square matrix jA21 j 5 1=jAj then jPt j 5 jSt 1 Vt Wt21 VtT j=jWt 1 VtT St21 Vt j . 0: From this it follows that Pt satisfies the system Pt 5 Pt21 2 Pt21 CtT ðRt 1Ct Pt21 CtT Þ21 Ct Pt21

(4.181)

for t $ tf AT with the initial condition Ptr . 0. We have for the estimation error t X Ht;s Ks ðϕs 1 ξ s Þ; et 5 Ht;tf etf 1 s5tf 11

where Ht;s is the system solution Eq. (4.155), ϕt is described by Eq. (4.152), tf . tr. Since Ht;s 5 Pt Ps21 ; Kt 5 Pt CtT Rt21 for t $ s $ tr then et 5 Pt Ptf21 etf 1 Pt

t X s5tf 11

CsT Rs21 ðϕs 1 ξs Þ:

(4.182)

It follows from Theorem 4.5 that for any tf $ tr, N . tf , δα . 0 there exists such ρ0 . 0 that if ρ , ρ0 and tAT 5 ftf ; tf 1 1; . . .; N g then ðβ t ; αt ÞAAβ 3 Aα , where ρ 5 maxfjjeβ0 jj; jjS0 jj; suptAT jjξt jjg:

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

121

From Eqs. (4.164) to (4.168) it follows that for any ν . 0 there exists δ . 0 such that jjϕt jj # νj jjet jj as long as jjet jj # δ and at the same time jjet jj # jjPt jj jjPtf21 jj jjetf jj 1 η1 jjPt jj ν

t X s5tf 11

! jjes21 jj 1 suptAT jjξt jj ; (4.183)

η1 5 suptAT jjCtT Rt21 jj

for ðβ t ; αt ÞAAβ 3 Aα : where Using Eq. (4.181) and condition 3 provides ðβ t ; αt ÞAAβ 3 Aα ! t X 21 21 T 21 21 limt-N Pt =t 5 limt-N Ptr 1 Ci Ri Ci =t 5 PN : (4.184) i5tr11

Thus for sufficiently large tf $ tr exists η2 . 0 such that jjPt jj 5 jjPN 1 ΔPt jj=t # η2 =t:

(4.185)

for any ðβ t ; αt ÞAAβ 3 Aα, where ΔPt -0 as t-N. Substituting Eq. (4.185) in Eq. (4.183) yields ! t X 21 jjet jj # η2 jjPtf jj jjetf jj=t 1 η1 η2 ν=t jjes21 jj 1 supt $ tf 11 jjξt jj 5 η3 =t 1 η4 =t

t X

s5t f 11

jjes21 jj 1 η5 ;

s5t f 11

(4.186) η3 5 η2 jjPtf21 jj

jjetf jj, η4 5 η1 η2 ν, η5 5 η1 η2 suptAT jjξ t jj. where From the comparison principle [74] it follows that jjet jj # ut , where ut is the solution of the difference equation ut 5 η3 =t 1 η4 =t

t X

us21 1 η5

(4.187)

us21 1 η4 ut21 =t 1 η5 ;

(4.188)

s5t f 11

with the initial condition utf 5 η3 =tf . We have ut 5 η3 =t 1 η4 =t

t21 X s5t f 11

122

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

ut21 5 η3 =ðt 2 1Þ 1 η4 =ðt 2 1Þ

t21 X s5tf 11

us21 1 η5 :

(4.189)

From Eq. (4.189) we find t21 X

us21 5 ðt 2 1Þ=η4 ðut21 2 η3 =ðt 2 1Þ 2 η5 Þ:

s5tf 11

Substituting this expression in Eq. (4.188) gives ut 5 η3 =t 1 ðt 2 1Þ=tðut21 2 η3 =ðt 2 1Þ 2 η5 Þ 1 η4 us =t 1 η5 5 ½1 2 ð1 2 η4 Þ=tut21 1 η5 =t:

(4.190)

It follows from this that ut 5 ht;tf η3 =tf 1 η5

t X

ht;s =s;

(4.191)

s5tf 11

where ht;s is the transition function determined by the equation ht;s 5 ½1 2 ð1 2 η4 Þ=tht;s21 ; hs;s 5 1: We have t

ht;s 5 L ½1 2 ð1 2 η4 Þ=j # ð1 1 qs Þðs=tÞ12η4 :

(4.192)

j5s11

for η4 , 1 [75], where qs -0 as s-N. Substituting Eq. (4.192) in Eq. (4.191) yields ut # η3 ð1 1 qtf Þtf 2η4 ð1=tÞ12η4 1 η5

t X

ð1 1 qs Þðs=tÞ12η4 =s:

s5tf 11

We obtain for the second term in this expression t X

η5

t X s5tf 11

ð1 1 qs Þðs=tÞ12η4 =s # η5 sups $ tf 11 ð1 1 qs Þ

-η5 =ð1 2 η4 Þ sups $ tf 11 ð1 1 qs Þ;

s2η4

s5tf 11 t 12η4

t-N:

Therefore jjet jj # η3 ð1 1 qtf Þtf 2η4 ð1=tÞ12η4 1 η5 =ð1 2 η4 Þsupt $ tf 11 ð1 1 qt Þ as long as jjet jj # δ.

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

123

Thus since Ptf21

5 block

diagðS021 ; 0r 3 r Þ 1

tf X

CiT Ri21 Ci

i51

then use of Eq. (4.151) gives jjet jj # η6 jjPtf21 jj jjetf jjtf 2η4 ð1=tÞ12η4 1 η7 supt $ tf 11 jjξ t jj ! tf X T 21 2η4 21 # η6 tf jjS0 jj 1 jj Ci Ri Ci jj ϑðeβ0 ; ξ t Þ

(4.193)

i51

~ tf 21 ð1=tÞ12η4 1 η7 supt $ tf 11 jjξ t jj; 3 ð11 νρÞ where ~ ϑðeβ0 ; ξ t Þ 5 ðχjjeβ0 jj 1 ς maxt51;2;...;tf jjξ t jjÞ; υðρÞ # νρ; ν~ . 0; ρ-0: η6 5 η3 ð1 1 qtf Þ η7 5 η5 =ð1 2 η4 Þsupt $ tf 11 ð1 1 qt Þ: We assume without loss of generality that ρ 5 jjS0 jj, tf 5 jjS021 jj1=η4 . Then from Eq. (4.193) it follows the existence of η8 . 0 and η9 . 0 such that if jjeβ0 jj , η8 , suptAT jjξt jj , η9 then jjet jj # δ for all tAT . Therefore, the solution of Eq. (4.99) xt 5 x is stable under the action of the disturbance ξ t and if ξ t 5 0 then it is asymptotically stable.

4.5 ITERATIVE VERSIONS OF DIFFUSE TRAINING ALGORITHMS For the diffuse training algorithms we present their iterative modifications oriented towards batch processing, namely, for the DTA from Theorem 4.1: xit 5 xit21 1 Kti ðyt 2 ht ðxit21 ÞÞ;

t 5 1; 2; . . .; N ; i 5 1; 2; . . .; M; (4.194)

where Kti 5 ððKtβi ÞT ; ðKtαi ÞT ÞT ; ht ðxt21 Þ 5 Φðzt ; β t21 Þαt21, Ktβi 5 ðSti 1 Vti ðWti Þ1 ðVti ÞT ÞðCtβ ÞT 1 Vti ðWti Þ1 ðCtα ÞT ; Ktβi 5 ðWti Þ1 ðVti ÞT ðCtβ ÞT 1 ðWti Þ1 ðCtα ÞT ; i i i i Sti 5 St21 =λ 2 St21 ðCtβ ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ St21 =λ;

(4.195) (4.196) (4.197)

124

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

i i i Vti 5 ðIl 2 St21 ðCtβ ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ ÞVt21 i i 2 St21 ðCtβ ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ctα ;

(4.198)

i i i i Wti 5 λWt21 1 λðCtβ Vt21 1Ctα ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 ðCtβ Vt21 1 Ctα Þ; (4.199)

Ctβ 5 Ctβ ðxit21 Þ 5 @½Φðzt ; β it21 Þαit21 [email protected]β it21 ; Ctα 5 Ctα ðxit21 Þ 5 Φðzt ; β it21 Þ; (4.200) with the initialization T

x10 5 ðβ ; 01 3 r ÞT ; S01 5 P β ; V01 5 0l 3 r ; W01 5 0r 3 r ; i i21 i i21 i i21 xi0 5 xi21 N ; S0 5 SN ; V0 5 VN ; W0 5 WN ;

(4.201)

i 5 2; 3; . . .; M; (4.202)

where i is the number of an iteration, M is the quantity of iterations; for the DTA Eqs. (4.68)(4.73): xit 5 xit21 1 Kti ðyt 2 ht ðxit21 ÞÞ;

t 5 1; 2; . . .; N ; i 5 1; 2; . . .; M; (4.203)

where Kti 5 ððKtβi ÞT ; ðKtαi ÞT ÞT ; ht ðxt21 Þ 5 Φðzt ; β t21 Þαt21, Ktβi 5 Sti ðCtβ ÞT ; Ktβi 5 ðWti Þ1 ðCtα ÞT ;

(4.204)

i i i i Sti 5 St21 =λ 2 St21 ðCtβ ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ St21 =λ;

(4.205)

i i i Wti 5 λWt21 1 λðCtα ÞT ðλIm 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ Vt21

(4.206)

with the initialization T

x10 5 ðβ ; 01 3 r ÞT ; S01 5 P β ; W01 5 0r 3 r ; i i21 i i21 xi0 5 xi21 N ; S0 5 SN ; W0 5 WN ;

i 5 2; 3; . . .; M:

(4.207) (4.208)

Standard recommendations on the choice of λ are as follows. At the initial stage of training, the value of λ can be sufficiently small to provide a high convergence rate. Then it is offered to gradually increase it to 1 to achieve the necessary accuracy of the obtained solution. One of the

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

125

variants of choosing λ which provides convergence is proposed in Ref. [76], namely, 0 # 1 2 ðλi ÞN # c=i;

c . 0; i 5 1; 2; . . .:

(4.209)

Based on this result, we will give the convergence conditions for the iterative algorithm proposed in this section. Assume that (1) for some L . 0 the Lipschitz condition jjrgt ðxÞgt ðxÞ 2 rgt ðyÞgt ðyÞjj # Ljjx 2 yjj; ’x; yARl1r ;

t 5 1; . . .; N ;

is satisfied, where gt ðxÞ 5 yt 2 ht ðxÞ and rgt ðxÞ is the ðl 1 rÞ 3 m matrix of gradients gt ðxÞ; (2) a sequence of vectors xit , t 5 1; . . .; N , i 5 1; 2. . . is bounded; (3) there is a vector xit such that Wt 5 Wt ðxit Þ . 0. Theorem 4.8. Under condition Eq. (4.209) and the conditions of items 13, the sequence Jt ðxiN Þ 5

N X

ðyk 2ht ðxiN ÞÞT ðyk 2 ht ðxiN ÞÞ

k51

converges and each limiting point of the sequence fxiN g is a stationary point of Jt ðxÞ. The statement follows from Ref. [76, Proposition 2] if  i  St 1 Vti ðWti Þ21 ðVti ÞT Vti ðWti Þ21 i21 Pt 5 Pt ðxt Þ 5 .0 ðWti Þ21 ðVti ÞT ðWti Þ21 and from the representation jPt j 5 jSti 1 Vti ðWti Þ21 ðVti ÞT j=jWti 1 Vti ðSti Þ21 ðVti ÞT j which was established in the proof of Theorem 4.7.

4.6 DIFFUSE TRAINING ALGORITHM OF RECURRENT NEURAL NETWORK Consider the recurrent neural network which is described by the statespace model 0 T 1 σða1 pt21 1 bT1 zt 1 d1 Þ A 5 σðApt21 1 Bzt 1 dÞ; ::: pt 5 @ (4.210) σðaTq pt21 1 bTq zt 1 dq Þ

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

126

0

1 c1T pt B C yt 5 @ ::: A 1 ξt 5 Cpt 1 ξ t ; cmT pt

t 5 1; 2; . . .; N ;

(4.211)

where pt ARq is a state vector, zt ARn is a vector of inputs, yt ARm is a vector of outputs, ai ARq , bi ARn , di AR1 , ci ARq , i 5 1; 2; . . .; q are vectors of unknown parameters, σðxÞ is the sigmoid function, A 5 ða1 ; a2 ; . . .; aq ÞARq 3 q , B 5 ðb1 ; b2 ; :::; bn ÞARq 3 n , d 5 ðd1 ; d2 ; . . .; dq ÞA Rq , ξt ARm is a random process which has uncorrelated values, zero expectation and a covariance matrix Rt. Let us turn to a system with more convenient parameterization for us including Eqs. (4.210) and (4.211) as a special case pt 5 Ψðzt ; pt21 ; βÞ; yt 5

8 T < pt :

0

0 ... ... 0 ...

9 0 = pTt

α 1 ξt 5 Φðpt Þα 1 ξt ;

;

(4.212)

t 5 1; 2; . . .; N; (4.213)

where α 5 ðα1 ; α2 ; . . .; αqm ÞT 5 ðc1T ; c2T ; . . .; cmT ÞARqm ; β 5 ðβ 1 ; β 2 ; . . .; β qm1q ÞT 5 ðaT1 ; aT2 ; . . .; aTq ; bT1 ; bT2 ; . . .; bTq ; d T ÞARqm1q are vectors of unknown parameters including elements of A, B, C, d, Ψðzt ; pt21 ; βÞARq is a vector of given nonlinear functions, Φðpt ÞARk ; k 5 q2 1 mq 1 q. We suppose that the unknown parameters in Eqs. (4.212) and (4.213) satisfy conditions A1, A2, and A3 and the quality criteria at a moment t are defined by Eq. (4.4). From Eqs. (4.212) and (4.213) follow relations for matrix functions α Ct ARm 3 qm , Ctβ ARm 3 pm included in the description of the DTA: 8 T 9 0 ::: 0 = < pt Ctα 5 ; (4.214) ::: : ; 0 0 ::: pTt Ctβ 5 @Φðpt Þ[email protected]β t21 αt21 5 @Φðpt Þ[email protected] @pt [email protected]β t21 αt21 ;

(4.215)

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

127

where @pt [email protected]β t21 5 @Ψðzt ; pt21 ; β t21 Þ[email protected] @pt21 [email protected]β t21 1 @Ψðzt ; pt21 ; β t21 Þ[email protected]β t21 ; (4.216) @p0 [email protected]β 0 5 0: After completion of the training phase the system Eq. (4.210) can be used to predict the behavior of the output yt. At the same time, the prediction should begin immediately after the completion of training, that is, from the moment of t 5 N 1 1 with the initial condition pN. In Chapter 5, Diffuse Kalman Filter, a more efficient training algorithm free from this disadvantage based on diffuse Kalman filter will be proposed.

4.7 ANALYSIS OF TRAINING ALGORITHMS WITH SMALL NOISE MEASUREMENTS Let us consider the behavior of the proposed training algorithms under a small measurement noise whose intensity matrix Rt enters in algorithms through the inverse matrix N1t21 5 ðCtβ St ðCtβ ÞT 1Rt Þ21 . Taking this into account it is important to understand their behavior as Rt -0. We first consider the DTA described by expressions (4.58)(4.64). Theorem 4.9. For a given N, Rt 5 εIm , ε-0, ε . 0, and S0 . 0      T  β T  β T 1 Gt Gt ðCt Þ ðC Þ Q dif t t ~ K t 5 limε-0 Kt 5 1 Lt1 ; I I ðCtα ÞT 0r 3 l r r (4.217) where Qt 5 Qt21 1 ðCtβ ÞT Ctβ ; Q0 5 0l 3 l ;

(4.218)

Gt 5 ðIl 2 Qt1 ðCtβ ÞT Ctβ ÞGt21 2 Qt ðCtβ ÞT Ctα ; G0 5 0l 3 r ;

(4.219)

Lt 5 Lt21 1 ðCtβ Gt 1Ctα ÞT H t ðCtβ Gt 1 Ctα Þ; L0 5 0r 3 r :

(4.220)

1

H t 5 ðIl 2 H~ t H~ t ÞðIl 1Ctβ Qt1 ðCtβ ÞT Þ21 ;

(4.221)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

128

1 H~ t 5 ðIl 1Ctβ Qt21 ðCtβ ÞT Þ21 Ctβ Ht ðCtβ ÞT ; 1 Þ; Ht 5 S0 ðIl 2 Qt21 Qt21

t 5 1; 2; . . .; N :

Proof Since 21 S21 t 5 S0 1

t X

ðCkβ ÞT Ckβ =ε 5

S21 0 ε1

k51

t X

(4.222) (4.223)

! ðCkβ ÞT Ckβ

=ε; t 51; 2; .. .; N

k51

then Lemma 2.1 implies St 5 S021 =ε 1

t X

! ðCkβ ÞT Ckβ ε

k51 5 S0 ðIl 2 Qt Qt1 Þ 1 Qt1 ε 1 Oðε2 Þ;

(4.224) t 5 1; 2; . . .; N

if we put Ω1 5 S021 , Ft 5 Ctβ . Using Lemma 2.2 and Eq. (2.24), we obtain limε-0 St ðCtβ ÞT =ε 5 Qt1 ðCtβ ÞT : We show first that St21 ðCtβ ÞT ðRt 1Ctβ St21 ðCtβ ÞT Þ21 5 St ðCtβ ÞT Rt21 : Multiplying Eq. (4.61) on the right ðCtβ ÞT Rt21 gives St ðCtβ ÞT Rt21 5 St21 ðCtβ ÞT Rt21 2 St21 ðCtβ ÞT ðRt 1Ctβ St21 ðCtβ ÞT Þ21 3 Ctβ St21 ðCtβ ÞT Rt21 5 St21 ðCtβ ÞT ðIl 2 ðRt 1Ctβ St21 ðCtβ ÞT Þ21 Ctβ St21 ðCtβ ÞT ÞRt21 5 St21 ðCtβ ÞT ðRt 1Ctβ St21 ðCtβ ÞT Þ21 : It follows that β T limε-0 St21 ðCtβ ÞT ðIm ε1Ctβ St21 ðCtβ ÞT Þ21 5 limε-0 St ðCtβ ÞT =ε 5 Q1 t ðCt Þ ; (4.225)

limε-0 Vt 5 Gt :

(4.226)

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

129

Let us find the asymptotic representation for Nt21 as ε-0. We transform Nt using Eq. (4.224) as follows 1 Nt 5 Im ε 1 Ctβ St21 ðCtβ ÞT 5 Im ε 1 Ctβ ðS0 ðIl 2 Qt21 Qt21 Þ 1 ε 1 Oðε2 ÞÞðCtβ ÞT 1 Oðε2 Þ 1 Qt21 1 5 Ctβ Ht ðCtβ ÞT 1 ðIm 1 Ctβ Qt21 ðCtβ ÞT Þε 1 Oðε2 Þ 1 ðCtβ ÞT ÞðH~ t 1 Im εÞ 1 Oðε2 Þ: 5 ðIm 1 Ctβ Qt21

Using Lemma 2.1 yields 1

1

ðH~ t 1Im εÞ21 5 ðIl 2 H~ t H~ t Þ=ε 1 H~ t 1 Oðε2 Þ: Therefore 1 1 Nt21 5 ðIl 2 H~ t H~ t ÞðIm 1Ctβ Qt21 ðCtβ ÞT Þ21 =ε 1 Oð1Þ 5 H t =ε 1 Oð1Þ:

We have Wt 5 Wt21 1 ðCtβ Vt21 1Ctα ÞT ðRt 1Ctβ St21 ðCtβ ÞT Þ21 ðCtβ Vt21 1 Ctα Þ 5 Wt21 1 ðCtβ Vt21 1Ctα ÞT Nt21 ðCtβ Vt21 1 Ctα Þ 5 Wt21 1 ðCtβ Vt21 1Ctα ÞT ðH t 1 OðεÞÞðCtβ Vt21 1 Ctα Þ=ε: This implies that Lt 5 limε-0 Wt ε 5

t X

ðCkβ Gk21 1Ckα ÞT H k ðCkβ Gk21 1 Ctα Þ;

k51

 limε-0

  T  β T     T  β T  Vt Vt Gt Gt ðCt Þ ðCt Þ 21 Wt1 R Lt1 : 5 t Ir Ir Ir Ir ðCtα ÞT ðCtα ÞT

A similar result holds for the two-stage algorithm described by Eqs. (4.58) and (4.71)(4.73). Theorem 4.10. For a given N, Rt 5 εIm , ε-0, ε . 0, and S0 . 0 β

α

K t 5 limε-0 Ktdif ;β 5 Qt1 ðCtβ ÞT ; K t 5 limε-0 Ktdif ;α 5 Lе1 ðCtα ÞT ; (4.227)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

130

where Qt 5 Qt21 1 ðCtβ ÞT Ctβ ; Q0 5 0l 3 l ;

(4.228)

Lt 5 Lt21 1 ðCtα ÞT H t Ctα Þ; L0 5 0r 3 r :

(4.229)

The statement is proved by the same scheme as the previous Theorem 4.9.

4.8 EXAMPLES OF APPLICATION 4.8.1 Identification of Nonlinear Static Plants Suppose that we want to describe the plant model with the scalar output using a multilayer perceptron with one hidden layer and linear output activation function ! p q X X yt 5 wk σ akj zjt 1 bk 1 ξ t ; t 5 1; 2; . . .; N ; (4.230) k51

j51

where zjt , j 5 1; 2; . . .; q are inputs of the plant, yt is the output of the plant, akj ; bk ; k 5 1; 2; . . .; p, j 5 1; 2; . . .; q are weights and biases of the hidden layer, respectively, wk ; k 5 1; 2; . . .; p are weights of the output layer, σðxÞ 5 ð11expð2xÞÞ21 is the hidden layer activation function, and ξ t is centered random process with uncorrelated values and variance Rt. It is required to estimate neuron network weights and biases from observations of inputoutput pairs fzi ; yi g, t 5 1; 2; . . .; N using the DTA. Let us rewrite Eq. (4.230) in a form more suitable for us Eq. (4.1) yt 5 Φðzt ; βÞα 1 ξ t ;

t 5 1; 2; . . .; N ;

(4.231)

where β 5 ða11 ; a12 ; . . .; a1q ; b1 ; a21 ; a22 ; . . .; a2q ; b2 ; . . .; ap1 ; ap2 ; . . .; apq ; bp ÞT ARðq11Þp ; α 5 ðw1 ; w2 ; . . .; wp Þ AR ; Φðzt ; βÞα 5 T

p

p X k51

wk σ

q X

! akj zjt 1 bk :

j51

To use the DTA we have to find: Ctβ 5 @½Φðzt ; βÞα[email protected]β; Ctα 5 Φðzt ; β t21 Þ and select initial values β 0 and S0 .

(4.232)

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

131

From Eq. (4.231) it follows that Ctα 5 ðσða1 zt 1 b1 Þ; σða2 zt 1 b2 Þ; . . .; σðar zt 1 br ÞÞAR1 3 p ;

(4.233)

Ctβ 5ð~a11 ; a~12 ;...; a~1q ; b~1 ; a~21 ; a~22 ;...; a~2q ; b~2 ;...; a~p1 ; a~p2 ;...; a~pq ;2b p ÞT ARðq11Þp ; (4.234) where zt 5 ðz1t ; z2t ; :::; zqt ; 1ÞT ARq11 ; ak 5 ðak1 ; ak2 ; . . .; akq ; bk Þ;

k 5 1; 2; . . .; p;

" p !# q p X X @ X d 5 wk σ akj zjt 1 bk wk σðxÞzjt ; a~kj 5 @akj k51 dx j51 k51 " p !# q p X X X @ d 5 wk σ akj zjt 1 bk wk σðxÞ; b~k 5 @bk k51 dx j51 k51 x5

q X

akj zjt 1 bk :

j51

We choose a small random vector β with a covariance matrix P β to initialize β t and set β 0 5 β; S0 5 hP β ; hAð0; 1: Example 4.1. Let us illustrate the efficiency of the results obtained in this chapter by one of the standard examples used in testing algorithms [7]. Consider the problem of approximation of the function that is described by the expression  yðxÞ 5

sinðxÞ=x; 1

x 6¼ 0 x 5 0:

Training and testing sets (xi, yi) include 2000 points each and values of xi are uniformly distributed on the interval [ 2 10, 10]. Noise uniformly distributed on the interval [0.2, 0.2] is added to training values yi and it is assumed that the testing set does not contain any noise. The sigmoid AF is used and the approximation accuracy of the ELM is compared with the

132

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

accuracy of the diffuse algorithms with sequential processing. As a measure of accuracy, an estimate of the 90th percentile of the approximation error obtained from 500 samples is used. Weights and biases of the hidden layer are assumed to be random quantities uniformly distributed on the intervals [1, 1] and [0, 1], respectively, Rt 5 0:16=12. Introduce the following notations: DTA1 is described by the relations Eqs. (2.74)(2.76) which estimate output layer parameters only (the diffuse version of ELM algorithms), DTA2 and DTA3 are described by the relations (4.58)(4.64) and (4.58), (4.71)(4.73), respectively. The modeling results are presented in Table 4.1. As is easily seen, the DTAs surpass the ELM in accuracy for both training and testing sets. DTA2 and DTA3 yield practically undistinguishable results concerning accuracy but DTA3 surpasses DTA2 in speed by 32%. Fig. 4.1 presents the curves of approximating dependencies 1 and 2 (ELM and DTA2, respectively) for five neurons of the hidden layer and the used training set. Table 4.1 Training errors Number of neurons in the hidden layer

4 5 6

Accuracy of algorithms In training

In testing

ELM

DTA1

DTA2

ELM

DTA1

DTA2

0.315 0.272 0.205

0.169 0.159 0.148

0.174 0.159 0.147

0.295 0.246 0.17

0.125 0.108 0.092

0.13 0.11 0.092

1.5 2

1 y(x)

1 0.5 0 –0.5 –10

–8

–6

–4

–2

0 x

2

4

6

Figure 4.1 Approximating dependencies for DTA1 (1) and DTA2 (2).

8

10

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

133

Let us now assume that a plant model is approximated with help of a neuro-fuzzy Sugeno network of zero order with Gaussians for membership function (MF) q P

yt 5 Φðzt ; βÞα 1 ξ t 5

n

αi L μij ðzjt ; mij ; σij Þ

i51 j51 q n P

L μij ðzjt ; mij ; σij Þ

1 ξt ;

t 5 1; 2; . . .; N;

i51 j51

(4.235) where μij ðzjt ; mij ; σij Þ 5 expð2 ðzjt 2mlj Þ2 =2σ2lj Þ;

i 5 1; 2; . . .; q; j 5 1; 2; . . .; n; (4.236)

α 5 ðα1 ; α2 ; . . .; αq ÞT ARq ;  β 5 m11 ; m12 ; . . .; m1n ; m21 ; m22 ; . . .; m2n ; . . .; mq1 ; mq2 ; . . .; mqn ; T σ11 ; σ12 ; . . .; σ1n ; σ21 ; σ22 ; . . .; σ2n ; . . .; σq1 ; σq2 ; . . .; σqn AR2nq : (4.237) It is required to estimate α and β from observations of inputoutput pairs fzi ; yi g, t 5 1; 2; . . .; N using the diffuse algorithms. To use these algorithms we have to find the expressions for the matrix functions Ctβ ; Ctα which are included in their description and select initial values β 0 and S0 . We have q P αi ai ðzt ; βÞ Φðzt ; βÞα 5 i51q ; t 5 1; 2; . . .; N ; (4.238) P ai ðzt ; βÞ i51 n ai ðzt ; βÞ 5 Lj51 μij ðzjt ; mij ; σij Þ.

where It follows from this that 1 Ctα 5 q ða1 ðzt ; βÞ; a2 ðzt ; βÞ; . . .; aq ðzt ; βÞÞAR1 3 q ; P ai ðzt ; βÞ

(4.239)

i51

m~ ij 5

@ Φðzt ; βÞα 5 αi ðbðzt ; βÞ 2 1Þμij ðzjt ; mij ; σij Þðzjt 2 mlj Þ=σ2lj ; @mij

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

134

σ~ ij 5

@ Φðzt ;βÞα @σij

5αi ðbðzt ;βÞ21Þμij ðzjt ;mij ;σij Þðzjt2mlj Þ2 =σ3lj ; i51;2;...;q; j 51;2;...;n;  Ctβ 5 m~ 11 ; m~ 12 ;...; m~ 1n ; m~ 21 ; m~ 22 ;...; m~ 2n ;...; m~ q1 ; m~ q2 ;...; m~ qn ; (4.240) T σ~ 11 ; σ~ 12 ;...; σ~ 1n ; σ~ 21 ; σ~ 22 ;...; σ~ 2n ;...; σ~ q1 ; σ~ q2 ;...; σ~ qn AR2nq ; Pq where bðzt ; βÞ 5 i51 αi ai ðzt ; βÞ. For the parameters of the MF often there is some expert knowledge which is connected with specifics of the decided problem. As an example for Gaussians Eq. (4.236), values mlj and σlj can be specified before training. A natural requirement after optimization is to keep the original order of linguistic values mlj11 . mlj and σlj . 0 imposes the restriction on the size of their spread with respect to the selected values mlj and σlj . Taking this into account, the MF parameters before training are interpreted as random values with expectations mlj and σlj and sufficiently small variances Sβ 5 Eððβ 2 mβ Þðβ2mβ ÞT Þ 5 diagðs1 ; . . .; sk Þ; where si A½0; ε are uniformly distributed numbers, ε . 0 is a small parameter. Similar considerations are valid for any parameterized MF. In contrast to β, in respect to α there is no a priori information in respect to α. Example 4.2. Let us consider the problem of identification of a static object from inputoutput data [77] (represented by points in Fig. 4.2) with the help of the zero-order Sugeno fuzzy inference system. There are two peaks observed against the background of a decaying trend and broadband noise. Six membership functions and n 5 18 unknown parameters were used; q 5 12 of these parameters are in the description of the MFs. Under the assumption that the MFs are Gaussian, it is easy to determine the matrix Ct from the expressions (4.239) and (4.240). The centers and width of MFs were specified using the standard methodology of initial (before optimization) arrangements of MF parameters [78] (Fig. 4.3), and the deviations from them were assumed to be equal to 5 and 0.01, respectively. Fig. 4.4 presents dependencies of the mean-square estimation error

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

140

1

2

120

135

y(t)

100 3

80 60 40 20 0

0

50

100

150

200

250

t

Figure 4.2 Dependencies of the plant and model outputs on the number of observations.

mf

1

0.5 1 0

3

2

5

4

6 0

50

100

t

150

200

250

Figure 4.3 Dependencies of six membership functions on the number of observations before optimization. 25

3

RMSE(M)

20

1

15 10

2

0 100

X: 166 Y: 3.244

X: 40 Y: 2.2

5 101

M

102

103

Figure 4.4 Dependencies of the mean-square estimation errors on the number of iterations M.

on the number of iterations RMSEðMÞ (epochs) with the forgetting parameter λi 5 maxf1 2 0:05=i; 0:99g, i 5 1; 2; :::; M. The plots of curves 1, 2, and 3 are constructed with the help of the hybrid algorithm of the ANFIS system, the iterative DTAs that are described by the systems

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

136 1

2

0.8

5

mf

0.6 6

0.4

1

0.2 0

4 3

0

50

100

150

200

250

t

Figure 4.5 Dependencies of six membership functions on the number of observations after optimization.

Eqs. (4.194)(4.202) and (4.203)(4.208), respectively. It is seen that the proposed algorithms considerably exceed the hybrid algorithm of the ANFIS system in convergence speed. It should be noted that the forgetting parameter exerts a considerable influence on the convergence rate of an algorithm. In fact, when λi 5 1 and the number of iterations equals 40, we have RMSEð40Þ 5 6:8. In Figs. 4.3 and 4.5 the membership functions obtained before and after optimization are shown, respectively. Despite considerable changes in the MFs after optimization, the initial linguistic order has remained in these figures.

4.8.2 Identification of Nonlinear Dynamic Plants Example 4.3. Consider the problem of the plant identification described by the nonlinear difference equation [58] yt 5 yt21 yt22 ðyt21 1 2:5Þ=ð1 1 y2t21 1 y2t22 Þ 1 ut21 : Values of the training set of the input ut are uniformly distributed over the interval ½ 2 2; 2 and values of the test sample are specified by the expression ut 5 sinð2πt=250Þ. The model of the object is found in the form of a nonlinear autoregression moving average model yt 5 f ðyt21 ; yt22 ; ut21 Þ; where f (.) is a multilayer perceptron with one hidden layer and linear output activation function.

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

137

In contrast to Section 2.3.1 we will simultaneously estimate all parameters included in the description of the perceptron. The initial values of the weights of the hidden layer and biases are selected from the uniform distribution on the intervals ½ 2 1; 1 and ½0; 1, respectively. Let us introduce the following notations: DTA1 are described by the relations (4.35)(4.41), DTA2 is described by the relations Eqs. (2.30) and (2.31) which estimate output layer parameters only (the diffuse version of ELM algorithms). Table 4.2 shows the values of the 90th percentile of the approximation output error DTA1. For comparison, in the same table, the simulation results are given in the evaluation of the output layer weights only (DTA2). Fig. 4.6 shows the system outputs and models with five neurons in the hidden layer. Here curves 1 and 2 are outputs of DTA2 and DTA1, respectively, and curve 3 is the output of the plant. Table 4.2 Training errors Sizes of training N Number of neurons and testing M sets in the hidden layer

N 5 2000, M 5 500

5 10 15 20 4 5 6 10

N 5 500, M 5 500

Training

Testing

DTA1

DTA2

DTA1

DTA2

0.13 0.12 0.13 0.13 0.16 0.15 0.14 0.14

0.57 0.23 0.14 0.12 0.83 0.58 0.48 0.23

0.06 0.05 0.05 0.05 0.11 0.1 0.1 0.09

0.59 0.18 0.08 0.04 0.83 0.58 0.46 0.19

5 1 2 3

4

y(t)

3 2 1 0 –1

0

50

100

150

200

250 t

300

350

400

450

500

Figure 4.6 Dependencies of the plant output and the models outputs on the time.

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

138

Example 4.4. Consider the problem of identification of two reservoir capacity using the inputoutput data [79]. Voltage (an input) is applied to pump fluid which fills the upper tank. From the upper tank it flows through a hole in the bottom container. The fluid level in the lower container defines a output of the system yðtÞ. The data contain 3000 values uðtÞ in volts and yðtÞ in meters measured with the step of 0.2 s, which are shown in Figs. 4.7 and 4.8, respectively. The plant model is searched in the form of the multilayer perceptron with one hidden layer and linear output activation function and a vector of inputs zt 5 ðyt21 ; yt22 ; yt23 ; yt24 ; yt25 ; ut23 ÞT :

10

u(t)

8 6 4 2

0

50

100

150

200 t

250

300

350

400

300

350

400

Figure 4.7 Dependency of the plant input on the time.

0.6

y(t)

0.4 0.2 0 50

100

150

200

250

t

Figure 4.8 Dependency of the plant output on the time.

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

139

0.6

y(t)

0.4

0.2

1 2

0 20

40

60

80

100 t

120

140

160

180

200

Figure 4.9 Dependencies of the plant output, the model output on the time.

0.6

y(t)

0.4 0.2 1

0

2

–0.2 –0.4

0

20

40

60

80

100 t

120

140

160

180

200

Figure 4.10 Dependencies of the plant output, the model output on the time with output layer estimation parameters only.

The size of the training sample is N 5 1000, the testing sample is M 5 1000. Hidden layer weights and bias are selected as in the previous example. The parameter of forgetting is determined by the expression λk 5 fmaxð1 2 0:05=k; 0:99g. Fig. 4.9 shows the output values of the plant (curve 1) and its model built using the DTA (curve 2) and the expressions (4.194)(4.202) with five neurons in the hidden layer with the simultaneous training of all unknown parameters for 20 ages. Fig. 4.10 shows output of the plant (curve 1) and the model built using the DTA (curve 2) of Theorem 2.1 with five neurons in the hidden layer and output layer estimation parameters only.

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

140

4.8.3 Example of Classification Task Let us illustrate use of the iterative DTA to solve the problem of classification of objects according to the training sample fzi ; yi g, t 5 1; 2; . . .; N , where zt 5 ðz1t ; z2t ; . . .; znt ÞT ARn is a vector of features characterizing a recognizable object, yi AY 5 f1; 2; . . .; mg is a finite set of classes numbers. We use the following form of the RBNN [24]: yit 5

p X

wik ϕðjjzt 2 ak jj2 Þ 1 wi0 ;

i 5 1; 2; . . .; m; t 5 1; 2; . . .; N;

k51

(4.241) where ak ARn, k 5 1; 2; . . .; p are centers, wik ; i 5 1; 2; . . .; m, k 5 1; 2; . . .; p are weights of the output layer, wi0 ; i 5 1; 2; . . .; m are biases, and ϕðjjzt 2 ak jj2 Þ 5 ðjjzt 2ak jj2 11Þ21=2 is a basis function. The network is trained so that its k-th output is equal to 1 and all the others are equal to 0 when the input vector zi belongs to the class with number k. Let us turn to the equivalent description yt 5 Φðzt ; βÞα;

t 5 1; 2; . . .; N ;

(4.242)

where α 5 ðw10 ; w11 ; . . .; w1p ; w20 ; w21 ; . . .; w2p ; wm0 ; wm1 ; . . .; wmp ÞT ARðp11Þm ; β 5 ðaT1 ; aT2 ; . . .; aTp ÞT ARnp ; Φðzt ; βÞ 5 block diagðΦ1 ðzt ; βÞ; Φ2 ðzt ; βÞ; . . .; Φm ðzt ; βÞÞARm 3 mðp21Þ ; Φi ðzt ; βÞ 5 ð1; ϕðjjzt 2 a1 jj2 Þ; ϕðjjzt 2 a2 jj2 Þ; . . .; ϕðjjzt 2 ap jj2 ÞÞAR1 3 ðp11Þ : Whence follow equations description of the DTA: 0 1 ϕðjjzt 2 a1 jj2 Þ B 1 ϕðjjzt 2 a1 jj2 Þ Ctα 5 B @ ::: ::: 1 ϕðjjzt 2 a1 jj2 Þ

for matrix functions included in the ϕðjjzt 2 a2 jj2 Þ ϕðjjzt 2 a2 jj2 Þ ::: ϕðjjzt 2 a2 jj2 Þ

1 ::: ϕðjjzt 2 ap jj2 Þ ::: ϕðjjzt 2 ap jj2 Þ C C; A ::: ::: 2 ϕðjjzt 2 ap jj Þ

Diffuse Neural and Neuro-Fuzzy Networks Training Algorithms

0

@ @ ϕðjjzt 2 a1 jj2 Þ w12 ϕðjjzt 2 a2 jj2 Þ @a2 B @a1 B @ @ 2 2 B Ctβ 5 B w21 @a1 ϕðjjzt 2 a1 jj Þ w22 @a2 ϕðjjzt 2 a2 jj Þ B ::: ::: @ @ @ wm1 ϕðjjzt 2 a1 jj2 Þ wm2 ϕðjjzt 2 a2 jj2 Þ @a1 @a2 w11

141

1

@ ϕðjjzt 2 ap jj2 Þ @ap C C @ 2 ::: w2p ϕðjjzt 2 ap jj Þ C C: @ap C ::: ::: A @ @ 2 ::: wmp ϕðjjzt 2 ap jj Þ @a1 @ap :::

w1p

Example 4.5. Consider the classical problem of partitioning iris flowers into three classes, using four features [80]. The experimental data contain 50 instances of each class (a matrix 4 3 150), previously normalized according to the average and variance that were estimated for the entire set. The samples were randomly divided into two sets of training and testing with 25 elements of each class. The training set is denoted L3 5 ðX1 X2 X3 Þ, where X1 ; X2 ; X3 AR4 3 25 are matrices characterizing each class. It is required to develop an algorithm allowing determining to which class a particular pattern of Iris belongs. We use RBNN Eq. (4.174) with m 5 3; p 5 4. Network training consisted of the evaluation of centers and the weights on the training set. We used two algorithms, namely, the GN method with a large parameter and the DTA in the iterative mode. Vectors β and α are interpreted as random variables. A priori information about the centers is determined from experimental data and on the output layer it is absent. We put Eβ 5 ðvT1 ; vT2 ; vT3 ÞT , where v1 ; v2 ; v3 are the estimated average values of each class, which were estimated from samples X1 ; X2 ; X3 , respectively. For matrix Eððβ 2 mβ Þðβ2mβ ÞT Þ the estimate of features covariance matrix in all three classes derived from the sample is used, L3 5 ðX1 X2 X3 Þ. The process of training and testing the network was repeated 100 times. In the result it was found that the GN method with the matrix of the intensities of the measurement noise 20.6 I75 diverged regardless of the value of μ in all 100 realizations. The DTA with the same noise intensity of measurements always converged, giving 89% of correct partition into classes with an average of 24 iterations.

CHAPTER 5

Diffuse Kalman Filter Contents 5.1 Problem Statement 5.2 Estimation With Diffuse Initialization 5.3 Estimation in the Absence or Incomplete a Priori Information About Initial Conditions 5.4 Systems State Recovery in a Finite Number of Steps 5.5 Filtering With the Sliding Window 5.6 Diffuse Analog of the Extended Kalman Filter 5.7 Recurrent Neural Network Training 5.8 Systems With Partly Unknown Dynamics

142 144 153 165 166 169 170 173

5.1 PROBLEM STATEMENT Let us consider a linear discrete system of the form xt11 5 At xt 1 Bt wt ; yt 5 Ct xt 1 Dt ξt ;

tAT 5 fa; a 1 1; . . .g;

(5.1)

where xt ARn is the state vector, yt ARm is the measured output vector, wt ARr and ξt ARl are the uncorrelated random processes with zero mean and the covariance matrices Eðwt wtT Þ 5 Ir ; Eðξt ξ Tt Þ 5 Il , At ; Bt ; Ct ; Dt are known matrices of appropriate dimensions. Let the initial state of the system Eq. (5.1) xa satisfies the following conditions: A1. The random vector xa is not correlated with wt and ξ t for tAT. A2. Without loss of generality it is assumed that the state vector elements are arranged so that there is a priori information regarding its qAf1; . . .; n 2 1g first components at the initial moment t 5 a   mai 5 E½xia ; saij 5 E ðxia 2 mai Þðxja 2 maj Þ ; i; j 5 1; 2; . . .; q: (5.2) A3. The random variables xia ; i 5 1; 2; . . .; q; q $ 1 are uncorrelated with xia ; i 5 q 1 1; q 1 2; . . .; n.

142

© 2017 Elsevier Inc. All rights reserved.

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks. DOI: http://dx.doi.org/10.1016/B978-0-12-812609-7.00005-6

Diffuse Kalman Filter

143

A4. A priori information regarding the remaining n 2 q components of xa is absent and they are treated as random variables with zero mean and covariance matrix proportional to the large parameter μ . 0, i.e., Eðxia Þ 5 0; Eðxia xja Þ 5 μ~saij ;

i; j 5 q 1 1; q 1 2; . . .; n;

(5.3)

where S~a 5 jj~saij jjnq11 . 0: A5. If a priori information regarding the entire vector xa is absent then the conditions A1 and A4 are satisfied with q 5 0. It is required to find the limit relations for the Kalman filter (KF) as μ-N and to study their properties. We will call these relations the diffuse KF (the DKF). Consider an alternative approach to the problem of the system state estimation in the absence of a priori information about the initial conditions. This approach assumes that the following conditions are satisfied. B1. Without loss of generality it is assumed that the state vector elements are arranged so that there is a priori information Eq. (5.2) regarding its qAf1; . . .; n 2 1g first components at the initial moment t 5 a. B2. A priori information on the remaining n 2 q components of xa is absent and they can be either unknown constants or random variables statistical characteristics which are unknown. B3. If xia ; i 5 1; 2; . . .; q are random variables, then they are uncorrelated with xia ; i 5 q 1 1; q 1 2; . . .; n; wt and ξ t for tAT. B4. If a priori information regarding the entire vector xa is absent then the conditions B2 and B3 satisfy only with q 5 0. It is required to find conditions for the existence of the linear state estimate x^ b 5 Ωb21;a ma 1

b21 X

Yb21;t yt

(5.4)

t5a

from measurements ya ; ya11 ; . . .; yb21 , bAfa 1 1; a 1 2; . . .; gAT and a recursive representation for x^ b calculation, where the matrices Yb21;t ARn 3 m and Ωb21;a ARn 3 q do not depend on a priori information about the unknown vector xa, ma 5 ðma1 ; . . .; maq ÞT . The estimate is found under the conditions that it is be unbiased Eðx^ b Þ 5 Eðxb Þ

(5.5)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

144

and minimizes the criterion Eððxb 2 x^ b ÞT ðxb 2 x^ b ÞÞ:

(5.6)

5.2 ESTIMATION WITH DIFFUSE INITIALIZATION Consider the problem of estimating the state vector of the system Eq. (5.1) under the assumption that the conditions A1aA5 are fulfilled. Under these assumptions, the linear unbiased state estimate of the system Eq. (5.1) with a minimum mean square error is determined from the equations system (the KF) [81] x^ t11 5 At x^ t 1 Kt ðyt 2 Ct x^ t Þ; x^ a 5 ðmTa ; 01 3 ðn2qÞ ÞT ; tAT 5 fa; a 1 1; . . .g; (5.7) where Kt 5 At Pt CtT Nt21 ;

(5.8)

Pt11 5 At Pt ATt 2 At Pt CtT Nt21 Ct Pt ATt 1 V2t ; Pa 5 block diagðSa ; μS~a Þ; (5.9) Nt 5 Ct Pt CtT 1 V1t ; V1t 5 Dt DTt ; V2t 5 Bt BTt ; ma 5 ðma1 ; . . .; maq ÞT ; Sa 5 jjsaij jjnq11 ARq 3 qÞ ; S~a 5 jj~saij jjnq11 ARðn2qÞ 3 ðn2qÞ : We want to study the behavior of the KF for large values of μ and to find the limit relations as μ-N. We prove two auxiliary results which will be needed to solve this problem. Lemma 5.1. The estimation error covariance matrix Pt satisfies for tAT the relations Pt 5 St 1 Qt 5 St 1 Rt Mt21 RtT ;

(5.10)

where St11 5 At St ATt 2 At St CtT N1t21 Ct St ATt 1 V2t ; Sa 5 block diag ðSa ; 0ðn2qÞ 3 ðn2qÞ Þ; (5.11) Rt11 5 A1t Rt ; A1t 5 At 2 At St CtT N1t21 Ct ; Ra 5 ðeq11 ; . . .; en Þ; 21 Mt11 5 Mt 1 RtT CtT N1t21 Ct Rt ; Ma 5 S~a =μ;

(5.12) (5.13)

Diffuse Kalman Filter

145

Qt11 5 A1t Qt AT1t 2 A1t Q2t CtT Nt21 Ct Qt AT1t ; Qa 5 block diag ð0q 3 q ; μSa Þ; (5.14) T T N1t 5 Ct St Ct 1 V1t ; Nt 5 Ct ðSt 1 Qt ÞCt 1 V1t ; St ARn 3 n ; Rt ARn 3 ðn2qÞ ; Mt ARðn2qÞ 3 ðn2qÞ ; Qt ARn 3 n ; Proof The first expression in Eq. (5.10) follows from the known property [69] of the discrete Riccati matrix equation: Qt 5 P~ t 2 P t ;

tAT ;

where P~ t ; P t are arbitrary solutions of Eq. (5.9). Let us prove the second expression. We first show that Qt 5 Rt Mt21 RtT :

(5.15)

Substituting this expression into Eq. (5.14) gives 21 T Rt11 Mt11 Rt11 5 At Rt Mt21 RtT ATt 2 At Rt Mt21 RtT CtT Nt21 Ct Rt Mt21 RtT ATt :

This relation is valid under the condition that Mt21 satisfies the difference equation 21 5 Mt21 2 Mt21 RtT CtT Nt21 Ct Rt Mt21 ; Ma21 5 S~a μ: Mt11

Transforming it using the identity Eq. (2.7) for B 5 Mt21 ; C 5 Ct Rt ; D 5 N1t , we obtain the expression 21 5 ðMt 1RtT CtT N1t21 Ct Rt Þ21 Mt11

(5.16)

which implies Eqs. (5.13) and (5.10). Lemma 5.2. The KF gain matrix Eq. (5.8) satisfies the relation 21 T Rt ÞCtT N1t21 ; tAT : Kt 5 At ðSt 1 Qt ÞCtT Nt21 5 ðAt St 1 A1t Rt Mt11 (5.17) Proof The first expression in Eq. (5.17) follows from Lemma 5.1. Let us prove the second equality. We first show that Kt 5 At Pt CtT Nt21 5 At St CtT N1t21 1 A1t Qt CtT Nt21 :

146

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

Since Pt 5 St 1 Qt (Lemma 5.1) then it should be At ðSt 1 Qt ÞCtT 5 At St CtT N1t21 Nt 1 A1t Qt CtT : Indeed, transforming the right side of this expression, we establish that At St CtT N1t21 Nt 1 A1t Qt CtT 5 At St CtT ðIm 1 N1t21 Ct Qt CtT Þ 1 ðAt 2 At St CtT N1t21 Ct ÞQt CtT 5 At ðSt 1 Qt ÞCtT Using Eqs. (5.15) and (5.16) gives A1t Qt CtT Nt21 5 A1t Rt Mt21 RtT CtT Nt21 21 5 A1t Rt Mt11 ðMt 1 RtT CtT N1t21 Ct Rt ÞMt21 RtT CtT Nt21 21 5 A1t Rt Mt11 ðRtT CtT 1 RtT CtT N1t21 Ct Rt Mt21 RtT CtT ÞNt21 21 T T 21 5 A1t Rt Mt11 Rt Ct N1t ðN1t 1 Ct Rt Mt21 RtT CtT ÞNt21 21 T T 21 5 A1t Rt Mt11 Rt Ct N1t

This implies Eq. (5.17). Theorem 5.1. Let T 5 fa; a 1 1; . . .; bg be an arbitrary bounded set. Then: 1. The following uniform asymptotic representations in tAT are valid Pt 5 Rt S~a ðIn2q 2 Wt Wt1 ÞRtT μ 1 St 1 Rt Wt1 RtT 1 Oðμ21 Þ; μ-N; (5.18)  Kt 5

0 dif Kt 1 Oðμ21 Þ

Ct 5 0 μ-N; Ct 6¼ 0;

(5.19)

where Wt11 5 Wt 1 RtT CtT N1t21 Ct Rt ; Wa 5 0ðn2qÞ 3 ðn2qÞ ; 1 Ktdif 5 ðAt St 1 A1t Rt Wt11 RtT ÞCtT N1t21 ;

(5.20) (5.21)

St and Rt are the solutions of Eqs. (5.11) and (5.12), respectively.

Diffuse Kalman Filter

147

2. For any ε . 0 the following condition is satisfied 21 Pðjjx^ t 2 x^ dif t jj $ εÞ 5 Oðμ Þ; μ-N; tAT ;

(5.22)

where dif

T dif T ^ dif ^ dif x^ t11 5 At x^ dif t 1 Kt ð yt 2 Ct x t Þ; x a 5 ðma ; 01 3 ðn2qÞ Þ :

(5.23)

Proof 1. From Eq. (5.13) it follows that 21

Mt 5 S~a =μ 1

t21 X

21

RtT CtT N1t21 Ct Rt 5 S~a =μ 1 Wt :

s5a 21 21=2 Using Lemma 2.1 for Ωt 5 Mt ; Ω0 5 S~a ; Ft 5 N1t Ct Rt we get

Mt21 5 S~a ðIn2q 2 Wt Wt1 Þμ 1 Wt1 1 Oðμ21 Þ; μ-N: Substitution of this expression into Eq. (5.10) yields Eq. (5.18). From Eq. (5.8) it follows that Kt 5 0 if Ct 5 0. Let Ct 6¼ 0. The use of Lemma 2.2 gives 21=2

1 ðIn2q 2 Wt11 Wt11 ÞRtT CtT N1t

50

that together with the relation (5.17) implies Eq. (5.19). 2. Introducing the notations et 5 x^ t 2 x^ dif ^ t 2 xt t ; ht 5 x and using Eqs. (5.1) and (5.7), we obtain et 5 ðAt 2 Ktdif Ct Þet21 2 ðKtdif 2 Kt ÞCt ht 1 ðKtdif 2 Kt ÞBt wt ; e0 5 0; ht 5 ðAt 2 Ktdif Ct Þht21 1 Ktdif Bt wt ; ha 5 ððma 2xa ÞT ; 2 x~ Ta ÞT ; where xa 5 ðx1a ; x2a ; . . .; xqa ÞT ; x~ a 5 ðxq11a ; xq12a ; . . .; xna ÞT : The matrix of second moments of the block vector xt 5 ðeTt ; hTt ÞT satisfies the following matrix difference equation T Qt 5 A~ t Qt21 A~ t 1 Lt ; Q0 5 block diag ð0q 3 q ; Q0 Þ;

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

148

where dif

A~ t 5

Lt 5

dif

At 2 Kt Ct 0 dif

2ðKt 2 Kt ÞCt dif At 2 Kt Ct dif

ðKt 2 Kt ÞRt ðKt 2Kt ÞT dif dif Kt Rt ðKt 2Kt ÞT

!

dif

; dif

ðKt 2 Kt ÞRt ðKt ÞT dif dif Kt Rt ðKt ÞT

! ;

Rt 5 Bt BTt ; Q0 5 Eðha hTa Þ: Since jjKs 2 Ktdif jj 5 Oðμ21 Þ; tAT ; μ-N then this implies EðeTt et Þ 5 Oðμ21 Þ; tAT; μ-N: where O(μk ) is a function such that O(μk )/μk is bounded as μ-N. Using the Markov’s inequality gives for any ε . 0 Pðjjet jj $ εÞ # Eðjjet jj2 Þ=ε2 5 Oðμ21 Þ; μ-N; tAT : The relations (5.20), (5.21), and (5.23) will be called the diffuse KF (DKF). Consequence 5.1. The diffuse component Ptdif 5 Rt S~a ðIn2q 2 Wt Wt1 ÞRtT μ vanishes when, t $ tr, where tr 5 mint ft:Wt . 0; t 5 a; a 1 1; . . .g. Consequence 5.2. dif The matrix Kt does not depend on the diffuse component as opposed to the matrix Pt and as the function of μ is uniformly bounded in the norm for tAT as μ-N. Consequence 5.3. Numerical implementation errors can result in the KF divergence for large values of μ. Indeed, let δWt1 be the error connected with calculations of the pseudoinverse Wt. Then by Theorem 5.1 Kt 5 Rt S~a ðIn2q 2 Wt ðWt1 1 δWt1 ÞÞRtT CtT N1t21 μ 1 Oð1Þ; tAT ; μ-N: (5.24)

Diffuse Kalman Filter

149

For δWt1 6¼ 0 the matrix Kt becomes dependent on the diffuse component. Moreover, in this case Kt becomes proportional to a large parameter and so divergence is possible even if the continuity condition is satisfied and δWt1 is arbitrarily small in norm. Let us consider the properties of the DKF optimality. We first prove the following auxiliary assertion. Lemma 5.3. There are the following representations Ht;s 5 Φt;s 2 Rt Wt21

t21 X

RiT CiT N1i21 Ci Φi;s ; t $ ttr ; s , t;

(5.25)

i5s

ðHs;t11 K~ t ÞT 5 N1t21 Ct Rt Ws21 RsT 5 ft;s ; dif

(5.26)

where Ht;s ; Φt;s, are determined from the equations systems Ht11;s 5 ðAt 2 Ktdiff Ct ÞHt;s 5 A~ t Ht;s ; Ht;s 5 In ; t $ s; Φt11;s 5 A1t Φt;s ; Φs;s 5 In ; t . s:

(5.27) (5.28)

Proof Consider the auxiliary equations systems Gt;s 5 ðAt 2Ktdif Ct ÞT Gt11;s 5 A~ t Gt11;s ; Gs;s 5 In ; T

Zt;s 5 AT1t Zt11;s 2 CtT N1t21 Ct Rt Ws21 RsT ; Zs;s 5 In ; s $ tr:

(5.29) (5.30)

We at first show that Gt;s 5 Zt;s ;

t 5 s 2 1; s 2 2; . . .; s $ tr:

Iterating Eqs. (5.30) and (5.28) obtain T Zt;s 5 AT1t AT1t11 . . .AT1s21 2 AT1t AT1t11 . . .AT1s22 Cs21 fs21;s

2 ? 2 CtT ft;s ; t $ ttr ; t $ s: Φt;s 5 A1t21 A1t22 . . .A1s ; ΦTs;t ¼ ðA1s21 A1s22 . . .A1t ÞT ¼ AT1t AT1tþ1 . . .AT1s21 :

(5.31)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

150

In view of these expressions Zt;s will take the form T fs21;s 2 ? 2 CtT ft;s Zt;s 5 ΦTs;t 2 ΦTs11;t Cs21

5 ΦTs;t 2

s21 X ΦTj;t CjT N1j21 Cj Rj Ws21 RsT ; t $ ttr ; t $ s:

(5.32)

j5t

Represent Eq. (5.29) in the following equivalent form dif Gt;s 5 ðAt 2Ktdif Ct ÞT Gt11;s 5 ðA~ 1t 2 K~ t Ct ÞT Gt11;s ; Gs;s 5 In ;

(5.33)

where dif 1 K~ t 5 Rt11 Wt11 RtT CtT N1t21 :

From comparison of the right-hand side parts of Eqs. (5.30) and (5.33) it follows that Eq. (5.31) is performed if ðK~ t ÞT Zt11;s 5 Nt21 Ct Rt Ws21 RsT : dif

(5.34)

Substituting Eq. (5.32) into the left side of Eq. (5.34) obtain dif 1 T ðK~ t ÞT Zt11;s 5N1t21 Ct Rt Wt11 Rt11

s21 X ΦTs;t112 ΦTj;t11 CjT N1j21 Cj Rj Ws21 RsT j5t11

1 5N1t21 Ct Rt Wt11 In 2

s21 X

!

!

:

RjT CjT N1j21 Cj Rj Ws21 RsT

j5t11 1 5N1t21 Ct Rt Wt11 Wt11 Ws21 RsT :

(5.35) In the derivation of Eq. (5.35) we used the identities Rt11 5 Φt11;a Ra ; Φs;t11 Φt11;a 5 Φs;a ; Φj;t11 Rt11 5 Φj;t11 Φt11;a Ra 5 Φj;a Ra 5 Rj : Let us transform Eq. (5.35) using the orthogonal decomposition Wt 5 Vt VtT , where  21=2 21=2  T T : Ca11 Na11 ; . . .; RtT CtT Nt Vt 5 RaT CaT Na21=2 ; Ra11 Let l1t ; l2t ; . . .; lkðtÞ;t be linearly independent columns of the matrix Vt . Using skeletal decomposition yields Vt 5 Lt Γt ;

Diffuse Kalman Filter

151

ðn2qÞ 3 kðtÞ

kðtÞ 3 mðt2aÞ ; Γt ARkðtÞ ; Rkr 3 l ; are a set of k 3 l matrices of the rank r. Since Γ~ t 5 Γt ΓTt . 0 is the Gram matrix constructed by linearly independent rows of the matrix Γt and rankðL t Þ 5 rankðΓ~ t LtT Þ then

where L t 5 ðl1t ; l2t ; . . .; lkðtÞ;t ÞAR kðtÞ

21 Wt 5 Lt Γ~ t LtT ; Wt1 5 ðLt Γ~ t LtT Þ1 5 ðLtT Þ1 Γ~ t Lt1 :

The validity of Eqs. (5.25) and (5.34) follows from the expressions 21=2 N1t Ct Rt Wt1 Wt

21=2

21 Ct Rt ðLtT Þ1 Γ~ t Lt1 Lt Γ~ t LtT

21=2

Ct Rt Lt ðLtT Lt Þ21 LtT 5 ΓTtt ðLtT Lt ÞðLtT Lt Þ21 LtT

5 N1t 5 N1t

21=2

¼ N1t

Ct R t ;

T T T T Gs;t 5 A~ s A~ s11 . . .A~ t21 ; Ht;s 5 A~ t21 A~ t22 . . .A~ s ; Ht;s 5 Gs;t

and Eq. (5.26) from Eq. (5.34). Theorem 5.2. 1. The estimate of the system state Eq. (5.1) obtained by the DKF is unbiased when t $ tr. 2. There are the following relations 2 Eðjjxt 2 x^ t jj2 Þ $ Eðjjxt 2 x^ dif t jj Þ;

(5.36)

T 21 T ^ dif Ptdif 5 Eððxt 2 x^ dif t Þðxt 2 x t Þ Þ 5 St 1 Rt Wt Rt ;

t $ ttr ;

(5.37)

where x^ t is an unbiased estimate of the system state Eq. (5.1) when t $ ttr defined by the equations system x^ t11 5 At x^ t 1 K~ t ðyt 2 Ct x^ t Þ; x^ a 5 ðmTa ; 01 3 ðn2qÞ ÞT ;

tAT 5 fa; a 1 1; . . .g (5.38)

with arbitrary matrix gain K~ t ARn 3 m that is independent of the unknown dif a priori information about initial conditions, Pt is the covariance matrix of the estimation error. Proof 1. Consider the system of equations ht11 5 ðAt 2 Ktdif Ct Þht with the initial condition ha 5 ð01 3 q ; x~ T ÞT , where ht 5 Eðx^ dif t 2xt Þ; T x~ 5 ðxaðq11Þ ; xaðq12Þ ; . . .; xan Þ is an arbitrary vector.

152

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

Using Lemma 5.3, we establish that ht 5 Ht;a ha 5 Φt;a 2 Rt Wt21

t21 X

! RiT CiT N1i21 Ci Φi;a Ra x~ 5 0:

i5a

Whence it follows that the estimate x^ dif is unbiased, i.e., t dif Eðx^ t Þ 5 Eðxt Þ when t $ ttr . 2. The covariance matrix of the estimation error P~ t 5 Eððxt 2 x^ t Þðxt 2 x^ t ÞT Þ satisfies the matrix equation T P~ t11 5 ðAt 2 K~ t Ct ÞP~ t ðAt 2 K~ t Ct ÞT 1 K~ t V1t K~ t 1 V2t

(5.39)

with some unknown initial matrix P~ tr . Complementing the right-hand side of Eq. (5.39) to the complete square, we get P~ t11 5 At P~ t ATt 2 At P~ t CtT Nt21 Ct P~ t ATt 1 V2t 1 ðK~ t 2 At P~ t CtT Nt21 ÞNt ðK~ t 2At P~ t CtT Nt21 ÞT :

(5.40)

It follows from this that if we put K~ t 5 At P~ t CtT Nt21 , then P~ t11 5 At P~ t ATt 2 At P~ t CtT Nt21 Ct P~ t ATt 1 V2t

(5.41)

and in this case P~ t # P t , where P t is arbitrary solution of the matrix Eq. (5.39). From Lemma 5.1 it follows that the set of the equation solutions (5.41) includes the solution of the form Ptdif 5 St 1 Qt 5 St 1 Rt Wt21 RtT ; t . tr with the initial condition Ptr 5 Str 1 Rtr Wtr21 RtrT . Thus Pt # P~ t . dif

dif

Comment 5.2.1. Comparing the KF with a large parameter μ described by expressions (5.7) and (5.9) with the DKF it is easy to see that from the point of view of computation that the DKF is significantly more expensive. Indeed, in the case of the DKF we need to iterate additionally two matrix Eqs. (5.11) and (5.12). In addition, there is a number of additional matrix operations related to the definition of the DKF gain. However, Consequence 5.3 shows that such complex structure of the DKF is compensated by its resistance to the accumulation of errors providing the estimates convergence.

Diffuse Kalman Filter

153

5.3 ESTIMATION IN THE ABSENCE OR INCOMPLETE A PRIORI INFORMATION ABOUT INITIAL CONDITIONS In this section we consider a problem of estimation of the state vector of the system Eq. (5.1) under the assumption that the conditions B1aB4 are fulfilled. We show at first that the problem of unbiased estimation of the state vector Eq. (5.1) is due to a special control task. Based on this result, we find the existence conditions for an unbiased estimator and expressions describing its work. Consider an auxiliary linear system of the form Zt 5 ATt Zt11 2 CtT Ut ; Zb 5 In ;

t 5 b 2 1; b 2 2; . . .; a;

(5.42)

where Zt ARn 3 n ; Ut ARm 3 n is a control. Lemma 5.4. Suppose that there is a control Ut bringing the system Eq. (5.42) in a state satisfying a condition ΨT Za 5 0; where Ψ 5 ðeq11 ; :::; en Þ; ei ARn ; ei is the i-th unit vector, i 5 1; 2; . . .; q. Then, there is an unbiased state estimate of the system Eq. (5.1) of the form Eq. (5.4), where Yb21;t 5 UtT ; Ωb21;a 5 ZaT ðe1 ; . . .; eq Þ:

(5.43)

Proof Using the identity xb 5 ZaT xa 1

b21 X

T ðZt11 xt11 2 ZtT xt Þ;

(5.44)

t5a

Eqs. (5.1) and (5.2) give xb 5 ZaT xa 1

b21 X T T ½Zt11 ðAt xt 1 Bt wt Þ 2 ðZt11 At 2 UtT Ct Þxt  t5a

5 ZaT xa

1

b21 X

T ðZt11 Bt wt

t5a

(5.45) 1 UtT Ct xt Þ:

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

154

Therefore, the expression for estimation error at the moment t 5 b can be presented in the form eb 5 xb 2 x^ b 5 ZaT xa

b21 X T 2 Ωb21;a ma 1 ðZt11 Bt wt 1 UtT Ct xt 2 Yb21;t yt Þ t5a

5 ZaT xa 2 ZaT ðe1 ; . . .; eq Þma 1

b21 X

T ðZt11 Bt wt 2 UtT Dt ξ t Þ

t5a

5 ZaT q1 1 ZaT q2 1

b21 X

T ðZt11 Bt wt 2 UtT Dt ξt Þ;

t5a

(5.46) where q1 5 ðxa ð1Þ2ma1 ; xa ð2Þ2ma2 . . .; xa ðqÞ2maq ; 0; 0; . . .; 0ÞT ARn ; q2 5 ð0; 0; :::; 0; xa ðq11Þ; xa ðq12Þ; . . .; xa ðnÞ; ÞT ARn : Averaging the left- and right-hand sides of this expression and using the condition ΨT Za 5 0 we get Eðeb Þ 5 EðZaT q1 Þ 1 EðZaT q2 Þ 5 ZaT Eðq2 Þ 5 ZaT ΨfE½xa ðq11Þ; E½xa ðq12Þ; . . .; E½xa ðnÞgT 5 0: Lemma 5.5. Let Wa;b 5

b21 X

ΨT ΦTj;a CjT Cj Φj;a Ψ . 0;

(5.47)

j5a

where the matrix Φt;a is determined by the system Φt11;a 5 At Φt;a ; Φa;a 5 In ; tAT : Then: 1. There is an unbiased estimate of the state vector of the system Eq. (5.1) at the moment t 5 b. 2. If additionally for any nonzero vector pARn2q Ab21 . . .Aa ð01 3 q ; pT ÞT 6¼ 0

(5.48)

then the condition Eq. (5.47) is the necessary and sufficient condition for the existence of an unbiased estimator of the system state Eq. (5.1) at the moment t 5 b.

Diffuse Kalman Filter

155

Proof 1. We show that when performing Eq. (5.47) there exists a control referred to Lemma 5.4. Iterating Eq. (5.42) obtain Zt 5 Gt;b Zb 2

b21 X

Gt;j CjT Uj ;

t 5 b 2 1; b 2 2; . . .; a;

(5.49)

j5t

where the matrix Gt;s is determined by the system Gt;s 5 ATt Gt11;s ; Gs;s 5 In ;

s $ t:

Using the boundary condition ΨT Za 5 0, we find with the help of Eq. (5.49) 0 5 ΨT Ga;b Zb 2

b21 X

ΨT Ga;j CjT Uj :

(5.50)

j5a

We seek a solution of this system in the form T Ut 5 Ct Ga;t ΨL;

where LARðn2qÞ 3 n is a constant, unknown matrix. It follows from Eq. (5.50) that 21

~ a;b ΨT Ga;b Zb L5W ~ a;b . 0, where provided that W W~ a;b 5

b21 X

T ΨT Ga;j CjT Cj Ga;j Ψ:

j5a

But as Gt;b 5 ATt ATt11 . . .ATb22 ATb21 ; Φb;t 5 Ab21 Ab22 . . .At21 At T then Gt;b 5 Φb;t ; Wa;b 5 W~ a;b and the control is given by the expression 21

T 21 T T Ut 5 Ct Ga;t ΨW~ a;b ΨT Ga;b 5 Ct Φt;a ΨWa;b Ψ Φb;a :

(5.51)

The statement follows now from Lemma 5.4. 2. Suppose that there are Yb21;t ; Ωb21;a such that the estimate Eq. (5.4) is an unbiased and performed Eq. (5.48), but Wa;b is singular. Then there

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

156

is a vector p 6¼ 0; pARn2q, such that Wa;b p 5 0. Let us denote ς t 5 Ct Φt;a Ψp. As b21 X

ς Tt ς t 5 pT Wa;b p

t5a

then ς t 5 0 for any tAfa; . . .; b 2 1g. Let xa 5 ðηT ; pT ÞT , where ηARq is any random vector with zero mean. Using Eqs. (5.1), (5.43) gives the relation ! b21 X Eðx^ b Þ 5 Ωb21;a EðηÞ 1 E Yb21;t yt t5a

5

b21 X

Yb21;t Eðyt Þ 5

t5a

5

b21 X

b21 X UtT Ct Φt;a Eðxa Þ t5a

UtT Ct Φt;a Ψp 5

t5a

b21 X

UtT ς t 5 0

t5a

that is impossible if the condition Eq. (5.48) holds since Eðxb Þ 5 Φb;a ð01 3 q ; pT ÞT 5 Ab21 Ab22 . . .Aa ð01 3 q ; pT ÞT 6¼ 0: Note that if the matrix At, tAT is not singular (as an example, the system Eq. (5.1) is obtained as a result of a continuous system sampling), then Eq. (5.48) is performed automatically. An unbiased estimate of the system state Eq. (5.1) is determined by the expression x^ b 5 ZaT ðmTa ; 01 3 ðn2qÞ ÞT 1

b21 X

21 Φb;a ΨWa;b21 ΨT ΦTt;a CtT yt ;

t5a

where Za 5 ΦTb;a 2

b21 X

21 T T ΦTj;a CjT Cj Φt;a ΨWa;b Ψ Φb;a :

j5a

This assertion follows from Lemma 5.4 and the expressions (5.49) and (5.51). We will show that under the made assumptions the determining problem of the optimal estimate of the system state Eq. (5.1) (unbiased estimate minimizing the criterion Eq. (5.6)) is a special dual optimal control problem. Based on this result, it has found recurrent presentation for evaluating of the system state Eq. (5.1).

Diffuse Kalman Filter

157

Let us consider the system Eq. (5.42) together with the criterion of quality   b21 X T ~ T T T JðUt Þ 5 tr Za Sa Za 1 ðZt11 V2t Zt11 1 Ut V1t Ut Þ ; (5.52) t5a

where Sa 5 blockdiagðSa ; 0n2q;n2q Þ. It is required to choose a control Ut from the condition of minimum of the criterion Eq. (5.52) along the system solutions Eq. (5.42). From Lemma 5.4 it follows that if the following conditions ΨT Za 5 0; Yb21;t 5 UtT ; Ωb21;a 5 ZaT ðe1 ; . . .; eq Þ are held then the state estimate of the system Eq. (5.1) at the time t 5 b is unbiased and the use of Eq. (5.46) gives xb 2 x^ b 5 ZaT ðx1a 2ma1 ; x2a 2ma2 ; . . .; xqa 2maq ; 0; . . .; 0ÞT 1

b21 X

T ðZt11 Bt wt 2 UtT Dt ξ t Þ:

t5a

Whence it follows that E½ðxb 2 x^ b Þðxb 2 x^ b ÞT  5 ZaT S~a ZaT 1

b21 X T ðZt11 V2t Zt11 1 UtT V1t Ut Þ; t5a

JðUt Þ 5 E½ðxb 2 x^ b Þ ðxb 2 x^ b Þ: T

(5.53) Thus, the unbiased state estimate of Eq. (5.1), minimizing performance criterion Eq. (5.6), can be found from the solution of the formulated control problem with the matrices appearing in the description of Eq. (5.4) Yb21;t ; Ωb21;a . We at first prove some auxiliary statements. Lemma 5.6. P T T T Let Wa;b 5 b21 j5a Ψ Φj;a Cj Cj Φj;a Ψ . 0. Then, the optimal program control and the minimum value of the quality criterion in the problems (5.42) and (5.52) are determined by expressions 21 T Rb Þ; Utp 5 Nt21 Ct ðSt ATt Zt11 1 Rt Ma;b

(5.54)

21 T Jmin 5 TrðSb 1 Rb Ma;b Rb Þ;

(5.55)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

158

where Ma;b 5

b21 X

RtT CtT Nt21 Ct Rt ;

(5.56)

t5a

Zt 5 ΛTb;t 2

b21 X

21 T ΛTj;t CjT Nj21 Cj Rj Ma;b Rb ;

(5.57)

j5t

Λt11;s 5 A1t Λt;s ; Λs;s 5 In ;

t $ s;

Rt11 5 A1t Rt ; Ra 5 Ψ;

(5.58) (5.59)

St11 5 At St ATt 2 At St CtT Nt21 Ct St ATt 1 V2t ; Sa 5 Sa ;

(5.60)

A1t 5 At 2 At St CtT Nt21 Ct ; Nt 5 Ct St CtT 1 V1t :

(5.61)

Proof Consider the identity I52

b21 X

T ðZt11 St11 Zt11 2 ZtT St Zt Þ 1 ZbT Sb Zb 2 ZaT Sa Za 5 0:

t5a

Transforming this with the help of Eqs. (5.42) and (5.60), we obtain I52

b21  X T Zt11 ðAt St ATt 2 At St CtT Nt21 Ct St ATt 1 V2t ÞZt11

t5a  2 ðATt Zt11 2CtT Ut ÞT St ðATt Zt11 2 CtT Ut Þ 1 ZbT Sb Zb 2 ZaT Sa Za b21  X T Zt11 52 ð2 At St CtT Nt21 Ct St ATt 1 V2t ÞZt11 2 Zt11 At St CtT Ut t5a  1 UtT Ct St ATt Zt11 2 UtT Ct St CtT Ut 1 ZbT Sb Zb 2 ZaT Sa Za 5 0:

Whence it follows that " # b21 X T V2t Zt11 1UtT V1t Ut Þ1I JðUt Þ5tr ZaT S~a ZaT 1 ðZt11 "

t5a

# b21 X 5tr ZbT Sb Zb1 ðUt2Nt21 Ct St ATtZt11 ÞT Nt ðUt 2Nt21 Ct St ATtZt11Þ : t5a

(5.62)

Diffuse Kalman Filter

159

We will search a control in the form 21=2

Ut 5 Nt21 Ct St ATt Zt11 1 Nt

U~ t :

(5.63)

Substituting this expression in Eqs. (5.42) and (5.62) gives 21=2

Zt 5 ðATt 2 CtT Nt21 Ct St ATt ÞZt11 2 CtT Nt

21=2 ~ U~ t 5 AT1t Zt11 2 CtT Nt U t; (5.64)

Zb 5 In ; ΨT Za 5 0; t 5 b 2 1; b 2 2; . . .; a; ( JðU~ t Þ 5 Tr

b21 X

(5.65)

) T U~ t U~ t 1 Sb

:

(5.66)

t5a

Let us find a program control which moves the system Eq. (5.64) in such state when ΨT Za 5 0: We have Zt 5 Gt;b Zb 2

b21 X

21=2

Gt;j CjT Nt

U~ j ;

t 5 b 2 1; b 2 2; . . .; a;

(5.67)

j5t

where Gt;s 5 AT1t Gt11;s ; Gs;s 5 In ;

s $ t:

From this, using the boundary condition Ψ Za 5 0, we get T

0 5 ΨT Ga;b Zb 2

b21 X

21=2

ΨT Ga;j CjT Nj

Uj :

(5.68)

j5a

We seek a solution Eq. (5.68) in the form 21=2 T U~ t 5 Nt Ct Ga;t ΨL;

where LARðn2qÞ 3 n is a constant, unknown matrix. Substituting this expression in Eq. (5.68), we obtain !21 b21 X 21=2 T L5 ΨT Ga;j CjT Nt Cj Ga;j Ψ ΨT Ga;b Zb j5a

5

b21 X j5a

!21 21=2 ΨT ΛTj;a CjT Nj Cj Λj;a Ψ

21 T T ΨT ΛTb;a Zb 5 Ma;b Ψ Λa;b Zb ;

160

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

if Ma;b . 0. From whence it follows 21=2 21 T U~ t 5 Nt Ct Rt Ma;b Rb Zb

(5.69)

and the representation Eq. (5.54) for Ut . We show that U~ t really minimizes the criterion Eq. (5.53) on the system solutions of Eq. (5.42). Suppose that U^ t is arbitrary control such that ΨT Za 5 0. From Eq. (5.68) it follows that b21 X

21=2

ΨT ΛTj;t CjT Nj

ðU~ j 2 U^ j Þ 5 0:

j5a 21 , we obtain Multiplying this expression left by the matrix ΛTa;b ΨMa;b21 the equality b21 X

T U~ j ðU~ j 2 U^ j Þ 5 0:

j5a

It follows from this " # " # b21 b21 X X T T T T ðU~ j 2 U^ j Þ ðU~ j 2 U^ j Þ 5 Tr ðU~ j U~ j 2 2U~ j _ U j 1 U^ j U^ j Þ Tr j5t

j5t b21 X T U^ j U^ j 5 Tr

!

b21 X T 2 Tr U~ j U~ j

j5t

! $ 0:

j5t

The expression (5.55) for the minimum value of the quality criterion follows from Eqs. (5.66) and (5.69). Lemma 5.7. The condition Wa;b 5

b21 X

ΨT ΦTj;a CjT Cj Φj;a Ψ . 0

j5a

is sufficient for the existence of the optimal control. If for any nonzero vector pARn2q Ab21 Ab22 . . .Aa ð01 3 q ; pT ÞT 6¼ 0; then it is necessary and sufficient.

Diffuse Kalman Filter

161

Proof Sufficiency. Let us show that from Wa;b . 0 it follows Ma;b .0. Let Wa;b .0 but Ma;b is singular. In this case there is a nonzero vector pARn2q such that Ma;b p 5 0 and at the same time Ct Rt p 5 0; tAfa; . . .; b 2 1g. We show that from this will follow the equalities Ct Rt p 5 Ct At21 . . .Aa Ψp 5 Ct Φt;a Ψp 5 0;

tAfa; . . .; b 2 1g:

(5.70)

As Ca Ra p 5Ca Ψp 50; Ca11 Ra11 p 5Ca11 A1a Ψp 5Ca11 ðAa 2 Aa Sa CaT Na21 Ca ÞΨp 5Ca11 Aa Ψp 50 then at t 5 a and t 5 a 1 1 they are carried out. Suppose that these relations are valid at a 1 2; . . .; t and prove then that Ct Rt p 5 0 at t 1 1. We have Ct11 Rt11 p 5 Ct11 ðAt 2 At St CtT Nt21 Ct ÞRt p T 21 5 Ct11 At Rt p 5 Ct11 At ðAt21 2 At21 St21 Ct21 Nt21 Ct21 ÞRt21 p

5 Ct11 At At21 Rt21 p 5 Ct11 Φt11;a Ψp 5 0 that is impossible since pT Wa;b p 5

b21 X

pT ΨT ΦTj;a CjT Cj Φj;a Ψp . 0

j5a

and therefore it should be Ct Φt;a Ψp . 0 if tAfa; . . .; b 2 1g. Necessity. Let the optimal control Ut exist but the matrix Wa;b is singular. Then there is a nonzero vector pARn2q such that Wa;b p 5 0 and at the same Ct Φt;a Ψp 5 0;

tAfa; . . .; b 2 1g:

Using the boundary condition ΨT Za 5 0 and the representation Zt 5 ΦTb;t 2

b21 X

ΦTj;t CjT Uj ;

t 5 b 2 1; b 2 2; :::; a

j5t

for the solutions of Eq. (5.42), we find pT ΨT ΦTb;a 5

b21 X j5a

pT ΨT ΦTj;a CjT Uj 5 0;

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

162

that is impossible since Φb;a Ψp 5 Ab21 . . .Aa ð0Tq ; pT ÞT 6¼ 0: p

f

Let us denote: Ut is the optimal program control, Ut is the optimal feedback control. Lemma 5.8. Let Wa;b . 0. Then the optimal feedback control is determined by the expression 1 T Utf 5 Nt21 Ct ðSt ATt 1 Rt Ma;t11 Rt11 ÞZt11 ;

t 5 b 2 1; b 2 2; . . .; a: (5.71)

Proof 21 1 Replace in Eq. (5.54) b by t 1 1; Zb, by Zt11 and Ma;b by Ma;t11 to obtain the feedback control which is determined by Eq. (5.71). We show now that this control indeed solves our problem. Consider the system Eq. (5.42) under the action of the controls Eqs. (5.54) and (5.71). We have p

Ztp 5 ATt Zt11 2 CtT Utp ;

(5.72)

f

Ztf 5 ðATt 2 CtT Gt ÞZt11 ;

(5.73)

where 1 T Rt11 Þ 5 G1t 1 G2t : Gt 5 Nt21 Ct ðSt ATt 1 Rt Ma;t11 p

p

Let us show that Zt satisfies Eq. (5.73). Substituting Ut in Eq. (5.72) gives 21 T Rb : Ztp 5 AT1t Zt11 2 CtT Nt21 Ct Rt Ma;b p

From whence we find Ztp 5 ΛTb;t 2

b21 X

21 T ΛTj;t CjT Nj21 Cj Rj Ma;b Rb :

j5t

We have p 1 T G2t Zt11 5 Nt21 Ct Rt Ma;t11 Rt11

1 5 Nt21 Ct Rt Ma;t11

In 2

ΛTb;t11 2 b21 X

!

b21 X

21 T ΛTj;t11 CjT Nj21 Cj Rj Ma;b Rb j5t11

!

21 RjT CjT Nj21 Cj Rj Ma;b

RbT :

j5t11 1 21 T 5 Nt21 Ct Rt Ma;t11 Ma;t11 Ma;b Rb :

(5.74)

Diffuse Kalman Filter

163

The expression (5.74) follows from the identities Rt11 5 Λt11;a Ψ; Λb;t11 Λt11;a 5 Λb;a ; Λj;t11 Rt11 5 Λj;t11 Λt11;a Ψ 5 Λj;a Ψ 5 Rj : Let us transform Eq. (5.74) using the orthogonal decomposition Ma;t 5 Vt VtT , where 21=2

21=2

T T Ca11 Na11 ; . . .; RtT CtT Nt Vt 5 ðRaT CaT Na21=2 ; Ra11

Þ:

(5.75)

Let l1t ; l2t ; . . .; lkðtÞ;t be linearly independent columns of the matrix Vt . Using skeletal decomposition yields Vt 5 L t Γt ; ðn2qÞ 3 kðtÞ

where Lt 5 ðl1t ; l2t ; . . .; lkðtÞ;t ÞARkðtÞ

kðtÞ 3 mðt2aÞ , Γt ARkðtÞ , Rrk 3 l are a set

of k 3 l matrices of the rank r. Since Γ~ t 5 Γt ΓTt . 0 is the Gram matrix constructed by linearly independent rows of the matrix Γt and rankðLt Þ 5 rankðΓ~ t L T Þ then t

21

1 Ma;t 5 Lt Γ~ t LtT ; Ma;t 5 ðLt Γ~ t LtT Þ1 5 ðLtT Þ1 Γ~ t Lt1 :

Substituting these expressions in Eq. (5.74) and using the representation Lt1 5 ðLtT Lt Þ21 LtT , we find 21 p 21 T Rb G2t Zt11 5 Nt21 Ct Rt ðLtT Þ1 Γ~ t Lt1 Lt Γ~ t LtT Ma;b 21 T 5 Nt21 Ct Rt Lt ðLtT Lt Þ21 LtT RtT Ma;b Rb :

(5.76)

From Eq. (5.75) it follows that RtT CtT N~ t 5 Lt Γ1t , where Γ1t is some rectangular matrix. Substituting this expression into Eq. (5.76) gives 21 T 21 T Rb 5 Nt21 Ct Rt Ma;b Rb : G2t Zt11 5 N~ t ΓT1 LtT Lt ðLtT Lt Þ21 LtT RtT Ma;b p

It follows from this the assertion. Theorem 5.3. 1. The state of the system x^ t11 5 At x^ t 1 Kt ð yt 2 Ct x^ t Þ; x^ a 5 ðm~ Ta ; 0Tn2q ÞT ;

tAT

(5.77)

for t $ ttr ; ttr 5 mint ft:Wa;t . 0; t 5 a; a 1 1; . . .g is the optimal estimate, where 1 RtT ÞCtT Nt21 ; Kt 5 ðAt St 1 A1t Rt Ma;t11

(5.78)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

164

St11 5 At St ATt 2 At St CtT Nt21 Ct St ATt 1 V2t ; Sa 5 Sa ;

(5.79)

Rt11 5 ðAt 2 At St CtT Nt21 Ct ÞRt ; Ra 5 ðeq11 ; . . .; en Þ;

(5.80)

Ma;t11 5 Ma;t 1 RtT CtT Nt21 Ct Rt ; Ma;a 5 0:

(5.81)

2. The estimation error covariance matrix for t $ ttr is given by the expression 21 T Rt : Pt 5 St 1 Rt Ma;t

(5.82)

Proof 1. Put in Eq. (5.4) Yb21;t 5 ðUtp ÞT ; Ωb21;a 5 ZaT ðe1 ; . . .; eq Þ;

b $ ttr ;

p Ut

where and Za were determined earlier in Lemma 5.6. This implies the optimality of the estimate at the moment t 5 b and the relation x^ b 5

b21 X

ðUtp ÞT yt 1 ZaT ðm~ Ta ; 0Tn2q ÞT :

t5a

From Lemma 5.7 it follows that 1 T Utp 5 Utf 5 Nt21 Ct ðSt ATt 1 Rt Ma;t11 Rt11 ÞZt11 ;

t 5 b 2 1; b 2 2; . . .; a:

In view of this x^ t 5

t21 X

ðUsf ÞT ys 1 ZaT ðm~ Ta ; 0Tn2q ÞT 5

s5a

t21 X

T T Hs11;t Ks ys 1 Ha;t ðm~ Ta ; 0Tn2q ÞT ;

s5a

(5.83) where Hs;t 5 ðATs 2 CsT KsT ÞHs11;t ; Ht;t 5 In ;

s # t; tAT :

By Eq. (5.83) we find t X T T T T ðHs11;t11 2 Hs11;t ÞKs ys 1 ðHa;t11 2 Ha;t Þðm~ Ta ; 0Tn2q ÞT : x^ t11 2 x^ t 5 Kt yt 1 s5a

(5.84)

Diffuse Kalman Filter

165

Using the identity Hs11;t11 5 Hs11;t Ht;t11 ; Ha;t11 5 Ha;t Ht;t11 ; Ht;t11 5 ATt 2 CtT KtT ; we transform this expression to the form which coincides with Eq. (5.77) T 2In Þ x^ t11 2 x^ t 5Kt yt 1ðHt;t11

t21 X T T T Hs11;t Ks ys 1ðHt;t11 2In ÞHa;t ðm~ Ta ;0Tn2q ÞT s5a

5Kt yt 1ðAt 2 Kt Ct 2In Þ

" t21 X

# T T Hs11;t Ks ys 1Ha;t ðm~ Ta ;0Tn2q ÞT

s5a

5Kt yt 1ðAt 2 Kt Ct 2In Þx^ t : T Putting in Eq. (5.83) t 5 a gives x^ a 5 Ha;a ðm~ Ta ; 0Tn2q ÞT 5 ðm~ Ta ; 0Tn2q ÞT . 21 T Rt Þ 5 Eððxt 2 x^ t ÞT 2. It follows from the expression TrðSt 1 Rt Ma;t ðxt 2 x^ t ÞÞ.

Comment 5.3.1. We have established the following methodically important result. The diffuse initialization of the KF leads to identical relations obtained by solving a special optimization problem under the assumption that a priori information about the initial state of the system Eq. (5.1) is incomplete or absent.

5.4 SYSTEMS STATE RECOVERY IN A FINITE NUMBER OF STEPS Consider the system Eq. (5.1) in the absence of random disturbances xt11 5 At xt ; yt 5 Ct xt ;

tAT 5 fa; a 1 1; . . .g:

(5.85)

It is required to restore the vector xt , using the observations yt ; tAT. To solve this problem we use the observer based on the design of the proposed the DKF dif

dif ^ dif ^ a 5 0; x^ t11 5 At x^ dif t 1 Kt ð yt 2 Ct x t Þ; x

tAT 5 fa; a 1 1; . . .g; (5.86)

where 1 ΦTt;a CtT ; Kt 5 At Φt;a Wt11 dif

Wt11 5 Wt 1 ΦTt;a CtT Ct Φt;a ; Wa 5 0ðn2qÞ 3 ðn2qÞ :

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

166

This observer will restore the system state in a finite number of steps. Indeed, the recovery error ht 5 x^ t 2 xt satisfies the equations system ht11 5 ðAt 2 Ktdif Ct Þht with arbitrary unknown initial conditions. Using Lemma 5.3 with St 5 0; V1t 5 0m 3 m gives Ht;s 5 Φt;s 2 Φt;s Wt21

t21 X

ΦTi;s CiT Ci Φi;s ;

i5s

where Φt11;s 5 At Φt;s ; Φs;s 5 In ;

t . s:

Whence it follows that Ht;a 5 0 when t $ tr and therefore ht 5 0. Consider the generalization to the case when initial conditions are specified only for q of the first components of the state vector Eq. (5.85). Let 1 Ktdif 5 At Rt Wt11 RtT CtT ;

(5.87)

Rt11 5 At Rt ; Ra 5 ðeq11 ; . . .; en Þ;

(5.88)

Wt11 5 Wt 1 ΦTt;a CtT Ct Φt;a ; Wa 5 0ðn2qÞ 3 ðn2qÞ :

(5.89)

where

Using Lemma 5.3 with St 5 0 gives Ht;s 5 Φt;s 2 Rt Wt21

t21 X

RiT CiT Ci Φi;s ;

t $ ttr ; s , t:

i5s

It follows from this that ht 5 0 when t $ tr.

5.5 FILTERING WITH THE SLIDING WINDOW It is well known that the KF may have unacceptable estimation accuracy if the system model is not adequate to the real process. In Refs. [60,82] it was proposed the filter with limited memory that is robust in respect to perturbations. The idea is to use a system model that would be adequate to the real process not over the entire interval of observation but only on a limited time moving interval (sliding window). Considered below the DKF with sliding window belongs to this type of algorithm. Besides the

Diffuse Kalman Filter

167

fact that it uses a model adequate within the sliding window, the DKF estimate has an important statistical property allowing you to get, unlike the KF, an unbiased estimate for a finite number of steps. This means that for exact measurements it restores the system state in a finite number of steps. Consider the observation intervals ½t 2 N ; t; tAT (sliding windows), where N . tr and assume that at the moment t 2 N there is only a priori information about q first components of the vector xt2N . Using the DKF, it is easy to write the following relationships for the evaluating state of the system Eq. (5.1) at the moment t based on a sliding window of the latest N observations dif

dif

dif ^ dif ^ t2N 5 ðmTa ; 01 3 ðn2qÞ ÞT ; x^ s11 5 As x^ dif s 1 Ks ðys 2 Cs x s Þ; x

(5.90)

where 1 T T 21 Ksdif 5 ðAs Ss 1 A1s Rs Wt2N ;s11 Rt ÞCs N1s ;

(5.91)

Ss11 5 As Ss ATs 2 As Ss CsT N1s21 Cs Ss ATs 1 V2s ; St2N 5 block diagðSa ; 0ðn2qÞ 3 ðn2qÞ Þ;

(5.92)

Rs11 5 A1s Rs ; Rt2N 5 ðeq11 ; . . .; en Þ;

(5.93)

A1s 5 As 2 As Ss CsT N1s21 Cs ; N1s 5 Cs Ss CsT 1 V1s ;

(5.94)

Ws11;t2N 5 Ws;t2N 1 RsT CsT N1s21 Cs Rs ; Wt2N ;t2N 5 0;

(5.95)

s 5 t 2 N ; t 2 N 1 1; . . .; t 2 1; t 5 a 1 N ; a 1 N 1 1; . . .: Let us illustrate the effect of the DKF use with sliding window under parametric uncertainty. Example 5.1. Given a sampled model of the aircraft engine [83] 2 3 3 0 0:1107 0:903051δt 1 4 4 5 xt 1 1 5wt 0:0077 0:98021δt 20:0173 xt11 5Axt 1Bwt 5 0:0142 0 0:895310:1δt 1 2

168

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

taking into account the delay in the measuring channel     0 0 1 1 0:1δt 1 yt 5 Cxt2τ 1 Dξ t 5 ξ; x 1 0 1 1 0:1δt 0 t2τ 1 t where wt and ξt are uncorrelated random processes with zero expectation and variances Eðwt2 Þ 5 0:02, Eðξ2 Þ 5 0:02; Eðξ2t Þ 5 0:02; τ, is a delay, and δt is an unknown disturbance caused by a temperature change of the turbine. We want to compare the KF and DKF estimation errors under the action on the engine of the unknown disturbance δt . Both filters are constructed without consideration of its action and it is assumed that  0:1; tA½50; 100 δt 5 : 0; t= 2½50; 100 We transform the model to the standard form without delay when τ 51 ~ t 1 Bw ~ t; zt11 5 Az where

~ t 1 D~ ξ~ t ; vt 5 Cz

2 3 B  0 6 7 ; B~ 5 4 0 5; 0 0   02 3 5 ; t 0; t 5 0 C~ 5 ; ξ~ 5 ½02 3 3 ; I2 ; t $ 1 ξ; t $ 1: 

  A xt ~ zt 5 ; A5 C Cxt

When constructing the KF we assume that the initial condition for the state vector is known accurately and x0 5 0. The DKF is used in a sliding window mode assuming no a priori information about the initial state vector of the object. Since the dynamics matrix is singular and a priori information is absent then the condition Wa;t . 0 is only sufficient for the existence of an optimal estimate. This is carried out from the moment ttr 5 3. Fig. 5.1 shows the simulation results—the estimation errors eðtÞ 5 z^ 1t 2 z1t with the KF and the DKF for N 5 20 (curves 1 and 2, respectively). It is seen that the errors of the DKF are significantly less than the KF errors on the interval of the perturbation action. In addition, the convergence speed is significantly higher for the DKF compared to the KF after the termination of the disturbance.

Diffuse Kalman Filter

169

2 0 –2 2

e(t)

–4

1

–6 –8 –10 –12 –14

0

50

100

150 t

200

250

Figure 5.1 Dependencies of estimation errors on the time (curves 1 and 2, respectively).

We note that from the presented results it follows that the DKF with sliding window allows you to get an unbiased estimate when N . tr. However, in many cases, also of interest is the statistical spread of the estimates which can be significant even, as was shown in Chapter 3, in the evaluation of linear regression parameters. So the choice of N should be accompanied by correlation analysis of the used algorithm. Such analysis, in problems of the state estimation of the system Eq. (5.1), can be based on the use of the expression (5.82).

5.6 DIFFUSE ANALOG OF THE EXTENDED KALMAN FILTER Consider a nonlinear dynamical system of the form xt11 5 Ft ðut ; xt Þ 1 wt ; yt 5 Zt ðxt Þ 1 ξt ;

tAT 5 fa; a 1 1; . . .g;

(5.96) (5.97)

where xt ARn is the state vector, yt ARm is the measured output vector, ut ARn is the known input vector, Ft ðzt ; xt Þ; Zt ðxt Þ, are the given functions, wt ARn and ξt ARm are uncorrelated random processes with zero expectation and the covariances Eðwt wtT Þ 5 Qt ; Eðξ t ξTt Þ 5 Rt , and the random vector xa is not correlated with wt and ξ t for tAT. Suppose that conditions A1A5 or B1B4 are satisfied and it is required to write the relations for the diffuse extended KF (DEKF) using Theorem 5.1 or 5.3.

170

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

Represent the relation for DKF in the following form. Between observations 1 x2 t11 5 At xt ;

(5.98)

2 5 At St1 ATt 1 V2t ; St11

(5.99)

Rt11 5 At ðIn 2 St2 CtT Nt21 Ct ÞRt ;

(5.100)

Wt11 5 Wt 1 RtT CtT Nt21 Ct Rt ;

(5.101)

Nt 5 Ct St2 CtT 1 V1t :

(5.102)

After receiving the observations 1 Ktdif 5 St2 CtT N1t21 1 ðIn 2 St2 CtT Nt21 Ct ÞRt Wt11 RtT CtT Nt21 ; 2 dif x1 ^ t Þ; t 5 xt 1 Kt ðyt 2 Ct x

(5.103) (5.104)

(5.105) St1 5 ðIn 2 St2 CtT Nt21 Ct ÞSt2 : The initial conditions for the filter are determined by expressions ~ Ta ; 0Tn2q ÞT ; Sa2 5 block diagðSa ; 0ðn2qÞ 3 ðn2qÞ Þ; x1 a 5 ðm Ra 5 ðeq11 ; . . .; en Þ; Wa 5 0ðn2qÞ 3 ðn2qÞ :

(5.106) (5.107)

The DEKF is obtained by replacing Eq. (5.98) in 1 x2 t11 5 Ft ðut ; xt Þ

(5.108)

1 1 1 At 5 @Ft ðut ; x1 t Þ[email protected] ; Ct 5 @Zt ðxt Þ=xt :

(5.109)

and using the linearization

5.7 RECURRENT NEURAL NETWORK TRAINING Let the recurrent NN be described by Eqs. (4.210) and (4.211). Then the state space model at the training stage will take the form pt11 5 Ψðut ; pt ; β t Þ 1 wtp ;

(5.110)

β t11 5 β t 1 wtβ ;

(5.111)

Diffuse Kalman Filter

αt11 5 αt 1 wtα ; yt 5 Φðpt Þαt 1 ξt ;

171

(5.112)

t 5 1; 2; . . .; N

(5.113)

with some small random initial conditions p0 ; β 0 , where Ψðzt ; pt ; β t Þ; Φðpt Þ; pt ; β t αt , are defined by Eqs. (4.212) and (4.213), p wt ; wtβ ; wtα are uncorrelated artificially added noises with p

p

p

Eðw t Þ 5 0; Qp 5 E½w t ðw t ÞT ; Eðw tβ Þ 5 0; Qβ 5 E½w tβ ðw tβ ÞT ; Eðw tα Þ 5 0; Qα 5 E½w tα ðw tα ÞT ; At the stage of the prediction the state space model has the form pt11 5 Ψðut ; pt ; β  Þ 1 w tp ; yt 5 Φðpt Þα 1 ξ t ;

t 5 1; 2; . . .; N ;

(5.114) (5.115)

with small random initial conditions p0 , where β  ; α, are the parameters estimates obtained by training. Matrix functions At ; Ct , in Eq. (5.109) can be easily determined using the relations @ dσðxÞ @ dσðxÞ Ψi ðut ; pt ; β t Þ 5 Ψi ðut ; pt ; β t Þ 5 ai ; bi ; @pt dx @zt dx @ dσðxÞ Ψi ðut ; pt ; β t Þ 5 ; i 5 1; 2; . . .; q; @d dx @ Φj ðpt Þ 5 pt ; j 5 1; 2; . . .; m @cj where x 5 aTi pt21 1 bTi ut 1 di : Example 5.2. Let a plant be described by the equations system [84,85] x1t11 5

x1t 1 2x2t x1t x2t 1 ut ; x2t11 5 1 ut ; 1 1 x22t 1 1 x22t

yt 5 x1t 1 x2t ;

At 5 1; 2; . . .; N :

(5.116)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

172

To simulate the plant we used the recurrent NN Eqs. (4.210) and (4.211) with q 5 10; m 5 1; N 5 900; M 5 100 ðnumber of test pointsÞ; Q 5 block diagðQp ; Qβ ; Qα Þ 5 0:001I140 ; R 5 1; P0 ; β 0 ; ut B0:001 3 randð2 1; 1Þ: The ability to generalization of the recurrent NN additionally checked using the harmonic input signal ut 5 sinð2π=tÞ;

t 5 1; 2; . . .; N :

The simulation results are shown in Figs. 5.25.4. 8

Curve 1 Curve 2

6

yt

4 2 0 –2 0

50

100

150

200

t

Figure 5.2 Dependencies of the DEKF output (curve 1) and the output plant (curve 2) on the number observations in the training set.

6 Curve 1 Curve 2

4

yt

2 0 –2 –4 800

810

820

830

840

850 t

860

870

880

890

900

Figure 5.3 Dependencies of the DEKF output (curve 1) and the output plant (curve 2) on the number observations in the testing set.

Diffuse Kalman Filter

5

173

Curve 1 Curve 2

yt

0 –5 –10

0

10

20

30

40

50 t

60

70

80

90

100

Figure 5.4 Dependencies of the DEKF output (curve 1) and the output plant (curve 2) on the number observations in the testing set with the input signal ut 5 sinð2π=tÞ.

5.8 SYSTEMS WITH PARTLY UNKNOWN DYNAMICS Consider the problem of estimation of the system state Eq. (5.96) when its right-hand part contains some level of uncertainty xt11 5 Ft ðut ; xt Þ 1 εðzt ; xt Þ 1 wt ; yt 5 Zt ðxt Þ 1 ξ t ;

t 5 1; 2; . . .; N ;

(5.117) (5.118)

where εðzt ; xt Þ is the difference between the total and the approximate model of the plant. Assume that εðzt ; vt Þ is a continuous function. We can use a perceptron with one hidden layer and linear activation function in the output or the RBNN Sugeno fuzzy system described by separable regression εðzt ; vt Þ 5 Φðzt ; βÞα;

(5.119)

for the approximation of εðzt ; vt Þ on compact sets, where zt 5 ðxTt ; uTt ÞT . Thus, the problem reduces to the simultaneous assessment of the state Eq. (5.117) and the parameters (if they are not known) with the help of measured outputs using the DEKF. Example 5.3. Let a plant be described by the equations x1t11 5

x1t 1 2x2t 1 ut 1 wt 5 f1t ; 1 1 x22t

(5.120)

x1t x2t 1 ut 1 wt 5 f2t ; 1 1 x22t

(5.121)

x2t11 5

yt 5 x1t 1 x2t 1 ξt ;

t 5 1; 2; . . .; N ;

where ut 5 10 sinð2π=tÞ; wt BN ð0; 0:7Þ; ξt BN ð0; 0:1Þ:

(5.122)

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

174 15

Curve 1 Curve 2 Curve 3

10

x1t

5 0 –5 –10 –15

0

50

100

150 t

200

250

300

Figure 5.5 Dependencies of x1t (curve 1), estimates of x1t for a known (curve 2) and unknown f2t (curve 3) on the number observations. 20

Curve 1 Curve 2 Curve 3

x2t

10 0 –10 –20

0

50

100

150 t

200

250

300

Figure 5.6 Dependencies of x2t (curve 1), estimates of x2t for a known (curve 2) and unknown f2t (curve 3) on the number observations.

Suppose that f2t is an unknown function and we use for the approximation perceptron with one hidden layer f2t 5 Φðzt ; βÞα 5 ðσða1 zt 1 b1 Þ; σða2 zt 1 b2 Þ; . . .; σðar zt 1 br ÞÞα; where zt 5 ðx1t ; x2t ; ut ÞT ; ak ; bk ; k 5 1; 2; . . .; r are weights and biases, respectively, σðUÞ is the sigmoid function. Let the weights and biases be selected from a uniform distribution: ak ; bk B0:01 3 randð2 1; 1Þ;

k 5 1; 2; . . .; r

and the parameter α be not known. The simulation results for three neurons are shown in Figs. 5.5 and 5.6.

CHAPTER 6

Applications of Diffuse Algorithms Contents 6.1 Identification of the Mobile Robot Dynamics 6.2 Modeling of Hysteretic Deformation by Neural Networks 6.3 Harmonics Tracking of Electric Power Networks

175 183 188

6.1 IDENTIFICATION OF THE MOBILE ROBOT DYNAMICS The mobile robot (MR) Robotino shown in Fig. 6.1 is an autonomous mobile platform with three “omnidirectional” wheels [86]. Movement of the robot is carried out with the help of three DC motors whose axes are arranged at angles of 120˚ to each other. The speed of the shaft rotation of each motor is transmitted to the axis of the corresponding wheel via the gearbox with a gear ratio of 16:1. The robot has a set of commands that allow setting and measuring the angular speed of the motor shaft rotation T out out ωr ðtÞ 5 ðωr1 ðtÞ; ωr2 ðtÞ; ωr3 ðtÞÞT ; ωout ðtÞ 5 ðωout 1 ðtÞ; ω2 ðtÞ; ω 3 ðtÞÞ :

Measuring of angular velocities is carried out using an incremental tachometer. Robotino runs under an embedded operating system Linux and work offline is provided by batteries. The choice of the control that moves MR along a predetermined path is one of the major tasks arising in its design. Assume that the rotational speeds of motor shafts perfectly follow the input signals, the used sensors have high accuracy and the wheels slip is absent. Then the solution of this task can be obtained only on the basis of the kinematics equations. However, the neglect of the influence of the dynamics may limit the ability of the MR in real situations. We will consider the problem of mathematical models constructing of the MR from the experimental data linking the values of given angular velocities ωr ðtÞ with their measured Diffuse Algorithms for Neural and Neuro-Fuzzy Networks. DOI: http://dx.doi.org/10.1016/B978-0-12-812609-7.00006-8

© 2017 Elsevier Inc. All rights reserved.

175

176

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

Figure 6.1 Robotino exterior view.

ω1out (t), ω1y (t), rev/min

4000

3000

2000

1000

0

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

t, S

Figure 6.2 Transient performance of engines.

values ωout ðtÞ. An external control program in C that, with the help of appropriate commands, sets the desired angular velocities of the robot ωr ðtÞ and receives the relevant measurements of speeds ωout ðtÞ was written. According to the results of experimental studies it was found that the MR frequency bandwidth does not exceed 8 Hz. Transients for each engine shown in Fig. 6.2 were obtained and the sampling was chosen with the help of these. Besides it was also shown that the engine interference is weak and therefore models of the MR can be built independently for each input and output.

Applications of Diffuse Algorithms

177

Test signals representing the Gaussian white noise were generated for each input. Noise intensities were chosen according to the rule “3 sigma” to ensure the motor angular shafts rotation in the range of ωri ðtÞA½2ωrmax ; ωrmax ; i 5 1; 2; 3 ðrev=minÞ. Furthermore, the generated input signals have been previously passed through the Batervord low-pass filter with bandwidth of the MR equal to 8 Hz. Figs. 6.36.5 show fragments of the input and the corresponding output of the angular speeds of the motors. Clearly a difference is visible between inputs and outputs due to a delay, dynamics of the plant, and the possible influence of nonlinearities. 1500

ω1out , ω1r, rev/min

1000 500 0 –500 –1000 –1500

0

0.5

1

1.5

2

Figure 6.3 Fragment inputoutput data for identification; line, ωr1 ðtÞ by a dashed line.

2.5

ωout 1 ðtÞ

3

is shown by a solid

1500

ω2out (t), ω2y (t), rev/min

1000 500 0 –500 –1000 –1500

0

0.5

1

1.5

2

2.5

3

t, S

Figure 6.4 Fragment inputoutput data for identification; ωout 2 ðtÞ is shown by a solid line, ωr2 ðtÞ by a dashed line.

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

178

ω3out (t), ω3y (t), rev/min

1500 1000 500 0 –500 –1000 –1500 0

0.5

1

1.5

2

3

2.5

t, S

Power/frequency (dB/Hz)

Figure 6.5 Fragment inputoutput data for identification; ωout 3 ðtÞ is shown by a solid line, ωr3 ðtÞ by a dashed line. Welch power spectral density estimate

20 0 –20 –40 –60

0

1

2

3

4

5 6 Frequency (Hz)

7

8

9

10

Figure 6.6 Spectral densities of the input signals.

Identification of models was carried out using the sample of 1000 points for each input and output. Obtained data were divided into two parts—training (850 points) and testing (150 points). The first part was used to build models and the second to test them. To obtain measurements equally spaced in time the input/output data were linearly interpolated with the step 0.0156 s that coincides with the median of the tact distribution. Fig. 6.6 shows the spectral densities of the input signals. We begin with a description of the identification results using a linear model of autoregression and moving average yt 1 a1 yt2T 1 ? 1 aq yt2qT 5 b1 ut2kT 1 ? 1 bl ut2ðk1lÞT 1 wt ; where ut AR; yt AR are an input and an output, respectively, wt is a centered white noise, ai ; i 5 1; 2; . . .; q; bj ; j 5 1; 2; . . .; l; k are model parameters to be estimated from experimental data, and T is a discreteness tact. This model describes the dependence between the given and measured values of angular velocities of the motor.

Applications of Diffuse Algorithms

179

Estimates of the unknown parameters ai ; i 5 1; 2; . . .; q; bj ; j 5 1; 2; . . .; l were obtained by minimizing the sum of squared residuals between measured values of the plant output yðti Þ and the model y^ðti Þ; i 5 1; 2; . . .; N for the different values of q; l; k. The adequacy of the constructed models was tested by estimating the determination coefficients ! PN 2 2 i51 ðyðiÞ2 y^ðiÞÞ R 5 1 2 PN 3 100% 2 i51 ðyðiÞ2 y~Þ for each output and using the statistical properties of residues and their correlations with inputs, where y^ðiÞ is the value predicted by the model and y~ is the average of the values obtained by the sample. The accuracy of the data approximation was estimated by the root mean square error (RMSE) RMSE 5 1=N

N X

!1=2 ðyðiÞ2 y^ðiÞÞ2

:

i51

The models’ order and the delay were chosen so as to ensure the adequacy of the models by comparing the determination coefficients of different models with ranges of possible values q 5 1; 2; 3; l 5 1; 2; 3; k 5 1; 2; . . .; 7: Figs. 6.76.9 show the simulation results of the obtained models with parameters ½q; l; k: ½3; 2; 7; ½3; 2; 5; ½1; 3; 6 for each of the motors. The RMSE of the predictions by one step for each of the motors were

1000 ω1out , (iΤ), rev/min

Curve1 Curve2

500 0 –500 –1000

0

50

100

150

i

Figure 6.7 Dependences of the angular speed of the first motor (curve 1) and its prediction (curve 2) on the number of observations.

180

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

ω2out , (iΤ), rev/min

1000 data1

500

data2

0 –500 –1000 –1500 –2000

0

50

i

100

150

Figure 6.8 Dependences of the angular speed of the second motor (curve 1) and its prediction (curve 2) on the number of observations.

1000 ω3out , (iΤ), rev/min

data1

500

data2

0 –500 –1000

0

50

100

150

i

Figure 6.9 Dependences of the angular speed of the third motor (curve 1) and its prediction (curve 2) on the number of observations.

109.4, 112.1, and 117.8 rev/min and the determination coefficients were 79.9%, 81.6%, and 78.5%, respectively. Fig. 6.10 shows the correlation functions of residues and crosscorrelation function inputs with residues. Let’s try to reduce received values of the RMSE. Suppose that the plant is described by a nonlinear autoregressive and moving average (1.14) which is constructed on the basis of the perceptron with one hidden layer, the sigmoid FA in hidden layer and linear in the output, where zt 5 ðyt21 ; . . .; yt2qT ; ut2kT ; . . .; ut2ðk1lÞT ÞT . We put for all three engines q 5 3, l 5 2, k 5 7 and the number of neurons in the hidden layer equal to 10 and we estimate the weights and biases of the Neural Network (NN) from observations of input/output pairs, using the iterative diffuse training algorithm (DTA). The forgetting factor is determined from the relationship λi 5 maxf1 2 0:05=i; 0:99g, i 5 1; 2; . . .; M, M 5 10 is the iterations number.

Applications of Diffuse Algorithms

181

Correlation function of residuals. Output y1

1 0 –1

0

5

10

lag

15

20

25

Cross corr. function between input u1 and residuals from output y1

0.2 0 –0.2 –25

–20

–15

–10

–5

0 lag

5

10

15

20

25

Correlation function of residuals. Output y2

1 0 –1

0

5

10

15

20

25

lag Cross corr. function between input u2 and residuals from output y2

0.2 0 –0.2 –25

–20

–15

–10

–5

0 lag

5

10

15

20

25

Correlation function of residuals. Output y3

1 0 –1

0

0.1

5

10

lag

15

20

25

Cross corr. function between input u3 and residuals from output y3

0 –0.1 –25

–20

–15

–10

–5

0 lag

5

10

15

20

25

Figure 6.10 Correlation functions of the residues and cross-correlation function inputs with the residues.

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

182

ω3out (iT), rev/min

1000 0 Curve1

–1000 –2000

Curve2

0

20

40

60

80 i

100

120

140

160

Figure 6.11 Dependences of the angular speed of the first motor (curve 1) and its prediction (curve 2) on the number of observations.

ω2out (iT), rev/min

1000 Curve1 Curve2

0

–1000

–2000

0

20

40

60

80 i

100

120

140

160

Figure 6.12 Dependences of the angular speed of the second motor (curve 1) and its prediction (curve 2) on the number of observations.

ω3out (iT), rev/min

1000

0 Data1 Curve1

–1000

–2000

Data2 Curve2

0

20

40

60

80

100

120

140

160

i

Figure 6.13 Dependences of the angular speed of the third motor (curve 1) and its prediction (curve 2) on the number of observations.

Figs. 6.116.13 show the simulation results for the test sample of the obtained models for each of the engines. Here graphs with numbers 1 are the experimental values of the angular velocities, with the numbers 2 being predictions by one step found with the help of the models, using the testing set. The RMSE of the predicted values by 1 step are significantly less than for the linear autoregressive and moving average models—40.9, 45.5, 43.4 rev/min for each of the motors, respectively.

Applications of Diffuse Algorithms

183

Increasing of the approximation accuracy is noticeable even in the visual comparison with the graphs shown in Figs. 6.76.9. The weights and the biases of the hidden layer are assumed initially to be uniformly distributed random variables on the intervals [1.1] and [0.1], respectively.

6.2 MODELING OF HYSTERETIC DEFORMATION BY NEURAL NETWORKS Hysteresis relationships between different variables appear in many engineering applications. However, detailed modeling of systems with hysteresis, using physical laws, is usually a difficult task and the resulting models are often too complex to be used in practical applications. Therefore, various alternative approaches that are not the result of deep analysis of the physical behavior of the system, but combining some physical understanding of hysteresis with models such as inputoutput, were proposed [87]. As an example, numerous well-known and successful applications of methods for constructing of hysteresis models using NNs and fuzzy logic in the problems of ferromagnetism and mechanics are proposed in [88,89]. The idea of their application is based on one of the main manifestations of hysteresis—dependence of the system output yðtÞ at the moment t on the previous and current values of the input uðtp Þ; tp # t. It is shown that the dependence of this type can be simulated by a dynamic system with a specially selected state, input and output and the NN is used to approximate its right side. Note that it seems that the first results where a NN was used to model the deformation dependence on the effort in the concrete plate are presented in [90,91]. In this section we show that the iterative DTA can be used for the NN training to simulate a hysteresis. We use the experimental data shown in Fig. 6.14. Here FðδÞ is a mechanical force (load) applied to a detail and δ is deformation caused by FðδÞ. In the experiments randomly implemented dependences for four levels of the force are equal to 0, 90, 140, and 250 kg. Figs. 6.15 and 6.16 show the dependence of the force and the corresponding deformation on the observations number. The frequency of data pickup is 20 Hz. It is seen that the error in putting, for example, 250 kg level can reach 23 kg. This explains the variability of the hysteresis curves shown in Fig. 6.14.

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

184 0.4

F(δ)

0.2 0 –0.2 –0.4 –300

–200

–100

0

100

200

300

δ

Figure 6.14 The dependence of the deformation (mm) from the applied force (kg). 300 200

X: 632 Y: 251.2

X: 2396 Y: 228.3

F(n)

100 0 –100 –200 –300

0

500

1000

1500 n

2000

2500

3000

Figure 6.15 The dependence of force FðnÞ (kg) on the number of observations n. 0.4 Curve1 Curve2

δ

0.2 0 –0.2 –0.4

0

500

1000

1500 n

2000

2500

3000

Figure 6.16 Dependences of deformation δ (mm) (curve 1) and its approximation (curve 2) on the number of observations n in the training set.

Observing the behavior of the curve in Fig. 6.14 it is easy to see that the knowledge of force value is not enough to uniquely identify the deformation. This can be corrected by the introduction of additional variables that take into account the background of the process (the change

Applications of Diffuse Algorithms

185

of the deformation under the influence of the varying force). This may be, for example, the force and the deformation values at some point prior to the current time. In more general cases the deformation value at the moment t can be defined by the difference equation δðtÞ 5 f ðδðt 2 ΔÞ; . . .; δðt 2 aΔÞ; Fðt 2 dÞ; . . .; Fðt 2 d 2 bΔÞÞ;

(6.1)

where f is some function, Δ is a time sample rate, and a, b, d are some positive integers. The difference Eq. (6.1) is a nonlinear model of autoregressive and moving average by means of which we try to describe the hysteresis phenomen. Therefore, the problem of constructing a hysteresis model is reduced to the identification of the discrete dynamic system. We rely on the following common approach to its decision: 1. The function f is sought in the class of the following parameterized dynamic dependencies δðtÞ 5 ΦðW ; δðt 2 ΔÞ; :::; δðt 2 aΔÞ; FðtÞ; :::; Fðt 2 bΔÞÞ 5 ΦðW ; ZðtÞÞ;

(6.2)

where W is a vector of unknown parameters and zðtÞ 5 ðδðt2ΔÞ; . . .; δðt2aΔÞ; FðtÞ; . . .; Fðt2bΔÞÞT is a input vector. The function ΦðW ; zðtÞÞ should be a universal approximator, in the sense that any continuous function can be reproduced by ΦðW ; zðtÞÞ with arbitrarily high accuracy. 2. A set of input/output data fFðtÞ; δðtÞg; t 5 0; Δ; . . .; N Δ is divided into two groups, training t 5 0; Δ; . . .; N1 Δ and testing t 5 ðN1 1 1ÞΔ; ðN1 1 2ÞΔ; . . .; N Δ (the latter is only used to verify the adequacy of the model). 3. It is fixed the permissible range of a; b; d parameters and W is selected from this range minimizing the quadratic criterion QðW Þ 5 1=N1

N1 X ðδðiΔÞ 2 ΦðW ; zðiΔÞÞ2 :

(6.3)

i5q

Then the values of criteria for various a; b are compared, and the vector W minimizing the criteria is chosen. 4. The model is tested in respect to the reproduction accuracy of the testing data, residues correlation values, and residues correlation values with inputs.

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

186

Figs. 6.176.22 show the modeling results of the hysteretic relationship shown in Fig. 6.14 with the help of the nonlinear difference equation δðtÞ 5 ΦðW ; δðt 2 ΔÞ; FðtÞ; Fðt 2 ΔÞÞ; tAT 5 f0; Δ; . . .; N Δg:

(6.4)

Eq. (6.4) determines uniquely the deformation δðtÞ at the points tAfΔ; 2Δ; . . .g for the values FðtÞ; tAfΔ; 2Δ; . . .g if you set the values Fð0Þ, Fð2ΔÞ and the deformation δð2ΔÞ. 150

n(s)

100

50

0 –0.015

–0.01

–0.005

0

0.005

δ, mm

0.01

0.015

Sample cross correlation

Figure 6.17 Histogram of residues on the training set. Sample cross correlation function

0.05

0

–0.05 –20

–15

–10

–5

0

5

10

15

20

Lag

Sample autocorrelation

Figure 6.18 Correlation residues with FðtÞ on the training set. Sample autocorrelation function

1 0.5 0 –0.5 0

2

4

6

8

10

12

Lag

Figure 6.19 Correlation residues on the training set.

14

16

18

20

Applications of Diffuse Algorithms

187

0.4 Curve1

0.3

Curve2

0.2

δ, mm

0.1 0 –0.1 –0.2 –0.3

0

100

200

300

400

500

n

600

700

800

900

Figure 6.20 Dependences of deformation δ (mm) (curve 1) and its approximation (curve 2) on the number of observations n on the testing set.

0.4

F(δ)

0.2 0 Curve1

–0.2 –0.4 –250

Curve2

–200

–150

–100

–50

0

50

100

150

200

250

δ

Figure 6.21 Dependences of the deformation (curve 2) and approximating curve (curve 1) (mm) from the applied force (kg) on the test set.

100

n(s)

80 60 40 20 0 –0.025

–0.02

–0.015

–0.01

–0.005

0

0.005

δ, mm Figure 6.22 Histogram of residues on the testing set.

0.01

0.015

0.02

0.025

188

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

To approximate the right-hand side in Eq. (6.4) we use a perceptron with one hidden layer and the linear output activation function (AF), and five neurons in the hidden layer. The size of the training set is 1836, the test is 824. The forgetting factor is determined by expression λi 5 maxf1 2 0:05=i; 0:99g, i 5 1; 2; . . .; 100. Of interest is also the question of how the use of such complex mathematical structures as NNs to construct the hysteresis curves shown in Fig. 6.14 can be justified. It should be noted that the simulation results show that a linear model of the form δm ðtÞ 5 a1 δm ðt 2 ΔÞ 1 b1 FðtÞ 1 b2 Fðt 2 ΔÞ can’t approximate even the points of the training set. A similar result was obtained by using the state space model xðt 1 ΔÞ 5 AxðtÞ 1 BFðtÞ 1 ξðtÞ; δðtÞ 5 CxðtÞ 1 ζðtÞ and the subspace identification method for evaluation of A, B, C.

6.3 HARMONICS TRACKING OF ELECTRIC POWER NETWORKS Evaluation of the periodic signal harmonics amplitudes is an extremely important problem in the networks of electricity consumption. This is due to the fact that the analysis of the most electrical equipment is based on the assumption that the load current can be represented by a superposition of harmonics with frequencies that are multiples of the fundamental frequency. Besides, real-time estimation algorithms of harmonics amplitudes are needed for the harmonic correction. To solve this problem recursive algorithms [9294] are used. Their work is based on the following idea. Suppose that we have a priori information about the number of required harmonics for reproduction of the observed load currents with a given accuracy, then estimates of the harmonics amplitudes (Fourier coefficients) are proposed to find through a recursive minimization of the squared prediction errors at one step between the load current and model values (the Widrow-Hoff algorithm) or using the RLSM. At the same time, the rate of convergence of the Widrow-Hoff algorithm may be unsatisfactory and the RLSM can diverge under the action of perturbations and changing loads. Compare the features of the WidrowHoff algorithm and the DTA with sliding window on numerical examples with real data.

Applications of Diffuse Algorithms

189

Let the mathematical model of the signal have the form y t 5 A0 1

M X

½Ai sinð2πfitÞ 1 Bi cosð2πfitÞ 1 ξt 5 Ct α 1 ξt ;

(6.5)

i51

where f 5 60 Hz is the fundamental T α 5 ðA0 ; A1 ; . . .; AM ; B1 ; . . .; BM Þ ,

frequency,

M 5 23,

Ct 5 ð1; sinð2πftÞ; . . .; sinð2πfMtÞ; cosð2πftÞ; . . .; cosð2πfMtÞÞ: It is needed to evaluate recursively values of the amplitudes of harmonics in Eq. (6.5), using the WidrowHoff algorithm αt 5 αt21 1 ηCt ðyt 2 Ct αt21 Þ; α0 5 0; η 5 0:005 and the DTA with sliding window. Tact data pickup T over the period of the fundamental frequency is 256 and the prediction horizon of the DTA h with sliding window is 300. Figs. 6.236.25 show a fragment of the realization with an impulse disturbance, evaluation of harmonics amplitudes obtained by the WidrowHoff algorithm (curves 1) and the DTA with sliding window (curves 2), where yt 5 yiT, i 5 1,2,. . . , A is shortened to the ampere. It is evident that the DTA substantially exceeds the WidrowHoff algorithm in speed. Figs. 6.266.28 show a fragment of the realization with the varying load, evaluation of the amplitudes of the harmonics produced by the WidrowHoff algorithm (curves 1), and the DTA with sliding window (curve 2) in this case. It is seen that the DTA in this case substantially exceeds the WidrowHoff algorithm in speed.

20

Ι (iT), A

10 0 –10

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

i

Figure 6.23 The dependence of the current value on the number of observations.

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

190

5

Ι (iΙ ), A

4 X: 2374 Y: 3.871

3 2

1 2

1 X: 1237 Y: 1.027

0 500

0

X: 1463 Y: 0.9556

1000

i

1500

2000

2500

Figure 6.24 Dependencies of harmonic amplitude estimates A1 on the number of observations under the impulse noise action. The WidrowHoff algorithm (curve 1) and the DTA with sliding window (curve 2).

0.2 0

Ι (iΙ ), A

1 2

X: 1229 Y: -0.008624

X: 1583 Y: -0.07907

-0.2

X: 2358 Y: -0.07309

–0.4 –0.6 500

1000

1500 i

2000

2500

Figure 6.25 Dependencies of harmonic amplitude estimates A23 on the number of observations under the impulse noise action. The WidrowHoff algorithm (curve 1) and the DTA with sliding window (curve 2).

40

Ι (iT), A

20 0 –20 –40 3500

4000

4500

5000

5500

6000

6500

i

Figure 6.26 The dependence of the signal value on the number of observations.

Applications of Diffuse Algorithms

191

30 Ι (iT), A

X: 3823

X: 4766

Y: 31.39

20

Y: 29.74

1 2

10

3400

3600

3800

4000

4200

i

4400

4600

Figure 6.27 Dependencies of harmonic amplitude estimates A1 on the number of observations while the varying load. The WidrowHoff algorithm (curve 1) and the DTA with sliding window (curve 2).

1 Ι (iT), A

0.2

2

0 –0.2 3400

3600

3800

4000

4200

4400

4600

i

Figure 6.28 Dependencies of harmonic amplitude estimates A23 on the number of observations while the varying load. The WidrowHoff algorithm (curve 1) and the DTA with sliding window (curve 2). 0

A(n)

-50

–100

–150

0

5

10

15 n

20

25

30

Figure 6.29 The dependence of the amplitudefrequency characteristics of the filter on the harmonic number.

Signals are preprocessed by a low-pass elliptic filter of the tenth order with the delay payment. The amplitudefrequency characteristic of the filter is shown in Fig. 6.29.

Diffuse Algorithms for Neural and Neuro-Fuzzy Networks

192

The problem of simultaneous estimation of the harmonics amplitudes and the fundamental frequency change in electric current monitoring in networks is also of considerable interest. Figs. 6.306.32 show the results of the simultaneous estimation of harmonics amplitudes, and the fundamental frequency period under the action impulse noise (Fig. 6.23) using nonlinear DTA. 3

Data1 Curve1 Data2 Curve2

2

Ι (iT), A

1 0 –1 –2 –3 0

100

200

300

400

i

500

600

700

800

Figure 6.30 Dependencies of harmonics amplitudes estimates A1 (curve 1) and B1 (curve 2) on the observations number under the impulse noise action. 0.2

Curve1 Data1 Data2 Curve2

Ι (iT), A

0.1 0

–0.1 –0.2 1200

1400

1600

1800

2000

2200

2400

i

Figure 6.31 Dependencies of harmonics amplitudes estimates A23 (curve 1) and B23 (curve 2) on the observations number under the impulse noise action. 60.004

Ι F(iT),sc

60.002 60 59.998 59.996 59.994

0

500

1000

1500

2000

2500

i

Figure 6.32 The dependence of the estimation of the fundamental period on the observations number under the impulse noise action.

GLOSSARY NOTATIONS AT transposition of matrix A A21 inverse of matrix A A1 pseudoinverse of matrix A A . 0 positive definite matrix A A $ 0 positive semidefinite matrix A A 5 diagða1 ; :::; an Þ diagonal matrix A 5 block diagðA1 ; :::; An Þ block-diagonal matrix traceðAÞ trace of matrix A detA determinant of matrix A 0n 3 m zero n 3 m matrix I n identity n 3 n matrix ei i-th unit vector of dimension n Rn n-dimensional linear space over the field of real numbers Rn 3 m set of n 3 m matrices  direct product of two matrices rankðAÞ rank of matrix A EðξÞ expectation of vector ξ jj.jj Euclidean vector norm

ABBREVIATIONS AF activation function DEFK diffuse extended Kalman filter DFK diffuse Kalman filter DTA diffuse training algorithm EKF extended Kalman filter ELM extreme learning machine GNM GaussNewton method KF Kalman filter LSM least-square method MF membership function NFS neuro-fuzzy system NN neural network 193

194

Glossary

RBNN radial basic NN RLSM recursive least-square method RNN recurrent NN SR separable regression

REFERENCES [1] Golub GH, Pereyra V. The differentiation of pseudoinverses and nonlinear least squares problems whose variables separate. SIAM J Number Anal 1973;10:413 32. [2] Golub GH, Pereyra V. Separable nonlinear least squares: the variable projection method and its applications. Inverse Probl 2003;19(2):R1 26. [3] Pereyra V, Scherer G, Wong F. Variable projections neural network training. Math Comput Simul 2006;73(1 4):231 43. [4] Sjoberg J, Viberg M. Separable non-linear least-squares minimization and possible improvements for neural net fitting. In: IEEE workshop in neural networks for signal processing, FL, USA; 1997, p. 345a354. [5] Jang R. Fuzzy modeling using generalized neural networks and Kalman filter algorithm. In Proceeding of the Ninth National Conference on Artificial Intelligence (AAAI-91), p. 762a7, July, 1991. [6] Jang R, Sun C, Mizutani E. Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence. Englewood Cliffs, NJ: Prentice Hall; 1997. [7] Huang GB, Zhu QY, Siew CK. Extreme learning machine: theory and applications. Neurocomputing 2006;70:489 501. [8] Rong HJ, Huang GB, Sundararajan N, Saratchandran P. Online sequential fuzzy extreme learning machine for function approximation and classification problems. IEEE Trans Syst Man Cybern B Cybern 2009;39(4):1067 72. [9] Kaufman L. A variable projection method for solving separable nonlinear least squares Problems. BIT 1975;15:49 57. [10] Haykin S. Neural Networks and Learning Machines. 3 ed. Englewood Cliffs, NJ: Prentice Hall; 2009, 916 p. [11] Cybenko G. Aproximation by superpositions of a sigmoidal function. Math Control Signals Syst 1989;2:303 14. [12] Funahashi K. On the approximate realization of continuous mappings by neural networks. Neural Net 1989;2:183 92. [13] Battiti R. First and second order methods for learning: between steepest descent and Newton’s method.. Neural Comput 1992;4(2):141 66. [14] Hagan MT, Menhaj M. Training multilayer networks with the Marquardt algorithm. IEEE Trans Neural Net 1994;5(6):989 93. [15] Singhal S, Wu I. Training multilayer perceptrons with the extended Kalman filter. Adv Neural Inf Process Syst 1989;1:133 40. [16] Iiguni Y, Sakai H, Tokumaru H. A real time learning algorithm for a multilayered neural network based on the extended. IEEE Trans Signal Process 1992;40 (4):959 66. [17] Skorohod B. Diffuse Initialization of Kalman Filter. J Autom Inf Sci 2011;43 (4):20 34. Begell House Publishing Inc, USA. [18] Skorohod B. Diffusion learning algorithms for feedforward neural networks. Cybernet Syst Anal May, 2013;49(3). Publisher: Springer New York. [19] Kim C-T, Lee J-J. Training two-layered feedforward networks with variable projection method. IEEE Trans Neural net February, 2008;19(2). [20] C.-T. Kim, J.-J. Lee and H. Kim, Variable projection method and LevenbergMarquardt algorithm for neural network training. In: IEEE Industrial Electronics, IECON 2006-32nd Annual Conference on 6 10 Nov. 2006. p. 4492 4497.

195

196

References

[21] Skorohod B. Learning algorithms for neural networks and neuro-fuzzy systems with separable structures. Cybernet Syst Anal 2015;51(2):173 86. Publisher: Springer New York. [22] Huang GB, Wang DH, Lan Y. Extreme learning machines: a survey. Int J Mach Lean Cybernet 2011;2(2):107 22. [23] Broomhead D, Lowe D. Multivariable functional interpolation and adaptive networks. Complex Syst 1988;2:321 55. [24] Simon D. Training radial basis neural networks with the extended Kalman Filter. Neuro Comput 2002;48:455 75. October 2002. [25] Takagi T, Sugeno M. Fuzzy identification of systems and its application to modeling and control. IEEE Trans Syst Man Cybern. 1985;SMC-15:116 32. [26] Simon D. Training fuzzy systems with the extended Kalman filter. Fuzzy Sets Syst 2002;132(2):189 99. [27] Goddard J, Parrazales R, Lopez I, de Luca A. Rule learning in fuzzy systems using evolutionary programs. IEEE Midwest Symp Circuits Syst 1996;703 9. Ames, Iowa. [28] Magdalena L, Monasterio-Huelin F. Fuzzy logic controller with learning through the evolution of its knowledge base, Internat. J Approx Reason 1997;16:335 58. [29] Nelles O. Nonlinear System Identification. From Classical Approaches to Neural Networks and Fuzzy Models. Berlin: Springer; 2001. p. 785. [30] Bruls J, Chou CT, Haverkamp BRJ, Verhaegen M. Linear and non-linear system identification using separable least-squares. Eur J Control 1999;5:116 28. [31] Edrissi H, Verhaegen M, Haverkamp B, Chou C.T. Off- and On-Line Identification of discrete time using Separable Least Squares. In: Proceedings of the 37th IEEE Conference on Decision & Control, Tampa, FL.1998. [32] Stubberud SC, Kramer KA, Geremia JA. Online sensor modeling using a neural Kalman filter. IEEE Trans Instrum Meas 2007;56(4):1451 8. [33] Stubberud SC, Kramer KA, and Geremia JA. System identification using the neuralextended Kalman filter for state-estimation and controller modification. In: Proceedings of the International Joint Conference on Neural Networks, 2008. p. 1352 1357. [34] Kral L, Simandl M. Neural networks in local state estimation. Methods and Models in Automation and Robotics (MMAR), In: 17th International Conference on 27 30 Aug 2012. p. 250 255. [35] Jin L, Nikiforuk P, Gupta M. Approximation of discrete-time state-space trajectories using dynamic recurrent neural networks. IEEE Trans Autom Control 1995;40(40), N.7. [36] Hagan MT, Demuth HB, De Jesu´s O. An introduction to the use of neural networks in control systems. Int J Robust Nonlinear Control 2002;12(11):959 85, September. [37] Ljung L, Soderstrom T. Theory and Practice of Recursive Identification. Cambridge, MA: MIT Press; 1983. [38] Haykin S. Adaptive Filter Theory. Englewood Cliffs, NJ: Prentice-Hall; 1991. [39] Mosca E. Optimal, Predictive and Adaptive Control. Englewood Cliffs, NJ: Prentice Hall; 1995. [40] Bellanger MG. Adaptive Digital Filters and Signal Analysis. New York: Marcel Dekker; 1987. [41] Cioffi JM, Kailath T. Fast, recursive-least-squares transversal filters for adaptive filtering. IEEE Trans Acoust Speech Signal Process. 1984;32(2):304 37. [42] Hubing NE, Alexander ST. Statistical analysis of initialization methods for RLS adaptive filters. IEEE Trans Signal Process. 1991;39(8):1793 804. [43] Eom K-S, Park D-J. Analysis of overshoot phenomena in initialisation stage of. RLS algorithm, ELSEVIER. Signal Process, 44. 1995. p. 329 39. [44] Moustakides GV. Study of the transient phase of the forgetting factor RLS. IEEE Trans Signal Process 1997;45:2468 76.

References

197

[45] Skorohod B. Asymptotic of linear recurrent regression under diffuse initialization. J Autom Inf Sci 2009;41(5):41 50. Begell House Publishing Inc, USA. [46] Skorohod B. Oscillations of RLSM with diffuse initialization. Automation of processes and control: proceedings of SSTU, 146, P. 40 45, Sevastopol, 2014 (in Russian). [47] Albert A. Regression and the Moore-Penrose Pseudoinverse. New York: Academic Press; 1972. 177 p. [48] Albert A, Sittler R. A method for computing least squares estimators that keep up with the data. SIAM J Control 1965;3(3): 384, 417. [49] Stoica P, Ashgren P. Exact initialization of the recursive least-squares algorithm. Int J Adapt Control Signal Process 2002;16(3):219 30. [50] Ansley CF, Kohn R. Estimation, filtering and smoothing in state space models with incompletely specified initial conditions. Ann. Statist. 1985;13:1286 316. [51] Koopman SJ. Exact initial Kalman filtering and smoothing for non-stationary time series Models. J Am Stat Assoc 1997;92(440):1630 8. [52] De Jong P. The diffuse Kalman Filter. Ann. Statist. 1991;19:1073 83. [53] Kwon WH, Kim PS, Park P. A receding horizon Kalman FIR filter for discrete time invariant systems.. IEEE Trans Autom Control 1999;44(9):1787 91. [54] Kwon WH, Kim PS, Han SH. A receding horizon unbiased FIR filter for discretetime state space models. Automatica. 2002;38(3):545 51. [55] Harvey AC, Pierse RG. Estimating missing observations in economic time series. J Am Stat Assoc., 79. 1984. p. 125 31. [56] Bar-Shalom Y, Rong X, Kirubarajan T. Estimation with Applications to Tracking and Navigation. New York: Johh Wiley and Sons; 2001. [57] Gantmaher FR. Matrix Theory, М.: Fizmathlit, 2004, 560 p. [58] Narendra KS, Parthasarathy K. Identification and control of dynamical systems using neural networks. IEEE Trans Neural Net. Mar. 1990;1(1):4 27. [59] Jinkun L. Radial Basis Function (RBF) Neural Network Control for Mechanical Systems. Heidelberg: Springer; 2013. [60] Jazwinski AH. Limited memory optimal filtering. In: Proceedings 1968 Joint Automatic Control Conference. Ann Arbor, MI, 1968, p. 383 393. [61] Hyun KW, Ho LK, Hee HS, Hoon LC. Receding horizon FIR parameter estimation for stochastic systems. Int Conf Control Autom Syst 2001;1193 6. [62] Alessandri A, Baglietto B, Battistelli G. Receding-horizon estimation for discretetime linear systems. IEEE Trans Autom Control 2003;48(3):473 8. [63] Wang Z-o, Zhang J. A Kalman filter algorithm using a moving window with applications. Int J Syst Sci 1995;26(9):1465 78. [64] Gustafsson F. Adaptive filtering and change detection. Chichester: Wiley; 2001. [65] Hoerl A, Kennard R. Ridge regression: biased estimation for northogonal problems. Technometrics 1970;12:55 67. [66] Korn A, Korn T. Mathematical Handbook for Scientists and Engineers. New York: McGraw-Hill Book Co; 1961. p. 943. [67] Seber G. A Matrix Handbook for Statisticians. Hoboken, NJ: John Wiley & Sons, Inc; 2007, 559 pp. [68] Cook D, Forzani L. On the mean and variance of the generalized inverse of a singular Wishart matrix. Electron J Stat 2011;5:146 58. [69] Wimmer H.R. Stabilizing and unmixed solutions of the discrete time algebraic Riccati equation. In: Proceeding Workshop on the Riccati equation in Control, Systems, and Sygnals. 1989. p. 95 98. [70] Polyak BT. Introduction to optimization. Translations Series in Mathematics and Engineering. New York: Optimization Software Inc. Publications Division; 1987.

198

References

[71] Ravi P. Agarwal Difference Equations and Inequalities: Theory, Methods, and Applications. Boca Raton, FL: CRC Press; 2000. p. 971. [72] Johnstone RM, Johnson CR, Bitmead RR, Anderson BDO. Exponential convergence of recursive least squares with exponential forgetting factor, Decision and Control, 1982 21st IEEE Conference. p. 994 997. [73] Harville D. Matrix Algebra From a Statistician’s Perspective. Berlin: Springer Science & Business Media; 2008, 634 pp. [74] Lakshmikantham V, Trigiante D. Theory of Difference Equations—Numerical Methods and Applications. 2nd ed. New York: Marcel Dekker; 2002, 300 p. [75] Wasan MT. Stochastic Approximations. Cambridge: Cambridge University Press; 1969. [76] Bertsekas Dimitri P. Incremental least squares methods and the extended Kalman Filter. SIAM J Optim. 1996;6:807 22. [77] Curve Fitting Toolbox 3. MathWorks, Inc. [78] Fuzzy Logic Toolbox. The MathWorks, Inc. [79] System Identification Toolbox. The MathWorks, Inc. [80] Bezdek J, Keller J, Krishnapuram R, Kuncheva, Pal L. Will the real Iris data please stand up? IEEE Trans Fuzzy Syst 1999;(7):368 9. [81] Anderson BDO, Moore J. Optimal Filtering. Englewood Cliffs, NJ: Prentice Hall; 1979. [82] Jazwinski AH. Stochastic Processes and Filtering Theory. New York: Academic Press; 1970. [83] S.H. Han, P.S. Kim and W.H. Kwon Receding horizon FIR filter with estimated horizon initial state and its application to aircraft engine systems. In: Proceeding of the 1999 IEEE international conference on Control Application Hawai, August 22 27, 1999. [84] Narendra KS, Parthasarathy K. Gradients methods for the optimization of dynamic systems containing neural networks. IEEE Trans Neural Net 1991;2(2):252 62. [85] Sum J. Extended Kalman Filter Based Pruning Algorithms and Several Aspects of Neural Network Learning. PhD Dissertation. Hong Kong: Department of Computer Science and Engineering, Chinese University of Hong Kong; 1998. [86] http://www.festo-didactic.com/int-en/learning-systems/education-and-research-robotsrobotino/robotino-workbook.htm. [87] Mayergoyz ID. Mathematical Models of Hysteresis and Their Applications. Amsterdam: Elsevier Series in Electromagnetism; 2003. [88] Advances in neural networks in computational mechanics and Engineering by Ghaboussi J. in Advances of soft computing in engineering, Springer, p191, 2010, 470 p. [89] Yun GJ, Yun, Ghaboussi J, Amr S. Modeling of hysteretic behavior of beam-column connections based on self-learning simulation, Report, Illinois, August 2007, 224 p. [90] Ghaboussi J, Garrett JH. and Wu X. Material modeling with neural networks. In: Proceedings of the International Conference on Numerical Methods in Engineering: Theory and Applications. Swansea; 1990. [91] Ghaboussi J, Garrett JH, Wu X. Knowledge-based modeling of material behavior with neural networks. J Eng Mech Div 1991;117:132 53. [92] Dash PK, Panda SK, Liew AC, Mishra B, Jena RK. A new approach to monitoring electric power quality. Elect Power Syst Res 1998;46:11 20. [93] Rechka S, Ngandui E, Jianhong X, Sicard P. Analysis of harmonic detection algorithms and their application to active power filters for harmonics compensation and resonance damping. Can J Elect Comput Eng 2003;28:41 51. [94] Dash PK, Swain DP, Liew AC, Rahman S. An adaptive linear combiner for on-line tracking of power system harmonics. In: IEEE Trans Power Appar. Syst. 96WM 181 8 PWRS, January 21 25, 1996.

INDEX Note: Page numbers followed by “f ” and “t” refer to figures and tables, respectively.

A Autoregressive moving average model, 13

B Backpropagation algorithm (BPA), 5 Batervord low-pass filter, 177

C Cauchy formula of two series product presentation, 66 CUSUM test, 48

D Diffuse algorithms, applications of harmonics tracking of electric power networks, 188 192 mobile robot (MR) Robotino, 175 183 modeling of hysteretic deformation, 183 188 Diffuse algorithms, for estimating parameters of linear regression applications estimation with a sliding window, 46 48, 48f identification of nonlinear dynamic plants, 39 43, 42f, 43f plant described by linear difference equation, 40 42 supervisory control, 43 45, 45f problem statement, 17 19 soft and diffuse initializations, 20 39 consequences, 25 26 lemma, 20 21, 28 31, 33 numerical examples, 26 theorems, 22 23, 32, 34, 37 39 Diffuse and soft initializations recursive LSM (RLSM) with, 13 15 Diffuse Kalman filter

absence of a priori information, 143, 153 165 estimation error covariance matrix, 164 lemma, 153 162 optimal feedback control, 162 optimal program control, 157 158, 160 theorem, 163 165 unbiased state estimate of the system, 153 155 diffuse analog of EKF, 169 170 estimation with diffuse initialization, 144 152 problem statement, 142 144 absence of a priori information, 143 initial state of linear discrete system, 142 143 with sliding window, 166 169 systems state recovery in a finite number of steps, 165 166 systems with partly unknown dynamics, estimation, 173 174 Diffuse neural and neuro-fuzzy networks training algorithms in the absence of a priori information, 95 103 lemma, 95 97, 99 103 applications classical problem of partitioning iris flowers, 141 identification of a static object from input output data, 134 136, 135f identification of nonlinear dynamics plants, 136 139 identification of nonlinear static plants, 130 136, 132f, 132t neuro-fuzzy Sugeno network of zero order with Gaussians for membership function (MF), 133 136

199

200

Index

Diffuse neural and neuro-fuzzy networks training algorithms (Continued) problem of classification of objects, 140 141 convergence of, 103 123 finite training set, 104 116 infinite training set, 116 123 iterative versions of, 123 125 theorem, 125 problem statement, 80 82 approach in absence of a priori information, 81 conditions satisfied, 80 81 of recurrent neural network, 125 127 under a small measurement noise, analysis, 127 130 theorem, 127 129 use of soft and diffuse initializations, 82 95 assertion of lemma, 86 consequences of obtained results, 88 89 linear optimization problem solution, 85 86 solution of the minimization problem, 82 83, 87 88, 93 theorems, 87 88, 90 95 Diffuse training algorithm (DTA), 88 89, 103 Discrete Riccati matrix equation, 145

G

E

L

Electric current harmonic components monitoring, 192 Euclidean vector norm, 1 2 Extended Kalman filter (EKF), 5 Extreme learning machine (ELM) algorithm, 5

Least squares algorithm on finite time interval, analysis of fluctuations under diffuse initialization, 71 77 fluctuations with random inputs, 77 79 condition of overshoot absence, case of, 79 normalized root mean square estimation error, 50 65 problem statement, 49 50 under soft initialization with large parameters, 65 71 Levenberg Marquardt algorithm, 2, 7 Low-pass elliptic filter, 191

F Finite training set, 104 116 definition, 104 lemma, 104 111 theorem, 111 116

Gauss function, 6 Gauss Newton method (GNM), 2

H Harmonics tracking of electric power networks, 188 192, 189f, 190f, 191f, 192f Hysteretic deformation, modeling of, 183 188 dependence of the force and corresponding deformation, 183, 184f modeling results of the hysteretic relationship, 186, 186f, 187f problem of constructing, 185 186

I Identification of nonlinear static plants, 130 136, 132f, 132t Infinite training set, 116 123 definitions, 116 theorems, 116 119 estimation error, 120 solution of the difference equation, 121

K Kalman filter diffuse initialization of, 15 16 extended (EKF), 5

Index

M Mobile robot (MR) Robotino, 175 183, 176f accuracy of the data approximation, 179 correlation functions of residues and cross-correlation function inputs with residues, 180, 181f dependences of the angular speed of, 182f determination coefficients, 179 fragments of input and output of angular speeds of motors, 177, 177f, 178f simulation results, 179 180, 179f, 180f transient performance of engines, 176, 176f values of angular velocities of the motor, 178 M rules, 8 Multilayer perceptron, plant identification with, 130, 136, 138 139 Multilayer perceptrons, 11

N Neural network (NN), 3 5 fluctuations of estimates, 69 modeling of hysteretic deformation, 183 188 recurrent neural network training, 170 172 RLSM with diffuse initialization for, 39 Neurocontrol, 12 Neuro-fuzzy network (NFS), 7 8 Neuro-fuzzy Sugeno network of zero order with Gaussians for membership function (MF), 133 136 Normalized values of root mean square estimation error, 50 65 covariance matrix expression of estimate, 51 52 harmonics amplitude in a signal of an alternating electric current, 56 dependencies, 57f, 58f, 59f, 61f, 62f differences and similarities in problems, 64 effect of forgetting factor on upper and lower limits, 61 62

201

realization of signal, 56f iterating equations for, 50 maximum and minimum eigenvalues of matrix, 60 65 ratio of signal to noise, 60 spectral decomposition, 52 53 theorems, 55 56, 59, 62 63

O Overshoot, RLSM, 50, 65 for arbitrarily large values, 69 condition of overshoot absence, 57, 70, 78 79 in zero approximation, 70 existence theorem of, 59 maximum value of, 56 58 using ratio of signal to noise, 60 65

P Perceptron with one hidden layer, 3 5 Plant models with time delays, 9 Problem statement diffuse algorithms, for estimating parameters of linear regression, 17 19 diffuse neural and neuro-fuzzy networks training algorithms, 80 82 least squares algorithm on finite time interval, 49 50

R Radial basis neural network (RBNN), 5 7 vector matrix notations for, 6 Recurrent neural network (RNN), 11 12 Recurrent neural network training, 170 172 Recursive LSM (RLSM), 8, 49 50 with diffuse and soft initializations, 13 15 Regressors vector, 9

S Separable least squares method, 1 3, 10 theorem, 3 Separable regression (SR), 1

202

Index

Sliding window diffuse algorithms, for estimating parameters of linear regression using, 48f diffuse Kalman filter with, 166 169 Soft and diffuse initializations diffuse algorithms, for estimating parameters of linear regression, 20 39, 27f consequences, 25 26 lemma, 20 21, 28 31, 33 numerical examples, 26 theorems, 22 23, 32, 34, 37 39 diffuse Kalman filter, estimation with diffuse initialization, 144 152 comparing the KF with DFK, 152 consequences of output, 148 149 covariance matrix of the estimation error, 152 estimation error covariance matrix, 144 145

KF gain matrix, 145 lemma, 144 146, 149 Markov’s inequality, 148 theorems, 151 152 uniform asymptotic representations, 146 147 fluctuations of estimates, 65 77 theorems, 65 68, 71 73, 75 76 State recovery in a finite number of steps, 165 166 Stochastic nonlinear discrete system, 11 Supervisory control system, functional scheme of, 12, 12f Systems with partly unknown dynamics, 9 11

W Widrow Hoff algorithm, 188 189