Traditionally, neural networks and wavelet theory have been two separate disciplines, taught separately and practiced se
123 88 5MB
Pages [285] Year 2002
Table of contents :
PART A
MATHEMATICAL PRELIMINARIES
Sets
Functions
Sequences and Series
Complex Numbers
Linear Spaces
Matrices
Hilbert Spaces
Topology
Measure and Integral
Fourier Series
Exercises
WAVELETS
Introduction
Dilation and Translation
Inner Product
Haar Wavelet
Multiresolution Analysis
Continuous Wavelet Transform
Discrete Wavelet Transform
Fourier Transform
Discrete Fourier Transform
Discrete Fourier Transform of Finite Sequences
Convolution
Exercises
NEURAL NETWORKS
Introduction
Multilayer Perceptrons
Hebbian Learning
Competitive and Kohonen Networks
Recurrent Neural Networks
WAVELET NETWORKS
Introduction
What Are Wavelet Networks
Dyadic Wavelet Network
Theory of Wavelet Networks
Wavelet Network Structure
Multidimensional Wavelets
Learning in Wavelet Networks
Initialization of Wavelet Networks
Properties of Wavelet Networks
Scaling at Higher Dimensions
Exercises
PART B
RECURRENT LEARNING
Introduction
Recurrent Neural Networks
Recurrent Wavenets
Numerical Experiments
Concluding Remarks
Exercises
SEPARATING ORDER FROM DISORDER
Order Within Disorder
Wavelet Networks: Trading Advisors
Comparison Results
Conclusions
Exercises
RADIAL WAVELET NEURAL NETWORKS
Introduction
Data Description and Preparation
Classification Systems
Results
Conclusions
Exercises
PREDICTING CHAOTIC TIME SERIES
Introduction
Nonlinear Prediction
Wavelet Networks
ShortTerm Prediction
ParameterVarying Systems
LongTerm Prediction
Conclusions
Acknowledgements
Appendix
Exercises
CONCEPT LEARNING
An Overview
An Illustrative Example of Learning
Introduction
Preliminaries
Learning Algorithms
Summary
Exercises
BIBLIOGRAPHY
INDEX
FOUNDATIONS of WAVELET NETWORKS and APPLICATIONS
FOUNDATIONS of WAVELET NETWORKS and APPLICATIONS S. Sitharama Iyengar E. C. Cho Vir V. Phoha
CHAPMAN & HALL/CRC A C R C Press Company Boca Raton
London
New York Washington, D.C.
Library of Congress CataloginginPublication Data Iyengar, S. S. (Sundararaja S.) Foundations of wavelet network and applications / S.S. Iyengar, E.C. Cho, V.V. Phoha. p. cm. Includes bibliographical references and index. ISBN 1584882743 1. N eural networks (Com puter science) 2. Wavelets (M athematics) I. Cho, E. C. II. Phoha, Vir V. III. Title. QA76.87 .I94 2002 006.3'2 — dc21
200205919 CIP
Catalog record is available from the Library of Congress
This book contains inform ation obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and inform ation, but the authors and the publisher cannot assum e responsibility for the validity of all m aterials or for the consequences of their use. N either this book nor any part m ay be reproduced or transm itted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any inform ation storage or retrieval system, without prior perm ission in writing from the publisher. All rights reserved. Authorization to photocopy items for internal or personal use, or the personal or internal use of specific clients, m ay be granted by CRC Press LLC, provided that $1.50 per page photocopied is paid directly to Copyright Clearance Center, 222 Rosewood Drive, Danvers, M A 01923 U SA . T he fee code for users o f the T ransactional R eporting S ervice is ISB N 1584882743/02/$0.00+$1.50. The fee is subject to change without notice. For organizations that have been granted a photocopy license by the CCC, a separate system of paym ent has been arranged. The consent of CRC Press LLC does not extend to copying for general distribution, for prom otion, for creating new works, or for resale. Specific perm ission m ust be obtained in writing from CRC Press LLC for such copying. Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431. Trademark Notice: Product or corporate names may be tradem arks or registered trademarks, and are used only for identification and explanation, without intent to infringe.
Visit the CRC Press Web site at www.crcpress.com © 2002 by Chapm an & Hall/CRC CRC Press LLC No claim to original U.S. Governm ent works International Standard B ook Num ber 1584882743 Library of Congress Card Num ber 2002025919 Printed in the U nited States of Am erica 1 2 3 4 5 6 7 8 9 0 Printed on acidfree paper
D ed ica tio n S.S. Iyengar would like to dedicate this book to Professor Hartmanis, a Turing Award winner, of Cornell for his pioneering contribution to complexity theory and to Puneeth, Veneeth, and Vijeth for their patience for my lectures, E.C. Cho would like to dedicate this book to his parents and his family, V. V. Phoha would like to dedicate this work to Professor William J.B, Oldham of Texas Tech University for his teachings and for his contributions to neural networks.
A ck n o w led g m en ts This book was made possible and thanks to generous support from the office of Naval Research Division of Electronics, DARPASense IT, DOEORNL programs, Board of RegentsLEQSF etc. Special acknowledgments to the following individuals for their contributions to this book. Chapter 5 was contributed by Ri cardo Riaza, Raquel Gomez Sanchez, and Pedro J, Zuhria, Chap ters 6 and 7 were contributed by Javier Eehauz and George Vaehtsevanos, Chapter 8, contributed by Liangyue Cao, Yiguang Hong, Harping Fang, and Guowei He, is reproduced with their permission. Chapter 9 was contributed by Rao and Iyengar, Thanks for Professor Rao’s technical expertise in fusion learn ing. We are grateful to the following individuals in the LSU Robotics Research and LSU Networking Laboratories who have helped in the publication of this book: Sridhar Karra, Kanthi Adapa, and Sumanth Yenduri, Professor Seiden’s help in using LaTex was very useful. This book is an extension to our earlier (Prof Iyengar’s) book, Wavelet Analysis with Applications to Image Processing, CRC Press (1997), Boca Raton, Florida, which continues to be used in many universities across the world.
P reface This text grew out of our lecture notes at Louisiana State Uni versity, Louisiana Tech University, and Kentucky State Univer sity and from many questions from our students that indicated the need for a comprehensive treatm ent of wavelet networks. Our two main purposes in writing this text are (1) to present a systematic treatm ent of wavelet networks based whenever possible on fundamental principles, and (2 ) to present a self contained treatm ent of wavelet networks in a clear and concise manner. Readers who sample the literature will note a recent vigor ous growth in this area. Wavelet networks are relatively new; however, their applications and growth have come from many areas including wavelet theory, neural networks, statistics, com puter science, pattern recognition, and communication, to name a few. In this text, neural networks and wavelet theory, tradi tionally taught in two different disciplines, are integrated under the common theme of wavelet networks. In our teaching and in research, we found th at the material required to teach wavelet networks in an organized and compre hensive manner was scattered in many research monographs; there is no book th at presents a comprehensive treatm ent of material in one place. This book is an attem pt to fill that void.
A p p ro a ch We have followed a pedagogical approach in this text. The book presents a rigorous treatm ent of both wavelets and neu ral networks, because the foundations of wavelet networks can be found in neural networks, and waveletswavelet networks combine the decomposition powers of wavelets with universal approximation properties of neural networks, so a good under standing of wavelets and networks is essential to appreciate the universal approximation properties of wavelet networks. Theory and techniques of wavelet networks are mostly m ath ematical in nature, but we have focused on providing insight and
understanding rather than establishing rigorous mathematical foundations, so we have avoided detailed proofs of theorems. A deliberate attem pt is made to provide a balance between theory and applications of wavelet networks. We present the mate rial at a level th at can be understood by senior undergraduate students in engineering and science. To follow this text no specific mathematical background is expected, with the exception of basic calculus. All the other requisite mathematical preliminaries are given in the hrst part.
T op ics The book has two parts. Part I deals with the foundational aspects of wavelet networks, and Part II presents applications of wavelet networks, A brief description of each part of the book follows. Part I is divided into four chapters. Chapter 1 deals with mathematical preliminaries. Wavelet transforms have come to be an im portant tool for approximation analysis of nonlinear equations, with a wide and ever increasing range of applica tions, in recent years. The versatility of wavelet analytic tech niques has forged new interdisciplinary bonds by offering com mon solutions to diverse problems and providing a platform to wavelet networks. Chapter 2 describes in detail wavelet anal ysis and construction of wavelets. Chapter 3 addresses neural networks, where the stress is on pereeptron type networks and the back propagation learning algorithm with its various adap tations, Starting with a single MeCulloehPitts neuron model to build intuition, we present pereeptrons and gradientdeseent learning, Hebb type learning is the basis of most artificial neural network models, and competition is a major learning principle, so we develop these concepts at some length. We also present other neural architectures, because it is our belief th at the future lies in adapting and integrating wavelets to other neural archi tectures, Our presentation of other neural architecture gives the reader a firm foundation in the concepts, which we hope will generate new ideas. Chapter 4 is the capstone finale that
deals with wavelet networks. In these final segments, we outline the theory underlying wavelet networks and present some basic applications. Part II, contributed by many distinguished researchers in this area, is divided into five chapters. There is a great amount of work available th at uses wavelet networks, much of which de serves inclusion here. Although we provide a review of current work, we have selected th at material which, in our view best represents im portant applications of wavelet networks. Chap ter 5 (contributed by Ricardo Riaza, Raquel Gomez Sanchez, and Pedro J, Zufiria) presents recurrent learning in dynamical wavelet networks. This chapter is a bridge between Part I and Part II of the book. It could easily have been included in Part I, but we chose to include it in Part II because of the advanced na ture of the concepts presented therein. Chapter 6 (contributed by Javier Eehauz and George Vaehtsevanos) presents the use of wavelet networks as trading advisors in the stock market; Chapter 7 (contributed by Javier Eehauz, and George Vaeht sevanos) presents radial wavelet networks as classifiers in eleetroeneephalographie drug detection; Chapter 8 (contributed by Liangvue Cao, Yiguang Hong, Haiping Fang, and Guowei He) presents the use and application of wavelet networks for pre diction of chaotic time series including the short term and long term predictions, where a number of times series generated from chaotic systems such as the MaekevGlass equation are tested. Chapter 9 (contributed by Nageswara S.V, Rao and S.S. Iyengar) presents concept learning and deals with approxima tion by wavelet networks. The book includes a detailed bibliography arranged under chapter headings to help the reader easily locate references of interest. The bibliography contains additional references not explicitly cited in text, and will provide the reader with a rich resource for further exploration.
F u tu re It is our hope th at this work will inspire more students to work in wavelet networks and open up new areas in research and applications. The literature is rich and growing in the appli cations of wavelet networks for engineering and economies. In the future, we would like to see it grow in application for social sciences and humanities. This book is a basic resource, not only for current applications, but for future opportunities as well.
A b o u t th e A u th o rs S.S. Iy e n g a r is the Chairman and Distinguished Research Mas ter Award Winning Professor of the Computer Science Depart ment at Louisiana State University, He has been involved with research in highperformance algorithms, data structures, sen sor fusion, data mining, and intelligent systems since receiving his Ph.D. (in 1974 at Mississippi State University) and his M.S. from the Indian Institute of Science (1970), He has directed over 30 Ph.D. candidates, many of whom are faculty at ma jor universities worldwide or scientists or engineers at national labs/industrv around the world. He has served as a principal in vestigator on research projects supported by the Office of Naval Research (ONR), Defense Advanced Research Project Agency (DARPA), the National Aeronautics and Space Administra tion (NASA), the National Science Foundation (NSF), Califor nia Institute of Technology’s Jet Propulsion Laboratory (JPL), the Department of NavvNORDA, the Department of Energy (DOE), LEQSFBoard of Regents, and the U.S. Army Research Office, His publications include 11 books (5 authored or eoauthored textbooks and 6 edited; PrentieeHall, CRC Press, IEEE Computer Society Press, John Wiley & Sons, etc.) and over 270 research papers in refereed journals and conferences in ar eas of highperformance parallel and distributed algorithms and data structures for image processing and pattern recognition, and distributed data mining algorithms for biological databases. His books have been used at New Mexico State University, Purdue, University of Southern California, University of New Mexico, etc. He was a visiting professor at the Jet Propulsion LaboratorvCal, Tech, Oak Ridge National Laboratory, the In dian Institute of Science, and at the University of Paris, Dr, Iyengar has served as an associate editor for the Institute of Electrical and Electronics Engineers and as guest editor for the IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions and SMC, IEEE Transactions on Software Engi neering, Journal of Theoretical Computer Science, Journal of Computer and Electrical Engineering, Journal of the Franklin
Institute, Journal of the American Society of Information Sci ence, International Journal of High Performance Computing Applications, etc. He has served on review panel committees for NSF, NASA, DOEORNL, the U.S. Army Research Office, etc. He has been on the prestigious National Institute of HealthNLM Review Committee, in the area of medical informatics, for 4 years. Dr, Iyengar is a series editor for NeuroComputing of Complex Systems for CRC Press, In 1998 Dr, Iyengar was the winner of the IEEE Computer Society Technical Achievement Award for Outstanding Contri butions to D ata Structures and Algorithms in Image Processing and Sensor Fusion Problems, This is the most prestigious re search award from the IEEE Computer Society, Dr, Iyengar was awarded the LSU Distinguished Faculty Award for Excellence in Research, the Hub Cotton Award for Faculty Excellence, and the LSU Tiger Athletic Foundation Teaching Award in 1996, He has been a consultant to several industrial and government orga nizations (JPL, NASA, etc.). In 1999, Professor Iyengar won the most prestigious research award, titled Distinguished Research Award, and a university medal for his research contributions in optimal algorithms for sensor fusion/image processing. He is also a Fellow of Association of Computing Machin ery (ACM), a Fellow of the IEEE, a Fellow of the American Association of Advancement of Science (AAAS), a Williams Evans Fellow, IEEE Distinguished Visitor, etc. He has been a Distinguished Lecturer for the IEEE Computer Society for the years 19951998, an ACM National Lecturer (19861992), and Member of the Distinguished SIAM Lecturer Program (2000 2002), He is also a member of the ACM accreditation commit tee for 20002002, He is a member of the New York Academy of Sciences, He has been the Program Chairman for many na tional/international conferences. He has given over 50 plenary talks and keynote lectures at numerous national and interna tional conferences.
E .C . Cho Professor in Mathematics, graduated with a Ph.D. in mathematics from Rutgers University in 1984, He has been working at Kentucky State University since 1989, His research interests include mathematical modeling and simulation, im age analysis, algorithms, and optimization. He has extensively published in these areas. His papers have been published in Applied Mathematics Letters, Nieuw Archief voor Wiskunde, IEEE Transactions on Geoscience and Remote Sensing, Con temporary Mathematics, and Missouri Journal of Mathematical Sciences. V .V . P h o h a holds an M.S. and a Ph.D. in computer sci ence from Texas Tech University, College of Engineering, He has taught and published extensively in neural networks, com puter networks, and Internet security. He has over 10 years of experience in senior technical positions in industry and over 9 years of experience in academia as a university professor. He has won various distinctions including Outstanding Research Faculty and the Faculty Circle of Excellence award from North eastern State University, Oklahoma, His writings have appeared in IEEE Transactions of Neural Networks, Communications of the ACM, Engineering Applications of Artificial Intelligence, Mathematics and Computer Education, among others. As a stu dent, he was twice awarded the President’s Medal for academic distinction. At present, he is an Associate Professor and Program Chair of Computer Science in the College of Engineering and Science at Louisiana Tech University,
Form ulae N o ta tio n s and A b breviations The following is a glossary of some of the symbols and abbrevi ations used freely in this book. V 3 3 e 3 U n c D $ < > = ± T X o => 0 there is N such that on —o < e for every n > N, In other words, all ak except possibly first N terms lie in the open interval (o —e, a + e), called the e neighborhood of a. For example, the sequence k / ( k + l) converges to 1, Given e > 0, 71
I 11 < e n+ 1
Vn > N
where N is an integer greater than 1/e, A sequence th at does not converge is said to be divergent. The sequence ak = k diverges to oo, ak = cosnk diverges (more precisely, oscillates between 1 and —1 ), and ak = ( ^ 2 )fe diverges to ±oo, A sequence ak in R satisfies Cauehv criterion, which means: Ve > 0, 3fV such th at \an —am\ < e
Vm, n > N
Cauehv sequence is a sequence th at satisfies Cauehv criterion, A Cauehv sequence is bounded, th at is, there is B such that \ak\ < B for every k. Any divergent sequence is not a Cauehv se quence, A convergent sequence is obviously a Cauehv sequence and every Cauehv sequence in R converges to a limit in R , How ever, whether a Cauehv sequence in X converges to a limit in X depends on X, A space X is called complete if every Cauehv sequence in X converges to a limit in X, R is complete, but Q, the set of rational numbers, is not. In other words, there are Cauehv sequences of rational numbers that do not converge to rational numbers. For example, the sequence of rational num bers given by the recurrence relation Ojj—i 1 al — l, an — — I 2 Qn^i converges to \/2 0 Q, The sequence 1/k shows that the space of all positive real numbers is not complete.
Mathematical Preliminaries
13
From a sequence (oi, o2, o3, ,,,) of numbers we define a series $n = Yk=l Si =
Oi
s2 =
Ol + 02
s3 =
01 + 02 + 03
Sjj =
Sjj_i  Q,n
From a given series sn, we can recover the sequence ®>k — $k
&k—1
lim ^oo sn = s means the series sn converges to a finite number s, th at is, given e > 0 there exists N such that s — (oo + oi + • — h o„) < e for every n > N s is called the sum of the series. Each sn is called a partial sum, rn = s — sn the remainder. Here are examples of well known and useful convergent se ries, 00
£
1
2
=
1
1 + t + t + . . . =2
fc=0
"
~ 1 t ^ n n\ k= 0
_ ~
^
1
1
1
0 ! + l! + 2 ! +
(i)n  — = 1 k=0 n + 1 00 f—l l n z b r r i = xk=0 Y
^
1
1
1
 + + . . . = In 2
2
3
4
1 1 3 + 5  / ^
7+ .„ = . / 4
It is necessary (but not sufficient) for lim ak = 0 for series Y o>k to converge. For example, Y t L 1 1/A: = 00, since n
rn
Y ^  / k > / 1 / x d x = \n n k= 1 1
14
Foundations o f Wavelet Network and Applications
According to Cauchy criterion, a series sn = YZ= 1 °fe converges if and only if given e > 0 there exists N such th at rn > N implies m+k  5 3 ai\ < e f°r any k k=m A Taylor series is a function series, th at is, a series in the space of polynomial functions. The Taylor series of a differen tiable function f ( x ) at x = a is given by
f (a) + f ' ( a)(x ~ a) H
f /r(of vr~( Zl, x ~ ° ) 2 H
f(n) (n\ 1;— (x ~ a)n Th.
For example, the Taylor series of f i x ) = ex at x = 0 is 2 X3 X4 i + i + x+ — + —+ — + .
.
X
X
n
The Taylor series of / at x = 0 is called the MacLauren series of / , A Laurent series is a function series th at contains negative, 0, and positive powers of x. A Fourier series is also a function series, a sum of cosine and sine functions of various frequencies. It is of the form oq + 5 3 Ofc cos kx + 5 3 bk sin kx k= 1 k=1 Fourier series will be discussed in detail later.
1.4 Let i = form
C o m p lex N u m b ers so th at i2 = —1 . A complex number z is of the z = x + iy
where x, y E R . a: is called the real part of z, denoted by R e(z), and y the imaginary part of z, denoted by Im (z). We can write z as z = R e(z) + ilm.{z). Note both the real part and the imaginary part of z are real numbers.
Mathematical Preliminaries
15
The polar form of z = x + yi is z = r(cos9 + isind) where r = \ / x 2 + y2. r is called the absolute value (or magni tude) of z and 9 = aretan y / x the argument (or phase angle) of The Euler formula e%e = cos 9 + i sin 9 can be understood in terms of Taylor series of ry 9"
ry^‘J
= 1+a:+^[ + ^r + ry 2^ C O S
X
. sin a;
=
1
2!
4!
6!
^
Tr y ,J
^
Try 1
a  — + —
"
ry6'
 ; 1  ;h
ry ‘J T
=
4 ry*
' ' '
7
+ •••
by substituting %9 into x in ex. Using the Euler formula, we can express z = retB. Writing z in polar form is convenient when multiplying and taking powers of z, z k = rkeM It also follows from the Euler formula that cos 9 = \(e % B+ e^%B) sin#
= \( e % B— e^%B)
The conjugate of z = x + iy = retB is z = x — iy = re 7 (9 Addition of complex numbers is defined by (o + ib) + (c + id) = (o + c) + i(b + d)
16
Foundations o f Wavelet Network and Applications
and multiplication by (o + ib)(c + id) = (oc —bd) + i{ad + be) In polar form, multiplication is done by multiplying the absolute values and adding the arguments (angles), rei0sei(^ = rse^9+^ It follows th at Re() =
2
Im(') =
2/
The length (absolute value, magnitude, or modulus) of z, de noted by \z\, is given by I\z\2 = zz = x 2 +, 2y The length satishes the following properties, \z + w\ < \z\ + \w\
\zw\ = \z\\w\
The inverse of z, zl = L M2 since zz = z2. For example, (o + 6i ) 1 = (o — bi)/(a2 + b2). The set of all complex numbers z = x + iy is denoted by C, C = {x — iy : x , y E R} A complex number z = x + iy can be viewed as a point (a vector) (x, y) in the twodimensional real plane R 2, The points on the x axis correspond to the pure real numbers and those on the y axis to the pure imaginary numbers. On the plane, mul tiplication by i can be viewed as the counterclockwise rotation by 90 degrees about the origin and by  1 as the rotation by 180 degrees. More generally, multiplication by e% B corresponds to the rotation by 9 (in radian) and retB to the rotation and scaling by the factor r. Obviously, eiir/2
^ I = ei%
Mathematical Preliminaries
17
Now a complex number z = x + iy can be represented as a linear map from R 2 onto itself and hence as a 2 x 2 matrix with respect to the standard basis of R 2, For example, 0
1 0
_ i = ( _1
0
“ V0
/
1
it follows a + ib =
,'a
—
b\
aq
f cosd
^sind
, e —' b a y Vsin 9 cos 9 The complex multiplication of x + iy with a + ib can be repre sented as ' x —y \ / a \ _ / xa —yb \ /o —6 \ / a —b' ,y x J Vbcr J \ b a ) \b a ) \b a, It also follows th at the m atrix multiplication corresponds to the complex number multiplication, °^ \ (c b a J \d
1.5
^ ^ c )
— ( ac ~ bd \ bc + ad
—be —ad' ac —bd
Linear Spaces
A vector is an object with magnitude and direction such as velocity or force in physics, A vector is usually represented by an ntuple of real (or complex) numbers. We define a space of vectors as follows, A linear space (also called a vector space) consists of the set V whose elements are called vectors, a rule of adding vectors, and a rule of multiplying a real (or complex) number (called scalar) with a vector satisfying the following conditions, 1. V is an abelian group under addition (+), 2, For x , y in V and numbers c, d, c(x + y) = (cx + cy) (c + d)x = cx + dx c{dx) = {cd)x lx
= X
Foundations o f Wavelet Network and Applications
18
V is an abelian group means 1, x + y = y + x for every x, y e V, 2, There exists a vector (called zero vector) 0 in V such that £ + 0 = 0 + 2; = a;,
3, For every vector i in V there exists  I in V such that x
+
( —
a;)
=
0,
If the numbers are complex numbers then the space is called a complex linear space (or a complex vector space), A subspaee of a vector space V is a monomorphism : W —>• V , We can also define a subspaee W of a vector space V as a subset of V which is also a vector space by itself. This means 1. 0 e W . 2, If u, v E W and o is a number then au + v E W , An intersection of subspaees of V is also a subspaee of V, If K is a nonempty subset of V then the subspaee generated by K is the intersection of all the subspaees of V th at contains K. For any subset i f of V there is always a subspaee containing K , for example, V , and the subspaee generated by K is the smallest subspaee containing K. E xam ple 1: U K consists of a single nonzero vector u e V , then the subspaee generated by K = {u} is the subspaee consisting of all the scalar multiples of u, in other words, it is a onedimensional line containing u. The direct sum of two vet or spaces V and W is the direct sum of V, W in the category of vector spaces. That is, it is a vector space, denoted by V © W , and monomorphisms i : V + V © W and j : W + V © W such that for monomorphisms / : V + A and y : W + .1 there is a unique morphism n : V © W + A with / = a o i and g = a o j . The direct product of two vetor spaces V and W is is a vector space, also denoted by V © W , and epimorphismsp : V © W ^ V and q : V © W + W such that for epimorphisms / : A + V
Mathematical Preliminaries
19
and g : A —b W there is a unique morphism a : A —>•V © W with f = p o a and g = q o a. E xam ple 2: Let R be the set of all real numbers. The cartesian product of n copies of R, R" = { (.r  . , , , , x n) :%i G R, i = 1, . . . , n} the set of all ntuples (aq,, , , , x n) of real numbers, is an ndimensional real linear space. The vectors are added and mul tiplied by scalars as follows, ( x i , . . . , x n) ± ( y i , . . . , y n) = ( x i ± y u . . . , x n ±
yn)
c(aq, . . . , x n) = (cx i , .. . , cxn) Similarly, the cartesian product of n copies of C (the set of all complex numbers), C" = {(.r,
n.) : n e ('. i = l , . . . , n }
is an ndimensional complex linear space, A set of vectors {t q, , , , , v k } is said to be linearly independent if r/i /'i + • — b n
)
Similarly, we define an ndimensional row vector as an ordered list of n numbers arranged in a row v = (o i,o 2, ■■■,an) We will work mostly with column vectors since they are conve nient under our notational convention of writing mappings on the left of the variable (as f ( x ) ) and multiplying the matrix on the left of the vector (as Aw), An m x n m atrix is an ordered list of mn numbers arranged in a rectangular form (
O il
O 'ln \
n x n matrices are called square matrix. For example, a 2 x 2 matrix
a,ij stands for the element at ith row jth column. For example, on = 1 and 022 = 5 in the m atrix above. The transpose of a m atrix A is defined by interchanging the rows and columns of A and is denoted by A*. For example, /I
4\
The element of A* at the position (ij) is the element of A at the position (ji). Obviously, the transpose of an m x n matrix
24
Foundations o f Wavelet Network and Applications
is an n x rn m atrix and (A*)* = A, If A = A* then A is called symmetric. If A = ^A* then A is called antisymmetric (or skewsymmetrie). Any square m atrix A can be written as a sum of a symmetric m atrix and an antisymmetric matrix A = i ( A + A ‘) + i ( A  A ‘) E xam ple 4: 2 4\ _ /2
5\
6
8/ + \l
8 / “ \5
/0
1 0
An ndimensional column vector can be viewed as an n x 1 matrix, and an ndimensional row vector can be viewed as an / Oi \
1 x n matrix, A column vector
will be frequently written \ @'n
)
as ( a , i , , any to save space. Vectors of the same size are added by adding each corre sponding component. For example, ( 1 , 3 ) + (2,4) = (3, 7) Matrices of the same size are also added by by adding each corresponding entry. For example, 1 2\ /5 3 4 j + V6
6\ / 6 8 3y \ 9 7
Vectors and matrices are multiplied by a scalar by multiplying the scalar to each component. For example, 3(1,2) = (3,6) and 1
2 \ 2 )
~
/3
6 '
V9
6,
E (pronounced sigma) is used to denote the sum of a sequence of numbers E^F^cq = cq O2 ~t~ ■■■A Q/n
Mathematical Preliminaries
25
A whole system of equations is expressed compactly by using sigma notation. For example bi = wndi A wi2a2 H h winan b2 = w2idi A w22a2 H h w2nan bfn = wm\Q,\ A wm2a2 A • • • A wmno,n is written as bk = ?%=1wki a,i for k = 1 , . . . , rn Let A = (ojj) be a p x m m atrix and B = (bjk) an m x n matrix. The product A B = C = (cik) is a p x n matrix defined as follows: m
Cik = Y
a ' j hj k fo r ' = *• ••••d k = l , .... II
3=1
M atrix multiplication is not commutative, B A is not defined unless p = n. If it is defined, B A is of size rn x rn while A B is of size n x n. If rn = n so th at both A B and B A are of the same size n x n, they are different in general. E xam ple 5: 1 3 1 5
2\ 4/ 3\ 7)
n V5 /I \3
3\ 7) 2\ 4)
_ ~ _ ~
/II \23 /IQ \26
17 37 14 38
Taking transpose of matrices and m atrix multiplication does not commute but (AB)* = B lA l The n x n m atrix (S^) is called the identity matrix of dimen sion n denoted by I n. S is the Kroneeker delta
26
Foundations o f Wavelet Network and Applications
I n is the identity of m atrix multiplication of n x n matrices, A In = I nmath bf A = A
Vn x n matrix A
Let A be an n x n matrix. If A B = I n for some B , then B A = I n and B is unique. The m atrix B is called the inverse of A and denoted by A 1. Since A B B  1 A  1 = I (A B ) 1 = B 'A 1 Since I = F = (A A 1)* = (A 1 )'A' (A 1)* = (A *) 1 If u = (ui,...,unY and v = (vi,...,vnY are column vectors in R ”, then ulv = U\V\ + • • • + unvn is the inner product of u and v. The inner product defines the norm of the vector and the metric p on the vector space. u
=
(i^u )1'2
p(u,v)
=
u —v
Vu G R ; 6 R :n
The CauchySchwarz inequality states n
< U    HI which implies ulv  1
0 for every v E H and  w  = 0 iff v = 0,
28
Foundations o f Wavelet Network and Applications 2, cro = atj for every o E C and r E H. 3,  \u + v \  >  m +  w for every u , v E H,
The norm induces a metric d on H by d(u, v) =
u
V
—
ii. r e H
Indeed d is a metric on H, th at is, d satisifes 1, d(u, v) > 0 for every u , v E H and d(u, v) = 0 iff u = v. 2, d(u, v ) = d(v, u) for every u , v E H, 3, d(u, w) < d(u, v ) + d(v, w) for every u , v , w E H, H is complete with respect to the metric d, which means every Cauehv sequence in H converges to a limit in H, A sequence an is a Cauehv sequence if for a given positive e there exists a number N such th at d(am, an) < e for every m , n > N. The sequence an converges to a limit a E H if for a given positive e there exists a number N such th at d(an,a) < e for every n > N. For example, the set of all rational numbers Q is not complete but R is. Let an = l + l / l + l/2! +   + l/n !. Then an is a Cauehv sequence in Q but it does not converge to any number in Q; it converges to e in R, We are interested in the particular Hilbert space called L 2 space or the space of L 2 functions, L 2 functions are also called square integrable functions. /
OO
f { x ) 2 dx < oo} OO
L 2([a,b]) is the space of square integrable functions on the in terval [a,b\. The inner product on L 2 space is given by
and the norm /  is given by ii/ir2 = ( / , / ) = / / t o 2 *
Mathematical Preliminaries
29
Given a set of L 2 functions {fi, ■■■, f n}, the span is defined to be the subspaee of L 2, n
{ / e L 2 : f = Y^ a kf k, for some a k } k= 1
1.8
T op ology
We need some concepts from topology. In this section, we give the definition of open sets, closed sets, and metric only. We are mainly interested in subsets of Euclidean space R ", Let a be a point in R " and r be a positive real number. The open ball of radius r about the point a is the set of all points in R " within the distance of r: B a,r = {x : x E R ". a: —a < r }• A subset S of R " is called open if for every point a in S the points near a is also in S. In other words, S is open if the following condition holds: For any a E S, there exist r such th at B ayT C S. For example, the whole set R " is open and the open ball is open. Single point sets are not open, the complement of a single point set is open. Open sets have the following properties • The union of open sets is open, • The intersection of finitely many open sets is open, A subset C of R " is called closed if its complement is open. For example, the whole set R " is closed and the closed ball B a,r = {x : x E R ", \x — a\ < r} is closed. Single point sets are closed. By definition, the empty set 0 and the whole set are both open and closed. We can extend this definition of “open” to a more general space X where the notion of distance is defined, A metric space is a set X with a function d : X x X  > R such that
30
Foundations o f Wavelet Network and Applications • d(a,b) > 0 for every a, b e X and d(a, b) = 0 if and only if a = b. • d(a, b) = d(b, a) for every a, b e X, • d(a, c) < d(a, b) + d(b, c) for every a,b,c e X,
We define a subset A of X as open if for every a e A there is an e > 0 such th at the eball 11, = { / e X : d(a, x) < e}centered at a is contained in A. We can define different metrics on R 2 which induce the same topology. That is, the sets of open sets defined by these metrics are identical. Let a = ( o i , o 2) and b = (61, 62) The ususal Euclidean metric is defined by d(a, 6) = ^ (61  O1)2 + (62  o2)2 The following metrics d(a, 6) = moa:{62 —o 2 , 161 —
}
and d(a, 6)
=
162  o2  + 6i  oi
define the same topology, th at is, they all generate the same class of open sets in R 2, The concept of compactness is closely related to the concept of finiteness, A subset A of a topological space X is compact if every open cover of A, that is, a collection of open sets whose union contains A, has a finite subcover. In other words, we can always reduce any open cover of a compact set to a finite number of open sets th at still cover the set. For subsets of R", compactness is equivalent to being closed and bounded, (A subset A of R" is bounded if there is an open ball B r which contains A.) Any continuous function defined on a compact set has both a maximum and minimum. For example, f ( x ) = x 2 does not have maximum on R but the restriction of F on a compact interval [0, 2] has minimum 0 and maximum 4 on the interval. The smallest closed subset outside which a function vanishes is called the support of the function. For example, the
Mathematical Preliminaries
31
support of the sine function is R and the support of the Haar function is [0,1], Obviously, a function / with compact support is bounded (\f(x)\ < M for some M). and also has a maximum and minimum,
1.9
M easu re and In tegral
A measure space consists of a set X , a collection M of sub sets (called measurable sets), and an extended nonnegative real valued set function fj, (a mapping that assigns a a nonnegative value or oo to sets in M). They satisfy the following conditions, 1, X and the empty set 0 are in M (they are measurable) and //( 0) = 0 , 2, If A is in M then A c = X — A is also in M. 3, If Ai, A 2, . ., are in M then U i s in M. 4, If Ai, A 2, . .. are in M and mutually disjoint, then //(lJjAj) = Y,i the infinite sum either converges to a finite value or to oo. Intuitively, measure is a generalization of the notion of the size of a set, number of elements, length, area, or volume of a set. For example, assigning the length b — a to the interval [o, b] is a set function defined on the set of bounded intervals, but this by itself is not a measure since it is does not satisfy the additivity condition. For example, it is not defined for the union of two disjoint closed intervals. The following two examples are trivial and uninteresting. E x a m p le 6 : Let X be any nonempty set. Let M = {.V. 0}, and //(0) = 0 and fj,(X) = 1, E xam ple 7: Let X be any nonempty set. Let M = {A : A C X } , and //(0) = 0 and fj,(A) = oo for any nonempty subset A. If we define p, on P(R) by p(A) = inf Y,i{h — a>i) where the inhmum inf (or least upper bound) is taken over the collection
32
Foundations o f Wavelet Network and Applications
of all unions of open intervals (cq, bf) that covers A. For intervals (either open or closed) p measures their lengths, however p is not a measure. Instead, it satisfies 1 . p($) = 0.
2. If A C B then p(A) < p(B). 3. tJ,(UiAi) < T,id'(A i)The following theorem shows how to construct a measure space from a more general set function p defined on every subset of the underlying set X . Theorem 1 Let X be any set and P ( X ) be the power set (the set of all subsets of X ) . Let p be a mapping from P ( X ) to the extended nonnegative real numbers R* (R U {oo}J satisfying the above condition. Let M be the set of all subsets A of X satisfying p(E) = p(A C\ E) + p ( A c n E), Define p : M —>• R* by p{A) = p{A). measure space.
VEeX Then ( X , M, p ) is a
The measure defined on the set of real numbers R from p by the theorem is called the Lebesgues measure on R. Let (X, M. p) be a measure space. Let / : X —>• R be a simple function with f ( X ) = {r i , , , , , r „ }. We define the integral of / over X to be p { f ^ l {ri) + /w(/_ 1 (r2) + • • • p ( f ^ 1(rn). It follows th at each / _ 1 (rj) is disjoint and the union Ui f ^ l {rf) covers X . A real valued function / : X —>• R is called measurable if the inverse image of any open interval is measurable, th at is, / _1(/) is in M for every open interval I in R. It follows that if / is measurable, then /  1 ((^oo, r)) is measurable for every real number r. Since any open subset of R is (Lebesgues) measurable, and the inverse image of an open set under continuous map is open, every con tinuous function / : R —>• R is measurable. Every measurable function is a limit of sequence of simple functions. If / > 0 then the sequence of simple functions can be taken to be nonnegative monotone increasing, that is, 0 < _/j < / 2 < • • • and lim f n = / ,
Mathematical Preliminaries
1.10
33
Fourier Series
Let / : R —>• R be a periodic function of period T. The period T is the smallest positive number with f ( t + T) = f (t ), V / e R The function / is viewed as a function of time t E R . If / has period T seconds, then the frequency of / is 1/T hertz (cycles per second). Frequency is the inverse of time. For example, f ( t ) = sin 2 7rf has period T = 1 and frequency 1 hertz. The frequency of f ( t ) = sin t is of 1/ 2tt hertz. If / is a function of space rather than time, then we use the term “wave number” , which is the inverse of space. Suppose / has period 1, Assume / can be represented as a series of trigonometric functions eos27rA:a: and sin27rA:a:, k = 0 ,1 ,2 ,.... O O
m
oo
= £ ofecos 2nkx + fcfc sin 2nkx k=0 k= 1
Since for n, k = 1, 2 ,,,, sin 27tux cos 2nkx dx cos 2ttnx cos 2nkx dx
=
sin 27rna: sin 27rA:a: dx
=
— 2
where Snk is the Kroneeker delta, and we can compute bk from f f i x ) cos 2nkxdx = Jo
2
f f i x ) sin 2ttkxdx Jo
2
=
i
f ( x ) cos 2nOxdx = o0
and
Foundations o f Wavelet Network and Applications
34
The Fourier series of / is given by 1 °°  o 0 + ^ 2 (°fc cos 2
+ bk sin 2jrkx)
k= 1
where 0^ =
2 [ f i x ) cos 2irkx dx Jo
bk =
2 / f i x ) sin 27rkx dx Jo
If the Fourier coefficients ak and bk at frequency k are plotted as a point (ofe, bk) on the plane, the length \ / ak2 + bk2 is called the amplitude at frequency k, and the angle f such that C O S
(f) 
s in ^
=
\ / ak2 + bk2 7 W T W
is called the phase at frequency k.
Mathematical Preliminaries
1.11
E xercises
1, Prove f ( A 2 ,
35
f l
B) C f ( A )
Find an example of
f ( A
f l
f l
f(B). B
)
#
f(A)
f l
f(B).
3, Show ( AU B) n C = ( AD C) U (B U C). 4, Find the inverse function of f ( x ) = ax + b, a ^ 0, defined on R. 5, Find the inverse function of fix)
=
\ f x defined on l>
.
6, Find the composite functions f ( g ( x )) and g ( f ( x )) of f ( x ) = ax + b and g{x) = cx + d.
7, Let A = {o, b, c}. Find A 2 and P(A). 8, Sketch the graphs of the Heaviside function, Box function, Gaussian function, exponential function, and hyperbolic tangent, 9, Find the real part and imaginary part of the following numbers: z = a—hi. z, 1/z, z2, i" (n = —3, —2, —1, 0,1, 2, 3), 10, Prove \z + w\2 + \z — w\2 = 2 (z 2 + w2). 11, Find the hrst 10 terms of the sequence given by the recur rence equation Oo = 2,
0^ = 2ofc_i — 1
12, Find the limits of the sequences if they converge, o^ = (_ l)fc/ 2 fc, 0fe = 2 fe, ak = (ah + [3)/k, ak = ( ak / 2k), ak = In k/k. 13, Give examples of Cauehv sequences of rational numbers th at do not converge to rational numbers, 14, Find the Taylor series of cos a: and sin a: al
x
= 0.
36
Foundations o f Wavelet Network and Applications
15, Show th at the Taylor series at x = 0 of a polynomial function p(x) is p{x). 16, Form all the subsets th at can be formed from the set A = {1,2,3}. 17, Express the following using suitable symbols: (a) A is a subset of M, (b) N is a subset of E, (e) {P, Q } is a subset of {P, Q, R}. (d) The set of odd numbersCnatural numbers, (e) The set of even numbersCnatural numbers, 18, If A = {P, Q } write all the subsets of A, 19, Write all the subsets of M = {3, 5, 7}, 20, If C = { 1 , 2 , 3,4} write all the subsets of set C that con tains: (a) One element (b) Two elements (e) Three elements 21, Write all the subsets of set P = {M. \ . (). /'}•. 22, If a = {1, 3, 5}, (a) A U B (d) (b) A U C (e) (c) B U G (f)
b = {3, 5, 7, 9}, e = {4, 6 , 8,10}, find: AD B AnC B DC
23, If P = {x, y, z}, Q = {a, y, z}, R = {a, 6, x}, find: (a)P U Q
(b) P U R (e)Q U P
(d) P n Q (e) P n R (f) Q n r 24, Use examples of your own to show the tru th of the follow ing statements: (a) If two sets are equal, they have the
Mathematical Preliminaries
37
same cardinal property, i.e., if A = B then n(A) = n(B). (b) If two sets are equivalent, they have the same cardinal property, i.e., If .1 o II then n(A) = n(B). 25, Describe a special algebraic structure with examples to describe a computational structure of a system called a closed semiring system defined as (S , + , •, 0,1) where S is a set of elements, + and • are binary operations on S, satis fying the properties of monoid, commutative, distributive, finite sum and distributes over eountablv infinite sums, as well as finite ones, 26, Show th at tq = (a, b) and v2 = (c, d) form an orthonormal basis of R 2 if and only if the matrix
is orthogonal, th at is, A A T = I. 27, Let v\ , v2, . .. , vn be an orthogonal basis of a vector space V Show v = Y,(V,Vi)vi. 28, Prove the CauchySchwarz inequality
29, Show that the inverse image of 0 under a linear map / between linear spaces is a subspaee of the domain of the linear map, 30, Show th at the image of V under a linear map / of V into W is a subspaee of W, the eodomain of / , 31, Let v\ and v2 be mutually orthogonal unit vectors in R 3. Find a vector v3 such th at tq, v2, v3 is an orthogonal basis of R 3.
38
Foundations o f Wavelet Network and Applications
32, Prove (1,1) and (1, —1) is an orthogonal basis of R 2 and express any vector (x,y) as a linear combination of ( 1 , 1 ) and ( 1 , —1 ). 33, Prove f°° c i n 3 L — 5 E Joo &m < 3 ^ 4
34, Let A be a subinterval of [0, 2ti\. Prove lim ^oo f A cos n x d x = 0 35, Let A be a subinterval of [0, 2ti\. Prove lim ^oo f A sin n x d x = 0 36, Prove J Z , e^x'2 dx = \/7r 37, Prove p(Z) = 0 and p{Q) = 0 where p is the Lebesgues measure on E, Z is the set of all integers, and Q is the set of all rational numbers. 38, Show f^_ sin n x d x = 0 39, Show cos n x d x = 0 for n 7^ 0 40, Show sin nx cos m x dx = 0 for n ^ m 41, Show
Mathematical Preliminaries
39
sin 2 nx dx =
tt for
n 7^ 0
42, Show cos2 nx dx = 7r for n 7^ 0 43, Show th at sin(na:) and sin(ma:) are orthogonal in L 2([0, 2ti]) if n 7^ m, 44, Show th at cos(na:) and cos(ma:) are orthogonal in L 2([0, 2ti]) if n 7^ m, 45, Show th at sin(na:) and cos(ma:) are orthogonal in L 2([0, 2ti]) if n 7^ m 46, Prove
is orthogonal in L 2([^ tt, tt]).
47, Let h(t) = e^2t = 0
for for
1 < t t < 0
48, Find the L 2 norm of h, a. Find the L\ norm of h, b. Prove L 2(R) is not a subspace of Li(R). c. Prove Li(R) is not a subspace of L2(R). 49, Let g(t) = f ( t — a). Show g(u) = e  MWf ( u ) 50, Let g(t) = f(t/b). Show g(u) = \ b \ / ( M 51, Let g(t) = f (k)(t). Show g(uj) =
(iuj)kf ( u j )
52, Find the Fourier transform of X[o,i](^)53, Find the Fourier transform of the function f ( t ) =
C h ap ter 2 W a velets 2.1
In tro d u ctio n
Wavelets are functions generated by one function w(x) called the analyzing wavelet (also called mother wavelet), w(x) is defined for real variable x, but may take complex value. In other words, re is a function from R into C, The function w has a finite L 2 norm, re, defined by
Since w(x) may be complex valued, \w(x)\2 is used instead of w(x)2. The square of L 2 norm, re2, is called the energy of the function w. We may assume re = 1 by normalizing w by
The mother wavelet w satisfies the following condition, called the compatibility condition
where w is the Fourier transform of w. The compatibility con dition implies
41
Foundations o f Wavelet Network and Applications
42
if the integral exists. The sine function does not satisfy these conditions and sin(a;) does not have finite L 2 norm. Since sin 2 (a;) has period tt and r% . 9/ , , r%1 ^ cos 2 a: , sm ( x)dx = dx Jo Jo 2 it follows
/
7r
OO
sin 2 (a:) dx = oo
O O
The wavelets are generated from the analyzing wavelet by translation and dilation, Wab(x) = A=W(~) \/a a where a is a positive real number and b is a real number. The factor is multiplied to preserve the L 2 norm, that is, each wah has norm 1 if in has norm 1 , For example, w2,o is the dilation of w by the factor of 2, wi# is a translation of w by 3, and w 2 , 3 is the dilation of w by the factor of 2 then translation by 3, The order of dilation then translation is important. If w(x) is translated by b and then dilated by a, it would be w {   b) a which is equivalent to , w a ,a b \x )
.x —ab = w(  ) Cl
dilation by a then translation by ab. A certain (sufficiently large) class of functions can be rep resented as linear combinations of the wavelets. That is, the functions are expressed as a finite linear combination of the di lation and translation of a single function re or as a limit of such finite linear combinations. Wavelet representations are more ef ficient than the traditional Fourier series representation in many situations, where the signal is nonstationarv, which means the
Wavelets
43
signal changes its behaviour (frequency) with time (or space) or has local singularities. Also, a wavelet is more flexible since we can choose the analyzing wavelet th at is suitable for the signal being analyzed whereas Fourier representation has a fixed basis, namely the sine and cosine functions,
2.2
D ila tio n and T ran slation
Let f ( x ) be any real valued function defined on the real line R, A new function f ab{x ) defined by fab(x ) = f (~— ) Cl
where a ^ 0, is scaled (or dilated) by the factor of o, then shifted (translated) by b version of / , For example, the graph of f 2 ,o(x ) is the graph of f ( x ) horizontally stretched by a factor of 2, and the graph of fi, 2 (x ) is the graph of f ( x ) shifted horizontally to the right by 2, Here the familar sine function is used to show the scaling and shifting. Let f ( t ) = sin t be a function of the independent variable t (say, time), and the graphs of f ( x t ) = sin7rt, /(27rf) = An 2;:/. and /(37rf) = sin37rf are shown in Figure 2,1, A function / is called periodic if there is a positive number T with the property that f ( t + T) = f(t ), V t e R , The smallest positive number T with this property is called the period of / , For example, sin t is of period 2tt and sin27rf is of period 1, In general, if f ( t ) is of period T then the dilation f a(t) = f ( t / a ) is of period oT, Obviously, the period is invariant under translation, sin  isof period 2o7r a
Foundations o f Wavelet Network and Applications
44
s i n (pi x>
s in (2 pi x)
siiim ( 3 pi x)
Figure 2,1: sin(7r;r), sin(27r;r), and sin(37r.i:).
2.3
Inner P r o d u c t
The inner product of two unit vectors in the ndimensional Eu clidean space R" is the cosine of the angle between them. If u and v are two unit vectors in R" and 9 is the angle between u and v, (u. v) = cos 9 The angle measured in radian can be viewed as the geodesic distance between u and v on the standard unit sphere S " 1 in R ", The closer u and v are, the closer the inner product to its maximum value 1 , Let ^ be a smooth and well localized function, that is, ^ has derivatives of every order and there exist fV, C, and 7 such that for a < N ') = J2cj kt j k( x), j,k
where cjk =
J
JU)i.'j kU) dx
Wavelets
2.4
45
H aar W avelet
Lot 0 be the characteristic function of the interval [0,1), =
. The function (f)k defined above will now be denoted by j,k{x) = =
2 ^ 2{x —k) } k£Z is an orthonormal basis for V 0. where 4>is called the sealing function. The hrst condition means that any function in L2(R) can be approximated by functions in V j as j increases. The third condition means the subspaees are related by a sealing law g(x) e V j g ( w 3x) e V 0. The fourth condition means th at each subspaee is spanned by translates of a single function and the translates are orthogonal. It also follows th at { 2 ^ 2(f>(2:ix —k) }keZ forms an orthonormal basis for V j, Since e V 0 C Vi, we have (x) = ^ 2 a kk = 2 and aW2n+k = 2S(k, 0)
Wavelets
49
of which £ M
2= 2
is a special case. For the associated wavelet function given by iix) =
Y , ( ^ 1 )k a i  k H 2 x 
k )
k
the set { ^ ( x — k ) } k is an orthonormal basis for the orthogonal complement Ho of Vo in If The family bi'k = 21/2V’(2Jx  A;) ly.fcez
1 0.8 06
0 .4
0 ,2
O 
0,2
—0 . 4
Figure 2.4: Original function.
2.6
C on tin u ou s W avelet Transform
Given a function g of time t, we consider the dilation (scaling) of g by a 9 a(t) = g{t/n)
50
Foundations o f Wavelet Network and Applications
Figure 2,5: Original function and its projection.
Figure 2,6: Original function and its projection to finer scale.
Wavelets
51
and the translation (shifting) of g by b g"(t) = g(t  b)
If g is translated then dilated, then it becomes 9ba(t) = g ( ( t  b ) / n )
If dilated first and translated, then 9a (t) = g (t / a  b)
For example, sin(2(f —7r)) is a dilation (a compression, in fact) by 1/2 and a shift by 7r of sin t.
Figure 2.7: sin(7r.:r), sin(27r;r), and sin(37r.i:).
Let xb be a wavelet function. Recall th at a wavelet is a function in L 2 (R) which satisfies certain conditions. A wavelet expansion uses the translated and dilated versions of one fixed wavelet function ijv We use the notation 4'ab{f) = V=4'{") y/a a
Foundations o f Wavelet Network and Applications
52
where o E R + and b e R . In continuous wavelet transform, the dilation param eter o and the translation parameter b varies in the continuum R , Let / be a function of time t. The continuous wavelet transform CWT maps / into a function of scale o and time b given by cwt(f)(a,b) = (f,ipab) = [ J
— ~)dt Cl
Since the inner product is preserved by the Fourier transform (Parseval’s theorem), we can also write cwt(f)(a, b) = (f,ipab) where M / ) = V d e  2msb^ ( a f ) Continuous wavelet transform is invertible, roc
roc
f(t) = — iy
cwt(f)(a,b)ripab(x)/a2 da db J —oc J —oc
where C is the constant from the admissibility condition / OO
m n W fd f OO
2.7
D isc r e te W avelet Transform
Like DFT (discrete Fourier transform), DWT (discrete wavelet transform) is a linear tansform th at operates on a 2”dimensional vector (vector in a 2"dimensional Euclidean space) into a vector in the same space, DWT is an orthogonal transformation (in vertible, and the inverse tansform is the transpose of the original transform). Orthogonal transform may be viewed as a rotation (or rotation followed by a reflection) of the vector space. They preserve the length of the vector. In CWT, a wavelet ij) is dilated and translated by any real value a and b. In DWT, however, a wavelet ij) is dilated and
Wavelets
53
translated by only discrete values. Often we use powers of 2 dilation (called dyadic) ip(2kt + l) where k, I are integers. The DWT of / is a function of scale 2k and time I given by dwt(f )(2k, I) = J f ( t ) rip(2kt
+ l)dt
Orthogonal wavelets are discrete wavelets which lend themselves to a very fast algorithms to compute DWT,
2.8
Fourier Transform
The Fourier series of a periodic function consists of sine and cosine functions of frequencies, the integer multiples of the base frequency of the function. Let f ( x ) be a real valued function with period T. We may consider the period T = 1 by a change of variable, replacing x by u = Tx. For example, sin t has period 2tt and sin27rf has period 1, We may assume f ( x ) is a function defined on the interval [0,1] since the function value is determined by f ( n + x ) = f ( x ) for n E Z, We can write the Fourier series of / as follows, OO
a0 + $ > fc cos 2jrkx + bk sin 2jrkx i or using exponential function etB = cos 9 + i sin 9, OO
/(a:) = £ c fee2^ — OO
where ck = f l f { x ) e / lmkx dx Jo The coefficients ak, bk, or Ck correspond to the components of / with frequency k, those with periods 1/k.
54
Foundations o f Wavelet Network and Applications
For nonperiodic functions, however, we need to consider co efficients for all possible frequencies £. If a function / is not periodic but decreases fast enough at infinity, th at is, \f(x) \ —>• 0 fast enough as \x\ —>• oo so th at the following integrals exist, then we have a Fourier transform / given by the coefficients °(C) b(0
= IZo f ( x ) cos 2k dx = f Z , f ( x ) sin 2n^x dx
It can be written as /
OO
f ( x ) e  2"t* dx oo
The transform is invertible, that is, we can compute f ( t ) from /(C) by / OO „ / © C " £l dx O O
So th at f i x ) = f ( —x). The Fourier transform is linear, th at is, ( a / + bgf = a / + bg. The Fourier transform preserves the L 2 norm (Plancherel and Parseval formula),
The Fourier transform may be viewed as a decomposition of a function of time into its frequency components, as a prism decomposes white light into rainbow colors. It maps a function of time f ( t ) into a new function of frequency /(C) Consider the Dirac delta function, which is not really a function in the usual sense (it is a distribution) but is defined by f (x)S(x) dx = /(0 ) It follows th at h(C) = 1 for all C since /
OO
S(x)e2^ X dx = e° O O
Wavelets
55
This is interpreted to mean th at the delta function contains the same amount of all the frequency components. The Fourier transform has the following sealing property: if f a is a sealed version of / as f a(x) = f(ax), then
fa(0 =
(l/o)/(C/o)
This means if the width of the function is decreased (com pressed), then its Fourier transform becomes wider (dilated) and its height shorter. The shifting property of the Fourier transform is: if g is a shifted version of / as g(x) = f i x — b), then 9 ( 0 = f ( s ) e 2^ The shifting property means th at the Fourier transform of a shifted version is the transform of the function multiplied by an exponential factor with a linear phase (27rbf).
2.9
D iscrete Fourier Transform
Let at, t E Z be an infinite sequence of real or complex valued numbers such that OO J 2 °t2 < oo t=— oo
Such sequences are called square summable and the set of all square summable sequences is denoted by I2. The DFT (discrete Fourier transform) of at is a complex valued function defined by OO
Mf) =
T. a V z” n t=
—
oo
where / E ma t h b f R is called the frequency. It follows from the dehntion th at A ( f + 1) = A ( f ) , that is A is of period 1 and we need to consider A only on, say, [—1/2,1/2]. If at is real valued, then A ( —f ) = A ( f ) and the absolute value A(—f )\ = \A(f)\. The inverse DFT is defined by rl/2 , and ^ 2,i. 9. Find the Fourier transform of or < a threshold. Figure 3,1 shows an example of a McCullochPitts model of a neuron.
Neural Networks
63
x2
x3
Figure 3,1: A MeCulloehPitts Neuron Model, Let %i, x 2, .. ■, x n be a set of input vectors and Wi, w2, . . . , wn denote the weights or the synaptic strength, then n
Y = f ( 5 2 wix i)
(3'1)
i= 1
where f(x)
=1
if
x>0
=0
otherwise
Here / is called an activation function. which 3,1 can be written is y = fi^WiXi  9)
Another form in
(3.2)
where 9 is a threshold. The McCullochPitts model makes many simplifying assumptions and does not appropriately reflect the behavior of a biological neuron. This model has been modified and extended to suit many applications. An im portant exten sion is the choice of an activation function.Some examples of activation functions are sigmoidal function and Gaussian func tion, The three most commonly used activation function are given in Figure 3,2,
64
Foundations o f Wavelet Network and Applications
x
1
f = tanh(ax) tanh
f 1
X
f = 1/(1 +exp(ax) logistic
Neural Networks
65
Figure 3,2: Example of three activation functions: tanh, logistic, and threshold.
McCulloch and P itts showed th at with proper selection of weights, a synchronous arrangement of neurons can perform uni versal approximations. Figure 3,3 shows step function implementation of AND, OR, NOT using the MeCulloehPitts neuron model. In this figure, t is the threshold, for example, in ease of AND W1 X1 + W2 X2 < 1.5, so for wi = W2 = 1 and for (xi,X 2 ) = (0, 0), (0,1), (1, 0) the output is 0 and for (xi, x2) = (1,1) the output is 1, N etw ork Structure There are many different types of network structures, but the main types are feedforward networks and recurrent networks. Feedforward networks: Feedforward networks have unidi rectional links, usually from input layers to output layers, and there are no cycles or feedback connections. Networks are ar ranged in layers and there are no links between units in the same layer, no backward links, and no links skip a layer, A simple ex ample of a three layer feedforward network is shown in Figure 3.4.
66
Foundations o f Wavelet Network and Applications
Figure 3,3: McCullochPitts model of neurons for the elemen tary logic gates AXD, OR, and XOT,
Layer 1
Layer 2
Layer 3
Figure 3,4: A three layer feedforward network with two inputs and one output.
Neural Networks
67
In this example of a feedforward network, ( x i , x 2) represents the input and the weight wfj represents the connection from PE i in layer k to PE j in layer k + 1 , for example, w\2 is the weight from PE 1 in layer 1 to PE 2 in layer 2, There are no cycles or feedback connections in feedforward networks. Lack of cycle ensures that the computation can pro ceed smoothly, A feedforward network computes a function of input th at is dependent on the weight settings; it has no other internal state other than the weight. Recurrent networks: In recurrent networks, links can form arbitrary topologies and there may be arbitrary feedback con nections, Recurrent networks can store internal state in activa tion level of the units because the activation may be fed back to the units th at initiated the activation. Because of complex topologies and feedback connection, recurrent networks may take longer to stabilize, and hence, learning, in general, is diffi cult, These networks can become unstable, oscillate, or become chaotic. These networks are characterized by higher order dif ferential equations,
3 .1 .2
O p tim a l N etw o rk S tr u c tu r e
W hat should be the size of the network? W hat architecture should be chosen and how many units should there be in the network? How should we decide on the interconnection between the units? These are some of the open questions and many attem pts have been made to come up with a satisfactory answer to these questions. Here we briefly outline some of them. To provide good generalization ability to the network, we need to embed as much knowledge as possible of the problem into the network. As an example, the number of components in the input vector can determine the number of input units and the out classes desired may influence the number of output units. Our goal is to design optimal architectures that take less time to train, have less storage requirements, good generaliza tion ability, and faster recognition or classification. Since fewer units translate to less number of connections between units and
68
Foundations o f Wavelet Network and Applications
hence to less training time and storage, we focus on having as few units as possible to keep the desired level of recognition or classification accuracy. The desired level can be defined by a cost function th at includes performance parameters for the network and the number of units. There are two major approaches to come up with an opti mal network structure: (1) build a larger network and prune the number of units with less im portant connections to reduce the network size, and (2) start with a smaller network and units and increase the number of units and layers until the desired level of performance is achieved. In the hrst category an important approach is to give each connection Wij a tendency to decay to zero unless the weight is reinforced, [See [37], [81], [82]], This approach works well in removing the unnecessary connections but not in units. It is also possible to remove units that are not needed by making weight decay rates higher for units that have smaller outputs, [See [84], [85]], In the second approach, a network starts small and the units are added to achieve the de sired level of performance, Marehand [86] proposes an algorithm to start with a single hidden layer and adds units onebvone, Fahlman and Lebiere [83] propose a cascadecorrelation algo rithm to build a hierarchy of hidden units, Mezard and Nedal [70] propose a tiling algorithm to create multilayer architectures,
3 .1 .3
S in g le L ayer P e r e e p tr o n
A single layer pereeptron (see Figure 3,5) is used to represent a continuous function. It has a single node, which takes the sum of inputs and subtracts the threshold. The output is hard limited. Here, %i,. .. , x n are the inputs, w i , . .. , wn are the weights associated with the links, and v is the output y =
 9)
for n = 2, and using hard limiter activation function we get
Neural Networks
69
W\X\ + W2X2 — 9 < 0
>0
=r~ y = —1
=> y = 1
the decision boundary is determined by setting WiXi + W2X2 — 9 = 0 solving for x 2 in terms of Xi we get x2
=
—{wi/w2)
*
Xl
+
d/wi
( 3 ,3 )
which is the equation of a straight line (see Figure 3 , 6 ), Thus for n = 2, a single layer perceptron separates the in puts into two regions separated by a straight line. For higher dimensions (n > 2), a perceptron separates the two regions by a hyperplane. Therefore, as shown in Figure 3 , 6 , for twodimensional input the single perceptron separates the plane into two regions. In a higher dimension space, the perceptron divides the space into two regions, divided by a hyperplane.
70
Foundations o f Wavelet Network and Applications
X,
Figure 3.6: Decision surface corresponding to a straight line.
3.2
M u ltilayer P e rcep tro n s
Multilayer perceptrons are the most popular class of multilayer feedforward networks, in which each computational unit em ploys either the thresholding function or the sinusoidal func tion. Multilayer perceptrons can form arbitrarily complex deci sion boundaries and represent any Boolean function. Figure 3.7 shows the architecture of a multilayer pereeptron.
3.2 .1
A d a p ta tio n P r o c e d u r e o f R o se n b la tt
Rosenblatt [71] gave an algorithm to adapt the weights. The weights and thresholds are given small random values. Adapta tion contains a gain term, 0 < rj < 1, that controls adaptation rate. This term accounts for adjustments to satisfy conflicting requirements. Fast adaptation is used for changes in the input and arranging of inputs to provide stable weight estimates. P ereeptron C onvergence P rocedure (R osen blatt) The procedure works as follows.
Neural Networks
71
Figure 3,7: A multilayer perceptron with inputs x i , .. , x n, out puts iji,. .. yn, and desired values as d i , . . . d n.
1, Initialize the weights (0) (1 < i < N) and 9 to small random values. Let t denote the time (iteration) and 9 is the threshold, y(t) is the actual output and d(t) is the desired output, 2, While there is no convergence and there is more input, present input x jt , i = 1, N and the desired output d(t)\ t = 0 The actual output is given by y(t) = /( 2 ] ”=i Wi(t)xi(t)  9), where / denotes a step, sign, or sigmoidal nonlinearity. Adjust the weights according to the update rule Wi(t + 1) = Wi(t) + r][d(t)  y(t)]xi(t) The convergence is aeheived when, the weights do not change. The pereeptrons have the ability to generalize. An M output perceptron can divide the pattern space into M distinct regions. Figure 3,8 shows the decision regions for three different classes. The pereeptrons can solve only separable problems. If classes can be represented such th at they are on the opposite sides of some hvperplane then the perceptron algorithm converges. However, if the distributions of the inputs overlap, or the in puts are not separable, then the discriminant function may over lap and the perceptron can not classify the input into separate
72
Foundations o f Wavelet Network and Applications
Figure 3,8: Decision surface for an output th at has three differ ent classes.
classes. Thus, a nonlinearly separable problem, such as the XOE problem, cannot be solved by a single pereeptron. For example, for a 2 input ease of XOE, the decision bound ary is given by (see equation 3,3): x 2 =  ( w i / w 2) x i + (0 / w 2)
As shown in Figure 3,9, there is no straight line which can seperate the points (0,1) and (1,0) from (1,1) and (0,0), Many modifications can be made to perceptrons, A few modifications are: (1) replace the hard limiter nonlinearitv by sigmoid nonlinearitv, (2) find the mean square error (MSE) be tween the desired and the actual output, and (3) update the weights backwards all the way to the first layer to minimize the error,
3 .2 .2
B a c k p ro p a g a tio n L earn in g
The backpropagation (BP) learning algorithm has made multi layer perceptrons very popular among researchers. There has
Neural Networks
73
r a T O
011 1o 1 00
( , )
11o
10
( , )
Figure 3,9: XOE problem cannot be solved by a single layer pereeptron.
been a lot of recent activity in the research and use of BP algo rithms, We briefly trace the history of development of the BP algorithm here, Bryson and Ho [18] were perhaps the first to introduce the BP learning algorithm in 1969, then independently Werbos [97] and Parker [62], [66], [67] presented the BP algorithm. In 1985 LeCun [52] presented a closely related approach. In its most basic form, the BP algorithm prescribes a gradient descent based weight update rule to learn a pair of inputoutput vectors in a feedforward network. The BP algorithm is a gradient descent algorithm th at minimizes the MSE (mean square error) between the actual and the desired output of a multilayer feedforward pereeptron. An architecture of a threelaver network is shown in Figure 3,10, In this figure, a q , , , , , aq, are the input, are the output of the network at time instant t (iteration), and di and d2 are the desired outputs. Following is an outline of a BP learning algorithm: The following BP algorithm is based on [72]:
74
Foundations o f Wavelet Network and Applications
In p u t
Layer
Figure 3,10: An architecture for a neural network th at uses the baekpropagation update rule.
Step 1: Initialize weights and thresholds to small random numbers. Step 2: Present the inputs x.j(t), i = l , . . , N and the desired output d(t). Step 3: Calculate the actual output at nodes of each layer, using y.i(t) = f ( E w i j ( t ) x j ( t ) — 9j)
for the jhh layer.
Using a recursion algorithm starting from the output nodes and going back, adjust the weights using ii'ijij + 1) = ii'ijij) + Where Wjj is the weight from the hidden node i (or from an input node) to the node j, x.j(t) is either the output of node i or is an input, ?/ is a gain term, and 5j is an error term for the node. Then
Neural Networks
75
83 = v A l  yj)(d3  yj) and if node j is an internal hidden node, then Sj = Xj(t)( 1  x'j)E Skwjk(t) where k is over all nodes in the layers above node j . Repeat by going to step 2, The network converges when the weights do not change,
3.3
H eb b ian L earning
3 .3 .1
H e b b ’s R u le
A neurophysiologist, Donald Hebb [27], in 1949, hrst proposed a learning scheme for updating node connections th at we now refer to as the Hebbian learning rule. Once a neuron is repeatedly excited by another neuron, the threshold of excitation of the latter is decreased, so the communication between these two neurons is facilitated by repeated excitation. An im portant property of this rule is that learning is done locally, th at is, the change in the weight connecting the two neu rons depends only on the activities of the two neurons connected by it. These ideas can be extended to artificial systems. There are many different mathematical formulations to express Hebb’s rule. The following formulation is the most commonly used. According to Hebb’s principle, when there is activity flowing from the j t h PE to the ith PE, the weight w^ will increase (see Figure 3,11), He stated th at information can be stored in connections, and he postulated the learning technique th at has made fundamental contributions to neural network theory. If we denote the activation of the jth neuron (PE) by Xj and the output of ith neuron (PE) by yjt then the connection weight is given by A Wij = riXjyi
(3.4)
76
Foundations o f Wavelet Network and Applications
Figure 3,11: Connection between two neurons and the corre sponding model. Here Xj is the input to the neuron j.
where r] is a control parameter to regulate the weight change. Note th at Hebb’s rule is local to the weight and unlike the learn ing rules (perceptron, baekpropagation), there is no desired out put in the Hebb’s rule and only the input needs to flow through the neural network. Rewriting equation 3,4, we get w(t + 1) = w(t) + r]x(t)y(t) Here t denotes iteration number and rj is step size. If we have a linear PE, we get w{t + 1) = w{t) + r]x2(t) by substituting y = wx Hence, the weight has the possibility of increasing without bounds, making Hebbian learning intrinsically unstable, and this rule may produce very large positive or negative weights. In biological systems, natural nonlinearities, such as chemical de pletion, dynamic range, etc., limit the synaptic efficacy. To con trol the unlimited growth of weights, Oja [60] and Sanger [104] added normalization (see sections 3,3,3 and 3,3,4) to Hebb’s rule.
Neural Networks
3 .3 .2
77
H e b b ia n L earn in g A p p lie d to M u ltip le In p u t P E s
Figure 3,12 gives a pictorial representation of Hebbian learning applied to multiple PEs,
Figure 3,12: Hebbian learning applied to multiple PEs,
The output y is given by y = e WiXi According to Hebb’s rule, the weight vector is adapted as / xiy \ (3.5)
APE = r]
\ x ny ) Equation 3,5 can be rewritten in a vector form as // = ir'.Y The transpose of the weight vector is multiplied by the input using inner product.
78
Foundations o f Wavelet Network and Applications
A
w
Y Figure 3,13: Angle 9 between x and y.
v = IF 11Xcos 01 Here 9 is the angle between x and y (see Figure 3,13), Assigning normalized inputs and weights, large y means 9 is small, th at X is close to the direction of the weight vector. Small y implies th at input is perpendicular to W (cos 9 = 0, 9 = 90), th at is, X and W are far apart. So, \y\ measures similarity between the input X and the weight W, using inner product as the similarity measure. So, during learning, weights condense the information about the data, in their own value, hence weights may also be inter preted as longterm memory of the network. Interp retation o f H eb b ’s R ule as C orrelation Learning Recall th at for a multiple input PE using Hebb’s rule, the weight update is given as I XiU \ X2V (3.6)
A W = r] \ %nU )
Where y = W TX = W X T where .V 1 = (%i,. .. , x n) and
Neural Networks
79
( U)i \ (3.7) \ ®n / So substituting y, we get A W(t ) = rjy(t)XT = r)X(t )XT (t)W(t)
(3.8)
In batch mode, after iterating over the input data of N pat terns the cumulative weight update is the sum of the products of the input with its transpose multiplied by the original weight vector. For on an online learning system, weights are repeatedly changed using a different input sample for each time step t. For batch mode, we can rewrite the above equation as (3.9) In this equation ' E X ( t ) X T (t) is an approximation of au tocorrelation R x = E [ X X t ], S o Hebb’s rule is updating the weights with a sample estimate of the autocorrelation function A W = r}RxW ( 0)
(3.10)
Pow er Interp retation o f H ebbian Learning Given the data set { X ( t ) , t = 1. A }■we define the power at the output as P = 1 / . Y $ > 2(/) = W t R xW where R x
(3.11)
n R = l / N ' £ X (t ) X T (t)
P can be interpreted as a held in the space of weights. The equation 3.11 is a quadratic equation and since R x is positive definite, this equation represents an upward facing parabola (see Figure 3.14) which passes through the origin of the weight space. (3.12)
Foundations o f Wavelet Network and Applications
80
W T
G ra d ie n t V P  R W
W a
Figure 3,14: Power interpretation of Hebbian learning.
If we take the gradient of P with respect to the weights, we get d p
AP = — = 2R W a ll
(3.13)
which is the same as equation 3,10, In fact, a sample by sample adaptation of equation 3,9 is a stochastic version. Wealso note th at Hebbian learning will diverge unless we use some type of normalization,
3 .3 .3
O ja ’s R u le
Oja [60] proposed normalization to Hebb’s rule. We divide the updated weight by the norm of all the connected weights „ ,(( + 1) =
, Wtii) + s j yi i i ni n + iiy{t)xi(t))2
(3.14)
Oja gave the following approximation to this update rule
Neural Networks
81
Wi(t + 1)
=
Wi(t) + rjy(t)[xi(t)  y ( t ) w i ( t )]
=
wt(t)[ 1  r]y2(t )] + rjXi(t)y(t)
Here, Xi(t) = Xi(t) — y(t)wi(t) is the normalization activity, which is a “forgetting factor” proportional to the square of the output,
3 .3 .4
S a n g e r ’s R u le
Sanger [104] modified O ja’s learning rule to implement the de flation method, A brief diseription of Sanger’s rule follows. Suppose the network has M outputs given by Vi(t) = ' £ w ij(t)xj(t), i = 1.1/ and N inputs (N > M ), Using Sanger’s rule the weights are updated according to the following rule: A w i:j(t) = nyi(t)[xj(t) ~ E w kj(t)yk(t)] Note th at the weight updates are not local. All previous network outputs are required to compile the weight update to weight wij. After the system convergence, this method imple ments the deflation method,
3 .3 .5
P r in c ip a l C o m p o n e n t A n a ly sis (P C A )
It is generally desirable to find or replace a large feature set to one which is minimal but sufficient. One way to achieve this reduction of dimensionality is through principal component analysis, PCA, also called KarhunenLoeve transform, reduces the di mensionality of data by seeking projections that best represent the data in the least squares sense, i.e., it is used to reduce the feature vector dimension, while retaining most of the informa tion in the feature information by constructing a linear trans formation matrix. Following steps illustrate how PCA may be performed on a d ata set.
Foundations o f Wavelet Network and Applications
82
1, First the ndimensional mean vector m and n * n covari ance m atrix are computed for the full data set. From this we then compute the eigenvectors and eigenvalues and sort them according to the decreasing magnitude of eigenval ues, 2, Choose the k eigenvectors, having the largest eigenvalues Ai, A2, .. ■, A*, corresponding to the eigenvectors Ci, e2, . . . , Cfe, Few large (k) eigenvectors imply that k is the inherent dimensionality of the subspaee governing the signal and the remaining nk dimensions contain mostly noise, 3, Form a m atrix [A]„xfe where the columns of [A] are the k eignveetors. This representation consists of projecting the data onto the ^dimensional subspaee according to x = A T (xm) A threelaver linear neural network trained as an autoeneoder can form such a representation. In Figure 3,15 the network is trained with the desired values same as the input. After the network stabilizes, the output layer is removed and the middle layer is retained. The output of this layer, represented in a rect angular box in the figure, which has much less dimensionality than the input layer, represents the principal components of the input data,
3 .3 .6
A n tiH e b b ia n L earn in g
The antiHebbian learning rule is given by Awtj = —rjXjUi This rule seeks to find the minimum of the performance sur face, A linear PE which is trained with antiHebbian learning deeorrelates the output from the input, A negative weight update minimizes the output variance, so an antiHebbian linear network will produce zero output, be cause weights seek the directions of data clusters with point
Neural Networks
83
Figure 3,15: An autoencoder: a three layer network to find principal components. In this MLP network the input is also used as the desired output to find the principal components given in rectangular block after the network stabilizes.
84
Foundations o f Wavelet Network and Applications
projection. That is, the network is performing a gradient de scent in the power held. C onvergence The weight update rule for antiHebbian is given by w(t + 1) = w(t) (1  r] x 2{t)) The equation can be rewritten as w ( t + 1) = (1  rjX) w(t) where A is the eigenvalue of the autocorrelation function of the input. This equation is stable if r], where aq and x 2 are random numbers between [0,1] and the output is represented by neurons, represented as solid circles, and the coordinates of the neurons are the weights. The output neurons retain the neighborhood relations of the input space in the sense th at (aq, x 2), which form a cluster, represent adja cent neurons in the output space arranged in a twodimensional grid to form a vector quantizer. Input vectors are presented sequentially in time and after enough input vectors have been presented, weights specify clusters or vector centers. These clus ters or vector centers sample the input space such th at the point density function of the vector centers approximate the probabil ity density functions of the input vectors. This algorithm also organizes weights so th at topologically close nodes are sensitive to physically similar inputs. O utput nodes are thus ordered in a natural fashion. Thus, this algorithm forms feature maps of inputs, A description of this algorithm follows, Kohonen’s al
Neural Networks
89
gorithm adjusts weights from common input nodes to N output nodes Let xi, X2 , ■■■, x n be a set of input vectors, which defines a point in iVdimensional space. The output units O* are arranged in an array and are fully connected to input via weights w^. A competitive learning rule is used to choose a winner unit i*, such th at \u>i* — x \ A Iwi — x \ f°r all i Then the Kohonen’s rule is given by Awi = rjh(i, i*) (x  w f ld)
(3.22)
Here h(i, i*) is the neighborhood function such th at h(i, i*) = 1 if i = i* but falls off with distance r* —r* *  between units i and i* in the output array. The winner and close by units are updated appreciably more than those further away, A typical choice for h (i, i*) is, e~^ri~ri*^2cr2\ where o is a param eter th at is gradually decreased to contract the neigh borhood; r] is decreased to ensure convergence, Kohonen’s self organizing maps can be used for projection of multi'variate data, density approximation, and clustering. It has been successfully applied in the areas of speech recognition, image processing, robotics, and process control. S election o f Param eters Appropriate selection of parameters is very im portant for weight stabilization and producing topology preserving mapping from a continuous input space to a discrete output space. The algorithm is divided into two phases. In the hrst phase, the neighborhood function is selected with an aim to allow neu rons that respond to similar inputs to be brought together. The neighborhood radius may be decreased linearly by o(t) = om( l —t / m) , where m denotes the initial set of iterations and t is the time step. The learning rate is generally kept high {jj > 0,2) to allow selforganization of input space onto the output space.
Foundations o f Wavelet Network and Applications
90
rj can also be adjusted linearly by using &v(t) = »7m(l  t / ( N + L))
(3.23)
Where rjm denotes the initial learning rate and L is used to select the final learning rates. In the second phase, the neurons are hnetuned to adjust to the distribution of the input space. In this phase, learning rate is kept small, (rj < 0.05) and o is chosen to select one neuron and its nearest neighbors in the neighborhood. Selection of the number of output units determines the res olution of the output map. A large number of output units can give better resolution at the expense of more learning time. Kohonen’s network ability to transform input relationships into spatial neighborhoods in the output units makes interesting ap plications of these networks in many areas.
3 .4 .3
G r o ssb e r g ’s In sta r O u tsta r N etw o rk
If we switch the roles of inputs with outputs, we get the equation
Wij(t + 1) = Wij(t) + rjXj(t)(yi(t)  Wij(t))
(3.24)
A network based on the above equation is called an outstar network. Grossberg proposed a threelavered network consisting of in star and outstar networks (see Figure 3.19). Of the three layers, the hrst layer normalizes the input, the second layer is the com petition layer, and the third layer is an outstar network. In the competitive layer, the winning neuron is the one whose weights are closest to the current input, and the outstar part of the network associates this winning neuron with a desired out put (see equation 3.24 above). Grossberg developed this system for real time operation.
Neural Networks
91
In sta r (1)
O utstar(2)
Figure 3,19: An instaroutstar network.
3 .4 .4
A d a p t i v e R e s o n a n c e T h e o r y (A R T )
Carpenter and Grossberg proposed a selforganizing neural net work architecture based on principles derived from biological systems, A basic problem th at all autonomous systems that adapt in real time have to solve is called the stabilityplasticity dilemma. They have to adapt to new events or learn new events and still remain stable to irrelevant events. That is, the system should be capable of retaining old knowledge while continuing to learn new information or stay plastic. In the Carpenter/Grossberg net, matching scores are com puted using feedforward connections and the maximum value is enhanced using lateral inhibition among the output nodes. This net is structurally similar to the Hamming net but differs in that feedback connections are provided from the output nodes to the input nodes. This net is completely described using nonlinear differential equations, includes extensive feedback, and has been shown to be stable. Mechanisms are also provided to turn off that output node with the maximum value and to compare ex emplars to the input for the threshold test required by the leader algorithm. The leader algorithm selects the first input as the
92
Foundations o f Wavelet Network and Applications
exemplar for the first cluster. The next input is compared to the first cluster exemplar. It ” follows the leader” and is clustered with the first if the distance to the first is less than a threshold. Otherwise it is the exemplar for a new cluster. This process is repeated for all the following inputs. The number of clusters thus grows with time and depends on both the threshold and the distance metric used to compare inputs to cluster exemplars. It is based on using a layer that is a competitive network of neurons operating on the ’’winnertakeall” principle, ART pro vides a computational mechanism by providing an architecture in which top down expectations fuse with bottom up informa tion to prevent loss of already learned knowledge and include new knowledge in a globally selfconsistent fashion.
Figure 3,20: model.
Architecture of the Carpenter and Grossberg
Learned expectations flow downward. Only partial connec tions are shown in Figure 3,20, The weights in both the diree
Neural Networks
93
tions are adapted by a fusion of downward and upward infor mation. The general architecture of the ART model is as shown in Figure 3,21 Let (xi, , , , , x„) represent input pattern, u u represent bot tom up weights from F i to F 2. and djj represent top down weights. >'l k
Y'l k
m
Figure 3,21: Architecture of the ART model.
The input pattern ......... r v ) activates layer Fi. Cor responding to this input pattern, the winning node in layer F2 is yj. which receives the largest total signal from iR. F2 then in turn sends its learned expectation to Fi and top down expecta tion and input patterns are matched, A mismatch turns off the node y,j and the process is repeated without y,,. This process is repeated until one of the following three conditions occur. An output node (F2) with an approximate match to the input pattern is found, a new node with no previously assigned weight is found, or the capacity of the network is exhausted.
Foundations o f Wavelet Network and Applications
94
An approximate match between the input pattern and top down expectation results in a consensus or fusion between top down expectation and input and both the top down and the bottom up weights are updated. Carpenter and Grossberg label this state as the resonant state and because the learning occurs only in the resonant state, they named this theory the adap tive resonance theory, ART maintains the plasticity required to learn new patterns, while preventing the modifications of pat terns th at have been learned previously. If the full capacity of the network is exhausted, that is, there is no resonance of the input pattern with expectation of the output nodes and there is no uncommitted node, then any new learning is not allowed. For a new node selected in F 2 no top down adjustment of weights takes place. The algorithm works as follows: Initialize dij to 1 and (0) = l / ( l + iV), For a winning neu ron y2 in F2 the bottom up and top down weights are (v,i2, u22, . .. , v,n 2) and {d21 , d22, , d2n ) respectively, (1) For a given input (aq,, , , , x n ) compute yjHuij(t)xi where x j is either 0 or 1, (2) Select the winning neuron using yj* = maxj{yj} Calculate „
_
llDAll
“
~ W ~ E*Li dijXi E i=i Xi
_
If 9 < p {Adapt weights} else { V i which are also admissible mother wavelets, with p zeroerossings symmetrically distributed with respect to zero. Due to their noneompaet sup port, they have shown unsuitable for signal processing applica tions but, in return, the flexibility in the selection of the zeroerossings makes them suitable for the purposes of this work, A neural network structure with a wavelet as the node activa tion function is called a wavelet neural network or wavenet. For a feedforward wavenet, the weighted superposition of node out puts is equivalent to the weighted superposition of the functions constituting the frame in (5,19); hence, the family of wavenets acts as a universal approximator. This property is also of inter est in recurrent networks, as discussed below,
at
5 .3 .2
R ec u r ren t W a v elet N etw o rk s
In recurrent neural networks, an appropriate superposition of wavelets in each node may be used to approximate the vector field F defined in (5,1) or (5,3), Nevertheless, certain applica tions, such as the synthesis of associative memories, are mainly concerned with the asymptotic properties of the network. In particular, the design of systems with prefixed asymptotically stable equilibria may be addressed using an appropriate mother wavelet in each node, which may be enough to place the equi libria at the desired locations. This qualitative approach is pre sented in the following sections. D yn am ics o f a Single W avelet N euron Let us consider the dynamics of the following neuron:
u = F(u, w) = —u/ t + uxpp(d(u —t)) + p
(5,23)
Recurrent Learning
139
where ipp is the porder abovepresented wavelet function. Equilibria of system (5,23) are given by the intersections of the weighted, sealed, and translated version of the mother wavelet ipp(x) and the curve u / r —p. Since ipp(x) is an oscillating func tion, several equilibrium points are expected. Asymptotically stable equilibria require a negative value for the derivative of the vector held. From (5,23) we easily obtain that, at an equi librium u*, asymptotic stability is guaranteed if the slope of the wavelet ipp(d(u* —t)) is below the value 1 /w rd, provided cud > 0, As stated before, the location of the equilibrium points is dependent on the values of the net parameters d, t, r , p, and u. Generally speaking, the oscillations of F will be located around the value u = t. The offset parameter p allows to place the roots of F near this oscillation region. In addition, increasing d (that is, stretching ipp{x)) will approach the equilibria to t and vice versa. These considerations are helpful for providing good initial values in the network training. The righthand side of (5,23), that is, the dependence of F( u, w) w.r.t, u, with fixed parameter values r = 1,0, ui = 1,0, p = 0,0, d = 1,0, and t = 0,0, is shown in Figure 5,1, where the mother wavelet ip5 is used. The held has five equilibrium points, symmetrically distributed around zero. Three of them are asymptotically stable (those with negative slope). The oscillation region may be translated using the parame ters t and p: Figure 5,2 displays the vector held F vs. u with param eter values r = 1,0, u> = 1,0, p = 3,0, d = 1,0, and t = 3,0, Now the zeroerossings are located around u = 3,0, Finally, the effect of the sealing param eter d is shown in Figure 5,3, Param eter values are r = 1,0, u> = 2,0, p = 0,0, d = 0,5, and t = 0,0, u> being increased in a ratio proportional to 1/d. Roots are now distributed in a larger region around the origin. M u ltid im en sion al N etw orks Analytical studies to characterize the dynamics of strongly non linear systems such as waveletbased multidimensional networks are difficult to carry out. However, due to the nice superposition
Foundations o f Wavelet Network and Applications
140
6
4
2
0
2
4
6
Figure 5.1: F vs. u with d = 1/ui = 1,0, t = p = 0,0,
t 1 1 1 r
\ 6 
V
___ I_______ I_______ I_______ I_______ I_______ I_______ L 6
4

2
0
2
4
6
Figure 5,2: F vs, u with d = 1/ui = 1,0, t = p = 3,0,
Recurrent Learning
141 r \ j j
/
:
/ , i
\ \ \
/ /
I ' i
\
/
\
\
/
\ j v.y
i,
/ /
i 1
\\
/
\ j
6
*4
2
0
2
4
6
Figure 5,3: F vs, u with d = 1/ui = 0,5, t = p = 0,0,
properties of wavelets, a large number of equilibria are also to be expected in highdimensional eases. Let us simply illustrate this point considering a twoneuron network: iii = F i ( u , w ) = il2
=
F 2(u ,
w

u i / T i + u)ii'i/jp ( d i ( u i  t i ) ) + u)r24’p(d2 ( u  2  h ) ) + Pi
) =  U 2/T2 + U J  2 l ' l p p ( d l ( u i  t i ) ) + UJ22'lpp(d2(u2h)) + p2
Let us assume th at ?/!o is used, with parameter values such that three asymptotically stable equilibria would exist in the above discussed onedimensional ease, A decoupled system (uq2 = ui2 1 = 0) would have nine asymptotically stable equilibria, reg ularly distributed in R 2. The use of small nonzero values of ui 12 and ui21 may allow to place these equilibria at prescribed positions, without the regular distribution associated with de coupled systems. This is illustrated in the following section.
142
5.4
Foundations o f Wavelet Network and Applications
N u m erica l E x p erim en ts
The considerations above suggest th at the use of wavenets might lead to an easy and systematic design of neural systems with a large number of asymptotically stable equilibria. This behavior makes such networks particularly indicated for the design of associative memories. Also, the initial selection of wavelet para meters may be performed through the abovementioned criteria. In this section we attem pt to illustrate these ideas through the numerical simulation of training processes in some simple instances of multiple pattern learning problems. The training scheme is based on the regularized version of recurrent baekpropagation [154, 169, 172, 173, 174], presented in System (5,14  5,15) for single pattern problems. The analogue of system (5,14  5,15) for learning q patterns {u*, . . . , u*} is based on the assumption th at q systems evolve in parallel, the training process being described by the equations eui
= F( ui , w)
F ( u q,w) i = 0 else
(6.1) (6.2)
where i = 1+ ig — 1 = 1.23% is the equivalentquarterly compound rate paid by the alternative investment class. A relationship between input and output data can be rep resented by a linear combination of building blocks called basis functions. One possible choice of basis functions is a family of selfsimilar, hniteduration, oscillatory functions called wavelets [187], A wavelet literally meaning little wave, is derived by sealing, translating, and reenergizing a reference function called a mother wavelet. This technique has attractive features such as joint spaeefrequenev localization, robustness against coeffi cient errors, and efficiency of function representation. When the wavelet expansion of functions is treated from the artificial neu ral network point of view, WNNs result. The networks used here are alternative types of elliptic basis function neural networks [190] and wavelet networks [191]. Their job is to implement the function in Equation 6.2 using the structure y =
E
j +
l
Cj'&A.b. (x)
+
Cl nX1
+
...
+
Cl^nxn+
cjfn
As shown in Figure6,2, where tk is a tapered cosine wavelet, Aj is a symmetric, positive semidehnite squashing m atrix of the jth wavelet node, and bj is a translation vector. The subscripted tk function in the above equation is multivariate, and is induced from the scalar \k function via the Mahalanobis distance be tween the input vector and bj, The size of the network and the values of the parameters Aj, bj, Cj, and c^m can be obtained from statistics, LevenbergMarquardt, and genetic algorithms. Refer to Eehauz and Vachtsevanos[188] for further details.
6.3
C om p arison R esu lts
The training procedures initially indicated th at input variable 1 had better predictive power regarding yk+i than in combination
150
Foundations o f Wavelet Network and Applications
with the others. This variable is nine quarters back from the predicted variable. Furthermore, since the degree of predictabil ity is expected to be low in this problem, there is no point in training the networks for extended periods of time, as they would quickly overfit the training data and thus degrade their performance. From these considerations, WNNs were quickly obtained as twonode networks. For example, the WNN used for the last decision in the test period implements the function yk+l = ^0.03W(34.04(a;^0.09))^0.01W(25.90(a;^0.03))+0.26a;+0.03 where x is input variable 1, The presence of nonlinear nodes in the network accounts for only about 2% improvement over the linear component alone, but slight improvements are precisely what we set out to investigate in this problem. Results are summarized in Figure 6,3, The correct decision rate is the fraction of times the strategy correctly predicted that Uk+1 would be higher than i. The average ROR is the expected value of the effective annual rate of return, ^ 1(Q00 — 1 where FV is the final value of the investment. The Average final value is the expected final dollar value of the investment as reflected by the 13 tracks. The investment value at any point in time is the market value of shares held in the index fund plus the account balance, if any, in the alternative investment class. As a frame of reference, we note that a constantly bearish advisor would have yielded ROR = 5% and a 38,65% correct decision rate, A coin flipper would have been bound to make the correct decision 50% of the times, no m atter what the market did. The firstorder shift model (prediction of yield next quarter equal to yield today) would have had a 51,73% correct decision rate. At the other end of intelligent advice, an ideal system th at made 100% correct decisions every time would have reaped ROR = 21,27% and final value $69,174,43, Figure 6,4 shows the empirical relationship between yield two years back (horizontal axis) and future yield (vertical axis) for all data involved in the 10vear test period. The map is
Separating Order from Disorder
151
Strategy
Correct Decision Rate
Average ROR
Average Final Value
B&H
61.35%
11.94%
$30,907.54
WNNs
63.27%
12.07%
$31,348.10
Figure 6.3: Summary of average performance over 13 investment tracks.
obtained by sorting the inputs in ascending order and plotting the corresponding outputs. The bold curvy line represents the inputoutput relationship learned by the WNN that made the last decision in the test, which made use of all data except the last yield (the one it was required to predict). The dotted hor izontal line is the mean value of output yields, equal to 3.52%. This is a prediction that, according to EMH. we cannot improve as a function of any past yield. Since this value is greater than i = 1.23%. the EMHimplied decision is constantly bullish (B and H).
6.4
C on clu sion s
The simplicity of the WNN models suggests that there is little hidden determinism; the patterns of yields are largely random. Nevertheless, as shown, a small nonlinear relationship does seem to exist and should not be overlooked. This is reflected by the fact that the WNN map deviates slightly from a perfectly hor izontal line, and furthermore, the line is not perfectly straight, suggesting that a small amount of nonlinearity is present. This is bad news in the sense that there is not much promise of su perior returns, but also good news because the relevant models
152
Foundations o f Wavelet Network and Applications
Figure 6,4: Deterministic component in the yield relationship. Solid: emperieal, bold: WXX, dotted: B & H,
for this application are simple in nature, avoiding many of their training complications. We have verified that BH is indeed hard to outperform, but learning network strategies such as WXXs are able to at least match this EMHimplied limit. From a deterministic point of view, we can conclude that the use of WXXs may introduce a slight advantage (1 to 1,5% in relative performance) over BH, thus separating a small amount of order in a sea of disorder. This could have significant im plications to individual and institutional investors. The slight difference could have even stronger consequences to traders of stock index futures and options.
Separating Order from Disorder
6.5
153
E xercises
1, Describe the algorithms to predict the future financial as sets of a car manufacturing company using wavelet neural networks, 2, How realistic are wavelet neural networks as advisors for Stock analysis. Discussion should include an example.
C h ap ter 7 R ad ial W avelet N eu ra l N etw o rk s 7.1
In tro d u ctio n
Wavelet neural networks (WNNs) represent a new class of ra dial basis function (EBF) neural networks and wavelet networks [191, 192, 193, 194, 195], The hallmark of WNNs is simply the fact th at their neurons, nonlinear activation functions are nei ther completely local as in fuzzy systems and Gaussian EBFs, nor semiinfinitely receptive as in multilayer pereeptrons, but represent a middle ground obtained from local oscillatory re sponses. Being closely related to Koskos additive fuzzy systems [196] and to fuzzy systems of the SugenoYasukawa type [197], EWNNs process uncertainty in much the same way that a fuzzy rulebase does, with an added forward inhibition feature, and bear significantly less opacity than classical neural networks. The theory of wavelets can be exploited in understanding the uni versal approximation properties of EWNNs, and in providing initialization heuristics for fast training, EWNNs offer a good compromise between robust implementations resulting from the °W e gratefully acknowledge the contribution of Dr. Javier Echauz and Dr. George Vachtsevanos for this chapter. Permis sion of the authors was obtained to use this paper.
155
156
Foundations o f Wavelet Network and Applications
redundancy characteristic of nonorthogonal wavelets and neural systems, and efficient functional representations that build on the timefrequency localization property of wavelets. This chapter shows the application of these networks to the problem of antiepileptic drug detection from electroencephalographic signals. Such noninvasive techniques have pragmatic implications in areas of drug monitoring in clinical settings, and drug screening in the pharmaceutical industry,
7.2
Data Description and Preparation
The EEG data used in this study were taken from the partition shown in Figure 7,1,
ar
B = Baseline (normal) condition; SI = 1st drug session; S2 = 2nd drug session; A = Verbal memory activated Figure 7,1: EEG data format.
Each cell represents sixteen 1second EEG epochs at the P z / A l + A 2 channel, for each of ten subjects, collected under the cognitive/drug condition indicated by cells’ row and column. Each subject had one drug applied on session SI and a differ ent one on S2, The drug on each session was either Dilantin, Valproate, or phenobarbital. It is generally agreed that not more than 1 second of EEG can be considered stationary [198], In addition, the long epochs required for the computation of chaotic indicators [199] apply
Radial Wavelet Neural Networks
157
only to the search for their exact values on mathematical or physically simple (effectively lowdimensional) systems. Since the EEG is a spatiotemporal average of millions of variables, the imposition of this requirement is extremely naive. Our use of chaotic indicators on 1seeond epochs was found to be useful in EEG differentiation, where differences and not absolute mag nitudes are meaningful [200], The EEG database is therefore viable in this application, and has 16 x epochs x 10 subjects x 3 conditions = 480 distinct epochs, A raw epoch is not fed directly into the neural network or linear discriminant, as this imposes a large number of parame ters attem pting to resolve mostly irrelevant details. Numerical features are extracted that squash an epoch of 128 amplitudes to only one or a few numbers. In a previous feasibility study us ing only linear classifiers on these same EEGs, it was found that the extraction of a single good feature specific to a subject/task (e.g., fractal dimension for drug separation on subject 9), was significantly better than the use of several fixed features (e.g., 5 autoregressive coefficients for all), A good feature is found by searching in a library of several candidate features, and choosing the one with the highest Kfaetor [201], K = If1/ 21 The Kfaetor measures how far apart are the means of a candidate feature under class 1 and class 2, in units of their pooled spreads. The higher K , the more that candidate feature promises to discriminate between classes. Multiplied by it gives the ^statistic for equal sample sizes. In general, the autoregressive coefficients , and the relative power in the alpha and delta bands tended to provide the highest Kfaetors, The features found relevant in this study are: AR1AR5 = fifthorder autoregressive coefficients o = signal standard deviation Ps, Pe, Pa, Pp = relative power in the delta, theta, alpha, and beta frequency bands krn = mean frequency index (proportional to mean frequency in
158
Foundations o f Wavelet Network and Applications
Hz) lac = lag of the first zerocrossing in the autocovariance sequence of the signal PLE = principal Lyapunov exponent (chaotic indicator) PI1, PI3 = absolute value of coefficients in fifthorder Pisarenko harmonic decomposition Dc = correlation dimension (chaotic indicator) From the database of raw EEG epochs, 30 feature data sets were obtained. Each data set has 32 samples of the best single feature for a given subject/task: half under one condition and half under the other condition (e.g., 16 values of o for subject 2 under baseline plus 16 values under drug session 1), Train ing a WNN amounts to nonparametrie (no assumption about model form, such as a line) statistical regression, so there is a pronounced danger of overspecialization to the data used for training. This, in turn, yields less than optimal performance on different, unseen d ata sets, A simple solution is to reserve a por tion of the available data for monitoring outofsample perfor mance, Each of the 30 data sets was partitioned into 3 random subsets: T R N = training set VAL = validation set for training with simple cross validation T ST = test set for final test on unseen data. All subsets still have half the samples under one condition and half under the other, but their sizes were defined using a 50% —25% —25% composition, as shown in Figure 7,2, Finally, to equalize amplitude scales across features, and fea ture variability across conditions, the zseores were computed and stored in each of the 30 TEN, VAL, and TST sets,
7.3
C lassification S y stem s
W ith two equiprobable classes, the classification accuracies of the WNNs in this report can be neither worse than 50% (the
Radial Wavelet Neural Networks
159
Figure 7,2: D ata partition for each subject/task.
rate of a coin flipper or a constantoutput classifier) nor bet ter than 100%, But how can we tell if the added complexity involved in training these networks is justified A benchmark measure of performance was established with a minimum dis tance classifier (MDC), which corresponds to traditional linear discriminant analysis.
7.3.1
M in im u m D is ta n c e C la ssifie r
The decision rule adopted by an MDC is simply: assign x to the class whose mean feature vector is closest in Euclidean distance to x. Thus, a diehotomous decision is given in one dimension bv IX —
Xi
I • B d, a (x^, Xi+1 , ,,,, Xi+d) to be an embedding if the XjS are from a noninvertible sys tem, Furthermore, it is also necessary to investigate whether the underlying dynamics can be reconstructed from time series generated by noninvertible systems by means of the timedelav method or not. However, it is quite possible to make predictions on such time series because they are generated by deterministic systems. So we still build a predictive model such as Eq, 8,2
Predicting Chaotic Time Series
183
from the time series generated by the Ushiki map (8,12), We choose embedding dimension 2 and time delay 1, The hrst 300 values are used to fit the predictive model. Based on this model, we make onestep predictions on the remaining 19,700 values. Results are shown in Fig, 8,5, As we expected, the predictions are made very successfully.
8.5
P a ram eterV ary in g S y stem s
In many realistic systems, the param eter values are always chang ing with time. In some such systems especially, we can clearly see the phenomenon of perioddoubling bifurcations in time if the param eter values vary relatively slowly (otherwise, many pe riodic windows may be bypassed)(see [216]), Such bifurcations are obviously different from the usual perioddoubling ones. In the usual ease, dynamical behavior may develop sufficiently and all transients are allowed to die out at each parameter value. Here,however, dynamical behavior does not develop sufficiently and transients are not allowed to die out at each parameter value. We consider an Ikeda map [229] with one param eter as a variable: ' x n+i = 1 + n n(xncos(t)  ynsin(t)) < y n+1 = y n(xnsiri(t) + yncos(t)) fj,n+i = hn + 104 (1 —0,5 sin{n))
(8.13)
15 where t = 0 ,8  ,,. 1 + x l + yl In Eq, 8,13, the param eter variable y n increases monotonically with n, and the increasing step y n+i — y n = 104 (1 — 0.5sm(n)) is explicitly related with time n. We iterate Eq, 8,13 with initial conditions x 0 = 0.87, y 0 = ^0,40, and y 0 = ^0,34 until y n increases to 0,7 (10,400 iterations). In fact, y n may in crease to 1,1 or so, and Eq, 8,13 will diverge to infinity if y n is larger than this value. We record the xeomponent value of each iteration and show the time series x n, n = 0,1, 2, • • •, 10,400
184
Foundations o f Wavelet Network and Applications
in Fig, 8,6, where one can clearly see the process of perioddoubling bifurcations. In addition, there obviously exist tangent bifurcations as well. For such time series, can we predict the fu ture values x n, x n+i, ■■■based upon the past records %i,i < n? We turn our attention to Eq, 8,13, It can be thought of as a threedimensional discrete map with an explicit time varible n. Since its dynamical behavior is divergent (to infinity) as n increases, no attractor exists in Eq, 8,13, As a result, the time series d ata x n, n = 0,1, 2, • • •, 10,400 are all from the transients of Eq, 8,13, In this ease, Takens’ theorem and its extensions are at least in principle invalid. However, predictions are possible as we discussed in connection with the Ushiki map in Section 8,4, In practice, in a relatively short time interval there should exist a function F such that
Xn ~
F(%n—(h *W—(d—1);
l)
' j *W —
(8,14)
Based on this formula, we choose d = 3 and use the hrst 200 data points x$,xi, ■■■, x Wg to fit F by using the technique of wavelet network. We make onestep predictions on the next 800 values. The results are shown in Fig, 8,7, We can say that predicted values correspond well with actual values. On the other hand, instead of doing accurate value predic tion, we can predict bifurcation structures from the time series above. Based on the fitted F above, we make fivestep predic tions on the last 10, 200 values. The prediction results are shown in Fig, 8,8, One can clearly see that predicted bifurcation struc tures are identical to actual ones. In fact, we can make fifteenstep predictions and can obtain the same bifurcation structures as the original. However, the bifurcation structures will become more and more vague (or indistinct) as the step of prediction increases. This is obviously related to the sensitive dependence on initial conditions, of course, also related to the fact th at the param eter values are changing with time.
Predicting Chaotic Time Series
8.6
185
L ongT erm P r ed ictio n
In this section, two examples are given to show th at predicting attractors is possible from a scalar time series. We hrst build a predictive model from remarkably few data points using the technique of wavelet network. Then we predict the attractor by iterating the predictive model many times. The hrst example is to the predict Lorenz attractor from the scalar time series generated by Eq, 8,1, The second example is to predict the Ikeda attractor from the d ata sequence generated by the follow ing equation, x n+i = 1 + n( xncos(t)  ynsin(t)) y n+1 = n{xnsin{t:) + yncos(t))
(8.15)
where t = 0,8 Similar work has been done in [204, 205], but the tested at tractors are only generated by one or twodimensional maps. In particular, in [204] the tests of longterm predictions are not on a scalar time series but on a vector time series, where all state variables are assumed to be observable. In [205], although the tests are directly on scalar time series (with noise), the con sidered attractors are only generated by the logistic map and the Henon map, both of which can be directly represented as one and twodegree difference equations. Here we test the pre dictability of the Lorenz attractor generated by a continuous system, Eq, 8,1, We numerically integrate Eq, 8,1 with in tegral step 0,005, then record 1500 ^ c o m p o n e n t values with sampling time 0,02 after all transients have diminished. These time series d ata are denoted by x(n), n = 1, 2, • • •, 1500, We use these values to fit a function F such that x(n) = F(x(n —9),x ( n —6),x( n —3)) We iterate the fitted F 20,000 times with initial condition the last 9 values of the time series, i.e., x(1492), x(1493), • • ,a;(1500). As a result, we obtain a predicted attractor as shown in the lower half of Fig, 8,9, and the actual attractor is shown in the upper
186
Foundations o f Wavelet Network and Applications
half of the same figure as a comparison, where the horizontal coordinate is x(n) and the vertical coordinate is x(n + 3), One can see th at the geometric structures of the predicted attractor are almost identical to but a little more complicated than those of the actual. Furthermore, we calculate the correlation dimen sion of the predicted attractor. Shown in Fig, 8,10 is the plot log(C(m,r)) vs. log(r) with embedding dimensions m from 2 to 12, As a comparison, correlation integrals of the actual at tractor are shown in Fig, 8,11, Shown in Fig, 8,12 are the numerical estimates of the correlation dimensions of the actual and of the predicted attractor based on Fig, 8,10 and Fig, 8,11, One can see th at the correlation dimension of the predicted at tractor (£>2 = 2,15) is a little larger than th at of the actual (.D2 = 2.03). We next turn our attention to predicting the Ikeda attractor, whose geometric structures are more complicated than those of some Ikeda attractors tested by Casdagli [204], We iterate Eq, 8,15 and record 100 ^ c o m p o n e n t values after all transients have diminished. Based on these 100 values, let embedding dimension d = 2 and timedelav r = 1, we fit a predictive model: x ( n ) = F ( x ( n —2) , x ( n — 1)) We predict the Ikeda attractor by iterating the fitted F 20,000 times with initial condition the last two records, i.e, x(99), x(100) Shown in the lower half of Fig, 8,13 is the predicted attractor, which is almost identical to the actual attractor shown in the upper half of the same figure. From the two examples above, we can see th at predicting attractors is possible based only on remarkably few data points. Obviously, more d ata points are needed to predict attractors generated by continuous systems than by discrete systems. This is because d ata points which are used to fit a predictive model ought to contain several main shortperiodic motions. Thus, for continuous systems, the number of data points needed is greatly related to the sampling time. Of course, how many data points are needed to predict the attractors is still unknown and should be further investigated.
Predicting Chaotic Time Series
8.7
187
C on clu sion s
In this chapter, wavelet networks are introduced to predict chaotic time series. The effectiveness of this technique has been confirmed by making some numerical tests. These tests are listed as follows. First, shortterm accurate value predictions are tested for time series from chaotic attractors. Second, un like usual cases, the time series we test are generated by a dy namical system with one param eter changing with time. Our numerical results show th at it is possible to make shortterm accurate value predictions and do bifurcation structure predic tions from such time series. Finally, based on remarkably few data points, predicting chaotic attractors is tested by iterating predictive models many times. We do not make comparison tests of wavelet networks with other prediction techniques. Generally, for time series from some more complicated systems, wavelet networks may be superior due to the high resolving power of wavelet analysis. In addition, there always exist some points with bad predictions no m atter what techniques are used. Wavelet networks may be able to reduce the number of such points as small as possible by adding some special wavelets at these points. Of course, how to find these points is the hrst step. We have had some ideas, but more investigations are still needed. It should be noted th at in this chapter, the hrst step for mak ing any prediction is to build a predictive model from a part of the observations. Then, based on this predictive model, we make predictions on many points. If the predictive model is a good approximation to the true dynamical system, predictions may be made successfully. In fact, obtaining a good approximation to the actual system is much more difficult than makeing good predictions. In other words, forecasting is easy and modelling is difficult. Thus it is inconvenient to try to build an approxima tion to the true system and then make predictions. Of course, if one wants to predict an attractor from a time series, building a good model is necessary. In practical applications, the pre dictive models should be able to be easily adjusted if new data
188
Foundations o f Wavelet Network and Applications
points are available. So the idea of making recursive predictions [8] is reasonable. We will make predictions by combining this idea with wavelet networks in another paper,
8.8
A p p en d ix
The basic steps of our algorithm are as follows: 1, Input: scalar time series data; embedding dimension d; and time delay r. 2, Choose a mother wavelet function V’ (see Eq. 8,11), In this chapter, the wavelet function ij) is chosen as the product of the socalled “Mexican h at” [18]: w(x) = (1 —a;2)e25“ for example, if the embedding dimension is d, we will choose ip(x) = w(x i)w(x 2 ) ■■■w(xd) where Xi,i = 1, 2, • • •, d are the d components of the re constructed vectors, 3, Network Eq, 8,11 initialization. We do it by three steps, i) Initialize Dit and t f For the sake of simplicity, we consider only the twodimensional ease, i.e., the embedding dimen sion d = 2, For other dimensional eases, the analysis is completely the same. Assume th at the input time series is z(l), z(2), ■■■, z(M). Let o = min{z(i), 1 < i < M }, b = max{z(i), 1 < i < M }. Then set t\(i) = (6+ o)/2, i = 1,2; £>i = diag(di(l), g?i(2)) with di(l) = cfi(2) = 2/(6 —o). The square [o, b] x [o, b] is divided into four parts by the lines x = (6 + o)/2 and y = (6 + o)/2. We count the num ber of the d ata points which fall into each subsquare. Let them be n i ,n 2 ,« 3, and 77.4, respectively. If ri\ is greater than a given number L, we set t 2 (i),i = 1,2 as the coor dinates of the center of the corresponding subsquare, and
Predicting Chaotic Time Series
189
D 2 = diag(d2(l), d2 {2)) with d2(l) = d2 {2) = 22/(b — a). Otherwise, we thrown away this subsquare and consider another subsquare. Similarly, we can consider other sub squares and will initialize = 1,2, D 3 = diag(d3(l), d3(2)) with d3(l) = cf3(2) = 22/(fo — a ), until the four subsquares all are considered. So we have initialized /•_>. IT. • • •. tkt , D kl (k\ < 5), Each subsquare with ti as its center (2 < i < k\) is divided into four parts by the lines x = G(l) and y = G(2), In each subsquare, we repeat the same procedure which will initialize t kl+i , D kl+i , t kl+2, D kl+2, • • ,tk 2 , D k 2 (k2 < 1 + 22 + 22 = 21), Continue to divide the subsubsquares and initialize tk2 +i, Dk 2 +i,tk2 +2 , Dk2+2, and so on. Generally, we require th at the number of times of dividing is less than 3, i.e., the smallest subsquares are subsubsquares. So we have two constrained conditions: (1) a subsquare is useful if the number of its data points is greater than a given number L; (2) the number of division is less than 3, By these, we can obtain a finite number N, the number of wavelets used (see Eq, 8,11), In all our examples, N is less than 10, In fact, N should not be taken too large since there exists the interference between different wavelets so th at the results of the approximation are unsatisfactory. We can keep the N less than 10 by adjusting the parameter L. The larger L is, the smaller N is, ii) Initialize /i’( and //: For each i = 1.2. • • •. A . Ik = /(u n it matrix), M
TT_ *= 1
iii) Initialize wf. In Eq, 8,11, we have initialized / / . / / . /(. and (j in the hrst two steps. We initialize « = 1, 2, , AT by the least squares method, i.e., choose WjS to minimize
Foundations o f Wavelet Network and Applications
190
the least squares error: M
E=
^ 2
[z(k)—g(z( k—d,T), z ( k —(d—l)T), ■■■, z ( k ^ T ) ) ] 2
k= dr+ 1
(8.16) (see Eqs. 8.2 and 8.11).
4. Adjust the parameters of our wavelet network Eq. 8.11 by a learning algorithm (for example, the back propaga tion algorithm for neural network learning). The objective function to be minimized is Eq. 8.16. We say the parame ters are “good” if the error E is less than a given tolerant value.
8.9
A ck n o w led g m en ts
This work is partially supported by the National Natural Science Foundation of China. L.Y. Cao is grateful for support from the Laboratory of Scientific and Engineering Computing of the Chinese Academy of Sciences.
Predicting Chaotic Time Series
8.10
191
F igu res
300
4725
9150
13575
18000
Figure 8,1: Predictions of time series generated by the MaekeyGlass equation. Diamonds represent the predicted values (x(n,pred.)) which are linked with a dashed line, crosses repre sent the actual values (x( n .tr u e )) which are linked with solid line. In this figure, only one point is plotted to show the compar ison of predicted values with actual values for each 200 points.
Foundations o f Wavelet Network and Applications
192
0.175

0.15 
Error
0.125

O1 0075
V; 
300
4725
9150
13575
18000
Figure 8.2: Relative errors between predicted and actual values for time series generated by the MaekeyGlass equation.
Predicting Chaotic Time Scries
x (n ,tru e ),x (n ,p r e d .)
193
11 Figure 8,3: Predictions of time series generated by the Lorenz equation. The interpretations of the symbols used in this figure are the same as in Figure 8,1,
Foundations o f Wavelet Network and Applications
x{n+L}
194
in) Figure 8,4: Plot of x v+i vs, x„ for x v generated by the Ushiki map, 20,000 data points are used in this figure.
Predicting Chaotic Time Scries
x (n ,tru e ),x (n ,p r e d .)
195
11 Figure 8,5: Predictions of time series generated by the Ushiki map. The interpretations of the symbols used in this figure are the same as in Figure 8,1,
Foundations o f Wavelet Network and Applications
x(a)
196
n Figure 8,6: Time series generated by Ikeda map, Eq, 8,13
Predicting Chaotic Time Series
197
Figure 8,7: Predictions of time series generated by the Ikeda map, where the first 200 points belong to “fit.dat” repre sented by diamonds, which are used to fit the predictive model; “true.dat” represented by dots, which are actual values; and “pred.dat” represented by crosses, which are predicted values.
Foundations o f Wavelet Network and Applications
x{n,pred.)
198
n
Figure 8,8: Fivestep predictions of time series generated by the Ikeda map. The predicted bifurcation structures are identical to the actual ones.
Predicting Chaotic Time Series
40
2 0 0 20 jtt (n, tr u e ) f x ( n ; pred= )
199
40
Figure 8,9: Prediction of the Lorenz attractor. The predicted attractor is shown in the lower half of the figure. The actual attractor is shown in the upper half of the figure.
200
Foundations o f Wavelet Network and Applications
Figure 8,10: Loglog plots of the correlation integrals C(in, r) vs , r for data set of 20,000 points from the predicted Lorenz attractor, in = 212,
Predicting Chaotic Time Series
201
Figure 8,11: Loglog plots of the correlation integrals C(m, r) vs, r for data set of 20,000 points from the actual Lorenz attractor, m = 212.
202
Foundations o f Wavelet Network and Applications
Figure 8,12: Numerical estimates of correlation dimensions D 2 of the actual and of the predicted Lorenz attractor based on the results of correlation integrals shown in Figures 8,108,11, where the diamonds correspond to the actual attractor and the saturated value is 2,03; the crosses correspond to the predicted attractor and the saturated value is 2,15, in = 212,
Predicting Chaotic Time Scries
x(n+l,pred.)
x(n+l,true)
203

0=8
0
=
2
jtt (n, t r u e ) t
1
=
2
2
=
2
x (n ,p re d . )
Figure 8,13: Prediction of the Ikeda attractor. The predicted attractor is shown in the lower half of the figure. The actual attractor is shown in the upper half of the figure.
204
8.11
Foundations o f Wavelet Network and Applications
E xercises
1, Think of a chaotic time series from real world or experi ments, How do you judge if a scalar time series is chaotic or not? 2, Can you apply genetic algorithm to predict chaotic time series? Describe an algorithm for this purpose, 3, Think of a nonlinear dynamical system; let one of its pa rameters vary over time; then generate a time series from this system with the timevarying parameter; apply the wavelet networks to predict this time series.
C h ap ter 9 C on cep t L earning 9.1
B ackgr ound
The classical stochastic approximation methods are shown to yield algorithms to solve several formulations of the concept learning problem defined on the domain [0, l]d, Under some smoothness conditions on the probability measure functions, simple algorithms to solve some Probably and Approximately Correct (PAC) learning problems are proposed based on wavelet networks. Conditions for the asymptotic convergence of these algorithms as well as conditions on the sizes of the samples re quired to ensure the error bounds are derived using martingale inequalities. This chapter describes a framework for concept learning using wavelet networks,
9.2
In tro d u ctio n
The problem of inferring concepts, which are treated as abstract sets, from randomly generated positive and negative examples of the concept has been the focus of considerable research in the area of pattern recognition and machine learning. The paradigm of Probably and Approximately Correct (PAC) concept learn ing has attracted considerable attention over the past years since the pioneering works of Valiant [264], This paradigm is also in 205
206
Foundations o f Wavelet Network and Applications
tim ately related to the method of empirical risk minimization and its application to the pattern recognition problem exten sively studied by Vapnik [265], Several basic results in PAC learning are existential in nature: informally, one of the main results states th at any hypothesis th at minimizes the empiri cal error, based on a sufficiently large sample, will approximate the underlying concept with a high probability. The problem of computing such a hypothesis could be of varying complexity; however, algorithms th at can handle significant classes of such computing problems will be of both practical and theoretical interest. In this chapter, we illustrate th at classical stochastic approximation algorithms can be used to solve an im portant class of PAC learning problems, where the hypotheses can be approximated by wavelet networks and the class of underlying probability measures satisfies some mild smoothness properties. The basic approach of this chapter is based on wavelet net works proposed by Zhang and Benveniste [152] and stochastic approximation algorithms which have been the focus of exten sive research for more than 40 years. The area of stochastic approximation has been well established since the pioneering works of Robbins and Monro [260] in 1951; comprehensive treat ments of various works can be found in Wasan [266], Kushner and Clark [252], and Benveniste et al, [233], The eonneetionin a concrete sensebetween the neural network learning proce dures and stochastic approximation algorithms has been pointed out recently by several researchers Gereneser, Nedeljkovie and Milosavljevie, and Stankovie [262] [269] [277], For example, W hite [267] showed that the popular back propagation algo rithm is an implementation of the RobbinsMoro style algorithm for the problem of minimizing the meansquare error. In fairly recent developments in the area of artificial neural networks, it has been established th at neural networks consisting of a single hidden layer with a finite number of nodes can approximate ar bitrary continuous mappings (for example, see Cvbenko [240], Funahashi [246], W hite [267]), More recently, such density prop erty of wavelet networks has been discussed by Zhang and Ben veniste [270]; they also propose a learning algorithm and state
Concept Learning
207
th at it proof of convergence is an open problem, Eao et al, [259] establish th at the PAC learning problem can be solved by using a network of nonpolynomial units by using stochastic approx imation algorithms th at incrementally update the connection weights of the network. In this chapter we apply the algorithm of Eao et al, [259] to the case of wavelet networks; we also show convergence of the proposed algorithm, under some mild smoothness conditions on measures functions, and derive some Unite sample results. We consider the concept classes th at can be closely approx imated by hxedsize wavelet networks, and classes of distri butions th at satisfy some conditions on the gradient (detailed condtions are given in Section 9,3), We present a basic algo rithm th at uses the (fixed size) network respresentation of a concept, whose weights are incrementally updated in response to its predicted classification of the next example. We estimate bounds on the precision and accuracy of the current estimate as a function of sample size and/or number of iterations. Then we present a variant of this algorithm that uses batches of examples in each update; in some cases better bounds for this algorithm can be obtained compared to the basic algorithm. The organization of this chapter is as follows. Some prelim inary facts on PAC learning, wavelet network approximation, and stochastic approximation algorithms are presented in Sec tion 9,5,3, The algorithms for PAC learning are presented in Section 9,5,3; the basic and batching algorithms are presented in Section 9,5 and 9,5,1 respectively,
9.3
P relim in aries
9.3 .1
P A C L earn in g
We are given a set X = [0, l]d called the domain, and C C 2X and II C 2 V called the concept class and hypothesis class, re spectively; members of C and H are measurable under a distri bution P x on X, A concept is any c e C and a hypothesis is any h e H. For s C X , an indicator function 1S: X {0,1}
208
Foundations o f Wavelet Network and Applications
is defined such th at for any s E X , l s(a;) = 1(0) if and only if x E s ( x 0 s). A pair (x, l c{x)) is called an example of c E C, and the set of m such examples is called msample of c. For o ,6 C X , we have oA 6 = (d fl 6) U (o fl 6), Then for any s E C U H U {cA h  c E C, h E H } we have the actual measure of s given by p(s) = f s dPx. In the framework of concept learning, a finite sized sample, produced according to an unknown distribution Px using an unknown concept c E C, is given. The concept class C is said to be learnable [236] if given a finite sample, a hypothesis h E II.C C //. can be produced such th at for any 0 < e, S < 1, we have the following condition satisfied: P[p(hAc) < e] > 1  6
(9.1)
where P is the product measure on the set of all independently and identically distributed msamples. Note th at p(hAc) is the probability th at a randomly chosen test point x E X will be classified differently by h and c, i.e., p(hAc) = fic(x)^ih(x)dPx. Here e and S are called the precision and confidence parame ters, respectively, and the pair (e, 6) is called the performance pair. The condition of equation 9,1 is referred to as the (e, 6 )~ condition.
9 .3 .2
A p p r o x im a tio n b y W a v elet N etw o rk s
Consider the wavelet network of Zhang and Benveniste [270] of the form M
h(w, x) = y~] ajfi(D jX —t j ) + g i =i
(9,2)
where o* E 3?, fi: Ud 3? in a wavelet function f E Ud, g e )\. and Di is d x d diagonal m atrix with the diagonal entries given by di E Ud. Let i.'s: )\ )\ be a scalar wavelet is the MorletGrossmann sense [250] [242] in th at its Fourier transform satisfies the condition
Concept Learning
209
ct,
=r c C L dw 0, for some M, there exists M
g{x) = h(u, x) =
apf{DjX — tj) + g
(9,4)
i =i and a set D C [0, l]d with Lesbegue measure of at least I(g(x) ~ f ( x ) I < £■f or x E D.
1—e and
P ro o f: The sums of the form g(x) are shown to be dense in L 2 (lRd) in [270], The theorem follows by noting that the indica tor functions on [0, l]d belong to L 2 (Ud) (see [240] for detailed proof for the networks of sigmoid functions; this proof can be adapted to the present ease),
9 .3 .3
S to c h a stic A p p r o x im a tio n A lg o r ith m s
Some of the simplest forms of the stochastic algorithm take the following form Wn+l = WnJr 7nSn(wn, Cn)
(9.5)
where the real vector wn is an estimate of the parameter of inter est at the nth step, j n is a sequence of scalars, (n is a sequence of random variables, and sn(wnXn) 1'S a random variable called
210
Foundations o f Wavelet Network and Applications
the update rule. For example, in solving m in wf(w ), where gra dient estimates of /(•) involve random error terms, sn() could correspond to the noisy estimate of the gradient. The conver gence conditions of this type of algorithm have been extensively studied using a variety of techniques (for example, see Borodin [236], Chen [237], Gereneser [248], Kushner and Clark [252], Ljung and Sodestrom [254], and Wasan [266]), Martingales have been employed extensively in the study of stochastic algorithms (Doob [243], Hall and Hevde [251]) and we provide some im portant (wellknown) results that are subse quently used here (more complete and recent introductions can be found in Billingsley [234], Chen and Luo [238]), Let wi, W2 , . . ., wn be a sequence of random variables on the probability space (fl, F, P ), and let /•',. F , /•'„ be a sequence of afields in F. The sequence {wn} is a submartingale if the following conditions hold; 1,
Fn C Fn+1 and oon is F„measurable,
2,
E [iOji] < oo,
3,
E[u)n+i\Fn] > uin with probability one.
Then {ujn} is a supermartingale if it satisfies parts (1) and (2), and part (3) is changed to E[ujn+i\\F„] < oon with probabil ity one. Also, oon is a martingale if it is both a supermartingale and submartingale. The following theorem contains simple versions of the wellknown Doob’s convergence theorems. T h e o re m 2 (i) Let {oun} be a submartingale. I f K = sup E j\uin] < oo, then uin —>• ui with probability 1 , where u i is a random variable satisfying E n[\u\] < K. (ii) Let uin be a nonnegative supermartiangale. Then uin —>• oo with probability 1 , where oo is a random variable.
Concept Learning
211
L e m m a 1 R obbins and S iegm on d’s A lm o st S u perm artin gale Lemma[261]: Let zn,bn,cn, and dn be finite nonnegative random variables, each Fnmeasurable, and satisfy E [ Z n+i\Fn] < ( 1 + bn)Z n + Cn — dn Then on the set {X] bn < oo, cn < oo}, we have dn < oo with probability 1 and Z n —>• Z < oo with probability 1 . The following results from Polzak [257] will be utilized in the next section. L e m m a 2 [257] For a sequence of real vectors {vn} such that Vn > 0, 0 < < 1, 0 < and Vn+I < (1  vn)vn + vngn
(9.6)
we have vn < V oll^r^l  vf) + rfi 1 
 1/j))
(9.7)
C o ro lla ry 1 [257] Under the hypothesis of Lemma 2, if vn —^ 0 and vn = then un —>• 0. L e m m a 3 [257] Consider a sequence of random variables {uin},uin > 0 ,E[u)q] < oo, and
E \aJn +\ 1^0; ^ 1 ; ^ 2 ;
pn = that
(1
^n \ C (1
VrfjU!n T n n T)n
(98)
0, 0, Pn > P > —£j t l )p^g Then for every e > 0, there exists c such
0 < Vn < 1 , E n = 0 Vn 1,
■ ■ ■ ;
=
O
O
,T]n >
//„
Mill,,
> x
P\u)n < (c + e)n^Z,}(1 —nfiforalln] > 1 — J e In particular, if pn > p >
1
for all n, then c = E[u)q] +
(9.9) ■
212
9.4
Foundations o f Wavelet Network and Applications
Learning Algorithms
We first develop conditions required to implement a stochastic approximation algorithm for the PAC learning problem using the formulations based on the results of Cybenko [240], C on d ition 9.1 Let X he the ddimensional unit cube. For a fixed M, the set of functions of the form M
g(x) = h(w, x) = y~] ajif(DjX —tf) + g i =i
(9.10)
approximate the set of indicator functions { l c()}ceC of the con cept class C in the sense of Theorem 1, where N is the size of the param eter vector w with components a i,tj,g and d j , i.e., for each l c(.) and e > 0, there exists some h(w,.) such that  l c(x) — h(w ,x) \< e for all x E D C X such th at Lebesgue measure of D is at least 1  e.
9 .4 .1
B a sic A lg o r ith m
The basic structure of the algorithm is as follows U)n+i = wn +
r„[ h(wn, x n) 
lc(xn) ]
(9.11)
where wn, wn+i E )\ v. each component of e )\ x consists of scalar j n (same in all components of T„) called the step size, and (xn, l c(xn)) is the nth example. Subsequently, this algorithm will be referred to as the basic algorithm. The expression [ h(wn,x n) — l c(xn) I] is the update rule. Notice th at each step of this algorithm can be implemented in 0 (M d) time, lienee this algorithm has a time complexity of 0 (nM d), when terminated after the nth iteration. In the context of neural networks, algorithms similar to 9.11 have been studied by Nedeljkovic [255], Finnoff [245], Darken and Moody [241], and Stankovic et. al. [263] (also see Farago and Lugosi [244] for more complex algorithms); finite sample
Concept Learning
213
results of such algorithms have been obtained by Eao et al. More general algorithms similar to 9,11 for wavelet networks have been proposed by Zhang and Benveniste [270]; here, under some conditions, we illustrate both asymptotic convergence and Unite sample results of algorithm. These results follow mostly from the theorems of Eao et al, [259]; we, however, provide the critical steps for completeness. Let p(w) = /  h (w ,X ) — 1C(X )  dPx denote the expected error made by the hypothesis. Here we choose to provide (e, h)eondition for p(w), and note th at if h(w,.) takes values of 0 or 1 only, then this condition is identical to Condition 9,1, Notice th at p(w) > 0, and in fwp(w) = 0 by Condition 9,1, Also # [( h(wn,x n)  l c(xn) )  wn\ = p(w n). The performance of the stochastic algorithms of type de scribed in (9,11), can be characterized in terms of a Lyapunov function V (w n) as described in Polyak [258]; here, by the specific choice of the update rule, p(w) plays the same role as V(w). In order to ensure the convergence of the algorithm (9,11), we require the following conditions on the probability measure generated by Px . Notice th at Condition 9,1 applies to C and H and the following Conditions 9,2 and 9,3 apply to the class from which Px is chosen. C on d ition 9.2 Let p{w) he differentiable and its gradient sat isfy the following Lipschitz condition. For all u, v e h'v . there exists a positive constant L such that  X7 p(u) —X7 p(v) < L  u —v 
(9,12)
C on d ition 9.3 There exists a scalar 0 suchthat for any w and Ndimensional vector tk(u;) = (p(w), ....p(w))r , we have syp(w)T^ (w ) > 9p(w) where
= (2 g i,
(9,13)
ip l).
This condition implies x/p(w) 1 > 6 , everywhere except at p{w) = 0 , where !_ is a column vector of all Is.
214
Foundations o f Wavelet Network and Applications
We now state a basic version of our main result. Here the conditions on the sample sizes are expressed as functions of the step size j n, and two specific eases are then illustrated in the following corollary. T h e o re m 3 Under the conditions 9.1  9.3 we have the follow ing. (i) (a) Under the condition ffp fo T i 2 < 0 0 1 the sequence of random variables {fi,(wn)} generated by basic algorithm con verges to a finite random variable pw. (b) Under the conditions — > 0 as n — > oo and Y jtfi lii we have E[fjfwn)\ — > 0 as n — >oo. (c) In addition to conditions in (b), under the condi tion lim ^oo — 7) > 7 > 1 , we have p{wn) — > 0 with probability one. (ii) For 7 „ <  for alln,we have P[p(wn)< 1 — 0 we have 7n+l In ~ '
P[p(wn) < e fo r
all
n] > 1 —6
(9,15)
for sufficiently large sample of size n given by nl
£ 3= 0
i
,
LNj o
n =
1
+A 6