Pattern Recognition in Practice IV Multiple Paradigms, Comparative Studies and Hybrid Systems [1st Edition] 9781483297842

The era of detailed comparisons of the merits of techniques of pattern recognition and artificial intelligence and of th

490 7 50MB

English Pages 586 [558] Year 1994

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Hybrid Intelligent Systems in Control, Pattern Recognition and Medicine [1st ed. 2020] 978-3-030-34134-3, 978-3-030-34135-0

This book describes the latest advances in fuzzy logic, neural networks and optimization algorithms, as well as their hy

562 137 15MB Read more

Pattern recognition principles 0201075865

1,307 114 5MB Read more

Robustness in statistical pattern recognition 9789048147601, 9789401586306

587 134 2MB Read more

Progress in pattern recognition 2 9780444877239, 0444877231

1,146 134 5MB Read more

Statistical pattern recognition 0876711778, 9780876711774

Book by Chen, C. H

216 73 3MB Read more

Mastering Financial Pattern Recognition 9781098120474

Candlesticks have become a key component of platforms and charting programs for financial trading. With these charts, tr

429 169 6MB Read more

Frontiers In Pattern Recognition And Artificial Intelligence 9811203350, 9789811203350

The fifth volume in this book series consists of a collection of new papers written by a diverse group of international

1,184 54 25MB Read more

Advanced Machine Vision Paradigms for Medical Image Analysis (Hybrid Computational Intelligence for Pattern Analysis and Understanding) 9780128192955, 012819295X

Computer vision and machine intelligence paradigms are prominent in the domain of medical image applications, including

143 61 70MB Read more

Process Mining Techniques for Pattern Recognition: Concepts, Theory, and Practice 9781000540598, 9780367770495, 9780367770501, 9781003169550, 1000540596

This book presents the theory and practice of Process Mining Techniques with a detailed focus on Pattern Recognition of

158 25 7MB Read more

Trends in Deep Learning Methodologies: Algorithms, Applications, and Systems (Hybrid Computational Intelligence for Pattern Analysis and Understanding) 0128222263, 9780128222263

Trends in Deep Learning Methodologies: Algorithms, Applications, and Systems covers deep learning approaches such as neu

1,048 211 14MB Read more

Pattern Recognition in Practice IV Multiple Paradigms, Comparative Studies and Hybrid Systems [1st Edition]
9781483297842

Author / Uploaded
Edzard S. GELSEMA and Laveen S. KANAL (Eds.)

Table of contents :
Content:
Machine Intelligence and Pattem RecognitionPage ii
Front MatterPage iii
Copyright pagePage iv
PrefacePages v-viiiEdzard S. Gelsema, Laveen N. Kanal
AcknowledgementsPages ix-xEdzard S. Gelsema, Laveen N. Kanal
Patterns in the role of knowledge representationPages 3-12T. Vámos
Application of evidence theory to k-NN pattern classificationPages 13-24Thierry Denœux
Decision trees and domain knowledge in pattern recognitionPages 25-36D.T. Morris, D. Kalles
Object recognition using hidden Markov modelsPages 37-44J. Hornegger, H. Niemann, D. Paulus, G. Schlottke
Inference of syntax for point setsPages 45-58Michael D. Alder
Recognising cubes in imagesPages 59-73Robert A. McLaughlin, Michael D. Alder
Syntactic pattern classification of moving objects in a domestic environmentPages 75-89Gek Lim, Michael D. Alder, Christopher J.S. deSilva
Initializing the EM algorithm for use in Gaussian mixture modellingPages 91-105Patricia McKenzie, Michael Alder
Predicting REM in sleep EEG using a structural approachPages 107-117Ana L.N. Fred, Agostinho C. Rosa, José M.N. Leitão
Discussions Part IPages 119-126
On the problem of restoring original structure of signals (images) corrupted by noisePages 129-140Victor L. Brailovsky, Yulia Kempner
Reflectance ratios: An extension of Land's retinex theoryPages 141-152Shree K. Nayar, Ruud M. Bolle
A segmentation algorithm based on AI techniques*Pages 153-164C. Di Ruberto, N. Di Ruocco, S. Vitulano
Graph matching by discrete relaxationPages 165-176Richard Wilson, Edwin R Hancock
Inexact matching using neural networksPages 177-184Jiansheng FENG, Michel LAUMY, Michel DHOME
Matching of Curvilinear Structures: Application to the Identification of Cortical Sulci on 3D magnetic resonance brain imagePages 185-195S Legoupil, H Fawal, M Desvignes, P Allain, M Revenu, D Bloyet, J M Travere
Knowledge Based Image Analysis of Agricultural Fields in Remotely Sensed ImagesPages 197-211Nanno J. Mulder, Fang Luo
A texture classification experiment for SAR radar imagesPages 213-224A. Murni, N. Darwis, M. Mastur, D. Hardianto
Discussions Part IIPages 225-230
Spatio/temporal causal modelsPages 233-240John F. Lemmer
Potentials of Bayesian decision networks for planning under uncertaintyPages 241-253Erica C. van de Stadt
Qualitative recognition using Bayesian reasoning*Pages 255-266Jianming Liang, Henrik I Christensen, Finn V. Jensen
Learning characteristic rules in a target languagePages 267-278Raj Bhatnagar
Discussions Part IIIPages 279-284
Why do multilayer perceptrons have favorable small sample properties?Pages 287-298àarūnas Raudys
Using Boltzmann Machines for probability estimation: A general framework for neural network learningPages 299-312Hilbert J. Kappen
Symbolic approximation of feedforward neural networks*Pages 313-324Ishwar K. Sethi, Jae H. Yoo
Analytical approaches to the neural net architecture design*Pages 325-335W J Christmas, J Kittler, M Petrou
An Alternative Feedforward Approach to Neural Classification ProblemsPages 337-345R. Tebbs, T. Windeatt
Contribution analysis of multi-layer perceptrons. Estimation of the input sources' importance for the classificationPages 347-358M. Egmont-Petersen, J.L. Talmon, E. Pelikan, F. Vogelsang
Neural networks – advantages and applicationsPages 359-365E. Oja
Relative effectiveness of neural networks for image noise suppressionPages 367-378D. Greenhill, E.R. Davies
Discussions Part IVPages 379-388
An experimental comparison of neural classifiers with ‘traditional’ classifiersPages 391-402W.F. Schmidt, D.F. Levelt, R.P.W. Duin
Comparative study of techniques for large-scale feature selection*Pages 403-413F.J. Ferri, P. Pudil, M. Hatef, J. Kittler
Neural nets and classification trees: A comparison in the domain of ECG analysisPages 415-423Jan L. Talmon, Willem R.M. Dassen, Vincent Karthaus
An empirical study of the performance of heuristic methods for clusteringPages 425-436Subhada K. Mishra, Vijay V. Raghavan
A comparative study of different classifiers for handprinted character recognitionPages 437-448K.M. Mohiuddin, Jianchang Mao
A Comparison of the Randomised Hough Transform and a Genetic Algorithm for Ellipse ExtractionPages 449-460S. Procter, J. Illingworth
Discussions Part VPages 461-470
Relative feature importance: A classifier-independent approach to feature selectionPages 473-487Hilary J. Holz, Murray H. Loew
An intelligent planner for multisensory robot visionPages 489-500X.Y. Jiang, H. Bunke
Hybrid knowledge bases for real-time robotic reasoning1Pages 501-512John Horst, Ernest Kent, Hassan Rifky, V.S. Subrahmanian
Hybrid systems for constraint-based spatial reasoningPages 513-524Jo Ann Parikh
Detecting novel fault conditions with hidden Markov models and neural networksPages 525-536Padhraic Smyth
A handwriting recognition system based on multiple AI techniquesPages 537-550P.E. Bramall, C.A. Higgins
A hybrid system to detect hand orientation in stereo imagesPages 551-562A. Drees, F. Kummert, E. Littmann, S. Posch, H. Ritter, G. Sagerer
Discussions Part VIPages 563-571
List of AuthorsPage 573
List of KeywordsPages 575-576

Citation preview

Machine Intelligence and Pattem Recognition Volume 16

Series Editors

L.N. KANAL and

A. ROSENFELD University of Maryland College Park, Maryland, U.S.A.

ELSEVIER Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo

Pattem Recognition in Practice IV: Multiple Paradigms, Comparative Studies and Hybrid Systems Proceedings of an Intemational Workshop held on Vlieland, The Netherlands, 1-3 June 1994

^ ^ A ; M E 1-3 1994 I

Edited by

Edzard S. GELSEMA Department of Medical Informatics Erasmus University Rotterdam, The Netherlands

Laveen S. KANAL Department of Computer Science University of Maryland College Park, MB, USA.

1994

ELSEVIER Amsterdam - Lausanne - New York - Oxford - Shannon - Tokyo

ELSEVIER SCIENCE B.V. Sara Burgerhartstraat 25 RO. Box 211, 1000 AE Amsterdam, The Netherlands

ISBN: 0 444 81892 8 © 1994 Elsevier Science B.V. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher, Elsevier Science B.V, Copyright & Permissions Department, RO. Box 521, 1000 AM Amsterdam, The Netheriands. Special regulations for readers in the U.S.A. - This publication has been registered with the Copyright Clearance Center Inc. (CCC), Salem, Massachusetts. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the U.S.A. All other copyright questions, including photocopying outside of the U.S.A., should be referred to the copyright owner, Elsevier Science B.V, unless otherwise specified. No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, pro ducts, instructions or ideas contained in the material herein. This book is printed on acid-free paper. Printed in The Netheriands

PREFACE The series of conferences "Pattern Recognition in Practice" started in 1980. Subsequently, the conferences in 1985 and in 1988 firmly established these conferences as a well appreciated workshop-type series, quite different in character from the much larger international conferences on pattern recognition (ICPR's). As in the previous conferences, the main aim of PRP-IV was to stimulate exchanges between professionals developing pattern recognition methodologies and those who use pattern recognition techniques in their professional work. As in PRP-III, the scope included artificial intelligence techniques. Due to the fact that ICPR-11 was organized in The Hague in 1992, PRP-IV did not materialize until 1994. The venue was changed to a more interesting spot in The Netheriands: the island Vlieland. Judging from the many enthousiastic comments received during and after the event, the conference was a success, not in the last place because of the scientific content. The 42 papers presented have been organized in tiiis book into six parts: 1. 2. 3. 4. 5. 6.

Pattern Recognition Signal- and Image Processing Probabilistic Reasoning Neural Networks Comparative Studies Hybrid Systems

Already in Part I, the integration of techniques from artificial mtelligence and from pattern recognition becomes apparent. The opening paper by Vamos is not only a philosophical introduction to the theme but also describes large scale projects an which the representation of knowledge is the central issue. The two following papers describe two non-parametric methods of classification: the k-NN nearest neighbour rule, supplemented with evidence theory and a decision tree building scheme, incorporating domain knowledge such as feature cost and allowing incremental learning. Homegger et al. compare the use of several types of Hidden Markov Models to 2-D object recognition in computer vision. Extensions to 3-D object recognition are briefly discussed. Part I continues wdth four papers describing an extension of grammatical inference to grammars for point sets. The process of UpWrite, an iterative sequence of mappings of a set of objects at one level of description to one single object at some higher level is introduced. The process is derived from properties of assemblies of neurons and might therefore be a model of the functioning of the nervous system. Applications to the recognition of cubes in 2-D images, the classification of moving objects and Gaussian mixture modelling are described. The final paper in Part I deals vsdth syntactic modelling of sleep-EEG's. A system aiming at predicting the entrance into REM sleep on the basis of its grammatical structure is described.

The papers in Part II describe various methodological issues and applications of signal and image processing. The opening paper in this part, by Brailovsky, deals with the suppression of noise in signals and images. It is argued that in order to recover the original signal or image, the global structure must be taken into account. Results with piecewise linear signals are presented. The second paper addresses the problem of reflectance ratios. Region reflectance ratios are invariant to illumination and other acquisition parameters. This property is exploited to recognize objects from a single brightness image. The next paper describes an image segmentation algorithm based on a possible model of the human perceptive process and incorporating techniques from artificial intelligence. The segmentation is intended as a preprocessing step, aiming at facilitating human interpretation. The next three papers deal with various forms of graph matching in image processing. Wilson and Hancock describe a procedure using discrete relaxation in which the probability distributions of null matches and erroneous matches are modelled by a uniform and memoryless probability ftmction. The next paper describes structural inexact matching using neural networks as an altemative to tree search methods. Results on an example of a synthetic image are presented. This paper is followed by one dealing with inexact and elastic matching of magnetic resonance brain images with an anatomical atias for the identification of cortical sulci. The method is validated on the basis of 55 pairs of sulci. The final two papers in Part II both deal with remote sensing images. Mulder et al, describe a system in which knowledge about agricultural fields, including crop development is represented in models. Costs of misclassification are defined in terms of economic costs for the user of the information. The last paper deals vsdth the application of various filtering techniques for noise suppression while preserving texture information. Results on SAR images in an application of natural resource management are presented. The papers in Part III discuss various methods of probabilistic reasoning. The three first use the Bayesian network paradigm. The paper by Lemmer deals with the problem of incoφorating space and time restrictions into the Bayesian network formalism. Methods for both prediction and inference are presented. First a model is introduced which, however, is computationally intractable. Several assumptions are shown to lead to a model which is computationally feasible. Van de Stadt describes extensions to a Bayesian belief network to provide a framework for the plaiming of actions under uncertainty. The scheme is dynamical, adapting the planning process to the available information. In the third paper, probabilistic Bayesian reasoning is used in an object recognition system using a set of volumetric primitives called geons. The probabilities are used to focus attention of the vision system. It is concluded that the method is superior to deterministic reasoning. The last paper, by Bhatnagar, describes a decision tree reasoning scheme in which the primitives in terms of which the target concepts are leamt are pre-specified by the agent performing the partitioning and leaming task. In Part IV, several aspects of neural networks are reviewed. Whereas the literature in the last decades has abundantly described the constmction and application of neural networks in various domains, time has now come to reflect on the properties of such networks and on the question of the interpretation of weights in a trained network. Part IV opens vsdth a thorough analysis by Raudys, giving at least three explanations for the

vil

intriguing properties which neural networks exhibit in the presence of small training sets. In the paper by Kappen, a general framework for neural network leaming is presented. In this framework, supervised feed-forward leaming, unsupervised leaming and clustering may be viewed as special cases. The next four papers all try to counter the common view of neiu*al networks as magic black boxes. The first paper of this quartet, by Sethi and Yoo, describes a backtracking tree search to convert the weights in a neural network into symbolic representations which may be used to understand the knowledge embodied in a trained network. Christmas, Kittler and Petrou present an analysis of object labelling and classification allowing to derive interpretations of the weights, response fimctions and the nodes in mutilayer perceptrons. This may guide the choice of such parameters in the design phase of neursd networks. The next paper presents an altemative feedforward scheme to train neural network classifiers. It can improve convergence and generalization properties when applied to binary problems. It uses digital circuit fault test theory to trace sensitive paths from the input to the output layer. The next paper presents a method of feature selection in the multi-layer perceptron framework. The partial derivatives of the outputs vdth respect to the inputs are used to determine each input attribute's contribution to the classification. The method is illustrated by an example of radiological bone lesion diagnosis. The last two papers mainly concentrate on applications of neural networks. Oja describes the use of various hardware and software implementations of practical production quality in various architectures. In addition, the main advantages and disadvantages of neural networks are summarized. Greenhill and Davies describe an application of neural networks to the problem of noise suppression in digital images. They indicate that the network can be adapted, by training, to various types of noise. This gives the ANN an advantage over more conventional types of filters. Part V collects various comparative studies, which are becoming increasingly important in a field which can still be described as "a bag of tools for a bag of problems". The availability of numerous well known, standard data sets facilitates such comparisons. The first paper in this part, by Schmidt and Duin, presents a thorough comparison of the performance of neural networks and traditional classifiers, i.e., the nearest mean and the nearest neighbour classifier. It describes a duplication of the NETtalk experiment and discusses a comparative study on three other traditional data sets. The conclusion is that in most cases, the two traditional methods considered are superior. The next paper discusses the relative merits of classical sequential methods and genetic algorithm search. The purpose is to determine whether properties established for such methods m medium scale experiments extend to problems of much higher dimensions. Talmon et al. describe a comparison of the performance of a neural net classifier with that of classification trees in a medical domam. Also, the knowledge encoded in the various classifiers is compared. Mishra and Raghavan compare a number of clustering algorithms with respect to computation cost and quality of the solution. The results on several data sets of various sizes and dimensionalities are presented. The problem of comparison is shown to be complicated, due to the fact that most of the techniques require design parameter settings, influencmg their behaviour. Mohiuddin and Mao compare four different classifiers for isolated handprinted character recognition, using the NIST database. A hybrid system in which the top three solutions obtained by a neural network are re-

evaluated by a nearest template classifier is shown to have superior recognition accuracy. The fmal paper in Part V, by Procter and Illingworth, describes a comparison of two techniques, i.e.. Randomised Hough Transform and Genetic Algorithm applied to the problem of ellipse extraction. The paper aims at forming the basis for making a reasoned choice between the two methods. Differences exist mainly in computation cost, rather than in performance. The fmal part of these Proceedings show the extent to which the integration of pattem recognition and artificial intelligence, alluded to in the subtitle of the Proceedings of "Pattern Recognition in Practice III" has materialized. The first paper on Hybrid Systems, by Holz and Loew, introduces the concept of relative feature importance (RFI), as a measure of discriminative power. The measure is based on the stmcture of the data. Since directly computing RFI requires exhaustive search, they introduce a hybrid genetic, neural technique to estimate RFI. The following two papers deal with applications in robotics. The first, by Jiang and Bunke, describes a multisensory system intended to support the vision requirements of an intelligent robot. The system comprises a vision planner which may transform a vision request into concrete vision operations or which may develop altemative strategies. The second paper on robotics deals with the development of a multi-level architecture for real-time reasoning in the domain of mobile robots. Data and knowledge exist in the system in multiple representations. The hybrid knowledge base is a suitable representation to allow reasoning accross levels in the hierarchy. The paper by Parikh discusses the application of the technique of constraint satisfaction to solve configuration and location problems in computer vision. The constraint problem is transformed into an energy minimizing problem which is then solved by a hybrid neural/genetic technique. Smyth describes a hybrid scheme based on discriminative and generative models for the real-time detection of faults in complex dynamic systems. The scheme combines a neural network and a Hidden Markov Model and is applied to the monitoring of antenna pointing systems. A hybrid system for the on-line recognition of cursive script is described by Bramall and Higgins. The system consists of various knowledge sources, which act on the information available on a blackboard, using various artificial intelligence and pattem recognition techniques. The final paper describes a hybrid system combining a neural- and a semantic network intended for the recognition of a hand and its 3-D orientation. The system is evaluated on a set of 300 stereo colour images of real world scenes. The papers in this volume indicate that the era of detailed comparisons of the merits of techniques of pattem recognition and artificial intelligence and of the integration of such techniques into flexible and powerfiil systems has begun. These papers may give prospective users a feeling for the applicability of the various methods in their field of specialization. The rapid development in the pattem recognition field has motivated the conference chairmen to announce "Pattem Recognition in Practice V" to be held in 1997. July, 1994, Rotterdam, Edzard S. Gelsema College Park, Laveen N. Kanal Editors

ACKNOWLEDGEMENTS

This book is a collection of papers presented at the international conference "Pattern Recognition in Practice IV", held in Vlieland, The Netherlands from June 1 to June 3, 1994. The conference was financially sponsored by the following companies, who are grateful ly acknowledged for their support: Hewlett-Packard S.A., Medical Products Group Europe Océ-van der Grinten N.V. IBM Nederland N.V. The conference was organized at a somewhat remote spot. The logistics of the transpor tation of the participants from Amsterdam Schiphol airport to Vlieland was coordinated by Guus Beckers and Jifke Veenland, who also helped in maintaining the participant registration fdes. Teun Timmers coordinated the communications with hotel Seeduyn in Vlieland, where the conference was held. The active support from these three members of the Organizing Committee was much appreciated and is gratefully acknowledged. The conference program was composed of invited contributions and of reactions to the call for papers. The tasks of speaker invitation and paper selection were carried out by the Program Committee consisting of, besides the editors, Eric Backer, Bob Duin and Anil Jain. Their efforts resuhed in an attractive scientific program of high quality content. During the conference, all discussions were recorded on tape and transcribed in the secretariat. The recording process was expertly carried out by Frits Vogels, assisted by Ben van den Boom. The process of transcription from tape to paper was done by Annett Bosch and Andre Redert and supervised by Guus Beckers, Jifke Veenland and Teun Timmers. Thanks to the dedicated efforts of all these individuals, these Proceedings contain an edited rendition of these discussions as a valuable addition to the original papers. Many large and small duties which come with running a conference were diligently carried out by the head of the secretariat. Loes de Langen. Her continuous, enthousiastic dedication had an infectuous effect on all in and around the secretariat. With her talent

of anticipation and improvisation she prevented or solved many small or larger problems, bringing order in the nerve center of the conference. Finally, as it was expressed in the opening address, whereas the Organizing Committee had provided the extemal conditions for a successful conference, the attendees' contribu tions and active participation gave the conference its intrinsic scientific value. The draft manuscripts and revised versions were almost always delivered within the prescribed deadlines. We want to thank all participants for their cooperation. The conference chairmen Edzard S. Gelsema Laveen N. Kanal

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

Patterns in the role of knowledge representation T. Vamos Computer and Automation Institute, Hungarian Academy of Sciences*

1. INTRODUCEN Representation of knowledge in the brain is the focus of cognitive science. Several highly acknowledged schools share the view that this representation is close to the concept of patterns. We emphasize the word close, i.e. pattern is used partly in a metaphoric sense; the real processes are the subject of research and speculation. Though relevant achievements and well proved facts are available, it is still argued if it is possible to have a fmal understanding of the mental processes, because of the extreme complexity of the brain and due to the philosophical paradoxes of self-reflexion. The evidences are rather clear. The entire phylogenetic evolution of the neural system, even long before it developed the specially dedicated central organism of the brain, had been characterized by some memorizing abilities of coherent effects, relevant for survival, and similarly a storage of coherent responses, partly inherited, partly learnt. These coherent memory items are those which are the archetypes of knowledge patterns, and the consequent responses have a pattern organization, too. All mobile creatures of nature owe some motion pattern primitives which are characteristic for the species and have well-defined organizational patterns (muscle and nerve patterns in the higher level animals) in their biological constitution. In this sense, patterns of life have a rather closed, predetermined character on one hand and are open-ended for variational complexity on the other. The pattern reality and metaphor are especially vision-related in the human, vision being our richest source for knowledge acquisition. The concept of patterns labelled by different names, accompanies the whole history of literature, philosophy and psychology, asserting the evidence of the validity for the idea. It was also proved that the creative human thinking preserves the way of preverbal periods. Patterns - fragmental impressions - come together in a mystic procedure before any formulation in words and further, in strong disciplinary rules of scientific reasoning. We could refer to volumes of bibliographies containing only titles on the subject, but for our practical considerations this short allusion should be enough to clarify why we tried to apply the pattern concept for knowledge representation particularly for representation of weakly structured or unstructured, i.e. soft knowledge.

• Sponsoring agency: Hungarian National Science Foundation; research grant number: 2584

2. PATTERN REPRESENTATION A pattern, in our representation sense, is a set. The set of impressions, related previous knowledge, random associations, represents a usually infinite (or at least immense) multitude of information items. For any practical application, this should be delimited to a finite, ordered set. The order contains some relevant information, too, either a well-experienced order of acquisition of knowledge (e.g. the sequence of a medical investigation), or a certain prejudice related to the structure of the knowledge (e.g. the sequence of a prosecutor's questions). Components of these sets are stored in a database in any usual way, the concept of pattern in knowledge representation is close to the objects of object oriented programming. A fi-ame is a pattern too, but the regular frame representation has a well-defined structure. This is the reason why the subject of soft knowledge requires a less structured representation. The other intelligent forms of representation, semantic nets, have a more specialized structure for a certain kind of scenario. Connectionist representations on the other end of consideration omit all kinds of structuring and limit their scope of application to lower levels of complexity. Emphasizing the differences to our methods of representation we do not forget that all of these have a certain view of pattern type, i.e. they endorse our argumentation by being related. The method used by us is not a panacea for representation but one of the possible solutions for highly complex unclear knowledge. The special way of representation, used in our system, contains three parts: a head, a body and a tail. The head is dedicated to overall information on the pattern. For most applications we have somehow different forms for pattern standards and individual patterns. A standard is a prototype, a mean representation, like a textbook description of a disease, an average pattern of a certain legal case definition or precedence case, an average pattern of a certain region's economy, etc. Individual is the pattern which should be matched to the nearest standards, mostly depending on some viewpoint. The viewpoint is a special pattern, close to the standard ones, containing enforcing or neglecting factors of certain pattern components. The pattern can be the description of an object, of an action or of a situation. The head contains the identifier of the pattern, in the standard patterns references to the algorithms how distances to the standard are computed, in the individuals the distances themselves. Standard patterns can contain estimations on the completeness of the description. A standard pattern can be used as an individual in a hierarchy and vice versa, an individual can be taken as a standard for comparison. The body contains the individual items of the patterns, e.g. data of a medical investigation and intervention, information items for any case of knowledge acquisition, economic data of a country, or steps and data of a certain technology. This means data identifiers; values (intensity, scaling, gradation); uncertainty estimation related to the individual data or, in the standard case, to the method of data acquisition. The standard pattern attributes relevance values to the data and indicators for further valuation (e.g. ranges of values to be considered as normal or abnormal). The tail is reserved for ñirther relations. Here are listed the related patterns, the relations to those (these relations are discussed later), and the estimated or calculated strength of these relations. As it can be seen from the above representation form, this contains at least only

very weak information on hypothesized structures; it is more close to a dedicated data base format.

3. PATTERN RELATIONS The relational aspect indicates the particularity of the concept in a more explicit sense. Conceptual abstraction and structures of logic were the most important achievements of human reasoning efforts, and this is still a valid statement in our times. This Aristotelian legacy was continued by recent efforts, extending logic into the nonmonotonic and intensional range, attaching methods and ideas of probabilityuncertainty calculations to the structures of logic. Most representation forms of artificial intelligence followed this process: frames are direct implementations of these structures and in this way very general skeletons, but rather inflexible ones; semantic nets more expressive, but simultaneously more limited to certain scenarios. An opposite effort of connectionism tried to avoid the prejudice-like structures of symbolic representation but in spite of several successñü applications in pattern recognition type problems, it is limited in the realm of complexity, due to the neglecting of the ordering power of structuring abstraction. Our position does not deny the relevance of each method used in proper cases. This standpoint is a more opportunistic, eclectic one. The theoretical backgroimd of this pragmatism is the developmental pattern view outlined in the introduction. According to this, the only existing schemes are the pattern-like organizations; abstractions of conceptual thinking, logic etc. are patterns themselves, patterns of pattern relations, pattern dynamics. We call these higher level patterns metapatterns. The way of thinking is not far from the connectivist ideas and contains some elements of the fiizzy view, too. Not by chance! All these representation ideas attempt to circumvent the problem of representation of an open world burdened by unlimited soft knowledge by means of treatable methods for a well-structured, closed world. The metapattem concept is more than a game with words; it opens the door for the mentioned pragmatic, eclectic procedures, taking all methods at their approximative competence, authorized by the knowledge of their abstraction-limited force. The first novelty of this view is its relativity standpoint, instead of the firm beliefs in the subject independent power of one or the other reasoning method. The second novelty follows fi-om the first: an endeavor for finding some metapatterns, i.e. some pattern relations, which can express the usual relations in a more opportunistic way. This should not be too far from the usual reasoning relations and should be oriented to those kinds of knowledge which cannot be treated well by those usual methods which are well developed for better structured, formulated knowledge. We define four relations: evocation, assessment, (re) organization and transformation. It cannot be proved that this quadruple of relations create a complete field of relations and not the unity of them. The soft nature of pattern relations excludes the elegance of logic, this is a necessary sacrifice for sincere treatment of soft knowledge and a good reason for not abandoning the more beautifiil procedures, if the subject permits. Evocation means relations of simple coherence: a pattern is (usually) accompanied by another one, like two (or more) simultaneous phenomena, or a pattern followed

(usually) by another one (or more). This relation is the soft extension of the IF...THENtype rules and is similarly implemented in our system. The relation evokes all kinds of associations, chaining of situations as they come up in the mind, of course to a limited extent, avoiding complexity problems. The compromise is up to the user; limitation and liberty of richness in associations is the pragmatic choice. Assessment is a strengthening or weakening effect of a pattern, influencing the relevance values of the pattern concerned. Evocation is a simple chaining without change of the participant patterns, assessment expresses the influence of an environmental pattern. Any diagnostic or therapeutic pattern can be influenced by the presence of other malftmctions or symptoms; pharmacies should get other relevance. The same phenomenon is expected in any other pattern-represented case. The valuation of a country's economic performance should consider natural catastrophes, emergency situations, etc. A legal case should be judged with different rigor in different social environments. Design viewpoints change in the same way, too. We find it important to maintain some individual integrety of standard patterns and to treat these opportunistic relations in a proper opportunistic way. The implementation is simple: the relation refers mostly to a vector which serves as a multiplier of the relevance values in the body. Any other, more complex algorithm can be fired by the assessment relation, but in our experience this was never needed. (Re)organization is a strange relation, because we foresee an organizational procedure in a representation where organization (structuring) has a minor role. This is true; however, some stronger coherencies can be interesting, especially for ftirther manmachine interaction representation. A major output of the system is a visual display of similarities and dissimilarities and their relations. The coherence of consecutive items helps in the recognition of these relations. A nice example in our experience was the convincing similarity of illiteracy relations in Europe a century ago, to current GNP/capita and telephone data. The implementation is done by an organization matrix with only 0 and 1 elements. Transformation is a complete replacement of a pattern by another one as a consequence of a third pattern, which latter remains unchanged. A legal case pattern should be completely replaced by another one, due to relevant new evidence; a medical diagnosis due to relevant changes of symptoms, a design pattern due to a change in market or technological conditions. Implementation is similar to the evocation relation, the reference in the tail will be changed by some rule. If required, any fiirther relations can be defined; the above ones are those which were used and needed in our practice. The listed relations differ in meaning and/or procedure, these are the basis of relation selection. 4. METMC The key issue in pattern representation is metric. Metric is created by distances, stored in the heads of patterns, and computed by the algorithms referred to at the same place. Selection of these algorithms is a fi*ee choice of the system manager, the system itself is open to any convenient definition. We use two kinds of algorithms. The first distance algorithm is a combination of values, relevances and uncertainties

of the individual items listed in the pattern body. Linear combination proved to be sufficient in the medical project; in the legal project some nonlinearities were introduced by a regression formula of values. The other algorithms use methods of code theory. The body values of the pattern, mostly in a standardized form, create a signature-type string. The order of the values is defined in the body, the string contains the values only. Other data (relevance, uncertainty) are neglected at this step. Hanuning-type distances are calculated in two directions: the numbers of different and the numbers of identical signature codes. Designating regions of similarity, in most cases a coarse gradation in normal, abnormal and irrelevant values can be helpfiil. For better representation of coherent and typical irregularities, the signature analysis can detect relevant groups of pattern components. In medical diagnosis syndromes are those symptom groups which have a typical coherence but the biological background is not yet certain. Our organization relation is dedicated to this coherence representation, especially designed for human associations via graphics. The resulting distances should create the decision space of the problem. In several cases this is not a metric space. The calculation of distances is based mostly on esthnated values, even precise data should be taken with concern and they are context dependent. Data of economy are typical examples, GPN data are estimated in several different ways. Cognitive psychology proved that human judgment is usually non-metric, especially the triangulation condition cannot be used, relations are not transitive and are nonsymmetric. Our hypothesis connects this phenomenon to the contradictions of the Closed World Assumption and the Open World Reality. According to this theory if all infinite data and relations were known, the metric of decision space would be a perfect one. The pragmatic limitation to a closed, finite world neglects many relevant dimensions of this infinite perfection. The situation is like a two-dimensional view of a three-dimensional object where points at far distances can coincide and short distances can be deformed to much longer ones. The irregularities of metric are indications of hidden facts, tacit knowledge, and can be used for further search either for better tuning of distances, i.e. tuning of relevance estimations, or for search of new data. The irregularities of metric can be detected by checking the clusters which are calculated fi-om the distance values. Clustering by the nearest neighbor algorithm (or by any other statistical method) is a ñmction of the system, the revision of this clustering helps in iterative improvement of standard patterns and metrization. The improvement of metrization acts as a feedback of knowledge acquisition.

5. HUMAN INTERACTION Pattern representation of knowledge has a natural corollary: supposed to have an analogy to the representation in the mind, it should be best matched to human interaction. Mining for knowledge is especially relevant in cases to be treated by pattern representation because these are the problems which are mostly burdened by hidden knowledge, unclear relations, lack of certain structures. Regular knowledge-based systems use human interaction methods for eliciting input knowledge, the output is expected to be a ready-made decision or at least a final recommendation - final of course

from the system's point of view. In our problems of soft knowledge, human interaction is foreseen both at the input and the output. Considering the most common human pattern representation, we try to apply as much sophisticated graphics as possible. We found that graphic representation of input and output have different roles in man-machine communication and should therefore preferably be designed in a different way. Input should evoke the standard patterns through a representation by sketches reminding of the patterns, thus supporting a fast selection of hypothesized standard patterns. This can be done partly by the system, combining icons (fixed images) and sketches for interaction. This latter is important for system design: the most cumbersome, hated procedure for the user is filling long questionnaires with dull, not appropriate questions. Using questionnaires this cannot be avoided. Each case being different fi*om otíiers, this would require tailor-made questions, all others being indifferent, boring, provocative. This art of questioning is a special ability of a good investigator, medical practitioner or teacher. The certain hidden knowledge does the fiirther work. The endeavor of graphical input representation is a modest imitation of the human process: providing appropriate sketches of patterns, many default values are offered. The user, supported by the system should select the best visual patterns and correct only those values which are different from the default (stand ard) ones. All kinds of graphical tools can be considered, coloring, shading, using special gauges, keyboard input, pencil operations. This art is now the subject of a special research project, a cooperation of computer science, visual art and cognitive psychology. Output has somehow different requirements. According to our experience, a visual analogy to the real patterns is not so much needed, rather the representation of as many data as possible from the visualization and perception points of view. The user should get a pattern of similartities and differences. This was the objective of multidimensional analysis from its very early times. Different charts, Chemoff-faces and other tools are available in most statistical representation software systems. No final decision is taken until now; we experiment with different tools and different people. Knowledge acquisition per se is done by the usual methods: interviews, long cooperation with the experts. In any kind of more complex, human experience-related subject this non-royal way is unavoidable. This was done for more than a decade with our medical project. Methods like Repertory Grid Analysis (RGA) were also applied. The emphasis is on the conceptual frame of the subject, creating a vocabulary of the expertise and guessing distance estimates among their poles (extreme values of occurrences) and triads of concepts, triangulating the conceptual field. This procedure is well adaptable for creating patterns and related values, as a direct input to the pattern body representation. The triangulation highlights the metric faults of the pattern space and provides a direct support for ftirther mining of knowledge as described in the section on metric. Interesting results were obtained by measuring time relations of perception, finding concepts and their poles, i.e. conceptual distances. The time for eliciting more and more hidden knowledge is a certain measure of remoteness in the memory, indicating how the mind's representation collects these non-surface information.

6. PROJECTS The research has a history of more than a decade. Until now we have chosen three

very different fields of experimentation. The reason was the investigation of a general applicability of the idea. The projects applied the same or similar software which developed gradually but intensively during this long period, possibly using novel, available tools of commercial software technology. All three applications were selected to contain conventional rule-based knowledge and unclear, subjective, not well-structured knowledge, and the interwoven combination of the two kinds. This paper does not report on the projects in detail, only reference is made to them. Medical project - early diagnosis and habilitation of babies bom with brain injuries. In the first period of life, generally within the first three months, the development of the brain provides a flexibility for compensation of these defects; later the fimctions are fixed and the dysfimctions as well. The internal relations of the developing brain cover several unclear problems; The patterns of diagnosis and therapy were developed together with the medical research group, the system was of much help in clarifying these problems and is fimctioning in daily practice for several years. Legal project - situation and decision on child custody after divorce. A problem ruled by law, but even more by sociological, psychological, pedagogical considerations. The project was the subject of a successfiil Ph.D. thesis, and lasted for about four years. The objective at this phase was not a practical application but a deep analysis of feasibility, and it received a very positive response from the best legal experts of the field in the country. Our intention is to continue this project with the objective of a first legal advisor of laymen, before investing much in legal procedures and human quarrels. Economic project - started about a year ago, comparing different socio-culturalhistorical background patterns of development in economy and economy-policy in various countries. Several research groups in economy and the World Bank are involved.

7. CONCLUSION The pattern view of knowledge is converted to a practical methodology of representing and analyzing uncertain, not well-structured, partly hidden knowledge. The given representation forms, software tools developed for the representation and the three extensive practical projects prove the feasibility of the ideas. A key issue is a more intensive man-machine interaction, to be supported by special graphical means.

APPENDIX Schemes of representation Pattern = Standard, Individual Pattern = {Head, Body, Tail} Standard Head = {identifier, algorithms, completeness} Individual Head = {identifier, distances_to} identifier = pattern name, individual code algorithm = pointer to distance calculating algorithm completeness = estimate of the pattern definition's completeness distanceto = {identifier, distance}

10

Standard Body = {datumjdentifier, value (value range), relevance, uncertainty} Individual Body = {datum identifier, value, uncertainty} datum identifier = name of pattern component value = measured or estimated value, scaling, ordering of pattern component relevance = relevance of the component in view of the standard pattern uncertainty = estimated uncertainty of measurement, obtaining data, estimation, generally for the class denoted by the standard and specially for an individual case tail = {patterns, relations, strengths} relations = {evoke, assess, (re)organize, transform} evoke = {evoke_follow, evoke_together} assess = {assess, viewpointjpattem} assess = {assess_strengthen, assess_weaken} transform = {transform, by_pattem} strength = strength of relation REFERENCES 1. 2. 3.

4. 5. 6. 7. 8. 9. 10.

11.

12.

13. 14. 15. 16.

S.S. Stevens, "To honor Fechner and repeal his law", Science, 133, 1961, pp. 80-86 H. Eisler, "The connection between magnitude and discrimination scales and direct and indh-ect scaling methods", Psychometrica, 30, 1965, pp. 271-289 N.H. Birnbaum and C.T. Veit, "Scale convergence as a criterion for rescaling: information integration with difference, ratio and averaging tasks", Perception and Psychophysics, 16, 1974, pp. 7-15 A. Tversky, P. Slovic and D. Kahneman, Judgment under Uncertainty: Heuristics and Biases. Cambridge: Cambridge University Press, 1982, pp. 47-519. M. Polányi, The Tacit Dimension. New York: Doubleday, 1965 M. Polányi, Personal Knowledge: Towards a Post-Critical Philosophy. New York: HaφeΓ, 1964 J, Senker, "The contribution of tacit knowledge to innovation", AI & Society, 7, 1993, pp. 208-224 F. Katona, "Clinico-neurodevelopmental diagnosis and treatment" in Challenges to Developmental Paradigms (eds. P.S. Zelazo and R.G. Barr). Hillsdale/New Jersey: Erlbaum, 1989, pp. 167-187 F. Katona, "Klinische Entwicklungsneurologie und Neurohabilitation im Neugeborenen- und Säuglingsalter", Der Kinderarzt, 22, Jg., Nr. 7, 1991, pp. 1166-1175 F. Katona and M. Berényi, "Das Konzept der Neurohabilitation nach Katona: Besonderheiten des Zusammenhanges zwischen longitudineller Frühdiagnose und Frühbehandlung", Der Kinderarzt, Nr. 2, 1992, pp. 195-205 F. Katona, "Early diagnosis and neurorehabilitation" in Early Identification of Infants with Developmental Disabilities (eds. P.M. Vietze and H.G. Vaugham). Saunders/Philadelphia: Grune and Stration, 1988, pp. 121-145 T. Vamos, J. Váncza, A. Markus and P. Somogyi, "Learning from Nature and Augustine - two experiences" in Proc. Expert Systems World Congress, December 16-19, 1991, Oriando, USA, 4, pp. 2759-2766 T. Vamos, F. Katona, "Knowledge-based, pattern-supported man-machine interaction" in Analysis, Design and Evaluation of Man-Machine Systems, (ed. H.G. Stassen). Oxford: Pergamon Press, 1992 P. Danyi, "Advisory expert system" in Proc. 3rd International Conference on Expert Systems in Law, November 2-5, 1989, Florence/Italy, 1, 1989, pp. 157-173 A. Hart, Knowledge Acquisition for Expert Systems. London: Kogan Paul, 1986 M.L.G, Shaw and B.R. Gaines, "An interactive knowledge elicitation technique using personal construct theory" in Knowledge Acquisition for Expert Systems: A Practical Handbook (ed. A. Kidd). New York: Plenum Press, 1987

11

17.

18. 19.

20. 21. 22. 23. 24.

25. 26. 27. 28. 29. 30. 31. 32.

33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43.

M.L.G. Shaw and B.R. Gaines, "KITTEN: Knowledge Initiation and Transfer Tools for Experts and Novices" in Knowledge-Based Systems (eds. Boose & Gaines), 2, New York: Academic Press, 1988, pp. 309-338 P. Koch, "Some problems of buildmg process control expert systems" in Intelligent Autonomous Systems (ed. L.O. Hertzberger). Amsterdam: North- Holland, 1987, pp. 742-746 P. Koch, "Knowledge Media: about the integration of knowledge-based and hypermedia systems" in Proc. of the Austro-Hungarian Conference: Man and Machine: Behavior, Skill, Understanding (eds. V. Haase and E. Knuth), Budapest, 1989, pp. 177-180 H. Chemofff, "The use of faces to represent points in K-dhnensional space graphically", Journal of American Statistical Association, 68, 1973, pp. 361-368 E.R. Tufte, The Visual Display of Qualitative Information. Cheshire, Conn.: Graphic Press, 1983 E.R. Tufte, Envisioning Information. Cheshire, Conn.: Graphic Press, 1990 G. Polya, Mathematics of Plausible Reasoning, I.: Introduction and Analogy in Mathematics, II.: Patterns of Plausible Inference. Princeton: Princeton University Press, 1954 Ch. Ehrenfels, Über Fühlen und Wollen, Eine Psychologische Studie. Sitzungsberichte der Philosophish-historischen Klasse der Kaiserlichen Akademie der Wissenschaften. Wien: Gerold, 1887 G.M. Edelman, Neural Darwinism, the Theory of Neuronal Group Selection. New York: Basic Books, 1989 D.L. Alkon, Memory Traces in the Brain. Cambridge: Cambridge University Press, 1987 D.L. Alkon, "Memory storage and neural systems". Scientific American, 261, nr. 1, July 1989, pp. 26-34 A. Ortony, G. Clore and A. Collins, The Cognitive Structure of Emotions. Cambridge: Cambridge University Press, 1988 V. Csányi, Evolutionary Systems and Society, A General Theory. Durham: Duke University Press, 1989 K,H. Pribram, Brain and Perception. Hillsdale/N.Y.: Eribaum, 1991 S.A, Kauffinan, The Origins of Order, Self-Organization and Selection in Evolution. New York: Oxford University Press, 1993 W.J. Clancey: "The biology of conciousness: comparative review" of Israel Rosenfeld: The Strange, Familiar and Forgotten: An Anatomy of Conciousness, and Gerald M. Edelman: Bright Air, Brilliant Fire: On the Matter of the Mind, Artificial Intelligence, 60, 1993, pp. 313-356 M.A. Arbib, Book review on A. Newell, Unified Theories of Cognition, Artificial Intelligence, 59, 1993, pp. 265-283 M.R. Fehling, Book review on A. Newell, Unified Theories of Cognition: Modelling Cognitive Competence, Artificial Intelligence, 59, 1993, pp. 295-328 M. Minsky, Book review on A. Newell, Unified Theories of Cognition, Artificial Intelligence, 59, 1993, pp. 343-354 P. Smolensky, "On proper treatment of connectionism". Behavioral and Brain Sciences, 11, 1988, pp. 1-74 M.A. Arbib, "Schema theory" in The Encyclopedia of Artificial Intelligence (ed. S.C. Shapiro), New York: Wiley Interscience (2nd ed.), 1992, pp. 1427-1443 L. Zadeh, "Fuzzy sets". Information and Control, 8, 1965, pp. 421-427 L. Zadeh, "Knowledge representation in fiizzy logic", IEEE Trans, on Knowledge and Data Engineering, 1, 1989, no.l, pp. 89-100 M.M. Veloso, J.G. Carbonell, "Variable-precision case retrieval in analogical problem solving" in Proc. Workshop on Case-Based Reasoning, Washington, DC, 1991, pp. 93-106 T. Cain, "Using domain knowledge to influence similarity judgments" in Proc. Workshop on CaseBased Reasoning, Washington, DC, 1991, pp. 191-199 D.W. Aha, "Case-based learning algorithms" in Proc. Workshop on Case-Based Reasoning, Washington, DC, 1991, pp. 147-158 R.P. Hall, "Computational approaches to analogical reasoning: a comparative analysis". Artificial Intelligence, 39, no. 2, 1989, pp. 39-120

12

44. 45. 46. 47. 48.

49.

50. 51. 52. 53. 54.

55.

56. 57. 58. 59. 60.

R.S. Michalski, J.G. Carbonell and T.M. Mitchell, eds., Machine Learning; An Artificial Intelligence Approach, 1, Palo Alto, CA: Tioga, 1983 A, Collins and R.S. Michalski, "The logic of plausible reasoning: a core theory", Cognitive Science, 13, pp. 1-49, 1989 R.S. Michalski and T.G, Dietterich, "Discovering patterns in sequences of events", Artificial Intelligence, 25, 1985, pp. 187-232 B. Buchanan and E.H. Shortliffe, Rule-Based Expert Systems, The Mycin Experiments of the Stanford Heuristic Programming Project. Reading, MA: Addison-Wesley, 1984 R.A. Miller, H.E. Pople and J.D. Myers, "INTERNIST-I, an experimental computer-based diagnostic consultant for general internal medicine", The New England Journal of Medicine, 307, 1982, pp. 468-476 W.J. Clancey, "NEOMYCIN: reconfiguring a rule-based expert system from application to teaching" in Proc. Seventh International Joint Conference on Artificial Intelligence, 1981, pp. 829836 D. Heckerman, "Probabilistic similarity networks", NETWORKS, 20, 1990, pp. 607-636 W. Clancey, "Model construction operators", Artificial Intelligence, 53, 1992, pp. 1-115 L.N. Kanal, "On patterns, categories and alternate realities". Pattern Recognition Letters, 14, no. 3, 1993, pp. 241-255 K.S. Fu, "A step towards unification of syntactic and statistical pattern recognition", IEEE Trans, on Ρ AMI, PAMI-5, no. 2, 1983, pp. 200-205 Yoh-Han Pao, "Use of qualitative knowledge in learning system behavior and discovering control strategy" in Proc. SOCOCO'86 - 4th IFAC/IFIP Symposium on Software for Computer Control, Graz, 1986, pp. 4-7 D.J. Sobajic and Yoh-Han Pao, "Metric synthesis and concept discovery with connectionist networks" in Proc. 1987 Systems, Man and Cybernetics Annual Conference, Alexandria, VA, 1987, pp. 390-395 T.M. Cover and P.E. Hart, "Nearest neighbor pattern classification", IEEE Trans, on Information Theory, IT-13, 1967, pp. 21-31 S. Cost and S. Salzberg, "A weighted NN algorithm for learning with symbolic features", Machine Learning, 10, 1993, pp. 57-78 H. Berliner and C. Ebeling, "Pattern knowledge and search: the SUPREM architecture", Artificial Intelligence, 38, 1989, pp. 161-198 R. Levison, "A pattern-weight formulation of search knowledge", Technical report, UCSC-CRL-9115, Computer Research Laboratory, University of Calofomia Santa Cruz, 1991 V. Dhar and A. Tuzhilin, "Abstract-driven pattern discovery in data-bases", IEEE Trans. Knowl. Data Eng., 5, no. 6, 1993, pp. 926-938

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

13

Application of evidence theory to A;-NN pattern classification Thierry Denoeux Université de Technologie Compiégne, U.R.A. C.N.R.S. 817 Heudiasyc BP 649 - F-60206 Compiégne cedex, France * In this paper, the problem of classifying an unseen pattern on the basis of its nearest neighbors in a data set is addressed from the point of view of Dempster-Shafer (D-S) theory of Evidence. Each neighbor of a sample to be classified is considered as an item of evidence that supports some hypothesis regarding the class membership of that pattern. The degree of support is defined as a function of the distance between the two vectors. The evidence of the A: nearest neighbors is then pooled by means of Dempster's rule of combination. In the following, a brief reminder of the main concepts of D-S theory is first provided. The application to fc-NN pattern classification is then presented.

1. D-S THEORY The mathematical theory of evidence is a generalization of conventional Bayesian anal ysis in which probabilities can be assigned directly to subsets of the states of nature, as well as to individual states [2]. This approach is typically helpful in situations where only limited or weak information and data are available to support decision making. It has proved a useful formalism for handling uncertain associations between evidence and hypotheses in expert systems [1], and as such has undergone a rapid development in the AI area. Let Θ be a finite set of mutually exclusive and exhaustive hypotheses about some problem domain, called the frame of discernment. A basic probability assignment (BPA) is a function m from 2® to [0,1] verifying:

m(0) = 0 =

1

(1) (2)

Ace

The quantity m(A), called a basic probability number^ can be interpreted £is a measure of the belief that one is willing to commit exactly to A, and not to any of its subsets, given a certain piece of evidence. The function assigning to each subset Λ of Θ the sum of the basic probability numbers for all the subsets of A is called a belief function: Bel(A)=j:m(B)

(3)

BCA

*This work has been supported by EEC funded Esprit Project 6757 EMS (Environmental Monitoring System). Partners: Atlas Elektronik, Centre National de la Recherche Scientifique (U.R.A. 817 Heudiasyc), Lyonnaise des Eaux Dumez, Technical University of Munich.

Bel{A)j also called the credibility oí A, is interpreted as a measure of the total belief committed to A. The subsets A of Θ such that m{A) > 0 are called the focal elements of the belief function; their union forms its core. The quantity: Pl{A) = 1 - Bel{Ä)

(4)

called the plausibility of A, defines to what extent one fails to doubt A, i.e. to what extent one finds A plausible. A situation of complete ignorance is represented by the vacuous or informationless belief function for which Θ is the only focal element. Other particular kinds of belief functions are Bayesian belief functions, whose focal elements are singletons, and simple support functions that have only one focal element in addition to Θ. A necessary and sufficient condition for a belief function to be Bayesian is to verify the axiom of additivity: Bel{AUB) = Bel{A)-hBel{B) for all subsets A and ^ of Θ such that AnB = 9. The concept of belief function thus includes the one of probability distribution as a special case. It is also worth noting that a BPA can be viewed as specifying a set of compatible probability distributions Ρ over 2® satisfying: Bel(A) < P(A) < Pl(A)

(5)

For that reason, Bel{A) and Pl(A) are sometimes called lower and upper probabilities, respectively. Two belief functions Beli and Bel2 induced by two independent sources of information can be combined using the so-called Dempster's rule that is defined in the following manner. If the cores of Β eh and Bel2 are not disjoint, and if mi and πΐ2 are the BPAs associated with Beli and Bel2, respectively, then a function mi Θ mj : 2®

[0,1]

(6)

can be defined by mi φ m2(0) = 0 and:

LAnB=' mt(A)m2(B)

mt ffi m2(6) = -----.;---1- LAnB=.mt(A)m2(B)

(7)

for all 0 ^ 0 . mi φ m2 is a BPA whose associated belief function is called the orthogonal sum of Bell and Bel2 and noted Beli φ Bel2» The φ operation is commutative and associative, which ensures that the order of evidence gathering does not influence the aggregation process. Other interesting properties of Dempster's rule are the following [2]: • Given two belief functions Beli and Be/2, liBeli is vacuous, then Beli^Bel2

= BeU.

• If Bell is Bayesian, and if Beli φ Bel2 exists, then it is also Bayesian. If Bel2 is based on a BPA that concentrates all the mass on a particular subset Β and Beli and Be/2 are combinable, then, noting Ph and PI the plausibility functions associated with mi and mi φ m2, respectively, it can be shown [2] that: Pli{B)

^^

If furthermore Beli is Bayesian, then one recognizes in Equation 8 the definition of the conditional probability distribution P{.\B). Dempster's rule thus generalizes Bayes' rule of conditioning.

15

2. THE METHOD 2.1. Neighbors as evidence Let us consider a collection of Ν P-dimensional training samples x\i = I,... ,N, and a set C of Μ classes: C = { C i , . . . , CM}- Each sample x* will first be assumed to possess a label L* G { 1 , . . . , M} indicating its class membership. Let X* be an incoming sample to be classified using the information contained in the training set. Classifying x' means deciding among a set of Μ hypotheses: x* G Cq, q = I,... ,M. C is therefore the frame of discernment of the problem. Let us denote by Φ' the set of the Ä:-nearest neighbors of x' in the training set, according to some distance measure (e.g. the euclidian one). For any x* e Φ', the fact that L* = q can be regarded as a piece of evidence that increases our belief that also belongs to C,. Since this fact does not point to any other particular hypothesis, the rest of our belief must be assigned to the whole frame of discernment. This item of evidence can be represented by a BPA m*'* verifying: m''«({CJ) = m''*(C) =

a I-a

(9) (10)

with 0 < α < 1, and m*'*(i4) = 0 for any other A in 2®. If x* is far from x', as compared to the distances between neighboring points in C„ the class of x* will be considered as providing very little information regarding the class of x'\ in that case, a must therefore take on a small value. On the contrary, if x* is close to x', one will be much more inclined to believe that x* and x' belong to the same class. As a consequence, it seems reasonable to postulate that α should be a decreasing function of J*'*, the distance between x' and x\ In the limit, as the distance between x' and x* tends to infinity, one's belief concerning the class of x' should no longer be affected by one's knowledge of the class of x\ If we note: Q = αοφ{ά'ή

(11)

with 0 3 . This advantage is particularly important for small sample sizes (N=30 and N=60). Although the choice of k does play a role in both methods, our algorithm also appears to be slightly more robust in that respect, at least for small numbers of training samples (see Figure 3). Table 1 Results of the second experiment (Gaussian data, 1000 test samples) for the voting fc-NN rule (ik-NN) and our method (D-S): best error rates (means over 5 trials) with correspond ing values of k (upper numbers) and average error rates integrated over the different values of k (lower number) Classification rule ik-NN D-S AT = 30 0.341 (9) 0.303 (14) (k < 20) 0.376 0.336 Ν = 60 0.308 (10) 0.282 (10) 0.343 0.317 TV = 120 0.281 (7) 0.278 (7) 0.311 0.298 N = m 0.283 (7) 0.275 (17) 0.299 0.289

22

Gaussian data (N-30)

Figure 3. Mean classification error rates for the voting A;-NN rule (-) and our method (as a function of k (Gaussian data, Ν = 30)

Gaussian data (N>60) 0.4,

0.38

0.36h

κ 0.34 0.32

0.3

10

15

k

20

25

30

Figure 4. Mean classification error rates for the voting A:-NN rule (-) and our method (- -) as a function of k (Gaussian data, Ν = 60)

23

Gaussian data (N.120) 0.37 0.360.35

Μ

0.34 ^0.33 5 0.32 'θ.3, 0.3 0.29 0.28 0.27

10

15 k

20

25

30

Figure 5. Mean classification error rates for the voting ÄJ-NN rule (-) and our method (- -) as a function of k (Gaussian data, = 120)

Gaussian data (N-180)

Figure 6. Mean classification error rates for the voting Ä:-NN rule (-) and our method (- -) as a function of k (Gaussian data, TV = 180)

24

4. C O N C L U S I O N Based on D-S theory of evidence, a new non parametric algorithm for pattern classifi cation has been proposed. This algorithm essentially consists in considering each of the k nearest neighbors of a pattern to be classified as an item of evidence concerning the class of that pattern. This evidence is conveniently represented by a BPA that assigns a probability mass to the subsets of the set of classes, with the constraint that the sum of all masses be equal to unity. The k BPAs resulting from the consideration of the k nearest neighbors are then pooled to form a new BPA by means of Dempster's rule of combination. Several advantages of this approach have been presented and discussed. Most of them stem from the flexibility with which different kinds of uncertainties can be represented in the framework of D-S theory. First of all, the possibility of modulating the influence of a neighboring learning sample as a function of the distance to the vector to be classified results in very good performance as compared to the conventional voting Ä;-NN procedure. The method can easily be adapted so as to allow ambiguity and distance re ject options, which can be interpreted in the theory in terms of lower and higher expected losses. Uncertainty in class labels can quite easily be taken into account by labeling each training sample with a BPA representing the accumulated evidence concerning the class membership of that sample. Lastly, the method provides as output a BPA that not only reflects the underlying uncertainty attached to the decision, but also allows combination with other classifiers whose outputs can be put in the same representational format. REFERENCES 1. J. Gordon and E.H. Shortlifl'e, The Dempster-Shafer Theory of Evidence, in B.G. Buchanan and E.H. Shortliffe (eds.). Rule-based Expert Systems, Addison-Wesley, Reading, Ma, 1984. 2. G. Shafer, A mathematical theory of evidence, Princeton University Press, Princeton, N.J., 1976. 3. T. Denoeux, A fc-nearest neighbor rule based on Dempster-Shafer Theory, submitted to IEEE Transactions on Systems, Man and Cybernetics, 1993 4. W. F. Caselton and W. Luo, Decision making with imprecise probabilities: DempsterShafer theory and application, Water Resources Research 28 (12), 1992, 3071-3081 5. B. Dubuisson and M. Masson, A statistical decision rule with incomplete knowledge about classes, Pattern Recognition 26(1), 1993, 155-165 6. C. K. Chow, On optimum recognition error and reject tradeoff, IEEE Trans. Inform. Theory, IT-16 (1970), 41-46 7. B. Tessem, Approximations for efficient computation in the theory of evidence. Arti ficial Intelligence 61 (1993), 315-329

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

25

Decision trees and domain knowledge in pattern recognition D.T. Morris * and D. Kalles ^ * ^Department of Computation, UMIST, PO Box 88, Manchester, M60 IQD, UK ^C25/MB, UMIST, PO Box 88, Manchester, M60 IQD, UK On the set of all objects we define a set of features, each one with an inherent cost of computation and an added cost due to other feature measurements that must precede it. A decision tree is built taking into account this domain knowledge. The scheme is extended to accommodate incremental learning which is required, whenever new samples become available or the domain knowledge is revised. 1. I N T R O D U C T I O N A decision tree is a model of the evaluation of a discrete function using a step-by-step computation [1]. In each step of the computation the value of a (discrete or continuous) variable is determined and the next action is chosen accordingly. Possible actions are the evaluation of some other variable, the output of the value of the function or the remark that the space points represented by the particular variable-value combination are not included in the function's domain. Early research into decision trees was primarily justified by some notable advantages possessed by multistage classifiers over single-stage classifiers, such as [2-5]: • Complex and global decisions may be made via a series of simpler and more localized decisions. • Feature subset selection is context sensitive in the sense that features are selected according to how well they are suited for a specific subtask. • During classification, a pattern is subjected to only a portion of the tests available thus improving performance in terms of efficiency. • Problem domains with many features and multimodal classes pose practical limi tations to the modeling and estimation of probabilistic distributions. By breaking down the original problem the decision process is simplified. *D. Kalles gratefully acknowledges the support of the Science and Engineering Research Council and the NATO studentships commission of the Greek Government throughout his research. The authors also would like to acknowledge that David Slate kindly supplied the image files for the characters used in the experiments.

26 These properties have made decision trees attractive to researchers in the Machine Learning community who have applied them to the construction of expert systems. The sometimes cumbersome elicitation of knowledge from the domain expert by the knowledge engineer and the ensuing problems of representing that knowledge in a usable form [6,7], the projected exponential increase [7] in demand for expert systems and the acknowledged clarity and conciseness of tree-based methods [8] has boosted the use of decision tree induction as a means of organizing knowledge that is already available in the form of preclassified data as examples of a concept. Inductive learning has emerged as a powerful technique for the development of knowledge-based systems as it allows rapid, systematic and (full or semi) automatic construction of knowledge bases [9]. This is despite the fact that there exist non-trivial methodological problems in the field of decision tree construction. Decision trees have been used in various fields, such as character recognition, medical diagnosis, speech recognition, taxonomy and identification (see also [1,3,5]). The rest of this paper consists of a brief presentation of approaches to the problem of inducing decision trees and we then discuss a technique which incorporates domain knowledge. An experimental verification is provided. 2. O V E R V I E W O F DECISION T R E E M E T H O D O L O G I E S When a decision tree is viewed as a pattern recognition system the following properties are clearly desirable [5]: • Minimal overhead for building and maintaining the decision tree. • Ability to classify correctly the training sample and generalize adequately so that unseen samples may be classified as accurately as possible. Some researchers [3,10] attempted to solve the problem of finding optimal decision trees by formulating it in a dynamic programming context to ease the requirements for searching. Near optimal methods were also put forward based on heuristic searching [11]. These were justified as in 1976, Hyafil and Rivest [12] proved that minimizing the expected number of samples required to classify an unknown sample is an NP-complete problem. It is conjectured that most non-trivial decision tree optimization problems will also be NPcomplete [5]. This motivates research into the development of suitable heuristic methods to tackle the problem sub-optimally but efficiently. Top-down induction of decision trees (TDIDT) has undoubtedly been the most favored direction in this research field. TDIDT algorithms recursively partition a data set, forming a hierarchy of splitting tests that are associated with the nodes of the decision tree. The thrust of the idea is that important tests are performed close to the root node. The identification of such suitable tests and of potential terminal nodes are recurring topics of interest. Influential work has been carried out by Quinlan [13] who proposed an information based criterion to measure the goodness of splits among competing features (refer to [14] for a discussion of the relative merits of various feature selection policies). The combination of features to form new (more informative) ones has also been a topic which has attracted considerable interest [15,16].

27

When no feature is good enough to partition the current set of patterns a terminal node is created and (usually) labeled by the most frequent class among the patterns of that node. Probabilistic classification may also be used to report more than one possible results [17,18]. It is usually undesirable to obtain a very fine partitioning of the pattern space in terms of the training sample as this will almost invariably fail to classify unseen patterns satisfactorily. Pruning may also be used to scale down the size of a decision tree [19-21]. Besides discriminating ability and error control, feature measurement costs may also be important. Kanal [2] noted that a priori structural knowledge concerning the relation ships between objects of the application domain may be explicitly used to structure the skeleton of the decision tree. The exploitation of domain knowledge, either in terms of cost incorporation or domain modeling is reflected in various research results [22-26]. Another aspect of decision tree induction is the ability to perform the learning task when the training sample is available in chunks of data and not as a whole. This is called incremental learning and has received considerable attention [27-29]. 3. A F R A M E W O R K F O R D E C I S I O N T R E E C O N S T R U C T I O N A basic framework for constructing decision trees using domain knowledge will be pre sented. Domain knowledge is expressed in terms of relations between features. We shall then show how a decision tree can be used to output results of varying confidence and how it can be adapted to incremental learning. 3.1. Terminology On the set of all objects we define a set, F , of functions where each function maps an object to an ordered pair of the form (a, A), such that a £ A £ 2^. a is termed a higher confidence result and {A — {a}) represents lower confidence results. A possible result of some application of / G F is described by an F-condition which is an ordered pair of the form (/, (a, 6)), and is satisfied when a single computed value (an element of A) falls in the set { a , , . . , 6}. On the same domain we also define a set, G, of functions which serve as a pool of computed features from which other functions can draw results for their own algorithms. A G-condition is denoted by the name of the corresponding function and means that the function must be executed. Define Η = FuG, with FnG = Define Scj, and Sca as the sets of all F-conditions and G-conditions respectively. To each function h e Η vre can attach a condition list, C h , as an ordered pair of sets of the form {CHFI^HG) where CHF ^ '^Cp> and CHG ^ ^Coi meaning that unless all components of CH are satisfied, computation of h may not advance. Let us note by ho-^cî that function ho is involved in the condition list of function /ii, i.e. it appears in an F-condition or G-condition applied on hi. Let us also introduce — as the transitive closure of —•c- For each function h we compute the sets of functions on which h depends {Rp C F, Rq C G) and the sets of functions which depend on it {Rp C F , Rq C G). These sets represent the implementation of the — r e l a t i o n . Based on the above, data collection consists of applying all allowed functions to each object in the training set. Each function has an inherent cost of computation associated with its algorithmic com-

28

plexity (internal cost) and an added cost due to the required computations of the functions appearing in its condition list (external cost). Let SN be the set of nodes of the decision tree, where SN = SNT^^MP^SND and SNτ^ SNPI are disjoint. SNT consists of the terminal nodes. Each node in set S^p is a computation node associated with a function g e G. Each node in set SNC> is a decision node associated with a function f Ε F. A configuration at a given node, iV, with respect to its path from the root is described by a 3-tuple {HNI LN, PN)J where HN is the set of all functions already computed so far, LN is the set of functions f e F that do not have any pending F-conditions and PN is the patterns' set that needs to be further split. Before advancing, note that the terms instance^ pattern, object will be used synony mously. Note also that the above definition of functions is synonymous with the standard terms feature and attribute. 3.2. Building t h e decision tree For each function f £ F introduce a dependence counter, called Depend[f] and initialize it to the magnitude of its Rp set. The decision to pick one function for the test node may involve decisions to establish intermediate computational nodes (note that the external cost is variable and subject to pending dependences). Let where / G Lpf denote the whole set of functions which will be applied before the actual test is carried out. Clearly, f ^ Dj. Set Ljvp = L^^ Assume that a function f ^ LN has been selected for the current decision node (the selection criterion is irrelevant at present). This function, depending on its results, may create a series of m (possibly overlapping) sets of patterns ΡΝΙ^ΡΝΊ^ · · · > PNW. such that \J = Pjq, Now, V/i G Rp. decrease their respective dependence counters by 1. If for 1=1

any such /i, the condition Depend[h\ = 0 holds then set Ljqp = LpfpU {h} as function h becomes a legal candidate for next level tests. Set Lpfp = L^p- (Df Π F) as no function f e F may be used more than once in a subsequent test on any node descending from the current one. Finally, set HNP = H^p U Df. It is possible that \ Dj \ > I whenever a suitable function cannot be selected from the current LN set. In this case we have to perform some lookahead search among functions which otherwise would not be defined. Dummy decision nodes may be created during this search. This procedure is applied recursively to configurations {HNP, LNP, PNI),- · · ,(îvp, LNP, -ñvm)Let us consider a function f £ F and a pattern Ρ with f{P) = (a, A). At a decision node where / is tested, during the split phase, overlaps will occur if | Λ | > 1. Considering all the training sample and the patterns' sets resulting from each test, each pattern needs to be assigned a confidence value to show if it was the first choice for this set or not. By attaching to each pattern the decision nodes that have led to it, we can estimate its overall confidence value by checking at each node if it was a first-choice selection. If the pattern is assigned to a terminal node after k tests then its confidence measure is based on the fc-tuple {{No,Co),(iVi,Ci),...,{Nk-uCk-i)) where Ni is the test at a node at height i and Ci is the confidence value for that test (the possible outcomes are true and false denoting high and low confidence, respectively). In the above discussion error nodes are assumed to correspond to descendants of decision

29

nodes that are not associated with any pattern set. 3.3. Classification Considering a pattern Ρχ which we want to classify, we start at the root node, perform the respective test and proceed to the descendant. For each descendant the same pro cedure is applied until a terminal node is reached. As a test pattern may have multiple values for some features the classification process always maintains a frontier of plausible nodes that may lead to a result. Let us assume that one of the terminal nodes, where Ρχ is directed to, is associated with pattern Pt. Ρχ and Pt are associated via the same path and each one has its confidence measure. The level of support that Pt offers for the classification of Ρχ can be estimated by calculating how many decisions in the path are incompatible in confidence. That is diff{Px, Pt) = \ {i ' Cix ^ Cit} |. This metric assigns big scores to paths of low confidence (the majority rule may also be used). At the end of the classification process for a specific test pattern the path with the highest confidence dictates the class label of the classified pattern. Note that when adequate features are not available it may be possible for a terminal node to accommodate patterns belonging to more than one classes. 3.4. A framework for incremental learning A decision tree reflects (based on its splitting policy) a partitioning of the current sample pattern space that is (sub-optimally) conducive to fast classification. This op erating environment may cease to be representative of future classification problems. If the designer decides that performance enhancement is required, there should be a way of updating the current tree structure inflicting minimal overheads. This can be done by maintaining knowledge acquired up to that moment and selectively amending some parts of the tree so as to reflect the new environment instead of trying to build a new tree and altogether ignoring the existing one. We first show why a decision tree may need to be revised. 3.4.1. Expansion Consider a decision tree and assume that we want to incorporate pattern Ρχ into the tree structure. If Ρχ follows the path to an error node we need only change the label of that node to that of the class of Ρχ. If Ρχ is classified as (any) Pt, the recursive partitioning procedure described above is applied to the configuration (fT^^,, L^r«, {Ρχ, Pt}) at node Nt. A similar procedure is used when node Nt accommodates more than one pattern. Changes of this type will raise the question of how satisfactory the original decision tree structure is, given the fact that new patterns continuously add more levels to it (slowing down the recognition process). It is inevitable that feature selection criteria based on distributions of values (information-based criteria, for example) will be liable to be affected. This happens because, at a given node where a feature was selected instead of some competing ones, the incrementally incorporated patterns passing through that node may swing the balance favouring one of the competing features. Implementing incremental learning for the above problem consists of detecting which nodes do not conform to the original design criteria and then changing them (and subtrees below them, if required) by selecting a more suitable feature.

30

3.4.2. Domain knowledge revision Introducing new dependences between functions may render invalid the compliance of the decision tree to the original design criteria in terms of priorities. It is also possible that during our inspection of the problem domain we have decided upon using features that are erroneously defined or inadequately specified. This can arise if we discover that a particular function may not be defined on a particular sample (while it was assumed to be defined up to that stage) or that its dependences have been wrongly estimated and some of them are missing or redundant. In this case existing dependences may need to be removed. A two-fold procedure is required to implement incremental learning for the domain revision problem: the new priorities of functions must be enforced by incorporating them into the tree so that no inconsistencies exist and the priority network must be amended to reflect the new set of dependences among features. 3.4.3. An incremental learning technique The problem of updating the network of dependences among functions can be formu lated and elegantly solved as a (graph-theoretic) transitive closure problem. Utgoff [29] has proposed the ID5R algorithm for the incremental induction of decision trees and this algorithm has been chosen as the basis of the restructuring of the decision tree in our model. However, some fundamental changes have been made to cope with the various confidence scores used in the new model as well as the interaction of functions via the dependences. Restructuring is effected by examining the functions defined at the root node of the decision tree, picking the best one according to a criterion and pulling this function to that node, in case it is not already present there. This pull-up is a tree manipulation which preserves consistency with the observed training patterns. Counters need to be maintained at each node to indicate how the patterns have been assigned to the feature values. This is essential to prevent re-examination of the training samples every time a restructuring is needed. A transposition is a tree manipulation that consists of swapping the contents of nodes in a particular subtree in two consecutive levels. At the first level there is the root node of the subtree. At the second level, there exist several decision nodes (descendants of the subtree root), all of which contain the same decision function. The third level consists of all children of the second level nodes. To effect a transposition, one promotes the decision function which is used collectively at the second level to the first level, and generates a set of second level nodes with the decision function of the former first level as their decision function. The set of the former second level nodes is discarded. Succession pointers from the second level to the third level (which has not changed) must be adjusted accordingly, after the swapping. Beyond the third level no pointer adjustments need to be made. The core of the algorithm is as follows: • At the current node examine the used function and all non-used functions and determine which one to select. • Pull-up this function to the root if it is not already there. • Recursively examine subtrees below the node.

31 Assume that at a given node, fnew has been selected as the testing function and fou was the previous testing function and that fnew φ fold- The pull-up operation consists of the following steps: • Recursively pull-up fnew to the root of each immediate subtree. If fnew has not been used, create a dummy expansion using fnew as a testing function at the appropriate terminal nodes, • Transpose the subtree. In order that confidence scores be properly calculated, one must examine which patterns are being considered at the nodes affected by transpositions. The above steps, however, do not reflect more refined steps taken in order to safeguard the dependence constraints and the correct calculation of confidence scores. It can be proved that by swapping the confidence scores of each individual pattern at the third level of a transposition the consistency of confidences of paths can be guaranteed. Dependences among features can be safeguarded by recursively descending the tree and applying the mechanism described when the original decision tree is built. Pull-ups may be required in order to promote or demote functions. As stated above, in order to perform transpositions one must keep counters of the occurrences of patterns in every node with respect to the values that a function has returned for these patterns. Establishing the correct counters is not only a set handling procedure because one must also detect the existence of patterns which have been directed to more than one node. Is is possible that a transposition may bring about a redundant test to split (trivially) a pattern set consisting of one pattern only. 4. E X P E R I M E N T A L EVALUATION 4.1. E x p e r i m e n t a l environment We shall now describe the heuristic used for feature selection. Assume that a feature / is examined using η patterns and that m discrete values are detected. If we choose to ignore class distributions within each "bucket" we can estimate how many bits of information (totalled over all available patterns) would be required for a complete classification of these patterns given their original partition by feature / . This amount of information is: m Ef

= Y^ndog2ni

i=\

(1)

where Ui is the number of patterns in bucket i. A large value of this expression indicates that the feature considered is liable to produce an unbalanced distribution and will thus probably increase the total tree size and reduce classification speed. This information based criterion was selected because of its low computational overhead. The above framework has been applied to a character recognition problem. 2366 char acters (91 samples from each capital letter of the Latin alphabet) were processed and structural and geometrical features were extracted after a framework of dependences be tween features was established. Patterns were indexed as follows: A o , . . . , -Ago, Bq, . . . , Z^. The set { Λ , . . . , Aj, Bi,..., Zi,..., Zj} defines letter family fij. Training and testing sets

32 were built in terms of letter families. A training set and its corresponding test set had a size ratio of about 2:1. Table 1 shows the details for the conducted experiments. These were selected so as to make use of most patterns for both the training and testing phases, to enhance the confidence in the results and to provide some insight into the scalability of the problem. Table 1 Description of the experiments Letter families Training set Testing set 0-30 31-45 1 16-45 46-60 2 61-75 31 - 6 0 3 76-90 46-75 4 61-90 0-60 5 0-29 30-90 6

4.2. Classification accuracy Table 2 describes the classes of experiments conducted to establish how the incorpo ration of lower confidence results (LCRs) affects accuracy. The STST descriptor stands for Single-Training-Single-Testing, meaning that lower confidence results are not consid ered at all during neither training nor testing. On the other extreme, MTMT stands for Multiple-Training-Multiple-Testing, meaning that overlap is allowed throughout training and a set of alternative decision paths is maintained during the classification of a test pattern. The classes denoted by descriptors STMT and MTST are defined similarly. Table 2 Description of LCR handling Suppression of LCRs during Testing Training STST V V STMT V MTST MTMT

The classification experiment consisted of reporting the accuracy of the decision ob tained when 1, 2, or 3 alternative results were allowed for a test pattern (a larger number of alternatives may incur prohibitive post-processing costs). Average results for the first four tests (experiment 30-15) and for the last two sets (experiment 60-30) are shown in Table 3. Individual results were within a ±3% range of the reported averages and their trend was consistent across all classes of any specific experiment.

33

Table 3 Accuracy with alternative results Experiment 30-15 1 2 3 STST 78.97 79.48 74.16 85.37 81.31 STMT 65.06 83.20 MTST 76.46 82.53 88.04 86.15 MTMT 75.73

1 78.97 68.60 80.80 80.95

Experiment 60-30 2 84.77 86.35 87.72 91.38

3 85.45 91.19 88.65 92.50

4.3. Cost-sensitive learning Each feature was assigned a cost (1-9 units) according to its approximate complexity. The criteria were weighted and averaged using a calibration parameter, x. total-score = information-based-score'' χ cost-score^^

(2)

Figure 1 shows the total cost of the trees produced as a proportion of the maximum cost, according to calibration. The results shown describe the 60-30 experiment. The curve for the 30-15 experiment is almost identical.

0.5 Calibration Figure 1. Evaluation of cost-merit combi nation criteria.

4.4. Evaluation of t h e results The results above indicate that for the particular problem domain incorporating lower confidence results during training is superior to using "fuzzy" testing only. In character recognition, where dictionary look-up [30] may be a post-processing step, this approach

34 seems particularly attractive. An increase in the average depth of the decision tree is the price to be paid for this improvement (due to lack of space we shall not discuss heuristics to tackle this problem). A surprising result, that merits investigation, was that the accuracy reported did not worsen when the majority rule was used for labeling the terminal nodes. The results also indicate that the size (and the total cost) of the decision tree is influ enced mostly by the suitability of the splitting criterion but the usage of the cost criterion is necessary for improvement (experiments where feature costs were fixed consistently produced more expensive trees). Accuracy was not affected in cost-related experiments. In particular, it fluctuated slightly around the figures reported in Table 3. 5. C O N C L U S I O N S Important aspects of decision tree methodologies have been identified and a general model for inductive learning of decision trees has been described. The experimental results support the proposed method. REFERENCES 1. B.M.E. Moret. Decision trees and diagrams. ACM Computing Surveys, 14:593-623, December 1982. 2. L.N. Kanal. Problem-solving models and search strategies for pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2):193-201, April 1979. 3. G.R. Dattatreya and L.N. Kanal. Decision trees in pattern recognition. In L.N. Kanal and A. Rosenfeld, editors, Progress in Pattern Recognition, volume 2, pages 189-239. North-Holland, Amsterdam, 1985. 4. Q.R. Wang and C.Y. Suen. Large tree classifier with heuristic search and global training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(1):91102, January 1987. 5. S.R. Safavian and D. Landgrebe. A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man and Cybernetics, 21(3):660-674, May/June 1991. 6. R.S. Michalski. Learning strategies and automatic knowledge acquisition. In L. Bole, editor. Computational Models of Learning, pages 1-19. Springer Verlag, 1987. 7. J.R. Quinlan. Decision trees and multi-valued attributes. In J.E. Hayes, D. Mitchie, and J. Richards, editors, Machine Intelligence, volume 11, pages 305-318. Oxford University Press, Oxford, 1988. 8. J.R. Quinlan. Decision trees and decisionmaking. IEEE Transactions on Systems, Man and Cybernetics, 20(2):339-346, March/April 1990. 9. K. Tsujino, V.G. Dabija, and S. Nishida. Knowledge acquisition by constructive and interactive induction. In Proceedings of the 6th European Knowledge Acquisition Workshop, pages 152-170, 1992. 10. W.S. Meisel and D.A. Michalopoulos. A partitioning algorithm with applications in pattern recognition and optimization of decision trees. IEEE Transactions on Computers, 22(1):93-103, January 1973. 11. I.K. Sethi and B. Chatterjee. Efficient decision tree design for discrete variable pattern recognition problems. Pattern Recognition, 9(3):197-206, 1977.

35 12. L. Hyafil and R.L. Rivest. Constructing optimal binary decision trees is np-complete. Information Processing Letters, 5(1): 15-17, May 1976. 13. J.R. Quinlan. Learning efficient classification procedures and their application to chess end games. In R.S. Michalski, J.G. Carbonell, and T.M. Mitchell, editors. Machine Learning: an Artificial Intelligence Approach, pages 463-482. Tioga Publishing, Palo Alto, CA, 1983. 14. J. Mingers. An empirical comparison of selection measures for decision-tree induction. Machine Learning, 3:319-342, 1989. 15. R. Seshu, L.A. Rendell, and D. Tcheng. Managing constructive induction using op timisation and test incorporation. In Proceedings of the International Conference on Artificial Intelligence Applications, pages 191-197, Miami, FL, March 1989. 16. P. UtgofF and C.E. Brodley. An incremental method for finding multivariate splits for decision trees. In Proceedings of the 7th International Conference on Machine Learning, pages 58-65, Austin, TX, 1990. 17. R.L.P. Chang and T. Pavlidis. Fuzzy decision tree algorithms. IEEE Transactions on Systems, Man and Cybernetics, 7(l):28-35, January 1977. 18. J.R. Quinlan. Decision trees as probabilistic classifiers. In Proceedings of the 4^h International Workshop on Machine Learning, pages 31-37, Irvine, CA, June 1987. Morgan Kaufman. 19. L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regres sion Trees. Wadsworth, Belmont, CA, 1984. 20. Τ. Niblett. Constructing decision trees in noisy domains. In I. Bratko and N. Lavrac, editors. Progress in Machine Learning: Proceedings of the 2nd European Working Session on Learning, pages 67-78. Sigma Press, Bled, Yugoslavia, 1987. 21. J. Mingers. An empirical comparison of pruning methods for decision tree induction. Machine Learning, 4:227-243, 1989. 22. G.R. Dattatreya and V.V.S. Sarma. Decision tree design for pattern recognition in cluding feature measurement cost. In Proceedings of the 5th International Conference on Pattern Recognition, volume 2, pages 1212-1214, Miami Beach, FL, December 1980. 23. Μ. Nunez. Decision tree induction using domain knowledge. In Proceedings of the Conference on Current Trends in Knowledge Acquisition, pages 276-288, Amsterdam, 1990. 24. M. Tan and J.C. Schlimmer. Two case studies in cost-sensitive concept acquisition. In Proceedings of the 8th National Conference on Artificial Intelligence, volume 2, pages 854-860, July - August 1990. 25. Μ. Manago and Y. Kodratoff. Induction of decision trees from complex structured data. In G. Piatetski-Shapiro and W. Frawley, editors. Knowledge Discovery in Databases, pages 289-306. MIT Press, Menlo Park, CA, 1992. 26. L. Gaga, V. Moustakis, G. Charissis, and S. Orphanoudakis. Iddd: An inductive, domain dependent decision algorithm. In P.B. Brazdil, editor. Proceedings of the European Conference on Machine Learning, pages 408-413, Vienna, Austria, April 1993. 27. J.C. Schlimmer and D. Fisher. A case study of incremental concept induction. In Proceedings of the 5th National Conference on Artificial Intelligence, pages 496-501,

36

Philadelphia, PA, August 1986. Morgan Kaufman. 28. D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987. 29. P.E. Utgoff. Incremental induction of decision trees. Machine Learning, A{2):161-1S6, 1989. 30. S. Kahan, T. Pavlidis, and H.S. Baird. On the recognition of printed characters of any font and size. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(2):274-288, March 1987.

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal 1994 Elsevier Science B.V.

37

Object recognition using hidden Markov models J. Hornegger, H. Niemann, D. Paulus and G. Schlottke * ^Lehrstuhl für Mustererkennung (Informatik 5), Friedrich-Alexander Universität Erlangen-Nürnberg, Martensstr. 3, D-91058 Erlangen, Germany

This contribution describes a statistical approach for learning and classification of twodimensional objects based on segmented grey-level images. The research concentrates on the application of Hidden Markov Models in the field of computer vision. For that purpose, the theory of Hidden Markov Models is shortly introduced with emphasis on different types of stochastic automata. In the experiments we evaluate several types of Hidden Markov Models with respect to afíine invariant geometric features. The implementation uses an object-oriented class hierarchy for different variants of Hidden Markov Models. The paper concludes with a discussion of Hidden Markov Models for 3-D computer vision purposes. 1. I N T R O D U C T I O N For classification purposes, knowledge about the objects is necessary, which can be ac quired and represented in various ways. One possibility is the explicit representation of knowledge for a particular problem domain [9]. In some cases, the knowledge base can then be generated automatically in a knowledge-acquisition phase using learning sets of images. Distortions and noise in the input data are inevitable and may cause problems for the algorithms. However, statistical learning algorithms exist which are robust with respect to variations of the input data. Consequently, a statistical approach for learning objects seems natural. In the area of speech analysis, the statistical approach has been very successful; stochastic automata — especially Hidden Markov Models (HMMs) — are an established tool for that purpose. The following paper is dedicated to the problem of learning 2-D objects by examples and the design of efficient recognition algorithms based on information extracted from training samples. The used technique is based on HMMs combined with affine invariant geometric features. Image segmentation and representation and training of the HMMs are implemented following the object-oriented programming paradigm. The experiments are based on four different object classes, where for each object class 50 images and the corresponding extracted features are used for estimating the model parameters. The contribution concludes with a discussion of the practical results and the consideration of a statistical approach to solve the 3-D object recognition problem from 2-D views.

38

2. STATISTICAL O B J E C T R E C O G N I T I O N Theoretical aspects of statistical pattern recognition and classificatin are well developed [5]. Current research in this field focuses on the investigation of efficient and robust algorithms for practical recognition systems [8]. Most classification algorithms are based on Bayesian decision theory, where the decision relies on the a posteriori probability of the classes. Let us assume, we have classes Ωχ, Ω2,..., Ωη and observe a feature vector c, then we decide for class Ωα:, if Ω^ =

argmâxP(Ω.|c).

(1)

A statistical classification system is expected to provide the capability of learning the statistical properties of classes, for instance the density functions, from training sam ples. Additionally, Bayesian classifiers should also allow an efficient computation of the a posteriori probabilities Ρ(Ω, | c) for each class Ω,. 3. H I D D E N M A R K O V M O D E L S In the following section we will briefly introduce the basic concepts of HMMs and refer the interested reader to the literature for more details [11]. 3.1. Definitions Hidden Markov Models are widely used in the field of speech recognition. They are sto chastic automata including states, transitions among states, and emission probabilities for elements of a given alphabet. An HMM with Ν states . . . , SN} can thus be described by a triplet λ = (π, A, Β), where π = (ττι, π2,..., π Ν) is the vector of probabilities for the generation of a sequence of output elements to start at a special state. The state tran sition matrix A = ( a , j ) o < i < N , o < j < N includes the probabilities ÜÍJ to change from state Si to state 5j. The third element is either a matrix Β = ( 6 , ( i ; / ) ) o < í < í v , o < í < l including discrete probabiHties for a finite output alphabet {vi,V2,.. .,VL} or a vector of density functions for an infinite continuous output alphabet. Each HMM can generate sequences of output symbols. The name Hidden Markov Model is due to the fact that for an observed sequence of output symbols the underlying state sequence is unknown. Figure 1 shows two examples of HMMs of different topologies. 3.2. Algorithms During the training phase, the parameters of an HMM λ have to be estimated such that for all observed learning sequences 0¿ (1 < 2 < L) the probability P ( 0 , | λ) that model λ generates O, is maximized. In the recognition stage the decision rule, i.e. the a posteriori probability P{\j\0) for an observed feature sequence O has to be computed for each HMM Aj, in order to find out which HMM most likely created the feature sequence. The parameter estimation algorithm is inasmuch unsupervised, since due to the nature of HMMs it is not known which state sequence has generated the sequence of output symbols. The computation of the parameters for the HMM λ is done iteratively using the Ex pectation Maximization Algorithm (EM-Algorithm, [2]). For that purpose, the Kullhack-

39

Figure 1. Ergodic and "left right" Hidden Markov Model; each state emits probabilistic output symbols.

Leihler quantity Ο(λ,λ)

=

¿^P(s|0,,A)logP(5,0,|Á)

(2)

1=1 8

is computed for an initial estimation of λ. Herein 8 = SiS2 .. .sj varies over all possible state sequences which may have produced the output symbols of the ¿-th observation Oi = 01O2...OT {T may be different for every i). Q(\,\) is maximized with respect to the parameter set λ. After the maximization step the reestimated model parameters λ := λ are substituted. Both steps have to be repeated until no change in parameters occurs, i. e. λ = λ.

Ρ{8,0\Χ)

= π,, π

Π

Since for an arbitrary observation O equation ( 3 ) holds, the learning be computed using numerical or combinatorial optimization techniques. computation of the zero crossings of the first derivatives with respect to parameters will yield the well known estimation formulas for HMMs with independent probabilities [11]. The decision rule for recognition depends on the computation of P(A|0) =

P(X)P{0\X) P(0)

(3) formulas can For example, the unknown discrete, time

(4)

where the complexity of determining P{0 \ λ) is bounded by 0{N^T) when the forwardbackward algorithm [8] is used. The optimal state sequence for an observation O can be computed using the Viterbi algorithm [11].

40 4. O B J E C T O R I E N T E D I M P L E M E N T A T I O N O F H M M S With respect to the definition of HMMs given in the previous section, different variants can be distinguished. Types of HMMs are distinguished by the special form of occurring statistical measures π , A and Β and the topology of the stochastic automata. For instance, the transition probabilities a,j can be time dependent; those HMMs are called non-stationary. The output alphabet of an HMM can be discrete or continuous; thus, the measure Β represents either discrete probabilities or continuous density functions. For example, the statistical behavior of a state can be modelled by a Gaussian density or a mixture of Gaussian densities. Restrictions on possible transitions induce different topologies. Left right HMMs are used in speech recognition algorithms and satisfy the constraint that the state index increases with increasing time. Object-oriented programming currently seems to be the most promising tool for soft ware management. The similarities in the deduced estimation formulas and classification algorithms suggest the use of polymorphism and inheritance, and for realization of HMM algorithms in class hierarchies. The algorithm for computing the a posteriori probabilities for an observed feature sequence and the optimal state sequence can be described in terms of the variables π . A, and B, independent of the topology or the special form of the statistical measures. Thus, the forward-backward algorithm and the Viterbi algorithm should be implemented in a higher level of the inheritance tree. Dependent on the properties of the output density or the statistical behavior of the transitions the learning formulas have to be computed. These training algorithms have to be implemented in derived, more specialized classes. A hierarchy of HMM classes has been implemented and tested. An abstract base class HMM provides the interfaces for an abstract specification of training and classification algorithms, like the Viterbi algorithm. Two class subtrees are derived from this class; one is used for the implementation of discrete HMMs which implement ß as a matrix; the other subtree describes continuous HMMs and is further subdivided into classes for various densities in B. Left right HMMs are special cases of any of those classes. The whole class hierarchy is integrated in an object-oriented environment for image segmentation and analysis (ANIMALS, [10]). 5. A F F I N E I N V A R I A N T F E A T U R E S Applications of HMMs are restricted to those pattern recognition problems, where ordered sequences of features can be constructed. The extracted features of a speech signal are, for instance, a priori time ordered and thus satisfy this central prerequisite. Since objects in an observed scene have several degrees of freedom such as translation or rotation, a sequence of features associated with each object needs to be invariant with respect to this kind of transformations, if the goal is to identify them in varying scenes using HMMs. In [4] affine invariant features are introduced which are based on simple contours of objects. A contour is called simple, if there are no intersections of the contour with itself. A local form of a simple contour is defined as a closed polygon α , ρ ^ , . . . ,p^, 6, o, which is part of the polygon approximation of the complete contour determined by the point sequence Po^Pn · · · ^Pn-i- Figure 2 shows a part of a simple contour and a local form and also indicates the possible locations for a and 6.

41

Figure 2. Examples of a local form in different scaling, rotation, and position

Obviously, proportions of areas are affine invariant. For example, let c be the center of gravity of the local form. The quotient Κι of the area Fc of the triangle abc and the area F of the local form is obviously an affine invariant feature. Now the question arises, how the points a and h have to be chosen for a given local form and how many local forms have to be computed for each contour. For each quadru ple {Pk-i,Pk^PhPi+i) the points a and b are chosen such that the quotient κχ will be minimized. This process produces a large set of local forms. The selection of local forms out of this set is guided by the following criterion: all local forms whose area is greater or equal to half of the area of the complete contour's area are canceled. Furthermore, triangular local forms are not admissible. The resulting set of local forms provides a set of affine invariant features, which is naturally ordered by the processing order of the polygon. Thus, we can associate with each contour a sequence of features and HMMs can be used for training and classification purposes. 6. E X P E R I M E N T A L RESULTS In the experiments we choose four different objects (see Figure 3). Using a training data set of 50 input images per object, different types of HMMs are trained from the sequences of extracted affine invariant features. We compute both Κχ for each contour. The classification results using 10 images of each object which are not included in the sample set are shown in Table 1. The used type of HMMs is the continuous version producing normally distributed output in each state. Each column shows the number of correctly classified objects. The last line summarizes the recognition rate related to the number of states. The increase of correct decisions using "left right" models is a remarkable result which is due to the fact that ergodic models have more transitions and thus more parameters which have to be estimated. The higher the number of parameters is, the larger the sample set has to be for good estimates. Conspicuously, in all cases

42

Figure 3. The image shows the original grey-level image (children toys) and the right image the resulting closed polygons of the contour of each object.

object monkey giraffe elephant camel rate in %

number of 3 4 5 9 8 9 7 7 7 8 8 8 5 5 5 72 72 72

states 6 7 8 8 8 8 5 4 3 5 60 62

object monkey giraffe elephant camel rate in %

number of 3 4 5 9 8 9 7 7 7 9 9 9 5 5 5 75 72 75

states 6 7 8 8 7 7 9 9 5 6 72 75

Table 1 Continuous, ergodic (left) and "left right" (right) HMM with Gaussian output densities

in which the classification result is wrong, the correct object has the second highest a posteriori probability. Since the features associated with a contour of an object are real numbers, discrete HMMs cannot be used in a direct manner. Discrete feature values can be computed using vector quantization techniques [8]. The same experiment was carried out with the Expectation Maximization Algorithm applied to Gaussian mixture densities and an unordered set of features. The overall recognition rate was approximately 93%. This means that the introduction of an ordering for the features has decreased the recognition rate rather than increasing it. Non-stationary HMMs as introduced in [6] expect feature sequences of equal length for each observed scene. The sequences of features in our experiments have different size from 1 up to 14 - and thus the mentioned type of HMM was not tested. All these experiments can easily be carried out using other types of invariant features, .applications to three-dimensional object recognition problems might be for instance the

43

use of geometric 3-D invariants like mean and Gaussian curvatures of surfaces in range images, which are viewpoint independent features. 7. S U M M A R Y A N D C O N C L U S I O N S This contribution shows that statistical methods are suitable for object recognition pur poses. The experimental evaluation is based on two-dimensional object recognition prob lems without the computation of the object's location. Real images were used. However, the constraint that only invariant features can be used is profound, since apart from the classification of an object, the computation of its location is further a central problem of computer vision. Actually, this cannot be solved using the proposed approach with HMMs, not even in the discussed two-dimensional object recognition problem. One con ceivable extension might be the introduction of parameterized output densities regarding the object's location. Nevertheless, this would cause maximization problems for para meter estimation which do not provide an analytical solution. The computation of the a posteriori probabiHties will also be of higher complexity, because the search space is enlarged by the location parameters. Additionally, the introduced method is currently limited to images which include only one object with a homogeneous background. In order to use HMMs, the inclusion of structural information about the objects as an ordered sequence of features was required. However, only few invariant geometric features of the described type can be found in the objects, e.g. only three for the monkey object. This is not sufficient for a stable parameter estimation of the HMMs (e.g. 70 parameters for a 7 state HMM) and explains why the HMM experiments reveal lower recognition rates than expected. Further research for features is required; ideally they should be chosen in such a way that the length of the feature sequence of a given object is fixed. In this case, non-stationary HMMs can be used which are shown to have higher recognition accuracy For the recognition of three-dimensional objects from 2-D views HMMs are not suitable, because occlusion and the missing depth information lead to the result that there do not exist any geometric invariant features for 3-D objects in 2-D images. Consequently, the use of a view based approach is necessary, i. e. for each possible view of an object an HMM has to be introduced. The relation between the possible number of views of an object and the resulting recognition errors are discussed in [1]. We summarize that HMMs can be naturally implemented in an object-oriented pro gramming environment, and provide high flexibility and programming comfort. Training algorithms can thereby be programmed in an abstract manner for several types of HMMs. Furthermore, we conclude that HMMs in their existing form cannot be used for solving the 3-D object recognition problem from two-dimensional images apart from the view based attempt. Thus, it seems to be indispensable to find a more appropriate statistical framework for building a Bayesian classifier for 3-D object recognition purposes. A first promising approach can be found in [3,7]. ACKNOWLEDGEMENT Special thanks to E.G. Schukat-Talamazzini who carried out the EM-experiments for mixture density functions.

44 REFERENCES 1. Τ. Μ. Breuel. Geometric Aspects of Visual Object Recognition. PhD thesis, De partment of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Massachusetts, 1992, 2. A. P. Dempster, N. M. Laird, and D, B. Rubin. Maximum Likelihood from Incom plete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series Β (Methodological), 39(l):l-38, 1977. 3. J. Denzler, R. Beß J. Hornegger, H. Niemann, and D. Paulus. Learning, tracking and recognition of 3d objects. In V. Graefe, editor, International Conference on Intelligent Robots and Systems - Advanced Robotic Systems and Real World page to appear, September 1994. 4. S. Frydrychowicz. Ein neues Verfahren zur Kontursegmentierung als Grund lage für einen maßstabs- und bewegungsinvarianten Strukturvergleich bei offenen, gekrümmten Kurven. In H. Burkhardt, K.H. Hoehne, and B. Neumann, editors, Mustererkennung 1989, Informatik Fachberichte Nr. 219, pages 240-247, Berlin Hei delberg, 1989. Springer. 5. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, Boston, 1990, 6. Y. He and A. Kundu. 2-D Shape Classification Using Hidden Markov Models. IEEE Trans, on Pattern Analysis and Machine Intelligence, 13(11):1172-1184, 1991. 7. J. Hornegger. A Bayesian approach to learn and classify 3-d objects from intensity images. In Proc 12th International Conference on Pattern Recognition, Jerusalem, Israel, to appear October 1994. IEEE Computer Society Press. 8. X. D. Huang, Y. Ariki, and M. A. Jack. Hidden Markov Models for Speech Recognition. Number 7 in Information Technology Series. Edinburgh University Press, Edinburgh, 1990. 9. H. Niemann, H. Brünig, R. Salzbrunn, and S. Schröder. Interpretation of industrial scenes by semantic networks. In Proc. lAPR Int. Workshop on Machine Vision Ap plications, pages 39-42, Tokyo, 1990. 10. D. W. R. Paulus. Objektorientierte und wissensbasierte Bildverarbeitung. Vieweg, Braunschweig, 1992, IL L. R. Rabiner. Mathematical Foundations of Hidden Markov Models. In H. Niemann, M. Lang, and G. Sagerer, editors. Recent Advances in Speech Understanding and Dialog Systems, volume 46 of NATO ASI Series F, pages 183-205. Springer, Berlin, 1988.

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

45

Inference of syntax for point sets Michael D. Alder * * •Mathematics Department, Centre for Intelligent Information Processing Systems, The University of Western Australia, Nedlands, WA 6009, Australia This paper discusses a method for extracting structure in point sets in R'*. It may be applied repeatedly to such problems as decomposing images into constituent parts, as in the recognition of handwritten characters, where we treat a character as a point set in R^, or for more general issues of image understanding, as when we recognise an image as made up out of cuboids, themselves made up out of faces, in turn made out of lines, in turn made out of pixels. It has been used for recognising hand printed Kanji characters, for finding hand drawn boxes in images, for finding faces in video-images, and for classifying images of flaws in metal. Details of the applications are discussed elsewhere; this paper discusses the theoretical issues. 1. I N T R O D U C T I O N King Sun Fu [1] drew attention in a number of books and papers to the structured nature of many pattern recognition problems. His terminology was motivated by analogy with the structures found in natural languages which are well known both to linguists and to students of formal language theory. In the case of natural language text, regarded as a long string of ASCII characters, the primitives at the lowest level are letters of the alphabet, but these are aggregated into words. Words may be regarded as a new set of primitive symbols, one level up from the base level of characters, and these may in turn be taken to be derived from a string of symbols each of which defines a part of speech. These in turn may be derived from a string of phrases, and the string of phrases from a primitive symbol called a sentence. Phrase structure or rewrite grammars formalise this idea of a hierarchy of alphabets and of languages, where a language for these purposes is simply an acceptable set of strings of symbols. We may talk, for example, of the language of PASCAL programs on the alphabet of ASCII characters. A recognition grammar for this language may be taken to be some algorithm which decides whether a particular string of ASCII characters is in fact a PASCAL program. The usual syntax diagrams for PASCAL constitute a grammar for the language; it is plain that there are intermediate levels of structure between the symbol for a program and the ASCII string which comprises the program. By analogy with linguistic structures, Fu argued that images are also decomposable into components, sometimes over many levels, and that algorithms which decide what allowable arrangements of entities at one level give a legitimate entity at the next level * [email protected]

46

up may be found. Intuitively, it is tempting to decompose a line drawing of a cube into faces, the faces into line segments, and the line segments into pixels, and to argue that this is precisely parallel to what is done in decomposing a sentence into phrases, each phrase into parts of speech, each part of speech into words, and each word into a string of characters. Picture grammars to do this have been explored in somewhat restricted cases. The case of formal languages with grammars which determine whether a string is admis sible have been generalised profitably to the stochastic case, where a generative grammar will produce strings with different frequencies, and where a recognition grammar will as sign to a string a probability or frequency with which the string occurs in the language. Examples are the n-grammars οτ language models used by IBM in their Automatic Speech Recognition work at the level of words, and Hidden Markov Models used at the level of phonemes. [2]. A key issue of grammatical Inference has also been identified by Fu; given a language sample, it is desired to infer a grammar which describes the strings found, and possibly which describes strings still to come in a larger language sample. In the case of stochastic grammars, this is well within the usual paradigms of probabilistic modelling, if not well explored. In the case of phonemes in speech, for example, Hidden Markov Models are used, whereas at the level of ascii text Words, n-grams may be extracted and used to construct a stochastic grammar. The program inaugurated by Fu required the selection, usually manually, of some suit able set of graphics primitives, which were then designated symbolically. This led to the risk of premature quantisation; two arrays of pixels which were very similar might still be assigned different symbols. This difficulty arises in Automatic Speech Recognition by the method of vector quantisation, followed by Hidden Markov Modelling of the strings derived from continuous trajectories in the speech space. A further issue arises: even very simple diagrams give rise to grammars which are not finite state and not even context free. This can present formidable difficulties at the inference stage. In this paper I shall show that it is possible to transfer the ideas of formal language theory across to the case where the points of are the symbols of the alphabet at the lowest level, and where generally there is a sequence of alphabets, each of them R" for some n, giving rise to a set of pixels constituting an image at the lowest level. The inference problem may be accomplished in a manner precisely analogous to a solution for the string case known to Shannon, and indeed the analogy may be formalised via category theory, although I shall not do that here. The advantage of this approach is that, first, we have the familiar structures and ideas of formal language theory to guide us, and second, the whole has been transferred to a category where everything is continuous and no premature quantisation occurs. The parallels between the two categories, of strings on a finite alphabet and of finite subsets of R", extend to the stochastic case. In order to focus the intuitions, consider the set of pixels comprising a handwritten Japanese character, as in figA. The images are taken from the etl character database, cdrom C, formats ETL-8 and ETL-9, produced by the ICDAR '93 Conference Committee. To the eye, it is natural to aggregate what is a set of pixels into a much smaller number of strokes. It is also natural for the eye which has seen many more similar characters.

47

Figure 1. Japanese handwritten character.

Figure 2. Kanji handwritten character

to further aggregate the strokes into radicals, and the radicals into the character. This is more apparent in fig.2, a Kanji character. Conversely, the character is decomposed into radicals, the radicals into strokes, and the strokes into pixels. Not all possible sets of pixels will constitute a stroke, and not all arrangements of strokes a radical, and not all radicals a character. An algorithm which allowed one to decide if a given arrangement of primitives at one level was allowable as constituting an element at the next level up, would be a continuous or topological grammar^ because the elements at any level may be deformed continuously into others, in contrast to the conventional discrete grammars for symbols. The problem which is addressed in this paper is that of finding a suitable representation of the elements at each level as points in a space, and then of inferring from a sample what elements are admissible, or more accurately of providing probabilities or probability densities for them. Such a general procedure would, in particular, allow, from a sample of characters, the inference of the existence of strokes, then by representing these strokes as points in a vector space, of inferring at a higher level the existence of radicals, and by a further application of precisely the same process, the existence of characters. Recognition of a character would proceed in precisely the same manner; ultimately a character becomes UpWritten to a point in a space which encodes the substructures out of which it is composed. Sufficiently close points will be recognised as the 'same' character, by the usual clustering methods. In any classical pattern recognition problem, it is usual to find a way of coding an object, say a point set in the plane, as a single vector. We may characterise this as the measurement process, which is inevitably required to turn an object in some class into a point in a vector space labelled by the class. What we are doing here is to arrive at such a measurement process not by arbitrary fiat of a human operator, but by extracting structure from the image. In effect, we hope to automate the problem of deciding what properties of our entity to measure. We factorise the measurement procedure through sub-measurements of parts of the object, repeated as we rise through the various levels of description.

48

2. C H U N K I N G The process of chunking has been known to psychologists for many years, and may be seen as the aggregation process which singles out a subset of the point set, in our case the pixels in the character, as constituting a single entity one level up. In deciding how to accomplish this process, it is useful to consider how we might aggregate characters in natural language text into words. Indeed, it is more than useful; it is an extension of an existing method into a different category, so there is an interesting principle involved. Suppose we adapt the procedure of the IBM Yorktown Heights Speech Group, and construct trigrams of letters extracted from English text (the Speech Group constructs trigrams of words). We store these in the form of bigrams of letters together with the distribution over the alphabet of the final letter. Thus if we take the set of all trigrams beginning t h , the frequencies of the letters which occur conditioned by these letters in that order in English text assigns a high frequency to the letter e, a slightly lower one to a, reasonably high frequencies to i, o, u, y, r and whitespace, but very low or zero to h, z. A, $, and so on. Such a frequency distribution can be normalised to become a probability distribution giving the probability of getting each letter conditioned on two predecessors. Any such distribution has an entropy, measuring the extent to which we are uncertain about the possibilities; if each letter of say 128 possible letters can occur with equal probability, the entropy is a maximum of 7 bits. If only one letter is possible, as occurs in English text after eq, then the entropy is zero. The collection of trigrams becomes a second order Markov predictor for the stream of symbols. We observe, with Shannon, that the entropy of such a predictor decreases as we go through a word in general, and then rises sharply after the whitespace character or punctuation symbol separating words. In short, we could segment written text into words even if all the punctuation and whitespaces were removed, by going through with a local predictor, and chunking the strings by introducing separators at the entropy maxima. We might have a few discrepancies, we might find ourselves segmenting into stems and inflections in many languages; much would depend on where our threshold for a maximum was set. It is as well that this can be done for text, because the situation in continuous speech is that there are no separators between spoken words. It has been found that such chunking can be applied to decompose executable files into ^words', thus allowing compression of such files. A similar principle may be applied to the problem of aggregation of point sets. What is required is an apparatus for predicting, from some level of aggregation, where other points may be expected to occur. If, for example, we have a line segment made up out of pixels, we need to be able to look in some neighbourhood of any one pixel, and to deduce the local linear structure, and then to extrapolate it into a prediction of where other pixels may be found. Similarly for other distributions of pixels. This may readily be done by taking the pixels belonging to the set which lie within some radius of a given pixel (corresponding to the η of the n-gram) and which I shall call the resolution radius. Now we compute the pixel count, the centroid, and the covariance matrix for those pixels within the resolution radius of the given pixel. We may, if we choose, regard this as the fitting to the data of a gaussian distribution, in which case it is clear that extra points

49

are, so to speak, expected to arrive preferentially at locations where the pdf takes greater values. In the case of a line segment of pixels and a resolution radius large enough to include more than one pixel (and small enough not to include any pixels not part of the line segment) the predicted pixels will also lie along the line segment. And since we took account only of pixels within the resolution radius, and intend to repeat the procedure on the remaining pixels of the set, we may predict that other gaussian distributions may be expected along the major axis of the quadratic form defined by the centroid and covariance matrix already computed. This prediction will hold within the line segment and fail at the end points. The amount of information supplied to the predictor by the data is measured by the logarithm of the reciprocal of the probability of the event. Using a continuous and noncompact pdf requires some amendment to this definition, so I shall provisionally define the surprise experienced by the predictor as the natural logarithm of the value of the gaussian pdf at the centre of the the closest quadratic form. This is simply the Mahalanobis distance, together with an additive term which is approximately constant. I also need to expect, in the case of the line segment, that there will be two such predictions, one on either side of the given form. If the adjacent form in any direction is remote or does not exist, the surprise will be large or infinite. I also need to amend my definition of surprise to take into account that the covariance matrix of the adjacent forms will have to be close to the given covariance matrix, or the predictor will also get a large surprise. It is easy to see how to measure this, so I shall avoid details. When the amount of surprise is low, the points are deemed to be part of a single entity at a higher level, where the surprise is high, the entity is deemed to have a boundary. The images in figs.3,4 show how the decomposition by quadratic forms, represented here as ellipses, allows us to collect pixels into strokes in a natural manner. Each ellipse may be regarded as a 1-gram predictor of its neighbours. If an ellipse of similar orientation is found in the expected position, not far from the major axis of the given ellipse, we regard the two ellipses as belonging to the same entity. As we proceed along a stroke in this way, we eventually come to an end of the stroke, or a sudden change of orientation. In this way, the character clearly decomposes into some number of strokes. The proba bility distribution for the likely locations of other ellipses does not have to be imposed a priori, but may be inferred from the data, just as with the case of natural language text. This leads us to the principle that in investigating the syntax of point sets, we should first take a point, then find a description of its neighbours which contains information about the distribution of these neighbours in the form of a vector, and this vector must also constitute a local predictor for the values of neighbouring such vectors. Having obtained such a representation, we may then use it to aggregate the original data. The problem then becomes to describe these aggregates in terms of some other entity, one level higher up. These entities will need to be represented as vectors or points in some generally higher dimensional space, and the process iterated until only one point at the top level is produced. This then constitutes a description of the original entity, and moreover the sequence of Up Wntes constitutes a representation of the syntactic structure of the original point set. Conversely, by DownWnting the point at its highest level, into possible sets which might have given rise to the higher level point, we recover objects which are syntactically equivalent to the given object. There may be many of these, just as there

50

are many sentences of words. It is important to see that the Up Write process just described for point sets is the ordinary Shannon chunking performed on text strings, transferred to a different category of grammars. As has been remarked, the correspondence between topological grammars and string grammars may be formalised completely within the language of category theory, facilitating the study of the inference of structure quite generally.

Figure 3. Japanese handwritten character: first Up Write.

Figure 4. Kanji handwritten character: first Up Write.

3. LOW ORDER MOMENT DESCRIPTORS We suppose that the entropic process described above has decomposed the original point set into subsets. They are not necessarily disjoint, since if we have for example a triangle, the vertex points may be naturally shared between the sets comprising the sides, or even assigned to separate categories. Next we have to code each such subset as a point in a generally higher dimensional space. In order to represent some collection of points in any space, in such a way as to preserve shape information of the distribution, it is natural and easy to use low order moments. In particular, the first order moments give the centroid of the set, second order moments give, when taken relative to the centre and normalised, the covariance matrix for the data, and higher order moments yield higher order forms. The zeroth order moment may be taken to give a count of points in the case of finite sets. The covariance matrix may usefully be regarded as providing us with a quadratic form on the space, which may be represented, as above, by a section, giving an ellipse in two dimensions, and an ellipsoid in higher dimensions. More precisely, it yields a basis for the space, or alternatively it can be regarded as supplying a local metric tensor. This makes it convenient for use as a local predictor. Higher order moments are sometimes convenient, but do not modify the ideas in any essential way. In dimension two, for example, as with the figures given above, a positive definite symmetric matrix contains three independent

51

numbers, and the centre requires two more, so the form is a point in K^. If we add the zero order moment, we obtain a point in This dimension may be reduced by constraining the form in some ways, in the case of the given figures, it is feasible to fix the eigenvalues of the matrix and to store only the angle of the principal axis and the centre. Then each form is a point in K^, and is still a local predictor of other pixels, and hence of other forms. For some data, it is possible to regard the covariance matrix as specifying a gaussian distribution in the usual way, in which case it returns directly a likelihood for other points. More generally, higher order forms may be regarded as providing density models for the data. Pixels in images of characters are hardly natural contenders for being modelled by mixtures of gaussians, but this is a convenient way to treat some problems. Indeed, regarding a character as a set of pixels, using some ad hoc method of initialising gaussians, and then using the EM algorithm [3-5] to compute a maximum likelihood gaussian mix ture model, has proved successful in extracting strokes from Japanese characters. Such decompositions have succeeded admirably in recognising such characters, with recognition rates largely insensitive to the number of characters presented. Recognition is better for the more complex characters, in general. 4. I N F E R E N C E Suppose, in order to make matters concrete, that we are given a set of linear images, that is to say of pixel arrays of straight line segments, and to make it even simpler, of triangles. The Continuous Syntactic Inference Engine is presumed to be presented with a list of points in R^, the constituent pixels. It has no knowledge of line segments or of triangles. We wish it to be able to be * trained', by simple exposure, to a sample of triangles, and then to take a test image which contains a triangle, and to identify it as an object it has seen before. Moreover, the same process must work for squares or any other geometrical shape built up from line segments; we impose no constraints at this stage except (a) we restrict ourselves to objects built up from line segments, and (b) we go to only three levels: triangle (or whatever) at the top, line segments as components, and pixels as the primitives. No knowledge is built into the process except that it starts with primitives which are points in and that it is to stop at the third level of Up Write. I shall illustrate the process in this case, giving an outline only, for the details will be described in Mr. McLaughlinn's paper in these proceedings. First we select a pixel at random, and a resolution radius which is sufficient to contain a few pixels. Very small triangles will not be recognised as triangles when they fall within the resolution radius. We take all the pixels within that radius and compute the centre and covariance matrix. All pixels which lie within the radius are deleted from the set of pixels, and the process is repeated until every pixel is removed. For graphical purposes, the covariance matrix C is represented by an ellipse which is the set of points {x € R^ : (x - m)^C"^(x - m) = l } , where m is the centre. These are the points at 1 standard deviation from the centre, or Mahalanobis distance 1. We now have a somewhat smaller set of points in R^, the central moments of orders one and two for the original contents of the resolution radius. These points are also predictors of the original pixels, and hence of each other. If two forms are such that the angles

52

of the major axes differ by only a small amount and the centre of each lies close to the major axis of the other, and if the ratio of the eigenvalues is sufficiently high, then they are deemed to belong to the same aggregate. If we * colour' a form purple and there is a neighbouring form satisfying these conditions, it is also coloured purple. Forms not in the aggregate of purple forms eventually acquire some different colour. Having aggregated or chunked the forms in this way, our original data consisting of an equilateral triangle now consists of a set of points which lie close to some coloured form. We may expect that we shall obtain three distinct colours corresponding to the three edges of the triangle, say purple, green and blue. More are possible if the vertices are regarded as distinct, but the treatment is essentially unchanged. We now assign a colour to each point of the data set, the colour of the quadratic form which models them. We simply compute the Mahalanobis distance from each form, and select the closest. Other more sophisticated procedures are of course possible. This, in the case of the triangle, will give us three colours of points, purple, green and blue. I shall call each colour set of points a chunk of the original image. Applying the process to text strings at the character level would make each chunk a word in the ordinary sense. We now wish to describe the chunks; we do this by computing low order moments for each chunk regarded as a point set in R^. If we were to use only first and second order moments, we should, in effect, be fitting an ellipsoid to the set of points on a line segment. We observe that this removes the dependence on the original choice of a resolution radius, since many such choices would produce the same chunk. In the case of shapes more complex than a straight line segment, it may be necessary to compute higher order moments than two in order to obtain a reasonably precise description of the shape of the chunk. The choice of what order of representation to use is a parameter which has to be determined by testing on the data; if data differ on details of shape which are not captured adequately by low order moments, then this will reveal itself at a later stage. If we elect to use moments of order zero, one and two, we have reduced the original data set to three points in a six dimensional space, each point containing a description of the size, location, shape and orientation of a line segment of the original triangle. Finally we take a resolution radius in this space, sufficient to contain all three points. We then compute low order central moments for this set. Since we have only very few points, it is unnecessary to go to higher order than three in order to describe their disposition completely. The process of going from the original point set to the chunks is known as the chunking operation, and the process of characterising the chunks as points in (in this case) is known as the first Up Write. The process is then repeated; here the chunking is trivial and all three points are in the same chunk, and the second Up Write into a single point, in a space of dimension 84 if we proceed to third order moments. At the top level, it is easy to see that there is considerable degeneracy: any triangle of any size, shape, or orientation will be a point in some lower dimensional subspace of the space R®^. In general, this subspace will be a manifold which is not affine. It is easy to calculate the dimension of this space; it may be thought of as the space obtained by letting the full affine group act on a single triangle in R^ and hence has dimension six. Topologically, it is the space R^ factored by the action of the symmetry group on three elements. Having seen lots of triangles, the process has generated a cluster in a high dimensional

53 'triangle' space. If another object is given, it may be Up Written in exactly the same way, and if it is a triangle, it will also be found in the cluster. If it is a square, it will not. It is easy to generate point sets for a variety of triangles, for a variety of squares, and then to determine for a new object whether it is a triangle or a square by any of the standard pattern classification methods in the top level space. The knowledge that the point sets lie on manifolds which characterise the objects may assist in doing this intelligently. Note that the process which has been applied to single triangles and squares to produce the manifolds for each, may be carried one level higher to create a point in a fourth level space characterising squareness or triangularity: the appropriate Lie group action may also be learnt by the same process. The dimension of the space of squares is of course only 4, and that information, relating to the symmetries of the object, may be preserved through the final UpWrite.

5. GENERALISATION The Process described in the case of linear images of triangles is evidently of considerable generality. Training it on squares or other polygons requires no modification whatever. It has been used to recognise Kanji characters built up from line segments with no alteration at all. Modifications to deal with curved chunks amount to little more than extending the order of the predictors formed by the quadratic forms, and going to higher order in the UpWrite stage. It is clear that the UpWrite process need not stop at this level; having given it paral lelograms, one could then proceed to UpWrite each parallelogram to a single point, and given perspective views of a cube, it could use exactly the same process in order to 4earn' the structure of cubes, and recognise them when it has seen them before. Each cube view is decomposed into a triple of parallelograms, the faces. Again, this does not entail any new ideas, simply iterating the basic machinery already sketched for essentially plane figures. Thereafter we merely proceed down the various levels as before. A cube is simply a higher order object under this decomposition. Going to higher levels merely increases the computational load in principle, and so analysis of images composed of line segments merely needs a lot of images. The restriction to clean images built up from line segments is also unnecessary, although it simplifies the exposition. The method has been applied successfully to frame differenced real time video images, and also to ultrasonic images of flaws in metals. Nor is it necessary that the object be an image. In the case of OCR it is possible to go up to the level of a single character by the above methods, and then to go to larger entities such as strings. Computational and memory requirements tend to exceed those of ordinary workstations rather rapidly however.

β. INVARIANCE There are many cases (identification and classification of aeroplanes, for example) where it is desirable to build into the process various kinds of invariance, for example rotational invariance. This may be accomplished conveniently by using a rotationally invariant set of moments. It is convenient although not essential to use orthogonal moments. In dimension 2 the well known Zernike moments can be used for this purpose. It is for this

54

reason that Central moments are preferred, as this allows translation invariance to be implemented by projection onto some linear subspace of the representation space. It has been found generally that it is much more interesting to not use invariant mo ments, but to learn the invariance, as indicated above, by employing another level of UpWrite. If we take for example a set of aeroplane silhouettes and rotate them about the centre, then we obtain an embedding of the rotation group S0(2) in the top level space; if in addition we scale them between half size and one hundred and fifty percent, and also displace them, we have the four dimensional manifold S0(2) x [0.5,1.5] x embedded in the top level space. If instead of an aeroplane we had chosen a square, the symmetries of the object under rotation would manifest themselves in the structure of the map from SO(2) into the space, which would no longer be an embedding. More may be done in this direction, involving a study of noisy manifolds and 'almost manifolds' and their dimension. It is worth observing that at each level, constraints on the kinds of arrangements of 'atoms' at that level which occur in forming a particular class of 'molecules' are trans formed into a requirement that the molecules lie in a subspace of the space of all possible concatenations of atoms. The subspace is not generally linear, but usually constitutes a manifold. Since the 'atomic arrangement' in a molecule is subject to noise, the manifold is noisy. In practice it is desirable to choose a representation of the quadratic forms which yields a linear or afíine submanifold, which usually means it is prudent to diagonalise it and specify eigenvectors and eigenvalues rather than to simply list the matrix entries. It is feasible to model such manifolds locally by quadratic forms: this of course is no more than the process of chunking applied in higher dimensions.

7. NOISE Noise at different levels can be very different. Randomly scattered pixels constitute noise at the lowest level, but randomly oriented line segments constitute noise at a higher level, and deformed triangles might constitute noise at a higher level again. Quite gen erally, the size of the resolution radius at each level if chosen judiciously can permit the exclusion of some of the noise- single pixels have singular moments and can be eliminated. It is the feature of variation or 'noise' at different levels which gives power to the hierarchical decomposition which is here termed syntactic. UpWriting in a single stage is of course precisely the commonly used moment method of pattern classification: typically it works well with characters drawn from a fixed font, and degenerates rapidly with large numbers of fonts or when characters are handprinted. It is clear to the eye, that much variation between versions of the 'same' character occurs at the level of strokes. This is perfectly intelligible if we think of the characters as assemblies of elements in a stroke space, where variation consists of modifying a stroke by perturbing it. But this is not at all the same as perturbing it at the pixel level. A flock of birds may wheel in the sky and become distorted, and yet remain a flock of birds, maintaining much of the relationship between individual members. Or we may make random changes to the positions of the birds. These are quite different kinds of variation, and it is important to have a language in which one can distinguish these things. In the case where there is noise within the resolution radius, as for example when there

55 are adventitious line segments intersecting a triangle, it is necessary to try removing points at one level to see if this yields an object which has been seen before at the next level up. This can be reduced to the problem of finding a close point in a manifold of noise free objects, as can the occlusion problem. The choice of a suitable resolution radius is at present accomplished in a somewhat arbitrary manner, and it would be desirable to study the matter further. There is a suspicion that the brain might perform similar operations to the above, concurrently at many different resolutions. It is worth noting that most noise reduction techniques make assumptions about the noise, independence most critically, and then use an averaging process in order to extract the signal. There are exceptions such as the use of Bayesian methods for finding a re stricted class of signals such as sine waves or mixtures of sines in fixed ratios, or more recently chirp [6], and these produce very impressive results. The present suggestion for coping with noise is clearly more closely related to Bayesian methods than to standard filtering.

8. BINDING AND OCCLUSION Suppose we are given a partial square after having trained the program described above on squares and triangles. Let us suppose that only three complete sides of a square are given and that the fourth is absent. We ask the question, can the process lead to completing the square? Such issues arise quite naturally even in images of line drawings of objects. Here it is variously named, but occurs in the case of occlusion of part of one object by another. Now it is plain that the completion makes sense only relative to a body of data, and that it is the space of squares which is significant here. What we are asking, in geometric terms, is: what point on the manifold of squares is closest to the given point obtained by Up Writing three of the four edges? This raises the question of the right way to measure distances in the space. It would be foolish to simply choose a euclidean metric if there were any better choices available. We can regard the approximation of a noisy manifold locally by quadratic forms as giving a local estimate of the metric in the whole space, and employ this in order to get reasonable estimates of how to measure distances in the space. It has been found possible to use quite crude methods and successfully complete an object, relative to a class of objects on which it has been trained. It may happen that there are several possible completions, and all of these may be found, as well as a measure of which is more likely. In general this may involve higher level completions competing with lower level completions. The problem of choosing the *best' or determining the plausible alternatives, is however reducible by these methods to a problem in geometry and statistics.

9. NEURAL MODELS The work described above has been motivated by a number of concerns, not least the plainly quasi-linguistic features of much natural information processing. By the latter term, I mean the kind of information processing which occurs in the animal nervous system. To give a not entirely whimsical example, when I learn to drive a car and the instructor tells me to turn left, I have to recall the necessity for slowing down, which

56

in turn requires me to take my right foot off the accelerator, move it laterally onto the brake and push my foot down. I also have to look in the driving mirror, turn the traffic indicator on in the appropriate direction, and change gear down, itself a constellation of movements. Not to mention checking for road signs, pedestrians and finally turning the steering wheel. Later, after I have passed my test and had some practice, the whole operation of turning left has become a subroutine or procedure upon which I can call as a high level chunk. Of course, the action of putting my foot down on the brake was itself learned and reduces ultimately to some set of muscle tissues contracting under the control of neurons firing. So a great deal of chunking of elements at one level into 'higher leveP objects occurs. Another example occurs when one tries to write a program, where the top level entity 'program' decomposes into procedures which themselves decompose, reducing in the end to the primitive operations of the programming language. Another source of motivation for this was a discontent with the classical neural net models. The well known papers of Hubel and Wiesel [7] and Blakemore [8] describe the self tuning of neurons in the visual cortex of the kitten, and it seemed desirable to model this process. It was not a matter of learning to discriminate inputs, it was a matter of acquiring sensitivity to them. I shall outline the thinking which led to the syntactic theory described above, because it seems to be important that neurons may be natural devices for implementing the applications. Even if we cannot directly use such objects at our present level of technology, it may give insights into the way in which brains work. In the Experiments of Blakemore [8], a kitten was found to have, on first opening its eyes, neurons in the primary visual cortex which were not capable of strong discrimination between edges in different orientation, while after being exposed to a visual environment comprising vertical edges for a period of the order of hours, neurons which respond strongly to vertical edges are found, while few neurons respond to orientations very different from the vertical. Neural response is gauged by the spike frequency measured via an electrode, probing the area, which is sufficiently fine to ensure that it is measuring the response of a single unit. Similarly, if the early visual experience is limited to horizontal bands, neurons are found which respond only to horizontal edges. And if the kitten is transferred between a vertical band environment and a horizontal band environment, neurons are found to be of two populations, responding in the main to vertical or to horizontal edges. In dynamical terms, we may characterise each neuron in its final state by a number giving the orientation, in degrees, to which it responds most strongly, observing that the response peaks fairly sharply at the orientation of maximum sensitivity. And of course the datum presented, an edge, may also be characterised by its orientation. If we open up the circle to the line between —45° to 135'' and plot the presented edges as points, we may suppose that these are largely centred on the points 0° and 90°. Initially, the neurons, also represented by points, are not on the line at all, since they do not have any preferred orientation. After presentation of data centred on the point at 90°, the neurons are 'drawn in' to the data. The data is attracting the neurons in a certain sense, in the space of neural states. Similarly if the data is centred at zero. If, however, half the data is at 0° and half at 90°, some of the attracted neurons wind up at one at tractor, the rest at the other. The rest at the other. They do not, as they might for some conceivable laws of attraction, finish up all at 45°. We may describe neural self tuning then, in dynamical terms, as the attraction to data

57 points, and it was found possible to devise a family of algorithms known as the *dog-rabbit strategy' which used only known properties of neurons and which could accomplish cluster finding in a satisfactory way by being presented with data sequentially. The strategy, which has been described in [9], works rather better in some applications than more conventional clustering algorithms. The modelling of a neuron by such an algorithm requires a little more to make it phys ically realistic: there is not just a sharp centre, there is a tuned response, so it would be more realistic to model the neural state not just by its centre, but by a narrow distribution such as a gaussian distribution, which is interpreted as the strength of response of the neuron. In one dimension, the width of the response or the degree of tuning selectivity may be specified by one parameter, but if one supposes that the same phenomena occur in generally higher dimensional spaces of inputs and neural states, then the question arises as to whether the simple first order description of a neuron together with a single speci ficity parameter is reasonable. If one accepts the feasibility of local interactions between simultaneous stimulation of synapses, then much higher orders may be implementable by single neurons, and if we suppose some degree of time integration, then a short exposure of some real event to the sensory apparatus may generally be described as a aggregation of input data being presented to a neuron simultaneously, and this aggregation to be representable as a cluster or clusters of points in for some n. If we suppose that the neural response takes into account second order interactions between inputs, as is physiologically plausible, then it is not unreasonable to model it as something like a multivariate gaussian distribution, the value of the pdf being proportional to the strength of the response of the neuron; alternatively we can compute the loglikelihood for any finite data set presented simultaneously, for any values of the parameters of the distribution. In general, one might reasonably expect higher than second order terms to be implemented by a single neuron, but second order suffices to outline the ideas. It follows that a collection of neurons sharing the same input might be modelled as a mixture of gaussians describing the data set to second order. One would not want to feel too committed to the precise form of the gaussian distribution of course, but for want of more detailed information concerning the response function in higher dimensions, such an approximation is at least suggestive. We therefore extend the idea of a neuron simply adapting to a single datum, as is demonstrated to occur in the Blakemore experiments and others, and conjecture that a family of them may describe the distribution of input data. It has proved straightforward to implement a dynamical approximation to a maximum likelihood gaussian mixture model of a sort which is consistent with what is known of neural capabilities. The result of some subset of a data set being processed by such a family then is a set of responses from the members of the family. It is natural to regard this as input to a subsequent layer of neurons, as occurs of course in the central nervous system. A neuron at the second layer 'sees' a set of neurons at the first layer responding, and can, just as in the first layer, respond to correlations between the inputs. It is this process which is being modelled by the UpWrite in the syntactic model. The model described differs in a great many ways from the standard models of artificial neural nets, and is very much more powerful in its capacity to classify structured data. The fact that categories of data as diverse as images and strings of symbols can be

58

treated in a uniform manner, adds some degree of credibility to the neural model. The abstraction involved in going directly to the syntactic process takes us some distance from the properties of real neurons, but allows us to focus on the class of operations which they can be expected to perform. There are obvious merits in this; nevertheless, the fact that the theory was derived from considerations of known neural behaviour and reasonable extensions of it gives some grounds for investing further time and energy in both the syntactic inference process and the neural model from which it was derived.

10. SUMMARY AND CONCLUSIONS This paper has presented a very general process for extending the concepts of gram matical inference to the case of continuous grammars for point sets in R'*. It consists of an iteration of a basic process which computes low order moments for a set of points within some resolution radius, using the resulting approximation to the characteristic function of the point set as a density estimator or predictor of where other points are likely to be found, and an entropic chunking to determine a higher order entity. This entity is then further represented, via some moment method, as a point in another space. Iteration of this process reduces the number of points until only one is found to specify the original object. This, in effect, automates the process of finding a suite of measurements to specify the object by coding substructures and their relationships. The method has been applied to a range of practical problems, some of which are described elsewhere in these proceedings. The genesis of the process from known properties of assemblies of neurons gives some hope that this is an abstraction of the functioning of the central nervous system.

REFERENCES 1. Fu, K.S, Syntactic Pattern Recognition and Applications, Prentice-Hall, New Jersey, 1982. 2. Jellinek, F, Self-Organized Language Modelling for Speech Recognition IBM Research Report, Continuous Speech Recognition Group, IBM Thomas J. Watson Research Center, Yorktown Heights, N.Y. 10598. 3. Everitt and Hand, Finite Mixture Distributions, Chapman and Hall, 1991. 4. Titterington, Smith and Markov, Statistical Analysis of Finite Mixture Distributions, John Wiley and Sons, 1985. 5. A.P. Dempster, N.M. Laird and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm,FToc.Roy.Sta.t,Soc., B,39(l), (1977), pp 1-38. 6. DeSilva,C.J.S, Bayesian Detection of Chirp Private Communication. Submitted to IEEE Trans. Sig. Proc. 7. Wiesel, T.N. and Hubel, D.H. Comparison of the effects of unilateral and bilateral eye closure on cortical unit responses in kittens J.Neurophysiol. 28: pp 1029-1040, 1965. 8. Blakemore, C. Developmental Factors in the Formation of Feature Extracting Neurons pp 105-114 in Feature Extraction by Neurons and BehaviourEd. G.Werner, MIT Press, 1975. 9. McKenzie, Ρ and Alder, M.D Initialising the EM Algorithm for use in Gaussian Mixture Modelling These Proceedings, Pattern Recognition in Practice IV, 1994.

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

59

Recognising cubes in images Robert A. McLaughlin* and Michael D. Alder^* ^Centre for Intelligent Information Processing Systems, Department of Electrical Engineering, The University of Western Australia, Nedlands W.A., 6009, AUSTRALIA This paper adopts a syntactic approach to a pattern recognition task. This entails decomposing an object into a set of simpler component objects. For example, a cube may be viewed as comprising a set of parallelograms. Each component object may in turn be decomposed into yet simpler sub-objects and this repeated recursively until some set of primitives, in this case points in representing pixels, is reached. Recognition involves assembling a set of sub-objects into a single higher level entity. This paper outlines a method by which such sets may be automatically identified, provided that the subobjects can be mapped to points in a space. Although this method is highly general, it is explored in the restricted context of recognising wire frame images of cubes. A computer program implementing the method outlined here has been found to successfully extract cubes from simple wire frame images.

1. INTRODUCTION This paper describes research into a general method of pattern recognition. The aim has been to produce a system which makes very few assumptions about the objects being identified. To explore this, we have chosen the problem of identifying cubes in hand-drawn sketches, such as that of jig. 1. Whilst we have developed the recognition system in the context of this specific problem, an ever present consideration has been its use in a more general range of applications. Consider the problem of recognising the cube and pyramid of fig. 1. Each consists of several obvious parts, such as squares and triangles. This has prompted us to adopt a syntactic approach to the recognition. In such an approach, an object is viewed as consisting of several sub-objects, these being combined according to some rules or a syntax. The syntax inherent in a cube, for example, specifies the ways in which the component squares and parallelograms must be combined. The syntactic approach to recognition was first explored by King Sun Fu [1]. In order to retain this method's wide applicability, it has been necessary to find a very general representation for the objects. We have chosen to represent each object as a point in space. Consider the following progression. We are presented with an image, such as that of fig. 1 and model the black pixels as a point set in Έ?. We then identify * [email protected] ^ mike® maths. uwa .edu. au

60

Figure 1. A typical image for recognition. subsets such that the points in each subset form a line segment. Each line segment may be described by a list of numbers describing position, length and orientation, and this list of numbers may be interpreted as a single point in some space. The image then can be described as a set of such points. Note that we have extracted a more succinct and higher level description of the image. Both descriptions are in terms of points in space, the second being obtained by extracting those subsets of the first which form a higher level object (a line segment) and discarding those subsets which do not. We have discarded much information such as the exact position of each pixel. However, for the purposes of recognition such information is unimportant. Beginning now with the point set representing the line segments, we shall identify those subsets which form higher level objects, such as squares and triangles. Each of these entities will be represented as a single point in some space and the image will be described as a set of such points. This new point set again constitutes a higher level and more succinct description of the image. Information such as the exact position of each line segment has been discarded as irrelevant. The process of mapping a set of objects to a single, higher level entity has been termed an Up Write, In the current example, we have UpWritten a set of pixels to a line segment and then have UpWritten a set of line segments to a square. The next stage will be to UpWrite a set of squares, parallelograms and triangles to a set of cubes and pyramids. To do this, we begin with the point set representing squares, parallelograms and trian gles. As before, we identify each subset of the points which forms a higher level object such as a cube, and map this subset to a single point in some other space. At the completion of this process the cubes and pyramids will have been extracted Having outlined the basic method, what remains is to elucidate the method of mapping a set of elements represented as points in space, to a single point in a different space. The explanation of this will constitute the bulk of this paper. Note the importance of abstracting the UpWrite of several objects to a single entity as an operation mapping a set of points to a single point in a different space. By doing so, and not building in assumptions of the real-world interpretation of these points, we have

61

been able to arrive at a process which is useful in an extremely wide range of instances. To demonstrate this, we have applied conceptually identical methods to the cube recognition problem detailed in this paper, to differentiating between silhouettes of a jumbo jet and a fighter aircraft [2] and to recognising the shape of left hands from right [3].

2. FINDING LINE SEGMENTS Consider a black and white image, such as that shown in fig. 1 and model the black pixels as a point set in Έ?. All subsequent operations will be performed on this point set. The task of extracting the line segments from the image shall be tackled in two stages. First, small clusters of points in Έ? shall be identified and grouped together. Then sets of these clusters which form a line segment will be grouped together. The following algorithm is used to segment the image into clusters of points. • Choose a point in Έ? • Define a small neighbourhood around this point and tag all points in this neigh bourhood as forming a subset of the image. • Calculate the mean and covariance matrix of this subset. • The mean [/ix/iy]^ and covariance matrix C define a positive definite quadratic form, and the set Ε E =

Χ .

y .

Τ

χ -

μχ

,ν-μν.

χ -

μχ

= 1

, ν - μ ν ,

forms an ellipse. This ellipse may be defined by its mean, the size of its major and minor axes and its orientation. Replace Ε with an ellipse E' which has identical mean and orientation but is of a standard size. • Remove all points from the image which lie within the ellipse E' and record E' in a list of ellipses. Repeat this process until no points remain. The effect of this algorithm is to model small subsets of the image with ellipses of a standard size. Results for a typical image are shown in fig. 2. Each ellipse indicates in which direction a small subset of the points are spread. A line segment will consist of a set of ellipses positioned head to tail in the same direction. Line segments are extracted as follows: • Choose an ellipse. • Search for an ellipse which lies at either end of this one and is orientated in the same direction. • Repeat this process, looking for the next ellipse in the sequence, until none can be found. Then repeat the process, but working in the other direction.

62

Figure 2. Image decomposed into ellipses.

• Group the resulting ellipses in to a subset and remove them from the set describing the image. This process is repeated until no ellipses remain. Each subset of ellipses will cover a line segment. By examining the elements of the original point set in Έ? which constitute these ellipses, we can extract a description of the line segment, such as its centre, length and angle. A sufficiently accurate approximation to this description can be obtained by examining the centres of the ellipses. The mean of these centres provides an estimate of the line segment's centre. The distance between the two outermost ellipses provides an estimate of the line segment's length. The orientation of the line of best fit through the ellipse centres provides an estimate of the angle of the line segment. This information can then be recorded as a point in ΊΙ^:

There is, however, an inadequacy in this representation. It is desirable that there be no ambiguity as to which point in space a line segment is mapped to. However, under the present scheme, a horizontal line segment could validly be described as having an orienta tion of both 0 radians and π radians. A second inadequacy in mapping fine segments to points in concerns the topology of the resulting points. It is desirable that similar line segments be mapped to similar points in space. However, a line segment with an orienta tion slightly less than π radians maps to a point quite distant from a line segment with an orientation that is a little more than 0 radians, yet both line segments are practically horizontal. We have therefore chosen to map a line segment with centre length / and orientation Ö to a point in 71^ thus:

63

y ι 3ΐη{2θ)

L cos{29) J 3. F I N D I N G S Q U A R E S The image is currently described as a set of points in Έ?, each point corresponding to a line segment. Our task is to identify entities constructed from line segments, such as squares. The basic procedure will be to take a subset of the line segments and decide whether they form the appropriate shape. We then repeat this process with other subsets of the Hne segments. Given a subset of the line segments, we require a method to describe this set as a single entity in order to decide whether it does form a shape such as a square. That is, we need a description of a set of points in Έ?, There are numerous ways to describe a set of points in space, a common one being to take moments of the set. We shall restrict ourselves to taking zeroth, first and second order moments of the set. Whilst these do not uniquely define the set, we have found that they contain sufficient information for our purposes. The zeroth order moment is the number of points in the set, and is specified by a single number. The first order moments are the mean of the set and in Tl^ these give five numbers. The second order central moments give a measure of how the points are spread in space. This information is contained in a matrix commonly referred to as the covariance matrix. For a point set in 7^*, this is a 5x5 matrix. As the matrix is symmetric by construction, it contains fifteen distinct entries. The zeroth, first and second order moments will produce a Ust of (1 -f 5 -h 15 =) 21 nimibers describing the point set. This can in turn be interpreted as a single point in Thus an entity constructed from a set of line segments, such as a square or a triangle, is represented in the computer as a single point in Έ?^. This process of taking a set of elements and converting them to a single entity was introduced earlier and termed an Up Write, In the present case, we have UpWritten a set of line segments to a square. Later in this paper we shall UpWrite a set of squares and parallelograms to a cube. Generally, an UpWrite is a process which maps a set of points in some space to a single point in some other, usually higher dimensional space. Mapping a set of points in to a single point in V?^ is an example of this. A fuller explanation of the UpWrite process is to be found in [4]. As only certain sets of line segments form squares, it seems reasonable to expect that squares will map to only a portion of Έ?^. As any square may be deformed in to any other by a translation, scaling and rotation, and as the shape always remains a square during this process, we expect squares to map to a single, connected region of Έ?^. Training the computer to recognise squares involves showing it a large number of squares and having it map each to the corresponding point in IV^, These points constitute a sample of the region and by modeling them, we model the region. We have chosen to model the points with a gaussian probability density function, defined by the formula below:

64

where η is the dimension of the space = 21. μ is the twenty-one dimensional mean of the training set, C is the 21x21 covariance matrix of the training set, A justification for this choice is given in Section 5. Fitting a single gaussian pdf to the training set is an application of elementary statistics and merely requires that we calculate the mean and covariance matrix of the training set. This gaussian assigns a likelihood to every point in 7^^^, generally a high likelihood where the training points are concentrated and a low likelihood elsewhere. By specifying some threshold value, we can define the rule that all points in corresponding to a likelihood above this threshold represent squares. Such a rule can be used to recognise squares not in the training set. Note that at no time in this process have we made any assumptions about the shapes being recognised. Whilst we have specifically used the example of a square, this was only to clarify the explanation and we could have equally well have chosen any other arbitrary shape formed from line segments. In short, we have not built this system around a specific shape. In fact, we have not assumed that the objects must be built up of line segments. As far as the computer was aware, the entities to be identified consisted of a point set in ΊΖ^, and the real world interpretation of these points is immaterial to the recognition process. Thus the recognition system could be used unchanged with any objects which have a mapping to points in 71^. As a further step of generality, the restriction to point sets in TZ^ is unnecessary. Point sets of any dimension may be described by low order moments, thus the method that we have used to map a set of points to a single point in some other space would remain conceptually unchanged if the dimension were altered. For example, we may be presented with an object whose parts can be represented as points in Έ}^, By taking the zeroth, first and second order moments of a set of such points, we shall UpWrite the entity to a single point in 11^. The objects to be identified will map to one or more regions in and each region can be modelled by a gaussian pdf. If the zeroth, first and second order moments do not contain sufficient information about the parts of the object, it may be necessary to take higher order moments of the set. The generality outlined here is further explored in [4]. 4. F I N D I N G C U B E S The previous section described how to find the squares in a set of line segments. Iden tifying the cubes formed in a set of squares and parallelograms is achieved using exactly the same method. From a set of training images, entities such as the squares and paral lelograms of fig. 3 are extracted using the method outlined in the previous section. This will give a description of an image such as that of fig. ^ as a set of three points in Έ?^, one point corresponding to the square and the other two to the parallelograms. Taking low order moments of this set allows us to UpWrite to a single point in The training set of cubes was found to UpWrite to a single, connected region of and this was modelled with a gaussian pdf. To extract the cube from the image of fig. i, the line segments were first identified and mapped to points in TV. Those subsets of points in TV which formed higher level

65

\ Figure 3. Components forming a cube.

V Figure 4. A cube to be recognised.

objects such as a square were then identified and recorded as points in Ίί^^. Subsets of the resulting point set in were then examined and those which mapped to a point in ΊΖ^^^ corresponding to a high likelihood were tagged as representing cubes. Note that the task of recognising cubes from a set of squares and parallelograms was achieved using the same UpWrite process utilised to extract squares from a set of line segments. There is no conceptual reason to bar us from forming objects from sets of cubes and recognising these by a re-application of the UpWrite process. The primary reason that such a system has not yet been developed is the prohibitive nature of the resulting computations. However, the highly parallel nature of many of the necessary computations suggests this not to be an insurmountable problem given appropriate hardware.

5. NOISE The process outlined in this paper imposes a hierarchy upon a recognition problem. In the present instance, this hierarchy has pixels as its most base element, then line segments followed by squares and finally cubes at the apex. Such a construct enables great tolerance to noise, as different forms of noise may be removed at different levels of the hierarchy. Consider if the image of fig. 1 were speckled with black pixels, constituting white noise at the pixel level. As these pixels have no higher order structure, they would not UpWrite to line segments. Thus our description of the image in terms of line segments will have removed this noise. Next consider if each element in the image were randomly perturbed a little to the left, right, up or down. This would not present serious problems to the method used to extract line segments as each pixel would still lie in approximately the same neighbourhood. Similarly, such perturbations would not notably alter the line segments extracted as the changes would tend to cancel out over an entire line segment. Each time that we perform an UpWrite, we discard information. Much of the informa tion removed describes either subsets of the image which lack higher order structure or details of the exact position of each pixel. By this process, lower level noise is automati-

66

cally removed. Next consider the analogous case of noise at the line segment level. Fig, 1 contains several examples of line segments which do not form part of a higher level structure. In UpWriting, such noise will be removed. That is, those line segments will not form part of a higher level description of the image. Alternatively, one could perturb each line segment a little, by altering its position, length and orientation. Doing so would reflect only small changes to the corresponding point set in 7^^, and these changes would not significantly alter the low order moments of particular subsets. Thus the line segments forming a square would, if altered slightly, UpWrite to essentially the same square. Whilst such noise is trivially dealt with at the line segment level, it would pose serious problems if tackled at the pixel level. Similar arguments apply in UpWriting from squares to cubes.

6. INVARIANCE AND MANIFOLDS Many objects are invariant under some operations. For example, a square remains a square when it is translated. Many recognition systems build in certain invariances, in particular many build in translation, scaling and rotation invariance. We have consciously avoided building in these invariances in order to produce a more powerful system. To appreciate this, consider the more difiicult forms of invariance. Specifically, consider a picture of a house. As one walks around the house, the picture will alter radically. Windows and doors which were hidden will become visible, and entire walls will disappear from sight. Each view of the house represents the same object, but a view from one side may bear little resemblance to a view from the other. Yet houses are invariant under rotation in three dimensions. That is, it does not matter which angle we view the house from, it is still the same house. Recognition systems currently available which build in the trivial invariances of translation, scaHng and two dimensional rotation are of little use in handling these more complicated forms of invariance. More importantly, it is diflScult to see how such systems could be extended to even begin tackling these problems. In order to solve these problems, we have attempted to develop a single method capable of handhng all forms of invariance. This has been explored by finding a system which solves the problems of translation, scale and rotation invariance in a uniform manner. This method will be elucidated in the context of these invariances, and then its use with more complicated invariances will be commented on. Let us return to the problem of recognising a set of line segments as forming a square. We have explained earlier in this paper that such a set is represented as a point set in TV and that by taking the zeroth, first and second order moments of such a set, we can UpWrite to a single point in TV^. Imagine now that the square were translated in the image. That is, each point in were moved by an equal amount. The zeroth order moment of the set remains unaltered as the number of points in the set has not changed. Similarly, the second order central moments remain unchanged as the point set is neither more nor less spread in space. Only the first order moments of the set (i.e. the mean) have been altered, and even then it is only those co-ordinates which describe the x, y position of the line segments. By repeatedly translating the image of the square in arbitrary directions, the corresponding point in will trace out a plane. Alternatively we could say that the points in generated by the translated square lie in a two dimensional

67

Figure 5 . Points generated by a translated square.

affine subspace of IV^. A sample of points in this subspace are shown in fig. 5, which shows a view of projected onto the two dimensional plane of this page. Given such a sample, it is easy to define the appropriate subspace. To define the space we need to specify the origin and a set of basis vectors. The mean of the sample points provides a convenient origin. The eigenvectors of the covariance matrix of these points provides an adequate set of orthogonal basis vectors. To see why, recall that the covariance matrix of a set of points describes how the points are spread in space. The eigenvectors of this matrix are a set of orthogonal vectors which are orientated in the directions of spread. They are a set of vectors which span the points, and hence can form a set of basis vectors for a space which contains the points. Training for translational invariance requires that we show the computer instances of a particular square translated. The computer then calculates the subspace containing all of the training points. During recognition, any image which maps to a point in this space is labelled as being a square. Note that whilst all instances of a translated square should map to a two dimensional affine subspace, because of noise they will not. In practice, all points will lie near a two dimensional affine subspace, but generally not on it. Thus an image is only required to map to a point in that is near the subspace in order to be deemed a square. Note also that we have not built translational invariance into the recognition system. The system learns the invariance by example. Next consider what happens if the square were scaled but remained otherwise un changed. Again the zeroth order moment of the corresponding point set in 1V would remain unaltered. As the average length of line segment changes, the first order moments of the point set will vary. Similarly, as the square increase or decreases, the points in 1V will become more or less spread in space and this will be reflected in the second order central moments. By scaling a particular square, we trace out a curve in 7V^, such as that shown in fig. 6. Because of the non-linear nature of the second order central moments, this curve will not be a line, but will be quadratic. As it may be parameterised by a single number (i.e the size of the square), we could also view it as a one dimensional manifold. When teaching the computer scale invariance for a particular square, we map scaled versions of the square to points in and hence obtain example points which lie on (or near, because of noise) this one dimensional manifold. Unfortunately, finding a parametric representation of this manifold given such a set of points is far from trivial. One possibility is to approximate the non-linear manifold with a linear or affine subspace. This is often

68

Figure 6. Curve generated by a scaled square.

Figure 7. Curve generated by a rotated square.

acceptable if one is only interested in a small section of the manifold. However, it is subject to the usual dangers associated with approximating a non-linear entity with a linear one. For our purposes, as the squares were only scaled over a certain range, we found it sufficient to use such an approximation. As in the case of translational invariance, the appropriate subspace was found by calculating the mean of a set of training points, and finding the eigenvectors of the covariance matrix of these points. The mean specified the origin of the space and the eigenvectors formed a set of orthogonal basis vectors. Had we wished to scale the images over a wider range, it may have proved necessary to model the manifold with several affine subspaces. This is rather similar to modelling a non-linear function with a piece-wise linear approximation. Finally, let us consider the operation of rotation. As with scaling, this will trace out a curve in A view of the curve generated by rotating a square is shown in fig. 7. We know that rotating any object by 2τ radians will leave the object unaltered, thus the curve will form a closed loop. To rephrase this, the curve will be topologically equivalent to the image by a continuous map of the circle S^. In the particular example of a noisefree square, a rotation of ^ radians will leave the object unaltered. That is, if we rotate a square by a quarter turn, we return to the original square. Thus we would expect a rotation of 2π radians to trace around the curve four times. This is why the curve of fig. 7 resembles several circles bunched together. Had the image been noise free, these circles would have been indistinguishable.

69 As a side issue, note that the four identical circles generated by rotating a square highlight certain symmetries of the object. Indeed, examining the curve or manifold generated by performing an operation (such as rotation) on an object, one may find the symmetries by noting where the curve or manifold traces back upon itself. As this could be detected automatically, it provides us with a mechanism where by the computer could automatically identify symmetries in an object under any arbitrary operation. As in the case of scaling, all rotated instances of a particular object ideally lie on a one dimensional manifold. This manifold is markedly non-linear and any model of it which consists of a single linear of affine subspace is bound to be poor but may suffice for certain recognition tasks. Training this system required that we show it a large number of squares at arbitrary positions and scales, and over a small range of rotations. As explained previously, allowing a square to be translated will generate points in which lie on a two dimensional affine subspace, this being topologically equivalent to 1V. By allowing the square to be scaled, we traced out a one dimensional manifold which we have approximated with a piece of a one dimensional affine subspace homeomorphic to the closed interval [a, 6]. By allowing both of these operations, we generate points in TV'^ which lie on a manifold that is topologically equivalent to: TV translation

X [0.5,1.5] scaling

Thus we have chosen to model the region of TZ^^ which all translated and scaled squares map to with a three dimensional subspace. Let us refer to this subspace as V. As scaling is only being tolerated over a specified range, it was important that this be automatically aiccounted for by the recognition system. Thus only images which map to a subset of V are labelled as being squares. This subset is roughly the region of V which corresponds to the training images. One method of implementing this would be to specify the mean of the training set as the origin of the subspace V. The appropriate region could then be defined as all points in V which are within some threshold distance of the origin, where this distance is large enough so as to encompass the training set. Because of noise, squares rarely map into V, but near it. Note also that when we define the region corresponding to all squares as being all points in V which lie within some threshold distance of the origin of V, we are not using the Euclidean metric as defined on TV^. The basis vectors of V will be the eigenvectors of the covariance matrix of the training set. Whilst these are orthogonal by construction, they need not be orthonormal. It makes more sense to scale each eigenvector by the square root of its corresponding eigenvalue. Then, if the training points are more spread in one dimension than in another, the scaling of the basis vectors of V effectively squeezes or stretches the space V (as appropriate) so as to even the spread in all dimensions. Alter natively, one could say that by setting a threshold distance from the origin, one is defining a neighbourhood about the origin. This neighbourhood takes the form of a hypersphere and all points within this hypersphere are considered to form squares. By scaling the basis vectors of V, we are distorting this neighbourhood into a hyperellipsoid so as to better model the region defined by the training set. This hyperellipsoid neighbourhood may also be implemented by considering the gaussian pdf defined by the mean and covariance

70 ^WiiWfiV///

Figure 8. Face with component missing.

matrix of the training set. Some threshold value is set and any point corresponding to a likelihood above this threshold is judged to lie within the neighbourhood. Note that we are not suggesting that the training points are distributed according to a gaussian density function. We are instead using the gaussian as a means to model a section of a noisy three dimensional affine subspace. In Section 3 of this paper, we first stated that we would use a gaussian to model this region. The above explanation is the justification for this choice. Finally, learning more complicated forms of invariance can be done by minor modifi cations of the general principles discussed above, A cube image such as that of fig. 4 is described via the UpWrite process as a point in 7^^". If we rotate an object in the space of such distinct objects is the Lie Group (manifold) S0(3) factored by the finite symmetry group of the object, G. The two dimensional image is a projection, and the space of such projections is a projection of S0(3) / G, This will not, in general, be a manifold, although it may be taken to be one almost everywhere. Such spaces can be described approximately by fitting quadratic forms locally. 7. O C C L U S I O N Consider the image of fig. 8. For a human it is quite trivial to recognise this as an incomplete sketch of a face and to identify the missing element. The question arises as to how a computer could be programmed to arrive at this same conclusion. In order to avoid banality, we would require that it not be restricted to working only with faces. Furthermore, it shall be trained only on complete examples of the object, so that it not merely create a database of all possible occlusions. Such a database would be subject to a serious combinatorial explosion and become quickly impractical. The method outlined in this paper lends itself to a solution for this problem. To explore this, we shall simplify the problem to that of completing a partial square. Finding the missing line segment of fig. 9 reduces to essentially the same task as identifying the missing feature of fig. 8. In both images, we are presented with an object consisting of parts. Given an incomplete set of these, we must identify the missing elements. We begin by taking the point set in TV corresponding to the line segments of fig. 9, and UpWrite this to a single point χ in TV^. AS this image does not resemble a square, χ will not lie on the manifold defining squares. Out task is to decide if there is an element which, if added to the set of line segments, would move χ on to the manifold. Let us call the point which corresponds to the completed version of fig. 9 p. We will define

71

Figure 9. Square with component missing.

the metric on such that the closest point on the manifold to χ is p. The problem is now reduced to the following: Given a point χ in 1V^, find the closest point ρ which lies on the manifold. This problem does not lend itself to an analytic solution as we are not using the Euclidean metric on the space. We shall thus find a solution through a series of iteratively improved estimates based on the Euclidean metric. Given X, we shall obtain a first estimate of ρ by simply projecting on to the aflSne subspace which is being used to model the manifold of squares. Call this projection pi. Both X and pi, being points in ΊΖ^^, represent the results of a point set in 1V being substituted into 21 equations. The first of these corresponds to the zeroth order moment of the point set which, as stated before, is the number of elements in the set. χ will correspond to a value of 3 and ρ to a value of 4. Thus we know that we need to add a single point to the set, this giving 5 unknowns. Solving the remaining 20 equations for the 5 unknowns gives an overspecified system of equations. Five of these, corresponding to the first order moments, are linear and the remaining 15 are not. We have chosen to solve these using the Newton-Raphson method [5] from an initial estimate supplied by the five linear equations. Solving these equations yields the co-ordinates of the point in TV which, when added to the point set. Up Writes to the point χχ. χχ is as close (according to the Euclidean metric) to pi as we can get by adding a single element to the point set. The overspecified system of equations will generally not have a solution. This can be interpreted as saying that there is no single fine segment that, when added to image, will cause it to UpWrite to ρχ. Thus χχ will not lie on the manifold of squares. However, we shall assume that it lies closer to the manifold that x. Reasoning that χχ provides a better estimate of ρ than χ does, we repeat the above process but now find the projection of χχ on to the manifold. Call this p2. Solving the resulting set of 20 simultaneous equations, we obtain a second estimate of the co ordinates of the point which should have been added to the set. Adding this to the point set generated by the original image, we UpWrite to a new estimate X2 of the completed image. We then repeat the process, reasoning that X 2 provides a better estimate of ρ than Χχ did. This sequence is represented graphically in fig. 10. The process is iterated η times, halting when our estimate Xn lies on or very near the manifold. If no such X n exists, then the image is judged to not represent an incomplete square. The occluded square of fig. 9 converges to the image of fig. 11. This process is generalisable to point sets of any dimension, and thus capable of recog nising the incomplete face of fig. 8, as well as providing an estimate of the entity which

72

Figure 10. Graphical representation of convergence to occluded side.

Figure 11. Completed square.

73

should be added to the image. 8. C O N C L U S I O N This paper has explored a very general method of recognition. This method was tackled in the context of a specific problem, that of recognising hand-drawn image of cubes. A hierarchical approach was adopted, where sets of pixels were composed to form line segments, which in turn formed squares and finally cubes. Each of these entities was abstracted as a point in some space. It was recognised that each of these structures have an inherent syntax which may be used to identify when a set of objects forms a higher level entity. This syntax was automatically extracted from a set of training images. That is, the computer learnt which sets of line segments formed a square and which sets of squares and parallelograms formed a cube. The method described relies only upon the properties of the abstract points in space and does not make assumptions as to the physical interpretation of these points. This has prompted us to suggest that this method could be of use in any instance where the object for recognition may be viewed as being composed of a set of simpler entities and these in turn abstracted as points in space. The method presented also lent itself to the learning of complex forms of invariance and to the recognition of occluded objects. REFERENCES 1. Fu, K, S., Syntactic Pattern Recognition and Applications, Prentice-Hall, New Jersey, 1982. 2. McLaughlin, R. Α., M. D. Alder, Recognising Aircraft: Automatic Extraction of Struc ture by layers of Quadratic Neural Nets, Proceeding IEEE International Conference on Neural Networks, Orlando Florida, June 28 - July 2 1994. 3. McLaughlin, R. Α., Μ. D. Alder, Inference of Structure: Hands, submitted for publi cation to Pattern Recognition Letters, Elsevier Science Publishers B.V. 4. Alder, M. D., Inference of Syntax for Point Sets, Pattern Recognition in Practice IV, Vlieland, The Netherlands, June 1 - June 3 1994, Elsevier Science Publishers B.V. 5. Press, W. H., S. A. Teukolsky, W. T. Vetterling, B. P. Flannery, Numerical Recipes in C : The Art of Scientific Computing 2nd ed., Cambridge University Press, 1992

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

75

Syntactic pattern classification of moving objects in a domestic environment Gek Lim*,

Michael D. Alder^

and Christopher J. S. deSilva* *

^Centre for Intelligent Information Processing Systems The University of Western Australia Nedlands W.A. 6009, AUSTRALIA In this paper, we present a syntactic approach to classifying objects moving in a domes tic environment such as human beings curtains blown by the wind, and external events such as moving tree branches. We use quadratic forms as our simple pattern primitives and the description of the objects (or patterns) is based on the relationships between the forms. We call the description the UpWnte. The UpWnte process can be applied any number of times, to extract high level information such as strokes, and express them in a compact way suitable for classification.

1. INTRODUCTION The syntactic approach to pattern recognition provides the ability to describe patterns or objects, in a two dimensional image, which contain structural information. The idea here is that an image is composed of a set of complex objects where each object can be described in terms of simpler subobjects or parts, and each part can be described in terms of even simpler subparts, etc., where the simplest parts selected are called the primitives. For example, an image may be decomposed into objects, the objects into segments, the segments into strokes and strokes into pixels. The recognition process suggested by Fu [1,2] is to use a grammar to describe the decomposition of the objects in terms of symbol strings. Instead of strings, other objects such as graphs or arrays may be used. A grammar is some algorithm which allows you to decide if a string is a legitimate member of the class comprising the language, and a language in this sense is any set of strings, such as the strings obtained by decomposing different examples of a particular class of objects. A grammatical inference engine is required to infer the grammar from a given training set of strings obtained from the class of objects under study. More generally, the idea of a grammar can be extended to that of a stochastic grammar and a stochastic language, where we assign frequencies or probabilities to the strings, and where a stochastic (recognition) grammar returns the probability of the string. Yet more generally, instead of dealing with a discrete, finite alphabet, we can treat the case of a topological grammar, in which the alphabet is infinite and carries some topological structure. We shall deal with a particular example of this latter case, in fact we deal with a particular class of topological stochastic grammars. *[email protected] ^[email protected] ^[email protected]

76

The choice of primitives and their relationships represented by composite operations determines the practicality of such an approach. Traditionally, the selection of a set of primitives is greatly influenced by the nature of the data, the specific application and the technology available. " . . . The following requirements usually serve as a guide for selecting pattern primitives. 1. The primitives should serve as basic pattern elements to provide a compact but adequate description of the data in terms of the specified structural relations (e.g., the concatenation relation). 2. The primitives should be easily extracted or recognized by existing nonsyntactic methods, since they are considered to be simple and compact patterns and their structural information not important." [1] In this paper we use positive semi-definite, symmetric, quadratic forms as our prim itives. These satisfy the above requirements and make it easy to extract higher level information about the objects based on the relationships between the forms. They also have a convenient representation in terms of matrices. The set of forms is not, of course, a discrete set, as in the cases studied by Fu, but may be thought of as a subset of IV, for some suitable n. The process of inferring higher level structure is known as the Up Write, and the usual rewrite process is called the DownWrite. One of the attractive aspects of this method is that the UpWrite process can be applied as many times as required depending on the type and specification of the problem. The higher the level of Up Write, the more compact is the representation of the object. These processes will be described in some detail for our particular application rather than in general. It is not difläcult to see how to generalise it to other cases.

2. METHOD A syntactic pattern recognition system can be considered as consisting of 3 major parts, namely, preprocessing, the UpWriting process and the classification. Figure 1. In this section a description of how we applied the above methodology to classifying moving objects such as human beings, blowing curtains and moving tree branches seen through windows is presented.

Data Input (images)

>

Preprocessing

UpWriting Process

j

Classification

Figure 1. Block diagram of a syntactic pattern recognition system

Classified Object

77 2.1. Preprocessing In this appHcation, the functions of preprocessing include (i) detecting moving objects; and (ii) border tracing. • Moving Object D e t e c t o r The function of the detector is to find moving objects, these being the objects we are interested in classifying. By applying differencing to two consecutive frames. Figure 2(a), we obtain a resulting difference image. This is a binary image which shows the moving parts of the objects in white. Figure 2(b). In this way, we filter out static background objects which are of no interest to us. • B o r d e r Tracing Next, we describe those moving parts of the objects using a border tracing algorithm [3,4]. What the algorithm does is to describe the isolated regions in the difference binary image by their starting point and the chaincode along the boundary. Figure 2(c). Noisy images can be cleaned up by ehminating 'short' chaincode. 2.2. T h e UpWrite Processes We decided to use positive semi-definite, symmetric, quadratic forms as our primitives to extract structural information from the preprocessed image. There is a neural model which gives a rationale for this choice, but we do not elaborate on the issues here. First we segment the chaincode representation of the objects into sequences of quadratic forms, Figure 3. Each quadratic form may conveniently be visualised by drawing the ellipse given by the equation { x G π 2 : ( x - a ) ^ C - ^ ( x - a ) = l} where a is the centre and is the matrix giving the quadratic form. The effect of this is to smooth the noisy boundary contour and compress the repre sentation of the objects. More importantly, the structural information in the object is not distorted and the ellipse representation can easily be made translation and rotation invariant. Forms are calculated by taking Κ consecutive elements of the chaincode for some suitable and computing the covariance matrix, C, for the set. Κ must be chosen so that the quadratic forms are not degenerate, but not so large that the structure is lost. Our application requires two levels of Up Write to extract higher level structural infor mation. First, we extract the stroke information for the objects, and the second UpWrite extracts the stroke set information for each class of object. 2.2.1. Stroke E x t r a c t i o n To extract strokes from the sequences of ellipses we first find sets of three consecutive eUipses, triples, that are close to each other and UpWrite them into a point in ΊΖ^. The idea is to see different clusters of triples in the UpWrite space. We expect to find two different classes of triples. The first class contains smooth triples, where there is a constant change of angle from one ellipse to another as shown in Figure 4(a). The other class contains chaotic (not smooth) triples where there is much variation in the change of angle from one ellipse to another, as shown in Figure 4(b).

78

Figure 2. Examples of the three different objects: Faces, Curtains, Tree Branches ; (a) Original image, (b) Difference image, (c) Border of the difference image. The smooth triples form a cluster in the UpWrite space, distinct from the chaotic triples. We need a rule to distinguish between the smooth and chaotic triples. We decided to generate the rule from the training data set via supervised learning. Triples from sequences of ellipses are displayed on the computer screen, and the human operator decides whether they are smooth or chaotic. Each of the triples is upwritten to a vector, (01,^2,0^1, «2), in as shown in Figure 5, where 0\ IS the angle between the principal eigenvectors of ellipse e l and ellipse e2 02 is the angle between the principal eigenvectors of ellipse e2 and ellipse e3 a l is the angle between the principal eigenvector of ellipse e l and the vector c2-cl and a2 is the angle between the principal eigenvector of ellipse e2 and the vector c3-c2. We know that the clusters for the two different classes of triple UpWrite vectors are separable, in fact, the smooth triple UpWrite vectors are surrounded by the chaotic triple UpWrite vectors. This may be verified by looking at two dimensional projections of the triple UpWrite space, easily obtained with current graphics capabilities. We fit a gaussian

79

(c) Tree

(b) Curtain

(a) Face

Figure 3. Fitting ellipses along the border of the difference image

(b) Chaotic Triple

(a) Smooth Triple Figure 4. Examples of triples

distribution to the cluster of smooth triple Up Write vectors as shown in Figure β. Any triple Up Write vector close to the mean of the the gaussian will be smooth, oth erwise it will be considered chaotic. We use the Mahalanobis distance with a suitable threshold to make that decision. Once we have fitted a gaussian to the smooth triples in the UpWrite space TV, we go through all the sequences of eUipses to find the strokes as follows: stroke=0; For ( s = 0 ;

s 2)

smooth)

80

Figure 5. The components of the UpWrite vector, (î,^2,cî,«2) in 71"^

stroke++; Repeat Until (triple(e,e+l,e+2)

is not smooth or end of

Sequence(s) ) else e++; } /* end of Sequence */

For example, in Figure 7, triple(el,e2,e3) is a smooth triple, we mark it as the begin ning of a stroke. Stroke 1. The end of the stroke is encountered when the triple is chaotic. In this case, triple(e2,e3,e4) is also smooth, so is triple(e3,e4,e5) and triple(e4,e5,e6). But triple(e5,e6,e7) is chaotic, so the end of Strokel is triple(e4,e5,e6). The next stroke, Stroke2 , starts from triple(e7,e8,e9) which is a smooth triple and ends at the end of the sequence. From the sequence of ellipses in Figure 7, we extract two strokes, Strokel which consists of 6 ellipses, el, .., e6, and Stroke2 consisting of 5 ellipses, e7, .., e l l . We repeat this process for all the sequences of ellipses. It should be noted that we can justify the procedure of chunking adjacent ellipses (or forms) into strokes by an entropic criterion: each form may be regarded as a predictor of points, and when a triple is not smooth, the information supplied by the divergent ellipses is high. This is precisely analogous to the way in which characters in natural language

81

Figure 6. Two dimensional projection of a gaussian distribution fitted to smooth triples in the Up Write space, TV

text chunk into words, as was pointed out by Shannon. In Figure 8 we show the smooth triples that will eventually be represented by strokes and used in the next level of upwrite, using the above method based on the images in Figure 3. The reason for not showing any smooth triples from the tree branches class is that there aren't any. The structure of the tree branches is more complex and thus almost all of the triples are chaotic. We are thus able to filter out the tree branches object from the other two classes, curtains and faces. 2.2.2. Stroke Set Extraction The main reason for this last level of upwrite is because one frequently encounters the situation where two different objects are moving in the same image. If you stop at the Stroke UpWrite level, the classification based on that level, if not incorrect, will be quite misleading. So, we try to find the stroke set for each class. We observed that most of the strokes for the curtains are long and thin and the strokes

82

Figure 7. Extraction of strokes from a sequence of ellipses, Stroke 1 is from el to e6 and Stroke2 is from e7 to ell for the faces are fat as shown in Figure 9. This is because we have linear strokes for curtains and curved strokes for faces. Again, we use the training data set with supervised learning to generate the rule for deciding whether the strokes belong to the curtains or faces. First, we take eUipses, from the previous UpWrite (Figure 8), that form a stroke, and we compute the centre and the covariance matrix for the stroke from the centre of the ellipses. Then we upwrite each stroke into a vector, ( E2, E2/E1 ) in IV where El is the major eigenvalue of the stroke and E2 is the minor eigenvalue of the stroke. Figure 10 shows the two different classes of strokes, -}- for the faces and · for the curtains and they are obviously separable. 2.3. Classification Once we have extracted the highest level of structural information of the objects, we proceed to the classification. We wanted to be able to tell when a human being has been moving in front of the camera, rather than respond to the moving curtains or tree branches as seen through the windows. We classify via an elimination process. We know that the moving tree branches have more complex structure, which we can eliminate at the stroke UpWrite level, because we don't get any strokes at all. Then we are left with two different classes, curtains and faces. We eliminate all the strokes that belong to the curtains at the stroke set UpWrite level and what we are left with are strokes of faces. For more complex classification tasks, it would be necessary in general to go to higher levels of UpWrite, but they are not required in the present application.

83

(a) Curtain

(b) Face

Figure 8. Smooth triples of a curtain and a face extracted from Figure 2(d) 3. RESULTS The experiment was done on a PC with the Intel 486 chip and a frame grabber board. Images were collected at 256x256 resolution, 128 grey scale levels as shown in Figure 2(a). The experiment has been carried out with various values of the following parameters, to test the robustness of the method proposed : • The difference threshold at the preprocessing level, r; • In the first level of Up Write, the number of consecutive element of chaincode used to compute the covariance matrix, K\ • At the same level, the Mahalanobis distance threshold used to distinguish the smooth triples from the chaotic triples. There was also the possibility of a dependency of the classification on the choice, by the human operator, of what constituted a smooth triple, at a higher level of what constituted a stroke, and on the source of data used to obtain these elements. Obviously, the results we get at any one stage will be affected by the parameters used in the previous stages. For example, different values of τ will produce different difference images showing the moving parts of the objects in white. Figure 2(b). A high value of r will only capture a small proportion of the moving parts which is insuflScient for the

84

(a) Curtain

(b) Face

Figure 9. Stroke representation for a curtain and a face

purpose of classification. Experiments were conducted using threshold values of r = 5, 10, 15, 20 and 25. The results presented below are bcised on τ = 5 and 10, because τ > 15 usually does not preserve enough information for further processing. Having decided on the value of r, the next parameter to consider is the number of consecutive elements, K, used to segment the chaincode representation of the objects. Again, a high value of Κ will remove too much information. We have tried Κ = 12, 20 and 30. This parameter will affect the stroke information that we are extracting at the first UpWrite. In section 2.2.1 we explained how to extract strokes from the sequences of ellipses. A stroke is made up of a series of smooth triples and a triple is three consecutive ellipses. A high value of Κ will result in fewer ellipses which may not be enough to generate the triples to extract strokes. A low value of Κ on the other hand produces too many ellipses which might require us to go to another level of Up Write before we go to the stroke set UpWrite. We have tried Κ = 12, 20 and 30. With Κ = 20 and 30 there is not much difference in producing the strokes. Κ = 12 might produce too many strokes which at first glance makes it look very noisy. Extraction of the strokes also depends on the classification of triples which depends on the Mahalanobis distance threshold. First, we collected triples for training via supervised learning, as was described in Section 2.2.1, then we fitted a gaussian on the smooth triples in R*, Figure 6. Intensive experiments have been done on training the smooth triples with different number of training sets, and it was found that the UpWrite was quite

85

Figure 10. The two different classes of strokes in Ä^, + for faces and . for curtains

robust: results are independent of the number of training data or whether you train the data on one person or more. The triples information is independent of the person in the image. Figure 11 shows the images of strokes generated with the following parameter values: r = 5; Κ =12; and triples have been trained on Κ=20; with different Mahalanobis distance thresholds of 3,4,5,6,7,8,9 and 10. The system is also robust if you use different Κ values when extracting strokes. Figure 12 shows the images of strokes with different K, where the triples have been trained with A'=30. Section 2.3 explained the classification via an elimination process. We need to distin guish face strokes from curtain strokes. Figure 10 shows the strokes of the faces and curtains and they are clearly separable. There is a number of methods to classify them, and we have chosen to present the results using a separating line. Training data of the smooth triples and the strokes have been collected and trained from the same person. 11 sequences of images have been collected of each of the 14 different

86

people including a female with long hair. 18 sequences of images of the curtains and 9 sequences of images of the tree were also collected. Each sequence produced 4 differenced images, therefore, 11 sequences of 14 different people gave 616 faces, 18 sequences of curtains gave 72 curtains and 9 sequences of trees gave 36 tree branches. Classification of the triples is based on the data collected from one person with chaincode length, A'=30 and the classification of strokes is based on the separating line shown in Figure 10. Table 1 shows the confusion matrix with difference threshold, r = 5 , Κ = 20 and 30. Table 2 shows the matrix with difference threshold, r=10, Κ = 20 and 30. Table 1 Testing Data : Difference Threshold, r = 5 Chaincode Length, A'=20 Curtains Faces Curtains 0.90 0.10 Faces 0.08 0.92 Tree branches 0.11 0.00

Trees 0.00 0.00 0.89

Table 2 Testing Data : Difference Threshold, r = 10 Chaincode Length, A'=20 Trees Curtains Faces 0.00 Curtains 0.92 0.08 0.00 Faces 0.05 0.95 0.92 0.00 0.08 Tree branches

Chaincode Length, A'=30 Curtains Faces 1.00 0.00 0.06 0.94 0.00 0.07

Trees 0.00 0.00 0.93

Chaincode Length, K=30 Trees Curtains Faces 0.96 0.04 0.00 0.06 0.94 0.00 0.00 0.03 0.97

There were errors where some of the curtains were classified as faces, and some of tree branches were also misclassified as faces. It is believed that these errors would be corrected by sequential real time processing, or by going to one higher level of UpWrite. 4. R O B U S T N E S S From the results in Tables 1 and 2, the reason for failing to pick up the moving faces is because of the differencing parameter r. In Table 1 where r = 5 is the lowest value, the reasons we failed to pick up the faces are because either there is too much movement in the images or the images are slightly overexposed. Under such circumstances, a higher r value is more desirable as shown in Figure 13 which shows the different difference images where the original images are slightly overexposed. In order to have a more robust system so that we can improve on the recognition rate, we can start off with a low r = 5 , and increment it progressively. The terminating condition is such that the white pixel count in the difference image is low. This is because this approach required a certain amount of movement in the difference image and this is

87

mahaThreshold=3

mahaThreshold=4

mahaThreshold=5

mahathreshold=6

mahaThreshold=7

mahaThreshold=8

mahaThreshold=9

mahaThreshold=10

Figure 11. Strokes extracted from different Mahalanobis distance thresholds

Κ=\2

Κ=20

Κ=30

Figure 12. Strokes extracted from different chaincode length, Κ = 12,20,30

r=5

r=10

r=15

Figure 13. A slightly overexposed image with different differencing thresholds, r = 5,10,15

89

represented by white pixels, Figure 2(b). In this way, the system is also robust against changes in the lighting conditions. 5. C O N C L U S I O N A N D F U R T H E R W O R K We have described a method of extracting higher level structural information from objects based on the use of quadratic forms as primitives, and also showed that such representation of objects achieved promising classification rates with a simple classification technique. The method used is very general. We note that no prior information specific to the objects was put into the process; the human intervention consisted of choosing some parameters such as the chaincode length, K, to be used in computing the forms, the level of thresholding, r, employed in constructing the difference images from consecutive frames. The human operator also had to collect the training data for the triples and strokes, but note that the training data is collected from the images of one human being and used to classify different people, curtains and tree. In all other respects the process was completely general. It is to preserve generality that we were concerned with robustness under variation of the parameters and the training data. In this respect the method greatly exceeds in generality conventional syntactic methods. A fuller discussion of the theoretical issues involved in our implementation will appear elsewhere. REFERENCES 1. Fu, K. S. Syntactic Methods in Pattern Recognition Academic Press, New York and London, 1974. 2. Fu, K. S. Syntactic Pattern Recognition, Applications Springer-Verlag Berlin, New York, 1977. 3. Haig, T.D., Attikiouzel, Y. and Alder, M.D. Border Marriage: Matching of Contours of Serial Section lEE Proc-I, Vol. 138, No.5, pp. 371-376, October 1991. 4. Haig, T.D., Attikiouzel, Y. and Alder, M.D. Border Following: New Definition Gives Improved Border lEE Proc-I, Vol. 139, No.2, pp. 206-211, April 1992.

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

91

Initializing the EM algorithm for use in Gaussian mixture modelling Patricia McKenzie * and Michael Alder "Centre for Intelligent Information Processing Systems, The University of Western Australia, Nedlands W.A. 6009, Australia We look at the EM algorithm used for Gaussian Mixture Modelling and problems with its initialization. We then consider a method of initializing the algorithm to cluster centres called the Dog Rabbit Strategy and compare this to the k-means Clustering Algorithm. The Dog Rabbit Strategy and the EM algorithm were run on data produced from an industrial pattern recognition problem and some results are discussed.

1. INTRODUCTION

Figure 1. An Image produced by ultrasonic projection.

Figure 1 is an image obtained by projecting and collecting reflections of ultrasonic beams into metal containing a hole. The different categories of * point' arise from beams having different angles of incidence. The present work arises from an attempt to classify such images in order to detect cracks and other anomalies. The first task was to characterise each such image as a vector. This was accomplished by computing the mean and covariance matrix for the set of points of each beam-angle; * [email protected] ^ mikeOmaths. uwa.edu. au

92 this was occasionally degenerate but always defined. Then the relationships between the resulting quadratic forms was coded by selecting a particular centre and listing various distances of the other centres and also by giving the several covariance matrix entries. Details may be found elsewhere [11]. The figure is coded by this process into a vector of fourteen real numbers in such a way that similar such diagrams yield similar vectors, i.e. points of R^"* which are close in the Euclidean metric. We shall refer to the process of turning an image such as Figure 1 into a point in a different space as the Up WHtt Process, Then determining if a hole contains a crack can be done by determining if it belongs to the cluster of normal holes, or not, by any of a variety of pattern classification methods. In the present case it was done by modelling the normal holes, regarded as points in R^'*, as a mixture of gaussian or normal distributions, and computing pdf values for particular points derived from an image. Our justification for treating the Up Written data as appropriate for modelling as a mixture of Gaussians derived partly from the general consideration that it might be expected that there is a finite family of image types and some 'noise' associated with each type, partly from inspection of two dimensional projections of the data, as in Figure 2, and partly from a desire to keep the modelling machinery simple.

Figure 2. 500 Upwritten images projected from 14 dimensions onto two different planes. A multivariate gaussian mixture with k component gaussians can be defined as [2]:

(1) where /,(χ|/χ,·, Σ,) describes the tth component gaussian distribution with the mean an η dimensional vector μ, and covariance an η χ η matrix and where Wi is the weight of the ith component gaussian and X2}=i i ü t = 1. Here Ö is a vector containing the parameters i ü , , / X t and Σ, for i = 1 , k of the mixture. The likelihood of a point χ is the value of 9{x\0) and can be used to give a measure of how 'close' to the gaussian mixture the point is. If each different type of normal occurrence

93 has a different gaussian mixture modelling it in the upwrite space (R**), then when a new image is presented it can be UpWritten to a point in the UpWrite space and its likehhood of belonging to each of the gaussian mixtures can be calculated. These likelihoods can then be compared to each other, and also to a threshold, to produce a classification of either belonging to one of the corresponding categories or of 'anomaly' if the likelihood of the point belonging to every known category is small [2,10]. We then need to find a gaussian mixture to model the data belonging to each category. In the UpWrite space we have a set of Ν data points of dimension n, X = Xi,X2y ··., XN, which can be modelled by a multivariate gaussian mixture with k component gaussian distributions. The fit of a model to the data can be measured by the total Hkelihood of the data L{X\E) =

L[g(xi\E)

where g{x\9) is defined in equation (1), or by the total log likelihood C{X\E) = \ogL{X\0) =

J:\ogg{xj\e) i=i

The problem then is to find θ = θ to maximize the total log likelihood of the data. The EM (Expectation Maximization) algorithm can be used to find an estimate of the maximum log Hkelihood and the corresponding estimate θ = Θ. The EM algorithm uses an iterative process to produce a sequence of estimates {θ""} [1,10]. This can produce a local maximum for the total log likelihood, but varying initial conditions may produce sequences which converge to differing values of θ and differing total log likelihoods. The problem then is to find a method of choosing initial conditions that lead to a consistent estimate of Θ and an optimal log likelihood. We consider the traditional methods of initializing the EM algorithm and the results they produce; look at a method of finding cluster centres called the * Dog-Rabbit Strategy' as a method of initiahzing the algorithm and compare this to the k-means clustering algorithm. 2. T H E E M A L G O R I T H M The EM Algorithm when used for gaussian mixture modelling tries to find the parame ters of a gaussian mixture distribution by using an iterative process to find the maximum likelihood estimate for the data. Each iteration involves two steps [1,10].E step: Evaluate E{\ogg{Y\e)\XJ'^) = Q{E,E^) Μ step: Find θ = 0"*+^ to maximize Q(ö, ö^) where Κ is a complete data set containing an incomplete data set X. This can be im plemented by initiahzing the gaussian mixture defined in (1) to somewhere in the space of ^s, traditionally with randomly chosen centres, identity covariance matrices and equal weights. Then by taking the partial derivatives of the log likelihood equation and setting these to zero we can derive the EM equations for a gaussian mixture [9,10]:

94

m+i

^

1 f

L{0^) for each m [1]. ie that the EM algorithm produces a monotonic increasing sequence of likelihoods - this implies that if the algorithm converges it will reach a stationary point in the likelihood function but does not guarantee that the global maximum will be achieved [10]. For example Figure 3 shows two different initial positions and the result of applying the EM algorithm to the data using these initial positions. In Figure 3 the component gaussians of the gaussian mixture are represented by ellipses drawn as points one standard deviation from the mean. This can be described as {x : (a; — /χ)^Σ"'^(χ — μ) = 1} where μ is the mean and Σ is the covariance matrix of the component gaussian. In Figure 3(a) the starting positions were chosen as randomly selected data points while in Figure 3(c) they were chosen close to the centres of the clusters. Note that the total log likelihood of the mixture in Figure 3(a), the results of running the algorithm with cluster centres as initial positions, is higher than the total log likelihood for the mixture produced in Figure 3(6) where the algorithm was initialized randomly. Note also that the number of iterations needed before the sequence produced by the EM algorithm converged was greatly reduced by initializing the mixture to the cluster centres. The problem then can be formulated as where should the initial component gaussians be placed to ensure that the EM algorithm achieves the desired maxima. If we consider

95

Convergence after 69 iterations. Total log likelihood = -4801.12

Initialization

(a) Initialization

• '.-'·:

(b) Convergence after 6 iterations. Total log Hkehhood = -4074.20

•

(C)

(d)

Figure 3. Results of running the EM algorithm with two sets of initial conditions. gaussian mixture modelling as trying to find the underlying distribution that generated the data and the EM algorithm as a method of finding this underlying distribution then a good starting position for the component gaussians may be near the means of the underlying component gaussians. We then need a method of finding these distribution centres. 4. C L U S T E R I N G T E C H N I Q U E S The components of the underlying distribution that generated the data can be consid ered as clusters in the data and so we need a method of finding cluster centres in the data. Therefore we look at clustering algorithms. One well known algorithm for finding clusters in the data is the k-means clustering algorithm [5,7]. This attempts to find k clusters in the data, where k is predefined, by an iterative process that involves moving the cluster centres to minimize the mean square error. This is done as follows: Step 0 Randomly select k data points as the initial starting locations of the cluster centres C i , C 2 , ...,Cfc.

Step 1 Divide the data points into k disjoint sets 5i,52, ...,5^ such that a data point Xj

96

is an element of 5, if 0, and the

99

gradient of d to be negative but ii(0) = 0, which conditions cannot all be met. We also want d{Dj) —• 0 as Dj —• oo, or remote data will continue to perturb any solution. We would also like 0 < d{Dj) < 2 VDj, so that although the dogs may jump over the rabbit they will always move closer to it. This leads us to a function similar to that shown in Figure 6 where the function starts at zero and increases rapidly before tapering off towards zero.

Figure 6. A possible shape for the function

d{Dj).

One possible form for d{Dj) is : 2D,

(3)

where r is a number larger than 1. This function has the desired properties in that it ranges from zero to two (when r is close to one), has a slope larger than one near zero and tends to zero as Dj tends to infinity. Another factor to consider is that of fatigue. As the dogs approach the cluster centres we would like them to stay there. This is done by introducing a fatigue factor which can be increased when the closest dog is moved to within a certain distance of the presented rabbit. As the fatigue increases the movement of the dog should be retarded so that the dogs have a tendency to stay close to rabbits they have seen before. If we introduce a fatigue factor into the portional distance function d(Dj) then this becomes d(Dj,fj) where fj is the fatigue of dog Cj. d{Dj,fj) should have all the properties of d{Dj) and also the property that as the fatigue fj increases d(Dj,fj) decreases for all values of Dj. If we examine equation (3) we notice that as r increases the function value decreases while the desired properties of d(Dj) still hold. Therefore the dynamic for moving the dog Cj is defined by: Cj

=

c,+

2Dj -i^-Cj)

where fj > 1 is the fatigue of dog Cj and Dj is the Euclidean distance of dog Cj to rabbit

100 Since we want only one dog on each cluster of rabbits, we don't want to move all the dogs equally towards each presented rabbit. Therefore we inhibit the movement of all but the closest dog so that, if Cc is the closest dog, ίοτ j = \, . . . , k j ^ c equation (4) becomes 2D, where ot{Dj) is a function of Dj describing the inhibition of dog j . If the dog Cj is close to the rabbit but not the closest dog then there may be two dogs attracted to the same cluster and so Cj should not move far implying that a{Dj) should be small. Ie a[Dj) -> 0 as Dj —• 0. On the other hand outlying dogs should be attracted towards the data and so a(Z)j) —• 1 as —• oo. This leads us to a possible form of oc{Dj) a{D,)

=

where the value of Λ > 0 determines the amount of inhibition. Our choice of particular functions, given only constraints on their qualitative behaviour, is determined by simplicity and speed of computation, so we have selected rational func tions which have the fewest parameters needed.

Initialization

After Dog Rabbit Strategy.

Initialization

After Dog Rabbit Strategy

Μ .

•: . S i - ' .

\ V.W-'

m

(c)

l

-

(d)

Figure 7. Running the Dog Rabbit Strategy with differing starting positions.

101

The Dog Rabbit Strategy can then be formulated as an algorithm: Step 0 Initialize k dogs Cj j = 1 , Ä ; to random positions in a bounded region containing the data and initialize the fatigue of the dogs /j = 1. Step 1 Select a random rabbit χ from the data. Step 2 Calculate the distance of each dog from the rabbit χ and find the closest dog Cc. Step 3 Move the dogs towards the rabbit according to the dynamic:

S t e p 4 If De < 1 increase the fatigue / c of Cc, the nearest dog. Step 5 Repeat steps 1 to 4 until the dogs are not moving much. Here Dj is the Euclidean distance of dog j from the rabbit x, is the fatigue of dog j and Λ > 0 is the inhibition rate. Figure 7 shows the results of running the Dog Rabbit Strategy with two different initial starting positions. This can be compared to Figure 4 where the k-means clustering algorithm was run with similar data and starting positions. Note that here the Dog Rabbit Strategy has converged to similar final configurations seen in Figure 7(6) and Figure 7(d) regardless of differing initial configurations. 6. RESULTS The Dog Rabbit Strategy and the k-means clustering algorithm were run on various data sets to compare their performance. Figure 8 shows the results of applying both algorithms to a data set containing distinct identical clusters using random data points as the initial positions of the cluster centres in both algorithms. Here the Dog Rabbit Strategy (Figure 8(a)) has successfully placed the cluster centres on the data clusters while the k-means clustering algorithm (Figure 8(6)) has placed several cluster centres on one data cluster and placed other cluster centres between two data clusters. Figure 9 shows the results of applying both algorithms to data with overlapping clusters, where the top left cluster has twice as many data points as the other two clusters. Here the Dog Rabbit Strategy positioned cluster centres closer to the intersection of data clusters (Figure 9(a)) where there is a concentration of points while the k-means clustering algorithm has biased the positions of the centres towards the cluster containing the most data points. Figure 10 shows the results of applying both algorithms to a set of circular data. Here the k-means algorithm has spaced the cluster centres evenly about the circle (Figure 10(6)) while the Dog Rabbit Strategy (Figure 10(a)) has moved one of the cluster centres to the centre of gravity of the data. This is due to the lateral inhibition of the other data centres

102

After Dog Rabbijt Strategy

After k-means Clustering

(a) (b) Figure 8. Applying Dog Rabbit Strategy and the k-means Clustering algorithm to data with distinct clusters. After Dog Rabbijt Strategy

After k-means- Clustering

(a) (b) Figure 9. Applying Dog Rabbit Strategy and the k-means Clustering algorithm to data with overlapping clusters.

which forces the extra cluster centre to continue moving towards the various presented data points without increasing its fatigue. 'Dogs' such as this can readily be pruned by monitoring the fatigue term fj. The question also arises of the effect of having too few or too many cluster centres to describe the data. This problem also arises when using the EM algorithm which assumes the number of clusters in the data is known. This is discussed elsewhere [12]. Figure 11 shows the results of applying the two algorithms with one cluster centre to data containing two clusters. Here the Dog Rabbit Strategy (Figure 11(a)) has placed the cluster centre on one of the data clusters, ignoring the other cluster while the k-means algorithm (Figure 11(6)) has placed the cluster centre between the two clusters. Figure 12 shows the results of applying the two algorithms with more clusters than

103

After Dog Rabbijt Strategy

After Dog Rabbijt Strategy

W

(b) Figure 10. Applying Dog Rabbit Strategy and the k-means Clustering algorithm to cir cular data with random initial positions.

necessary. Here both the Dog Rabbit Strategy (Figure 12(a)) and the k-means clustering algorithm (Figure 12(6)) have moved two cluster centres to one cluster in the data, mod elling the other cluster with one cluster centre although Figure 10 shows that this is not always the case . The Dog Rabbit Strategy was applied to various data sets to determine its consistency. On the data sets seen in Figure 7, 8 and 9 with a fatigue increase of 0.1 and an inhibition factor of 25. The Strategy converged to final configurations similar to those seen in the respective figures about 99% of the time. The times that the algorithm did not converge to these configurations, two cluster centres modelled one data cluster. When run on the circular data seen in figure 10, the Strategy sometimes converged to positions similar to those in Figure 10 (a), using only seven cluster centres to model the data, otherwise converging to positions similar to those seen in Figure 10 (6) where all eight cluster centres are used. An important factor in the convergence of the Dog Rabbit Strategy is the determination of when a cluster centre is close enough to a data point for its Fatigue factor to be increased. In the data sets used in this paper, the data was contained inside a 20 by 20 square and two points were close if the Euclidean distance between them was less than one.

7. CONCLUSION AND SUMMARY We note that when using the EM algorithm for gaussian mixture modelling, changes in the initial gaussian mixture g(x\0^) result in varying final gaussian mixtures. One method of overcoming this problem of varying final solutions is to run the EM algorithm several times and then select the best final mixture but this is slow and there is no guarantee that the final mixture chosen is optimal. Therefore a method of selecting initial gaussian mixtures is needed. One approach to this is to find cluster centres in the

104

After Dog Rabbit Strategy

After k-means Clustering

•'''1;·' :"·ίν:·" · · ·

(a) (b) Figure 11. Applying Dog Rabbit Strategy and the k-means Clustering algorithm with fewer cluster centres than data clusters. After Dog Rabbit Strategy

After k-means Clustering

(a) (b) Figure 12. Applying Dog Rabbit Strategy and the k-means Clustering algorithm with more cluster centres than data clusters.

data and use these centres as the initial means in the component gaussians of the initial mixture. We examined the k-means clustering algorithm for this and found that it too is dependent upon initial conditions, leading to similar problems when used to initialize the EM algorithm. We then described the Dog Rabbit Strategy as a method of finding cluster centres for use as the means of initial component gaussians used in the EM algorithm and noted that this was much more consistent than the k-means clustering algorithm when presented with varying initial conditions. Data produced from ultrasonic scanning of metal containing circular holes and flaws in circular holes was preprocessed to produce points in R^^. The Dog Rabbit Strategy was used on the circular hole data to find fourteen dimensional means for a gaussian mixture containing four component gaussians. This mixture was then used as an initial mixture for the EM algorithm and the model produced was used to classify data into categories of normal hole or flawed metal by comparing the likelihood of the point with respect to the mixture to a threshold. The mixture was trained on 1000 data points and tested on 500 different circular hole data points and 30 flawed metal data points. 96% of the circular hole data was correctly classified and 100% of the flawed metal data was correctly

105

classified. REFERENCES 1. A.P. Dempster, N.M. Laird and D.B Rubin, Maximum Likelihood from Incomplete Data via the EM algorithm, Proc.Roy.Stat.Soc. 1, 1977. 2. Everitt and Hand, Finite Mixture Distributions, Chapman and Hall, London 1991. 3. K.S. Fu, Syntactic Pattern Recognition and Applications, Prentice Hall, 1982. 4. R. Gonzales, Syntactic Pattern Recognition: An Introduction, Addison-Wesley Pub. Co. 1978. 5. R.A. Johnson and D.W. Wichern, Applied Multivariate Statistical Analysis Prentice Hall, USA, 1988. 6. L.V. Kantlorovich and C P . Akilov, Functional Analysis Pergamon Press, Oxford, 1982. 7. L. Lebart, A Morimeau and K.M Warwick, Multivariate Descriptive Statistical Anal ysis John Wiley and Sons, New York, 1984. 8. G. Lim, Μ. Alder and P. Hadingham, Adaptive Quadratic Neural Nets, Pattern Recog nition Letters 13 (1992) pp 325-329. 9. T. Taxt, N.L. Hjort and L. Eikvil, Statistical classification using a linear mixture of two multinormal probability densities, Pattern Recognition Letters 12 (1991) pp 731-737. 10. D.M Titterington, A.F. Smith and U.E. Markov, Statistical Analysis of Finite Mixture Distributions, John Wiley and Sons Ltd. Great Britain, 1985. 11. P.J. McKenzie, and M.D. Alder, Syntactic Pattern Recognition by Quadratic Neural Nets. A Case Study: Rail Flaw Classification. IJCNN'93-Nagoya Conference 1993. 12. P.J. McKenzie, and M.D. Alder, Selecting the Optimal Number of Components for a Gaussian Mixture Model. ISIT '94 Trondheim Conference 1994.

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

107

Predicting REM in sleep EEG using a structural approach Ana L. N. Fred, Agostinho C. Rosa and José Μ. Ν. Leitäo Instituto de Telecomunicagöes / Dep. Eng. Electrotécnica e de Computadores, Instituto Superior Técnico Complexo I, Av. Rovisco Pais, 1096 Lisboa Codex Portugal

This paper presents a quantitative modeling framework for automatic analysis, classl fication and prediction of sleep electroencephalographic (EEG) signals. A hierarchical hybrid pattern recognition system is proposed comprising, at a first level, feature extrac tion based on a stochastic model of sleep EEG and, at a higher level, syntactic models based on stochastic context-free grammars. The particular application considered is the prediction of entrance into the REM (Rapid Eye Movements) stage from the 2NREM (non_REM) stage.

1.

INTRODUCTION

Sleep is a complex dynamic process whose function and mechanisms are not clearly known. Signals usually taken as indicators of sleep dynamics are, among others, the electroen cephalogram (EEG), the electrooculogram and the electromyogram. The global features of sleep dynamics, that is, its macrostructure, are usually com pressed into a sleep stages description, based on the Rechtschaffen and Kales (R&K) paradigm [1]. According to these criteria the sleeper is considered to stay in one of six sleep stages: wakefulness, REM (Rapid Eye Movements) stage, and four non REM stages (1, 2, 3 and 4 NREM). By visual inspection, the neurophysiologist attributes one stage to each page (corresponding to 20 or 30 seconds) of continuous polygraphic recording along the night, according to the R&K rules. In this way a graphical description of stage evolution, called the hypnogram, is produced. Figure 1 shows a typical example. Reference [2] deals with a syntactic approach to the study of sleep macrostructure. The temporal organization of sleep, as given by the hypnogram, is analysed and modeled in terms of stochastic grammars, inferred using Crespi Reghizzi's method [3] without a priori information. The application of the developed methodology to the comparison of a population of normals with a population of psychiatric patients was analysed in terms of structural modeling and discriminating power. From the structural point of view, the context-free grammars are a natural way of representation, being able to describe the tendencies of sleep cyclicity. Results show clear differences in the grammatical complexity associated with each population, and the capacity to discriminate between populations.

108

500

1000

1500

2000

Figure 1: Hypnogram of a normal subject. Recording from 11pm (time instant 0) to 7am. Horizontal scale is in epochs of 15 seconds. A further refinement of the previous model, by the introduction of a priori information in the process of grammar inference, is described in [4]. It consists of devising subpatterns by means of Solomonoff coding [5] and using these to constrain the production of structural samples. Results of the application of the refined model to the above problem reveals a significant improvement in discriminating capacity. This paper concerns the extension of the previous work on syntactic modeling to the context of sleep microstructure. The underlying common assumption is that sleep is a highly structured dynamic process, both in macro and microstructural views. In par ticular, within each stage, sleep EEG signals exhibit a characteristic organization whose building blocks are transient events (like K-complexes, spindles and sawtooth waves) with variable morphologies and random occurrence - composing the phasic activity - super imposed on the background activity with various dominant frequencies, namely delta (< 4 Hz),

theta

(4-8

Hz),

alpha

(8 - 12 Hz),

sigma

(12 - 16 Hz)

and

beta (16 - 30

Hz).

Figure 2 represents a single channel EEG recording showing two contiguous epochs of signal belonging to stages 2NREM and REM, where differences in the patterns associated with the two sleep stages are very clear. Motivated by some empirical evidence, the work summarized in [6] presents first results on the definition of features to be used in predicting the transition from stage 2NREM to REM. These preliminary results, supported by statistical evaluation of the occurrences of these features in both situations, indicate an increase in K-complexes den sity, in sawtooth-like activity and in the length of EEG desynchronization in periods preceding REM. In this paper we propose a hierarchical modeling system of sleep EEG activity with the purpose of testing the predictability hypothesis. The hybrid pattern recognition system comprises, at a first level, physiological meaningful feature extraction based on a stochastic model of sleep EEG. At a higher level, syntactic models based on stochastic context-free grammars are used to monitor long-term as well as short-term changes in patterns related to the presumed predictive ability. Section 2. introduces the hierarchical modeling framework for the study of sleep EEG microstructure. The purpose is to predict transition to REM, confirming the empirical evidence of certain trends of temporal organization of sleep. Our perspective is to look at

109

LU CE

Figure 2: Single channel EEG record showing 2NREM stage preceding REM stage.

110

the EEG as an expression of a language [7, 8]. Sleep EEG signals are thus interpreted as sequences of symbols whose structure is modeled by stochastic context-free grammars (SCFGs). The methodology developed for the prediction of entrance into REM from stage 2NREM is detailed in section 3. In a final section, some results are presented.

2.

MODELING FRAMEWORK

Figure 3 represents the modeling framework of sleep EEG microstructure. Single Channel EEG records

Stochastic Sleep EEG Model

Estimation of model parameters and phasic events detection

level 1

Conversion into string description

level 2

Syntactic Model

level 3

Figure 3: Hierarchical model of sleep EEG. On a first level, single channel EEG signals are modeled by the stochastic system [7, 9] depicted at the left of figure 3. Several feedback loops, representing the dominant rythmic activities, are excited by a sum of white Gaussian noise {v{t)) with impulse-like signals {p{t)). Each feedback loop consists of a band-pass filter Hi with variable gain Κ{. The model, having a physiological interpretation, can recreate both the background activity and the phasic events by appropriate combination of loop gains and impulsive input. This formalization leads to a model based optimal estimation of background activity (represented by the gains Ki and the variance of the input noise σ^), and detection of phasic events - level 1. This forms the basis for the extraction of features, which are then used in a symbolic description of the original signals - level 2. The situations to be compared comprise signals from: a) phase 2NREM not followed by REM (baseline); b) phase 2NREM immediately preceding REM. Stochastic contextfree grammars are used to describe the temporal organization of symbols within each situation (level 3). Figure 4 gives an example of a symbolic description of sleep EEG at level 2.

Ill

a

b

b

b

a

a

b

b

b

a

a

b

b

b

d

be

b

d

d

d

d

db

b

d

d

db

b

Figure 4: The stochastic sleep EEG model allows the estimation of dominant rythmic activities and detection of phasic events. These are labeled with symbols leading to a string description. In the picture, symbol "b" is associated with the detection of spindles (sigma activity occurring in bursts) and "c" is a K-complex. 3.

METHODOLOGY

The proposed methodology, detailed along this section, is summarized in figure 5.

before REM

12

_l

Single Channel EEG records

Single Channel EEG records

i

_ J

Estimation of model paranneters and phasic events detection

Estimation of nrxxJei parameters and phasic events detection

Τ

J

Conversion into string description

Grammar Inference

Conversion into string description

arbitrary sample

Grammar Inference

|G2B

°2R

Recognition

η Recognition

Decision

Figure 5: Schematic description of the proposed methodology.

112

3.1. D a t a sets Two data sets are produced by selection of variable length sequences of single channel EEG signals from stage 2NREM (see figure 5): 2R - immediately preceding REM; 2B not preceding REM (baseline). 3.2. Estimation of E E G model p a r a m e t e r s and detection of phasic events The stochastic sleep EEG model previously referred to allows the estimation of dominant background activity and the detection of phasic events, namely: 1. Estimation of model parameters, specifically: loop gains ks, ke, k^^ k„^ thus char acterizing the dominant rythmic activities; and the variance of the driving white noise, σ^. Given the observations (the raw signals in the data sets), the system parameters are estimated according to the maximum likelihood criterion [9]. 2. Detection of transient activity. The detection of phasic events implements a combi nation of Bayesian and tests [9]. 3.3. Conversion into string description From the previous model a large set of EEG features, concerning background and transient activities, can be extracted. Each feature is labeled with a distinct symbol, thus providing a string description of the original data. Previous work and empirical evidence suggest a categorization of background activity and transient events as features more likely to embody predictive power. 3.4. G r a m m a r inference Following the hierarchical model presented in the previous section, a stochastic contextfree grammar G = (V^v, Vr, Ps, σ) (where Vat and Υχ are the sets of nonterminal and terminal symbols, respectively, is the set of productions and σ is the start symbol) is inferred for each data set using Crespi-Reghizzi's method [10].

Table 1: Construction of structural descriptions, a) String sample, b) Structural descrip tion.

a)

b)

1- No a priori information

2- Detecting regularities using Solomonoff coding

5 i S 2 . . .θ„

S1S2 . "Sk

...

[.„]...]]

Sk+l

[siS2 . . . Sk[Sk-î

. . . Sk+l

. . . Sn-m

" ·

· · ·

. . . Sk+ll . . [Sn-m · · · 5 „ ] . . .]]

Structural samples are produced by (see table 1): 1- forcing a temporal alignment in the productions; 2- introducing a priori information by devising subpatterns in the

113

symbol sequences using Solomonoff's coding [8]. As a result, inferred grammars are of the type A Α^αι.,.αρΒ

\ ai,..ai

AeVN A, Β e VN , ai e Vr ,

with recursive constructions. The estimation of the probabilities of the rules is based on the method of stochastic presentation, using a maximum likelihood criterion [10, 3]. 3.5. Recognition / Prediction The prediction of REM involves the parsing of arbitrary EEG string samples according to the various grammars. Given the grammars G2a = (VJv, Vr, /'e? σ), a = R,B, and an arbitrary string x, the probabilities Pa(x\G2a) = T,y:y=:V*x Ρ{ν\^2α) = total probability that the SCFG generates strings having χ as suffix, are computed using an iterative algorithm based on dynamic programming principles [8]. This algorithm explores the particular structure of the grammars (in a fcishion similar to the one described in [11] for grammars in CNF form), has 0{mn) time complexity, with m = |V5v| and η = and is described subsequently. The computed probabilities are used to predict REM occurrence according to a Bayesian decision criterion. 3.5.1. Calculation of suffix probabilities Due to space limitations the derivation of the following expression will not be presented. Let VN be an ordered set with elements {σ, ^ i , · · · ^ m - i } , and consider the fol lowing notation: Fn{Hi-^H,)

FT{HÍ^WI

,7GV?

= Y^?T(H,^^H,) Ί

...

Wn)

(1)

= probability of all derivation trees with root in Hi that generate exactly W i . . .Wn.

.^v

For the computation of the suffix probability ? T { a ^ V p W i . . . u;„) the definition of the (m — 1) X η matrix ?u[iJ]

= ?T(Hi^Wj...w,,)

(3)

is required. This is iteratively computed in 0{τηη) time according to the following steps: 1. For all

Hi e VN

- {σ}

Fulhn] = PT{H,-^Wn)

(4)

2. For A: = 1 , . . , , η - 1 and for all Hi e VN - {σ} Pu[¿, n-k]=

?T{HÍ

Wn-k

. . . W'n)-f

Σ/ T!¡ZI ^T{HÍ ^

.

. . . Wr..k+jHi).F,[l,

n - k +

j-\-l]

.

114

The suffix probability is then obtained as follows: 1. Off-line calculation of matrix QR = VR[I-VH]-\

(6)

where / is the identity matrix and VR[Í,J] = P r ( / / ,

i/j).

2. On-hne calculation of the probabiHties of exact derivation Pu[hj],i 1 , j = 2 , . . . , n, as detailed above.

= 1, . . . ,m —

3. For all Hi eVs - {σ} (see figure 6) ?x[Hi^Viw^^... Σ

w^) = Σ Σ Pr(^.

^^'1 · · · ^'i+iti//)Pu[/, 2 + A:] +

(7)

/ fc=0

J

+Σ Σ

j

. . . u;i+,.i/,)Pu[j, 2 + fc]

*:=0

4. Finally Pr(a:^V7*w;i...

= Σ Ρϊ·(^

//.·).ΡΓ(^.·=^ν^;^ι · · · ^n)

(8)

Hi

Figure 6: Terms of the expression 7.

4.

EXAMPLE

The data sets consist of samples from a population of seven normal subjects. Baseline data is divided into two sets: set 1, consisting of 18 samples, gathers segments of EEG from stage 2NREM not immediately preceding REM, but not excluding the situation of a certain proximity; set 2 uses a more rigorous selection, consisting of 9 samples. For data concerning the period before REM, 28 samples were selected. As a first, simplified approach, the extracted features were the dominant rythmic activities, estimated according to the stochastic sleep EEG model described previously. A string description was obtained by labeling these activities with the associated greek symbols.

115

Table 2: Terminal symbols: δ,Θ,α,σ. String length range: 80 - 461.

# Samples # Inferred Rules Pc (resubst.)

Set 1 2B 2R 18 28 62 86 7.14 5.55

Set 2 2B 2R 28 9 86 49 0 0

Table 2 shows some of the results obtained when no a priori information was used in the inference process. The first two rows indicate, respectively, the number of samples in each data set and the number of rules of the corresponding inferred grammars. Two misses and one false prediction of REM occurred in the classification of the samples in set 1. The estimates of the probability of error based on the resubstitution method are 7.14% for the classification of signals preceding REM, and 5.55% for baseline data. The second sample set produced no incorrect classifications of the training data.

Figure 7: Graph associated with 2B grammar. Figures 7 and 8 provide graph descriptions of the inferred grammars for data set 2, showing different organization structures. On these graphs, nodes correspond to dominant rythms (except for the start node Σ and the end node T) and directed arcs correspond to transitions, having an associated probability; such probability is graded in four levels represented by different line patterns, as shown in the righthand side of the figures.

116

2R (26 samples)

0 . 7 5 — 1 0. 0 OJO — 0 . 57 0.25—OJ O 0.00 —O

Figure 8: Graph displaying grammar associated with 2R data. 5.

CONCLUSIONS

A hierarchical model for sleep EEG microstructure was presented comprising basic tools for modeling and automatic analysis. A similar approach has been suggested by Sanderson et al. in [12]. The main differences reside in the models implementing the several levels of the hierarchy, in the degree of automation accomplished and in the purpose of the analysis. The design proposed here articulates syntactic modeling of the dominant features of the microstructure of sleep EEG with a stochastic sleep EEG model which is able to describe both background activity and transient activity. The association of these two components in a hybrid pattern recognition system leads to the description of signal dynamics in terms of stochastic rules over a set of symbols having physiological meaning. Special emphasis was put on the methodology used for the prediction of the entrance in REM sleep from stage 2NREM. An example was used mainly for illustration purposes, consisting of a simplified first approach in terms of feature selection. It should be noticed that, although much information WCLS neglected in the presented symbolic description, still the preliminary results obtained point to a confirmation of the predictability hypothesis, revealing different grammatical structures associated with the two situations under study. Ongoing work involves a more careful selection of features for symbolic representation, including phasic events. Also, the use of larger data sets is being foreseen for evaluation purposes. REFERENCES [1] A. Rechtschaffen and A. Kales. A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages of Human Subjects, U. S. Government Printing Office, Washington DC, 1968.

117

[2] Α. L. Ν. Fred and J. Μ. Ν. Leitäo. Use of Stochastic Grammars for Hypnogram Analysis, in Proc. of the 11th IAPR Int'l Conference on Pattern Recognition, pp: 242-245, 1992. [3] M. G. Thomason. Syntactic pattern recognition: Stochastic languages. In Handbook of Pattern Recognition and Image Processing, chapter 5, pp 119-142. Academic Press, 1986. [4] A. L. N. Fred and J. M. N. Leitäo. Solomonoff Coding as a Means of Introducing Prior Information in Syntactic Pattern Recognition, in Proc. of the 12th InVl Conference on Pattern Recognition, 1994. [5] R. J. Solomonoff. A formal theory of inductive inference. Part I and II. Information and Control, pp 1 - 22, 224 - 254 1964. [6] R. Largo, J. M. N. Leitäo and T. Paiva. Sleep EEG Patterns Preceding REM. Sleep Research., 20-A, pp 39, 1991 (abstract). [7] J. M. N. Leitäo and A. C. Rosa. Aspects of complexity in sleep analysis. In Com plexity in Physics and Technology, pp 249-262. World Scientific Publishing, 1992. [8] A. L. N. Fred. Structural Pattern Recognition. An Application to Sleep Analysis. , PhD thesis. Instituto Superior Técnico, 1994. [9] A. C. Rosa, B. Kemp, T. Paiva, F. H. Lopes da Silva and H. A. C. Kamphuisen. A model based detector of vertex sharp waves and Κ complexes in sleep electroen cephalogram. Electroenceph. Clin. NeurophysioL, 78, pp 71-79, January 1991. [10] K. S. Fu and T. L. Booth. Grammatical inference: Introduction and survey - part I and II. IEEE Trans. Pattern Anal. Machine Meli., PAMI-8:343-359, May 1986. [11] F. Jelinek, J. D. Lafferty, R. L. Mercer. Basic Methods of Probabilistic Context Free Grammars, in Speech Recognition and Understanding, Pietro Laface and Renato De Mori (Eds.) Springer-Verlag, pp: 345-360, 1992. [12] A. C. Sanderson, J. Segen, and E. Richey. Hierarchical modeling of EEG sig nals. IEEE Trans. Pattern Analysis and Machine Intelligence, PAMI-2(5):405-415, September 1980.

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V, All rights reserved.

119

Discussions Part I

Paper Vamos Lemmer: You asserted that once you have complete knowledge then you are into a metric problem. I am wondering if you could elaborate on that and why you believe that. Vamos: In our system we have a possibility of tuning. By tuning the relevance of some symptoms, which m the case of the Medical project is done by doctors, the diagnosis could be optimized. Of course, the very best experts are setting the values, in that way creating the system with us. This is one way of puttmg expert knowledge into the system. Kanal: The statement that if you have good or perfect knowledge, then a metric exists, might be misunderstood, because there are many instances when you have good knowledge, but nevertheless symmetry and transitivity do not hold among the relations of interest. So you do not have a real metric in any strict sense of the word. In most human relations, symmetry and transitivity just do not hold. Vamos: Yes, that is correct. If you take for mstance the metric theory of Hahnos or Haar, or the Neumann metric, in a strong mathematical sense these are not metrics. But I think that if in a closed world we would have complete knowledge, then there would be a mathematically correct metric and that might be used to improve knowledge. Holz: You said that much of your work is inspked by human pattern manipulation and, in fact, your prototype comes from Eleanor Rosch. Rosch herself moved away from Prototype Theory in the late seventies and into Family Resemblance Theory as a more accurate model of how people form classes. I wonder why you choose to work with prototypes rather than follow the next step in her research, which was more of a true metric, although not in the strong sense. Vamos: In our projects we always start with some hypotheses and a hypothesis is a kind of prototype. For the Economic project, we received all the databases of the World Bank and you feel lost in an ocean of data. You have to start with some prototypes. There is

120

something more: we did not discover the wheel all over again. We are aware of what is done in case-based reasoning and in many other branches of so-called Artificial Intelligence, a term that I do not like. On the other hand we think that we can do something in addition and that is why we work with experts in the cognitive sciences and in graphics. The main thing is having real-life, very big projects, where the complexity is high, not schoolboy excercises. This is the difference with the so-called Artificial Intelligence reports.

Paper Denreux Kappen: You gave us an example of how we can apply Dempster-Schafer (D-S) theory to model imperfect knowledge. I would imagine that in the normal probabilistic framework you can also model imperfect knowledge. Can you give an example which indicates the difference between D-S theory and a normal probabilistic framework? Denceux: This is a much debated question. Some claim that D-S theory can be defined independently of any probabilistic framework, while others prefer to interpret it in the framework of probability theory. For example, you can consider a belief function and a plausibility function as lower and upper bounds for a family of probability distributions. So, you can interpret D-S theory in view of Bayesian statistics. But, from the philosophical point of view it is still an open question. From the practical point of view, it can be shown that in some examples Bayesian analysis and the D-S approach lead to different results. As a tentative conclusion I would say that you can understand D-S theory in view of probability theory, but it is not necessary: in my view, D-S theory cannot simply be reduced to probability theory. Kappen: Is it best then to view D-S theory as an alternative to a probabilistic approach? Denceux: Yes, I think so, but there are bridges between the two approaches. Mulder: Looking at the definitions in D-S theory, it seems a redundant set to me because it is based on evidence and on counter-evidence. The whole problem is solved if in the standard pattern recognition procedure the concept of the unknown class is introduced. Then this whole artificial idea of counter-evidence becomes redundant. To take up the question about expert uncertainty, if an observer has uncertainty, the option of noopinion will solve the problem. Can you comment on my statement that D-S theory represents a redundant theory because people do not think of a full set of hypotheses including the unknown hypothesis?

121

Denoeux: I think that the concept of an unknown class that you mention is related to the open world assumption. In D-S theory we can also depart from the closed world assumption and introduce a set of unknown hypotheses. I do not think that simply introducing the open world assumption in probability theory, which may raise other theoretical problems, is the answer to all the questions addressed in D-S theory. Mulder: I consider a closed world, where one of the hypotheses is that the object belongs to an undefined or an unknown class. So, this is a closed world but some objects may be assigned the label of the complement set. For such objects the expert does not know the class or he is not interested in it. So, it is a closed world with a rest class, which makes the training set complete. Denoeux: Yes, it is an alternative. But you will never get any evidence pointing to the unknown class, by definition. So I do not see how you can rigorously treat this in a pure probabi listic framework. Mulder: The vocabulary changes. If I have defined the unknown class in a training set, I can treat it as a normal class in the classification phase. Normally, in the statistical approach there are no rejects. Data are transformed from measurement vectors into likelihood vectors. The whole idea of reject is another artifact which comes from human, fiizzy thinking. It does look axiomatic, but it is not. It is taking very fiizzy axioms and then applying rigid mathematics to them. Denoeux: D-S theory is an axiomatic approach, just as probability theory is. It must be understood as a model that you may find usefiil to solve certain practical problems. There is no fiindamental difference with probability theory at this level. Vamos: I do not want to go into the theoretical discussions. It was very well discussed in earlier publications of Kanal and many others. I want to stress some practical point. We had many difficulties with the D-S method, both computationally and practically. For a nontrivial problem you may get into severe complexity problems. And I would like to see real life experiments on complex patterns. Is it really more efficient in computation and in discrimination? Denoeux: You are right. D-S may lead to high computational complexity in some applications. We assign a probability number to each subset of the universe of hypotheses, so there may be an exponential increase of computations with the cardinality of the frame of discernment. However, in the framework which I have just proposed, the belief fimctions are simple. In that case, the computations are not heavy. Computation times only

122

increase linearly with the number of classes. If we have imperfect knowledge of class labels, we can work with arbitrary belief functions. But yet, the complexity depends only on the number of classes which, even in real applications (except in special cases) is generally not very high. So, although what you say is right for some applications, for instance in expert systems, for the applications I consider here, I do not think it is a real limitation. Jain: I have two questions about the performance of the k-NN rule. Your version of the nearest neighbour rule is similar to the distance-weighted NN rule which was published in the late seventies. It was shown that there is no guarantee that the distance-weighted rule will always perform better than the majority rule. You have shown by a few examples that your version of the k-NN rule performs better than the majority rule. Is this true in general? Denoeux: You are right that the method I am proposing has some similarity with the weighted kNN rule, but the methods differ in the way in which they take into account the distances to the NN. Of course, I do not claim that my method works better than the other methods in all cases, which is not even very plausible. All I have is empirical evidence that this method performs well in a number of cases, but I have as yet no theoretical justification for that. Jain: A related question is that one of the reasons why k-NN rules are attractive is that they have some asymptotic properties relating them to the Bayes rule. Is the same true for your decision rule? Denoeux: It would be interesting to apply the same kind of asymptotic analysis to this method, but this remains to be done. However, I think that this method works better than the plain kNN rule when you have few training vectors and in such a situation the asymptotic properties of the k-NN rule are not very useful. Lemmer: I would like to know how you are going to prove any sort of performance characteristics of this method. I understand that once you go beyond Bayesian belief functions, that you must give up the frequency interpretation of the numbers that result from the D-S rule. Denoeux: I have not gone very far in this direction. One approach could perhaps be to use the probability distribution of the distances from the k nearest neighbours and to analyze the finite sample properties of this rule, as done by Fukunaga vsdth the NN rules.

123

Paper Morris and Kalles Talmon: You say that costs are less important than the discriminating power of your features. Intuitively I would say that this is true as long as the costs do not differ too much. Can you give any indication what the range of the costs was in your case? Kalles: The costs that we used in the character recognition problem were handcrafted. For an algorithm extracting geometrical features, the cost was related to the number of cycles, the number of holes, the number of edges, etc. We defined 1.0 as the least expensive cost; a value of 9.0 would be assigned to a process with the highest complexity. While this is a handcrafted approach, I do believe that we would get similar results with real estimates. Talmon: Have you compared the discriminating power of your costly features with that of your cheap features? I know for instance that in medicine, the more expensive tests are lUcely to have close to 100% predictability, but you don't want to submit all your patients to such tests. So, in that domain, the costs may be dramatically different and so is the discriminating power. Kalles: I have not experimented in the medical domain. In the character recognition domain, the estimated costs do not differ by many orders of magnitude.

Paper Hornegger et al. Jain: I see no support for your conclusion that the HMM models are suitable for object recognition, Hornegger: I did not show all results. Jain: But you said there is about 75% classification accuracy. If I take some simple moment features or Fourier descriptors, I think I can do better than that. Hornegger: Of course, and you will get also close to 100% if you use the right Markov parameters. Jain: So, that is what I am saying. Why do you make the statement that HMM is suitable when the results don't support that?

124

Hornegger: As I mentioned, if you use different features from local forms, you will succeed. For example if you add an additional feature like the length of the feature sequence, you will get recognition rates of 99% for this trivial example as expected. Continuous hidden Markov models cause another problem. We assume that the output density function is modelled by a Gaussian density function and there the problem of initialization comes in.

Jain: My point is that the 2-D recognition problem has been very well studied and the large number of features already tried give good results. Are you trying to solve a specific recognition problem for which existing methods do not work? Hornegger: First we tried some statistical methods to object recognition problems and we started with well-founded hidden Markov models which are used in speech recognition. We studied these techniques for object recognition problems. As I said before, we finished the research with respect to hidden Markov models and chose an EM approach to allow for necessary functions like localization of objects. In the context of HMM you have no possibility to model background features. If you use an EM algorithm, you generally have the chance to model those background features.

Papers Alder, McLaughlin and Alder, Lim et al., McKenzie and Alder Van Dijck: You represent a cube as a point in a 253-dimensional space. In order to recognize objects that have not been in your training set but are similar, you need a distance in this space. Do you use a Euclidean distance, and if so, is it relevant to the problem? And if you use the Euclidean distance, does it satisfy triangular inequality, Pythagoras' theorem, etc.? Alder: In the case of the cubes, and other objects that we treated, for example aeroplanes, the descriptor at the top level can usually be accomplished quite well by simply taking a hypeφlane. When you compute the distance to a hyperplane, you can take the Euclidean distance. For example, in the case of aeroplanes, we scaled them between 60% and 140% and then we took one of the aeroplanes and scaled it down to about 15% and looked to see which hyperplane it was closest to. It was getting a good classification, well outside the usual range down to the limits of quantization errors. Your question is a very significant one, especially in cases of occlusion. If the system is trained on squares and triangles and then it is given three sides of a square, and we ask whether it is more likely to be a square or a triangle, then the metric becomes quite important. If you have trained on perfect squares and triangles, there are no grounds to prefer one over the other. If, however, you have given it many squares with some of the sides a bit on the short side, then a triangle may be inteφreted as a square with one side reduced to

125

nothing. You may use the data fitted locally by quadratic forms, and regard that as a local estimate of a metric tensor to do the calculation of the distance. Lemmer: I wonder if you can give us some insight into where the power comes from in Upwriting this into a higher dimensional space. Starting with a 512x512 image you are automatically in a 32K dimensional space and that does not do you any good. Does the magic lie in the fact that you are using ellipses? Alder: The answer is that if you have for example two slightly shifted images, the points in 512x512 space will have no relationship to each other. It is not that the ellipses have anything to do with it, it is in how the process proceeds from the bottom up to extract a higher order structure. So, whereas we might end up in a space of some awfiil dimension, you will be glad to hear that this trade-off of the number of data points versus the dimensionality does produce a profit. There is data compression involved in this. If the structures have been close together all the way along the line, the points resulting from our method will also be close together. So, if you are looking at pictures of aircraft, similar aircraft will end up close together. So there is a topological preservation involved, which is not easily extracted in general. McLaughlin: In this representation, the metric structure is preserved, so that pictures that look similar are mapped onto similar points in space. This has been the major motivation for developing the process. Bunke: One of the powerful concepts in the syntactic approach is recursion. I have not seen this concept in your approach and I wonder if it is in principle possible to accomodate it. Alder: No, nor should it be. Recursion is a very interesting business altogether because people use it, or so linguists say. For example in nursery rhymes, such as "The house that Jack built". If you look at recursion in natural languages, you seldom find it going to more than three or four levels. Recursion is also used in computer languages such as Pascal and C. Now this is interesting, because they run on finite state machines. When you get the "out of stack" message, it is telling you that you cannot do recursion on a computer. You can only simulate it to low levels. Now it is very interesting that both natural language and the computer languages that we find convenient, do allow us to simulate low levels of recursion. And the question remains as to why our brains should be organized so that recursion appears to be such a handy tool. But nevertheless I say that we do not have recursion. There is a background for the work that we have been describing in terms of a neural model. I have evaded that because it is a just as big new can of worms. In that can of worms, I have certain indications as to why brains simulate low level recursion. But this is a very complex issue.

126

Paper Fred et al. Bunke: A theoretical alternative to stochastic grammars is a hidden Markov model. Could you comment on that for your application? Fred: Formally there is an equivalence betvêen finite state grammars and hidden Markov models, but of course the same equivalence does not exist for context free grammars. In this example, without using any a priori information, I derived a structure which is very similar to finite state grammars. In this modelling, and in particular in using this type of inference methods based on structural samples we can introduce a priori information. So, if you have some way of detecting regularities in the patterns and introducing these in the process of inference, you wdll of course get other types of grammars which cannot be described by hidden Markov models, and contain much more information than those. This proved to be significant in our experience vsdth previous work on hypnogram classification. There was a great improvement in discrimination between the two situations of normal and psychiatric patients, just by using that a priori information.

Pattem Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

129

On the problem of restoring original structure of signals (images) corrupted by noise Victor L. Brailovsky * and Yulia Kempner ^ 'Departments of Computer Sciences and Mathematical Statistics, Tel-Aviv University, Ramat-Aviv, 69978, Israel ^'Department of Computer Sciences, Tel-Aviv University, Ramat-Aviv, 69978, Israel In the first part of this paper the problem of detecting piece-wise-linear structure of ID and 2D signals (images) corrupted by heavy Gaussian noise, is considered. In many cases the attribute function (response function) is continuous at the boundary points (lines) and only change of its derivatives indicates the transition from one region of smoothness to another. In the presence of heavy noise it is difficult to detect these changes with the help of a local operator and an approach taking into account the global behavior of signal should be introduced. The suggested approach is based on a combination of a least square (LS) criterion, dynamic programming and a probabilistic estimate for model selection. In the second part of the paper the signals corrupted by spikes are considered. In these cases the approaches based on the use of LS estimators are not efficient. It is shown that in the case of piece-wise-linear signals corrupted with such a noise the potentialities of known methods of Robust Regression are restricted. A modification of the Hough Transform is introduced as robust method for outliers detection. Following this procedure the outliers may be detected and excluded. The piece-wise-linear structure of ID signals may be detected either by associating different linear pieces of the signal with the corresponding maxima in an accumulator array or (if the level of additional background noise is relatively high and the maxima are blurred) with the help of the above discribed methods applied to the signal cleaned of outliers. Results of experiments with ID signals and 2D images are presented. 1. I N T R O D U C T I O N The problem of detecting signal or image structure and its interpretation is a central topic in signal and image analysis, respectively. The solution is often based on a technique of image (signal) segmentation, i.e. on a decomposition of the signal into homogeneous regions. The notion of homogeneity has different meaning in different situations but such a region is usually considered as an element of structure and different regions correspond to different elements of signal or image structure. There are two standard approaches to solving the image segmentation problem [1]: one based on edges and boundaries detection and another one based on direct restoring homogeneous regions on the image. When the level of noise is high enough both of the

130

approaches usually based on the use of local operators, turn out to be rather inefficient. For example, for noisy range images that are considered in this paper, it is natural to define the homogeneous regions in such a way that within such a region the depth function behaves in a regular way (e.g. constant or linear) and within different regions it behaves differently. So, it is not unlikely that the depth function is continuous at the points of boundary and only a change of its derivatives reflects the transition from one region to another. This change is very difficult to detect by local procedures in the presence of noise. Another approach to the segmentation problem is based on the use of global infor mation and global constrains inherent in given types of problems. One of the ways to achieve it is to consider a global segmentation problem based on local interactions between neighboring pixels. This approach leads either to MRF models [2] or to some cooperative relaxation algorithms [3]. Another way to take into account the global information is based on regularization theory [4], that deals with a kind of trade-off between quality of approximation of an attribute function and quality of smoothness of the solution within the homogeneous regions (and/or form of the boundaries between them). We mentioned a broad class of different methods and here we can only note that the majority of them suf fers from problems connected with convergence and from the necessity of fine adjustment of tuning parameters. In this article we consider another way of using global information, namely analysing image row by row and column by column. Each line is a ID signal corrupted by noise. In this work we consider the case when a real range image is the corruption of an underlying piece-wise-smooth image with an unknown number of flat patches. In the first part of the article we consider noise close to Gaussian white noise, in the second part some problems connected with the presence of noise in the form of spikes (outliers) will be considered. Therefore, ID signals that should be analysed in each row and column are piece-wiselinear signals corrupted by noise. The purpose of the analysis is to determine the number of regions of linearity (it is the notion of the homogeneous regions in this case), their location (boundaries) and to find a good approximation of the depth function within each of the regions of linearity. These problems will be considered in Sections 2-4. The problems of restoring the original structure of signals corrupted by spikes, are considered in Section 5. 2. P I E C E - W I S E - L I N E A R R E G R E S S I O N 2-1. Consider the array of η sites X — [1,2, Let j/i, i/2» · · · ? 3/n be the set of values of a response function. One assumes the response function has the form y- = f(xi) -f e,; ¿ = 1,2,... ,n. Here f{x) stands for a function with the following piecewise-smooth structure. The array X is divided into a number (k) of regions. Inside each of the regions the function has a simple form (constant, linear), one calls them the regions of smoothness. Meanwhile, inside different regions of smoothness, the dependence may be different and there is no requirement of continuity between the neighboring regions. The number of the regions k and the location of the change-points (knots) are unknown, £i refers to a random error (noise that corrupts the underlying signal /(x)). The objective is to discover the piece-wise-smooth structure of the underlying signal

131

f(x) with the help of the analysis of the corrupted signal One should find the number of regions of smoothness, the location of knots and obtain a good approximation of the signal inside each of the regions. 2-2. Assume for a moment that the number of regions of smoothness k and the location of the knots χ \ χ ^ , . . . are known. So, the k regions are A = [l,x^); D2 = [x\x');

D . = [x'^\n].

(1)

The round bracket means that the corresponding knot does not belong to the region, the square bracket means the opposite. Define a linear regression for a region Dp (p = 1,2,..., A:) as a hnear function: VDp

= aix -f ao-

(2)

The regression is linear in the coefficients a, and the estimates of the coefficients may be obtained from the least square principle: Σ

[yM-yDÁXi)r

= min

(3)

2-3. If the location of the knots x ^ x ^ , . . . ,α:*^"^ is not known one can obtain their estimates from the least square principle as well. Let some values of the knots be fixed. For ρ = 1,2,..., A; one can calculate the estimates of the regression coefficients for the polynomials (2) and define the error of the piece-wise regression as E{x\ χ ^ . . . , x"-') =

ΣΣ

- y¡>Á^')?-

(4)

p=l xieDp

The location of the knots may be found as min

E(x\x\,.,,x'-')

= RSS(k).

(5)

Technically, the minimum (5) may be obtained with the help of dynamic programming. Such an approach was applied to solving the piece-wise regression problem in [5]. 2-4. With the help of this approach one can obtain the best solution according to the criterion (4), (5) for a given k. In fact one often does not know the value of k. However, in many cases one can assume that k < k^, where km is a known quantity. One can obtain the best solutions for Ä: = 1,2,..., A:,n and the problem is how to select from these km solutions the one that corresponds to the underlying model. It is easy to see that the square error of approximation (4), (5) decreases, while the number of the regions of linearity is growing. So, selecting the best solution according to the least square error will automatically lead to selecting the most complex solution with k = km- This phenomenon is called overfitting and it is discussed e.g. in [6]. The discussed problem is a typical example of model selection problem; the most simple model corresponds A: = 1, the most complex one to k = km- Among other examples of the model selection problem one can mention selecting the best subset of regressors in Regression Analysis, selecting the optimal number of clusters in Cluster Analysis, many problems of signal and image segmentation and so on. Different approaches to solving this problem are discussed in [4], [6],

132 In recent years the author developed a probabilistic approach to the model selection problem, based on the comparison of results obtained for a real problem, described by a set of experimental data, with those obtained for unstructured sets of data (e.g. pure noise). The above mentioned overfitting effect is connected with the fact that when one fits a model to data one fits it to all irregularities and fluctuations that are characteristic for a given data set as well. We use the fact that with the help of Monte-Carlo sampling from the source of unstructured data and subsequent analysis of the samples one can obtain information concerning the possible scale of the fluctuations in the sample sets of a given size for the problem under consideration and thereby obtain a basis for the separation between fitting a real property of the underlying model and fitting the irregularities of a given set of experimental data. The approach was successfully applied to the problem of finding the best subset of regressors [7-9] and the best clustering [10]. The first application of this probabilistic approach to the problem of piece-wise-linear regresión was reported in [11]. Here one describes a faster amd more efficient algorithm, which was applied to the problem of 2D image analysis, described in Section 1 [12]. 3. M O D E L S E L E C T I O N P R O B L E M 3 - 1 . Here one presents the probabilistic approach to the model selection problem as applied to the piece-wise-linear regression problem of Section 2. One defines the unstruc tured data set on the same array of η sites X. Now instead of the values of the response function 2/1,2/2, · . ' , 3/n one considers the values of an artificial function ^ i , ^2? · · · ? which are independent random variables from Λ'^(0,1). For this data set one can find the best solutions for A; = 1,2,..., /jyn exactly as it was done for the response function in Section 2 (see (4), (5)). For each of the best solutions one obtains the value RSSa{k) according to (5) and the value of the relative improvement of fit when one makes the transition from a simpler model to a more complex one, i.e. from k to k -\-1: RSSa{k)-RSSa{k-\-l) RSSaik)

'

k =

l,2,...km-l.

(6)

One can repeat this procedure with many samples of the artificial function; as a result for each pair A:, A; -f 1 {k = 1,2,.,., A:„i — 1) one obtains the probability distribution for the test-statistic (6). Now one should fix a significance level α and for each of the above mentioned prob ability distributions find the percentage points Ok^k+i, such that the probability for an unstructured data set to obtain relative improvement, i.e. the value of test-statistic (6), more than 9k,k+i, while making the transition from the model with complexity k to that with Ar -I- 1, is equal to a. Pr

RSSa(k)

- RSSa(k -H 1) RSSaik)

>ek,

(7)

So, at this stage one gets the table of percentage points Ok,k+i'^ A: = 1,2,..., A:,n - 1. 3-2. Next, one should come back to the given data set and calculate the relative improvement of fit (6) while making the transition from the model with complexity k to

133

that with k + 1: RSS{k)-RSS{k

+ l)

= — R s m — •

... ^'

Begining with the most simple model (k = I) one should check if tkMi

> ^kMi'

(^)

If (9) holds, i.e. the relative increase is larger than the threshold, it means that the improvement is significant and we can continue and compare the next increase ^2,3 with the corresponding threshold $2,3 and so on. If at some stage it appears that (9) does not hold, it means that the corresponding improvement is not significant, one should stop the process and the last model achieved is the solution. 3 - 3 . Let us note some properties of the procedure. Firstly, the only fitting param eter here is the significance level a, which is usual for many formulations of hypothesis testing. This parameter may easily be chosen for any problem (standard choice accepted in statistics, α = 0.01 or 0.05). Secondly, there are many problems in image and signal analysis in which one deals with different sets of data given in a standard format, like in the 2D image segmentation prob lem discussed in Section 1, when one should solve the same piece-wise regression problems for each row and column. In this case the calculation of α-percentage points Ok^k+i may be performed regardless of the form of a signal (response function) preprocessing, once for the whole set of problems. As a result one obtains a fast algorithm of image analysis. 3-4. To gain an impression about efficiency of this algorithm, consider the following example. An ideal underlying signal f(x) is defined in the array of 50 sites and presented in Figure la. To obtain the response function {i = 1,2,..,, 50) one adds to each value of the underlying signal f{xi) a random error ε, , which in our case is obtained as follows. With the help of a random number generator one generates 50 independent values τ,· from yV(0,1). A pattern of noise [T¿]J=I,50 is presented in Figure lb. The random error Si = c-r,-; The set of experiments is performed for c = 0; 0.5; 0.75; 1.0; 1.1; 1.3; 1.5; for the pattern of noise. After adding the random error to each value of the underlying signal one obtains the values of the response function (corrupted signal) [τ/,],=ι,5ο. The response function for c = 0.75 is presented in Figure Ic and the same for c = 1.5 in Figure Id. For each response function one obtains the best solutions for k = 1,2,3,4,5,6,7 ac cording to (5). To select the optimal solution one used the above described procedure with significance level a = 0.05 and, for the comparison, the criterion based on minimum cross-validation estimate. For the considered pattern of noise (Fig. 1) the cross-validation-based approach prop erly selects k = A (four regions of smoothness, see Fig. la) for c = 0.5 — 1.1. For c = 1.3; 1.5 the cross-validation selects k = 3. For c = 0 it selects k = 5,6,7 as the optimal solutions. The above described probabilistic estimate selects the right solution for all values of c (from 0 up to 1.5 inclusively). Similar experiments with other patterns of noise and other ideal signals demonstrate approximately the same results. At the same time, as was mentioned above, the algorithm considered here is much faster, than cross-validation one (if the threshold points Ok^k+i are obtained on the preprocessing stage).

134

o α c

IO

enα ωro "O α

o α

α. ri o1 en o U3

ω

(Λ

οC

O• α

Δδ^δ

d o b. 1

^

λ

δΔ

δ

δ

δδ δ

Δ

ΔΔ Δ

α

σή -J o c

o

uó

δΔδ

eno

.δΔ^

(Π ro L O σ ü d

^Δ Δ

^

Δδ ΔΔ δ δΔ

δΔΔδ Δ Δ

Α

Δ Δ

α

Δ Δ, δΔδ Δ Δ 'Δ Δ ΔΔ Δ

c, 1 o d σ

—J d o c(J) ._) o Μη

δ^δΔ

(_

o σ υ d o d.

Δ ^

0.0

10.Ο

Δδ δ

δΔ δ Δ Δ δ Δδ δ

Δ

δ Δ Δ δδ δ

20.0 30.0 Χ (sLtes)

—ι ΊΟ.Ο

Δ

Δ 1 50.0

F i g . l . Experiments with ID signal, a) Ideal underlying signal / ( x , ) . b) Pattern of noise r¿. c) Response function iji = /(x,) + 0.75 · τ,·, d) Response function y,- = f{xi) + 1.5 . Ti.

135

4. A P P L I C A T I O N T O I M A G E ANALYSIS 4 - 1 . The version of algorithm described in Sections 2-3, for finding piece-wise-Hnear structure of ID signals is fast enough to make possible its application to 2D image analysis (see Section 1). As a result of such an application, for each row and column one obtains the number of regions of smoothness and the location of border points. Due to the influence of noise, the errors in determining the proper number of regions of smoothness as well as in the location of boundary points are possible. Algorithms to correct these errors were developed. They are based on properties of the original object that induced the considered 2D image. • If one gets a certain number of regions of smoothness in a sequence of neighboring rows (columns) it is impossible to get in the middle of the sequence a different number of the regions. If such situation takes place it indicates an error that should be corrected according to a majority scheme. • The homogeneous regions in 2D images and the boundary lines should be topolog ically connected, the latter ones should be continuous (in the considered case even linear) and closed. These considerations were used for a further correction of the boundary lines. 4 - 2 . Due to the fact that under the conditions considered here, each homogeneous region in a 2D image corresponds to a plane patch in the original object, our purpose is to restore the parameters of these planes. To find parameters of approximating planes it is sufficient to use only a certain group of pixels which with high probability belongs to the interior of a region (we call it the trustworthy interior of a region). Obtaining the trustworthy interior implies deleting rather thick layer of pixels around the boundaries. After this, using the least square criterion we find the parameters of the planes that deliver the best fit to the depth function within each of the regions. In the next stage one obtains the intersection lines of adjacent planes and their projec tions to the plane of the image. It gives the information necessary for the final segmenta tion. The example presented in Fig.2, demonstrates the results of different stages of the algorithm. 5. R O B U S T R E G R E S S I O N A N D H O U G H T R A N S F O R M 5 - 1 . In the work reported in Sections 1-4 we considered a signal (an image) corrupted by noise in the form of independent Gaussian random variables. However, often signals are corrupted by spikes (outliers). In these cases any approach based on the least squares (LS) of residuals is not efficient. For this reason the methods of robust statistics that are not so easily affected by outliers, attracted considerable attention [13]. A promicing approach in applications of regression in image and signal analysis is based on minimization of the median of squares of residuals (LMS). This estimator is very robust; its breakdown point reaches 50% (it means that up to 50% of samples may be corrupted by outliers and the LMS algorithm still gives reasonable regression). Meanwhile, the property of this algorithm to ignore up to 50% of data appears to be unsuitable in the case of piece-wise regression. Consider the ideal signal in Figure l.a. and an interval

136

b.

a.

ν' /

"

·

· . . I"

d.

e.

f.

Fig.2. Experiments with a 2D range image of a truncated pyramid, a) Ideal range image of the truncated pyramid, b) Corrupted image, c) Border points found by the algorithm of Section 3. d) Border points corrected by the algorithm of Section 4-1. e) Borders after correction by the algorithm of Section 4-2. f) Restored range image of the truncated pyramid.

137

[1, χϊ\. When Χ{ is just moved into the second region of smoothness (left slope side of the triangle), but the number of points from the second region is less than 50% of the total number x,, the algorithm is insensitive to the change of the functional dependence and the results of regression are exactly the same as those when the interval is completely within the first region of smoothness. This insensitivity to the change of dependence prevents using the LMS algorithm (instead of LS one) in the dynamic programming process similar to the one described in Section 2-3. In this example we even did not consider outliers, this approach does not work when there are no outliers at all. Meanwhile, it is possible to use a less robust but more sensitive estimator, like the least of sum of absolute values of residuals or M-estimators [13]. However, as a result of the greater sensitivity these estimators are more sensitive to the influence of outliers and there are many problems connected with their use in the above mentioned dynamic programming process. 5-2. The Hough transform (HT) is a powerful robust method of structure identifica tion which has gained wide-spread acceptance in image analysis [1]. Consider briefly the outline of HT, as applied to a piece-wise-linear signal. A straight line that corresponds to a linear part of a signal may be described by the equation y = ax + b;

(10)

where (x, y) are coordinates of data points, a is the slope and 6 is the intercept of the line with the Y axis. In the classical HT each signal point {x,y) is mapped onto the parametric space with coordinate axis a, 6 onto all points, which are compatible with (10). These points form the locus in the form of straight line b=-xa-\-y;

(11)

where —χ is the slope and y is the intercept with the b axis. If one considers a number of points that belong to the line (10), all the lines (11) that represent these points in the parametric space intersect in the point with the coordinates a, b equal to the values of the parameters in (10). If the parametric space is quantized and one counts the number of lines (11) intersecting a certain cell, one gets a maximum in the cell that corresponds to this point of intersection. The points of such kind that correspond to diflêrent pieces of a piece-wise linear signal form different maxima in the accumulator array. If in a signal point one has a spike instead of a regular signal value, this spike does not contribute to corresponding maximum. If, nevertheless, the maxima in the accumulator array are still significant, one can detect them and obtain information about different pieces of the signal. If one identifies the data points that contribute to a certain maximum, one obtains a list of points that belong to a particular region of smoothness minus outliers. In this way the HT-based approach may detect outliers. 5 - 3 . In fact the maxima that one obtains with the help of the classical HT are rather blurred, especially when the signal is corrupted by noise. An alternative way to find the parameters a, 6 that correspond to a linear part of the signal (10) is to take a pair of points, lying on the straight line (10), to write for each of them the equation (10) and to obtain the values a, 6 as the solution. This pair of points increments only one counter in the accumulator array, the one that corresponds to this solution [1]. If one

138

deals with piece-wise-linear signals the way of choosing these pairs of points appears to be important. E.g. if one chooses these pairs randomly, the corresponding maxima in the accumulator array are still blurred. In [14] a special approach how to select these pairs of points is suggested. For each signal element {xi,yi) one considers the pairs of points formed by this element together with other elements from a certain neighborhood of it along the axis X. Such local approach to the pairs formation appeared to be the most effective in the problem under consideration, where the relations in a neighborhood are the most impor tant for findig the piece-wise structure of a signal. Such an approach leads to obtaining much more distinctive maxima in the accumulator array. In [14] these results are illustrated with an example of a signal of trapezoidal form. The general number of sample points (with equal spacing along the axis X) is equal to 50 (X = 1,2,..., 50). The left slope side of the trapezoid is located between the points (1,0) and (20,20), the upper base between (20,20) and (35,20), the right slope side between (35,20) and (50,0), So, in the accumulator array there should be three maxima that correspond to these three lines. In this study, the following model of signal contamination was considered. In a random way one selects m sample points (out of η = 50) and for each of them one substitutes the signal value by a random value uniformely distributed on the interval [—100,100]. So, the percentage of outliers is ~ · 100%. The result of application of the HT using the local approach to pair formation is as follows. Even when the percentage of outliers is as large as 60% one still can identify three correct and distinctive maxima in the accumulator array. If one performs such experiments (even with smaller percentage of outliers) with the HT using the random approach to pair formation, the picture in the accumulator array is chaotic. The situation with the classical HT is similar. 5-4. If one gets the correct and distinctive maxima, their location determines pa rameters of lines corresponding to different linear parts of a signal. In many cases this decomposition of a signal into regions of smoothness that correspond to these maxima, may be quite satisfactory. It remains either to use the coordinates of the maxima as parameters of the regression lines within the corresponding regions of smoothness or to perform regression within each of the regions by the LS method with the outliers excluded. What is more, to get more precise result one can use the results of the HT for the selection of outliers only and then use the algorithm described in Sections 2-3, for restoring the piece-wise structure of the signal. As was discussed in [14], for the outlier detection one should introduce a measure of association of a given sample element with a given region of maximum (a cell or a number of cells in the accumulator array). For each sample element one considers the same pairs of points that were used in the local HT, and counts the ones that contribute to a given maximum. As a result one obtains the number of scores that measure the association of a given sample element with a given maximum. The outliers usually get zero scores, even when the percentage of outliers is high (like 50%), and this is the base of the outlier selection criterion. 5-5. Up to now we discussed the case when the signal was corrupted by outliers but the background noise in regular sample points was zero. A more difficult problem arises when

139

there is a combination of a significant percentage of outhers together with a noticeable level of background noise. This leads to blurring of the maxima in the accumulator array and, if the level of the background noise increases, one faces merging domains of different maxima in the accumulator array. Such merging prevents proper identification of different maxima and, as a result, those of different parts of the signal. Meanwhile, for each sample element one can calculate the measure of association with the whole region of the merged maxima, as it was done in the previous paragraph for a specific maximum. The outliers once more get zero scores and this is the base for the outliers selection. After this one can use the algorithm described in Sections 2-3, for restoring the piece-wise structure of the signal.

6. SUMMARY In this paper the problem of restoring the original structure of a signal (image) cor rupted by noise, is considered. The signal is assumed to be piece-wise linear and with respect to the noise two situations were considered. When the level of noise is rather high and the noise is close to Gaussian, the approach based on a combination of the least square criterion, dynamic programming and a probabilistic estimate for model selection, is suggested. This approach is realized in the form of a fast algorithm, that makes it possible to apply it to 2D image analysis. In Section 5 the case when the signal is corrupted by noise in the form of spikes is considered. A modification of the Hough Transform is introduced as a robust method for outliers (spikes) detection. With the help of this approach one can restore the signal structure even when the percentage of the outliers is rather high.

REFERENCES 1. D.H. Ballard and C H . Brown, Computer Vision, Prentice-Hall, 1982 2. R.C. Dubes and A.K. Jain, Random field models in image analysis, J. Appl. Statist., 16(2), 131-164, 1989. 3. R.A. Hummel and S.W. Zucker, On the foundation of relaxation labehng processes, IEEE Trans. PAMI, 5, 267-287, 1983. 4. T. Poggio, V. Torre and C. Koch, Computational vision and regularization theory. Nature, 317, 314-319, 1985. 5. R. Bellman and R. Roth, Curve fitting by segmented straight fines, J. Am. Stat. Assoc., 64, 1074-1079, 1969. 6. J. Rissanen, Minimum-description-length principle, Encyclopedia of Statistical Sci ences, vol.5, 523-527, Wiley, N.Y., 1987. 7. V.L. Brailovsky, A predictive probabilistic estimate for selecting subsets of regressor variables, Ann. N.Y. Acad. Sei., 491, 233-244, 1987. 8. V.L. Brailovsky, On the use of a predictive probabilistic estimate for selecting best decision rules in the course of search, Proc. IEEE Comp. Society Conf. on Computer Vision and Pattern Recognition, Ann Arbor, MI, 469-477, 1988. 9. V.L. Brailovsky, Search for the best decision rules with the help of a probabilistic estimate, Ann. of Math, and AI, 4, 249-268, 1991.

140 10. V.L. Brailovsky, A probabilistic approach to clustering, Pattern Recognition Letters, 12, No 4, 193-198, 1991. 11. V.L. Brailovsky, Yu. Kempner, Application of piece-wise regression to detecting in ternal structure of signal, Pattern Recognition, 25, No 11, 1361-1370, 1992. 12. V.L. Brailovsky, Yu. Kempner, Restoring the original range image structure using probabilistic estimate, Proc. 10 Israeli Symposium on Art. Int., Comp. Vis. and Neur. Net., 389-396, 1993. 13. P.J. Rousseeuv and A.M. Leroy, Robust regression and outlier detection, Wiley, N.Y., 1987. 14. V.L. Brailovsky, Robust regression, Hough transform and finding piece-wise structure of a signal corrupted by noise. Submitted to Trans. IEEE, PAMI.

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

141

Reflectance ratios: An extension of Land's retinex theory Shree K. Nayar» and Ruud M. BoUe^ •Department of Computer Science, Columbia University, New York City, NY 10027, U.S.A Êxploratory Computer Vision Group, IBM TJ Watson Research Center, Yorktown Heights, NY 10598, U.S.A. Neighboring points on a smoothly curved surface have similar surface orientations and illumination conditions. Therefore, their brightness values can be used to compute the ratio of their reflectance coefläcients. We develop an algorithm that estimates a reflectance ratio for each region in an image with respect to its background. The region reflectance ratio represents a physical property that is invariant to illumination and imaging para meters and is an extension of Land's "Lightness" defined in his Retinex theory. The ratio invariant is used to automatically build representations of objects and recognize objects from a single brightness image of a scene.

1. INTRODUCTION Object recognition has been an active area of machine vision research for the past two decades [1,2]. The traditional approach has been the use of geometry, that is, reconstruct scene geometry from an image and match that with geometric models of objects. Little attention has been given to other object properties for recognition. Besides geom etry, objects may be cheuracterized by reflectance properties. Clearly, the representation of an object using these properties is useful only if the recognition system is able to compute them from images. In this paper, we present a method for computing the reflectance of regions in a scene, with respect to their backgrounds, from a single image. This property is invariant to the intensity and direction of illumination. This photometric invariant, called the reflectance ratio, provides information for recognition tasks. The only assump tion we make is that the light illuminating the scene is roughly the same in distribution during learning and recognition phases. Surface radiance, and hence image brightness, is the product of surface reflectance and illumination. So, it is impossible to separate the contributions of reflectance and illumination at a single image point, if treated in isolation. The problem of computing the reflectance of regions in a scene was first addressed by Land [3-5]. Land constructed a set of ingenious experiments to show that humans are able to perceive the reflectance of scene regions even in the presence of non-uniform and unknown illumination. He developed the retinex theory which suggests computational steps for recovering the reflectance of scene regions. Though it is not possible to determine the absolute reflectance of regions, their

142

relative reflectance (or "Lightness**) can be computed. The retinex theory is based on the following assumptions: • The scene is planar and similar in appearance to the paintings of the Dutch artist Mondrian (although they are closer in appearance to a van Doesburg). The resulting images are like the one shown in Figure 1(a). • The effective illumination varies slowly and smoothly across the scene and is inde pendent of the position of the viewer. • The brightest scene patch is white ("brightest-is-white" assumption) and the average scene reflectance in each color band is the same (the "gray-world" assumption). Incidentally, these assumptions are not always explicitly stated in Land's work. Under these assumptions, image brightness values only change abruptly at region bound aries since illumination variations are small throughout the scene. It is therefore possible to integrate out the effects of illumination. Integration only determines scene reflectcince up to a constant; the constant is resolved by normalizing all reflectances to the signed from the brightest patch. However, proper normalization can only be achieved if the brightest image patch is "pure" saturated white. The gray-world assumption is needed for global consistency in terms of scene reflectance for all color channels. Realistic images typically do not obey these assumptions and also include shadows, occlusions, and noise. Any one

(a)

(b)

Figure 1. Images of (a) a Mondrian scene; and (b) a realistic scene with three-dimensional objects.

143

of the latter can cause a region boundary to be missed or erroneously detected. Such errors can greatly affect the lightness values computed for all regions in the image. We develop a scheme for computing the ratio of the reflectance of a region to its back ground and use these ratios for object recognition. This method is fundamentally different from histogram techniques (e.g, [6]), in that geometric information is embedded in the object representations. We also do not assume that the objects are flat and parallel to the image plane. The image is first segmented into regions of constant (but unknown) reflectance. Next, a reflectance ratio is computed for each region with respect to its back ground using only points that lie close to the region's boundary. The reflectance ratio computed for any particular region is not affected by those computed for regions elsewhere in the image. Our derivation of the reflectance ratio is based on the analysis of regions on curved surfaces. Such regions are commonplace in realistic scenes like the one shown in Figure 1(b). For curved surfaces, image brightness variations result from both illumi nation variations and surface normal changes. For curved surfaces, our reflectance ratio invariant is valid when a region and its background have the same distribution (scatter ing) function but different reflectance coeflScients (albedo). We use the reflectance ratio invariant to recognize and estimate the 3D poses of objects from a single image. The pro posed approach is very effective for man-made objects that have printed characters and pictures. Each object is assumed to have a set of regions, each with constant reflectance.

2. REFLECTANCE RATIOS The reflectance of a surface depends on its roughness and material properties. In general, incident light is scattered by a surface in different directions. This distribution of reflected light can be described as a function of the angle of incidence, the angle of emittance, and the wavelength of the incident light. Consider an infinitesimal surface patch with normcd n, illuminated with monochromatic light of wavelength A from the direction s, and viewed from the direction v. The reflectance of the surface element can be expressed as: r(s, v, n, λ). Now consider an image of the surface patch. If the spectral distribution of the incident light is β(λ) and the spectral response of the sensor is θ(λ), the image brightness value produced by the sensor is: / = / .(A)e(A)r(s,v,n,Ä)dA

(1)

If we assume the surface patch is illuminated by "white" light and the spectral response of the sensor is constant within the visible-light spectrum, then 5(A) = s and e(A) = e. We have: I = s e / ? Ä ( s , v , n ) = Ä:/9Ä(s,v,n)

(2)

where / 9 Ä ( s , v , n ) is the integral of r(s,v,n,A) over the visible-light spectrum and the constant k = 5.e, containing information about the illuminating light and the sensor. We have decomposed the result into R(.) which represents the dependence of surface reflectance on the geometry of illumination and sensing, and ρ which may be interpreted as the fraction of the incident light that is reflected in all directions by the surface (albedo). Incident light that is not reflected by the surface is absorbed and/or transmitted through the surface. Two surfaces with the same distribution function R{.) can have different reflectance coefläcients p.

144 2.1. Reflectance Ratio of Neighboring Points Consider two neighboring points on a surface (Figure 2). For a smooth continuous surface, the points may be assumed to have the same surface normal vectors (ni « n2). Further, the points have the same source and sensor directions. Hence, the brightness values, Ιχ and 72» of the two points may be written as: Ii = k PiRiis, v , n);

/j =

P2R2{s, v , n)

(3)

The main assumption made in computing the reflectance ratio is that the two points have the same scattering functions {Ri = R2 = R) but their reflectance coefläcient and p2 may differ. Then, the image brightness values produced by the points are: Ii = k piÄ(s, V, n); I2 = k p2R{s, v , n)

(4)

The ratio of the reflectance coeflScients of the points is: Ρ = h/h

(5)

= P1/P2

Figure 2. Neighboring points on a surface. Note that ρ is independent of the reflectance function, illumination direction and in tensity, and the surface normal of the points. It is a photometric invariant that is easy to compute and does not vary with the position and orientation of the surface with respect to the sensor and the source. Further, it represents an intrinsic surface property that can be effectively used for object recognition. We use a different definition for ρ to make it a well-behaved function of the brightness values Ii and I2: P = {h-

h)l{h

+ h) = (/>! - P2)l{Pi + P2)

(6)

Now, we have - 1 < ρ < 1. We will use this definition of the reflectance ratio in the following sections. If multiple light sources are present, these can be replaced by one effective point source and illumination direction and the above derivation is the same [7,8].

145

2.2. Reflectance Ratio of a Region We now consider a surface region that has constant reflectance coefficient p-^ and is surrounded by a background region with constant reflectance coefficient /Oj. We are in terested in computing the reflectance ratio P{S) of the surface region S with respect to its background. The image brightness of the entire region cannot be assumed constant for two reasons. First, the surface may be curved and hence the surface normal can vary substantially over the region. Second, while the illumination may be assumed to be locally constant, it may vary over the region. These factors can cause brightness variations, or shading, over the region and its background as well. Figure 3(a) shows the image of a curved region and Figure 3(b) shows image brightness values varying along the boundary (white line) of the curved region. The reflectance ratio can be accurately estimated using neighboring (or nearby) points that lie on either side of the boundary between the region and the background. Figure 3(c) shows reflectance ratios computed along the boundary of the curved region. Note that while image brightness varies along the boundary, the ratio remains near constant. A robust estimate of the region's reflectance ratio can be obtained as an average of the ratios computed along its boundary. The region ratio is also a photometric invariant; it is independent of the shape of the surface and the illumination conditions. It is computed using a single image of the scene and provides useful intrinsic properties of surface regions in the scene. In [9] we presented an algorithm that computes reflectance ratios of scene regions. The algorithm can be divided in two parts. First, a sequential labeling algorithm is used to segment the image into connected regions. The second phase involves the computation of a reflectance ratio for each of the segmented regions. The algorithm is computationally eflBcient in that reflectance ratios of all scene regions are computed in just two raster scans of the image. Several experiments in [9] demonstrate the invariance of reflectance ratios. 3. R E C O G N I T I O N U S I N G R E F L E C T A N C E R A T I O S We now apply reflectance ratios to the problem of object recognition. The recognition methods presented here are effective for objects that have markings with diflferent re flectance coeflBcients. Man-made objects with pictures and text printed on them are good examples of such objects. We consider recognition scenarios that differ in the assump tions made with respect to the constraints on the objects in the scene. Quick indexing schemes [10] are developed that use the reflectance ratio invariant to identify objects from a possibly large object-model database. The expected search time is independent of the size of the object-model database. We consider only the three-dimensional case, the two-dimensional case is discussed in [7,8]. For the 3-D case, a 3-D object can be in any arbitrary orientation and position in 3-D space. Here, the image formation model is assumed to be weak-perspective; orthographic projection followed by scaling [11]. This assumption is a good one if the objects are sufficiently far from the sensor. Unlike the 2-D case, for the 3-D object case the configuration of a set of projected object features into the image can vary dramatically. For the 2-D case, the image features are a similarity transformation or, at most, an affine transformation of the features on the object surface. For 3-D objects, the "loose" projection of object features on the image

146

(a)

65^

1

Image Brighmess

541

Reflectance Ratio

0.5\

43

32

-04

10

δ)

Boundary Pixel Boundary Pixel

^)öö iSsö ilioo

(b)

_!

Boundary Pixel

Iso

^00 iSso

lioo

(c)

Figure 3. (a) Image of a curved region; (b) image brightness values along the region boundary; (c) reflectance ratios computed along the boundary. The ratios are near con stant while the brightness values vary.

147

plane makes the incorporation of geometric object properties more difficult. 3.1. Three-Dimensional Object Recognition The three-dimensional case is more general than the two-dimensional scenario described (along with the 3-D case) in [7,8]. We use reflectance images to recognize objects that cor respond to truly 3-D models of the objects. The 3-D models are obtained from registered range and brightness images. Acquiring Object Models The three-dimensional scenario allows for arbitrary rotations and translations of objects in the scene. Since our objective is to recover the three-dimensional pose of an object from a single brightness image, the object model must include reflectance ratios of objects as well as the three-dimensional coordinates of the centroide of each region. This is done using a range finder. We use the image sensor of the range finder to also obtain a brightness image of the object. The range and brightness images are therefore registered. The reflectance ratio algorithm is applied to the brightness image and the ratios (Pm) ^nd centroide (x^) (in the model image) of the object's regions are determined.^ Next, the range map is used to obtain the three-dimensional coordinates (Xm) of points on the object surface that correspond to the region centroide in the image. We assume that though the object surface may be curved, each constant reflectance region is small compared to the size of the object. Under this assumption, centroids of regions in the image correspond to centroids of the regions in the 3-D scene. Using the above approach, a ratio-centroid list La = ((Xi, Pi)> (Xa, P2), · · · ,(Xm, A n ) , ·..) is obtained for each object. Here, Xm, »7i = 1 , . . . , Μ are the 3-D centroids of the regions and P^, rn = 1 , . . . , Μ are the reflectance ratios. The reflectance ratios P^ obtained from the brightness image. For object model building, a hash table is initialized. The indices in the hash table must be invariants that can be computed from a single brightness image of the scene since 3-D object recognition is intended to be achieved from such a brightness image. In the threedimensional case, there are no useful geometric invariants, such as the angles ά and β in the two-dimensional case [7,8], that can be computed from the spatial arrangement of the region centroids [12]. This is because object rotation in the scene changes the relative configuration of the region centroids in the image. Thus, we rely on the photometric invariance of reflectance ratios for indexing into the hash table. We select three regions, t, j , and k on the object and use their reflectance ratios to obtain an index < P„ P,, P^ > (see table). Indices are formed using only those region triplets ( i , ; , k) whose centroids in 3-D space lie within a radius Da^ This is done for all combinations of triplets of regions that lie within a radius DaINDEX

ENTRY :

< MI,

(Χ,,Χ^,X*),{(Xi, Pi), · · · ( X m , P M } )

>

':

Associated with each index in the hash table is an entry. In the entry, Mj is the

148

object identifier and (Χ.,Χ^,Χ*) are the 3-D centroide of the three regions used in the index. The entry also includes the centroid-ratio pairs {Xm,Pm), m = 1 , . . . , Μ to be used for object verification. The above procedure is applied to all sets of three regions in the Hst LA. This process is repeated for all objects, Mj, I = 1 , . . . , of interest to the recognition system. The resulting hash table represents the complete object-model database. Recognition and Pose Estimation Though model acquisition requires the use of both brightness and range images of each object, recognition and pose estimation is accomplished using a single brightness image. The reflectance ratio algorithm is applied to the scene image to obtain the list LR = ((χι,Ρι), (X2,-P2), ...)· A set of three regions {i,j,k) that lie within a radius DR is selected from the list LR. The ratios of the three regions are used to form the index < Pi,Pj,Pk >. If this index does not have an entry in the hash table (described above), the next set of three regions is selected. If an entry does exist, we have a hypothesis for the object (say Μκ)- The entry includes the 3-D centroide of the regions (i,;, Α;) and a set of centroid-ratio pairs for other regions on the object Μκ- Assuming the object hypothesis is correct, we have a correspondence between the image centroide (χ^,χ^,χ^) and the 3-D centroide (Χ,,Χ^,Χ^) in the entry. Under the weak-perspective assumption, the transformation Τ from the 3-D scene pointe to 2-D image pointe can be computed from the three correeponding paire ueing the alignment technique propoeed by Huttenlocher and UUman [11]. In general, there exiet two eolutions to the transformation [11]: x' = T;ci(X) and

x' = Ύκ2{±)

(7)

Weinehall [13] hae ehown that inetead of computing theee two transformations, the in verse of the Grammian of the pointe X¿, X^, and X^ can be need to predict the image coordinatee x'„i of a fourth 3-D point Χ,„ in the entry. Again, two eolutione to x'm exiet but if the initial object hypotheeie ie correct, one of the two eolutions is Ukely to be close to one of the centroide in the Uet LR. Further, the reflectance ratio Pm (in the entry) and Pm (in the Het LR) must be similar. The point x'm is not guaranteed to be in the het LR eince it may not be vieible to the eeneor or it may be occluded by other objecte in the ecene. In any event, for the object to be verified, one or more projections of the 3-D regione in the entry muet match in location and ratio with regions in the liet LR. If so, the object Μκ hae been recognized and ite poee is given by either Ύκι or TiC2. At thie point, all regions need ae indicee and thoee that are verified are removed from the Ust LR. A new set of three regione ie eelected from the Het and used to form the next index. 4. E X P E R I M E N T S In thie eection, we preeent a three-dimeneional recognition reeult that demonstrates that the eimultaneoue uee of photometric and geometric invariante is a powerful approach to recognition. The robustneee of reflectance ratioe of regione and two-dimeneional recog nition reeulte are experimentally ehown in [7,8]. The recognition experimente were conducted on man-made objecte with lettere and picturee printed on them. The printed regione have reflectance coefläcients that depend

149

Brightness image.

Range image. (a) Model acquisition.

r s ,

H.,,-

11 Kit

(b) Recognition and pose estimation.

Figure 4. Model acquisition and object recognition results obtained for a threedimensional recognition problem.

150

on the shade or color of the paint used to print them. The approach proposed here is particularly effective for such objects. The reflectance ratio algorithm produces a set of detected regions, each region represented by its centroid in the image and its reflectance ratio. This compact representation of objects is used to automatically acquire object mod els as well as to recognize them in unknown images. The recognition and pose estimation stages are eflScient as they are based on the indexing scheme described in Section 3. All images used for model acquisition and object recognition were obtained imder ambient lighting conditions. Figure 4(a) shows model acquisition for a 3-D object. The range image was obtained using a light stripe range finder. The vertices of the triangle displayed are the centroids of three regions whose reflectance ratios were used as indices in the hash table. Other nearby regions used for verification and pose estimation are indicated by their centroids (black squares). Recognition and pose estimation is done using a single brightness image of the scene. The scene shown in Figure 4(b) consists of several 3-D objects in different orientations and positions. It includes occlusions, shadows, and non-uniform illumination. The reflectance ratio algorithm was applied and a total of 18 constant reflectance regions were detected. The index triangle shown in the model image is found and verified in the scene image. The three index regions produce a hypothesis for the object and its pose. Other regions in the object model are used to verify this hypothesis using the alignment technique (Expression (7)). Again, some of the verification regions are not found in the scene image since they are occluded by other objects. Further, the actual and projected centroids do not overlap exactly since the assumption that the regions are small compared to the size of the object is violated. 5. DISCUSSION We conclude with a brief discussion on the ideas and results presented in this paper. Some directions for future work are also mentioned. • Physical Approach t o Visual Perception: Most will agree that biological vi sion systems use not only geometric features but also physical attributes such as reflectance for perception. We are fairly adept at distinguishing a smooth surface from a rough one, plastic from metal, cotton from silk, or bronze from copper. Some can even tell artificial wood or metal from the real thing. Machine vision systems have relied primarily on geometry for high-level tasks such as recognition and navigation. In fact, in the past, models of reflectance have been used mainly for recovering scene geometry (shape from shading, for example). Perceptual algo rithms too can benefit from the explicit use of non-geometric physical attributes. Reflectance, material, and roughness are examples of such attributes. • Representing Physical A t t r i b u t e s : While arguing in favor of physical attributes, we are faced with several significant problems. One entails the representation of an object's physical attributes. In this paper, we have used a rather simplistic representation; region ratios and centers. To accommodate a larger class of objects, richer descriptions must be explored. It is imperative that the representation be able to handle multiple properties (e.g., shape and reflectance) simultaneously, and yet

151

be compact enough to be called a representation. The shape variations of an object may not be, in any way, correlated with its reflectance variations. For instance, a simple geometry such as a sphere may be highly textured. We may therefore need to represent geometry and reflectance at different resolutions. Further, all difflculties posed by single-attribute representations are also inherited by multiattribute representations. For instance, one must decide a-priori the level (or scale) at which shape and reflectance variations need to described. • Computing Reflectance from Images: Representation of physical properties is meaningful only if these properties can be computed from images. We presented an algorithm that computes the relative reflectance of scene regions from a single image. The algorithm may be viewed as an extension of Land's retinex theory to three-dimensional scenes. By using segmentation first, our algorithm overcomes several problems inherent to Land's global method. Fairly straightforward hardware implementations can be envisioned to obtain real-time reflectance estimates. Ideally, we would like to compute the absolute reflectance of a region. Using a single image however only relative reflectance estimates can be obtained. This reflectance ratio was shown to be invariant to a variety of illumination and imaging parameters. The use of a single image also precludes us from being able to handle specular reflections, faces on a polyhedron, or regions and backgrounds that have different scattering functions. An interesting extension would involve the use of multiple images of a scene obtained from different vantage points. • Photometric Invariants: We demonstrated that reflectance ratios of regions can be used for robust recognition and pose estimation. This approach is of course effective only if the objects of interest have 4 or more (3 for hypothesis and at least 1 for verification) visible regions. Instead of using region ratios, it may be possible to use ratios of just neighboring points and their locations in the image. This would also avoid the need for scene segmentation prior to computing region ratios. The unresolved problem here is the selection of points both from the scene image as well as from the object model for matching and pose estimation. In the past, several other photometric invariants have been proposed for visual perception (see [14], for examples). These invariants do not directly represent physical properties such as reflectance but rather are functions derived from image brightness that are invariant to pose and illumination for a given shape and reflectance. They are clearly useful for recognition tasks. Some of the proposed invariants are based on high-order spatial derivatives of image brightness and hence suffer from noise sensitivity. However, improvements in imaging technology are being continually made and this problem is expected to fade with time. • Integrating Recognition Techniques: Several recognition techniques have been proposed in the past, each developed with a particular class of objects in mind. In the case of polyhedra, geometric features such as lines and corners provide powerful constraints and invariants. For a smoothly curved object with uniform reflectance, the occluding boundaries and the shading within provide strong cues. As shown here, for objects with constant reflectance patches (surface markings), reflectance

152

ratios and their geometrical arrangement can be used. It is evident that a truly versatile recognition system cannot rely solely on any one of the above techniques. This is a natural consequence of the variety of objects that such a system would have to deal with. The challenge seems to lie in the integration of several recognition strategies into a single system. The broader objective of this paper has been to show that such an integrated system must also rely on physical properties in addition to geometry. REFERENCES 1. P. J. Besl and R. C. Jain, Three-dimensional object recognition. ACM Computing Surveys, 17(1):75-145, 1985. 2. R. T. Chin and C. R. Dyer. Model-based recognition in robot vision. ACM Computing Surveys, 18(1), March 1986. 3. E. H. Land. The retinex. Amencan Scientist, 52(2):247-264, June 1964. 4. E. H. Land. Recent advances in retinex theory and some implications for cortical com putations: Color vision and the natural image. Proc. National Academy of Science, 80:5163-5169, August 1983. 5. E. H. Land. Recent advances in retinex theory. Vision Research, 26:7-21, 1986. 6. M. J. Swain and D. H. Ballard. Color indexing. International Journal of Computer Vision, pages 11-32, November 1991. 7. S. K. Nayar and R. M. Bolle. Reflectance ratio: A photometric invariant for object recognition. In Proc. International Conference on Computer Vision, pages 280-285, May 1993. 8. S. K. Nayar and R. M. Bolle. Reflectance based object recognition. Technical Report CUCS-055-92, Columbia University, March 1994. 9. S. K. Nayar and R. M. Bolle. Computing reflectance ratios from an image. Pattern Recognition, 26(10):1529-1542, October 1993. 10. A. Califano and R. Mohan. Multidimensional indexing for recognizing visual shapes. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 28-34, June 1991. 11. D.P. Huttenlocher and S. Ullman. Recognizing solid objects by alignment with an image. International Journal of Computer Vision, 5(2):195-212, November 1990. 12. J.B. Burns, R. Weiss, and E. Riseman. View variation of point-set and line segment features. In Proc. Image Understanding Workshop, pages 650-659, April 1990. 13. D. Weinshall. Model-based invariants for 3d vision. Technical Report RC 17705, IBM Thomas J. Watson Research Center, December 1991. 14. J. J. Koenderink and A. J. van Doom. Photometric invariants related to soUd shape. Optica Acta, 27(7):981-996, 1980.

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal 1994 Elsevier Science B.V.

153

A segmentation algorithm based on AI techniques * C.Di Ruberto, N.Di Ruocco, S.Vitulano Istituto di Medicina Interna - Clinica Medica "Μ. Aresu", Fax +39-70-663651, E-Mail [email protected] Universitá di Cagüari - Via S.Giorgio, 12 - 09124, Cagliari, Italy

A possible model of the human perceptive process is shown in our paper. With the aim to realize the model, we have introduced a well known technique of Problem Solving. The most important roles of our model are played by the Evaluation Function and the Control Strategy. The Evaluation Function is, roughly speaking, related to the ratio between the entropy of one region or zone of the picture and the entropy of the entire picture. The Control Strategy determines the optimal path in the search tree so that the nodes of the optimal path have the minimal entropy.

words: human perception, segmentation, entropy, quadtree, clustering, Chebyshev distance, structure.

Key

1. INTRODUCTION In the stimulating wodd of Image Processing the most significant phase is, certainly, the segmentation. Typically the term segmentation describes the process, both human and automatic, that individuates in a pictorial scene zones or regions showing some characteristics with respect to a certain evaluation fimction (e.f). These characteristics could be, for example, the same colour or gray level, roughness, material, decorum, texture and so on. The evaluation ñmction, as we intend it in this work, can be described in the following way: with respect to which fimction, regularity or use, one or more regions could be considered homogeneous? So, for example, two regions of different nature but characterized by the same kind of manufacturing or decorum could be considered homogeneous by an artist but not by a technician. In image processing, researchers often utilize the following classes of characteristics:

* This work is partially supported by the Scientific Research Ministry and by the Study Conduct between ffiM and Image Processing Laboratory-University of Cagliari.

154

- Statistical - gray levels of a region have to satisfy a particular statistical function; - structural - gray levels distribution of a region has to satisfy a mathematical structure or a rule or a fractal function [1-3]. In the Computer Vision world the whole segmentation process is often shown, in literature, by two opposite but related phases: merge and split. In the merge phases we try to realize the greatest possible organization of the picture, exalting the transitions zones (catastrophes) between different regions. We consider as regions all those zones of the picture whose entropy is negligible, reversible states, and for which it is possible to determine a statistical fiinction or mathematical structure. On the contrary for those regions (catastrophes) that present a high value of entropy, irreversible states, it is not possible to determine a statistical function or a mathematical structure but, perhaps, a catastrophe. We say that the zones of the picture where entropy is high correspond to the contour or the silhouette of the object. The purpose of the split phase is to divide each region, individuated by the merge phase, in subregions which can be described by means of different mathematical structures; for these subregions the entropy changes are infinitesimal. So, the whole segmentation process is governed, in our opinion, by the choice of the characteristics of the e.f and its quantification. In fact a different choice of characteristics and the recourse to a variable threshold can produce different segmentations for a same pictorial scene.

2. THE PROPOSED TECHNIQUE In the introduction we have explained the role of the components that determine the segmentation process. They are the type of peculiarities and their quantification. For the choice of the characteristics and their quantification we have recoursed to a typical AI strategy: Problem Solving. This technique divides the whole segmentation problem into two subproblems: the merge and the split. 2.1 The Merge phase The task of this phase is to obtain some information about the different regions of the picture: the number of existing regions, their areas and topological positions in the image domain, their features and their contours. 2.1.1.- The physical space of the problem The image domain plays an important role; in fact by means of this space it is possible to determine all topological information of the extracted regions. This domain will be the physical space of the problem and will be usually represented by a square matrix of 512 χ 512 pixels. 2.1.2. - The Quad-tree It is necessary to utilize a hierarchical structure to follow the behaviour of the problem during its various phases; to this puφose we use the quad-tree [4] which is easily computed and stored; it gives us topological information about regions (shape and position into image domain), it allows the implementation of the back-track.

155

2A3.'The

Evaluation Function Gestalt theory supposes the whole information needed to achieve the perceptive process is contained inside the examined picture [6]. Now we put in evidence a question: What and how much information does it need? Then the first problem is to determine a particular analysis to measure such information. In literature there are several important works that propose different kinds of analysis, such as histograms, fractal dimensions, trasforms, etc. The second problem is to determine the quantity needed to correctly process the image. To achieve this goal, in this work, we utilize image's histogram. In fact, the histogram can give some interesting information about the picture under investigation, such as entropy, gray levels, their distribution and so on. For example, an image totally homogeneous, consisting of a single gray level, has a histogram with only one peak; we can associate to such a peak a gaussian distribution that has zero standard deviation, σ . So, the entropy of the whole picture is zero. In other words, the entire scene is homogeneous and it's composed by only one region characterized by a same gray level; then, we can state that the entropies of the region and of the pictorial scene are zero. On the contrary, if we consider a picture with only two gray levels, white and black for instance, its histogram has two peaks. If we associate a gaussian function to each one of these peaks, its standard deviation is different from zero. The entropy of the whole picture is different from zero and it is related to σι + 0 2 , the sum of the standard deviations. Such an image could be a regular structure (chess-board) or a chaotic scene ( black and white noise). In this picture two or more homogeneous regions could exist (with entropy equal to zero ) but the entropy of the whole picture is different from zero. Typically a real picture histogram is very complicated to read and if it gives us some information about the whole image, on the other hand it doesn't allow to pick out the regions contained inside the image. Now it is possible to make a remark: if the hnage is composed of two or more regions, each of these ones has small entropy, but this is not true for the whole picture. For this reason the chosen e.f measures entropy of each region with respect to the entropy of the whole image. Such an idea suggest us to introduce the following measures: σ - standard deviation With each relative maximum of the image's histogram we associate a gaussian distribution with its standard deviation. The picture entropy is strictly near to the sum of standard deviations associated with each peak. τ - symmetry The gray levels distribution with respect to a relative maximum can be more or less symmetric. To take account of possible peak's asymmetries we have introduced the symmetry correction factor, τ. So, τ shows the number of shifts that the associated gaussian function has to carry out in order to obtain the best matching between the curve and the maximum itself The number of possible shifts is limited inferioriy by -σ and superioriy by +σ. c - cost It is a measure of the computational cost for the localization of a region in the pictorial scene. If we set unitary the cost between a father node and a child node in the quad-tree, such a cost -c- vÛ be a measure of the depth reached in the search tree to individuate a region.

156

2.1.4- The Lists We have recoursed to some lists in order to store the quad-tree's nodes visited during the various search phases; they are the F-Open list, the F-Closed list and the B-Closed list. The F-Open list contains the nodes generated along an optimal path but not yet expanded. They are ordered in the list with respect to the evaluation function values. The F-Closed list contains the nodes expanded along an optimal path. The B-Closed list contains the nodes whose associated domain covers a region already individuated. 2.1.5- The Rules The choice of the rules plays a significant role in the field of problem solving; in fact, it can influence the number of operations necessary to reach the problem solution. We have recoursed to the rules shown in fig. 1. The rules Ri-Rio are used by the forward strategy while the rules R11-R12 are used by the baclcward strategy, described in the next sections. 2.1.6- The Forward strategy The purpose we intend to realize in this phase is to highlight, inside the pictorial scene, those areas whose entropy is equal to zero or quasi-zero. In other words, the zones we individuate as regions are those ones whose evaluation function is stable or gradually decreasing. If a zone of the image consists of a single region to which corresponds a certain value of the evaluation function, under our hypotheses it is possible to state that in any subdomain of the region the evaluation function may assume values smaller than or equal to the evaluation function value of the whole region. So, we can state that the evaluation function value of a subdomain included in the region is smaller than or equal to the evaluation function value of a subdomain including the region itself It is in the light of these considerations that we have chosen as forward strategy the hill-climbing strategy. The flow-chart of the whole control strategy is illustrated in fig. 2. 2.1.7- The Backward Strategy In a region whose main characteristic is not a single gray level but rather a texture, or a structure, or a decorum, etc., there exists a subdomain that contains all the features of the whole region; we shall call such a subdomain piece or partition element of the region. As we have seen, the quad-tree uses domains whose dimensions are always powers of two. But nobody assures us that the domain of the piece of the region satisfies this condition. Just because the piece of the region can have any dimension and any form we have recoursed to the baclcward strategy. We have said that along an optimal path the evaluation function assumes constant or decreasing values. Along such a path, when the dimensions of the subdomain relative to a level of the search tree are smaller than the dimensions of the domain of the piece, then a sudden variation of the e.f value occurs. Moreover, along an optimal path there can exist two or more nodes whose evaluation function has the same value. Let's suppose that at is the node of an optimal path whose corresponding subdomain is the smallest one for that path. If the evaluation function values of all of its children are greater than or strongly different from the evaluation function value of ai, then this means that the dimensions of these nodes are smaller than the exact dimensions of the partition element of the region. The backward strategy considers the domain of one of its children and

157

Ri : Put the root into the F-Open list. R2 : Get the first node in the F-Open list. R3 : Compute the gray level with the highest frequency, the standard deviation σ of the gaussian curve associated with it and the τ value as the measure that maximizes the intersection between the gaussian curve and the histogram of the current node. R4 : Compute the evaluation function value of the current node as e.f = σ + τ + c, where c is the cost along the optimal path. R5 : Generate the children of the current node. Re : Order the children of the current node according to the intersection between the measures of the evaluation fimction of the current node and the father's one. R7 : Put the children of the expanded node into the F-Open list according to the order defined in the Re rule. Rg : Put the current node (father) into the F-Open list. R9 : Apply the hill-climbing strategy: put on the ordinate axis the evaluation function values and on the abscissa axis the levels of the tree; compute the maximum value with the highest level. Rio : Compute the level with the smallest derivative. Rii : Cover the whole domain with a partition whose elements have the same dimensions of the domain relative to the level computed by the Rio rule. R12 : Label as similar all the elements (nodes) of the partition whose evaluation function value is equal to that one relative to the piece computed by the Rio rule. R i 3 : Put all the nodes satisfying the R12 rule into the B-Closed and F-Closed lists. Ri4 : Remove the remaining nodes from the F-Open list. R i 5 : Compute the domain of the image not yet labelled.

fig. 1 - The list of the rules

158

tnie STOP

Is the domain of the image empty? | false Put the root into the F-Open list

Compute the domain of the image not yet labelled

Get Ihe first node in tlie F-Open list

±

Compute Uie gray level with the highest frequency, the standard deviation σ of the gaussian curve associated with it and tlic τ value as the measure that maximizes the intersection between the gaussian curve and the histogram of the current node

Compute the evaluation function of the current node as e.f. = σ + τ + c, where c is the cost along the optimal path

For each child (not in the B-Closed list) apply the R3 and

rule

Sort the children of the current node according to the intersection bet\veen the measures of the evaluation function of tlie current node and tlie father's one

Put the children of the expanded node into the F-Open list according to the sorting defined in rule

Put the current node into the F-Closed list

Verify if quadtree

tlie first node in the F-Open list lies on the last level of the

true

false

I Backward strategy

fig. 2 - The flow chart of the Control strategy and rules

159

APPLY THE HILL-CLIMBING STRATEGY: PUT ON THE ORDINATE AXIS THE EVALUATION FUNCTION VALUES AND ON THE ABSCISSA AXIS THE LEVELS OF THE TREE; COMPUTE THE MAXIMUM OF THE EVALUATION FUNCTION WITH THE HIGHEST LEVEL

COMPUTE THE LEVEL WITH THE LEAST DERIVATIVE |

COVER THE DOMAIN OF THE IMAGE WITH A PARTITION WHOSE ELEMENTS HAVE THE SAME DIMENSIONS OBTAINED BY MEANS OF THE RIO RULE

±

LABEL AS SIMILAR AU THE ELEMENTS (NODES) OF THE PARTITION WHOSE EVALUATION FUNCTION VALUE IS EQUAL TO ÜIAT ONE RELATIVE TO the PIECE COMPUTED BY THE RIO RULE.

±

PUT AUTHE NODES SATIS^G THE R12 NILEFROMTHE B-CLOSED AND F-CLOSED UST

±

I REMOVE ÜIE REMAINING NODES FROMTIIEF-OPEN LIST]

FIG. 3 - THEFLOWCHART OF ÜIE BACKWARD STRATEGY

increasing by one the dimensions of the domain, step by step, computes the relative evaluation fimction value. The process stops when the e.f value of one of these domains is equal to that one of the father node. This domain is the exact partition element or piece of the region. At this point the backward strategy partitions the whole domain of the image into elements whose dimensions are equal to those ones of the partition element of the region. For each element of the partition the backward strategy computes the evaluation function value and labels as equal those elements for which the e.f assumes the same value. The strategy puts all the similar elements into the B-Closed list and updates the F-Open, F-Closed and B-Closed lists. It removes from the histogram the gray levels of the pbcels belonging to the elements in the B-Closed list. It verifies if the whole domain of the pictorial scene has been labelled and if this is true it stops the merge phase, otherwise the whole segmentation process is repeated for the unlabelled zones of the pictorial scene. The final result of the merging phase

160

is the segmentation of the pictorial scene into regions for which we know the dimensions of the partition element and its e.f value. Thefig.3 shows the phases of the backward strategy. 2.2 The Split phase The aim of the splitting phase is to distinguish the structures associated with each region whatever is their statistics. The merging phase divides the image into regions whose elements satisfy a predicate of uniformity. It associates with each region the dimensions of the partition element and its evaluation fiinction, but the fact that inside the same region two elements satisfy its uniformity predicate does not imply that they have the same texture or structure. For structure associated with an element, we intend a description of the mutual relations among the pixels contained in the partition element. The structure itself is suitable for associating a "shape" to each partition element or piece and later on we shall speak about the piece shape. Therefore the basic task of the split phase is to associate a structure to the region under investigation or, if the elements do not have the same shape, to split the region in more regions (splitting). The problem of shape recognition (Clustering Problem) presents two fundamental aspects: the first consists in finding and identifying the common properties of all the elements belonging to a class; the second consists in establishing which class the shape belongs to, on the basis of the features it shows. In particular a clustering method is generally seen as a splitting process of a set of objects in classes according to some similarity measures. The first aspect, that is the extraction of the characteristics, is relative to the problem of carrying out some measures on the objects to classify, and that of selecting the meaningful information, the "features", which will be used to represent the objects. The selection and the sort of the characteristics is often based on two factors: the importance of the features in characterizing the objects and the contribution of the features in the recognition performance. Practically the extraction process associates with each object pi a vector V i ( k i ' , Íc2', ,kn') where kj' for j=l...n constitute the coordinates of a point in the features n-dimensional space. This process produces a description of all the objects in the features space. The other aspect of a classification system is the one relative to the individuation of a decision rule that allows the classification of all the objects present in the region to process. The algorithm determining the different structures present in each region individuated by the merging phase is based on a guided iterative clustering method. At each iteration one class is individuated. Each class corresponds to a structure present in the region under investigation. The representative element of a class (the sample piece) is constituted by the piece that repeats more times in the region under investigation. Let I(x,y), 0 V o

(6)

Ei occupies the same position as pixel i and has Ν = 3 possible values, i.e. 0, 1, or 2. The label of pixel V q , which is called the texture unit number (NTU) of V o , is computed by

NTU =

X Ν'-'

N T U = {0,1,2,...,(N8-1)}

(7)

1=1

i = I, 2, 8 is related to the relative position of the eight neighbors to the central pixel V q . The texture spectmm is the occurence frequency of all the texture unit numbers within a moving window (the window size of 30x30 was used), with the abscissa indicating the texture unit number NTU and the ordinate representing its occurence frequency. If Roberts edge detection operator is used, then the integrated absolute difference between two spectra has been taken as the difference between two elements of the edge detection operator. Ä(^.«.,=VA'+A=

(8)

217 D) = """,NilS.. (k)-S.1+I .J". ,(k)1 L.Jk=) '.J

and

(9)

Sij(k) denotes the kth element of the texture spectrum calculated from the window located at the position (ij). Simple thresholding technique was applied to the Roberts edge image and a typical result for a Lee filtered input image is shown in Figure 5 and for a median filtered input image is shown in Figure 6. This texture edge detection was used to evaluate the best filtered input image. The filtered image which gave the best texture edge image was considered as the best input image.

3.

TEXTURE FEATURES

Texture has been recommended as a potential attribute for radar image inteφretation. Several investigations have already been done, but an optimal set of texture features for radar image classification has not yet been validated. The followings are the results of a literature survey done in the frame of this study which included the use of texture analysis for radar images, and the use of texture analysis for Brodatz images which have been used widely for bench mark tests. Ulaby [25] found that a combination of image tone and co-occurrence matrix, i.e. contrast and inverse difference moment into a multi-dimensional classification model resulted in an overall classification accuracy exceeding 90% for SAR radar images. Schistad [9] used the co-occurrence matrix features which included angular second moment, entropy, correlation, cluster shade, inertia, and inverse difference moment, and the local statistics features which included the local mean and the local ratio between the standard deviation and the mean value which performs quite well with approximately 89% correct classification. In her comparative texture classification experiment using SAR radar images, Schistad [9] also showed that multiplicative autoregressive random-field model performs best with approximately 94% correct classification, and that the fractal features performs worst with approximately 75% correct classification. Frankot [26] also showed that the lognormal random field is a good model for SAR radar image synthesis. Wang [27] proposed a concept of texture units and used the texture unit features in Brodatz and radar image interpretation. The concept was applied to Brodatz images with around 97% correct classification, but its performance with radar images has not been reported. Multi-channel Gabor filtering based on human vision model has been applied to Brodatz images by Jain [28] with 93% correct classification, and wavelet transformation was used by Unser [29] with 90% to 99% correct classification. Based on the previous works, several statistical-based features were selected for the first experiment, they included three features of texture units, seven features of the co-occurrence matrix, and two features of the local statistics. The random-field model, the Gabor filtering, and the wavelet transformation will be included in the second experiment. A visual judgement approach was used for selecting an optimal combination of features for radar image classification.

218

3.1. A concept of texture units A concept of texture units was proposed by Wang [27] and described in Eq. (6) and (7). The image classification can be done using the integrated absolute difference between the texture spectra of an observed pixel and the texture spectra of the texture models; where the minimum distance decision rule could be employed. The texture spectrum reveals textural information of an image in a primitive form. It is useful and necessary to extract textural features from the texture spectrum. These features will then be more easily used in practice for texture characterization and image classification. Three texture features were proposed by Wang and He and have been used in this study. They consist of Black-White Symmetry (BWS), Geometric Symmetry (GS), and Degree of Direction (DD). X;:;%SX/)-.SX328U/)|

BWS =

(10)

xlOO

xlOO

GS =

1

3

"

DD = i - - y

(11)

(12)

XlOO

y

S(i) is the occurrence frequency of the texture unit numbered i, where i = 0,1,2,...,6560 the possible values of NTU for N=3 in Eq.(9). And Sj(i) is the occurence frequency of the texture unit numbered i in the texture spectrum under the ordering way j , where i = 0,1,2,...,6560 and j = 1,2,3,...,8. An example of classified image based on GS and DD features with window of size 30x30 is shown in Figure 7, and based on the image tone, GS, and DD features in Figure 8. 3.2. Gray level co-occurrence matrix Seven features of co-occurrence matrix were used in this study. They included the energy, entropy, maximum probability, contrast, inverse difference moment, correlation, and homogeneity described by Haralick [30] : Image:

0 0 0 2

P h ( 0 degree) =

0 0 2 2

11 11 2 2 3 3 4 2 2 4 10 0 0

Grey Tone table:

10 0 0 6 1 12

#(0,0) #(1,0) #(2,0) #(3,0)

Pv(90 degree) =

#(0,1) #(1,1) #(2,1) #(3,1)

#(0,2) #(1,2) #(2,2) #(3,2)

6 0 2 0

0 0 2 0

0 4 2 0

2 2 2 2

#(0,3) #(1,3) #(2,3) #(3,3)

219 2 13 0 12 10 3 10 2 0 0 2 0

PLD(135degree)=

P r d ( 4 5 degree) =

4 1 12 0 2 0 0

0 2 4 1

0 0 1 0

If P(i j ) is an element of the co-occurrence matrix, then : Energy=Y^Y^PiijY »

(13)

j

Entropy=YY^P{iJ)\ogP{iJ)

(14)

Contrasl^Y^î-JYPHJ)

(15)

'

j

αοηβΙαίίοη=Υ^Υ^{ί-μ){ί-μ)Ρ{ί,])Ι »

(16)

>

InverseDiffereticeMomerH = Y^P{iJ)l\i

Homogeneity = Σ Σ ΡΟ.;) / ( »

- jf

71)

(17)

(18)

j

Maximum Pr obability = max Ρ(/, y)

(19)

'./

The features were computed based on eight grey levels and using a moving window size of 15x15. A classified image using the seven features of the co-occurrence matrix is shown in Figure 9. 3.3. Local statistics features The local statistics features used here are proposed by Schistad [9]. They considered texture as local scene heterogeneity, and assumed that the speckle obeys a Gamma distribution. The same assumption was also used by Gastellu-Etchegorry [17]. Thus, the local mean and the local ratio between the standard deviation and the mean value could be used as texture indexes. The local statistics was computed within a moving window of size 9x9. An example of a classified image which used the two local statistics features is shown in Figure 10.

4.

IMAGE CLASSIFICATION AND SEGMENTATION

Four object categories were actually planned to be used in the image classification, i.e. water (normally homogeneous regions with dark signature), dense-forest (smooth texture), secondary-forest (brighter and rougher texture), and bushes (bright response). Since the

220 perception of the secondary-forest and the tall-bushes are hard to be differrentiated, then these two object categories are combined into the bushes class. Feature extraction has been done using unsupervised approach. Images of size 512x512 were used. Random sampling was used to obtain a training sample set of 2500 samples per image. The sample data were clustered into three clusters and minimum distance classifier was used for the classification. Figure 7 to 10 show classified images based on difference texture features. Visual judgement has been incorporated in the evaluation of classification results in order both to see whether or not the linear features such as river becoming thicker or being interrupted, and to select the potential features for image classification. Figure 11 shows a classified image based on the selected features which could be considered as an optimal combination of features. The proposed combination of features should be validated fiartherly by larger data. The classified images should be cleaned from very small regions, and Figure 12 shows an example of final classified image. Classification accuracy was measured based on supervised approach. Four 650-pixel sub images were selected for each object category. Especially for water, seven 50-pixel sub images were used. A table of correct classification and classification error was developed.

5.

RESULT AND DISCUSSION

5.1. Speckle noise elimination The filtered images shown in Figure 2 to 4 which are the results of using the median filter, the adaptive Lee filter, and the lineament enhancement respectively, show a reasonably better visual appearance than the original one in Figure 1, in the sense that they reduce the speckle and enhance the textural information. The results of texture edge detection show that the Lee filter does preserve edges best followed by the median filter, as shown in Figure 5 and 6. 5.2. Feature selection Figure 7 and 8 show classified images using the texture unit features. Based on the cases of this experiment, visually the geometric symmetry (GS) and the degree of direction (DD) feature images contribute the greatest variance; and generally, the black-white symmetry (BWS) feature image has low variance. It is also important to select an optimal window size. It seems that the use of a window size of 30x30 favors the forest and the bushes class but destroys the river class, and on the contrary if a window size of 10x10 is used. In both cases the use of image tone feature could improve the classification accuracy, an example is shown in Figure 7 and Figure 8. For the image in Figure 8, the three objects classification error is 26.8% and the water classification error is 12.5%. Figure 9 shows a classified image using the seven features of co-occurrence matrix which is not a good result. Based on visual judgement, the three feature images which include the contrast, correlation, and entropy feature image contribute similar information, and the other four feature images which include the inverse difference moment, homogeneity, maximum probability, and energy feature image also contribute another similar information. Cases of the experiment using any two features from the two group of features show that the

221

Figure 1. Original image.

Figure 2. Median filtered image.

Figure 3. Lcc filtered image.

Figure 4. Lineament enhanced image.

Figure 5. Texture edges of Lcc image.

Figure 6. Texture edges of median image.

222

Figure 7. Ciassífícd ima{;c based on 30x30 GS and DD features.

Figure 8. Classified image based on image tone, 30x30 GS and DD features.

Figure 9. Classifíed image based on seven features of co-occurrence matrix.

Figure 10. Classified image based on local mean and local standard deviation/mean.

Figure 11. Classified image based on GS, DD, contrast, inverse difference moment, local mean, and standard deviation/mean.

Figure 12. Classified image based on six optimal features and after single-pixel region elimination.

223 classification error is quite high which is approximately 45%. Seven other recommended features of the co-occurrence matrix are going to be investigated in fiirther experiments of this study. Figure 10 shows a classified image based on the two local statistics features. The classification result is quite good where the classification error is approximately 30%. At this stage of experiment, a combination of six features which consists of the geometric symmetry, degree of direction, contrast, inverse difference moment, local mean, and the local rafio between the standard deviation and the mean value was selected and was considered as the optimal set of features, and the results are shown in Figure 11 and 12. The optimal set of features could perform relatively good with approximately 5.5% for the three objects classification error and 10.5% for the water classification error. 6.

CONCLUSIONS AND FUTURE WORK

6.1. Filtering module Even though the experiment results show that the adaptive Lee fiUer gives the best filtering result, other potential filters such as the median filter, the mean filter, the lineament enhancement, the adaptive Frost filter, and the adaptive Gaussian smoothing filter would also be provided in this module of the radar image processing system. A texture edge detection is provided as a mechanism for evaluating the filtering results. 6.2. Feature extraction module The texture unit features and the local statistics features show quite good results. The use of another features of the co-occurrence matrix should be explored in order to find relatively better features than the seven features used in this experiment, even though the use of two features of the co-occurrence matrix with the other two texture unit features and the two local statistics features could show a good performance. Currently, the comparative experiment has been continued with the use of the angular second moment, cluster shade, inertia, cluster prominence, diagonal moment, high level moment, and the low level moment of the co-occurrence matrix and the use of features derived from the multiplicative autoregressive random-field model, Gabor filtering, and wavelet transformation. Besides the visually-based judgement, a feature selection technique based on a supervised learning is also applied. 6.3. Image classification module The classification module consists of random sampling, supervised sampling, clustering, minimum distance classifier, single-pixel elimination, and classification accuracy evaluation. Any other classifier can also be added in this module, such as maximum-likelihood classifier and multistage classifier. 6.4. Validation module A bench mark test and its accuracy test will be arranged in this module. Standard reference areas for SAR radar images will also be provided in this module. The results obtained from this study must be validated furtherly by larger data.

224 ACKNOWLEDGEMENT We thank Bapak Mahmud A. Raimadoya, Bogor Institute of Agriculture, for supporting us with the raw SAR image data and for investigating the results. We also thank Prof Anil K. Jain, Michigan State University, for allowing us to use the PRIP Lab software for the image classification. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30.

J. Rais and staiTs, Discussions, Bakosurtanal, Jakarta, (1990). I2S, Dipix, Erdas, MDA, Software and Hardware Spec., Bidding Doc., lUC-CS-UI, Jakarta, (1987). I. Samihardjo and B. Marsudi, The Application of Remote Sensing for Strategic Purposes, National Conf. on ERSl-Landsat-SPOT, BPPT-Bakosurtanal, Jakarta, (1993). J.P. Canton and J.P. Scmpcre, ERS/SPOT for Mapping : A search for complementary. National Conf. on ERSl-Landsat-SPOT, BPPT-Bakosurtanal, Jakarta, (1993). R. Schumann, Status of the first European Remote sensing Satellite ERS-1, National Conf. on ERSlLandsat-SPOT, BPPT-Bakosurtanal, Jakarta, (1993). EC-ASEAN ERS-1 Project Newsletter, (1993). M.A. Raimadoya and A. Murni, A research prcproposal : COSMOS, Jakarta, (1993). M.A. Raimadoya and K. Endo. The application of multisensor remote sensing in forest inventory. Seminar on Reforestation and Rehabilitation, Univ.of Mulawarman, Samarinda, (1989). A.H. Schistad and A.K. Jain, Texture analysis in the presence of speckle noise, IEEE Geoscience and Remote Sensing Symposium, Houston, (1992). V.S. Frost et al. An Adap.Filter for Smoolh.Noisy Radar Images, Proc.of IEEE,Vol 69,No 1, (1981). B.N. Koopmans and E. Ricchelti, Optimal Geological Data Extraction from SPOT-Radar Synergism, National Conf. on ERSl-Landsat-SPOT, BPPT-Bakosurlanal, Jakarta, (1993). I. Scollar and B. Weidncr, Image Enhancement using the Median and the Interquartile Distance, Comp. Vision, Graph., and Image Proc. 25, (1984). Ruzbari, Perbandingan Perbaikan Citra SAR antara Filter Median dan Filter Frost, Fak.Teknik UI, (1990). M.A. Raimadoya, Aplikasi Citra Radar, BPPT - I2S USA - Multimatra Prakarsa, Jakarta, (1993). D.G. Goodenough et al. Adaptive Filtering and Image Segmentation for SAR Analysis, Mach. Proc. of Remotely Sens. Data Symp., (1984). P. Mouginis-Mark et al, Spaceborne and Airborne Radar, Mach. Proc.of Rem. Sens. Data Symp., (1984). J.P. Gastellu-Eichegorry et al. Tropical Vegetation Survey with optical and SAR data. National Conf". on ERSl-Landsat-SPOT, BPPT-Bakosurtanal, Jakarta, (1993). G.K. Moore and F.A. Waltz, Objective Procedures for Lineament Enhancement and Extraction, Photogram. Eng. and Rem. Sensing, 49, 5, (1983). S.J. Wirjosocdirdjo et al. Speckle Noise Reduction of SAR Imagery, Bakosurtanal R.L, Jakarta, (1987). J.S. Lee, Speckle analysis and smooth, of SAR images, Comput. Graph. Image Proc. 17, (1981). J. Ton et al. Automatic road identification and labelling in Landsat TM images, Elseiver, (1988). Y. Xiaohan et al. Image Segmentation Combining Region Growing and Edge Detection, IEEE, (1992). D.C. He and L. Wang, Delecting texture edges from images. Pattern Recognition, 25, 6, (1992). J. Harms et al. Integration of ERS-1 Data into the European Project of Agriculture Statistics, National Conf on ERSl-Landsat-SPOT, BPPT-Bakosurtanal, (1993). F.T. Ulaby et al, Textural Info, in SAR, IEEE Trans, on Geoscie. and Rem. Sens., Vol GE-24, No 2, (1986). R.T. Frankot and R. Chcllappa, Lognormal Random-Field Model and their Appl. to Radar Image Synthesis, IEEE Trans.on Geoscie. and Remote Sens., Vol GE-25, No 2, (1987). L. Wang and D.C. He, A Statistical Approach for Texture, Photo.Eng. Rem.Sens.,Vol 56,No 1, (1990). A.K. Jain and F. Farrokhnia, Unsuper. texture segmen.using Gabor filt., Patt.Recog., 24,12, (1991). M. Unser, Texture Discrimination using Wavelets, Computer Vision and Pattern Recognition, (1993). R.M. Haralick et al, Stat. Image Texture Analys., Handbook of Patt.Recog. Im.Proc, Academic Press, (1986).

Pattem Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

225

Discussions Part II

Paper Brailovsky and Kempner Alder: In the case of Gaussian noise, have you compared your threshold method for selecting the number of segments with AIC, BIG, or stochastic complexity? Brailovski: If you mean the Minimal Description Length Principle, this is not so easy to compare. There are many parameters in that principle and it is unclear how they should be determined. Also, if you consider the Minimal Description Length approach, its formulation may be reduced to a trade-off between complexity and precision of the approximation, which in our case translates into a trade-off between the number of regions of linearity and the level of approximation. So, there are certain degrees of freedom in this approach. We are discussing with some people from IBM Research Center the possibility to do such comparisons but it will not be straightforward. What was done is a comparison with the cross-validation estimate, which has no specific parameters. This comparison demonstrated that the method presented here works a little bit better. But besides that, the cross-validation based approach is impossible to apply in image analysis, because the analysis of each line takes a factor of hundred times more computer time than the algorithm reported here. Davies: I just wonder whether it is not slightly dangerous to cut-off the number of segements K. You increase Κ till you get to the optimum. It may only temporarily get worse, but you cut-off at that point. I am thinking of an analogy. If I were looking for circles in an image and in fact they were ellipses, but I didn't know. I would then need two extra parameters and not one: that would be a sort of trick case. Brailovsky: I think you are right, but if you perform more complex algorithms, the situation would be even worse. What makes the situation easier here is that the 1-D analysis is not the final stage. During 2-D image analysis, using for example continuity considerations, I have the possibility to correct possible errors. So the errors in the 1-D analysis are not the final errors. Alder: My question is, have you compared your methods with Markov Random Field based reconstructions?

226 Brailovsky: That is a good question. With Markov Random Fields you inmiediately meet the problem of what parameters to choose to have a good result. The problem of parameter specification for Markov Random Fields is a very crucial point. In our approach the situation is much easier. We need not define any so-called artificial parameters, only the significance level. Also, our algoritm is much faster. Jain: I have a similar question about your 1-D treatment of 2-D images. You are right about the parameter complexity but now you lose the correlation mformation between the rows and colunms. And since you are talking about image-restoration, are you assuming any noise model, any kind of a blurring fimction? Can you handle a specified non-linear fimction which has been applied to the image? Brailovsky: The source of my statistical comparison is the probability distribution which I am able to build. I take an unstructured signal and I add noise in the form of identically distributed independent Gaussian and so I create an artificial fimction. From this ñmction, with the help of the algorithm, I can get these probability distributions and statistical bounds. If I have some information about other kinds of noise, I can represent this information when I create the family of unstructured signals. For example, if the noise has some known correlation properties, I can use this information to create a family of such signals and to obtain the new bounds, which take into account the nature of the noise. In addition to that, the least squares approach is ingrained in the method and as a result we are able to work with specific kinds of noise corrupting the signal. If you use an appropriate technique in the algorithm and an appropriate family of unstructured signals to construct the bounds, you can solve a broader class of problems.

Paper Di Ruberto et al. (presented by Vitulano) Mulder: In your title you mention techniques of Artificial Intelligence. Can you indicate where the Artificial Intelligence comes in? Vitulano: In my opinion, problem solving using an evaluation fimction and strategy control is a form of Artificial Intelligence. Gerbrands: You gave many examples: images of watches, scissors, oranges, biomedical images, etc. When changing from one application to the other, do you have to change parameter settings and decision boundaries? Vitulano: We do not change any setting. Everything is guided by the search strategy on the basis

227 of the characteristics of the histogram of the scene. Gerbrands: That would indicate that you have defined the ultimate segmentation method which will segment any image for any problem. This may be true, depending on what you want to achieve. If you want to construct a more or less meaningful map which is then interpreted by a human being, this may work. But if you consider this as a preprocessing step prior to automatic measurements, you may run into problems. Vitulano: Our purpose is perception and not automatic recognition. Brailovski: You have shown the result of segmentation in biomedical images. Do you have contact with medical professionals and if so, do they find these results useful in their practical work? Vitulano: I work in a medical faculty and I have collaborated with medical professionals for five years in different parts of Italy. It is only one year that this program is in routine use in different hospitals. The doctors can usually verify the results. Talmon: You claim that with your method you can see the effects of therapy by comparing images taken at different times. Have you done any evaluation studies to verify that? Vitulano: Medical professionals tell us that the changes are really due to therapy. Talmon: Have you done any animal experiments in order to verify that what you see in the images really corresponds to what there is in the real tissue? Vitulano: Yes, we did that to some extent, but even then it is difficult to see the correspondence between the image and the section of the organ.

Paper Wilson and Hancock Bunke: What is the likelihood that the method converges to the correct solution? Did you find any cases where it converged to a wrong solution or where it did not converge at all? And the second question is: how did you verify? By manual inspection?

228 Hancock: We have looked at all the matches and worked out ground truth. The results are quoted with respect to ground truth and list how many matches are correct and which are incorrect. The other thing that we did is to construct a pattem space model. If you know the number of nodes and edges in the graphs that you are trying to match, you can actually work out a lower value for the error probability. We can also estimate the residual error from the Hanmiing distances in the final match and compare this with the theoretical value. They are not in perfect agreement but they are plausibly close. We feel that this modelling is going in the right direction; it looks as if this pattem space model tells something about how well we may expect to do. Davies: I have been doing graph matching using maximal cliques and the Hough transform. In these cases, the 'null' votes you use are not needed. What happens is that one just votes positively where evidence exists. It would be useful to know why a "null" category has to exist in your case. Hancock: As I said in the conclusions, this is our first attempt at this problem. Subsequent refinements, not reported in this conference, are aimed at using the fact that you are measuring the quality of inexact matches, using a meaningfiil objective function. You can accomodate lots of inconsistencies in the final match, but then you need some way of identifying the consistent portion. We have developed a technique for doing this. You can think of this as a variant of some of the classical constraint filtering techniques.

Paper Feng et al. (presented by Laumy) Sethi: I have seen several papers doing exactly the same thing. What do you think is new or different in the work that you have presented compared to the work that has been reported in the literature in the past five years? Laumy: First, it is only a beginning. In future work we are going to use 3-D constraints which are more powerful than 2-D constraints.

Paper Legoupil et al. Mulder: I would like to raise the question of the certainty of the experts. Are you in a position to get statistics from a large number of experts? Legoupil: The problem is that experts rarely agree on sulcus identification. A way to validate our

229 system would indeed be to confront experts between themselves and to compare this confrontation with our results. So first we would have to find confidence parameters from different experts. The second part would be to compare our results with specialists with respect to such statistical criteria. But so far we have not done that. Kittler: How did you estimate the parameters of your compatibility coefficient fimction, in particular the sigmas? Legoupil: They have been estimated on the basis of empirical evidence. We have tested 55 couples of sulci and average values of estimates allowed to define a standard set of parameters sigma 1 and sigma 2. With this standard set, applied to the 55 couples of sulci, in 85% of the cases the sulci have exactly the same topology.

Paper Mulder and Luo Egmont-Petersen You had a comment on a pattern in the errors. These might be systematic errors. Do you utilize any such pattern to correct your algorithm? Might it be possible to incorporate systematic errors in your optimization algorithm? Mulder: The process consists of two loops. There is a local loop to obtain the parameters on the given set of hypotheses. Then, at a higher whether the minimum cost is acceptable. If it is, and if no regular you can go on. If there is an error pattern, then the scientist will hypotheses at a generic class level.

the best estimation of level it is determined patterns are observed, have to generate new

Smyth: How can you prevent getting many small segments as a result of the segmentation process? Mulder: We do not explicitly control the number of segments. We have a prior probability of the size of the fields. Brailovsky: There are various ways to incorporate prior knowledge. You said that you incoφorated your knowledge about the geometrical structure of the image in the cost fimction. Can you comment on that? Mulder: The geometric hypotheses and their parameters define the hypothesis per pixel. This results in a predicted class label per pixel or per area object. The other domain is

230 included by assuming a label at a pixel. I can predict the radiometric value in terms of conditional probability density functions. That is how both are combined. Murni: In Indonesia, the agricultural fields do not always follow rectangular shapes. Is it possible to extend the method to other geometrical shapes? Maybe we can use fractal parameters? Mulder: I am in the lucky circumstance that I know the area of application. The fractal idea is the worst one you can think of because even in Indonesia the fields are not fractal. The controlling factors of the shape are very often contour lines. Because many of the agricultural fields are irrigated rice fields, the field boundaries will follow contours. So we initiate the model, say the Digital Elevation Model (DEM), from which the contours can be extracted. In the first iteration we may have to use some manual input. Once we have defined the complete digital terrain model, we only need an additional model for the change with time. That is quite manageable. My general comment on fractal models would be that they are nice for presentation but I have only seen one area where I could invert a model and that is with fractal plants. Fractal plants string re-writing provides a plant generation model. And with a 3-D model you can ray-trace the simulated image and see whether on certain features the generated tree is like the actual tree. I would not suggest to use this model for agricultural field modelling.

Paper Murni et al. Raudys: How many spectral bands did you use, one or more? Murni: A single band. From that we derived a set of feature images. Talmon: You mentioned that you used the adaptive Gaussian filter, but you did not show any results of that. Can you say a little more about this? Murni: All the filtered images are almost the same. The speckle disappears, but we do not yet know which type of filter gives the best classification. I will try to find a way to evaluate the results without having to go through all the filter outputs. Since we look at textures, I used texture edge detection to see the boimdaries between the texture regions.

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal 1994 Elsevier Science B.V.

233

Spatio/temporal causal models John F. Lemmer Rome Laboratory (RL/C3CA), 525 Brooks Road (Bldg. 3), Griffiss AFB, NY 13441, [email protected]úl, (315) 330 3655 L INTRODUCTION Bayesian Networks (Dependency Graph Models) are frequently proposed as parsimonious models of causality useful for a variety of different functions ranging from diagnosis to statistical analysis [Pearl, 1988] [Pearl, 1993]. In the physical world, events corresponding to causes and effects have both spatial and temporal limits. Causes are generally not everywhere effective for all time, nor are direct effects observable everywhere for all eternity. Yet these spatio/temporal limits are almost never explicitly modeled within the formalism of Bayesian Networks. This paper provides a formal model in which spatio/temporal limits are modeled within the framework of Bayesian Network models. This is done by extending the domain of the conditional probability tables associated with each node in the network to include functional relations regarding both space and time. Formally, this is straightforward, but severe feasibility problems must be addressed in order to make this proposal into a practical system. In this paper, feasible methods for both prediction and inference are presented. These methods are based upon assumptions of independence between categorical and spatio/temporal variables, and upon discrete models of space and time. These methods are shown to lead to a natural definition of both "report correlation" and "sensor fusion^" witiiin the context of Bayesian Network Models. The methods proposed here offer an alternative to Action Networks [Goldszmidt and Darwiche, 1994], Our methods code spatio/temporal information within the state of Bayesian Network nodes. Action Networks code this information by adding additional arcs and nodes to the basic Bayesian Network. 2. FORMAL MODEL It is possible to embed notions of space and time within the standard Bayesian Network model of Causality. This can be done simply by requiring the domain of the random variables in the model to include spatio/temporal information. Here we will formally define a specific method for including spatio/temporal variables. The model, as described in this section will be computationally intractable, and missing certain desirable dynamic characteristics. The formal model does not allow us to model that a single rain storm can cause more than one puddle, nor that a squeegee can terminate some of these puddles. In Section 3, however, we introduce a series of assumptions leading to an implementation which is both constructively and computationally feasible, i.e. it is reasonable both to build and compute with the less general models. In addition, the implementation provides a means for including the dynamic characteristics missing from the formal model. ^ Both these terms are in common usage in various types of Military Intelligence Systems, including target identification systems and Indications and Warnings systems.

234

Intractability of the formal model results from the huge domain implied by the cross product of all the variables associated with each node. One missing dynamic characteristic is an explicit representation of the notion that a cause, acting over a suitably large range of space and time, may produce multiple distinct instances of the same type of effect. Another missing characteristic is that certain causal events can terminate the effects (of other causes) which might otherwise have been expected to persist longer. Intractability will be overcome by invoking independence assumption and by using a discrete rather than a continuous domain. Using a (¿serete version of the domain also provides a means for adding the desired dynamic characteristics In its standard form, a Causal Model is an acyclic dependency graph, D = (N, E), where Ν is a set of nodes, {N¡}, and Ε is a set of directed edges, Ε = ((iV¿, Njj). Corresponding to Ν is a set of random variables, X = {X,·}, each element of which is in one to one correspondence with an element of N. Define the mother set of X¿, M(X,) to be M(Xj = (Xyl(Ny,N,)GE} Associate with each random variable a conditional probability function.

which we will denote by the shorthand, prcd(X,). By the assumption of conditional independence implied by the graph stmcture, we have that

pi{X) = JJpUX¡) By requiring the Ni to be a special form, the Causal Model becomes a Spatio/Temporal Causal Model. The form we require is that X¿ = (í¿,b,{s¿,r, Μ, v,w))

The variable, s¿ is a vector of state information for the event defined by the random variable, including its value (e.g. is raining) its (point) location, and (start) time. The function, b, is a Boolean function of the state information and the global space, time coordinate system. The function, b¿, computes if the random event, X/, is observable at the space time point, (i, M, V, w). (If the observability of the random variable is thought of as a predicate, b can be thought of as determining whether the predicate is true at a particular point in space time.) As an example, if the event is "raining," Si might include the centroid and start time of the storm. The function, b, tells where and when rain is being generated by this particular storm. This model inserts into the usual notion of an event, the notion of observability in space and time. Thus the mathematics of the Causal Model remain the same, while it t)ecomes possible to conceive of events which are spatially and temporally limited. The Xi can be made discrete in the same way that the state information in any Dependency Graph based Causal Model can be. The information in s is made discrete in the normal way: a set of categories naming the event type, and a set of space/time intervals. The function, b, can be made discrete by considering only some discrete set of functions to be candidates for b. Once discrete, the domain of random variable can be thought of, as usual, as the set As an example of all this, consider further a "rain storm," RS causing a "mud puddle," MP. Suppose that the domain of RS is the set of four discrete states

235

7 ^ 3 = fe ^ i ) . where «i = [íi=0, «1=1, Vi=0, wi=0] is one time and place for a storm and «2 = [^2=0» «^=0. V2=l, W2=0] is another. Suppose ¿>i, describing the space time extent of a small storm, is defined as ^(„_„.)2 + ( v - v / + ( w - w / < 3 bi[Sj,Uu, V, >v) = I true if and t j < t < t j + 5 ^false otherwise and ¿7 2 describing a big storm is given as ^(„_,.)2^(v_v/ + (w-w/tr). Figure 3 shows the structure of the decision tree with three strategies: 'wait', 'test' and 'treat' Figure 4 depicts the structure of the decision network for the same decision problem. In Section 5.1 we discuss the decision network for the single test/single treat decision problem and the strategy generation. The decision tree approach for the single test/single treat decision problem is discussed in Section 5.2. In the single test/single treat decision problem there is one parameter that varies with the particular decision problem. This parameter is the prevalence of the disease, P{d). To study the differences in (optimal) strategy between the generated strategies and the strategies supplied by the decision tree, for varying prevalence of the disease, two sensi tivity analyses are conducted and presented in Section 5.3. The results of these analyses are compared in Section 5.4. 5.1. T h e single t e s t / s i n g l e t r e a t decision network The logic of the single test/single treat decision problem is reflected in the topology of the decision network shown in Figure 4. Consider the left part of the decision network: SE {side effects), TR (treat), and DC {disease continued). On one hand, administering a drug could have the negative effect of side effects, but, on the other hand, the positive effect of reducing the likelihood of disease continued. Further, the manifestation of side effects does not influence the positive effect of the drug. That is SE and DC are conditional independent given TR. Next, consider the middle part: DC, D and R {result). The probability distribution of the test result R depends on the prevalence of the disease. After a test result has been found it will change the clinical likelihood of this disease. The arc from D to R, represents this dependency, and is quantified with the sensitivity and the specificity of the test. The dashed arc from Τ to R, indicates that a test result will become available after the test has been performed. We notice that this link is conceptually not part of the the decision network.

247 outcome nd

L_§e

2^0

0 ^2dc

0

-id

Λ

-waif : - . t r ^

It

^^~X_dc

Figure 3. Decision tree for the single test/single treat decision problem. The strength of the causal relations are quantified by conditional probabilities: P{SE

I TR),

P{DC

I TRAD),

P{R \ D), P(CP

\ T)

The prevalence of the disease is expressed by the prior probability P{d). The decision vertices, treatment and test need no quantification. They are always instantiated. The calculation of the utility function requires a quantification of the risk incurred by the utility vertices. For each utility vertex we specify the risk of the individual vertex being in one of its states: The following utility vectors are specified: U{CP) = (0,1), U{SE) = (0,1), U{DC) = (0,1) 5.1.1. Strategy generation We argued that planning with decision networks is dynamic: only the next best action(s) is(are) planned and all available information about the decision problem is taken into account. In this section we discuss the generation of whole strategies in one itera tion. There is no principle objection to strategies generation in this manner. We notice, however, that in decision problems of realistic size it is not practical to do so. The reason we used strategy generation here it to enable the comparison. During the generation of

248

Figure 4. The single test/single treat decision network. The dashed arrow indicates the time lapse between the decision to test and the availability of the test result.

whole strategies in one iteration, no information is fed into the decision network. Instead, the expected values are used to determine the preferred strategy as in the decision tree approach. 5.2. T h e single test/single t r e a t decision t r e e Consider the decision tree depicted in Figure 3. The rectangular node labeled X rep resents the decision between the three strategies. The rectangular node labeled TR rep resents the decision to treat. The circular chance nodes: SE, CP, D and DC represent side ef fects, complications, disease and disease continued respectively. The hnks that sprout from a decision represent the actions available at that particular decision point. The links that sprout from a chance node are quantified with the prob ability of the associated state. In cases where a node representing a test result precedes the chance node in the strategy, the probability of this chance node is updated with the effect the test result bears on the associated chance event. The calculations involved in this updating process are carried out separated from the decision tree. In the other cases, the probabilities are directly accessible from Table 1. Consider Figure 3, the link labeled (a) is quantified with P{~>d). The link labeled (b) is quantified with ^^"'^p^po^^^^"''^^ , that is P(-^d) updated in the light of a positive test result. Likewise the link labeled (c) is quantified with The dots: . . . indicate that the right hand side of the portion of the decision tree at the alternative branch is repeated.. The desirability of the outcome of a strategy is quantified by utilities. In the decision tree depicted in Figure 3, the outcome of, for example, the treat strategy is side e/fects, or disease continued, or not disease continued. Both side ef feds and disease continued have utility 0 expressing that they represent the worst outcome. The best outcome, no disease continued is quantified with utility 1. 5.2.1. C o m p u t a t i o n of t h e optimal strategy With all the information displayed in the decision tree of Figure 3, we can calculate the utility function for a strategy. This calculation process is called averaging out and folding back. The calculations start at the leaves of the tree. Working back toward the root, two operation are performed. Averaging out is the calculation of the expected utility of a chance node. This is the vector product of the probabilities of the links sprouting from that chance node and

249 the numerical expressing of the outcome associated. This expected utility represents the 'outcome' for the nodes to its left and is stored at the chance node such that this process can be repeated in a recursive way. For example, in the tree of Figure 3, the expected utility of the uppermost DC chance node is calculated as: EU{DC)

= 0 * P{dc I (f Λ ¿r) -f 1 * P(^dc

\dAtr)

Folding back prunes the inferior choices. For every decision vertex, the action which maximizes the expected utility is chosen. This maximal expected utihty is stored at the decision variable. All other possible actions available at this decision point, together with the associated subtrees are pruned from the tree.

5.3. Analysis In the sensitivity analysis, the numerical ctósessment of the probabilities as shown in Table 1 were used. We notice that all variables are binary and that tr = pos is abbreviated pos and likewise t = neg is abbreviated neg. In addition we notice that the correctness and precision of the assumed probabilities do not influence the value of the comparison. Table 1 The second column contains the assumed probabilities in the analyses. The third column contains the formulas. Probability of Assumed Formula value % no disease continued with disease, treated 67 P(^dc \dAtr) with disease, untreated 29 P{^dc \ d Λ -.¿r) with non disease, untreated 82 P{^dc 1 ^d Λ -¿r) with non disease, treated 82 P(-^dc 1 Λ tr) side effect from the drug 5 P{se 1 tr) complications form the test 1 P{cp 1 t) Positive test result In disease (sensitivity) 90 P(pos 1 d) Negative test result In non disease (specificity) 98 P{neg \ -^d)

Displaying the utility function per strategy for varying probabiHties of disease reveals the graphs shown in Figure 5.3 and Figure 6.

5.4. Comparison The visual inspection of the graphs shown in Figure 6 and Figure 5.3 reveals two major differences, a scaling factor between the y-axes, and a bending in the test strategy for the decision network. First, we will explain the bending. As mentioned earlier, in the decision network approach, no fixed set of possible strategies needs to be specified. The strategy

250 test Utility function

----

0.8

treat

- - - wait

0.7 0.6 0.5 0.4

0.2

0.4

0.6

0.8

P(d)

Figure 5. Decision tree: for a clinical likelihood of disease < 0.03 the best strategy is to wait, for a clinical likelihood of disease in the range [0.03,0.44] test is the optimal strategy, and for a clinical likelihood of disease > 0.44 therapy without testing is optimal. to wait after a negative test result can, therefore, change into the strategy to treat after observing a negative test result. This occurs after the bend in the test strategy. After the bending treatment will always be justified, and the line continues parallel to the treat strategy. As long as the bending takes place after the intersection with the treat-line, the optimal strategy will not change. It can easily be proved that this is the case provided that complications from the test do not occur spontaneously. To explain the scaling, recall that in the decision network approach the utility function is calculated cis the sum of the values of the utility vertices. Consider for example the treat strategy\ ¥{treat) = P{-^se \ tr) + P{-^cp \ -^tr) -f V*^««*(DC). The utility function of the treat strategy in the decision tree approach is calculated as the probability of the best outcome, that is F{treat) = P{-^dc \ tr) = P{-^se \ tr) * ( P ( - d c | d Λ tr) * P{d) + P{-^dc \ Λ tr) * Ρ (--a)) (See the tree of Figure 3). The second factor of this equation equals V*^'"'(Z)C). In our example, b9th P{-^se \ tr) and P{-^cp \ t) are close to 1. The addition of P(-~>se I tr) and P{-^cp \ -it), therefore, causes the shift of approximately 2. The minor differences between the intervals where the test strategy is optimal, ([0.04,0.47] for the decision network and [0.03,0.44] for the decision tree), can be explained by the multiplication factor P{-^se \ tr), and the multiplication factors P(^cp | t) and P{-^se \ tr) and P{-ise \ -^tr) in the utility function of the test strategy. Generally it can be concluded that the differences in optimal strategy are explained by the difference in quantification of the utility variables and outcomes in the decision ^The difference and similarities between the other strategies can be explained analogously.

251

test Utility function treat

2.8 - - - wait

2.7 2.6 2.5 2.4

0.2

0.4

0.6

0.8

P(d)

Figure 6. Decision network: for a clinical likelihood of disease < 0.04 the best strategy is to wait, for a cHnical likelihood of disease in the range [0.04,0.47] test is the optimal strategy, and for a clinical likelihood of disease > 0.47 therapy without testing is optimal. For a clinical likelihood of 0.59 the 'test' strategy shows a bend. The line continues parallel to the 'treat' strategy with a shift of 0.01 due to the loss of utility of performing the test.

network and the decision tree, respectively, combined with the difference in the calculation of the utility function. In the decision network approach, the contributions of the utility variables to the utility function are accounted for independently. The decision network is used to calculate the updated probability distributions of the utility variables needed to determine the utility function. In the decision tree formalism it is not possible to calculate these individual updated probability distributions. Instead, the utility function is calculated as the probability of the outcomes of a whole strategy. In these calculations, the contributions of risk bearing situations are dependent. 6. DECISION NETWORK POTENTIALS In this section we shed some light on the fundamental differences between decision tree based approaches and decision networks and discuss the merits of these differences. 6.1. Expected value - observed value In decision networks, planning of actions proceeds as an iterative process. Information that has become available in one iteration is used in the next iteration. In doing so, a reduction of uncertainty about the variables(s) we are concerned with is achieved. In the example, disease continued is the variable of concern. The evaluation in the next iteration is based on the joint probability of the variables discerned conditional on the

252

the information gathered in the previous iterations, such as for instance, test results or observed symptoms. Further, the dependency of utility functions in successive iterations is captured solely in the updated conditional probability distribution. This means that each iteration can be viewed as a separate decision problem of lower complexity. Summarizing we can state fhat complex decision problems characterized by multiple test, multiple treatments and multiple hypotheses are partitioned into separate smaller decision problems. These smaller decision problems are both more easy to comprehend and dedicated to the specific decision problem under consideration. 6.2. Switch between populations The optimal strategy changes with the assumed population. A population is charac terized by the probability distribution of the hypothesis variable(s). A change in the this probability distribution as a result of for instance the observation of a symtom indicative for the hypothesis is accounted for immediately. The impact of the observation is propagated through the decision network and the next iteration in the planning process will be based on the updated probability distribution which provides for a flexible planning process. Speaking in terms of the three strategies of the example, incoming data could possibly cause a switch from one strategy to another. The planning process can be viewed as a walk through the decision space spanned by the hypothesis variables plus the observable variables, without the need to determine the complete space beforehand. This walk is determined by the varying probability distribu tions of the hypothesis variables. 6.3. Individual weighting of utility variables The decision network representation allows the assessment of utilities as a function of a selected group of variables, possibly containing one variable only. These groups are selected such that the assessment of utility vectors appears most natural to the human expert. Although the utilities é^sociated to separate groups are assessed independently, we notice that they must be calibrated in such a way that equally serious situations must be quantified equally. We emphasize that the specification of local utilities provides for a more detailed quan tification of the seriousness of concequences. In addition, the problem of utility assess ment is made easier because the focus is local and the representation (and construction) of domain knowledge, i.e. the causal relations are not mixed with the representation of strategies. 7. C O N C L U S I O N The decision network formalism possesses advantages over decision trees, influence dia grams and the hybrid method. These are the parsimonious representation and the flexible case-specific planning process. The main difference in representation that allows for these advantages is the omission to specify the permissible sequences of actions. In addition, during decision network construction, attention remains focused on the issues of domain modeling and is not obscured by constructing strategies, whereas during evaluation, the logical structure of

253 the network remains untouched and therefore can serve other purposes such as explanation or causal reasoning. The representation of a decision variable as a root vertex, reflecting the fact that the value of a decision does not probabilistically depend on other variables, prevents problems inherent to the propagation of influence of actions to predecessors of the decision vertex. The conservative planning of actions, until and including the first information gathering action, enables to account for new information as soon as it becomes available, thereby tuning the planning to the specific case at hand. This flexibility, however, demands for an increased computational complexity because each iteration receives a separate evaluation. REFERENCES 1. Steen Andreassen. Planning of therapy and tests in causal probabilistic networks. In Arti ficial Intelligence in Medicine, volume 4, pages 227-241, 1992. 2. R. H. Howard and J. E. Matheson. R. H. Howard and J. E. Matheson (Eds.), Principles and Application of Decision Analysis, volume 2, chapter 37-lnfluence Diagrams, pages 719-762. Strategic Decision Group, Menlo Park, California, 1984. 3. Finn V. Jensen, Steffan L. Lauritzen, and Kristian G. Olesen. Bayesian updating in causal probabilistic networks by local computations. Computational Statistics Quarterly, 4:269282, 1990. 4. S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society, 50(2):157-224, 1988. 5. Judea Pearl. Fusion propagation, and structuring in belief networks. Artificial Intelligence, 29:241-288, 1986. 6. Judea Pearl. Distributed revision of belief commitment in composite explication. In L. N. Kanal and J. F. Lemmer, editors, Uncertainty in Artificial Intelligence 2, pages 291-315. Elsevier Science Publishers B.V., 1988. 7. Ross D. Shachter. InteUigent probabihstic inference. In L. N. Kanal and J. F. Lemmer, editors. Uncertainty in Artificial Intelligence 1, pages 371-382. Elsevier Science Publishers B.V., 1986.

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

255

Qualitative recognition using Bayesian reasoning* Jianming Liang \ Henrik I Christensen, & Finn V. Jensen Institute of Electronic Systems, Aalborg University, Fr. Bajers Vej 7, DK-9220 Aalborg East

1. INTRODUCTION Object recognition has been an active area of research for more than 3 decades. Most of the research has been based on recovery of structural descriptions with associated metric features. The obtained description is typically matched against models of a-priori defined objects; a good description of such an approach has been provided by Crimson [6]. For generic object recognition the above mentioned approach has several problems. Tradi tionally the approach is based on low-level features which results in an indexing problem as individual primitives have a limited indexing power. It is necessary to adopt 'abstract' primitives to circumvent this problem. Use of explicit geometry is another problem, as it does not necessarily capture the overall geometric structure. To enable categorisation of objects Biederman has proposed a set of volumetric primi tives that may enable recognition of man-made objects [2]. The proposed primitives are all variations of generalised cylinders where cross-section, axis, symmetry and sweeping rule are assigned qualitative characteristics to provide a total of 36 primitives, termed GEONS {GEometnc lONS). Several researchers have tried to use these ideas from psychology as a basis for con struction of recognition systems[l,3,ll]. A recent review of these methods is provided in [4]. In the system mentioned above it is characteristic that the reasoning to a large extent is deterministic, in the sense that a predefined strategy is used in the recognition of specific objects. This implies that there is little or no adaptivity based on contextual information. The OPTICA system [3] does include probabilistic information which can guide processing, but no formal propagation mechanism is involved, consequently the strategies used do not have guaranteed properties in terms of optimality, convergence, adequacy, etc. To investigate the utility of using well established reasoning methods for control of the recognition process we have chosen to adopt the Bayesian formalism. Further to enable a comparison to an established system, the OPTICA system, which include probabilistic knowledge, is used as a framework for this investigation. *This research has been sponsored by the Danish Research Councils as part of the PIFT programme, and by the CEC through the ESPRIT Basic Research Project EP-7108-VAP-II "Vision as Process" ^Permanent Address: North China Institute of Computing Technology, P. O. Box 619 Ext. 70, Beijing 100083, P. R. China, where his work has been supported by the national 8:5 program.

256 In section 2 the OPTICA system and its use for recognition is outlined, while sec tion 3 outlines the Bayesian formalism. Having the basis available, section 4 presents the Bayesian network used as the basis for control of the interpretation process. In sec tion 5 the algorithm used for partitioning of image information into clusters representing GEONS is presented. Finally section 6 outlines a simple experiment where the OPTICA system is compared to a system which uses Bayesian networks for control.

2. GEON BASED RECOGNITION In the OPTICA system 10 of the 36 GEONS have been modelled. The primitives are shown in figure 1. In OPTICA primitives are modelled at four diflFerent levels, which are: volumetric primitives, aspects, faces and boundary groups. Aspects represent views from which the primitives have the same structural appearance, these are in turn broken down into a set of qualitatively diflFerent faces, represented by image regions. Each face is then described as a collection of contour segments denoted "Boundary Groups". In the OP TICA system the relation between 'items' at the different levels is encoded as conditional probabilities (two separate graphs encode downward and upward probabilities, respec tively). The probabilities have been determined empirically. The resulting representation is termed an "aspect hierarchy".

1. Block

2. Truncated Pyranid

3 . Pyranid

1. Bent Block

5 . Cylinder

6 . Truncated Cone

7 . Cone

B. Truncated E l l i p s o i d

9. E l l i p s o i d

1Θ. Bent Cylinder

Figure 1. The 10 primitives modelled in OPTICA [3]

257 The processing in OPTICA is structured so that preprocessing performs a region based segmentation (in [5] it has been shown that regions are superior to boundary groups as indexing primitives). After regions have been extracted a Region Topology Graph (RTG), which encodes adjacency relations, is generated. Now regions are classified as belonging to a specific 'face' category, and the RTG is changed into a Face Topology Graph (FTG). The FTG is used as a basis for recognition and verification. Recognition can either proceed as a top-down [5] or a bottom-up process [3]. In the top-down process, which is of particular interest here, recognition is carried out through use of a depth first strategy, down through the aspect hierarchy, in which the maximum probability at a given level is used to guide selection of features at the next level. I.e., only the node with maximum probability is considered. After a hypothesis has been generated, through propagation of evidence along the selected nodes, the hypothesised primitive is verified through an analysis of the boundary primitives in combination with relational information between faces. 3. B A Y E S I A N N E T W O R K S The language of Bayesian Networks is used for modelling domains with inherent un certainty in their impact structure (see e.g. [12]). A (discrete) Bayesian network is con structed over a universe of variables, each having a finite set of states. The universe is organised as a directed acyclic graph. The links in the graph model impact from one variable to another. The strength of the impacts are modelled through conditional prob abilities. As an example, suppose that there are two possible objects in a scene, namely blocks and cylinders. A block can be seen under three different aspects, and a cylinder also has three aspects. Note that they have the parallelogram aspect in common. Then a Bayesian model would be to have a variable Object with states block and cylinder, and a variable Aspect with five possible aspects as states. The type of object has an impact on the type of aspects, so a model for that situation would be as in figure 2.

Object

Aspect

Figure 2. A simple Bayesian network for the object-aspect relation The quantitative part of the model are two probability tables: the prior probabilities P(Object) for the distribution of blocks and cylinders, and the conditional probabilities P(Aspect I Object) giving the distribution of the aspects given the object type (a 2 χ 5 table). To model the breaking down of aspects into 2D faces, the Bayesian network is extended with variables for faces (parallelogram, ellipses, and the side of a cylinder), and also with variables describing adjacency relations among the faces (see Figure 3).

258

Object

Facel-1 Facel-2

Figure 3. A model for blocks and cylinders broken down into faces.

The extra conditional probabilities to specify are Ρ (Face, [Aspect) and P(Adj|Facei, Face^, Aspect). Various algorithms have been constructed for information processing in Bayesian net works (examples may be found in [13,8,14,9]). The basic information to achieve from a Bayesian network is probability distributions, P{Á), for the states of each variable, A, Initially, prior probabilities are provided. Evidence, e, can be entered to a network and after propagation of e the new probability distributions P{A\e) are available. For example evidence on a parallelogram can be entered to the network in Figure 3 by setting Facei_i in the state parallelogram. The probability updating will then yield new probabilities for the other variables. This means that not only do we get new beliefs in Object and Aspect, but also do we get new expectations to what else to find in the scene, namely new probabilities for the nodes Face and Adj. Our philosophy behind using Bayesian networks in scene interpretation is to use the probabilities to control focus of attention. Consider the following situation. An image has by some low level procedure been segmented into a low level topology graph, which is a description of the image in terms of low level primitives (faces, line segments, etc.) together with adjacency. The task now is to cluster the low level primitives to come up with a coherent description of the image in terms of high level primitives (e.g, GEONS). In Figure 4, a simple example is shown. To this end a Bayesian network is used. It contains a variable with the possible high level primitives as states; other variables contain the low level primitives, and further variables may describe intermediate concepts like part-of or aspects. When information about a low level primitive / is entered into the Bayesian network, the probability updating will yield new probabilities for the high level primitives but it will also create expectations for what kind of low level primitives to find adjacent to / . These expectations are now used

259

Pari^^)

(CyLsd)

(^ίπρ^

Figure 4. A topology graph for a scene with a block and a cylinder placed next to each other. in the topology graph for choosing a low level primitive to cluster with / . It is entered to the Bayesian network as evidence, and the process continues until some criterion tells that an optimal cluster has been found. The cluster is removed from the topology graph and the process is restarted on the reduced graph. 4. B A Y E S I A N N E T W O R K F O R R E C O G N I T I O N 4.1. Levels & Variables The network has 5 levels of variables: • P r i m i t i v e s : We use the ten GEONS from OPTICA (see figure 1). This level consists of one variable with these ten GEONS as states. • Aspects: The ten GEONS may be viewed under various spatial orientations. This forms altogether forty different aspects. This level consists of one variable with the forty aspects as states. Note that two different GEONS may have aspects in common. • Faces: Each aspect is decomposed into a set of faces. There are 18 types of faces, but some aspects may contain several faces of the same type. At this level there are 29 face variables, each with two states "existent" and "nonexistent". • Relations: The variables at this level describe the relations between faces like adjacent or inside. The variables are attached pairwise to variables from the face level. Totally there are 46 variables. • Constraints: The variables at this level are introduced to take care of non-directed connections between variables at the relations level. They are kind of technical and we shall not discuss them further in this paper. The links in the network go from a higher level to a lower one (see figure 5). 4.2. Quantitative Relations Quantitative relations in the network include two parts: • Prior Probabilities: The ten primitives are assigned a prior distribution according to an estimate of their occurance in the world. They may be considered a kind of evidence entered to the system in actual applications.

Figure 5. The Bayesian network which encodes the relation between features.

o

tv 0'\

261 • Conditional Probabilities: The conditional probabilities of aspects given each prim itive are adopted from the aspect hierarchy, reported in [5]. Whenever a face is a member of a given aspect, the state "existent" in the face variable is set to 1, otherwise it is 0. If two faces are adjacent in an aspect, the "adjacent" state in their relation variable is established. 5. A C O N T R O L S T R U C T U R E F O R R E C O G N I T I O N

In section 2 it was mentioned that the final stage of the OPTICA preprocessing is a Face Topology Graph (FTG). For the recognition of objects it is necessary to partition the FTG into clusters corresponding to collections of faces representing objects. Several have addressed the problem of optimal partitioning of information in the context of reasoning under uncertainty [11,10]. The partitioning of the FTG may be considered such a problem for each of the clusters, and consequently established methods may be used. In the clustering of faces coherence may be used as a criterion function. Coherence is here interpreted as: Pif\E)

> P(f),

(1)

where / denotes a face, while Ε represents faces/evidence already entered into a cluster. I.e., the probability for a face (/) should increase as evidence is entered, if it belongs to a cluster. This implies that P{f\E) - P(f) might be used as a measure for selection of faces to be entered into a cluster. This measure does, however, not take the actual value of P{f) into account. Consequently a transformation which favours high P{f) is preferable. We have chosen to use the following value function for ranking of faces to be included in a cluster: 1 - P(/) V(/) = log1 - P(/IE)

(2)

For the selection of a face to be entered into a cluster the function max^í^í/) is evaluated. Given adjacency information encoded into the FTG it is possible to determine the set of faces to be evaluated. In figure 6 a formal description of the algorithm is given. Here Ti and Τ are thresholds that ensure that inclusion of faces is not continued if the coherence is low or a high level of belief for a primitive already has been obtained. For the selection of an initial face for clustering, the following criteria are used in the order listed. 1. minimal links; 2. larger area; 3. higher confidence in face type, In the selection of faces to be added to a cluster the criteria listed below are used.

262

While FTG contains unmarked nodes do Select a face fi Enter / i

as the basis for a cluster

into network and propagate

Generate cluster Cq = Form the set Τ

{fi}

of adjacent faces

While maxprtmttit;e5 P{primitive)

< Ti Λ

7^ 0 do

fi =

argmaxjrV{f) If V{fi) >T do Enter fi into cluster (Cn = Cn-i U {/i}) Form the set Τ

and propagate

of adjacent faces

done done Mark nodes in C as processed done.

Figure 6. Algorithm for grouping of faces into clusters

1. higher value from the value function; 2. maximal links with the faces already included; 3. minimal links; 4. largest area; 5. higher confidence in face type, In consequence the "face" with the best possible match are included first and when several faces have equal value adjacency with the cluster is optimised.

6. EXPERIMENTS At this place we only report two experiments. With reference to Figure 1 the following prior distribution for the primitives is used (0.2, 0.08, 0.15, 0.02,0.2, 0.08, 0.15, 0.06, 0.04, 0.02).

6.1. Experiment 1 The image in Figure 7.a has been processed by the OPTICA low level preprocessor yielding the regions shown in Figure 7.b and the face topology graph in Figure 8. The numbers in a node indicate its type and the confidence in the type. The number 9 corresponds to a parallelogram face, 14 corresponds to the side of a truncated cone, while 1 is an ellipses. Due to the shadow on the block face close to the truncated cone it has been divided into the three segments 7, 8 and 9.

263

(a) Original image

Figure 7. Image used for experiment 1

Figure 8. The Face Topology Graph.

(b) Regions Extracted

264

(a) Original image

(b) Regions extracted

Figure 9. Image used for figure 2

The procedure is initiated at the node with minimal links and larger area (node i), then includes the adjacent node with minimal links (node 2) given that the neighbors (nodes 2, 3) of node 1 have the same value from the value function. In the next stage, the procedure clusters (1, 2, 3) sis Ά block. Next, it starts in node 11 and clusters (11, 5) as a truncated cone. Finally, it starts in node ^, clusters (4, 6, 9) as another block, and leaves nodes 7, 8, 10 uninterpreted. For comparison the same face topology graph has been processed by the original OP TICA recognition routines. In this method all possible interpretations are provided. To enable comparison each face is only allowed to participate in a single interpretation and the interpretations are ranked according to confidence. Using this method the OPTICA system will group faces (1, 2, 3) into a cluster representing a block. The cluster (4, 6) is also categorised as a block, while the cluster (5, 11) represents a truncated cone. The remaining two faces (resulting from shadows) are not clustered, but simply left as individual faces that cannot be classified.

6.2. Experiment 2 For the image in Figure 9.a, the preprocessor generates the regions in figure 9.b besides two regions for the background (e.g. regions 0 and 7) and the face topology graph in Figure 10. Note that the faces for the cylinder have been merged into the background, and the regions P, 12 and 16 cannot be classified as any type of face. The cluster finding procedure selects node as a starting point for the first cluster (4, 5), which is classified as a truncated pyramid. Then it starts from node i, and groups 1, 2, 3 block. Due to the complicated relation between those two blocks, all the possible faces are not extracted correctly. The procedure considers node 5 as a cluster. Next, it is initiated at node 14, and includes node 15 and node 13 as a. cluster (another

265

J*

2

ζίΐ^

C O

Figure 10. The Face Topology Graph.

block). Finally the procedure clusters nodes 10 and ϋ as a cluster and leaves the node 6 uninterpreted. For figure 9 the original OPTICA recognition routine has also been applied. From the face topology graph the following clusters are extracted: (1, 3), (13, 15) and (10, 11). All three clusters are categorized as representing blocks. For the cluster (10 ,11) the result is the same as mentioned above. For the two other clusters only two faces are used in the classification, as inclusion of a third face will decrease the confidence factor considerably (below an a priori defined threshold). It should be noted that faces 4 and 5 are not categorized by the original routine which mainly is due to the fact that once clusters of faces have been determined they are subjected to an aspect verification routine which project boundary information back into the image for verification of boundary relations between faces. In this stage the cluster (4,5) is rejected based on inadequate information to support the truncated pyramid hypothesis. 7. C O N C L U S I O N Recently a considerable interest in object recognition has been aimed at use of qual itative geometry, such as GEONS, where graph matching is critical to success. At an intermediate stage in the processing a topology graph is created, which encodes the set of faces and their adjacency relations. In earlier work on GEON based object recognition, the topology graph has been partitioned through use a depth first search strategy. In this paper Bayesian reasoning is introduced for the graph partitioning. A Bayesian model encodes probabilities about the visibility of different configurations of faces, and their appearance. In the partitioning faces are introduced into clusters according to their discriminatory value. Once faces are introduced into a cluster the corresponding face label is entered in the Bayesian network as evidence, and the model is updated. In consequence each cluster becomes optimal, but given the myopic selection criteria global optimality is not guaranteed. To demonstrate the utility of the described method, it has been implemented with the OPTICA system. Processing results for two natural

266 images demonstrate that the system is capable of finding the largest optimal cliques. For comparison the results obtained with the original OPTICA systems are also shown. The results indicate that the partitioning based on Bayesian reasoning is superior to the original method. It should, however, be noted that several problems in terms of handling of poor segmentation, shadows etc. still remain before robust recognition is possible

8. Acknowledgement The authors wish to thank Dr. Sven Dickinson for the permission to use the OP TICA software for the described research and for many fruitful discussions. HUGIN is a trademark of HUGIN International Inc., Denmark. REFERENCES 1. R. Bergevin k M.D. Levine, Part Decomposition of Objects from Single View Line Drawings, CVGIPJU, Vol. 55, No. 1, pp. 73-83. 2. I. Biederman, Matching Image Edges to Object Memory, In: Proc. 1. ICCV, London, June 1987, pp. 384-392. 3. S. Dickinson, A. Pentland k A. Rosenfeld, 3-D Shape Recovery using Distributed Aspect Matching, In: IEEE Trans on PAMI, Vol. 14, No. 2, February 1992. 4. S. Dickinson, R. Bergevin, I. Biederman, J.O. Eklundh, R. Munck-Fairwood, k A. Pentland, The Use of Geons for Generic Object Recognition, In: Proc. 11th IJCAI, Chambery, France, August 1993. 5. S. Dickinson, H.I. Christensen, J. Tsotsos, and G. Olofsson, Active Object Recogni tion Integrating Attention and Viewpoint Control, ECCV-94, Stockholm, May 1994 (Submitted) 6. E. Crimson, Object Recognition by Computers, MIT Press, Boston, Ma. 1990 7. F.V. Jensen, S.L. Lauritsen, k K.G. Olesen, Bayesian Updating in Causal Proba bilistic Networks by Local Computations, Computational Statistics Quarterly, Vol. 4, 1990, pp. 269-282. 8. S.L. Lauritzen & D.J. Spiegelhalter, Local computations with probabilities on graphi cal structures and their application to expert systems, Jour. Royal Statistical Society, Vol. Β 50, 1988, pp.157-224. 9. Jensen, F. V., Olesen, Κ. G., and Andersen, S. K. (1990). An algebra of Bayesian belief universes for knowledge based systems. Networks, 20(5):637-659. 10. F.V. Jensen, H.I. Christensen, k J. Nielsen, Bayesian methods for interpretation and control in multi-agent systems. In: SPIE Appl. of Artificial Intelligence X, Vol. 1708, Orlando, Fla. 1992, pp. 536-548. 11. R. Munck-Fairwood, Recognition of generic Components Using Logic-Program rela tions of Image Contours, Image & Vision Computing, Vol. 9, No. 2, 1991, pp. 113-122. 12. J. Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann Press, San Mateo, 1988. 13. R.D. Shachter, Evaluation of Influence Diagrams, Operations Research, Vol. 34, No. 6, 1986, pp. 871-882. 14. G. Shafer k P.P. Shenoy, Probability propagation, Annals of Mathematics and Arti ficial Intelligence, 1990.

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

267

Learning characteristic rules in a target language Raj Bhatnagar Computer Science Department, University of Cincinnati, Cincinnati, OH 45221 email : [email protected]

In this paper we present a decision-tree type of construction algorithm for learning characteristic rules in terms of the preferred language of a learning situation. The pre ferred target language for learning is specified in terms of a set of relations, each of which represents a domain specific concept. We present an optimaUty criterion for learning the characteristic rules which is different from that for learning the discrimination rules. 1. INTRODUCTION One important aspect of any learning system is the input and output languages that it employs. At one extreme the learning systems use databases of attribute-value pairs as input, use conditional probabiHty functions as the target language, and learn from the databases the Bayesian network representations [2]. The same databases can also be used to construct decision trees and learn the rules for discriminating among the various concept classes. The Information theoretic algorithms for learning the classi fication rules seek such a partitioning of the dataset that results in minimum average conditional-entropy. The boolean expressions of attribute-value tests in decision trees that help discriminate among the partitions constitute the learned discrimination rules. When the value for the target attribute is not uniform in a data-set-partition we may write the learned rules as probabihstic rules in the form / / (Y=y and Z=z) Then T=t with prohahility p. Learning systems based on the above ideas from information theory have been presented and discussed, among many other works, in [1, 3, 5]. At another ex treme, the practitioners of formal-logic based reasoning systems seek object and structure descriptions and classification rules expressed in first order logic. These rules are made up of relations (and predicates) corresponding to the notions prevalent in the domain. These domain-specific notions may be viewed as the preferred language of the domain. Our objective in this paper is to learn the discrimination rules and also the character istic rules for the most discriminable concept classes in terms of the specified preferred language of the domain. The learning system (FOIL) presented by Quinlan in [4] uses a database as input and various specified relations as the learning (output) language to learn complex discrimination rules. We seek to learn the characteristic rules of the optimally discriminable classes of a dataset. We seek our objectives by minimizing the cross-entropy measure for the partitions and the relations of the target language instead of minimizing the average conditional entropy as is done in many other tree-construction algorithms.

268 In the following sections we first present an example database and then our algorithms for learning the discrimination and the characteristic rules in terms of the specified target language. We then discuss some differences in the algorithms for learning the description and the characteristic rules for a database. We also present and discuss a measure which we minimize for learning the optimal set of description rules from a database. 2. A N E X A M P L E We consider an example database of 9-bit binary numbers and a tenth few records from this database are as follows : 9-bit number 1 0 1

0 0 0

0

0 0 1

1 0 1

0 0 1

1 1

1 1 1

1 1 0

0 0 1

0 0 0

0 0 1

0 1 0

0

0 1 0

0 0 0

0 0 1

1

0

0

0

0

1

0

1

0

A

Target bit

0 0 1

0

targeUhit

0

0 1 1

o i l

1 0 1

0

0 1 0

o i l

1

1

1

1 1 1

1 0 1

1 1 0

1

1

We refer to the non-target attribute bits as bo through bs from the least significant to the most significant bit position. The tree-construction learning algorithms [1, 3, 4] would seek to spHt this database into a number of partitions in such a way that the following measure of average conditional entropy is minimized : Σ(?)χ(Σ-?'°ί/»?) ^

Tit

e

(1)

rib

where rib is the number of tuples in branch 6, rit is the total number of tuples in all branches, c is the number of possible values (classes) the target attribute can take, Ube is the number of tuples in branch b of class c. Considering the tenth bit as the target concept, we seek to find that attribute-value test which best partitions the database, that is, it reduces the average entropy of the partitions by the largest amount. One can further partition each of these partitions, by selecting similar tests until the target attribute for all the tuples in a branch has the same value. For each resulting partition we determine the discrimination rule by conjuncting all the attribute-value tests on the path from the root of the tree to the leaf at which it resides. 3. T A R G E T L A N G U A G E The discrimination rules learnt above are in the form of conjunctions of at tribute-value tests. There are many situations in which we want to learn dependencies among some

269 domain-specific concepts which are not the same as the attributes in terms of which the database cases are observed and recorded. These concepts may, however, be related to the recorded attributes by some specific relations. For example, consider a database of final board situations of the Tic-Tac-Toe game in which each record contains the marks placed in each of the nine places and a target attribute showing whether x-player won, lost, or drew the game. We may partition the database and learn the discrimination rules for the target attribute in terms of the conjunctions of place-mark pairs. However, one may want to learn the dependencies among the concepts lateral-adjacency of two x% diagonal-adjacency of two χ's and the winning of the game by the x-player. The two adjacency concepts correspond to the notions prevalent in the domain but the database is not recorded in terms of them. The adjacency concepts, however, can be evaluated for each recorded case of the database. Another example is of a marketing database each record of which may consist of various attributes pertaining to an individual sale. We may however be interested in determining the dependencies in terms of the concepts relating to the buyer-profile and the brand-characteristic such as wealthy-purchaser and famous-brands. The relations representing the target language in terms of which we want to learn the dependencies may be fuzzy-relations. Some research has been done to infer the optimal set of derived attributes in terms of which the discrimination rules may then be learnt. In the problem that we are addressing, the domain specific concepts are given and we must learn optimal rules in terms of these, possibly non-optimal, set of concept descriptions. We first consider the simplest type of target relations and will then examine the more complex relations. In the example of the 9-bit binary number mentioned above, the two domain specific concepts in terms of which the dependencies are sought to be learnt are Value-greaterthan-100 (shortened to vgtlOO) and Value-divisible-by-5 (shortened to vdb5). These two concepts can be represented as relations which, when completely enumerated, would look as follows: relation vgtlOO Tuple

Truth-value

relation Tuple

vdb5 Truth-value

000000000

0

000000000

0

000000001

0

000000001

0

010000000

1

000000101

1

010000001

1

000000110

0

100100101

1

010100000

1

etc.

etc.

According to the learning algorithm discussed by Quinlan in [4], the possible testcandidates for partitioning the database D are all the attributes (6o through bg here) and

270

all the relations [vgilOO and vdh5 here). The test selected for partitioning is that which reduces by the largest amount the average disorder as measured by expression 1. This type of partitioning seeks to keep tuples with the same target-attribute value in the same partition. However, trying to learn rules in terms of the preferred language requires that each partition resemble, as best as it can, one (or a function of a few) of the relations of the target language, instead of being pure from the perspective of the value of the target attribute. Our formulation for the learning task is described as follows. 4. D I S C R I M I N A T I O N RULES IN T A R G E T L A N G U A G E Information theory describes a measure of difference between the information-contents of two different relations. This difference is given by the following expression which is known as the KuUback-Leibler cross entropy measure : Dis{R,,R,)

= Y^C^

log^)

(2)

where Ri and R2 are the two relations, etc. are the counts (same as defined for expression 1) for relation Rl and m^c are the counts for the relation R2. The value of this measure is zero only when the two relations Ri and R2 are identical and otherwise has a positive value. The higher the value, the more different are the two relations from each other. In our algorithm we seek to minimize this difference between the individual partitions of the dataset and the concepts (relations) of the target language rather than minimizing the average conditional entropy computed from the perspective of the target attribute alone. In the following discussion by D we refer to the database for which we are trying to learn the discrimination rules and by Ä^'s we refer to the relations of the target language. 4.1. S I N G L E R E L A T I O N IN L E A R N I N G L A N G U A G E In this simplest and somewhat trivial case, the partitions of database D are sought to be discriminated from the perspective of a single domain specific relation R. In the context of the above example let us say the relation R is vgtlOO. The steps for partitioning the dataset at each level of the tree construction process are as follows : 1. Consider all attributes (60 - hg) and the relation vgtlOO as possible tests for partitioning the database. 2. For each possible test : (a) SpHt the database into partitions D i , D2, · · (b) For each partition Db determine the corresponding set of tuples from the relation R, This subset of tuples, At, contains exactly the same set of tuples as Db and only the truth values of the tuples may be different from the corresponding target

271

attributes in Df,. If the set of attributes in Ä is a subset of the attributes in D, we append the absent attributes, for all their possible values, to each relevant tuple of (c) Considering the truth values of i^'s as their target attribute, compute the average distance of the partitioning from the relation R as follows: Average — Distance = — * Dis(Db, Rb) nt

(3)

3. Select that test for partitioning which results in the partitions having minimum av erage distance from the relation R. In the above, we seek that partitioning which has minimum cross entropy between the database partition D^ and the relation Rb instead of that partitioning which has minimum absolute entropy as given by equation 1. The actual concept embedded in the example database is that the target attribute is "1" only when the value is greater than 100 and it is not a multiple of five. However, there are few exceptions (tuples) included in the database because our objective is to learn approximate rules for classes that are not completely homogeneous. For some particular database of the example, after partitioning, the discrimination rules for the two partitions look Hke: 1. livgtlOO{tuple) then target = 1 (for 79% cases) 2. If not vgtlOO{tuple) then target = 1 (for 18% cases) With only one relation in the target language the learning becomes a kind of trivial exercise. A more reasonable learning exercise, using a larger target language is described as follows. 4.2. MULTIPLE RELATIONS IN LEARNING L A N G U A G E In this case the desired output language of the learning process consists of a number of concepts, each specified by a relation on the attributes of the domain. In the context of the above example, we consider vgtlOO and vdbb as the two relations forming the target learning language. The test selection task at each step of the tree building exercise, according to our algorithm, proceeds as follows : 1. Consider all attributes (fco - fcs) and all the relations RiS^vgtlOO and vdh5 for the example here) as possible tests for partitioning the database, 2. For each possible test perform the following steps :

272 (a) Split the database into partitions D i ,

· · ^fc-

(b) For each partition Di, determine the corresponding set of tuples from each relation Ri. This subset of tuples, contains exactly the same set of tuples as Di, and only the truth values of the tuples may be different from the corresponding target attributes of Di,. If the set of attributes in is a subset of the attributes in we append the absent attributes, for all their possible values, to each relevant tuple of Ri(c) Consider the truth values of Ri^s as corresponding to the target attribute and compute the minimum distance a partition has from any of the Ä»b's as follows: Mdisb = Mim {Dis{Db, Rib))

(4)

(d) Compute the weighted average of the minimum distance, that each partition has from any of the relations by computing : Average — Distance =

h

— * Mdis^

(5)

3. Select that test for partitioning which results in the partitions having minimum Average — Distance from the relations of the target language. It should be noted here that different partitions may turn out to be closest to different relation Ä»'s. This procedure helps us determine that test which keeps each resulting partition closest to one or the other concept (relation) of the target language. Using this procedure for the above example our learning system learned the following discrimination rules for the largest partitions (containing more than 90% of all the database cases) : 1. if 67 = 1 then target = 1 (for 96% of cases) 2. If ¿7 = 0 and vgt\^^{tuple)

then target = 1 (for 99% cases)

3. If 67 = 0 and vdhbitwple) then target = 0 (for 99% cases) In the above example case it just happens that the 67 attribute is closer to the concept attribute than any of the individual relations of the target language. Minimization of cross entropy also corrects to some extent the shortsightedness of the otherwise greedy algorithm of selecting the optimal partitioning test at each stage. In effect, we are minimizing the average entropy not for a single attribute or relation, but for a conjunctive expression, one element of which is a relation and the other an attribute or a relation. And since at least one element of the conjunction is always a relation the partitioning gets more biased towards the specified language. The above formulation is very close to most of the other formulations for learning discrimination rules by constructing decision trees. The main advantage of this formulation is for learning the characteristic rules for the partitions most discriminable from the

273 perspective of the target language. We may compute each relation of the given target language for each tuple and then construct the decision trees with the database of the derived tuples. The problem of learning the optimal characteristic rules for the most discriminable partitions remains the same for the derived database also. We still must determine that partitioning which minimizes not the average conditional entropy but the average distance from the expressions of the target language. We modify the algorithm of section 4.2 above by optimizing in step 2.(d) a function which measures the optimality of the characteristic rules for the resulting partitions. In the following section we elaborate further on this issue. 5. CHARACTERISTIC RULES IN TARGET L A N G U A G E In many learning situations it is desirable to learn the characteristic rules for the most discriminable concept-classes in addition to the optimal discrimination rules. The characteristic rules help provide more complete domain knowledge, an insight into the deeper structure of the domain's classes, and have an objective very different from the optimal discrimination rules. Considering the example of learning about seven-segment displays for digits "0" to "9" [1] the discrimination rule and the description for the noisy data for digit 2 may be as shown below : Discrimination Rule for "2" : If seg #2 is OFF and scg #5 is ON then it is "2" Description for digit "2" : If only seg #1,3,4,5,7 are ON then it is "2"

The discrimination-tree construction process yields the most distinguishable partitions of a database. The discrimination rules learnt in the process can be used for classifying new data but there is more knowledge to be gained by studying the characteristics shared by all the cases included in a single most discriminable partition of a database. The knowledge about segment # 2 and # 5 is sufficient to help discriminate it from other characters but a different type of knowledge is gained from inferring that most of the digit "2"'s have segments #1,3,4,5,7 ON. It gives us the structural knowledge of the class "2" in terms of the other attributes of the database. From a database of tic-tac-toe end-games one may learn the discrimination rule to separate the losing from the winning boards but the characteristic rules for the structures of winning boards are a useful and a different type of knowledge. In a typical decision tree the discrimination rule for a partition is obtained by performing a conjunction of the attribute-value tests performed on the path from the root of the tree upto the leaf node where the partition resides. The characteristic rules for the partitions may be constructed by forming a boolean expression from the attribute-value pairs shared by all or some cases included in the partition. The discrimination rules contain as Uttle knowledge as may be needed to classify a

274 new data point in its appropriate class. The tests included in the discrimination rule for a partition would certainly be a part of its characteristic rule but there may be more attribute-value pairs and/or predicates that may be shared by all or most of the cases of a partition. For learning the characteristic rules for the most discriminable categories of a domain one would first partition the database using the same kind of partitioning as one would perform for learning the optimal discrimination rules. One can then examine the attribute values of each partition to learn the characteristic rules of the type / / only seg#l,S,4,5,7 are ON Then it is "2" or If seg#2,4,5,6,7 are always ON and seg#l may be ON Then it is "6" or, probabihstic rules if the partitions are not uniform in the value of the target attribute. We say that a probabihstic characteristic rule is of the form If Y=y Then ConceptClass = c With probability p. Here Y may be a conjunction of a number of attribute-value pairs or predicates related to the relations of the specified language. In a discrimination rule learnt by building a decision tree the Y = y part is unique for each partition in the sense that it has a fixed structure (attribute/relation-value pairs to be tested). This rule can be determined by conjuncting all the tests from the root of the decision tree upto the leaf where the partition resides. In a characteristic rule the part Y=y may not be such a unique expression. For example, in the characteristic rule for the letter "6" mentioned above, segment number one may or may not be ON for all cases of the same partition, even though all of them have the same value for the target attribute. The extent of a match between a relation of the target language and a database parti tion may be determined only in terms of the above described cross-entropy measure Dis. To learn the optimal set of characteristic rules we do the following : 1. For each partition Db determine the closest expression Yb such that the cross entropy measure Dis{Db,Yb) is minimized. 2. Find the probabihty ρ with which Yb determines the value of the target attribute X in the partition. Now, the optimal set of characteristic rules learnt from a database of cases is described as that set of rules for which the value

P[Y,,]

1

X

Dis(D bl Y,,)

(6)

is maximized. Intuitively, it can be seen that we prefer those rules that on one hand are common, that is, have a high probabihty of occurrence P[Yb], and on the other hand are very well described in terms of the specified target language for learning. This measure may be contrasted with other information theoretic measures of goodness of rule-sets. The 7-measure specified in [5] seeks to get the optimal discrimination rules. The above

275 measure which is very similar in spirit and structure, is directed towards seeking optimal characteristic rules in a specified target language. Construction of optimal characteristic rules, such that the above measure is maxi mized, cannot be done by the short-sighted greedy method of building a decision tree. The tree-construction algorithm presented in the preceding section, which addresses the shortsightedness problem only to the extent that it looks at two relations at a time in stead of a single test, is used by us for learning the set of characteristic rules. 5.1. ANOTHER EXAMPLE We apphed this characteristic rule learning algorithm to a database of Tic-Tac-Toe end-game board situations. Assuming that the player χ starts the game, the target (10th) attribute is ' Γ whenever aj-player wins the game specified by the nine attributes specifying the board situation. For these nine attributes a value of χ means the place is taken by the x-player, a value of o means the place is taken by the o-player, and a value of h means the place is a blank. A few records of this database look as follows : b

b

x

b

x

o

x

o

b

l

1

x

x

o

x

o

b

o

o

x

O

0

b

b

x

b

x

o

x

b

o

l

b

b

x

b

x

b

x

o

o

l

1 1

x

x

o

x

o

o

o

b

x

O

0

b

b

x

b

o

x

o

b

x

l

b

b

x

b

o

x

b

o

x

xl

x

x

o

x

o

o

b

x

o

O

1 1 0

x

x

o

x

o

b

o

x

o

O

0

Decision trees with very good prediction performance have been built for the above database which is one of the known databases used by many machine learning researchers. We sought to learn the characteristic rules for this game from the perspective of the language of the domain. That is, the board characteristics are better described in terms of adjacency of "x" marks. The two types of adjacency we have used is the lateral adjacency and the diagonal adjacency of two similar-valued attributes. Therefore the two relations that we specify as the target language are ladj(x,y) and dadj(x,y), denoting lateral adjacency and the diagonal adjacency concepts. The characteristic rules obtained for the tic-tac-toe database for the x-win partitions are : • ai = χ A (/αφ'(αι, α2) Λ ladj{a2, α^) • ο,· = χ Λ {dadj(ai,a2) Λ dadj{a2j as) It turns out that the above expressions are true with a significant probabiUty even in those partitions in which x-player loses the game. This hindsight suggested to us that

276 a more suitable set for describing the board situation would be the relations Horizon tal adjacency. Vertical adjacency, 45deg adjacency, and ISSdeg adjacency. Some more heuristics for instantiating attributes and generahzing characteristic rules for partitions were employed in the above example. A partitioning of the same database by minimizing the traditional average conditional entropy results in good discrimination rules but the individual partitions are less uniform from the perspective of the two relations ladj and dadj. β. A G L O B A L P E R S P E C T I V E As stated above, our algorithm in effect reduces the effects of the shortsightedness of the greedy algorithm for picking the optimal test at each stage of the learning process. This locaHzation effect, however, is not completely ehminated. An attempt towards reducing the locaHzation effect and making decisions from a global perspective is described as foUows. We seek to construct new domain attributes equivalent to the complex expressions formed using the original domain attributes and the specified domain relations. The objective is to search the space of possible expressions for those candidates which when used as the test-attributes in a decision tree, would minimize the average distance from the relations (as computed in the above section) of the target language. Let us say the domain's attributes are: A = {αι,α2,...α„}

(7)

and the relations in terms of which the structure of the domain is to be learned are given by the set Ä = ( Ä i , · · · Rk)- We consider an algebra Ε defined by A, R, and the logical operators (Λ, V, Not). The difficult learning task can now be defined as foUows. In the space of aU possible expressions of E, search for that expression e which minimizes the value of the cross entropy value as defined in section 4.2 above. In the above partitioning methods the expression e has remained restricted to single attributes and relations and successive partitionings provide a conjunction of these sin gleton expressions to describe or discriminate the partitions. We use some very simple heuristics to discover those expressions that reduce the average cross entropy by amounts larger than what is achieved by any single attribute or relation. The use of these simple heuristics has yielded results that are better than the singleton attributes, even though as yet we have no way of provably constructing the optimal expressions. One heuristic employed by us is to determine those four attributes or functions that are the best tests for partitioning the database. We then determine cross entropy for each pair formed from this set of four. The pair with the highest cross entropy is the one whose elements contain information most different from each other. We perform a disjunction of these two attributes and use it for partitioning the database. The average cross entropy of the partitioning performed by this composite attribute is always lower than or equal to that achieved by any singleton test used for partitioning.

277 7. S O M E E X T E N S I O N S The above presented ideas for learning in a target language can be extended for learning in the following cases : 1. The target concept is specified not by an attribute of the database but by a function of the relations of the specified target language. 2. The specified target language contains fuzzy relations instead of crisp relations. We can then learn concepts in terms of more "precise" or more "imprecise" relations of the target language. 3. The concepts of the target language can be evaluated not for each individual tuple of the database but depend on the temporal or the aggregate aspects of the known database. Characteristic rules are required in some situations because of the extra structural information they provide about each discriminable partition. In some domains we can assume that each discriminable partition of a database corresponds to a different un derlying cause-effect phenomenon of the domain. This information can then be used for making cause-effect related speciaHzations in models for probabilistic reasoning. For example, a Bayesian network consists of a number of nodes each also storing with it a conditional probabiHty function P[node \ parents{node)]. This probabiHty function is derived from the complete available database approximating the joint probability distri bution of the domain. When we want to make assumptions in terms of a target language, we can partition the database so as to learn the characteristic rules in terms of the given target language. We can then select that partition whose characteristic rule is closest to the assumption to be made. The conditional probabiHty function for the network node representing the target attribute can then be derived only from the selected partition and not the complete database. 8. C O N C L U S I O N We have presented an algorithm for learning the characteristic rules for the optimally discriminable classes of a database in terms of a specified target language. Most learning methods seek to partition the database such that the conditional entropy of the parti tioning is minimized. We minimize the average cross entropy between the partitions and the expressions formed from the relations of the target learning language. This results in removing from the characteristic and discrimination rules most of the unwanted and irrelevant attributes, and generates rules, as much as possible, in terms of the primitives of the desired learning language. Learning characteristic rules in terms of a specified language is very useful from the point of view understanding the deeper structure of the domain. Tic-Tac-Toe should be understood in terms of the adjacencies of the movee and chess in terms of the legal moves of the chess game. Purely statistical dependencies involving only the recorded domain attributes, and not the domain specific relations, do

278

not shed much hght on the structure of the domain. The learning method presented in our paper helps us in learning the desired type of discrimination and characteristic rules for a domain. ACKNOWLEDGMENT This research was supported by National Science Foundation Grant Number IRI9308868.

References [1] Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. Classification and Regression Trees, Wadsworth International Group, 1984. [2] Gregory F. Cooper and Edward Herskovits. A Bayesian Method for Induction of Probabihstic Networks from Data. Machine Learning, vol. 9, pp 309-347 1992. [3] J. R. Quinlan. Induction of Decision Trees, Machine Learning vol. 1, number 1, pp 81-106, 1986. [4] J. R. Quinlan, Learning Logical Definitions from Relations, Machine Learning vol. 5, pp 239-260, (1990). [5] Padhraic Smyth. Rule Induction Using Information Theory. Knowledge Discovery in Databases, edited by Gregory Piatetsky-Shapiro and WiUiam J Frawley, 1991, pp 159-176.

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

279

Discussions Part III Paper Lemmer Egmont-Petersen: I would like to address the issue of evaluation of your model. Probably we can agree that war does not happen every day. So how can you evaluate your network? Lemmer: This is a question that always comes up in military applications. You can't verify in the sense of verifying on patients in medical applications. These models are used as a backup to the people who are making these kinds of decisions. Having the decision processes represented as models is a good way for various intelligence analysts to discuss the problem amongst themselves. They may determine whether they agree or disagree. with the probabilities or cause-effect relations given by the model and they may suggest refinements accordingly. So, the models are just used as one more source of analysis. They have not been envisioned for the automatic button-pushing that goes with star-wars or anything like that. Mulder: In remote sensing and geo-information systems you have two choices. You can approach a problem statistically, and I think that is what you have been doing, or you can approach it from the side of physical modelling and add statistics where you need it. An application which seems very similar to yours is that of crop development. The streetsweeper is replaced by a harvester. 'Ne got reasonably high succes-rates with a crop growth model. Primarily we have a growth model which tells us the prior probabilities of the changes with time. Then we add multi-sensor observations. Of course these sensors do not observe at the same time. So I think it may be worthwhile to compare the two approaches: starting with statistics and then adding modelling, versus first modelling and then adding statistics. Lemmer: I agree with you, and I would like to know more about the work that you are doing. Smyth: You mentioned modelling mine fields. What is the actual operational status of this and related applications? Can you see it being used routinely in the field, or are you still far away from actual use?

280 Lemmer: I would say we are still pretty far away. The motivation behind this was to try to develop different classes of mine detectors that could be tuned to various geological properties and moisture conditions and things like this in the field. The idea was to use this as a simulator to derive practice without actually going out in the field. There was some thought initially that this would be actually embedded in a fieldable system and that one could use this kind of simulator to model the local environment in which the detector would operate. Then one could train the mine detector overnight to work in this particular area. But, shall we say, funding for this has dried up.

Paper Van de Stadt Lemmer: Is there anything special about the medical diagnosis problem which allows you to put the decision variables always at the root or is this a general characteristic, applicable to all decision nets? Van de Stadt: I think this should be done in general because if you parent nodes and you fix them to some state or action, tiie probability distribution you have calculated for tiiat in principle, decision variables do not probabilistically domain because they are fixed by the decision maker.

allow decision variables to have then that can be in conflict with decision variable. So I think that depend on other concepts in the

Brailovsky: For these Bayesian networks, you say there are two problems. How to define the general structure, which variables influence other variables, and how to calculate the system of probabilities. You have in fact two possible sources: expert opinion and archives of medical information. If you have a big, good archive, you can use some statistical measures. Can you comment on how to build a Bayesian network for some practical medical problem? Van de Stadt: That is a difficult question, also a very relevant one. There are two approaches, as you mentioned. You can use large data-bases with examples of the knowledge domain to learn the initial structure, but it turned out that you have to add expert knowledge to that. It also depends on the approach you take. You may want to build the network from bottom up, starting with a physical model, trying to model the physiological processes which give rise to the disorders. In that case you probably start with textbooks and experts. If your objective is to model the heuristic expert knowledge, then you will start by interviewing experts to define the structure of your network. The structure is very important. If the structure is correct, it will have a good classification performance, even with very rough estimates of the probability distributions. Sensitivity analyses have been done in this area. Although constructing networks is a severe problem, there are networks constructed of about 2000 nodes that work in the medical domain.

281

Kanal: Do people really estimate discrete probabilities? I remember that in the domain of oil exploration, they would rather use kernel distributions for which they could estimate parameters very easily. Is that way of estimation used in medicine at all? Van de Stadt: Yes, continuous distributions can be used instead of specifying a complete discrete table of conditional probabilities. You can also ask experts to identify the mean and variance per state. It becomes difficult in cases of multiple causes for one consequence. But even then it is possible to combine those specified distributions to compute the ñill table of conditional probabilities for every combination of parent states. So there are interaction models on a more abstract level available. Kanal: Let me ask a follow-up question: Were decision trees being used in the medical field before Baysian networks became available? Van de Stadt: Yes, they were used. Mulder: Where in your modelling do you put the common sense information that if you do not consult a doctor or do not go to the hospital, you also may get healthy. So, if there is no input, the state may still change. How would you squeeze that into the Bayesian model? Van de Stadt: In the probability tables you can also specify the probability of future development under the assumption that you do not act. In the tables you can specify that after some time the disease may disappear on its own. Gelsema: Does that coincide vsdth the 'wait' strategy? Van de Stadt: Yes, it is in the tables. Mulder: What I am missing in the 'wait' strategy is the change in state of the subject with time. In the field of engineering, a system goes from one state into another, depending on the input. In your case, if the input is zero, the system may still go from one state to another. So I would expect time nodes in the graph. Van de Stadt: Yes, that is missing in this representation. If you do not input anything in a network, it remains in the same state. An approach to overcome that is to build time-slices.

282

Egmont-Petersen: I have a follow-up question with respect to time. In your network you have the nodes 'disease' and 'disease continued'. My question refers to your time representation. Does it not cause problems to the model to have in one node the disease at a certain time t and in another node the disease in a time slice t+L\t? Many systems cannot handle that. Vande Stadt: In the figure it looks as if such nodes occur at the same time, but actually there are two layers in this network: in the upper layer your knowledge about the disease is represented. Then you can take some action, for instance start a treatment, and the effect is represented in the lower layer. If you want to do a next planning iteration, it will be based on the posteriori probability distribution as calculated in the previous layer. In that way the information that has already been put into the system is propagated to the next time slice. Lemmer: That is exactly the basis of what Pearl and his people are doing. The action net unfolds in time. Van de Stadt: Yes, processes are cut up in time slices. Talmon: I would like to make a comment on the modelling of time. I know that people in Denmark, who developed Hugin, have used probabilistic causal networks to model processes specifically in the domain of insulin treatment. They model the food intake and the insulin protocol of a patient and try to predict the next state of the patient. Thus they get estimates over time of how blood glucose develops. They try to develop plans for treatment. So, they build a model for each time slice and connect those to each other. I think there is a difference with your approach to treatment planning, since in your approach time is not explicitly represented. Does it depend on the type of problem you are dealing with what type of approach is best? Vande Stadt: The two types of modelling are quite similar. When modelling a continuing process, as long as the model remains the same, the same concepts stay present over time. Only the probability distribution changes with time because new information becomes available from measurements on the patient. I think that from that point of view the modelling is the same as in the approach presented here. You can also incorporate the planning of actions in that model, in exactly the same way as I showed here. Smyth: Are your decision networks similar to influence diagrams or are they the same? Van de Stadt: No, they are not the same, they are similar. If you analyze influence diagrams, you end

283

up with a decision tree. So in the influence diagram the permissible sequences of actions are specified. The representation is more compact than the decision tree, which is one of the advantages. The evaluation is very similar and has been shown to be functionally equivalent. In the representation presented here, the evaluation differs on some points. It is more flexible, because you incorporate information which can change with time, influencing your strategy during the planning course.

Paper Liang et al. Loew: Why do you use the Biederman Geons, rather than arbitrary objects? Liang: The OPTICA system is inspired by Biederman's theory "Recognition by Components". The aspect hierarchy in OPTICA can model arbitrary objects besides Geons. Loew: Now, to follow up on that, when Biederman proposed his recognition-by-components idea, it were these non-exact properties that gave this method its robustness. The notions of straightness of the axes or curviness and parallelness of edges and so on. I believe his point was that it was their qualitativeness which gave power to his recognition scheme. One did not have to accurately measure for instance the straightness of an axis, just roughly whether it was straight or curved. But I don't see any explicit use of those nonexact properties in your method. Liang: The approach in OPTICA is different from other approaches such as PARVO, llM, etc. The developers of OPTICA did not strictly follow Biederman's theory. Bunke: I think your main motivation was the clustering problem and I see at least two alternative approaches for that. The first is to use range images instead of grey level images. In this case the clustering problem is no longer exponential, but linear. The other alternative is based on edges using junction labelling for cluster finding. Could you comment on this, giving a comparison between your approach and these two alternatives? Liang: The method based on junctions has been used in PARVO, nM, etc. to infer primitives directly from contours. The method in OPTICA is to group contours in a region into a face, then infer primitives from faces via aspects. The advantage is that faces have much more indexing power than contours. I think that the junction principle can be implemented in the preprocessor of OPTICA. But anyway, OPTICA is not my·work.

284

Paper Bhatnagar Mulder: You talked about causal relationships. Bayesian theory combines probabilities of events in a symmetric way. What is the reason for a break in the symmetry, why would you call one thing the cause and another the effect? Bhatnagar: I wrote here causal relationships in quotes. These are not exactly inferred causal relationships, rather intuitively we call them causal relationships. Mulder: What criterion do you have to make that decision? Is it the time dependence or is there any other general reason to break the symmetry? Bhatnagar: We assume that given all the attributes of the database, we have enough domain knowledge available to put the attributes in a causal partial order. Any variable in this sequence is causally affected only by those that precede it in the order that we know from the domain. Mulder: So, is it a processing order or is the order due to physical constraints or to a model? Bhatnager: The order is due to physical knowledge about the domain. Sethi: To me, the process that you are describing for extracting characteristic rules, looks very similar to the processes that are used very often in pattern recognition like prototype learning or clustering, or in terms of neural networks it would be learning vector quantization. So, have you looked at any of those techniques which might be suitable rather than trying to do a heuristic? Bhatnagar: As far as partitioning is concerned it is very similar to many clustering methods that have been used. The thing that I am doing differently is taking cross-entropy as the distance measure and also some fuzzy distance measures and trying to optimize a different optimality function. I did not find cross-entropy and language primitives being used in the literature on clustering.

Pattern Recognition in Practice IV E.S. Gelsema and L.N. Kanal © 1994 Elsevier Science B.V. All rights reserved.

287

Why do multilayer perceptrons have favorable small sample properties? Sarunas Raudys Institute of Mathematics and Informatics, Akademijos4, Vilnius 2600, Lithuania e-mail: [email protected] There are several arguments which explain good small sample properties of multilayer perceptrons. First, the hidden layer's weights of the multilayer ANN are conmion for all classes. Second, the traditional pattern error function used as a criterion in ANN training does not parametrize the feature space totally and takes into account only pattern vectors nearest to the decision boundary. Third, local nonparametric classifiers are sensitive only to intrinsic dimensionality and do not suffer from the dimensions that have a negligible variability of the data points. 1. INTRODUCTION Two of the main practical questions which arise in pattern classifier design are how large the number of training samples should be, and how many features should be measured? According to recommendations by Jain [1], it is a good practice in pattern recognition to take a number of training samples at least five to ten times the number of measurements. Other authors are ever more pessimistic. For example, for the Parzen Window classifier. Duda and Hart [2, p. 95 ] claim that the number of samples grows exponentially with the feature space dimensionality. Like the Parzen Window classifier, complex multilayer artificial neural net classifiers allow to obtain nonconvex nonlinear decision boundaries, so one can expect that the number of training samples required to train the ANN classifier should be very large. Estimates obtained from an evaluation of the ANN classifier's capacity, the VC dimension give very large values[3.5]. Despite a number of comparatively high bounds formulated by statistical laws, a number of reported neural network applications claim to be successful even with suφrisingly small training sets ( see e.g. an enumeration of problems presented in [6-8]). We use a term "small training set" to define a case when the number of training samples is small in comparison with the complexity of the classification rule. Thus 2000 training pattern vectors can be sufficient to train a network with two inputs, two outputs and three hidden neurons. However, for a network with 10,000 weights to be determined this number can be too small. Therefore, we then have a small sample problem. Duin[6] explains successful applications of the ANN in small training sample cases by strong dependencies in the data, resulting in low intrinsic dimensionality. This agrees with observations for the Parzen Window classifier, a nonparametric local classification rule, for which the small sample properties are caused by the intrinsic dimensionality of the data in local areas of the multivariate feature space [9,10]. In addition to this, there are more pecuHarities of the feedforward ANN classifiers which result in favorable small sample properties, and which can be explained by statistical arguments. An intention of this paper is to present and discuss

288 once more some old theoretical results, which are very important for the analysis of multilayer perceptrons and to present some new ones in order to better explain the ANN training abilities. In section 2 we show that parameters, common for all classes, in statistical pattem classifiers asymptotically do not effect the generalization error. In section 3 we shall analyze peculiarities of a loss fimction used to train ANN's and its influence on small sample properties. In section 4 we discuss the influence of the intrinsic dimensionality. 2. PARAMETERS ( WEIGHTS ) COMMON FOR ALL CLASSES Theoretical results. In parametric statistical classification, parameters of the class distribution density fimction asymptotically ( when the number of training samples and dimensionality increase) sometimes do not affect the expected probability of misclassification ( the generalization error in ANN terminology). A first step to prove this effect was made in 1972 [11]. For the Euclidean distance classifier Ε with a discriminant fijnction(DF) g(X) = [X-l/2(X(i)+X(2))]»(xa)-X(2) )

(1)

and two spherically Gaussian classes πι and π 2 with densities N( X, μι , I) and N( X, μ 2 ,1), it was shown [11-13] that the expected probability of misclassification ( a mean generalization error ) asymptotically as dimensionality ρ and the number of training samples Ν increase and the ratio N/p is kept constant δ 1 EPm -> Φ (2) + 82J + Ν2δ2 J In the above equations X = ( X i , X 2 , . . . , X ρ )* is a p-variate vector to be classified, X(i)andX(2) are the sample maximum likelihood estimates of mean vectors μ ι , μ 2 » N( X, μ, Σ) is a multivariate gaussian density with mean vector μ and covariance matrix Σ, δ ^ is the squared Mahalanobis distance between the classes,

δ ^= ¿ j=i

μ^ ^

^ 1 > k t g , , μ ρ are the components of the vector μ = μι - μ 2 , N=Ni=N2 is the number of vectors from each class in the training set, and Φ(ο) denotes the standard cumulative Gaussian distribution fimction. Note that the classifier makes a decision according to the sign of the discriminant fimction. Note that with increasing number of training samples, the expected probability of misclassification (PMC) tends to its asymptotical value P«, = Φ (-δ/2 ), called asymptotic PMC. For a "diagonal" classifier D with a DF [11-13]

289 g(X) = [ X - 1/2(χω+χ(2))]·ο-^(χω.χ(2)

)^

(3)

asymptotically as the dimensionality ρ and the number of training samples Ν increase and the ratio N/p is kept constant

EPm

->Φ

Γδ

2

ρ

52

(4)

N252"*"4(N-3)J

In order to design the Euclidean distance classifier one needs to estimate two p-variate sample mean vectors X and Χ(2) . While designing the "diagonal" classifier D all ρ features are supposed to be independent. The variances of the features are supposed to be common for both classes, but different for each feature. Therefore, a diagonal p*p matrix D appears in the discriminant function. This matrix is composed of the sample estimates d j , d2 , ... , dp of the variances of each feature. Note that in the classifier D we have to estimate 3p unknown parameters of the distributions. Among them ρ parameters (all variances) are common to both classes. In the classifier Ε we have to estimate only 2p parameters. Expressions for the expected PMC were obtained assuming that for large dimensionalities the discriminant functions (1) and (3) have Gaussian distributions. A comparison of Equations (2) and (4) indicates that the estimation of ρ components of the variance matrix D causes an additional term T(D) = 52/(4(N-3)). It is interesting (and very important) to note that the term T(D) does not depend on the dimensionality ρ of the feature space. Here and below we use an assumption traditionally made in muhivariate statistical analysis: in order to study small sample effects the Mahalanobis distance 8 ^ is supposed not to change with an increase in the number of variables. Therefore, in the high dimensional and large number of training samples case, the estimation of the variances common to both classes will not influence the expected PMC. A next important step in this direction was made by Deev [14]. He analyzed a discriminant function for Gaussian vectors having a block structured dependence between the variables. Let the ρ components of vector X be divided into h blocks x^, Xg, ... ,χ^ with h dimensionalities Pi, P2> ···, Ph Pj= ρ ). Inside each block the features are supposed to be dependent and the separate blocks are independent. Then a covariance matrix of vector X may be represented in a block-diagonal way: 0 Σ =

5^

(5)

290 and one can define a classifier with discriminant fiinction g(x)=[Xj -1/2 (x.r denotes the expectation value of the k-tli output neuron averaged o\«^r the en.senil>le of permissible states, in the presence of stinujlus .r. Note that for large weights, Eq. (8) implements a winner-take-all mechanism: for the neuron for whicli Hu is maximal, the probal)ility of firing is one, and for the others zero. Let training pattern χ belong to class a(x) = 1 77/. The Kullback divergence is then dFF = - E î"") ôgipiM^)). (9)

305 The learning rules for Uik, Vjk and uuj are now given by gradient descent on (IFFÛik

=

-î'jfc =

nY,Q{x)xi{{ya(x))k-x) V

Σ

tanh(/{j +

Vjk) {{yo{x))k-

< yk

>x)

X ^Wij

=

Η J2 (LI^ÛI

(tanh(7?; +

Vja(i))-

ί ) .

χ The expectation value < yk >;? is given by Eq. (8) and Ml < tanh(7ij 4- gj) >;?= J2 ânh(/ij +

Vjk)p{yk\x)

k=l

and involves a sum over m terms only.

3.1. Some numerical results A numerical study was performed to compare the performance of the MLP and the BP [11]. The data consisted of 48.000 handwritten digits. The data were collected on data entry forms and were preprocessed (segmentation, filtering, normahzation and compres sion). The resulting digits were represented l)y 64 real numbers. Both MLPl and BP with 64 inputs and 10 outputs were trained on 40.000 digits and tested on 8.000 ). 5. Else /* Generate the left child. It isredundantto check if 5 + /? ^ Γ. */ 5.1. CaU BO 1EXTRACT( S, /?,ϋ: +1; X, VV, SOP).

End of Algorithm

Illustrative Example: To illustrate the above algorithm, let us consider mapping a perceptron whose weight vector w = [2, 2, -1, -2]' and Τ = 0.5. Converting negative weights to positive weights and updatingtiietiircshold,we obtain w = [2, 2, 1, 2]^ and Γ = 3.5. After sorting tiie weight vector becomes w = [2, 2, 2,1]' which leads to tiie search tree of Figure 3. In this figure, each node has two numbers. The first number is the sum of the selected weights tiius far and tiie second number is tiie sum of tiie remaining weights yet to be considered, i.e. the two numbers correspond to the two terms of the bounding function. The box nodes correspond to solution nodes andtiiemixed nodesrepresentdead nodes that were not pursued further. Thus tiie given perceptron is mapped to a Boolean function which is given as

E D E D X 1 X4

ÚJ}

X 2 X 4

Figure 3, The search tree for a perceptron with weight vector w = [2, 2, -1, -2]^ and Γ = 0.5.

2.2 Worst Case Complexity

The number of nodes generated by the symbolic mapping algorithm varies from one problem instance to another. In general, the number of nodes generated is relatively a small fimction of all possible nodes if some weights dominate therest;otherwise a large fraction of all possible nodes is generated. The following theorem describes the worst case behavior of tiie proposed symbolic m^pmg algoritiim. Theorem: The symbolic mapping algorithm using the backtracking search has the worst case complexity intiieorder oftiiecombination C(n, n/2) = W^(2" / Λ) = 0((2n I nf^T I n). Proof: The worst case isrealizedwhen all of tiie weights are identical andtiietiiresholdis just half the sum of all tiie weights. Thus, if we choose ceil{n 12) or more weights from tiie η weights, we satisfy the threshold function. The number of prime implicants is the combmation of taking ceil{n 12) out of Λ. Since the sum of the η +1 binomial coefficients is 2" and two terms equal one, we have C(n,n/2)^2V/i

(5)

318

Using the Stirling's approximation of the factorial

\ej

y

12«

\n

(6)

))

we have 1+ C(n,n/2) =

1 ^ 12^;

1ηΤ(^_ \2

(η/2)!(Λ/2)!

ln{nll)\

π η

3

^

12/1 + 4 ;

(7)

or In 2" C(n,n/2)^. — — \ π η

(8)

Hence, we have the worst case complexity C(n,n/2) = W{2'' / η) = 0((2η/π^^Ι"

Q.E.D.

/ n).

2 . 3 Case of Bipolar Inputs In many cases, bipolar binary inputs are used instead of unipolar binary inputs. In such cases also, the above algorithm can be used provided the first step involving weight conversion is modified accordin^y and the appropriate changes are made to the bouncSng and the solution functions. Using the linear transformation (9)

WjXj=i-Wj)i-'Xj)

we have the negative weight conversion rule: Xj:=

-Xj,Wj:=

-Wy,and Γ:= Τ

(10)

Thus, in this case, no adjustment to the threshold is necessary when we convert negative weights to positive weights. The boundmg function in this case is given by £vv.x.+ £

w^.^rwithXw.j:.-£w.rwitiiXw.ai.-£w. > 00 {I 0 if x = 0

f 2 (X ) =

¡ 0 if xiix>0 {O >0 1 if x = 0

and fi{x) and

(20)

The objective function for problems which do not have embedded optimization constraints is simply the negative of the sum of the associated variables. Thus, for Problems I and II, the objective function is given by Va,6 where the summation is taken over all associated variables. Minimizing this objective function, while satisfying the constraints, ensures that any logical proposition which cannot explicitly be inferred as true will receive a value of false. Embedded optimization constraints in rules affect both the form of the objective function and the form of the penalty function. A constraint is added to the penalty function to ensure that exactly one solution will be found by the neural network operator. For Problems III and IV, these constraints are formulated as 24

Σ

-

^α%

= 0

and

24

o,6c {1,2,3,4,5}

-

^

Κ,Β = 0

(21)

o,6e {1,2,3,4,5}

where J^* ¿, and i?*^ are the variables associated with bestJiouse and best jrestaurantîte, respectively. The objective function for Problem III is modified to incorporate the embedded maximization constraint CLS follows.

Σ

f{y) = -

^ « . 6 + ^α,6 + Ha,b + /α,6 +

+ ^ α , 6 + ^ « , 6 + Η^,

(22)

a,6c {1,2,3,4,5}

Σ

{distance from ( α , b) to (2,2)) (1 - / f ^ J

a,fcc {1,2,3,4,5}

Similarly, for Problem IV the objective function is fiV) = -

+

Σ Σ

a,6c {1,2,3,4,5}

+

+

H,,f> + /a,6 + L,,b + ^ a , 6 + ^ a , 6

{distance from

+

(23)

(a, δ) to (5,5))(1 - i?*^,)

a,fcc {1,2,3,4,5}

3. H Y B R I D GENETIC A L G O R I T H M / N E U R A L NETWORK P R O C E D U R E The motivation for consideration of a hybrid genetic algorithm/ neural network ap proach is to supplement the abihty of neural network algorithms to perform efficient local searches with the ability of genetic algorithms to perform global search. In previous studies [4] using a neural network algorithm based on work by Wang [6], one of the ma jor drawbacks in solving the optimization problems resulting from identical and similar rule-based systems to those in the illustrative examples given in Figures 1 and 2 was the selection of parameters that would lead to convergence to an optimal solution. It was impossible to obtain convergence to optimal solutions for a set of illustrative examples without changing parameters for each problem. Finding a set of parameters that would lead to convergence proved to be a time-consuming and tedious process. In order to de velop an automated procedure, it would be necessary to have a parameter selection or

520

tuning algorithm, perhaps based on a genetic algorithm approach, or to pursue entirely different approaches for solution of nonlinear optimization problems. A synergistic genetic algorithm/neural network approach used by Shonkwiler and Miller [7] to solve nonlinearly constrained optimization problems, in particular, the A;-chque prob lem, appeared to incorporate some of the best features of the neural network procedure used in the first phase of our investigations within a genetic algorithm framework. For example, the authors used a time-varying penalty parameter, as in the Wang [6] approach, to shift attention from the objective function to the penalty function. Unhke the Wang neural network approach, no gradients were calculated. Instead, a synchronous proba bilistic updating rule which permitted variable values of 0, 0.5, and 1 was used by the authors.

ALGORITHM H: Initialize generation.counter to L Randomly generate a population of pop.size individuals. Evolve each of the individuals using Algorithm NN below. Repeat while generation.counter < max.generations Increment

generation.counter.

Randomly select no^matings pairs of distinct individuals from the previous generation and generate no.matings individuals by averaging the variables of the parents. Evolve each individual using Algorithm NN. Select pop.size individuals using the objective function to determine fitness for survival, Report all individuals that satisfy constraints (penalty function). ALGORITHM N N : Initialize iteration-counter to 1. Repeat Update each variable with probability 0.5 as follows: If the value is 1, change the value to 0 if this change results in a decrease in energy. If the value is 0, change the value to 1 if this change results in a decrease in energy. If the value is 0.5, change the value to 1 if this change results in a decrease in energy; otherwise, change the value to 0. Increment

generation.counter.

while (penalty function < 0.0001) and

iteration.counter