Uncertainty in Artificial Intelligence [1st Edition] 9781483296555

This volume, like its predecessors, reflects the cutting edge of research on the automation of reasoning under uncertain

776 156 27MB

English Pages 456 [436] Year 1990

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Uncertainty in Artificial Intelligence [1st Edition]
 9781483296555

Table of contents :
Content:
Machine Intelligence and Pattern RecognitionPage ii
Front MatterPage iii
Copyright pagePage iv
PrefacePages v-viMax Henrion
ReviewersPage xi
Program CommitteePage xi
ContributorsPages xiii-xiv
Lp—A Logic for Statistical InformationPages 3-14Fahiem Bacchus
Representing Time in Causal Probabilistic NetworksPages 15-28Carlo BERZUINI
Constructing the Pignistic Probability Function in a Context of UncertaintyPages 29-39Philippe SMETS
Can Uncertainty Management Be Realized In A Finite Totally Ordered Probability Algebra?Pages 41-57Yang Xiang, Michael P. Beddoes, David Poole
Defeasible Reasoning and Uncertainty: CommentsPages 61-66Benjamin N. Grosof
Uncertainty and Incompleteness: Breaking the Symmetry of Defeasible Reasoning *Pages 67-85Piero P. Bonissone, David A. Cyrluk, James W. Goodwin, Jonathan Stillman
Deciding Consistency of Databases Containing Defeasible and Strict Information*Pages 87-97Moisés Goldszmidt, Judea Pearl
Defeasible Decisions: What the Proposal is and Isn'tPages 99-116R.P. Loui
Conditioning on Disjunctive Knowledge: Simpson's Paradox in Default LogicPages 117-125Eric Neufeld, J.D. Horton
An Introduction to Algorithms for Inference in Belief NetsPages 129-138Max Henrion
d-Separation: From Theorems to AlgorithmsPages 139-148Dan Geiger, Thomas Verma, Judea Pearl
Interval Influence DiagramsPages 149-161Kenneth W. Fertig, John S. Breese
A Tractable Inference Algorithm for Diagnosing Multiple Diseases1Pages 163-171David Heckerman
Evidence Absorption and Propagation Through Evidence ReversalsPages 173-190Ross D. Shachter
An Empirical Evaluation of a Randomized Algorithm for Probabilistic InferencePages 191-207R. Martin Chavez, Gregory F. Cooper
Weighing and Integrating Evidence for Stochastic Simulation in Bayesian NetworksPages 209-219Robert Fung, Kuo-Chu Chang
Simulation Approaches to General Probabilistic Inference on Belief NetworksPages 221-231Ross D. Shachter, Mark A. Peot
Software tools for uncertain reasoning: An IntroductionPage 235Jack S. Breese
Now that I Have a Good Theory of Uncertainty, What Else Do I Need? *Pages 237-253Piero P. Bonissone
Knowledge Acquisition Techniques for Intelligent Decision Systems: Integrating Axotl and Aquinas in DDUCKSPages 255-270Jeffrey M. Bradshaw, Stanley P. Covington, Peter J. Russo, John H. Boose
BaRT: A Bayesian Reasoning Tool for Knowledge Based SystemsPages 271-282Lashon B. Booker, Naveen Hota, Connie Loggia Ramsey
Assessment, criticism and improvement of imprecise subjective probabilities for a medical expert systemPages 285-294David J Spiegelhalter, Rodney C G Franklin, Kate Bull
Automated construction of sparse Bayesian networks from unstructured probabilistic models and domain informationPages 295-308Sampath Srinivas, Stuart Russell, Alice Agogino
A Decision-Analytic Model for Using Scientific DataPages 309-318Harold P. Lehmann
Verbal expressions for probability updates How much more probable is “much more probable”?Pages 319-328Christopher Elsaesser, Max Henrion
Map Learning with Indistinguishable LocationsPages 331-341Kenneth Basye, Thomas Dean
Plan Recognition in Stories and in Life*Pages 343-351Eugene Charniak, Robert Goldman
Hierarchical Evidence Accumulation in the Pseiki System and Experiments in Model-Driven Mobile Robot Navigation*Pages 353-369A.C. Kak, K.M. Andress, C. Lopez-Abadia, M.S. Carroll, J.R. Lewis
Model-Based Influence Diagrams For Machine VisionPages 371-388T.S. Levitt, J.M. Agosta, T.O. Binford
The Application of Dempster Shafer Theory to a Logic-Based Visual Recognition SystemPages 389-405Gregory M. Provan
Efficient Parallel Estimation for Markov Random FieldsPages 407-419Michael J. Swain, Lambert E. Wixson, Paul B. Chou
Comparing Approaches to Uncertain Reasoning: Discussion System Condemnation Pays OffPages 423-426Ward Edwards
A Probability Analysis of the Usefulness of Decision Aids1Pages 427-436Paul E. Lehner, Theresa M. Mullin, Marvin S. Cohen
Inference Policies12Pages 437-444Paul E. Lehner
Comparing Expert Systems Built Using Different Uncertain Inference Systems*Pages 445-455David S. Vaughan, Bruce M. Perrin, Robert M. Yadrick, Peter D. Holden
Shootout-89, An Evaluation of Knowledge-based Weather Forecasting SystemsPages 457-458W.R. Moninger
Author indexPage 459

Citation preview

Machine Intelligence and Pattern Recognition Volume 10 Series Editors L.N. KANAL and A. ROSENFELD University of Maryland College Park, Maryland, U.S.A.

NORTH-HOLLAND AMSTERDAM · NEW YORK · OXFORD · TOKYO

Uncertainty in Artificial Intelligence 5

Edited by

Max HENRION

Carnegie-Mellon University Pittsburgh, Pennsylvania, U.S.A. and Rockwell International Science Center Palo Alto, California, U.S.A.

Ross D. SHACHTER

Stanford University Stanford, California, U.S.A.

Laveen N. KANAL

University of Maryland College Park, Maryland, U.S.A.

John F. LEMMER

Knowledge Systems Concepts Rome, New York, U.S.A.

NORTH-HOLLAND AMSTERDAM · NEW YORK · OXFORD · TOKYO

ELSEVIER SCIENCE PUBLISHERS B.V. Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam, The Netherlands Distributors for the United States and Canada: ELSEVIER SCIENCE PUBLISHING COMPANY, INC. 655 Avenue of the Americas New York, N.Y. 10010, U.S.A.

ISBN: 0 444 88738 5 (hardbound) ISBN: 0 444 88739 3 (paperback) ©ELSEVIER SCIENCE PUBLISHERS B.V, 1990 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher, Elsevier Science Publishers B.V./Physical Sciences and Engineering Division, P.O. Box 103, 1000 AC Amsterdam, The Netherlands. Special regulations for readers in the U.S.A. - This publication has been registered with the Copyright Clearance Center Inc. (CCC), Salem, Massachusetts. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the U.S.A. All other copyright questions, including photocopying outside of the U.S.A., should be referred to the copyright owner, Elsevier Science Publishers B.V, unless otherwise specified. No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. pp. 271-282: copyright not transferred. Printed in The Netherlands

V

Preface This collection of papers, like its predecessors, reflects the cutting edge of research on the automation of reasoning under uncertainty. This volume contains a selection from the papers that were originally presented at the Fifth Workshop on Uncertainty in Artificial Intelligence, held on August 18th to 20th, 1989 at the University of Windsor in Ontario, Canada. The papers have been edited in the light of workshop discussions and, in some cases, expanded. It also includes written versions of commentaries by discussants on selected workshop sessions. This fifth volume from the fifth annual workshop may be seen as marking a milestone for the field. The first Workshop on Uncertainty in Artificial Intelligence in 1985 brought together for the first time what was then something of a fringe group as far as mainstream artificial intelligence (AI) research was concerned. The early meetings focussed particularly on the fundamental issues of representing uncertainty, and they were the scene of vigorous and sometimes heated debates about the relative merits of the competing approaches. In more recent meetings these "religious wars" have much diminished in intensity. This is not because participants have all finally reached a concensus. Far from it! But perhaps rather because there is a recognition that arguments purely about the fundamentals, important though they may be, are not sufficient to settle the question of selecting one scheme over another for a particular application. More pragmatic criteria must also be considered. What are the computational demands of a scheme? How reliable is it? How easy is it to structure and encode human uncertain knowledge into the formalism? Can the model and reasoning be explained comprehensibly to users? The primary goal is not simply to convince rival uncertainty theorists of the superiority of your approach over theirs. Indeed, if we believe Thomas Kuhn's characterization of the clash between scientific paradigms is applicable here, this ambition may often be unattainable anyway. Rather the goal is to provide AI practitioners and knowledge engineers with tools for uncertain reasoning that are not only principled but also practical for handling their real world problems. In this view the criterion of success of an approach is its effectiveness for application. The marketplace for ideas, like more tangible goods, is ultimately ruled more by consumers than producers. This more pragmatic emphasis is much in evidence in this volume. While it does contain some interesting papers on fundamentals (Chapter I), and particularly on the relationships between uncertain and defeasible reasoning (Chapter II), the bulk of the papers address more practical issues. Recognizing that tractable probabilistic inference is critical for large scale applications of Bayesian belief nets, the papers in Chapter III explore the development of more efficient algorithms. The papers in Chapter IV discuss the embedding of uncertain inference schemes in general software tools for building knowledge-based systems. Chapter V contains papers exploring a range of important issues in knowledge acquisition, modelling and explanation. Chapter VI includes a wide variety of different approaches to uncertain inference applied to problems in vision and recognition, including natural language understanding. The final Chapter, VII,

VI

presents comparisons of uncertain inference schemes, both theoretical and empirical. Ward Edwards, in his provocative commentary on these papers "System Condemnation Pays Off", argues for the key role of empirical comparisons in the development of the field. The final panel discussion in the third workshop started by addressing the question "Why does mainstream AI research ignore uncertainty research?", which provoked the answer "Why does uncertainty research ignore mainstream AI research?" On the evidence from this volume and the increasing prevalence of papers addressing uncertain reasoning in general AI forums such as the AAAI and IJCAI conferences and AI Journal, this jibe is no longer apt. The application of recently developed uncertain reasoning schemes to important problems in AI, including medical diagnosis, machine diagnosis, vision, robotics, and natural language understanding among others, demonstrate the increasing effectiveness of uncertainty researchers in addressing the needs of mainstream AI. As AI tools are applied to larger scale problems the importance of the explicit treatment of uncertainty is becoming increasingly inescapable. Uncertainty research is also making important contributions to a range of fundamental theoretical issues including nonmonotonic and default reasoning, heuristic search, planning, and learning. While some participants in earlier workshops may feel a lingering nostalgia for the cut and thrust excitement that characterized the youth of the field a scant few years ago, we should welcome these developments as signs of the increasing maturity and broader impact of the field. Max Henrion, Palo Alto, California.

xi

Reviewers Gautam Biswas Piero Bonissone Paul Cohen Gregory Cooper Norm Dalkey Michael Fehling Matthew Ginsberg David Heckerman Henry Kyburg Tod Levitt Enrique Ruspini Judea Pearl David Spiegelhalter Ben Wise

Jack Breese Peter Cheeseman Marvin Cohen Bruce D'Ambrosio Rina Dechter Ken Fertig Dan Geiger Eric Horvitz John Lemmer Ron Loui Ramesh Patil Prakash Shenoy Michael Wellman Ron Yager

The editors give heartfelt thanks to the reviewers, who contributed generously of their time in refereeing the papers for the Workshop and so helped provide the basis for this volume.

Program Committee Piero Bonissone Peter Cheeseman Paul Cohen Laveen Kanal Henry Kyburg John Lemmer

Tod Levitt Ramesh Patil Judea Pearl Enrique Ruspini Glenn Shafer Lotfi Zadeh

Program Chair: Max Henrion

General Chair: Ross Shachter

xi

Reviewers Gautam Biswas Piero Bonissone Paul Cohen Gregory Cooper Norm Dalkey Michael Fehling Matthew Ginsberg David Heckerman Henry Kyburg Tod Levitt Enrique Ruspini Judea Pearl David Spiegelhalter Ben Wise

Jack Breese Peter Cheeseman Marvin Cohen Bruce D'Ambrosio Rina Dechter Ken Fertig Dan Geiger Eric Horvitz John Lemmer Ron Loui Ramesh Patil Prakash Shenoy Michael Wellman Ron Yager

The editors give heartfelt thanks to the reviewers, who contributed generously of their time in refereeing the papers for the Workshop and so helped provide the basis for this volume.

Program Committee Piero Bonissone Peter Cheeseman Paul Cohen Laveen Kanal Henry Kyburg John Lemmer

Tod Levitt Ramesh Patil Judea Pearl Enrique Ruspini Glenn Shafer Lotfi Zadeh

Program Chair: Max Henrion

General Chair: Ross Shachter

Xlll

Contributors Alice Agogino University of California, Berkeley, CA 94720 John Mark Agosta Stanford University, Stanford, CA 94305 K.M. Andress Purdue University, W. Lafayette, IN 47907 Fahiem Bacchus University of Waterloo, Waterloo, Ontario N2I-3G1 Kenneth Basye Brown University, Providence, Rl 02912 Michael P. Beddoes University ofB. C, Vancouver, B.C. V6T 1W5 R. Bellazzi Universita di Pavia, Pavia, Italy Carlo Berzuini Universita di Pavia, Pavia, Italy Thomas O. Binford Stanford University, Stanford, CA 94305 Piero P. Bonissone General Electric Corporation, Schenectady, NY 12301 Lashon Booker Naval Research Laboratories, Washington D.C, 20375-5000 John S. Boose Boeing Computer Services, Seattle, WA 98124 Jeffrey Bradshaw Boeing Computer Services, Seattle, WA 98124 John S. Breese Rockwell International, Palo Alto, CA 94301 Kate Bull Hospital for Sick Children, London, England M.S. Carroll Purdue University, W. Lafayette, IN 47907 Kuo-Chu Chang Advanced Decision Systems, Mountain View, CA 94040 Eugene Chamiak Brown University, Providence, Rl, 02912 R. Martin Chavez Stanford University, Stanford, CA 94305 Paul B. Chou IBM T.J. Watson Research Center, Yorktown Heights, NY 10598 Marvin S. Cohen Decision Science Consortium, Inc., Reston, VA 22091 Gregory F. Cooper Stanford University, Stanford, CA 94305 Stanley Covington Boeing Computer Services, Seattle, WA 98124 David A. Cyrluk General Electric Corporation, Schenectady, NY 12301 Thomas Dean Brown Univeristy, Providence, Rl 02912 Ward Edwards University of Southern California, Los Angeles, CA 90089. Christopher Elsaesser Carnegie Mellon University, Pittsburgh, PA 15213. Kenneth W. Fertig Rockwell International, Palo Alto, CA 94301 R.C.G. Franklin Hospital for Sick Children, London, England Robert Fung Advanced Decision Systems, Mountain View, CA 94040 Dan Geiger University of California, Los Angeles, CA 90024 Robert Goldman Brown University, Providence, Rl 02912 Moises Goldszmidt University California - Los Angeles, CA 90024 James W. Goodwin Knowledge Analysis, Belmont, MA Benjamin Grosof IBM T.J. Watson Research Center, Yorktown Heights, NY 10598

XIV

David Heckerman Stanford University, Stanford, CA 94305 Max Henrion, Rockwell International, Palo Alto, CA 94301 and Carnegie Mellon University, Pittsburgh, PA 15213 Peter D. Holden McDonnell Douglas Corporation, St. Louis, MO 63166 J.D. Horton University of Saskatchewan, Canada S7N OWO and University of New Brunswick, Fredericton, Canada E3B 5A3 Naveen Hota JAYCOR, Vienna, VA 22180 Avi C. Kak Purdue University, W. Lafayette, IN 47907 Harold P. Lehmann Stanford University, Stanford, CA 94305 Paul E. Lehner George Mason University, Fairfax, VA 22030 Tod S. Levitt Advanced Decision Systems, Mountain View, CA 94040 J.R. Lewis Purdue University, W. Lafayette, IN 47907 C. Lopez-Abadia Purdue University, W. Lafayette, IN 47907 Ronald P. Loui Washington University St. Louis, MO 63130 William R. Moninger National Oceanic and Atmospheric Administration, Boulder, CO 80303 Theresa M. Mullin Decision Science Consortium, Inc., Reston, VA 22091 Eric Neufeld University of Saskatchewan, Canada S7N OWO and University of New Brunswick, Fredericton, Canada E3B 5A3 Judea Pearl University of California, Los Angeles, CA 90024 Mark Peot Stanford University, Stanford, CA 94305 and Rockwell International, Palo Alto, CA 94301 Bruce Perrin McDonnell Douglas Corporation, St. Louis, MO 63166 David Poole University of British Columbia, Vancouver, B.C. V6T 1W5 Gregory M. Provan University of British Columbia, Vancouver, B.C. V6T 1W5 Connie L. Ramsey Naval Research Laboratories, Washington D.C, 20375-5000 Stuart Russell University of California, Berkeley, CA 94720 Peter Russo Boeing Computer Services, Seattle, WA 98124 Ross D. Shachter Stanford University, Stanford, CA 94305 Philippe Smets I.R.I.D.I.A., Université Libre de Bruxelles, Brussels, Belgium B-1050 David J. Spiegelhalter MRC Biostatlstlcs Unit, Cambridge, England Sampath Srinivas Rockwell International, Palo Alto, CA 94301 Jonathan Stillman General Electric Corporation, Schenectady, NY 12301 Michael J. Swain University of Rochester, Rochester, NY 14627 David S. Vaughan McDonnell Douglas Corporation, St. Louis, MO 63166 Thomas Verma University of California, Los Angeles, CA 90024 L.E. Wixson University of Rochester, Rochester, NY 14627 Yang Xiang University of British Columbia, Vancouver, B.C. V6T 1W5 Robert Yadrick McDonnell Douglas Corporation, St. Louis, MO 63166

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

3

Lp—A Logic for Statistical Information Fahiem Bacchus* Department of Computer Science University of Waterloo Waterloo, Ontario, Canada N2L-3G1 [email protected]

1

Introduction

This extended abstract presents a logic, called L p , t h a t is capable of representing and reasoning with a wide variety of both qualitative and quantitative statistical information. T h e advantage of this logical formalism is t h a t it offers a declarative representation of statistical knowledge; knowledge represented in this manner can be used for a variety of reasoning tasks. T h e logic differs from previous work in probability logics in t h a t it uses a probability distribution over t h e domain of discourse, whereas most previous work (e.g., Nilsson [2], Scott et al. [3], Gaifman [4], Fagin et al. [5]) has investigated the attachment of probabilities to t h e sentences of the logic (also, see Halpern [6] and Bacchus [7] for further discussion of t h e differences). T h e logic L p possesses some further important features. First, L p is a superset of first order logic, hence it can represent ordinary logical assertions. This means t h a t L p provides a mechanism for integrating statistical information and reasoning about uncertainty into systems based solely on logic. Second, L p possesses transparent semantics, based on sets and probabilities of those sets. Hence, knowledge represented in L p can be understood in terms of the simple primative concepts of sets and probabilities. And finally, t h e there is a sound proof theory t h a t has wide coverage (the proof theory is complete for certain classes of models). T h e proof theory captures a sufficient range of valid inferences to subsume most previous probabilistic uncertainty reasoning systems. For example, the linear constraints like those generated by Nilsson's probabilistic entailment [2] can be generated by the proof theory, and the Bayesian inference underlying belief nets [8] can be performed. In addition, the proof theory integrates quantitative and qualitative reasoning as well as statistical and logical reasoning. * Support for preparing this paper was provided through a grant from the University of Waterloo, and NSERC grant OGP0041848. Parts of this work have been previously reported at CSCSI-88 [1].

4 In the next section we briefly examine previous work in probability logics, comparing it to L p . Then we present some of t h e varieties of statistical information that L p is capable of expressing. After this we present, briefly, the syntax, semantics, and proof theory of the logic. We conclude with a few examples of knowledge representation and reasoning in L p , pointing out t h e advantages of t h e declarative representation offered by L p . We close with a brief discussion of probabilities as degrees of belief, indicating how such probabilities can be generated from statistical knowledge encoded in L p . The reader who is interested in a more complete t r e a t m e n t should consult Bacchus [7].

2

Other Probability Logics

Previous work in probability logic has investigated the attachment of probabilities to sentences. To appreciate t h e difference between this and t h e expression of statistical information consider the two assertions: "The p r o b a b i l i t y t h a t Tweety c a n f l y i s 0 . 7 5 , " and "More t h a n 75'/, of a l l b i r d s c a n f l y . " T h e first statement is an expression of a degree of belief. It is expressing the internal state of some agent—an agent who believes the assertion "Tweety c a n f l y " to degree 0.75. It is not an objective assertion about the state of the world (i.e., an assertion that is independent of any believers). In the world Tweety can either fly or not fly—there is no probability involved. T h e second statement, on the other hand, is making an objective assertion about the state of the world; i.e., in the world there is some percentage of birds t h a t can fly and this percentage is either 75% or some other number. 1 This example shows t h a t there is an essential difference between the attachment of a probability to a sentence and the expression of a statistical assertion. Probabilities attached to sentences, which have been the focus of previous work on probability logics, op. cit., are not capable of efficiently expressing statistical assertions (Bacchus [9]). There has been some work similar to L p . This work is discussed in more detail in Bacchus [10].

3

Types of Statistical Knowledge

Statistical information can be categorized into many different types. T h e development of L p was guided by a desire to represent as many different types of statistical knowledge as possible. T h e key consideration was t h e desire to represent qualitative statistical knowledge, i.e., not only the types of statistical knowledge used in statistics but also the types of "commonsense" statistical knowledge t h a t would be useful in AI domains. The following is an incomplete list of some different types of statistical information that L p is capable of expressing. R e l a t i v e : Statistical information may be strictly comparative, e.g., the assertion "More p o l i t i c i a n s are lawyers than e n g i n e e r s . " 1 As stated it is clear t h a t it is extremely unlikely that the actual percentage is exactly 75%. More likely that it is in some interval around 75%. L p is also capable of making such interval assertions.

5 I n t e r v a l : We may know t h a t t h e proportion is in a certain range, e.g., t h e assertion "Between 75'/, t o 99'/, of a l l p o l i t i c i a n s a r e l a w y e r s . " F u n c t i o n a l : We may know t h a t a certain statistic is functionally dependent on some other measurement, e.g., "The p r o p o r t i o n of f l y i n g b i r d s d e c r e a s e s a s w e i g h t i n c r e a s e s . " This type of functional dependence in an uncertainty measure is prominent in t h e medical domain. I n d e p e n d e n c e : We may know t h a t two properties are statistically independent of each other. Work by Pearl and his associates has demonstrated t h e importance of this kind of knowledge ([11, 12, 13]).

4

Syntax and Semantics

L p is based on two fairly straightforward ideas. First, there is a probability distribution over t h e domain of discourse. This means t h a t any set of domain individuals can be assigned a probability. Through t h e use of open formulas (i.e., formulas with free variables) we can assert t h a t various sets of domain individuals possess certain probabilities. An open formula can be viewed, as in l a m b d a abstraction, as specifying a set of domain individuals—the set of individuals which satisfy t h a t formula. For example the open formula " B i r d ( x ) " can be viewed as denoting the set of birds, i.e., the set of individuals t h a t satisfy the formula. Sentences in L p can be used to assert t h a t the probability of this set (i.e., the measure of t h e set of individuals t h a t satisfy the formula) possesses various properties. For example, t h e L p sentence "[Bird(x)] x > 0.9" asserts that the probability of the set of birds has the property that it is greater t h a n O.9.2 T h e second idea is to have a field of numbers in t h e semantics as a separate sort. W i t h numbers as a separate sort the probabilities become individuals in t h e logics. T h a t is, t h e probabilities become numeric t e r m s 3 and, by asserting t h a t these terms stand in various numeric relationships with other terms, we can assert various qualitative relationships between these probabilities. In the above example, ' [ B i r d ( x ) ] x ' is a numeric term, and the sentence asserts that it stands in the 'greater-than' relation with the numeric term Ό . 9 ' . T h e existence of numbers as a separate sort also allows the use of 'measuring' functions, functions t h a t m a p individuals to numbers. An example of such a function is 'Weight', which maps individuals to a number representing their weight (in some convenient units). T h e measuring functions greatly increase the expressiveness of the logic. 4 2

This unconditional probability does not make much sense; it is through the use of conditional probabilities t h a t meaningful statistical assertions can be made. For example, the L p sentence " [ F l y ( x ) | B i r d ( x ) ] x > 0.9" makes an assertion about the relative probability of flying birds among birds, i.e., about the proportion of birds that fly. 3 This means that the probabilities are field-valued not real-valued. There are technical difficulties with using the reals instead of a field of numbers. In particular, it is not possible to give a complete axiomatization of the reals without severely restricting the expressiveness of the logic. We can be assured, however, t h a t the field of numbers will always contain the rational numbers, so t h e probabilities can be any rational number that we wish (in the range 0 - 1 , of course). 4 These "measuring" functions are called random variables in statistics, but I avoid that terminology to eliminate possible confusion with the ordinary variables of L p .

6

5

Syntax

We now present a more detailed picture of the syntax of L p . This description should give the reader a better idea of t h e types of sentences t h a t one can form in the language. We start with a set of constant, variable, function, and predicate symbols. T h e constants, variables, and predicates can be of two types, either field or object. 5 T h e function symbols come in three different types: object, field, and measuring functions. T h e measuring functions will usually have special names like Weight or S i z e . Along with these symbols we also have a set of distinguished symbols, including the following field symbols: 1, 0 (constants), = , > (predicates), -f, —, x , and -f6 (functions). T h e symbol = is also used to represent the object equality predicate. Also included is t h e logical connective 'Λ', the quantifier ' V , and the probability t e r m formers '[', ' ] ' .

5.1

Formulas

T h e major difference between the formulas of L p and the formulas of first order logic is t h e manner in which terms are built up. TO) A single object variable or constant is an o-term] a single field variable or constant is an f-term. T l ) If / is an n-ary object (field) function symbol and ί χ , . . .,£„ are o-terms (f-terms) then / ( i i . . .tn) is an o-term (f-term). If v is an n-ary measuring function symbol and ti,..., tn are o-terms then u(tx ... tn) is an f-term. T 2 ) If a is a formula and x is a vector of n object variables, ( χ χ , . . . , xn), then [a\z is an f-term.7 T h e formulas of L p are built up in t h e standard manner, with t h e added constraint t h a t predicates can only apply to terms of the same type. The notable difference with first order logic is that f-terms can be generated from formulas by the probability t e r m former. For example, from the formula "Have(y,x) Λ Zoo(x)" the f-term u [Have(y,x) Λ Zoo(x)] x " can be generated. This term can then be used to generate new formulas of arbitrary complexity, e.g., (Vyz) Rare(y) Λ -iRare(z) Λ Animal(y) Λ Animal(z) ->

[Have(z, x) Λ Zoo(x)] x > [Have(y, x) Λ Zoo(x)] x ).

In this formula some of the variables are universally quantified while the 'x' is bound by the probability term former. The intuitive content of this formula can be stated as follows: if there are two animals one of which is rare while the other is not then the measure 5

W h e n there is a danger of confusion the field symbols will be written in a b o l d font. T h e division function is added by extending the language through definition. See [10] for the technical details. 7 Note, x does not have to include all of the free variables of a. If it does not we have a term with free variables which must be bound by other quantifiers or probability term formers to produce a sentence. 6

7 (probability) of t h e set of zoos which have t h e rare animal is less t h a n t h e measure of t h e set of zoos which have t h e non-rare animal. Through standard definitions we add V, —►, 3 , and t h e extended set of field inequality predicates, < , < , > , and G (denoting membership in an interval). We use infix form for t h e predicate symbols = and > as well as for t h e function symbols + , x , —, and -=-. Conditional probabilities are represented in L p with t h e following abbreviation. Definition 1

[a\ß]g=4f[aAß]ä^\ß]g. 5.2

Semantic Model

This section outlines t h e semantic structure over which L p is interpreted. As indicated above it consists of a two sorted domain (individuals and numbers) and a probability distribution over t h e set of individuals. W h a t was not discussed was t h e need for a distribution over all vectors of individuals. This is necessary since t h e open formulas used to generate t h e probability terms may have more t h a n one free variable. Hence one may need t o examine t h e probability of a set vectors of individuals which satisfy a given formula. An L p - S t r u c t u r e is defined t o be t h e tuple M: ( 0 , ^ , { / i n | n = l,2...}) Where: a ) O represents a finite set of individual objects (the domain of discourse). 8 b ) T represents a totally ordered field of numbers. T h e rationals, t h e real, t h e complex numbers are all examples of fields of numbers. In fact, every field of numbers contains t h e rationals as a subfield (MacLane [14]). c) {μη | n = 1, 2 , . . . } is a sequence of probability functions. Each μη is a set function whose domain includes t h e subsets of On defined by t h e formulas of L p , 9 whose range is Ty and which satisfies t h e axioms of a probability function (i.e., μη{Α) > 0, μη(Α UB) = μη(Α) + μη(Β) if Λ Π £ = 0, and μη(Οη) = 1). The sequence of probability functions is a sequence of product measures. T h a t is, for any two sets A e On and B G Om and their Cartesian product A X B G O n + m , if A G άοτηαιη(μη) and B G domain^m), then Ax B £ άοτηαιη(μη+τη) and μη+ηι(Α x B) = μη(Α) x μτη(Β). The product measure ensures that t h e probability terms satisfy certain conditions of coherence. For example, t h e order of t h e variables cited in t h e probability terms makes no difference, e.g., [a]Xty — \d\y%x. Another example is that t h e probability terms are unaffected by tautologies, e.g., [P(x) A (R{y) V -*R(y))](Xiy) = [P(x)]x. It should be noted t h a t this constraint on the probability functions is not equivalent to a restrictive assumption of independence, sometimes found in probabilistic inference engines (e.g., t h e independence assumptions of t h e Prospector system [15], see Johnson [16]). See [7] for a full discussion of t h e intuitive implications of using product measures. 8

We restrict ourselves to finite domains to avoid the difficulty of sigma additivity. This issue is dealt with in [10]. 9 This set of subsets can be shown to be a field of subsets [7].

8

5.3

Semantics of Formulas

The formulas of L p are interpreted with respect to the semantic structure in the same manner as first order formulas are interpreted with respect to first order structures. T h e only difference is t h a t we have to provide an interpretation of t h e probability terms. As indicated above t h e probability terms denote the measure (probability) of the set of satisfying instances of t h e formula. In more detail: We define a correspondence, called an interpretation, between the formulas and the Lp-Structure M. augmented by the t r u t h values T and _L (true and false). An interpretation maps all of the symbols to appropriate entities in the L p - S t r u c t u r e , including giving an initial assignment to all of the variables. These assignments serve as t h e inductive basis for an interpretation of t h e formulas. This interpretation is built up in the same way as in first order logic, with the added consideration t h a t universally quantified object variables range over O while universally quantified field variables range over T. T h e only thing which needs to be demonstrated is the semantic interpretation of the probability terms. Let σ be an interpretation of L p . Let σ ( χ / α ) , where a = ( a i , . . . , a n ) and x = (χι,...,xn) are vectors of individuals and variables (of matching type), denote a new interpretation identical to σ except t h a t {χ{)σ^χ^α^ = a;, (i = 1 , . . . , n ) . The probability terms are given the following semantic interpretation: For the f-term

In other words, the probability term denotes the probability of the set of satisfying instances of the formula. Since μη is a probability function which maps to the field of numbers T, it is clear t h a t [a]# denotes an element of T under the interpretation σ; thus, it is a valid f-term.

6

Examples of Representation

We can now give a indication of the representational power of L p . By considering the semantic interpretation of the formulas it should be reasonably clear that the formulas do in fact represent the gist of the stated English assertions. 1 0 1. More politicians

are lawyers than

engineers.

[Lawyer(x)|Politician(x)]x > 2. The proportion

[Engineer(x)|Politician(x)]x.

of flying birds decreases with weight. Here y is a field variable. Vy([f l y ( x ) | b i r d ( x ) Λ w e i g h t ( x ) < y ] x >

10

[f l y ( x ) | b i r d ( x ) Λ w e i g h t ( x ) > y ] x ) .

It should be noted that the aim is to give some illustrative examples, not to capture all of the nuances of the English assertions.

9 3. Given R the property P is independent of Q. This is the canonical tri-functional expression of independence (see Pearl [11]). [P(x) Λ Q{x)\R(x)]x

=

[P(x)\R(x)]xx[Q(x)\R(x)]x.

Thus Lp can represent finely grained notions of independence at the object language level. 4. Quantitative notions from statistics, e.g, The height of adult male humans is normally distributed with mean 177cm and standard deviation 13cm: V y z ( [ h e i g h t ( x ) G (y,z)|Adult_male(x)] x =

n o r m a l ( y , z , 177,13)).

Here n o r m a l is a field function which, given an interval ( y , z ) n , a mean, and a standard deviation, returns the rational number approximation 12 of the integral of a normal distribution, with specified mean and standard deviation, over the given interval.

7

Deductive Proof Theory

This section outlines t h e deductive proof theory of L p . T h e proof theory provides a specification for wide class of valid inferences t h a t can be made from a body of knowledge expressed in L p . In particular, it provides a full specification for most probabilistic inferences, including Bay si an inference, all first order inferences, as well as inferences which follow from the combination of qualitative and quantitative as well as statistical and logical knowledge. T h e proof theory consists of a set of axioms and rules of inference, and can be shown to be both sound. It can also be shown to be complete with respect to various classes of models. T h e proof theory for L p is similar to t h e proof theory for ordinary first order logic. T h e major change is t h a t two new sets of axioms must be introduced, one to deal with the logic of the probability function, and another set to define the logic of the field T. The axioms include the axioms of first order logic (e.g., [17]) along with the axioms of a totally ordered field (MacLane [14]). There are also various axioms which specify the behavior of the probability terms. We give some examples of these axioms to give a indication of their form. S o m e of t h e P r o b a b i l i t y F u n c t i o n A x i o m s [a]£= 1, P I ) VXl...Vxna-> where x = ( x i , . . ., xn), and every X{ is an object variable. P 2 ) [a]* > 0. 11 One would probably want to constrain the values of y and z further, for example, y < z. 12 A rational number approximation is returned since the numbers are from a totally ordered field, not necessarily the reals.

10

P3) [α],+ Η ι = 1 . P4)

[a)*+\ß]g>[aVß]*.

P 5 ) [a A ß]g = 0 -> [a] tf + [£]* = [a V fl*. T h e first axiom simply says t h a t if all individuals satisfy a given formula then t h e probability of this set is one (i.e., the probability summed over the entire domain is one). T h e other axioms state similar facts from the calculus of probabilities. R u l e of i n f e r e n c e The only rule of inference is modus ponens, i.e., from { a , a —> ß} infer ß. If we also have an axiom of finiteness (see Halpern [6]) then the above axioms and rule of inference comprise a sound and complete proof theory for t h e class of models we have defined here (i.e., models in which O is bounded in size and where t h e probabilities are field valued). Let Φ be a set of L p sentences. We have: T h e o r e m 2 ( C o m p l e t e n e s s ) If Φ f= a, then Φ h a . T h e o r e m 3 ( S o u n d n e s s ) Ι / Φ Η α , then Φ |= a . L e m m a 1 The following a ) ([a -

are provable13

in L p ;

0 ] , = 1 Λ p -> a ] * = 1) -> [«] Ä = [ß]s.

b) [a V β]β = [a}£ + [ß]s - [a A ß]3. T h e following gives an indication of the scope of the proof theory. E x a m p l e 1 Nilsson's

Probabilistic

Entailment

Nilsson [2] shows how probabilities attached to sentences in a logic are constrained by known probabilities, i.e., constrained by the probabilities attached to a base set of sentences. For example, if [P Λ Q] = 0.5, then the values of [P] and [Q] are b o t h constrained to be > 0.5. Nilsson demonstrates how the implied constraints of a base set of sentences can be represented in a canonical manner, as a set of linear equations. These linear equations can be used to identify the strongest constraints on the probability of a new sentence, i.e., the tightest bounds on its probability. These constraints are, in Nilsson's terms, probabilistic entailments. These bounds are simply consequences of the laws of probability. The statistical terms in L p also obey the laws of probability. Hence, although these statistical probabilities have a different meaning t h a n probabilities attached to sentences, they obey similar kinds of linear constraints. In fact, the linear constraints investigated by Nilsson depend only on finite properties of probabilities, and since the proof theory of L p is complete with respect to finite domains, all such linear constraints can be deduced from the proof theory of L p . ;

That is, deducible directly from the axioms.

11

Figure 1: A Bayes's Net For example, if we have {[P] = 0.6, [P —> Q] — 0.8}, Nilsson's probabilistic entailment gives t h e conclusion 0.4 < [Q] < 0.8. These probabilities are to be interpreted as being probabilities attached to propositions, i.e., t h e propositions P , Q and P —> "Q. A statistical analogue of this example is the set of L p sentences: {[P(as)]* = 0.6, [P(x) —► Q(x)]x = 0.8}. These probabilities are to be interpreted as probabilities of sets of individuals, i.e., the measure of the set of P ' s , Q's and ->P U Q. From this knowledge it is easy to deduce the bounds [0.4,0.8] on the probability t e r m

E x a m p l e 2 Bayesian

Networks.

Bayes's theorem is immediate in L p . L e m m a 2 (Bayes's Theorem)

The following

is provable in L p :

[β\α)£=[α\β}£χ1^-. Consider the Bayes's Net in figure 1. If all of the variables Χχ-Χ^ are propositional (binary) variables one could write t h e m as one place predicates in L p . Hence, the open formula lX1(x)\ for example, would denote the set of individuals with property Χχ. T h e Bayes's Net gives a graphical device for specifying a product form for the joint distribution of the variables Xi (Pearl [8]). 14 In this case the distribution represented by the Bayes's Net in Figure 1 could also be specified by the L p sentence [X^x) =

A X2(x)

A X3(x)

[XXA(z)]m

A ^X2(x)}x

x

[X3(x)\X1(x)}x

x[-^X2(x)\X1(x)]xx[X1(x)]x will be satisfied by every probability distribution which satisfies t h e first equation. Furthermore, the proof depends only on finite properties of t h e probability function, i.e., only on properties true of the field valued probabilities used in t h e L p - s t r u c t u r e . Hence, by the completeness result, all such equations will be provable from L p ' s proof theory. This means t h a t the behavior of the Bayes's net is captured by t h e first L p sentence. T h a t is, the fact t h a t this product decomposition holds for t r u t h assignment of the predicates X;, is captured by t h e proof theory. In addition to the structural decomposition Bayes's nets must provide a quantification of the links. This means the conditional probabilities in the product must be specified. In this example if we add t h e L p sentences {[Χι(α;)] χ = 0.5, [X2(x)\Xi(x)]x = .75, [X3(x)\Xi(x)]x — .4, [X4(x)\X2(x) AX3(x)]x = .3}, we can then determine the probability of the set of individuals that have some properties X{ given t h a t these individuals possess some other properties, e.g., the values of terms like [Χι(χ)\Χ2(χ) Α ->Χ4(ίζ;)]χ. Again these probabilities will be semantically entailed by the product decomposition and by the link conditional probabilities. Thus, the new probability values will be provable from the proof theory. Of course the proof theory has none of the computational advantages of the Bayes's net. However, what is important is that L p gives a declarative representation of the net. The structure embedded in the net is represented in a form that can be reasoned with and can be easily changed. There is also the possibility of automatically compiling Bayes's net structures from declarative L p sentences. Furthermore, the proof theory captures all of the Baysian reasoning within its specification, and offers the possibility of integrating Bayes's net reasoning with more general logical and qualitative statistical reasoning. Hence the proof theory gives a unifying formalism in which both types of inferences could be understood.

8

Degrees of Belief

Besides their use in expressing statistical information, probabilities have an important use in expressing degrees of belief. One can assert t h a t prob[Fly(Tweety)] > .75, indicating that one's degree of belief in the assertion F l y ( T w e e t y ) is greater than 0.75. Interestingly, L p is cannot (easily) express such probabilities. It can be shown that the probability of any sentence (i.e., formula with no free variables) is either 1 or 0 in L p . This fact is interesting because, as was demonstrated in [9], probability logics capable of assigning probabilities to sentences cannot (easily) represent statistical probabilities. Hence, these two types of probability logics have very different uses which coincide with their very different semantics. However, one advantage of a logic like L p is that it can be used to generate statistically founded degrees of belief, via a system of direct inference (e.g., Kyburg [18], Pollock [19]).

13 Degree of belief probabilities generated in this manner have a number of advantages over purely subjective probabilities (Kyburg [20]); not the least of which is that they yield degrees of belief which are founded on empirical experience. A system of direct inference based on the use of Lp is presented in Bacchus [7]. The following simplified example should serve to illustrate the basic idea behind this system. Example 3 Belief Formation Say we know that we have the following Lp knowledge base KB

=

[Fly(x)|Bird(x)] x > 0.9 Bird(Tweety)

That is, we know that more than 90% of all birds fly, and that Tweety is a bird. Say that we want to generate a degree of belief about Fly (Tweety), i.e., Tweety's flying ability. We can accomplish this by considering what is know about Tweety (i.e., what is provable from our knowledge base), and then equating our degree of belief with the statistical probability term which results when we substitute a variable for the constant Tweety. This yields prob(Fly(Tweety)|Bird(Tweety)) = [Fly(x)|Bird(x)] x , which by our knowledge base is greater than 0.9. Semantically, this can be interpreted in the following manner: our degree of belief that Tweety can fly, given that all we know about Tweety is that he is a bird, is equal to the proportion of birds that can fly. The main complexities arise when we know other things about Tweety, e.g., when we know that Tweety is yellow as well as a bird.

Acknowledgments Thanks to the referees for some helpful criticisms.

References [1] Fahiem Bacchus. Statistically founded degrees of belief. In CSCSI-88, Proceedings of the Canadian Artificial Intelligence Conference, pages 59-66, Available from Morgan Kaufmann Publishers, Palo Alto, CA, 94303, 1988. [2] Nils J. Nilsson. Probabilistic logic. Artificial Intelligence, 28:71-87, 1986. [3] Dana Scott and Peter Krauss. Assigning probabilities to logical formulas. In Jaakko Hintikka and Patrick Suppes, editors, Aspects of Inductive Logic. North-Holland, 1966. [4] Haim Gaifman. Concerning measures in first order calculi. Israel Journal of Mathematics, 2:1-18, 1964. [5] Ronald Fagin, Joseph Y. Halpern, and Nimrod Megiddo. A logic for reasoning about probabilities. Technical Report RJ 6190 4/88, IBM Research, Almaden Research Center, 650 Harry Road, San Jose, California, 95120-6099, 1988.

14 [6] Joseph Y. Halpern. An analysis of first-order logics of probability. In pages 1375-1381, 1989. [7] Fahiem Bacchus. Representing and Reasoning With Probabilistic coming). MIT-Press, Cambridge, Massachusetts, 1990.

IJCAI-89,

Knowledge

[8] J u d e a Pearl. Fusion, propagation, and structuring in belief networks. Intelligence, 29:241-288, 1986.

(forth-

Artificial

[9] Fahiem Bacchus. On probability distributions over possible worlds. In Proceedings of the Fourth Workshop on Uncertainty in Artificial Intelligence, pages 15-21, 1988. [10] Fahiem Bacchus. Representing and Reasoning With Probabilistic Knowledge. PhD thesis, T h e University of Alberta, 1988. Available as University of Waterloo Research Report CS-88-31, Department of Computer Science, Waterloo, Ontario, Canada, N 2 L - 3 G 1 . p p . 1-135. [11] J u d e a Pearl. On the logic of probabilistic dependencies. In AAAI-86, 1986.

pages 339-343,

[12] J u d e a Pearl and Paz Azaria. On the logic of representing dependencies by graphs. In Proceedings of Sixth Canadian Artificial Intelligence Conference, pages 94-98, 1986. [13] J u d e a Pearl and Thomas Verma. T h e logic of representing dependencies by directed graphs. In AAAI-87, pages 374-379, 1987. [14] S. MacLane and G. BirkhofF. Algebra. Macmillan, New York, 1968. [15] Richard O. Duda, Peter E. Hart, and Nils J. Nilsson. Subjective Bayesian methods for rule-based inference systems. In Bonnie Lynn Webber and Nils J. Nilsson, editors, Readings in Artificial Intelligence, pages 192-199. Morgan Kaufmann, 1981. [16] R. W. Johnson. Independence and Bayesian updating methods. Artificial gence, 29:217-222, 1986. [17] John Bell and Moshé Machover. A Course in Mathematical lands, 1977. [18] Henry E. Kyburg, Jr. 1974.

Intelli-

Logic. Elsevier, NetherInference.

D. Reidel,

[19] John L. Pollock. Foundations for direct inference. Theory and Decision, 1984.

17:221-256,

The Logical Foundations

of Statistical

[20] Henry E. Kyburg, Jr. and H. Smokier, editors. Studies in Subjective Probability. Wiley and Sons, N.Y., 1964.

John

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

15

REPRESENTING TIME IN CAUSAL PROBABILISTIC NETWORKS Carlo BERZUINI Dipartimento di Informatica e Sistemistica. Universita' di Pavia. Via Abbiategrasso 209. 27100 PAVIA (ITALY) [email protected]

Abstract In this paper we explore representations of temporal knowledge based upon the formalism of Causal Probabilistic Networks (CPNs). Two different "continuous-time" representations are proposed. In the first, the CPN includes variables representing "event-occurrence times", possibly on different time scales, and variables representing the "state" of the system at these times. 61 the second, the CPN describes the influences between random variables with values in {9*+u} representing dates, i.e. time-points associated with the occurrence of relevant events. However, structuring a system of inter-related dates as a network where all links commit to a single specific notion of cause and effect is in general farfromtrivial and leads to severe difficulties. We claim that we should recognize explicitly different kinds of relation between dates, such as "cause", "inhibition", "competition", etc., and propose a method whereby these relations are coherently embedded in a CPN using additional auxiliary nodes corresponding to "instrumental" variables. Also discussed, though not covered in detail, is the topic concerning how the quantitative specifications to be inserted in a temporal CPN can be learnedfromspecific data. Keywords: temporal reasoning, causation, causal probabilistic networks, expert systems.

1 INTRODUCTION! Recent work in Artificial Intelligence (see, for example, Pearl, 1986, 1988, Shachter, 1988, Lauritzen & Spiegelhalter, 1988, and Cooper, 1989) has emphasized the use of Causal Probabilistic Networks (CPNs), in which an attempt is made to model explicitly the qualitative knowledge concerning causal mechanisms underlying the system of interest These models are represented by directed acyclic graphs (DAGs) in which links represent direct influences, and absent links indicate conditional independence assumptions. Attached to the nodes of a CPN, there are conditional probability distributions that provide the necessary quantitative specification for the network. By virtue of these specifications, the CPN becomes

Work facilitated by MURST grants, by CNR grant no.87.01829, and by EEC grant (AIM Project 1005)

16 a complete and consistent probabilistic model of the system of interest, allowing case-specific inferences. Quite surprisingly, an investigation of the problems encountered when this formalism is applied to a temporal ontology has been pursued by relatively few researchers (exg. (Dean & Kanazawa, 1988), (Hanks, 1988), (Coopère*al, 1988). Both Cooper and Dean & Kanazawa are primarily interested in discrete representations of time. In fact the nodes of their networks correspond to random variables describing the state of the system of interest on a fixed discrete grid of time-points. In many temporal reasoning applications such a kind of representation is clearly the most suitable. There are, though, applications in which a continuous-time representation, in which events do not commit to prespecified times, is worthy of consideration. There are several potential reasons for this. For one, if the stochastic process underlying the represented temporal structure has not a clear Markov structure, then the temporal grid representation may become clumsy. Moreover, adapting the grid spacing to the time constants involved in the dynamics of the represented system may be far from trivial, or easily lead to unmanageably extensive networks. In this paper we explore two different continuous-time representations. In the first (section 3), the CPN includes variables representing "event-occurrence times", possibly on different time scales, and variables representing the "state" of the system at these times. The second proposed representation (section 4) is a CPN with nodes corresponding to random variables with values in {91+υο°} that represent dates, i.e. time-instants associated with the occurrence of relevant events. The value °o denotes "non-occurrence". However, structuring a system of inter-related dates as a network where all links commit to a single specific notion of cause and effect is in general far from trivial. For one, the need of representing relations of synchronization between relatively independent subsystems leads to dependency structures which are not DAG-isomorph. We claim that we should recognize explicitly different kinds of relation, such as "cause", "inhibition", "competition", etc., and propose a method whereby these relations are coherently embedded in a CPN using additional auxiliary nodes corresponding to "instrumental" variables. Also briefly discussed, though not covered in detail, is the topic concerning how the quantitative specifications to be inserted in a temporal CPN can be learnedfromspecific data. This paper does not deal with probability propagation issues. In principle, both "exact" techniques proposed in (Pearl, 1986, 1988) and in (Lauritzen & Spiegelhalter, 1988) are applicable to the temporal CPNs proposed in this paper, perhaps after suitable discretization of continuous variables involved. Approximate techniques proposed in (Pearl, 1987) and in (Henrion, 1988) may perform without requiring discretization.

2 THE DISTRIBUTION OF TIME The basic ingredient of our representations is time, which we view as a positive real random variable Γ. The probability distribution of T can be specified in various equivalent mathematical forms (for a detailed exposition see Cox & Oakes, 1984). Let F(t)=P(T0,

be the distribution function of T, and S(t)=P(T

>t) = \-F(t),

t>0,

(2)

the corresponding survival function. We suppose absolute continuity of F and use the special notation lit] (3)

17

for the corresponding density. The corresponding intensity, denoted [t] (4) is defined as [t ] = [[t ] / S(t ), and can be interpreted as the probability that T belongs to some small time interval between t and t +Δ, given that it has not happened before t. That is, the following is true: lim pr(r T < t +Δ \t T ) w 1ί] -Δ->0+ Δ If T is the time to the occurrence of a given event X, the intensity [t ] may be interpreted as describing the propensity of X to occur just after r, given that it did not occur previously. Densities and intensities are mutually linked by the relationship: t

[[t] = [t] exp(-jMdK)

(6)

from which it follows that iff the intensity is constant, [t}=p say, then the density is exponential with parameter p. A continuous density function may contain "discrete" components of type : p δα(ί ),

where δα(ί )=0 Vr * a and J 8 a (r ) at =1

(7)

-oo

corresponding to an atom of probability p at time α. δα(.) is called Dirac delta (impulse) function centered at a.

3 MARKED POINT PROCESS REPRESENTATION We consider first the representationalframeworkof a marked point process (MPP), based on the following random variables : {(Ti9Xi),i>0} where To < Tj 0} where 47/ =Ti+1-Ti. We may represent the variables introduced above as nodes of a CPN. In particular, the "description" X,· for the i -th occurred event can be represented by a number of variables linked together into a portion of the global network. We illustrate this through the following (simplistic) example concerning post-transplant history: Either transplant or a subsequent transfusion may have caused an accidental inoculation of either virus A or virus B. The inoculated virus, after a period of incubation, overgrows causing fever. Fever, however, may also develop due to other causes. The network in Fig.l is an MPP-CPN representation of the above example. In fact nodes A,B provide a description of the "inoculation-event", and node C a description of the subsequent "fever-event" occurring at time T. AT is the time interval between the two events. Let us consider the meaning of the variables involved more in detail. Variable A (liability ) represents three mutually exclusive and exhaustive possibilities:

18 transplant transfusion none

{

inoculation occurred at transplant inoculation occurred at transfusion inoculation never occurred.

We consider the possibility of inoculation both at transplant and at transfusion irrelevant, on the assumption that a subsequent inoculation may not alter the infection process triggered by the first one. We define the basic time origin (BTO) as the time at transplant, unless A="transfusion", in which case the BTO is taken to be the time at transfusion. Variable B describes the patient's initial state, defined as the patient's state just after the BTO. Node C describes the target state, defined as the state into which the patient goes when he/she leaves the initial state, lue time at which the target state is entered is measured on two scales: T {target time ) measures it from a fixed independent origin, say the beginning of the year in which the transplant was made, while AT (waiting time ) measures it from the BTO. The "waiting time" AT may be interpreted as the time spent in the initial state prior to transition into the target state, and has quite different meanings and distributions depending on the configuration of B and C. For example, if B ="A incubating ..." and C ="fever+overgrowth", then AT represents how long virus A incubates, while if, say, B ="A incubating" and C ="no overgrowth", then AT represents how long it takes until fever develops from "other" causes.

ΔΤ A incubating, no fever B incubating, no fever .

no virus, no fever

V

fever + overgrowth

fever, no overgrowth]

Figure 1 Representation of post-transplant history

Conditional independence assumptions represented by the graph should be carefully verified. For example, from the graph AT and A appear to be independent given B, but they appear to become dependent once the value of T becomes known. The latter dependency follows from the fact that, given Γ, AT is calculated from the value of A by solving one of the following: Y _

AT + [

transfusion date transplant date

if A = "transfusion" otherwise

(8a) (86)

The arrows from B and C to AT are placed to mean that the values of B and C are direct determinants of the distribution of AT. These arrows induce the following non independencies (1 stands for "not independent of, and I stands for "given") : (i)

B LC

| ΔΤ

(ii) C L AT | B

if I know the time spent in the initial state, then learning which is such initial state helps me to predict the target state; if I know the initial state, then learning the time spent in it helps me to estimate the target state.

19 To illustrate statement (i) suppose, for example, that I know that ΔΤ =3 months. Then knowing that, say, B="no virus" may still be essential to make me conclude for C ="fever+overgrowth". To illustrate statement (ii) suppose, for example, that I know that £="virus A incubating, no fever". Then knowing that AT is, say, lower than the incubation time for virus A is essential to make me conclude that C ="fever, no overgrowth", (ii) becomes false whenever the intensity functions for variable AT corresponding to the various admissible configurations of B andC are parallel. Once established, the graph structure guides us in specifying a set of "local" conditional probabilities, which taken together yield an expression for the global joint density function: [[A,B,C,AT,T

] = [[A]x[[B \A]x[[C

\B] x[[T \Α,ΔΤ]χ[[ΔΤ

\B,C]

(9)

which shows that for each node of the CPN we need obtain an assessment of the distribution of the hosted variable conditional on its direct parents in the graph (for the root node a prior is needed). In particular: [[A ] reflects the a priori belief on which event in the patient's history (transplant or transfusion or none) should be considered liable of causing the inoculation. [[B \A] reflects the expert opinion concerning the chances that a specific type of virus has been inoculated, given that inoculation through transplant or transfusion has occurred. [[C \B] refers to transition probabilities from the initial to the target state, i.e., for example, how likely it is that a patient after inoculation of a given virus develops fever before overgrowth, or viceversa. [[T \AyAT ] is the functional form (8). [[AT\B,C] is a table [[AT ]gçof distributional specifications for^ir with entries corresponding to possible configurations of ByC ; the appropriate distribution being selected by a "switch" according to the relevant configuration. Just one possibility is to model these distributions parametrically with parameters depending on £,C. For example, one might adopt a two-parameter exponential specification : ttAT h=i,c=j =λϋ βχρί-λί/^-α//)} At^ayyXy>0 where ay represents virus-specific fixed incubation delay if C ="fever+overgrowth", and is zero otherwise. Other possibilities are the lognormal and the gamma distributions, or non-parametric modelling. This CPN can be used for various types of case-specific inference. In fact, once the conditional probabilities listed above have been provided, the jdf (9) is completely specified, and queries can be answered by computing posterior marginals from it. For example, suppose that at time T = τ we observe fever onset in a transplanted patient, and that, based on known times at which the patient was transplanted and transfused, we want to assess the probability that the fever is a result of, say, virus A having been inoculated at transplant. Probabilistically speaking, we do this by computing from (9) the posterior ÖÄ,Ä Τ=τ \

4 NETWORKS OF "DATES We call date of an event (assumed to occur instantaneously) a continuous random variable representing the time-instant associated with the occurrence of the event, or of a given replication of it We assume dates defined on:

20

{SR+uoo} (10) where SR+ is the set of non-negative real numbers, and m=2

its order of mobility is

k-2 *m+l-*m-l m=l

and

j=l

0 p(f\S)

&

V{f\ska)>V{f\a)

which the results obtained from the above mentioned 24 legal FTOPAs do not fit in with. To illustrate how this happens, evaluate p(/|s&a) in model M8,4 with idempotent elements { e ^ e s ^ ^ s } . p(f\ska)

= p(s\fka)

*p{f\a)/p(s\a)

= e2 * e 6 /e 4 = e 6 /e 4 = e6

Pay attention to the solution in last step. The result is no larger than p(f\a) = ee due to denominator-indifference. We do not get extra evidence accumulating. • One of the very useful results provided by [0,1] numerical probability is that although p(f\s) = 0.48 and p(f\a) = 0.37 are moderate, when both smoke and alarm are observed p(f\ska) = 0.98 is quite high which is more intuitive than the case above. In Table 1, fire is the only event we know which can cause both smoke and alarm with high certainty (p(s\f) = p(a\f) = e2). Thus observing both simultaneously we would expect a higher probability. But the remaining 8 legal FTOPAs give only ambiguous ρ(/|θ&α) spanning at least half of the total probability range.

53 Consider the evaluation of p(/|s&a) in model M&t4 with idempotent elements {ei,e 4 ,e 7 ,e 8 }.

Notice the solution in last step. • In the deductive case, the situation is slightly better. Some models achieve the same tendency as [0,1] probability in deduction (e.g. p(s\t) < p(a\t)). Some achieve the same tendency with increased ambiguity. Others either produce identical ranges for different probabilities or do not reflect the correct trend. The slight improvement attributes to less operations required in deduction (only reasoning by cases but not Bayes theorem is involved). Since reasoning by cases needs the solution operation, it still creates denominator-indifference and generates ambiguity. Our experiment is systematic with respect to legal FTOPAs of a particular size 8. Although a set of arbitrarily chosen priors is used in this presentation, we have tried varying them in a non-systematic way, but the outcomes were basically the same.

6

Conclusion

The investigation is motivated by finding finite totally ordered probability models under the theory of probabilistic logic [Aleliunas, 1988], to automate qualitative reasoning under uncertainty and facilitate knowledge acquisition and explanation in expert system building. Under the theory of probabilistic logic, the general form of finite totally ordered probability algebras was derived and the number of different models is deduced such that all the possible models can be explored systematically. Two major problems of those models are analyzed: denominator-indifference, and ambiguity-generation. They are manifested during the processes of applying Bayes theorem and reasoning by cases. Changes in size, model and assignment of priors do not seem to solve the problems. All the models with size 8 have been implemented in a Prolog program and tested against a simple example. The results are consistent with the analysis. The investigation reveals that under the TPL axioms, finite probability models may have limited usefulness. The premise of legal FTOPA is {TPL axioms, finite, totally ordered}. It is believed that TPL axioms represent the necessity of general inference under uncertainty. "Totally ordered" seems to be necessary, and is not the real culprit here. Thus it is conjectured that a useful uncertainty management mechanism can not be realized in a finite setting.

Acknowledgements This work is supported by Operating Grants A3290 and OGP0044121 from NSERC. Y. Xiang was awarded a University Fellowship during the term of this work. The authors would like to thank R. Aleliunas for helping us to gain the understanding of his TPL.

References [Aleliunas, 1986] R. Aleliunas, "Models of reasoning based on formal deductive probability theories," Draft unpublished, 1986.

54

[Aleliunas, 1987] R. Aleliunas, "Mathematical models of reasoning - competence models of reasoning about propositions in English and their relationship to the concept of probability," Research Report CS'87-31, Univ. of Waterloo, 1987. [Aleliunas, 1988] R. Aleliunas, "A new normative theory of probabilistic logic," Proc. CSCSI-88, pp. 67-74, 1988. [Burris, 1981] S. Burris and H. P. Sankappannvar, A course in universal algebra, SpringerVerlag, 1981. [Kuczkowski 77] J. E. Kuczkowski and J. L. Gersting, Abstract Algebra, Marcel Dekker, 1977. [Halpern, 1987] J. Y. Halpern and M.O. Rabin, "A logic to reason about likelihood," Artificial Intelligence, 32: 379-405, 1987. [Pearl, 1988] J. Pearl Profobilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann. [Pearl, 1989] J. Pearl, "Probabilistic semantics for nonmonotonic reasoning: A survey," to appear in Proceedings, First intl. conf. on principles of knowledge representation and reasoning, 1989. [Poole, 1988] D. Poole and E. Neufeld, "Sound probabilistic inference in Prolog: an executable specification of influence diagrams," I SIMPOSIUM INTERNAGIONAL DE INTELIGENCIA ARTIFICIAL, Oct. 1988.

Appendix A: Derivation of p(fire\smokekalarm) p(f\ska) = ρ(ί|/&α)*ρ(/|α)/ρ(«|α) where p{s\fka) = P(»\fï, p(sk(fVf)\a) p(s\a) = = i[i\p(s\7)} * i\p(s\f) * p(f\a)/i\p(s\f) * p(/|a)]]]; and PW)*p(f)/p(a) p(/k) = where rW) = p(ak((fkt) V (fkt) V (7&i) V (7&i))|/) = i[i[p{a\fkt) * p{t)] * i\p{a\fkt) * p(i)/i[p(a|/&i) * p(i)]]] and p(a) = p{ak{{fkt) V (fkt) V (fkt) V (7&i))) where

h h h h

= i[h * h * h * U\

= ;[PM7&: a is a formula called the prerequisite, ß is a set of formulae called the justifications, and 7 is a formula called the conclusion. Informally, a default denotes the statement "if the prerequisite is true, and the justifications are consistent with what is believed, then one may infer the conclusion" Defaults are written a:ß 7 If the conclusion of a default occurs in the justifications, the default is said to be semi-normal; if the conclusion is identical to the justifications the rule is said to be normal. A default is closed if it does not have any free occurrences of variables, and a default theory is closed if all of its rules are closed. The maximally consistent sets that can follow from a default theory are called extensions. An extension can be thought of informally as one way of "filling in the gaps about the world." Formally, an extension £ of a closed set of wffs T is defined as the fixed point of an operator Γ, where Γ(Γ) is the smallest set satisfying: W C Γ(Γ), Γ(Τ) is deductively closed, for each default d € D, if the prerequisite is in Γ(Γ), and the negations of the justifications are not in T, then the conclusion is in Γ(Τ). Since the operator Γ is not necessarily monotonie, a default theory may not have any extensions. Normal default theories do not suffer from this, however (see [Reiter, 1980]), and always have at least one extension. One popular attempt to develop a computationally tractable approach to default logic is based on inheritance networks with exceptions. A thorough presentation of this can be found in [Touretzky, 1986]. The second class of approaches to nonmonotonic reasoning involves reasoning about the knowledge or beliefs of an agent. Responding to the weaknesses of NML, Moore [Moore, 1983] developed an epistemology-based logic called autoepistemic logic. Moore's logic differs from NML in two basic ways: the fixed points cannot contain both L / a n d ->/, and each fixed point is considered to correspond to one way that the world might be, in contrast to NML, where a proposition is believed only if it is present in all fixed points. Although Moore's original formulation was restricted to propositional reasoning, autoepistemic logic has recently been generalized by Levesque [Levesque, 1987]. Modifications to autoepistemic logic were also discussed in [Konolige, 1986].

71 The third prominent approach is based on minimization. This approach is embodied in many attempts at dealing with negative knowledge in databases. If one assumes complete knowledge of the domain under consideration, one can represent negative information implicitly. Thus a negative fact can be assumed if its positive counterpart is not entailed by the database. The assumption of complete domain knowledge is termed the Closed World Assumption (CWA) by Reiter, and there has been a significant amount of research in this area. One of the most studied of the minimality-based approaches is circumscription, which was first proposed by McCarthy in [McCarthy, 1980]. The notion we will describe here is called predicate circumscription. This was later extended by McCarthy to formula circumscription, discussion of which we omit here. Predicate circumscription is tied to the notion of minimal models. Informally, the idea behind circumscribing a predicate P in a formula F is to produce a formula which, when conjoined with F, forces P to be true for those atomic objects for which it is forced to be true by F , and false otherwise. This conjunction is called the circumscription ofP in F (written CIRC[F; P]). Formally, this is defined to be: CIRC[F;P]

=de/FA(VP' (P(P') Λ (VxP'(z) => P(x))) => (VxP(z) -+ P'(x)))

where F(P') is the formula F with all occurrences of P replaced with P'. Although this is a second-order formula, in many cases it reduces to a first-order formula. The most prominent work attempting to develop a usable computational framework for default reasoning is that involving truth maintenance systems. Among others, this approach is exemplified by the work of Doyle [Doyle, 1979], de Kleer [deKleer, 1986], Goodwin [Goodwin, 1987] and Brown [Brown, Gaucas, & Benanav, 1987]. The work described in this paper involves propagation of uncertainty measures through a network similar to the JTMS graph described by Doyle in [Doyle, 1979].

1.4 Proposed Approach In the rest of this paper we discuss our efforts on integrating defeasible reasoning (based on nonmonotonic rules) with plausible reasoning (based on monotonie rules with partial degrees of sufficiency and necessity). In our approach, uncertainty measures are propagated through a Doyle-JTMS graph whose labels are real-valued certainty measures. Unlike other default reasoning languages that only model the incompleteness of the information, our approach uses the presence of numerical certainty values to distinguish quantitatively the different admissible labelings and pick an optimal one. The key idea is to exploit the information on the monotonie edges carrying uncertainty measures. A preference function, based on these measures together with nonmonotonic information, is used to select the extension that is maximally consistent with the constraints imposed by the monotonie edges. Thus, instead of minimizing the cardinality of abnormality types [McCarthy, 1986] or of performing temporal minimizations [Shoham, 1986], we maximize an expectation function based on the uncertainty measure. This method breaks the symmetry of the (potentially) multiple extensions in each loop by selecting a most likely extension. This idea is currently being implemented in PRIMO (Plausible Reasoning MOdule), RUM's successor.

72 We will illustrate our approach through an example. For this purpose we will prevail upon Tweety, the much overworked flying emu. The example consists of the following default rules: B I R D Λ -> □ H O P S -> F L I E S E M U Λ -I Ü F L I E S -> H O P S FLEMU ->

EMU

EMU -> BIRD FLEMU —► FLIES

The first rule states that unless it can be proven that a bird hops, assume that it flies. The second says that unless it can be proven that an emu flies assume it hops. Given that FLEMU is false, and EMU true there are two valid extensions for this set of rules. One in which FLIES is true and HOPS is false; and one in which FLIES is false and HOPS is true. As we develop this example we will show how our approach uses quantitative uncertainty measures to facilitate choice between these two valid extensions. The following section defines PRIMO's rule-graph semantics and constraints. Section 3 describes the generation of admissible labelings (consistent extensions) and introduces an objective function to guide the selection of preferred extensions. Section 4 discusses optimization techniques (applicable on restricted classes of graphs) and heuristics (such as graph decomposition into strongly connected components), which can be used to generate acceptable approximations to the optimal solution. The conclusion section summarizes our results and defines an agenda of possible future research work.

2 Plausible Reasoning Module The decision procedure for a logic based on real-valued truth values may be much more computationally expensive than that for boolean-valued logic. This is because in booleanvalued logic only one proof need be found. In real-valued logic all possible proofs must be explored in order to ensure that the certainty of a proposition has been maximized. RUM (Reasoning with Uncertainty Module), the predecessor to PRIMO, was designed as a monotonie expert system shell that handles uncertainty according to triangular norm calculi.1 It deals with the possible computational explosion by allowing only propositional acyclic2 quantitative Horn clauses. To avoid the computational problems associated with first order reasoning, RUM only allows propositional rules. Although the user may write first-order rules, they must be fully instantiated at run time. Thus a single written rule may give rise to many rules at run time, all of which are propositional. RUM restricts its rules to Horn clauses; it deals with negative antecedents by treating P and -iP independently. We denote the certainty of P as LB(P). The only time P and -iP will interact is when LB(P) + LB(-iP) > 1 (both P and ->P are believed). When this occurs a conflict handler tries to detect the source of inconsistency.3 triangular norm calculi represent logical and as a real valued function called a t-norm, and logical or as a s-conorm. For an introduction to them see [Bonissone, 1987b]. A succinct presentation can be found in [Bonissone, 1990]. 2 Unless an idempotent t-norm is used cyclic rules will cause all certainties in the cycle to converge to 0. 3 Note that the above constraint on LBs implies an upper-bound on LB(P) of 1 - LB(-iP). In the literature

73 Because of these restrictions, a simple linear-time algorithm exists for propagating certainty values through RUM rules. Resolution of inconsistency by the conflict handler, however, may require cost exponential in some subset of the rules. PRIMO (Plausible Reasoning MOdule) is the successor to RUM designed to perform nonmonotonic reasoning. PRIMO extends RUM by allowing nonmonotonic antecedents. PRIMO also allows nonmonotonic cycles, which represent conflicts between different defaults. We provide a formal overview of PRIMO below. Definitions: A PRIMO specification is a triple (L, I, J). L is a set of ground literals, such that whenever I e L, 1 e L. For / e L, LB(/) £ [0, 1] is the amount of evidence confirming the truth of /. J is a set of justifications. Each justification is of the form: f\ mai Λ / \ nmctj —»s c 3

»

where c is the conclusion, s G [0, 1] is the sufficiency of the justification (5 indicates the degree of belief in the conclusion of the justification when all the antecedents are satisfied), mai € L are the monotonie antecedents of the justification, and nmaj are the nonmonotonic antecedents of the justification. The nonmonotonic antecedents are of the form -1 \ä\p, where p € L, with the semantics: L 5 (

^

r i

=

| l

ifLB(p)[ç*]p can be informally interpreted as "if we fail to prove proposition p to a degree of at least a.") The input literals I c L, are a distinguished set of ground literals for which a certainty may be provided by outside sources (e.g. user input), as well as by justifications. The certainty of all other literals can only be affected by justifications. A PRIMO specification can also be viewed as an AND/OR graph, with justifications mapped onto AND nodes and literals mapped onto OR nodes. Definition: A valid PRIMO graph is a PRIMO graph that does not contain any cycles consisting of only monotonie edges. Definition: An admissible labeling of a PRIMO graph is an assignment of real numbers in [0, 1] to the arcs and nodes that satisfy the following conditions: 1. The label of each arc leaving a justification equals the t-norm of the arcs entering the justification and the sufficiency of the justification. 2. The label of each literal is the s-co-norm of the labels of the arcs entering it. A PRIMO graph may have zero, one, or many admissible labelings. An odd loop (a cycle traversing an odd number of nonmonotonic wires) is a necessary, but not sufficient, condition for a graph not to have any solutions. Every even cyclic graph has at least two solutions. In these respects PRIMO is like the Doyle JTMS [Doyle, 1979]. Proofs can be found in [Goodwin, 1988]. this is denoted as UB(P). LB and UB are related just as support and plausibility in Dempster-Shafer, or D and O in modal logics.

74

2.1 PRIMO Example We use our example to illustrate the above definitions. The default rules given in the earlier example can be turned into PRIMO justifications by adding sufficiencies to them. For example: BIRD Λ -■ [Tgj H O P S —>·8 FLIES E M U Λ - I r ^ l F L i E S ->· 9 H O P S FLEMU - * 1 EMU EMU

-V

BIRD

FLEMU - * 1 FLIES

In the first justification, BIRD is a monotonie antecedent and ->[72]HOPS is a nonmonotonic antecedent. The sufficiency of the justification is .8. The first rule states that if it can be proven with certainty > .2 that HOPS is true then the certainty of FLIES is 0; otherwise the certainty of FLIES is the t-norm of .8 and the certainty of BIRD being true. The input literals for this example are BIRD, EMU, and FLEMU. The PRIMO graph corresponding to the above rules is shown in Figure 1. If the user

Figure 1: PRIMO Rule Graph specifies that L B ( B I R D ) = LB(EMU) = 1, and LB(FLEMU) = 0, then there are two admissible labelings of the graph. One of these labelings is shown in Figure 2. The other can easily be obtained by changing the labeling of HOPS to 0 and FLIES to .8.

3 Finding Admissible Labelings In this section we discuss an approach to propagation of constraints which is used as a preliminary step in processing a PRIMO graph before resorting to exhaustive search.

3.1 Propagation of Bounds In PRIMO, propagation of bounds on LB's can be more effective than propagation of exact values alone. It may even trigger further propagation of exact values when bounds are propagated to a nonmonotonic antecedent whose value of a falls outside of them. Thus

75

Figure 2: One Admissible Labeling bounds propagation can sometimes provide an exact solution where propagation of exact values alone would not. To propagate bounds, vertices are labeled with pairs of values representing lower and upper bounds on the exact LB of that vertex in any admissible labeling. These bounds are successively narrowed as propagation continues. For each vertex v we define LB~(v) and LB+(v), the lower and upper bounds on LB(v) at any given point during the computation, to be functions of the bounds then stored on the antecedents of v. LB~ uses the lower bounds of monotonie antecedents and the upper bounds of nonmonotonic ones; LB+ uses the upper bound of monotonie and the lower bound of nonmonotonic antecedents. The actual function applied to these values is the same one used to compute LB itself for that vertex. The algorithm is then 1. Initialize every input node, where k is the confidence given by the user, to [&, 1] i.e. "at least &." Initialize every other vertex to [0,1]. 2. While there exists any vertex v such that the label on v is not equal to [LB~(t>), LB+(t;)], relabel v with that value. It can be shown that this algorithm converges in polynomial time, yields the same result regardless of the order of propagation, and never assigns bounds to a vertex that excludes any value that vertex takes on in any admissible labeling. Proofs can be found in [Goodwin, 1988].

3.2 Example In this section we illustrate, through the example, the bounds propagation algorithm. Figure 3 shows the labeling of the graph after the initialization step. Figure 4 shows the final bounds obtained. The value of LB+ for FLIES of .8 is derived by using LB~ of HOPS (0) which gives the certainty of the premises of the justification for FLIES of 1. The t-norm of this with the sufficiency of the justification yields .8.

76

Figure 3: Initialization

Figure 4: Final Bounds

3.3 A Labeling Algorithm for PRIMO Definitions: A nonmonotonic antecedent is satisfied if LB+ < a, exceeded if LB" > a, and ambiguous if LB~ < a < LB + . A labeled graph is stable if every vertex v is labeled [LB~(tO,LB+(t>)] (a graph is always stable after bounds have been propagated). In a stable graph, a starter dependency is an AND-vertex which has no unlabeled monotonie antecedents, no exceeded nonmonotonic antecedents, and at least one ambiguous nonmonotonic antecedent. A starter dependency must be unlabeled, with a zero LB~ and a positive LB + . Because PRIMO nets contain no monotonie loops, a starter dependency always exists (unless the entire graph is labeled exactly) and can be found in time linear in the size of the graph. Because the only inputs left undetermined are nonmonotonic antecedents (i.e., thresholds) a starter dependency must be labeled exactly LB~ or LB+ in any admissible labeling which may exist [Goodwin, 1988]. One can therefore find all admissible labelings of a stable graph in time exponential in the number of starter dependencies, simply by generating each of the 2k ways to label each of k starter dependencies in the graph with its LB~ or LB + , and testing each combination

77

for consistency. A straightforward algorithm to do this would search the space depth-first with backtracking. Each iteration would pick a starter dependency, force it to LB~ or LB + , and propagate bounds again, continuing until either a solution is produced or an inconsistency is found, and then backtrack. Inconsistencies can only occur at a starter dependency, when either (1) the starter was earlier forced to LB" (i.e., zero) and positive support for it is derived, or (2) the starter is forced to LB+ (i.e., a positive value) and the last support for it becomes relabeled zero. Practical efficiency may be greatly enhanced if the starter dependency is always chosen from a minimal strongly connected component of the unlabeled part of the graph. Below we consider more sophisticated methods for searching this space.

3.4 Consistent and Preferred Extensions The discussion and algorithm given above indicate that in a stable graph the problem of deciding upon how to resolve the ambiguous nonmonotonic wires is a boolean decision. Thus we should be able to formulate this problem in propositional logic, the satisfying assignments of which would represent the various consistent extensions of the PRIMO specification. We now present an alternative algorithm, based on propositional satisfiability, for finding consistent extensions. We also show how this algorithm can be used to find an optimal extension. In general, a set of formulae will have many extensions. Given such a set of extensions, some may be preferable to others based on the cost associated with choosing truth values for certain nodes. That is, the LB of the ambiguous antecedents will be coerced to either LB~ or LB + . We will prefer extensions in which the sum of the costs associated with choosing a truth value for each proposition is minimized. More formally, let -n|q t |p t · be the set of nonmonotonic premises from a PRIMO rule graph which are still ambiguous after the numeric bounds have been propagated; let IC(pi) = Lg~fafrLB*fa>. IC(pd is a measure of the current approximation of the information content in p t . An optimal admissible labeling is an admissible labeling that minimizes the objective function: Ç \IC(pd - FFV(pd\ + | JCH>,·) -

FPV^pdl

i

FPV(pi)y the final propositional value associated with pi9 will be either 0 or 1, depending on whether p{ is ultimately coerced to LB"(pt·) or LB+(pt·), respectively. Thus the objective function is a measure of the distance of our current numerical approximation to the final value chosen, which we want to minimize. Once we have made the commitment to coercing ambiguous values to either 0 or 1, solving the problem of finding extensions reduces to propositional satisfiability. Extending the problem we consider to that of weighted satisfiability gives us a means of finding a preferred extension. Weighted satisfiability is defined formally below Let C be a weighted CNF formula, Aid» where each clause, Ct = VjPj, has a corresponding positive weight, wt·. Let P be a truth assignment of the propositional variables, pit that appear in C. The weight of P is the sum of the weights of the clauses that are made false by P . The weighted satisfiability problem is to find the minimum weighted truth assignment.

78 The optimal admissible labeling problem can be encoded as the weighted satisfiability problem in the following way: Convert the propositional form of the given PRIMO graph into clausal form. Assign infinite weight to each of the resulting clauses. Define a function W which maps literals into the interval [0,1] such that for a literal p,

W(p) = IC(p) +

(l-IC^p)).

Next, for each ambiguous nonmonotonic premise of the form -i | a{ \p{, generate two clauses: 1. fa), with weight W(pi) 2. (-ipi) with weight W(-^pi\ The first clause represents the cost of making pi false, the second the cost of making -ipt· false (equivalently, making pi true). A typical case is illustrated in Figure 5.

Figure 5: Relationship between bounds and weights It is easy to see that the original graph has an admissible labeling if, and only if, there is a finitely weighted truth assignment for the corresponding instance of weighted satisfiability, and that the weighted truth assignment corresponds to minimizing the objective function given above.

3.5

Example

We now complete our example by showing how the graph in Figure 4 can be transformed into a weighted satisfiability problem which yields the optimal extension for this example. From the graph we obtained the following values: The weighted clauses, obtained from the structure of the graph and from the table above, are:

79

FLIES I -i FLIES I H O P S I -i H O P S

1 LBLB+ IC w

1

0 0.8 0.4 1.4

0 0 0 0.6

0 0.9 0.45 1.45

ô | 0 0 0.55 1

Figure 6: Evaluation of Weights

FLIES V H O P S

OO

-«FLIES V -«HOPS

OO

FLIES

1.4

-«FLIES

.6

HOPS

1.45

-IHOPS

.55

There are two finitely weighted truth assignments for the above set of weighted clauses. They are FLIES = True, HOPS = False with weight 2.05; and FLIES = False, HOPS = True with weight 1.95. (Remember the weight of a truth assignment is the sum of the weights of the clauses made false by the truth assignment.) Thus the optimal labeling for our example gives L B ( F L I E S ) = 0, and L B ( H O P S ) = .9.

We leave it to the reader to verify that starting with LB (EMU) = .8 instead of 1 would result in the optimal labeling where LB(FLIES) = .8, L B ( H O P S ) = 0.

4

Algorithms and Heuristics

In Section 3.4 we showed how the problem we are concerned with can be posed as one of weighted satisfiability. Since this problem is intractable in general, we must make compromises if our system is to perform reasonably on nontrivial instances. The alternatives we consider include constraining the classes of problems we will allow (Section 4.1) or sacrificing optimality of solutions (Section 4.2).

4.1 Nonserial Dynamic Programming One of the most interesting possibilities involves restricting our attention to classes of formulae which, while still intractable, have satisfiability algorithms which theoretically take much less than 0(2n) time, where n is the number of propositional variables. In [Ravi & Hunt, 1986], Hunt and Ravi describe a method based on nonserial dynamic programming and planar separators (see [Bertele & Brioschi, 1972] and [Lipton & Tarjan, 1980], respectively) which solves the satisfiability problem in 0(2^) time for a subclass of propositional clauses that can be mapped in a natural way to planar graphs.4 In [Fernandez-Baca, 1988] Fernandez-Baca discusses an alternative construction for planar satisfiability and an extension 4

It is shown in [Lichstenstein, 1982] that the satisfiability problem for this class is NP-complete [Garey & Johnson, 1979]. Thus the existence of a polynomial-time decision-procedure is highly unlikely.

80 to weighted satisfiability. He also presents a similar algorithm for another interesting class of problems, where the graph corresponding to the set of clauses has bounded bandwidth. Hunt [Hunt, 1988] has shown that similar results hold for a large class of problems that have graphs with bounded channel width. Each of these is in some sense a measure of the complexity of the clausal form of the problem. If this measure is much smaller than the number of variables in the problem, weighted satisfiability can be solved relatively quickly for large instances.

4.2 Heuristics Depending on the size of the graph and the deadline imposed on the system by the outside world, time to find an optimal extension may not be available. Under these circumstances, we need to use a heuristic that, without guaranteeing an optimal solution, will find a "satisficing" solution while exhibiting reasonable performance characteristics.5 The following heuristics can be applied to the PRIMO graph, after the propagation of bounds, or to the problem encoded in terms of weighted satisfiability. As initial conditions we assume a set of nodes P, which is a subset of the original set of nodes in the graph. Each element of P has an associated pair of lower and upper bounds. We sort the elements in P such that | IC(pi) - 0 . 5 |>| IC(pi+\) - 0.5 | . By sorting the elements in P based on decreasing information content, we are trying to first coerce the labeling of those nodes for which we have the strongest constraints. We can now use a variety of search strategies, such as the iteratively deepening hillclimbing search, or beam-search to (locally) minimize the objective function defined in Section 3.4, subjected to the consistency constraints dictated by the graph topology.

4.3 Strongly Connected Components Thus far we have presented our algorithms as if they were to work on the entire PRIMO rule graph. Even the heuristic presented would bog down on rule graphs of realistic size. As a result, several optimizations are essential in practice, even though they do not affect the theoretical worst case complexity. The entire initial graph can be decomposed into strongly connected components (SCCs), which are attacked one at a time (using whatever algorithm or heuristic is deemed appropriate) "bottom up." This idea was first used for JTMSs in [Goodwin, 1987]. As in the JTMS, there is no guarantee that one can avoid backtracking: a low level SCC may have several solutions, and a higher SCC dependent upon it may become unsolvable if the wrong choice is made lower down. However, this strategy seems likely to be helpful in practice.

4.4 Compile Time Options Decomposing the initial graph into SCCs is one form of preprocessing that can be done at compile time in an attempt to facilitate faster run time processing. Much more processing could be done at compile time, at potentially great savings at run time. This section briefly summarizes various options we are exploring for dividing the task of generating optimal (good) admissible labelings between run time and compile time components. 5 As any other heuristic, there is no guarantee that its worst-case performance can improve that of an exhaustive search.

81 • Precompute all admissible labelings at compile time. At run time eliminate those labelings that are not currently valid, and choose the optimal labeling from those remaining. • Precompute a subset of the admissible labelings at compile time. At run time eliminate those labelings that are not currently valid, and choose the optimal labeling from those remaining. If all the precomputed labelings have been eliminated, additional labelings must be generated. • Precompute one default admissible labeling optimized according to static uncertainty and utility information. If this labeling is no longer valid at run time, additional labelings must be generated. • Precompile the graph into some canonical form that will allow easier generation of admissible labelings at run time. Two possible such forms are prime implicants (similar to the ATMS [Reiter & deKleer, 1987]) and Gröbner bases [Kapur & Narendran, 1987]. At run time the compiled form can be used to generate an admissible labeling. • At compile time check for planarity of graph or, if not planar, determine the bandwidth (if practical.) At run time use a dynamic programming algorithm to find an admissible labeling. Although all the above options may result in great savings at run time, many of them share the problem of potentially incurring exponential time and space costs in the worst case. At present, we are beginning to experiment with these options in an attempt to determine which will work best in practice.

5

Conclusions

We have presented an approach that integrates nonmonotonic reasoning with the use of quantitative information as a criterion for model preference. This methodology represents a major departure from existing paradigms, which normally fail to account for one or the other. We have also identified several methods for coping with the inherent intractability involved in such reasoning. We feel that this is a promising approach, but this work is at a preliminary stage. As a result, there are a number of questions that we are considering now. We list some of them below. • We have previously noted that there is some correspondence between the PRIMO rule graph and that of the JTMS. Their exact relationship (if indeed one exists) is not well understood and needs to be explored. • The dynamic programming algorithms discussed in Section 4.1 may help us to deal with large problem instances under certain structural constraints on the allowed propositional formulae. The results discussed, however, are based on asymptotic bounds. We have begun to implement these algorithms, but we do not know at this point whether they will perform satisfactorily in practice. We also need to determine how well the heuristics we have described will perform.

82 • It may be advantageous to preprocess the graph prior to run time. For instance, breaking up the graph into SCCs may also allow us to do some precomputation at compile time. In addition to generating the SCCs, it might be possible to transform them into canonical forms, which would yield more efficient run-time algorithms.

References [Bertele & Brioschi, 1972] Umberto Bertele and Francesco Brioschi. Nonserial Dynamic Programming, Academic Press, New York, first edition, 1972. [Blair & Brown, 1988] Howard A. Blair, Jr. Allen L. Brown, and V.S. Subrahmanian. A logic programming semantics scheme, part i. Technical Report LPRG-TR-88-8, Syracuse University School of Computer and Information Science, 1988. [Bonissone, 1987a] Piero P. Bonissone. Plausible Reasoning: Coping with Uncertainty in Expert Systems. In Stuart Shapiro, editor, Encyclopedia of Artificial Intelligence, pages 854-863. John Wiley and Sons Co., New York, 1987. [Bonissone, 1987b] Piero P. Bonissone. Summarizing and Propagating Uncertain Information with Triangular Norms. International Journal of Approximate Reasoning, 1(1):71-101, January 1987. [Bonissone, 1990] Piero P. Bonissone. Now that I Have a Good Theory of Uncertainty, What Else Do I Need? In this volume. [Bonissone & Brown, 1986] Piero P. Bonissone and Allen L. Brown. Expanding the Horizons of Expert Systems. In Thomas Bernold, editor, Expert Systems and Knowledge Engineering, pages 267-288. North-Holland, Amsterdam, 1986. [Bonissone & Decker, 1986] Piero P. Bonissone and Keith S. Decker. Selecting Uncertainty Calculi and Granularity: An Experiment in Trading-off Precision and Complexity. In L. Kanal and J. Lemmer, editors, Uncertainty in Artificial Intelligence, pages 217247. North-Holland, Amsterdam, 1986. [Bonissone, Gans, & Decker, 1987] Piero P. Bonissone, Stephen Gans, and Keith S. Decker. RUM: A Layered Architecture for Reasoning with Uncertainty. In Proceedings of the 10th International Joint Conference on Artificial Intelligence, pages 891-898. AAAI, August 1987. [Bonissone & Wood, 1988] Piero P. Bonissone and Nancy C Wood. Plausible Reasoning in Dynamic Classification Problems. In Proceedings of the Validation and Testing of Knowledge-Based Systems Workshop. AAAI, August 1988. [Brown, Gaucas, & Benanav, 1987] Allen L. Brown Jr., Dale E. Gaucas, and Dan Benanav. An Algebraic Foundation for Truth maintenance. In Proceedings 10th International Joint Conference on Artificial Intelligence, pages 973-980. AAAI, August 1987. [Cohen & Grinberg, 1983] P.R. Cohen and M.R. Grinberg. A Framework for Heuristics Reasoning about Uncertainty. In Proceedings Eight International Joint Conference on Artificial Intelligence, pages 355-357. AAAI, August 1983.

83 [Cohen, 1985] P. Cohen. Heuristic Reasoning about Uncertainty: An Artificial Intelligence Approach. Pittman, Boston, Massachusetts, 1985. [D'Ambrosio, 1988] Bruce D'Ambrosio. A Hybrid Approach to Reasoning Under Uncertainty. International Journal of Approximate Reasoning, 2(l):29-45, January 1988. [deKleer, 1986] J. de Kleer. An assumption-based TMS. Artificial Intelligence, 28:127-162, 1986. [Dempster, 1967] A.P. Dempster. Upper and lower probabilities induced by a multivalued mapping. Annals of Mathematical Statistics, 38:325-339, 1967. [Doyle, 1979] J. Doyle. A Truth-Maintenance System. Journal of Artificial Intelligence, 12:231-272, 1979. [Doyle, 1983] J. Doyle. Methodological Simplicity in Expert System Construction: The Case of Judgements and Reasoned Assumptions. The AI Magazine, 4(2):39-43, 1983. [Duda, Hart, & Nilsson, 1976] R.O. Duda, P.E. Hart, and N.J. Nilsson. Subjective Bayesian Methods for Rule-Based Inference Systems. In Proc. AFIPS 45, pages 1075-1082, New York, 1976. AFIPS Press. [Etherington, 1988] David W. Etherington. Reasoning with Incomplete Information. Morgan Kaufmann, San Mateo, CA, first edition, 1988. [Fernandez-Baca, 1988] David Fernandez-Baca. Nonserial Dynamic Programming Formulations of Satisfiability. Information Processing Letters, 27:323-326, 1988. [Garey & Johnson, 1979] M.R. Garey and D.S. Johnson. W.H. Freeman, New York, 1979.

Computers and Intractability.

[Ginsberg, 1987] M.L. Ginsberg, editor. Readings in Nonmonotonic Reasoning. Morgan Kaufmann Publishers, Los Altos, California, 1987. [Goodwin, 1987] James W. Goodwin. A Theory and System for Non-monotonic Reasoning. PhD thesis, Linkoping University, 1987. [Goodwin, 1988] James W. Goodwin. RUM-inations, 1988. GE-AIP Working Paper. [Halpern & Moses, 1986] Halpern J.Y. and Y. Moses. A Guide to Modal Logics of Knowledge and Belief. In Proceedings of the 5th National Conference on Artificial Intelligence, pages 4480-490. AAAI, 1986. [Hunt, 1988] H. B. Hunt III. personal communication. [Kapur & Narendran, 1987] D. Kapur and P. Narendra. An equational approach to theorem proving in first-order predicate calculus. In IJCAI-87, pages 1146-1153. AAAI, August 1987. [Konolige, 1986] K. Konolige, editor. A Deduction Model of Belief. Morgan Kaufmann Publishers, Los Altos, California, 1986.

84 [Levesque, 1987] H. J. Levesque. All I Know: An Abridged Report. In Proceedings American Association for Artificial Intelligence, pages 426-431, 1987. [Lichstenstein, 1982] D. Lichtenstein. Planar formulae and their uses. SIAM J. Comput., ll(2):329-343, 1982. [Lipton & Tarjan, 1980] R. Lipton and R.E. Tarjan. Applications of a planar separator theorem. SIAMJ. Compute 9(3):615-627, 1980. [Lowrance, Garvey, & Strat, 1986] J.D. Lowrance, T.D. Garvey, and T.M. Strat. A Framework for Evidential-Reasoning Systems. In Proc. National Conference on Artificial Intelligence, pages 896-903, Menlo Park, California, 1986. AAAI. [McCarthy, 1977] J. McCarthy. Epistemological Problems of Artificial Intelligence. In Proceedings Fifth International Joint Conference on Artificial Intelligence, pages 1038— 1044, 1977. [McCarthy, 1980] J. McCarthy. Circumscription: A Non-Monotonic Inference Rule. Artificial Intelligence, 13:27-40, 1980. [McCarthy, 1986] J. McCarthy. Applications of Circumscription to Formalizing Commonsense Knowledge. Artificial Intelligence, 28:89-166, 1986. [McDermott, 1982] D.V. McDermott. Non-monotonic Logic II. Journal ofACM, 29:33-57, 1982. [McDermott & Doyle, 1980] D.V. McDermott and J. Doyle. Non-monotonic Logic I. Artificial Intelligence, 13:41-72, 1980. [Minsky, 1975] M. Minsky. A Framework for Representing Knowledge. In P. Winston, editor, The Psychology of Computer Vision, pages 211-277. McGraw-Hill, New York, 1975. [Moore, 1983] R. Moore. Semantical Considerations on Nonmonotonic Logic. In Proceedings Eigth International Joint Conference on Artificial Intelligence, pages 272-279, 1983. [Pearl, 1985] J. Pearl. How to Do with Probabilities What People Say You Can't. In Proceedings Second Conference on Artificial Intelligence Applications, pages 1-12. IEEE, December 1985. [Pearl, 1988] Judea Pearl. Evidential Reasoning Under Uncertainty. In Howard E. Shrobe, editor, Exploring Artificial Intelligence, pages 381-418. Morgan Kaufmann, San Mateo, CA, 1988. [Quinlan, 1983] J.R. Quinlan. Consistency and Plausible Reasoning. In Proceedings Eight International Joint Conference on Artificial Intelligence, pages 137-144. AAAI, August 1983. [Ravi & Hunt, 1986] S.S. Ravi and H.B. Hunt III. Applications of a Planar Separator Theorem to Counting Problems. Technical Report 86-19, Suny at Albany, Computer Science Dept., August 1986.

85 [Reiter, 1980] R. Reiter. A Logic for Default Reasoning. Artificial Intelligence, 13:81-132, 1980. [Reiter, 1988] Raymond Reiter. Nonmonotonic Reasoning. In Howard E. Shrobe, editor, Exploring Artificial Intelligence, pages 419-481. Morgan Kaufmann, San Mateo, CA, 1988. [Reiter & deKleer, 1987] R. Reiter and J. de Kleer. Foundations of Assumption-Based Truth Maintenance Systems: Preliminary Report. In Proceedings Sixth National Conference on Artificial Intelligence, pages 183-188. AAAI, July 1987. [Shafer, 1976] G. Shafer. A Mathematical Theory of Evidence. Princeton University Press, Princeton, New Jersey, 1976. [Shoham, 1986] Yoav Shoham. Chronological Ignorance: Time, Nonmonotonicity, Necessity and Causal Theories. In Proceedings of the 5th National Conference on Artificial Intelligence, pages 389-393. AAAI, August 1986. [Shortliffe & Buchanan, 1975] E.H. Shortliffe and B. Buchanan. A Model of Inexact Reasoning in Medicine. Mathematical Biosciences, 23:351-379, 1975. [Touretzky, 1986] D. Touretzky. The Mathematics of Inheritance Systems. Pitman, London, 1986. [Zadeh, 1978] L.A. Zadeh. Fuzzy Sets as a Basis for a Theory of Possibility. Fuzzy Sets and Systems, 1:3-28, 1978. [Zadeh, 1979a] L.A. Zadeh. A Theory of Approximate Reasoning. In P. Hayes, D. Michie, and L.I. Mikulich, editors, Machine Intelligence, pages 149-194. Halstead Press, New York, 1979. [Zadeh, 1979b] L.A. Zadeh. Fuzzy Sets and Information Granularity. In M.M. Gupta, R.K. Ragade, and R.R. Yager, editors, Advances in Fuzzy Set Theory and Applications, pages 3-18. Elsevier Science Publishers, 1979. [Zadeh, 1983] L.A. Zadeh. Linguistic Variables, Approximate Reasoning, and Dispositions. Medical Informatics, 8:173-186, 1983.

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

87

Deciding Consistency of Databases Containing Defeasible and Strict Information* Moisés Goldszmidt and Judea Pearl Cognitive Systems Lab. Computer Science Department University of California, Los Angeles Los Angeles, CA 90024 U.S.A.

Abstract We propose a norm of consistency for a mixed set of defeasible and strict sentences, based on a probabilistic semantics. This norm establishes a clear distinction between knowledge bases depicting exceptions and those containing outright contradictions. We then define a notion of entailment based also on probabilistic considerations and provide a characterization of the relation between consistency and entailment. We derive necessary and sufficient conditions for consistency, and provide a simple decision procedure for testing consistency and deciding whether a sentence is entailed by a database. Finally, it is shown that if all sentences are Horn clauses, consistency and entailment can be tested in polynomial time.

1

Introduction

There is a sharp difference between exceptions and outright contradictions. Two statements like "typically, penguins do not fly" and "red penguins can fly", can be accepted as a description of a world in which redness defines an abnormal type of penguin. However, the *This work was supported in part by National Science Foundation grant #IRI-88-21444 and Naval Research Laboratory grant #N00014-89-J-2007.

88 statements "typically, birds fly" and "typically, birds do not fly" stand in outright contradiction to each other (unless birds are non existent). Whatever interpretation we give to "typically", it is hard to imagine a world containing birds in which both statements can hold simultaneously. Yet, in spite of this clear distinction, there is hardly any comprehensive treatment of inconsistencies in existing proposals for non-monotonic reasoning. Delgrande (see [Delgrande 88]) points out that one of the previous sentences must be invalid. Lehmann and Magidor (see [Lehmann h Magidor 88]) conclude that the formula "birds" is inconsistent, but regard the two sentences as precluding the existence of birds. Surprisingly, such conflicting pair of sentences can perfectly coexist in a circumscriptive ([McCarthy 86]) or a default ([Reiter 80]) formalism. Consider a database Δ containing the following sentences: "all birds fly", "typically, penguins are birds" and "typically, penguins don't fly". A circumscriptive theory consisting of the sentences in Δ plus the fact that Tweety is a penguin, will render the conclusion that either Tweety is a flying penguin (and therefore is an exception to the rule "typically, penguins don't fly"), or Tweety is an exception to the rule "typically, penguins are birds" and Tweety does not fly. A formalization of the database in terms of a default theory will render similar conclusions for our penguin Tweety. Nevertheless, the above set of rules strike our intuition as being inherently wrong: if all birds fly, there cannot be a nonempty class of objects (penguins) that are "typically birds" and yet "typically, don't fly". We cannot accept this database as merely depicting exceptions between classes of individuals; it appears to be more of a programming "bug" than a genuine description of some state of affairs. However, if we now change the first sentence of Δ from strict to defeasible (to read "typically, birds fly" instead of "all birds fly"), we are willing to cope with the apparent contradiction by considering the set of penguins as exceptional birds. This interpretation will remain satisfactory even if we made the second rule strict (to read "all penguins are birds"). Yet, if we further add to Δ the sentence "typically, birds are penguins" we are faced again with an intuitive inconsistency. This paper deals with the problem of formalizing, detecting and isolating such inconsistencies in knowledge bases containing both defeasible and strict information 1 . We will interpret a defeasible sentence such as "typically, if φ then ψ" (written φ —► 0 ) , as the conditional probability Ρ(ψ\φ) > 1 — ε , where ε > 0 2 . A strict sentence such as "if φ it must be the case that σ" (written φ =ϊ σ), will be interpreted as the conditional probability Ρ(σ\φ) = 1. Our criterion for testing inconsistency translates to that of determining if there exists a probability distribution P that satisfies all these conditional probabilities for all ε > 0. Furthermore, to match our intuition that conditional sentences do not refer to empty classes, nor are they confirmed by merely "falsifying" their antecedents, we also require that P be proper, i.e., that it does not render any antecedent as totally impossible. We shall show that these two requirements properly capture our intuition regarding the consistency of conditionals sentences.

χ

Τ1ιβ consistency of systems with only defeasible sentences is analyzed in [Adams 75] and [Pearl 87]. 2 Intuitively we would like defeasible sentences to be interpreted as conditional probabilities with very high likelihood and e to be an infinitesimal quantity. For more on probabilistic semantics for default reasoning the reader is referred to [Pearl 88].

89 We also define a notion of entailment in which plausible conclusions are guaranteed arbitrarily high probabilities in all proper probability assignments in which the defeasible premises have arbitrarily high probabilities and in which the strict premises have probabilities equal to one. A characterization of the relation between entailment and consistency is shown through the theorems of section 3. The paper is organized as follows: section 2 introduces notation and some preliminary definitions. Consistency and entailment are explored in section 3. An effective procedure for testing consistency and entailment is presented in section 4. Section 5 contains illustrative examples, and in section 6 we summarize the main results of the paper. Proofs can be found in [Goldszmidt & Pearl 90].

2

Notation and Preliminary Definitions

We will use ordinary letters from the alphabet (except d, s and x) as propositional variables. Let T be a language built up in the usual way from a finite set of propositional variables and the connectives "->" and "V" (the other connectives will be used as syntactic abbreviations), and let the greek letters φ,ψ,φ,σ stand for formulas of T. Let φ and φ be two formulas in T. We will use a new binary connective "—*" to construct a defeasible sentence φ —► φ, which may be interpreted as "if φ then typically ψ". The set of defeasible sentences will be denoted by D. Similarly, given φ. Finally, the material counterpart of a conditional sentence with antecedent φ and consequent φ is defined as the formula φ D φ (where "D" denotes material implication). Given a factual language T, a truth assignment for T is a function 2, mapping the sentences in T to the set {1,0}, (1 for True and 0 for False), such that t respects the usual boolean connectives 5 . A sentence x 6 X with antecedent φ and consequent φ will be verified by t, if ί{φ) = t($) = 1. If ί(φ) = 1 but ί(φ) = 0, the sentence x will be falsified by t. Finally, when ί(φ) = 0, x will be considered as neither verified nor falsified.

3 In the domain of non-monotonie multiple inheritance networks, the interpretation for the defeasible sentence φ — ► φ would be "typically 0 there exists S > 0 such that for all probability assignments which are proper for df, P{d') > 1 — ε.

P G Vxts

Theorem 2 relates the notions of entailment and consistency: T h e o r e m 2 If X is p-consistent, inconsistent

X p-entails d! if and only if X U {~d'}

is substantively

Definition 7 and Theorem 3 below characterize the conditions under which conclusions are guaranteed not only very high likelihood but absolute certainty. We call this form of entailment strict p-entailment: Definition 7 (Strict p-entailment.) If X is p-consistent, (written X \=8 s') if:

then X strictly p-entails s'

1. There exists a non-empty set of probability distributions which are proper for X U {s'} and 2. For all ε > 0 such that for all probability assignments P £ Vxyt which are proper for X and s', we have P(s') = 1. T h e o r e m 3 If X = D U S is p-consistent, X strictly p-entails φ => φ if and only if there exists a subset S' of S such that S'U {φ —»· True] is p-consistent and φ => -*ψ is not tolerated byS'. Note that strict p-entailment subsumes p-entailment, i.e., if a conditional sentence is strictly-entailed then it is also p-entailed. Also, to test whether a conditional sentence is strictly p-entailed we need to check its status only with respect to the strict set in X. This

93 confirms the intuition that we can not deduce "hard" rules from "soft" ones. However, strict p-entailment is different than logical entailment because the requirements of substantive consistency and properness for the probability distributions distinguishes strict sentences from their material counterpart. For example, consider the database X = S = {True =Φ- - 6, since the antecedent a is always falsified. For completeness, we now present two more theorems relating consistency and entailment. Similar versions of these theorems, for the case of purely defeasible sentences, first appeared in [Adams 75]. They follow from previous theorems and definitions. T h e o r e m 4 If X does not p-entail d' and X U {d'} is suhstantively inconsistent, then for all ε > 0 there exists a probability assignment P € Vx%t which is proper for X and d' such that P{d') 6 ("typically, penguins are birds") 3. p —> - i / ("typically, penguins don't fly") Clearly none of the defeasible sentences in the example can be tolerated by the rest. If for example t(p) = t(b) = 1 (testing whether sentence (2) is tolerated), the assignment t(f) = 1 will falsify sentence (3), while the assignment t(f) = 0 will falsify sentence (1). A similar situation arises when we check if sentence (3) can be tolerated. Note that changing sentence (1) to be defeasible, renders the database consistent: 6 —► / is tolerated by sentences (2) and (3) through the truth assignment t(b) = t(f) = 1 and t(p) = 0, while the remaining sentences tolerate each other. If we further add to this modified database the sentence p A b —► / , we get an inconsistent set, thus showing (by Theorem 2) that p A b —► ->/ is pentailed, as expected ("typically penguins-birds don't fly"). The set will become inconsistent again by adding the sentence b —> p ("typically, birds are penguins"), in conformity the graphical criteria of [Pearl 87]. Example 2 On quakers and republicans. Consider the following set of sentences: 1. n —► r ("typically, Nixonites 7 are republicans") 2. n —> q ("typically, Nixonites are quakers") 3. q => p ("all quakers are pacifists") 4. r => -ip ("all republicans are non-pacifists") 5. p —► c ("typically, pacifists are persecuted")

6

The terms consistency and p-consistency will be used interchangeably. "Nixonites" are members of R. Nixon's family.

7

95 Sentence (5) is tolerated by all others, but the remaining sentences (l)-(4) are not conformable. Thus this set of sentences is inconsistent. Note that in this case Theorem 1 and the procedure outlined in the previous section not only provide a criteria to decide whether a database of defeasible and strict information is inconsistent, but also identify the offending set of sentences. We can modify the above set of sentences to be: 1. n => r ("all Nixonites are republicans") 2. n =>> q ("all Nixonites are quakers") 3· -ip is p-entailed in conformity with the intuition expressed in [Horty & Thomason 88].

96

6

Conclusions

The probabilistic interpretation of conditional sentences yields a consistency criterion which produces the expected results in examples were our intuition is strong. A tight relation between entailment and consistency was established and an effective procedure for testing both consistency and entailment was devised. One unique feature of our system is that conditional sentences are entailed in a natural manner, different from logical entailment, capturing the intuition that such sentences do not apply to impossible premises. For example, UI am poor", does not entail "If I were rich I would have a higher I.Q.". Thus, the special semantics presented here, avoids some of the classical paradoxes of material implication [Anderson h Belnap 75] and, hence, it brings mechanical and plausible reasoning closer together. Although our definition of p-entailment yields a rather conservative set of conclusions (e.g., one that does not permit chaining or contraposition), it constitutes a core of plausible consequences that should be common to every reasonable system of defeasible reasoning [Pearl 89]. Indeed, the notion of p-entailment was shown to be equivalent to that of preferential entailment ([Lehmann & Magidor 88]), whenever the sentences in X are purely defeasible and the underlying language is finite. Consequently, the decision procedure of Section 4 should also apply to preferential entailment. More powerful extensions of p-entailment are developed in [Goldszmidt h Pearl 90].

Acknowledgments Many of the proofs, techniques and notation are extensions of those presented in [Adams 75].

References [Adams 66]

Adams, E., Probability and The Logic of Conditionals, in Aspects of Inductive Logic, Hintikka J., and Suppes, P. eds. 1966, Amsterdam: North Holland.

[Adams 75]

Adams, E., The Logic of ConditionaL·, chapter II, Dordrecht, Netherlands: D. Reidel.

[Anderson & Belnap 75] Anderson, A. and N. Belnap, Entailment: The Logic of Relevance and Necessity, Vol. 1, Princeton University Press, Princeton N.J. 1975. [Delgrande 88]

Delgrande, J., An Approach to Default Reasoning Based on a Firs-Order Conditional Logic: Revised Report, Artificial Intelligence Journal, Volume 36, Number 1, pp. 63-90, August 1988.

[Dowling h Gallier 84] Dowling, W. and J. Gallier, Linear-Time Algorithms for Testing the Satisfiability of Propositional Horn Formulae, Journal of Logic Programming, 3:267-284, 1984.

97 [Goldszmidt & Pearl 90] Goldszmidt, M. and J. Pearl, On The Consistency of Defeasible Databases, technical report TR-122, Cognitive Systems Lab., UCLA, 1989, revised 1990, submitted to Artificial Intelligence Journal. [Goldszmidt, Morris & Pearl 90] Goldszmidt, M., P. Morris, and J. Pearl, The Maximum Entropy of Nonmonotonic Reasoning, to appear in Proceedings AAAI90, Boston 1990. [Horty h Thomason 88] Horty, J. F. and R. H. Thomason, Mixing Strict and Defeasible Inheritance, in Proceedings of AAAI-88, St. Paul, Minnesota. [Lehmann h Magidor 88] Lehmann, D. and M. Magidor, Rational Logics and their Models: A Study in Cumulative Logics, TR-8816 Dept. of Computer Science, Hebrew Univ., Jerusalem, Israel. [McCarthy 86]

McCarthy, J., Applications of Circumscription to Formalizing Commonsense Knowledge, Artificial Intelligence Journal, 13:27-39.

[Pearl 89]

Pearl, J., Probabilistic Semantics for Nonmonotonic Reasoning: A Survey, in Proceedings of the First Intl. Conf. on Principles of Knowledge Representation and Reasoning, Toronto, Canada, May 1989, pp. 505516.

[Pearl 88]

Pearl, J., Probabilistic Reasoning in Intelligent Systems: NetworL· of Plausible Inference, chapter 10, Morgan Kaufmann Publishers Inc.

[Pearl 87]

Pearl, J., Deciding Consistency in Inheritance Networks, Tech. Rep. (R-96) Cognitive Systems Lab., UCLA.

[Reiter 80]

Reiter, R., A Logic for Default Reasoning, Artificial Intelligence Journal, 13:81-132.

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

99

DEFEASIBLE DECISIONS: WHAT THE PROPOSAL IS AND ISN'T

R. P. Loui Department of Computer Science and Department of Philosophy Washington University St. Louis, MO 63130

In two recent papers, I have proposed a description of decision analysis that differs from the Bayesian picture painted by Savage, Jeffrey and other classic authors. Response to this view has been either overly enthusiastic or unduly pessimistic. In this paper I try to place the idea in its proper place, which must be somewhere in between. Looking at decision analysis as defeasible reasoning produces a framework in which planning and decision theory can be integrated, but work on the details has barely begun. It also produces a framework in which the meta-decision regress can be stopped in a reasonable way, but it does not allow us to ignore meta-level decisions. The heuristics for producing arguments that I have presented are only supposed to be suggestive; but they are not open to the egregious errors about which some have worried. And though the idea is familiar to those who have studied heuristic search, it is somewhat richer because the control of dialectic is more interesting than the deepening of search. 1

WHAT THE PROPOSAL IS

"Defeasible Spécification of Utilities" [13] and "Two Heuristic Functions for Decision" [14] proposed that decision analysis could profitably be conceived to be defeasible reasoning. Analyzing a decision in one decision tree or model is an argument for doing a particular act. The result of analysis with a different tree of model is another argument. That there can be multiple arguments suggests that there can be better arguments and lesser arguments. Thus, arguments for decisions must be defeasible. Response to this idea has been mixed, but often immoderate. This paper attempts to temper the reaction by saying what the proposal is and isn't.

100 1.1

Like Qualitative, Defeasible, Practical Reasoning

The proposed defeasible reasoning about decisions is the natural extension of philosophers' defeasible practical reasoning about action. The difference is that our arguments for actions are quantitative, often invoking expected utility calculations. In practical reasoning, reasoning about action is qualitative. If an act achieves a goal, that's a reason for performing that act. If an act achieves a goal but also invokes a penalty, and that penalty is more undesirable than the goal is desirable, that may be reason not to perform the act. Eveyone assumes that practical reasoning is defeasible in this way: that is, an argument for an action can be defeated by taking more into account in its deliberation. But this had never been formalized as defeasible reasoning, to my knowledge, because formalisms for defeasible reasoning are relatively new. Now that we have such formalisms, we can write the general schemata for practical reasoning quite simply: (a)(d). \a ACHIEVES dAd TDO al,

IS-DESi >—

(a)(d). ra ACHIEVES dAd r-.(DO a)l,

IS-UNDESi >—

where " >— " is a relation between sentences that corresponds roughly to our intuitive relation "is a reason for"; (x) is a meta-language quantifier. Axioms that govern such a relation are described in [11,12] and [22]. They are similar to axioms given by other authors ([5,6], [15], [1]). Reasons can be composed to form arguments. So prior to considering interference among contemplated actions, we might produce the argument: tt

— >—

"DO αλ A DO a 2 " >— "DO αλ L· a 2 ". But there may be other arguments that disagree with this argument, such as " α ι L· a2 ACHIEVES d3 A d3 IS-UNDES" "-.(DO ax h a2)'\

>—

In fact, we should be able to write our reasons in such a way that preference among arguments can be achieved with the specificity defeaters in defeasible inference, i.e., those rules that tell us to prefer one argument over another if it uses more information. Suppose I am reasoning about whether to rent an Alfa, though it incurs a big expense. An argument for renting an Alfa is based on the following reason:

101 "rent-the-Alfa ACHIEVES drove-Alfa A drove-Alfa IS-DES" "DO rent-the-Alfa".

>—

A different argument which comes into conflict is based on the reason: "rent-the-Alfa ACHIEVES incurred-big-expense A incurred-big-expense IS-UNDES" >— tt -i(DO rent-the-Alfaf. There is no reason to choose among these arguments, so they interfere and neither justifies its conclusion. Suppose further that taking into account the desirability of driving the Alfa and the undesirability of incurring big expense, I judge the combination to be undesirable. "drove-Alfa IS-DES Λ incurred-big-expense IS-UNDES" >— "(drove-Alfa L· incurred-big-expense) IS-UNDES". Then there is a third argument, based on the combined reasons: "drove-Alfa IS-DES Λ incurred-big-expense IS-UNDES" >— "(drove-Alfa h incurred-big-expense) IS-UNDES" "rent-the-Alfa ACHIEVES (drove-Alfa & incurred-big-expense) A (drove-Alfa L· incurred-big-expense) IS-UNDES" >— "n(DO rent-the-Alfa)". This argument disagrees with the first argument, which was in favor of renting the Alfa. But it takes into account all of the information that the first argument takes into account, and it does so in a way that cannot be counter-argued. So it is a superior argument; it defeats the first argument. All of this reasoning about action is defeasible. There may be other arguments, based on what else we notice that renting the Alfa achieves, and what we may know about their desirability in various contexts. As more consequences of action are inferred, more arguments can be presented. Eventually, defeat relations among those arguments are proved. At any time, based on the pattern of defeat relations among presented arguments, there is either an undefeated justification for taking a particular action, or there are interfering arguments whose conflict has not been resolved. In the latter case, we might fall back on our un-tutored inclination (e.^., to rent the Alfa). Sometimes we act for reasons; sometimes we act for very good reasons; sometimes we do not have the luxury of having unanimous reasons, or any reasons at all. Of course, this qualitative practical reasoning is a very weak way of analyzing tradeoffs. It does not take into account known risks of actions, that is, known probabilities of acts achieving various effects.

102 1.2

But Quantitative and Risk-Sensitive

What I have proposed is a quantitative version of this defeasible reasoning about decisions. An act achieves an effect with known probability, and we have independent reasons for the utilities of each of the resulting states. By weighing these independent utilities by their respective probabilities, we produce an argument for the utility of the act. With different independent reasons for the utilities of resulting states, we get different arguments. With different accounting of the possible results of an act, again, we produce different arguments. If we are clever, reasons can be written in an existing formalism for defeasible reasoning in such a way that those arguments that justify their conclusions are exactly those arguments that we would consider compelling among the multitude of potentially conflicting arguments. Suppose I consider the possibility that my department will reimburse me for renting the Alfa, and calculate its probability to be 0.4. Based on expense and access to the Alfa, I assess the utilities of the various resulting states and calculate an expected utility for renting the Alfa. If it is greater than the utility of renting the econo-car, it represents an argument for renting the Alfa. u(dept-pays; rent-the-Alfa 10 utils

BASED-ON expense; whether-drove-Alfa)

=

u(-idept-pays; rent-the-Alfa — 1 utils

BASED-ON expense] whether-drove-Alfa)

=

EXPECTED u(rent-the-Alfa BASED-ON expense; whether-dept-pays) = 3.4 utils

whether-drove-Alfa;

u(rent-econo-car) = 2 utils therefore, defeasibly,

rent-the-Alfa.

But if instead I consider expense, access to the Alfa, and the dissatisfaction of my department chairman, in the assessment of utilities, then I produce a different argument. u(dept-pays; rent-the-Alfa BASED-ON expense; how-chairman-reacts) = 8 utils u(-ydept-pays; rent-the-Alfa BASED-ON expense; how-chairman-reacts) = —4 utils

whether-drove-Alfa;

whether-drove-Alfa;

Expected u(rent-the-Alfa BASED-ON expense; whether-drove-Alfa; how-chairman-reacts; whether-dept-pays) = 0.8 utils u(rent-econo-car) = 2 utils

103 therefore, defeasibly, rent-econo-car. As before, there may be other arguments, based on other contingencies to be analyzed (e.g., whether I can fool the accounting secretary, whether it rains, etc.) and other factors that affect the independent reasons for valuing various states of the world (e.g., how my colleagues react, how my friends react, etc.). Normally, in decision analysis we require all such reasons to be taken into account in advance. For computational and foundational reasons, this is not so here. One way this reasoning can be formalized is with the following axiom schemata, which presume as default that there is a linear-additive structure to utility when exceptions are not known. Properties, such as P and Q, make basic contributions to the utility of a state in which they are known to hold. If the contribution of P is x and the contribution of Q is y, that provides a reason for taking the contribution of the conjunction to be the sum of x and y. Ax.l. (x)(y)(P)(Q). ïcontr(P) = x L· contr(Q) = y^ >— \contr(P & Q ) = x + yl. This is defeated if we know independently the contribution of P L· Q to be something other than the sum of the individual contributions. The function "contr" maps properties to utility contributions. Any information about this mapping, together with the knowledge that property P holds in state s, provides a reason for taking the utility of s to be the contribution of P . If P L· Q is known to hold in s, then taking u(s) to be the contribution of P L· Q will result in a better argument for what is the utility of s. Ax.2. (P)(s). ΓΓ(Ρ, 5)1 >— Tif(s) = contr(P^. Here, T ( P , s) says that P holds in s. Finally, if event E is known to have probability k in state s, then this provides a reason for taking the utility of s to be the weighted sum of the children's utilities. Ax.3. (E)(s)(k) . \T(prob(E) = ifc, 5)1 |— Γΐζ(θ) = u()k + ti()(l - ]fe)l. 1.3

Better Detailed Than Some Have Thought

Ax.l has been criticized on two counts: first, for permitting a utility pump (anonymous referee), and second, for assuming independence of contributions([24]). The first criticism is wrong if we add the obvious requirement that Ax.i. (P)(Q)

|— Γ(Ρ = Q)D contr(P) = contr(Q)l.

104 Then contr(Pi) = 10 does not provide reason for contr(Pi & Ρχ) = 20; the axioms governing the construction of arguments require that they be consistent. The second criticism is right, but empty: indeed, we assume independence of contributions. But that is a defeasible assumption, and when it is an incorrect assumption, we expect that it is made known as an explicit exception. The properties does-smoke and has-cancer typically co-occur. The individual contribution of does-smoke in a state might be —20, and the individual contribution of has-cancer in a state might be —50, but the joint contribution of does-smoke and has-cancer might be an exception to additivity, —60. This exception must be stated explicitly. Figures 1 through 5 show some arguments for utility valuations, and interations. In each case, the utility of state s is at issue. Here I am assuming Simari's system [22]· Figure 1 shows the most basic argument for the utility of s based on the contribution of P , which holds in s. The conclusion, u(s) = 5, is based on the theory below it. It rides over a bold horizontal line whenever it is justified. The theory consists of a set of defeasible rules, which are depicted as connected digraphs (the convention here is that arrows always go up). Sources must be given as evidence. Contingent sentences used in these graphs are underlined and are important in determining specificity of various arguments, sentences below the vertical line are setences given as evidence that are used to produce the conclusion, but not to activate any defeasible rules. Figure 2 shows a similar argument, where two defeasible rules are used instead of one. Multiple in-directed edges represent conjunction. Figure 3 shows two arguments that conflict. When there is defeat of one argument by another, the defeating argument has a large arrowhead. The defeating argument here is the most basic argument that uses the expected utility axiom (Ax. 3). Defeat is clear to see here because

T(P

L·R,E\s)L·T(PL·R,E\s) Γ ( Ρ , s).

|—

Figure 4 shows conflict among two arguments that determine the contributions of properties through defeasible arguments. Figure 5 shows an argument that uses both expected utility and defeasible reasoning about the contributions of properties. Since the contingent sentences are those underlined, specificity holds. Note that if the sentences concerning contr had been regarded as contingent, there would not have been specificity. 1.4

One Way To Integrate Planning and Decision Theory

The original motivation for the use of non-monotonie language was to find a concise specification of utility functions mapping descriptions of the world into the reals, when descriptions are collections of sentences in a first order predicate logical language. There is no way for planning research to exploit the existing ideas for decision-making under known risk if there is no practical way to represent the relative desirability of descriptions of the world.

105

«(») = 5

u(s) = eontr(P)

T(P,s)

contr(P) = 5

Figure 1.

u(s) = 12

> «(«) =

contr(PUQ)

Γ(Ρ. — ΏΟ(α,ι) This is an argument to do a\ based on a comparison with a2. Argument 2: T(X, αχ) >— u(ai) = contr(X) T(Z, a3) >— u(a3) = contr(Z) contr(X) = 3 contr(Y) = 5 u(ai) < u(a3) >— -.(DO(ai)). This is an argument that disagrees with the first argument. Argument 3: T(Z L· V, a3) >— u(a3) = contr(Z h V) contr(Z &,V) = 2. This argument counterargues argument 2 at the point where argument 2 contends that u(a3) = 5. It does not augment argument 1 in any way, but it defeats argument 2, thus reinstating argument 1. The third argument could conceivably be integrated into the first argument to produce a more comprehensive argument: an argument that defeats the second argument by itself. But that would be somewhat complex and is not required in the presence of the second argument's rebuttal. Why build into an argument a defense to every possible objection, when each objection can be rebutted as it arises? I do not think there are game tree search situations that correspond to this state of defeasible deliberation on decision. 2.2

Excluding Meta-Level Analysis

Prominent work on limited rationality is being done on decision-theoretic metareasoning (esp. [8], [20], [4]). This proposal seems to conflict with their approaches because it does not require meta-reasoning. But it does not preclude meta-reasoning, and sometimes such reasoning is useful. I do not have notation for representing nor axiom schemata for generating reasons and arguments of the following kind, but conceivably they could be produced. The first have to do with meta-reasoning that controls the attention at the object level: (α,ι > a2 in Mi) Λ (Mi is not worth expanding) is reason to do a\ now

113 or

(ai y a2 in Mi) A (do-now(ai) y expand(Mi) ) do-now(a,i).

>—

These would be reasons that say that there is no net perceived value of expanding the model. As the cited authors have pointed out, the preference to expand a model may be based on an expected utility computation. I would add that computation of such a utility will be defeasible: u(do-now(a\)) = 15 Λ u(expand(Mi))

= 10 |— do-now(ai) y

T(intended-act-succeeds, do-now(ai)) Λ contr (intended-act-succeeds) = 15 u(do- now(ai)) = 15 T(prob(find-better-act)

= .3, expand(Mi)),

expand(Mi)

>— etc.

As an aside, my discussions have not been about arguments for acting "now" as oppsed to acting "later." I have presumed that time simply expires, leaving an apparent best act at the moment. To produce arguments for action "now" would seem to re-open Hamlet's problem: given such an argument to act "now," do we take the time to seek a counter-argument? The second kind of meta-reasoning has to do with arbitrating among disagreeing arguments at the object level, when no defeat relation is known to hold: (a\ y Ö2 in Argl Λ a2 y ai in Arg2 Λ Argl is based on short-term-considerations Λ Arg2 is based on long-term-considerations) >— αχ >- a? So I do not see meta-reasoning as incompatible with this proposal. We have so little experience with mechanizing simple arguments at the object level, however, that the focus of attention remains there. 2.3

One Particular Heuristic Function

This proposal does not live or die with the multi-attribute suggestion for representing utility concisely. A second heuristic is exhibited in [14] based on [21]. In order to have utility expectations, all that is needed is some representation of utility on sentential descriptions of the world. Practical necessity demands that there be some regularity that can be exploited for compact representation. In order to achieve defeasibility in our deliberation about decisions, all that are needed are independent reasons for valuing a state based on lists of properties that can be proved to hold in that state. Those lists of properties are not complete, and reasons for valuations based on incomplete properties need not bear relation to valuations based on probabilities of the omitted properties. In fact, I expect that heuristics for utility will vary from individual to individual, and will depend on application.

114 2.4

Terrible Computation

A valid concern [16] is that I am substituting something whose effective computation is well understood (heuristic search) with something whose effective computation has yet to be achieved (a variety of non-monotonie reasoning). This is true to the extent that deliberation on decision is just heuristic search, and special cases of defeasible reasoning do not yield to special purpose, effective inference procedures. If the only dialectic envisioned is the succession of arguments based on successively deepened trees, all of which defeat their predecessors, then this defeasible reasoning is analogous to heuristic search. And it can be implemented without much ado. But defeasible reasoning about decision can be more interesting than that. Until we discover patterns of dialectic for decision that lead to special algorithms, we are stuck with the general framework for defeasible reasoning. This situation is not so bad: dialectic in defeasible reasoning has good prospects for being controlled reasonably well under resource limitation. 2.5

Necessarily Quantitative

Reasons need not be quantitative. Consider d\ is better than the usual risk is reason to do a\ or a.\ achieves my aspirations in this context is reason to do ax. Qualitative reasons make especially good sense at the meta-level: (the difference between a\ and a2 is small) and (a\ is robust) is reason to do a x. A reason suggested by Doyle as a tie-breaker is: can't choose between a\ and a2 is reason to do a\. Again, I have no schemata for generating these kinds of reasons, and no way to weigh arguments based on these reasons against arguments based on quantitative considerations. But I believe that a full theory of deliberation would include them, or be able to reduce them to quantitative reasons. 2.6

Complete

Finally, it should be admitted clearly that this proposal is not complete. The integration of planning techniques, the exploration of meta-reasoning and qualitative reasons, the production of reasons for acting now, control of search and dialectic, and experience with particular heuristics are all things to be done. All we can do at present is produce expected utility arguments at various levels of detail. We do, however, have a PROLOG-based implementation of the underlying defeasible reasoning system [23] and see no major obstacle in using the schemata Ax.l - Ax.3 with some help unifying terms within functions.

115 3

AN OPEN CONVERSATION WITH RAIFFA

There is a device through which to take the measure of this proposal's break from Bayesian tradition, and at the same time to see the inutitiveness of what is being proposed. Consider the following hypothetical conversation with the great decision theorist, Howard RaifFa. I phone him at his Harvard office to solicit his best decision analysis under resource limitation. Ron: I'm at the San Francisco airport. I have this decision problem whether to rent an Alfa. Can you help? RaifFa: Sure. I have this theory, you know. What are all the relevant distinctions among states? All the effects of events? All the available courses of action? Ron: You want me to list them all? I don't have time! Am I paying for this phone call? RaifFa: Yes, I see. Hmm. Ok. Confine your attention to the important ones. Ron: How important? At this point, there are two good responses. RaifFa 1: Well, let's make a model of the expected utility of omitting various considerations. RaifFa 2: Well, let's just start and see what comes to mind and refine the model later. The Bayesians want to think that the first answer is the only legitimate one. Meanwhile, it is the second answer that makes sense to us. What is the logic of decision analysis based on this second answer? This is the question that I have been attempting to answer. 4

REFERENCES

[1] Delgrande, J. "An approach to default reasoning based on a first-order conditional logic," Proceedings of AAAI, 1987. [2] D'Ambrosio, B. and Fehling, M. "Resource-bounded agents in an uncertain world," Proceedings of AAAI Spring Symposium, 1989. [3] Edwards, W. "Episodic decision analysis and dynamic decision theory," Proceedings of AAAI Spring Symposium, 1989. [4] Etzioni, O. "Tractable decision-analytic control," Proceedings of Principles of Knowledge Representation and Reasoning, Morgan-Kaufman 1989. [5] GefFner, H. "On the logic of defaults," Proceedings of AAAI, 1988.

116 [6] Geffner, H. "A framework for reasoning with defaults," in Knowledge Representation and Defeasible Reasoning, H. Kyburg, R. Loui, and G. Carlson eds., Kluwer 1989 (in press). [7] Hansson, O. and Mayer, A. "Decision-theoretic control of search in BPS," Proceedings of A A AI Spring Symposium, 1989. [8] Horvitz, E. "Reasoning under varying and uncertain resource constraints," Proceedings of AAAI, 1988. [9] Jeffrey, R. Logic of Decision, Princeton, 1965. [10] Kyburg, H. "Subjective probability: criticisms, reflections, and problems," Journal of Philosophical Logic 7, 1978. [11] Loui, R. "Defeat among arguments," Computational Intelligence 3, 1987. [12] Loui, R. "Defeat among arguments II," Washington U. Computer Science WUCS89-06, 1988. [13] Loui, R. "Defeasible specification of utilities," in Knowledge Representation and Defeasible Reasoning, H. Kyburg, R. Loui, and G. Carlson eds., Kluwer 1989 (in press). [14] Loui, R. "Two heuristic functions for decision," Proceedings of AAAI Spring Symposium 1989. [15] Nute, D. "Defeasible logic and the frame problem," in Knowledge Representation and Defeasible Reasoning, H. Kyburg, R. Loui, and G. Carlson eds., Kluwer 1989 (in press). [16] Pearl, J. Personal communication. 1988. [17] Pylyshyn, Z. The Robot's Dilemma: The Frame Problem in Artificial Intelligence, Ablex, 1987. [18] Savage, L. Foundations of Statistics, Dover, 1950. [19] Raiffa, H. Decision Analysis, Addison-Wesley, 1968. [20] Russell, S. and Wefald, E. "Principles of metareasoning," Proceedings of Principles of Knowledge Representation and Reasoning, Morgan-Kaufman, 1989. [21] Schubert, L. Personal communication. 1988. [22] Simari, G. "On the logic of defeasible reasoning," Washington U. Computer Science WUCS-89-12, 1989. [23] Simari, G. "A justification finder (user's manual)," Washington U. Computer Science WUCS-89-24, 1989. [24] Thomason, R. Personal communication. 1989.

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

117

CONDITIONING ON DISJUNCTIVE KNOWLEDGE: SIMPSON'S P A R A D O X IN DEFAULT LOGIC Eric Neufeldf* and J.D. HortonJ** fDepartment of Computational Science, University of Saskatchewan, Saskatoon, Saskatchewan, Canada, S7N 0W0 JSchool of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada E3B 5A3

Many writers have observed that default logics appear to contain the "lottery paradox" of probability theory. This arises when a default "proof by contradiction" lets us conclude that a typical X is not a Y where Y is an unusual subclass of X. We show that there is a similar problem with default "proof by cases" and construct a setting where we might draw a different conclusion knowing a disjunction than we would knowing any particular disjunct. Though Reiter's original formalism is capable of representing this distinction, other approaches are not. To represent and reason about this case, default logicians must specify how a "typical" individual is selected. The problem is closely related to Simpson's paradox of probability theory. If we accept a simple probabilistic account of defaults based on the notion that one proposition may favour or increase belief in another, the "multiple extension problem" for both conjunctive and disjunctive knowledge vanishes.

1. I N T R O D U C T I O N The idea that intelligence, artificial or otherwise, involves the ability to "jump" to "default" conclusions is an attractive one; if true, it would explain a lot of intelligent activity without the need for numeric probability distributions. The classic example is the "birds fly" problem. Given that some individual tweety is a bird, we "jump" to the conclusion she flies. When we later discover she is an em«, we retract that defeasible conclusion and decide instead that she doesn't fly. ♦Research supported by NSERC grants OGP0041937, EQP0041938 and OGP0099045. **Research supported by NSERC grant OGP0005376.

118 Reiter's follows:

(1980) formalism can represent this in two different ways. One way is as

The first default is read as follows: if bird is true, and it is consistent to assume fly, then infer fly. We say this representation is in prerequisite form, since every default has a prerequisite, following the terminology of Etherington (1987) With the aid of "theory comparators" such as "specificity" (Poole, 1985) or "inferential distance" (Touretzky, 1984), it is possible to conclude that if Polly is an emu, then Polly doesn't fly. If we only know Polly is a bird, then we conclude that Polly can fly, but we can conclude nothing else about Polly. We can also represent the knowledge as follows:

This is in consequent form. This representation is closely related to the system of Poole and his colleagues (Poole, Goebel L· Aleliunas, 1987; Poole, 1988) and it lets us consistently show that if we have observed a bird, that bird is not an emu using the contrapositive form of the second default. This doesn't seem unreasonable; we can give a default "proof by contradiction" that birds are (typically) not emus: if birds were emus, then birds wouldn't (typically) fly. But birds do (typically) fly; a contradiction. (Technically, we assume both consequent form defaults to derive ->emu from bird). Although no one argues that emus aren't rare, it has been observed that this leads to a questionable side effect sometimes called the "dingo paradox" (Neufeld, Poole L· Aleliunas, 1989), a variant of the lottery paradox of probability theory (Kyburg, 1988). Kyburg (1988) states that the nonmonotonic logic formalisms contain the lottery paradox; this is stated from the perspective of default logic in (Poole, 1989). In this paper, we question the idea of default "proof by cases". Intuition suggests that if a typically implies c and b typically implies c, then a V 6 typically implies c. Suppose, however, we give to "a typically implies 6" the probabilistic interpretation p(b\a) > p(b). This is the weakest probabilistic property we believe a default ought to have, whatever else a default may mean. We will say a favours b when this is true, following (Chung, 1942). Wellman (1987) pursues the same idea as "qualitative influence" in the realms of planning and diagnosis. It then becomes simple to construct a counterexample to the notion of default "proof by cases" and multiple extensions

119 arise for both conjunctive and disjunctive knowledge. We conclude with a discussion of the implications. 2. DOES A N E M U OR OSTRICH R U N ? Poole (1988) poses the following. Suppose emus (typically) run and ostriches (typically) run. We can write this in prerequisite form as ostrich : M run run emu : M run run Poole observes that if we know only that Polly is an emu or an ostrich, but do not know which, we cannot conclude that Polly runs. We simulate his system by rewriting the defaults in consequent form: : Mostrich —► run ostrich —> run : M emu —> run emu —► run which allows a default "proof by cases" of run given emu V ostrich. The idea that this should be the case appears also in Delgrande's (1987) logic NP which contains the axiom schema (a => c Λ b => c) D (a V b =* c) where =>· is Delgrande's "variable conditional" operator. Similarly, Geffner's system has the inference rule

(1988)

If Γ, # ' | ~ H and Γ, H" |~ H, then Γ, H' V H" ^ H, where |~ is Geffner's provability operator. Poole (1988) argues that prerequisite form gives "unintuitive" results. We agree with his particular example, but argue that the different representations point to a variation of the "multiple extension problem" (Hanks & McDermott, 1986). It is a premise of nonmonotonic logic that it should be possible to make a different inference from a Λ b than from either conjunct. We ask the same question about disjunctive knowledge: do we ever want to draw a different conclusion from a V b than from either disjunct? Certainly we can write down such a set of defaults; can we provide a semantic account for doing so? In the next section, we describe a probabilistically motivated counterexample to this intuition. If our motivation is correct, we believe at least one of the following must be true: 1. Reiter's formalism does not give "unintuitive" results, but rather, a default reasoner must know when it is making inferences on the basis of knowing only a disjunction rather than knowing one of the disjuncts.

120 2. Those developing default logics must provide a more rigorous account of what is meant by "typically". 3. The proper formalism for reasoning under uncertainty, even when numeric probability distributions are unavailable, is standard probability theory. 3. ARTS S T U D E N T S A N D SCIENCE S T U D E N T S Let Ci mean that an individual is a student in a class i\ let a mean the student is an arts student and let s mean the student is a science student. Ignoring the issue of prerequisite form, consider a set of defaults of the form C{ : Ms

s that intuitively means a student in class i is (typically) a science (s) student. Consider the following scenario. Suppose there are only two classes under consideration and both have three science students and two arts students. Because they are core courses, they have the same science students but different arts students. Finally, assume there is one more science student in the domain. Suppose we interpret this default to mean "favours", i.e., p(s|c t ) > p(s). The reader can easily verify that if such inequalities are a partial account of defaults, there is also a need to condition differently on the disjunction than on the disjuncts. (This turns out to be true if we interpret defaults to mean "most" (Bacchus, 1989)). There are some straightforward arguments against this. 3.1. "WE CHOOSE A S T U D E N T FROM THE CLASS FIRST" A reasonable argument is this: if we enter any class ct·, the typical student will be a science student. This argument does not allow us to represent the different ways we might select a typical student. If we enter ci, but don't know which class we are in, that class favours the conclusion that the typical student is a science student. But this is not the only way we might meet someone in one of those classes: if the students of the Ci have banded together to complain that the courses were too technical (for example), the typical member of such a group is an arts student, even though we know only that the student is a member of the disjunction of the classes. This is the heart of the problem: how is the "typical" student in c\ V c2 selected? Do we want to know whether favours(s,C\) V favours(s,C2) or favours(s, C\ V c2) is true? Note that Poole's "running emus" is a special case where conditioning on either disjunct yields the same probabilistic answers as conditioning on the disjunction. This is straightforward to prove since emu and ostrich are mutually exclusive: Proposition 1. Let a and b be mutually exclusive and separately favoured by c. Then a V 6 is favoured by c. Proof: From the premises, p{ab) — p(ab\c) = 0. From the disjunction rule p(a V b\c) = p(a\c) + p(b\c)

121 and ρ(α V 6) =ρ(α)+/>(&)· Both quantities on the right hand side of the first equality are greater than the respective quantities on the right hand side of the second and the desired inequality follows. D

3.2. "THE PROBABILITY IS CLOSE TO 1/2 A N D IS UNININTERESTING" We have been told that the probabilities involved are too close to 1/2 to be interesting. It is easy in the conjunctive case to construct sets a, 6 and c so that for arbitrary probability values Vi, v2 in the open unit interval, p(c\a) = p(c\b) = υλ and p(c\ab) = v2. To achieve a similar result for the disjunctive case, we need only create enough disjuncts. Returning to the "arts and science" example, suppose we want p(s|c t ) to be at least υλ and p(s\ V^=1 c t ) to be at most v2 with 0 < v2 < vx < 1. Assume there are k science students in every one of n classes, and there is one arts student in each class, and no arts strudent is in two of the c t . Choose k > vi/(l — vi) and n > k(l — v2)/v2 and we obtain the desired result. This means we that for any interpretation of defaults as high probabilities 1 — €, we can create a counterexample. 3.3. "THIS E X A M P L E IS CONTRIVED" It is just a matter of time before someone comes up with a better one: there were only eight years between the 1980 "nonmonotonic logic" special issue of the AI journal to the discovery of the lottery paradox in nonmonotonic logic by Kyburg (1988). The next section describes actual instances of the paradox. Pearl (1989) argues that these counterexamples are unimportant in most domains. This may be true, but we argue for testable and sound formalisms that eliminate unwanted inferences even if it means that certain apparently desirable inferences are lost. 4. DISCUSSION OF THE P A R A D O X We believe that this variation of the "multiple extension problem" is the appearance of Simpson's (1951) paradox of probability theory, (which some think should be attributed to Yule (1903)), which may be stated in a number of ways. Commonly (and perhaps most surprisingly) it is the situation that happens when the truth of c is known, whether true or false, then b makes a more probable, but if the truth of c is unknown then b makes a less probable. Formally, it is the fact that there is a consistent assignment of probability values so that for propositions a, b and c p{a\bc) > p(a\c) p(a\b->c) > p(a|->c) p(a\b) < p(a). Pearl (1989) describes the paradox as involving a "hypothetical" test of the effective-

122 ness of a drug on a population of males and females where the numbers are "contrived" so that the drug seems to have a positive effect on the population as a whole but an adverse effect on males and an adverse effect on females. We reply by pointing out that this situation occurs in real life; Wagner (1982) gives several examples. A commonly cited example from Cohen and Nagel (1934) appeared in 1934 in an actual comparison of tuberculosis deaths in New York City and Richmond, Virginia. Wagner gives several other examples: possibly the best known, though not complete, recent instance of the paradox was a study of sex bias in graduate admissions at UCB (Bickel, Hammel L· O'Connell, 1975) where data appeared to show admissions by college are fair though campus wide data indicated women needed higher marks to gain admission. (This occurs if most women apply to the most competitive colleges. Thus where Pearl might argue that the paradox is pedagogical and that it is reasonable to adopt "default proof by cases" as a "canon of plausible reasoning" , we argue that we should use statistics to improve on commonsense and improve our intuitions. Blyth (1973) states that these inequalities are closely related to the facts shown by Chung (1942) that for propositions a, b and c there is a consistent assignment of probabilities such that either p(a\b) > p(a) p(a\c) > p(a) p(a\bc) < p{a) or p(a\b) > p(a) p(a\c) > p(a) p(a\bV c) p(a) and p(a\c) > p(a). Multiplying by the prior of the antecedent and dividing by the prior of the consequent yields p(b\a) > p(b) and p(c|a) > p(c). From the disjunction rule p(bc\a) + p{b V c\a) = p(b\a) + p(c\a) > p(b)+p(c) = p(bc)+p(b\/c). Suppose a does not favour either be or bWc. By a similar manipulation, p(bc\a) < p(bc) and p(b V c\a) < p(b V c), and p(bc\a) + p(b V c\a) < p(bc) + p(b V c), a contradiction. □ In these examples, a may be conditioned on nine different antecedents: empty antecedent, 6, -16, c, -ic, 6c, δ-ic, -i6c, ->b->c. An ordering on these conditional probabilities must be constrained by the following:

123 1. for any a, b and c, p(a\bc) > p(a\c) > p(a\->bc) (the direction of t h e inequality may be reversed), 2. for any a, 6 and c, p(a\c) + p(b\c) = p(a6|c) + p(a V b\c), 3. if ρ(α|6) φ p(a) and p(a|c) ^ p(a), then some combination of outcomes of 6 and c must increase belief in a and some combination must decrease belief in a. T h e looseness of t h e constraints suggests t h a t there are m a n y ways to order t h e probabilities. Given so m a n y orderings, we should b e surprised if there were no surprises. 6.

CONCLUSIONS

Default logic and its variants were proposed as solutions to t h e problem of reasoning in uncertain domains when numeric probability distributions are unavailable. Few would disagree t h a t such formalisms are becoming awkward even for small problems. We show elsewhere t h a t a system based on probability and t h e ideas of favouring and of conditional independence seems to yield t h e expected answers to most of t h e problems in t h e nonmonotonic literature. See (Neufeld, Poole h Aleliunas, 1989) for details. T h e most i m p o r t a n t point relevant to this discussion is t h a t such a system does not in general favour a given b V c even though both disjuncts favour a. These results can be interpreted in a number of different ways: 1. More t h a n monotonicity is a problem for logic when reasoning in uncertain domains; t h e multiple extension problem must b e discussed for disjunctive knowledge. T h e question of how the "typical" individual in a V b is chosen must b e answered. This means t h a t t h e meaning of "typical" must b e specified and it will be interesting to see if this can be done without introducing a notion of randomness from probability theory. 2. Either it is sensible or not to draw different conclusions from disjunctions t h a n from disjuncts. This obviously varies from one domain to another, b u t t h e n a t u r e of this variation must be m a d e precise. Perhaps t h e differences between default logics correspond to statistical properties of different domains. 3. Probability theory tells us t h a t we are taking a chance of being wrong even when t h e odds are in our favour. T h e "arts and science" example shows us t h a t some default representations will tell us to "jump" to a conclusion when t h e odds are against us at the outset. We sum up with a quotation from Koopman (1940) on t h e same "The distinction between an asserted disjunction and a disjoined mental: ( U V Ü ) = 1 must never be confused with (u = 1) V (v = of this distinction has led to more difficulties in t h e foundations of often imagined."

foundational issue: assertion is funda1). T h e disregard probability t h a n is

124 ACKNOWLEDGEMENTS This work resulted in p a r t from years of discussion with David Poole, Romas Aleliunas, and other members of t h e Logic Programming and Artificial Intelligence Group at Waterloo. T h a n k s for Dr. W. Knight at t h e University of New Brunswick for references to Simpson's paradox. Thanks also to J i m Greer a t the University of Saskatchewan for discussions and spotting typos. During t h e refereeing process, an anonymous referee pointed out t h a t Ron Loui (1988) makes some similar observations. He also observes t h a t Simpson's paradox (proof by cases in default logic) is at odds with "specificity". We have shown here t h a t the problem is more widespread than thought if knowledge is represented in consequent form. We have also related the problem to several probabilistic formalisms and have suggested t h a t t h e method by which an individual is selected is at t h e heart of t h e problem.

REFERENCES Bacchus, F . (1989) . A modest, but semantically well-founded, inheritance reasoner. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 1104-1109. Bickel, P., Hammel, E., and O'Connell, J. (1975) . Sex bias in graduate admissions: D a t a from berkley. Science, 187:398-404. Blyth, C. (1973) . Simpson's paradox and mutually favourable events. Journal American Statistical Association, 68:746. Chung, K.-L. (1942). On mutually favorable events. Annals of Mathematical 13:338-349. Cohen, M. R. and Nagel, E. (1934) . An Introduction Harcourt, Brace and World, Inc., New York.

to Logic and Scientific

of the

Statistics,

Method.

Delgrande, J. P. (1987) . A first order logic for prototypical properties. Intelligence, 33:105-130.

Artificial

Etherington, D. (1987) . Formalizing nonmonotonic reasoning systems. Intelligence, 31:41-85.

Artificial

GefFner, H. (1988) . A logic for defaults. In Proceedings Conference on Artificial Intelligence, pages 449-454.

National

of the Seventh

Hanks, S. and McDermott, D. (1986) . Default reasoning, nonmonotonic logic and the frame problem. In Proceedings of the Fifth National Conference on Artificial Intelligence, pages 328-333. Koopman, B. (1940) . T h e bases of probability. Society, 46:763-774.

Bulletins

of the American

Math.

125 Kyburg, Jr., H. E. (1988) . Probabilistic inference and non-monotonie inference. In Proceedings of the Fourth Workshop on Uncertainty in Artificial Intelligence, pages 229-236. Loui, R. P. (1988) . Theory and computation thesis, University of Rochester.

of uncertain

inference and decision.

PhD

Neufeld, E., Poole, D., and Aleliunas, R. (1989) . Probabilistic semantics and defaults, to appear, Volume 4 of Uncertainty in Artificial Intelligence, Shachter, Levitt, Lemmer and Kanal eds. Pearl, J. (1989) . Probabilistic semantics for nonmonotonic reasoning: a survey. In Proceedings of the First International Conference on Principles of Knowledge Representation and Reasoning, pages 505-516. Poole, D. (1989) . W h a t t h e lottery paradox tells us about default reasoning. In Proceedings of the First International Conference on Principles of Knowledge Representation and Reasoning, pages 333-340. Poole, D. L. (1985) . O n t h e comparison of theories: Preferring t h e most specific explanation. In Proceedings of the Ninth International Joint Conference on Artificial Intelligence, pages 144-147. Poole, D. L. (1988) . A logical framework for default reasoning. Artificial 36:27-48.

Intelligence,

Poole, D. L., Goebel, R., and Aleliunas, R. (1987) . Theorist: a logical reasoning system for defaults and diagnosis. In Cercone, N. and McCalla, G., editors, The Knowledge Frontier: Essays in the Representation of Knowledge. Springer-Verlag, New York. Reiter, R. (1980) . A logic for default reasoning. Artificial

Intelligence,

13:81-132.

Simpson, E. (1951) . T h e interpretation of interaction in contingency tables. of the Royal Statistical Society B, 13:238-241.

Journal

Touretzky, D. S. (1984) . Implicit orderings of defaults in inheritance systems. Proceedings AAAI-SJ^, pages 322-325. Wagner, C. (1982) . Simpson's paradox in real life. The American 48.

Statistician,

In

36:46-

Wellman, M. P. (1987) . Probabilistic semantics for qualitative influences. In Proceedings of the Sixth National Conference of Artificial Intelligence, pages 660-664. Yule, G. U. (1903) . Notes on t h e theory of association of attributes in statistics. Biometrika, 2:121-134.

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

129

An Introduction to Algorithms for Inference in Belief Nets Max Henrion Carnegie Mellon University and, Rockwell Palo Alto Laboratory, 444 High St, Palo Alto, Ca 94301, USA. Abstract As belief nets are applied to represent larger and more complex knowledge bases, the development of more efficient inference algorithms is becoming increasingly urgent. A brief survey of different approaches is presented to provide a framework for understanding the following papers in this section. 1. Introduction Over the last few years the appeal of influence diagrams (Howard & Matheson, 1984) and belief nets (Pearl, 1986) for representing decision problems and uncertain knowledge has become increasingly apparent. These simple directed graphs provide both a principled formalism and a natural notation for people to encode and communicate uncertain beliefs. Belief nets also provide a basis for probabilistic inference, to calculate the changes in probabilistic belief as new evidence is obtained. An influence diagram, which is essentially a belief net with the addition of decision variable(s) and a value node, further supports calculation of the decisions or strategies that maximize expected value (or more generally utility). As these representations are increasingly applied to larger bodies of knowledge, the computational efficiency of inference algorithms that operate on them has become an issue of greater urgency. Accordingly there has been a proliferation of research seeking to develop new and more efficient approaches. Since the pace of developments has been so rapid and some of techniques may seem arcane, this article attempts to give a brief introduction to the area and to provide a framework to understand the differences and relationships between the various approaches. I shall try to outline the main classes of approach, but will not pretend the coverage is exhaustive. 2. Qualitative, real, and interval-valued belief representations Influence arcs in a belief network represent the existence of probabilistic dependence between variables. (More strictly, the absence of an arc between variables indicates their independence.) Thus the directed graph itself represents purely qualitative relationships. Typically these relationships are also quantified as probability distributions for each variable conditional on its predecessors (parents). Since it may sometimes be hard to obtain a complete point-valued probability distribution, researchers have also explored a variety of more general

130 (i.e. weaker) spécifications, both qualitative and numerical. Even the pure directed graph unadorned by numbers or anything else provides a basis for some potentially useful kinds of purely qualitative inference about dependence or relevance. Geiger, Verma and Pearl (this volume) show how it is possible to calculate all the conditional independencies implied by a belief network, in time linear in the number of edges. A more specific, but still purely qualitative, representation is the qualitative probabilistic network (QPN) (Wellman, 1988). This spécifies the signs of probabilistic influences and synergies among variables and provides the basis for qualitative inferences about the directions of effects on belief and decisions of evidence. Directed networks have also be used as a framework for representing Dempster-Shafer belief functions to facilitate inference with this scheme (Shenoy, Shafer, and Mellouli, 1988; Dempster and Kong, 1988). Whether these should be interpreted as specifying bounds on families of probability distributions remains a matter of controversy. Fertig and Breese (this volume) introduce a different approach explicitly intended as a generalization of point-valued probabilities using a particular class of interval-valued probability distributions in which lower-bounds are specified for each probability. They present methods for node removal and arc reversal to support inference in belief nets with this representation. 3. Early approaches Conceptually the simplest approach to probabilistic inference in a belief network is to explicitly compute the joint distribution over all the variables as the product of all the prior and conditional distributions. This approach is implicit in the standard roll-back method of solving decision trees. By summing over the appropriate dimensions of the joint it is straightforward to obtain arbitrary marginals. Similarly, the conditional probability P(xl e) for any variable or combination of variables, JC, given any evidence, e, can be computed as the ratio of the two marginals, P(x&e)/P(e). Of course the snag with this brute force approach is that the size of the joint distribution and hence the computational effort is combinatorial in the number of variables. This is likely to be fatal for tractability if there are more than a dozen or so uncertain variables. The key to computational efficiency for inference in belief nets is to take advantage of conditional independence specified by the network topology, and so find ways of propagated the impact of new evidence locally without having to calculate the entire joint distribution explicitly.

Figure 1: Belief network showing simplified or "Idiot's" Bayes assumptions. The most popular alternative in early schemes for Bayesian diagnosis in medical applications assumes the hypotheses are mutually exclusive and exhaustive and evidence variables (findings) are conditionally independent given the hypothesis. Figure 1 shows a belief net of

131 this form. The single hypothesis node indicates alternative hypotheses are mutually exclusive and exhaustive (hence represented as a single n-valued variable). The absence of arcs between findings indicates they are conditionally independent given a hypothesis. This model, sometimes called simplified Bayes or "Idiot's Bayes", is highly tractable, both to build and for diagnostic inference. But of course the assumptions are not appropriate to many applications. Unfortunately its early popularity gave rise to the widespread misapprehension in the AI community that these assumptions are essential to any Bayesian scheme. 4. Exact methods Provided the network is & poly tree, that is singly connected (as in Figure 2), inference may be performed using an efficient algorithm based on constraint satisfaction for propagating the effect of new observations along the tree (Kim & Pearl, 1983). Each node in the network obtains messages from each of its parent and child nodes, representing all the evidence from the portion of the network lying beyond them. The single-connectedness guarantees that that the information in each message to a node is independent and so local updating will work. This scheme's complexity is linear in the number of variables.

Figure 2: A polytree, or singly connected belief network. Unfortunately most real networks are multiply connected (as in Figure 3), so more complex methods are required. The approach developed by Olmsted (1983) and Shachter (1987; 1988)

Figure 3: A multiply connected belief network. applies a sequence of operators to the network which reverse the links, using Bayes' theorem, and take expectations over nodes to eliminate them. The process continues until the network is

132 reduced to just the node(s) whose probability is desired with the evidence nodes as immediate predecessors. The reduction scheme is guided by the network topology. The interval-valued probability representation proposed by Fertig and Breese (this volume) supports network inference using an analogous reduction scheme. Pearl (1986) presents an alternative approach to inference in multiply connected nets termed loop cutset conditioning. Selected variables are instantiated to cut open all loops (the loop cutset), the resulting singly connected network is solved using the Kim-Pearl method, and then the results of each instantiation are combined, weighted by their prior probabilities. Lauritzen and Spiegelhalter (1988) describe an approach based on a reformulation of the belief network. First they "moralize" the graph by adding arcs between all pairs of nodes that have a common successor (i.e. parents with a common child). They then triangulate it, adding arcs so that there are no undirected cycles of more than three nodes without an internal chord. They then identify cliques, that is all maximal sets of nodes that are completely interconnected. They prove that by this process any network can be transformed into a singly connected "hypernetwork" of cliques. They provide an algorithm for propagation of evidence within this polytree of cliques. Shachter (this volume) examines his reduction algorithms, the Pearl loop cutset conditioning method, and the Lauritzen and Spiegelhalter approach to discover their underlying relationship. He shows they are essentially identical when the network is a. forest (disconnected set of trees). The computational complexity for these algorithms has not been completely analyzed in terms of the network topology, but all are liable to combinatoric problems if there are many intersecting cycles. More generally, Cooper (1987) has shown that the general problem of inference to obtain conditional probabilities in an arbitrary belief network (i.e. PIBNET) is NP-hard. This suggests it will be more profitable either to to look at interesting special classes of network or to develop approximate or bounding methods, which trade off precision for computational efficiency. An example of the latter is bounded cutset conditioning (Horvitz, Suermondt and Cooper, 1989). This is a modification of loop cutset conditioning which examines only a subset of the possible cutset instantiations, and so produces bounds on the posterior probabilities at smaller computational expense than the complete version. Another approach, useful when rapid inference is needed, is to precompute, or compile, certain exact or approximate results ahead of time, so they can be accessed by simple table lookup when needed. Heckerman, Breese, and Cooper (1989) examine the tradeoffs between storage costs and computation time for a simple example of this approach. 5. Two level belief networks One special class of belief networks that has been of particular interest, especially for medical diagnosis, are two level belief networks (BN2). This is a generalization of the simplified Bayes scheme (Figure 1), in which multiple hypotheses (diseases) can occur together. In the simple case shown in figure 4, the hypotheses are marginally independent. Finding nodes are conditionally independent as before. Such networks are generally multiply connected, often with many intersecting cycles. It is also necessary to specify how the impacts of multiple diseases combine on a single finding. The usual assumption is the noisy OR where the probability that a given disease will be sufficient to cause a finding is independent of the presence or impact on that finding of other diseases.

133

Figure 4: Two level belief network with non-exclusive hypotheses. Cooper (1974) developed a branch-and-bound approach that performs search in the space of possible hypotheses for operating on multi-level networks of which BN2 is a special case. Peng and Reggia (1987a, b) developed an approach which identifies covering sets of diseases which can qualitatively explain the observed findings, and then computes their relative probabilities. Henrion (1990) describes TopN, a branch-and-bound algorithm to identify the n most probable diagnoses. These algorithms make use of the fact that, while a patient may have more than one disease, he or she rarely has more than four or five. They use heuristic search to identify those relatively few joint hypothèse that may account for most of the probability, employing heuristics to eliminate paths that are inadmissable, i.e. cannot lead to more probable hypotheses. TopN provides bounds on their posterior probabilities, which can be narrowed arbitrarily with additional computation. It generalizes the Peng and Reggia approach by allowing leaks, i.e. findings that can occur without any modelled cause, for example due to a false positive test. Heckerman's (this volume) Quickscore algorithm addresses this same two-level belief netowrk with noisy-OR gates and leaks, but it provides an exact solution. While it cannot evade being computationally exponential in the number of findings, it uses an ingenious ordering of the computations that allows up to about 12 to 15 positive findings to be manageable at acceptable computational cost.

6. Stochastic simulation and Monte Carlo schemes A completely different line of attack for approximate inference in belief nets has been to employ simulation or Monte Carlo techniques. A framework for classifying these approaches, including three papers in this volume (in italics) are shown in Figure 5. In this sampling or simulation approach, posterior probabilities and other statistics are estimated from samples of instantiations of the network, that is cases in which each variable is assigned a particular deterministic value. The accuracy depends on the size of the sample (number of simulation runs). Bundy (1985) suggested a Monte Carlo approach for computing the probabilities of boolean combinations of correlated logical variables, which he called the incidence calculus. Henrion (1988) developed this approach for inference in belief nets, introducing a scheme termed probabilistic logic sampling. For each case in the sample, each source node and influence is represented as a truth value or truth table generated at random using the specified probabilities. Diagnostic inference is performed by estimating the probability of a hypothesis as the fraction of simulations that give rise to the observed set of evidence.

134 The key advantage of logic sampling over exact methods is that its complexity is linear in the number of nodes to achieve a given level of precision, irrespective of the degree of connectedness of the graph. Since the sample instantiations are independent (exchangeable) standard statistical methods can be used to estimate the precision of estimates as a function of the number of the sample size. The drawback of simple logic sampling is that its complexity is exponential in the number of observed or evidence nodes, since it effectively throws away all instantiations inconsistent with the observed evidence.

Monte Carlo or Stochastic simulation Forward propagation (Incidence calculus: Bundy, 1985)

Arc reversal and node elimination Shachter, 1986,1988

Evidential integration

Probabilistic Logic sampling Henrion, 1986

Likelihood weighting

Importance sampling

Markov simulation (Gibbs sampling)

Stochastic simulation: Pearl, 1987

Markov sampling with restarting

Chih& Cooper, 1987 Fung & Chang

Shachter &Peot

Chavez & Cooper

Berzuini, Bellazi & Quaglini, 1989

Figure 5: Monte Carlo simulation approaches to inference in belief nets. Italics refer to articles in this volume. NB: Dates refer to when first reported, generally in proceedings of earlier Workshops on Uncertainty and AI. Logic sampling is an example of a forward sampling approach. Each instantiation is created following the influence arrows. In response to the problems of logic sampling, Pearl (1987) developed a quite different stochastic sampling approach, which we may call Markov simulation (or Gibbs sampling). This method involves propagation in either direction along arcs, generally in a random sequence. First it computes the conditional distribution for each variable given all the neighbours in its Markov blanket. The Markov blanket consists of the variables predecessors, successors, and sucessors' predecessors, and it shields it from the rest

135 of the network. All nodes are initialized at random with truth values. During simulation, each node may be updated using a truth value generated with the probability conditional on the current state of its neighbours. Nodes may be reinstantiated in random sequence. The probability of each node is estimated as the fraction of simulation cycles for which it is true. An advantage of this scheme is that it could be implemented as a network of paralleldistributed processors, exchanging messages with its neighbours. Unfortunately, it turns out to be liable to convergence problems when the network contains links that are near deterministic, i.e. close to 0 or 1 (Chin & Cooper, 1989; Shachter & Peot, this volume). Unlike logic sampling, successive cycles in Markov simulation schemes are not independent and the simulation can get trapped in particular states or sets of states. Chavez and Cooper (this volume), and Berzuini, Bellazzi and Quaglini (1989) independently explore hybrids of logic sampling and Markov simulation. Logic sampling is used to provide independent starting points for multiple Markov simulations, with the goal of combining the benefits and avoiding the problems of both. A difficulty for this approach is that the initialization method is not guaranteed to generate a random instantiation from the underlying distribution (conditional on observations). If it could do so efficiently, the problem would already be solved. If it does not, then the question arises of whether it may bias the results. Chavez and Cooper also present an interesting approach to calculate an upper bound for the convergence rate for Markov simulation. Two other papers together explore various enhancements to logic sampling. Fung and Chang (this volume) examine evidential integration, that is employing arc reversal to convert evidence nodes that are diagnostic (sinks) to sources, and so avoid the computational penalty of observed nodes. A similar approach was explored by Chin & Cooper (1989). This is not a general solution since the rearrangement process is itself liable to combinatoric problems, but partial evidential integration may play a useful role in some applications. The second extension, termed likelihood weighting or evidence weighting, explored by both Fung and Chang (this volume) and Shachter and Peot (this volume), seems more promising as a general approach. Instead of instantiating observed nodes and throwing away cases that are inconsistent with observations, they compute the joint likelihood of the observations conditioned on their unobserved predecessors, and use this to weight the sample. While the convergence properties of this approach seem hard to predict, the empirical results they present suggest it can be a significant improvement on the other simulation techniques. Shachter and Peot (this volume) further enhance their approach by importance sampling (see also Henrion, 1988). More probable hypotheses are sampled disproportionately often, and the resulting instantiations weighted inversely to prevent biasing the sample. They also describe ways to improve the process by changing the importance weights as better probability estimates are obtained. 7. Final remarks As illustrated by the papers in this section, research towards more efficient inference in large multiply connected belief networks is proceeding apace. We should not expect any single scheme to offer the best method in all circumstances. Certain classes of network, such as the two level networks discussed, are likely to be best handled by specially tailored algorithms. Different algorithms have different properties and work best for different applications. The characteristics of networks relevant to the choice of algorithm include the size (number of nodes and edges) of the network, its diameter, the existence and interconnectedness of loops

136 or conversely near-tree-decomposability, number of observed evidence nodes, and determinism or near-determinism of influences. Important characteristics of algorithms are robustness, the ability to estimate the error, either before or at least after computation, and the "anytime" property that it will give a useful answer anytime it is interrupted, with reduction in expected error after additional computation. There remain many critical and interesting open issues in algorithm design for belief nets, and important new developments can be expected. References Berzuini, C , R. Bellazzi and S. Quaglini, (1989) "Temporal reasoning with Probabilities", in Proceedings of Fifth Workshop on Uncertainty and AI, M. Henrion (ed.), Windsor, Ontario, ppl4-21. Bundy, A. (1985) "Incidence calculus: A mechanism for probabilistic reasoning", J. of Automated Reasoning, 1:263-83. Chavez, R. M. and G. F. Cooper, (this volume) "An Empirical Evaluation of a Randomized Algorithm for Probabilistic Inference". Chin, H.L. & Cooper, G.F. (1989) "Stochastic simulation of Bayesian belief networks",, in Uncertainty in Artiflcian Intelligence 2, L.N. Kanal, J. Lemmer & T.S. Levitt (Eds.), NorthHolland, Amsterdam, pp 129-148. Cooper, G.F. (1984) "NESTOR: A computer-based medical diagnostic aid that integrates causal and probabilistic knowledge", STAN-CS-84-1031 (PhD Dissertation), Dept of Computer Science, Stanford University. Cooper, G.F. (1987) "Probabilistic inference using belief networks is NP-hard", Tech Report KSL-87-27, Knowledge Systems Lab, Stanford University. Dempster, A.P. and A. Kong (1988) "Uncertain evidence and artificial analysis", / . of Statistical Planning and Inference,20, 355-68. Fertig, K.W. and J.S. Breese, (this volume), "Interval Influence Diagrams". Fung, R. and K. Chang , (this volume), "Weighing and Integrating Evidence for Stochastic Simulation in Bayesian Networks". Geiger, D . , T. Verma and J. Pearl (this volume), "d-Separation: From Theorems to Algorithms". Gorry, G.A. and Barnett, G.O. (1968) "Experience with a model of sequential diagnosis", Computers and Biomédical Research, 1:490-507. Heckerman, D. (this volume), "A Tractable Inference Algorithm for Diagnosing Multiple Diseases". Heckerman, D., Breese, J. and Horvitz, E. (1989) "The Compilation of Decision Models", in Proceedings of Fifth Workshop on Uncertainty and AI, M. Henrion (ed.), Windsor, Ontario, ppl62-173. Henrion, M. (1987) "Uncertainty in Artificial Intelligence: Is probability epistemologically and heuristically adequate?", in Expert Judgment and Expert Systems, J.L. Mumpower (ed.) Springer-Verlag, Berlin, ppl05-130.

137 Henrion, M. (1988) "Propagation of uncertainty by probabilistic logic sampling in Bayes' networks", in Uncertainty in Artifician Intelligence, Vol 2, J. Lemmer & L.N. Kanal (Eds.)» North-Holland, Amsterdam, pp 149-164. Henrion, M. (1990) "Towards efficient inference in multiply connected belief networks" with discussion, Chapter 17 in Influence Diagrams, Belief Nets and Decision Analysis, R.M. Oliver and J.Q. Smith (eds.), Wiley: Chichester, pp385-407. Horvitz, E.J., Breese, J.S., & Henrion, M. (1988), "Decision theory in expert systems and artificial intelligence", International J. of Approximate Reasoning. 2, pp247-302. Horvitz, Eric J., Suermondt, HJ., & Cooper, G.F. (1989)"Bounded Conditioning: Flexible Inference for Decisions Under Scarce Resources", in Proceedings of Fifth Workshop on Uncertainty and AI, M. Henrion (ed.), Windsor, Ontario, pp 182-193. Howard. R.A., & Matheson, J.E. (1984) "Influence diagrams", in Readings in Decision Analysis, R.A. Howard & J.E. Matheson (eds.), Strategic Decisions Group, Menlo Park, Ca. Ch38,pp763-771. Kim, J.H. & Pearl, J. (1983) "A computational model for causal and diagnostic reasoning in inference engines", in Proc of8thIJCAI, IntJoint Conferences on AI, Karlsruhe, West Germany, 190-193. Lauritzen, S.L. & Spiegelhalter, DJ. (1988) "Local computations with probabilities on graphical structures and their applications to expert systems", /. Royal Statistical Society B, 50, No 2. Olmsted, S.M. (1983) On representing and solving decision problems, PhD Thesis, Engineering-Economic Systems Department, Stanford University, Stanford, California. Pearl, J. (1986) "Fusion, propagation, and structuring in belief networks", Artificial Intelligence, 29, pp241-88. Pearl, J. (1987) "Evidential reasoning using stochastic simulation of causal models", Artificial Intelligence, 32, pp247-57. Peng, Y. & Reggia, J.A. (1987a) "A probabilistic Causal Model for diagnostic problem solving - Part I: Integrating symbolic causal inference with numeric probabilistic inference", IEEE Trans, on Systems, Man, and Cybernetics, Vol SMC-17, No 2, Mar/Apr, ppl46-62. Peng, Y. & Reggia, J.A. (1987b) "A probabilistic Causal Model for diagnostic problem solving - Part 2: Diagnostic strategy", IEEE Trans, on Systems, Man, and Cybernetics:: Special issue for diagnosis, Vol SMC-17, No 3, May, pp395-406. Shachter, R.D. (1986) "Evaluating influence diagrams", Operations Research, 34, No 6pp871-882. Shachter, R.D. (1988) "Probabilistic inference", Operations Research, 36. Shachter, R. D. (this volume), "Evidence Absorption and Propagation Through Evidence Reversals". Shachter, R. D. and M. Peot (this volume), "Simulation Approaches to General Probabilistic Inference on Belief Networks".

138 Shenoy, P.P, G. Shafer, and K. Mellouli, (1988) "Propagation of belief functions: A distributed approach", in J.F. Lemmer and L. Kanal (eds.), Uncertainty and Artificial Intelligence 2, North Holland, Amsterdam. Wellman, M. (1988) Formulation of Tradeoffs in Planning under Uncertainty, PhD Thesis, MIT/LCS, TR-427, MIT, Lab for Computer Science, August.

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

139

rf-SEPARATION: FROM THEOREMS TO ALGORITHMS Dan Geiger, Thomas Verma & Judea Pearl Cognitive Systems Laboratory, Computer Science Department University of California Los Angeles, CA 90024 ABSTRACT An efficient algorithm is developed that identifies all independencies implied by the topology of a Bayesian network. Its correctness and maximality stems from the soundness and completeness of d -separation with respect to probability theory. The algorithm runs in time 0 ( IE I ) where E is the number of edges in the network. 1. INTRODUCTION Bayesian networks encode properties of probability distributions using directed acyclic graphs (dags). Their usage is spread among many disciplines such as: Artificial Intelligence (Pearl 1988), Decision Analysis (Howard and Matheson 1981; Shachter 1988), Economics (Wold 1964), Genetics (Wright 1934), Philosophy (Glymour et al. 1987) and Statistics (Lauritzen and Spiegelhalter 1988; Smith 1987). A Bayesian network is a pair (D, P) where D is a dag and P is a probability distribution called the underlying distribution. Each node i in D corresponds to a variable X, in P, a set of nodes / correspond to to a r*ît of variables X/ and xt,xj denotes values drawn from the domain of X, and from the (cross product) domain of X/, respectively/ ' Each node in the network is regarded as a storage cell for the distribution P (*/ I *π(/)) where X π(/) is a set of variables that correspond to the parent nodes n(i ) of i. The underlying distribution represented by a Bayesian network is composed via P(x\,-,xn)

= f][P(xi I**·)),

(!)

(when ί has no parents, then Χπ(ΐ) = 0). The role of a Bayesian network is to record a state of knowledge P, to provide means for updating the knowledge as new information is accumulated and to facilitate query answering mechanisms for knowledge retrieval (Lauritzen and Spiegelhalter 1988; Pearl 1988). A standard query for a Bayesian network is tofindthe posterior distribution of a hypothesis variable Xi, given an evidence set Xj =xj i.e., to compute P(xi I xj) for each value of Xi and for a given combination of values of Xy. The answer ♦This work was partially supported by the National Science Foundation Grant #IRI-8610155 and Naval Research Laboratory Grant #N00014-89-J-2057. (1) Note that bolds letters denote sets of variables.

140 to such queries can, in principle, be computed directly from equation (1) because this equation defines a full probability distribution. However, treating the underlying distribution as a large table instead of a composition of several small ones, might be very inefficient both in time and space requirements, unless we exploit independence relationships encoded in the network. To better understand the improvements and limitations that more efficient algorithms can achieve, the following two problems must be examined: Given a variable Xk, a Bayesian network D and the task of computing P(xi I xy); determine, without resorting to numeric calculations: 1) whether the answer to the query is sensitive to the value of Xkf and 2) whether the answer to the query is sensitive to the parameters pk =P(xk Ι*π(*)) stored at node k. The answer to both questions can be given in terms of conditional independence. The value of Xk does not affect this query if Pfc \xj) = P(xi \xj,xk) for all values of xif xk and xj, or equivalently, if X, and Xk are conditionally independent given Xy, denoted by IQCi, Xj,Xk)P. Similarly, whether the parameters pk stored at node k would not affect the query Pfa I xj) also reduces to a simple test of conditional independence, /(Χ,,Χ/,π*), where nk is a (dummy) parent node of Xk representing the possible values of pk. The main contribution of this paper is the development of an efficient algorithm that detects these independencies directlyfromthe topology of the network, by merely examining the paths along which i, k and J are connected. The proposed algorithm is based on a graphical criteria, called d-separation, that associates the topology of the network to independencies encoded in the underlying distribution. The main property of d-separation is that it detects only genuine independencies of the underlying distribution (Verma and Pearl 1988) and that it can not be sharpened to reveal additional independencies (Geiger and Pearl 1990). A second contribution of the paper is providing a unified approach to the solution of two distinct problems: sensitivity to parameter values and sensitivity to variable instantiations. 2. SOUNDNESS AND COMPLETENESS OF d -SEPARATION In this section we review the definition of d-separation; a graphical criteria that identifies conditional independencies in Bayesian networks. This criteria is both sound and complete (maximal) i.e., it identifies only independencies that hold in every distribution having the form (1), and all such independencies. A preliminary definition is needed. Definition: A trail in a dag is a sequence of links that form a path in the underlying undirected graph. A node ß is called a head-to-head node with respect to a trail t if there are two consecutive links a -» ß and ß as v labeled i, such that (u -» v, v - » w ) i s a legal pair. If no such link exists, stop.

(iv)

Label each link v -> w found in Step (iii) with i +1 and the corresponding node w with R.

(v)

i := i+1, Goto Step (iii).

The main difference between this algorithm and BFS, a change which has been proposed by (Gafni 1988), is the traversal of the graph according to a labeling of the links and not according to a labeling of nodes (as in the traditional BFS algorithm). This change is essential as the next example shows (Figure 2); Let F = {(a,c)}, then the path from 1 to 3 through links a, b and c is legal while the path not traversing b is not legal because (a,c)e F. However, BFS which labels nodes, would not traverse the link b since when link b is considered, its end point has already been labeled. Thus, BFS with node labeling, would not reveal the legal path connecting nodes 1 and 3.

o b

©—® a

c

Figure 2 Lemma 3: Algorithm 1 labels with R all nodes that are reachable from s (and thus from / ) via a legal path and only these nodes are labeled with R. Proof: First, we show that if a node w/ is labeled with R, then there exists a legal path from s to w/. Let w/_i -> w/ be a link through which vt>/ has been labeled. We induct on the label / of the link w/_i -» w/. If / = 1 then w/ e / and is therefore reachable from s. If / > 1, then by step (3), there exists a link H>/-2 -> w/_i labeled with / - l such that (w/_2 -> w/_i, w/_i -» w/) is a legal pair. Repeatedly applying this argument for i=/...2 yields a legal path H>O-»M>I ->...w/, where wo->w/ is labeled with 1. However, the only links labeled 1 emanate from s, hence the above path is the required legal path from s to w/. It remains to be shown that each node that is reachable from s via a legal path is labeled with R by the algorithm. Instead, we show that every link a -> vm that is reachable from s via a legal path (i.e., it participates in a legal path emanating from s ) is eventually labeled by the algorithm * \ The latter claim is stronger than the former because for every reachable node vm there exists a reachable link a -» vm and by Step (iv), whenever a -» vm is labeled with some integer, vm is labeled with R. We continue by contradiction. Let lm = vm_i -» vm be the closest link to s via a legal path that remains unlabeled. Let p = s _> vi -» ...vm_i -> vOT be the path emanating from s and terminating with the link lm. (2) By labeled we mean that the algorithm attaches an integer label to that link and not an "undefined" label.

144 The portion of this path that reaches the link /m_i = vm_2 -» vm_i is shorter than p. Thus, by the induction hypothesis, /m_i is labeled by the algorithm. Hence, the link lm is labeled as well (by the next application of step (iv)), contradicting our assumption that it remains unlabeled. D The complexity of Algorithm 1 for a general F is O ( \E I · IV I ). In the worst case, each of the IVI nodes might be reached from IVI - 1 entry points and, for each entry, the remaining links may need to be examined afresh for reachability. Thus, in the worst case, a link may be examined IV - 21 times before it is labeled, which leads to O ( IE I · IV I ) complexity. However, for the special case where F is induced by the d -separation condition, we shall later show that each link is examined only a constant number of times, and the complexity reduces to 0(\E I). Next we employ Algorithm 1 to solve the problem of identifying the set of nodes that are d -separated from J by L. For this aim, we construct a directed graph D' marked with a set of legal pairs such that a node v is reachable from / via an active trail (given L) in D iff v is reachable from J via a legal path in £/. The following observations are the basis of our algorithm. First, any link on a trail can be traversed both ways. Therefore, to ensure that every active trail in D corresponds to a legal (directed) path in D', D' must consist of all links of D in their forward and reverse direction. Second, constructing a single table that indicates, for each node, whether it is in L or has a descendent in L, facilitates a constant-time test for legal pairs in D'. Algorithm 2 Input:

A Bayesian network D =(V,E) and two disjoint sets of nodes / and L.

Data Structure:

A list of incoming links (in-list) for each node v e V.

Output:

A set of nodes K where K = {a I / (/, L, a)o}.

(i)

Construct the following table: true if v is or has a descendent inL descendent [v] = false otherwise

(ii)

Construct a directed graph D' = (V, E' ) where B =Eu{(u ->v)l(v - > u ) e £ )

(iii)

Using algorithm 1,findthe set of all nodes K' which have a legal path from J in D', where a pair of links (u -» v, v -> w) is legal iff u Φ W and either 1) v is a head-to-head node on the trail M—V—W in D and descendent[v] = true or 2) v is not a head-to-head node on the trail u—v—w in D and v « Z .

(iv) tf = V - ( r u / u L ) Return (K). The correctness of this algorithm is established by the following argument.

145 Lemma 4: For every node a e J u L, a is reachable from / via a legal path in & iff there is an active trail by L from / to a in D. Proof: For a ^ / u L and x0 e J, if * 0 - * i . . . a is an active trail (given L) in D, then the directed path x0 -> x\ -» ... a is a legal path in Z/, and vise versa. (We have eliminated the case a e / u L for technical convenience; the trail XQ-X\ ... a is not active nor non-active because, by our definition, / and {a} must be disjoint). D Theorem 5: The set K returned by the algorithm is exactly {aI / ( / , L, OOD }. Proof: The set K! constructed in Step (iii) contains all nodes reachable from / via a legal path in £>'. Thus, by lemma 4, K! contains all nodes not in / u L that are reachable from / via an active trail (given L) in D. However, /(/,L, CC)D, holds iff a e / u L and a is not reachable from / (by an active trail given L), therefore, K = V - (JC u / uL) is exactly the set {al 7(/,L, a) D }.D Next, we show that the complexity of the algorithm is O ( \E I ) we analyze the algorithm step by step. The first step is implemented as follows: Initially mark all nodes of Z with true. Follow the incoming links of the nodes in Z to their parents and then to their parents and so on. This way, each link is examined at most once, hence the entire step requires 0(\E I) operations. The second step requires the construction of a list for each node that specifies all the links that emanate from v in D (out-list). The in-list and the out-list completely and explicitly specify the topology of D'. This step also requires O ( IE I ) steps. Using the two lists the task of finding a legal pair in step (iii) of algorithm 2 requires only constant time; if ex1 = u -> v is labeled / then depending upon the direction of u - v in D and whether v is or has a descendent in Z, either all links of the out-list of v, or all links of the in-list of v, or both are selected. Thus, a constant number of operations per encountered link is performed. Hence, Step (iii) requires no more than O ( \E I ) operation which is therefore the upper bound (assuming \E\ £ IV I ) for the entire algorithm. The above algorithm can also be employed to verify whether a specific statement holds in a dag D. Simplyfindthe set Kmn of all nodes that are d-separated from / given L and observe that I(J,L,K)D holds in D iff K c Kmui. In fact, for this task, algorithm 2 can slightly be improved by forcing termination once the condition K c #max has been detected. (Lauritzen at al 1988) have recently proposed another algorithm for the same task. Their algorithm consists of the following steps. First, form a dag £>' by removing from D all nodes which are not ancestors of any node in / u K u L (and removing their incident links). Second, form an undirected graph G, called the moral graph, by stripping the directionality of the links of D' and connecting any two nodes that have a common child (in D' ) which is or has a descendent in L. Third, they show that I(J,L,K)D holds iff all (undirected) paths between / and K in G are intercepted by L.

I(JJLJC)D

The complexity of the moral graph algorithm is O ( IV 12) because the moral graph G may contain up to IV 12 links. Hence, checking separation in G could require O ( IV 12) steps. Thus, our algorithm is a moderate improvement as it only requires O(IEI) steps. The gain is significant mainly in sparse graphs where \E\ =0(\V\). We note that if the maximal number of parents of each node is bounded by a constant, then the two algorithms achieve

146 the same asymptotic behavior i.e, linear in \E I. On the other hand, when the task is to find all nodes d-separated from J by L (not merely validating a given independence), then a brute force application of the moral graph algorithm requires O ( IV13) steps, because for each node not in / u L the algorithm must construct a new moral graph. Hence, for this task, our algorithm offers a considerable improvement. The inference engine of Bayesian networks has also been used for decision analysis; an analyst consults an expert to elicit information about a decision problem, formulates the appropriate network and then by an automated sequence of graphical and probabilistic manipulations an optimal decision is obtained (Howard and Matheson 1981; Olmsted 1984; Shachter 1988). When such a network is constructed it is important to determine the information needed to answer a given query P (xj I x ι ) (where {/ ) u L is an arbitrary set of nodes in the network), because some nodes might contain no relevant information to the decision problem and eliciting their numerical parameters is a waste of effort (Shachter 1988). Assuming that each node X, stores the conditional distribution P (*,· Ix^,·)), the task is to identify the set M of nodes that must be consulted in the process of computing P (xj I xL ) or, alternatively, the set of nodes that can be assigned arbitrary conditional distributions without affecting the quantity P{xj Ix/,). The required set can be identified by the d -separation criterion. We represent the parameters of the distribution Pfa Ιχπ#)) as a dummy parent p, of node i. This is clearly a legitimate representation complying with the format of Eq. (1), since for every node X,, P(xt Ιχπ(ί>) can also be written as P(xt Ix^/^.p/)» sop,· can be regarded as a parent of Xt. From Theorem 1, all dummy nodes that are d-separatedfromJ by L represent variables that are conditionally independent of J given L and so, the information stored in these nodes can be ignored. Thus, the information required to compute P(xj \χκ) resides in the set of dummy nodes which are not d-separated from/ given L. Moreover, the completeness of d -separation further implies that M is minimal; no node in M can be exempted from processing on purely topological grounds (i.e., without considering the numerical values of the probabilities involved). The algorithm below summarizes these considerations: Algorithm 3 Input:

A Bayesian network, two sets of nodes / and L.

Output:

A set of nodes M that contains sufficient information to compute P(XJ\ XL)

(i)

Construct a dag D' by augmenting D with a dummy node v' for every node v in D and adding a link V -> v.

(ii)

Use algorithm 2 to compute the set K' of nodes not d -separated from JbyL.

(iii) Let M be the set of all dummy nodes V that are included in K. We conclude with an example. Consider the network D of Figure 3 and a query P(xz).

147

j&



©" Figure 3

The computation of P(xi) requires only to multiply the matrices P(x%\x\) and P(x\) and to sum over the values of X\. These two matrices are stored at the dummy nodes Γ and 3', which are the only dummy nodes not d -separated from node 3 (given 0 ) . Thus, algorithm 3 reveals the fact that the parameters represented by node 2' and 4' (P (xj), P (ΧΛ I *i. *2» are not needed for the computation of P(xj). Note, however, that knowing the value of X4 might influence the computation of Pfa), because X3 and X4 could be dependent. The value of X2» on the other hand, never affects this computation because X2 is independent of X3. Note that the questions of the value of a node, or the parameters stored with a node influencing a given computation, may result in two different answers. For example, the value of xA might influence the computation of P fo), because x$ and X4 could be dependent, while the parameters stored at node X4 never affect this computation. Algorithm 3, by representing parameters as dummy variables, reveals this fact. Shachter was the first to address the problem of identifying irrelevant parameters using transformations of arc-reversal and node-removal (Shachter 1988)(3). A revised algorithm of Shachter also detects irrelevant variables and it appears that the outcome of this algorithm is identical to ours (Shachter 1990). In our approach we maintain a clear distinction between the following two tasks: (1) declarative characterization of the independencies encoded in the network (i.e., the ^-separation criterion) and (2) procedural implementation of the criterion defined in (1). Such separation facilitates a formal proof of the the algorithm's soundness, completeness and optimality. In Shachter's treatment, task (1) is inseparable from (2). The axiomatic basis upon which our method is grounded also provides means for extending the graphical criteria to other notions of independence, such as relational and correlational dependencies (Geiger 1990). ACKNOWLEDGEMENT We thank Eli Gafni for his help in developing algorithm 1, and to Azaria Paz and Ross Shachter for many stimulating discussions.

(3) Shachter also considers deterministic variables which we treat in (Geiger at al. 1990).

148 REFERENCES Even, S. 1979. Graph Algorithms, Computer Science Press. Gafni, E. 1988. Personal communication. Geiger, D. January 1990. "Graphoids: A Qualitative Framework for Probabilistic Inference," Ph.D dissertation, Computer Science Department. Technical report R-142, UCLA Cognitive Systems Laboratory. Geiger, D. and Pearl, J. 1990. "On The Logic of Causal Models," in, Uncertainty in Artificial Intelligence 4, Kanal, L. and Shachter, R. (eds), Amsterdam: North-Holland Publishing Co. Geiger, D., Verma, T.S. and Pearl, J. 1990. "Identifying independence in Bayesian networks," in, Networks, Sussex, England: John Wiley and Sons. Glymour, C, Scheines, R., Spirtes, P. and Kelly, K. 1987. Discovering Causal Structure, New York: Academic Press. Howard, R.A. and Matheson, J.E. 1981. "Influence Diagrams," in, Principles and Applications of Decision Analysis, Menlo Park, CA: Strategic Decisions Group. Lauritzen, S.L. and Spiegelhalter, D. 1989. "Local computations with probabilities on graphical structures and their applications to expert systems," in, / . Royal Statist. Soc, ser. B. Lauritzen, S.L., Dawid, A.P., Larsen, B.N. and Leimer, H.G.. October 1988. "Independence Properties of Directed Markov Fields," Technical Report R 88-32, Aalborg Universitetscenter, Aalborg Denmark. To appear in networks. Olmsted, S.M.. 1983. "On Representing and Solving Decision Problems," Ph.D. Thesis, EES Dept., Stanford University. Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, San Mateo, CA: Morgan Kaufmann. Shachter, R.D. 1988. "Probabilistic Inference and Influence Diagrams," Operations Research, Vol. 36, pp. 589-604. Shachter, R.D. 1990. "An Ordered Examination of Influence Diagrams". To appear in Networks. Smith, J.Q. June 1987. "Influence Diagrams for Statistical Modeling," Technical report #117, department of Statistics, University of Warwick, Coventry, England. Verma, T.S. and Pearl, J. August 1990. "Causal Networks: Semantics and Expressiveness," in, Uncertainty in Artificial Intelligence 4, Kanal, L. and Shachter, R. (eds), Amsterdam: North-Holland Publishing Co. Wold, H. 1964. Econometric Model Building, Amsterdam: North-Holland Publishing Co. Wright, S. 1934. "The Method of Path Coefficients," Ann. Math. Statist., Vol. 5, pp. 161-215.

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

\ 49

Interval Influence Diagrams Kenneth W. Fertig and John S. Breese Rockwell International Science Center Palo Alto Laboratory 444 High Street Palo Alto, CA 94301 We describe a mechanism for performing probabilistic reasoning in influence diagrams using interval rather than point valued probabilities. We derive the procedures for node removal (corresponding to conditional expectation) and arc reversal (corresponding to Bayesian conditioning) in influence diagrams where lower bounds on probabilities are stored at each node. The resulting bounds for the transformed diagram are shown to be optimal within the class of constraints on probability distributions that can be expressed exclusively as lower bounds on the component probabilities of the diagram. Sequences of these operations can be performed to answer probabilistic queries with indeterminacies in the input and for performing sensitivity analysis on an influence diagram. The storage requirements and computational complexity of this approach are comparable to those for point-valued probabilistic inference mechanisms, making the approach attractive for performing sensitivity analysis and where probability information is not available. Limited empirical data on an implementation of the methodology are provided.

1

Introduction

One of t h e most difficult tasks in constructing an influence diagram is development of conditional and marginal probabilities for each node in t h e network. In some instances probability information may not be readily available, and a reasoner wishes to determine what conclusions can be drawn with partial information on probabilities. In others cases, one may wish to assess the robustness of various conclusions to imprecision in t h e input. T h e subject of probability bounds has been a topic of interest for a number of years in artificial intelligence. Early users of Dempster-Shafer formalisms were originally motivated by t h e ability to specify bounds on probabilities [Ginsberg, 1985, Haddawy, 1986]. Inequality bounds have also been examined by those a t t e m p t i n g to bridge between Dempster-Shafer theory and traditional probability theory [Grosof, 1986, Kyburg, 1987, Kyburg, 1988]. A number of other researchers have a t t e m p t e d to deal with bounds on probabilities within a traditional Bayesian framework [Cooper, 1984, Snow, 1986, Neapolitan and Kenevan, 1988, Pearl, 1988a]. In this paper we develop and demonstrate a means of incorporating imprecision in probability values by specifying lower bounds on input probabilities and using influence diagrams as a means of expressing conditional independence. A number of authors have developed systems which derive probabilistic conclusions, given general linear constraints on the inputs [Snow, 1986, White, 1986]. These systems typically use linear programming methods repeatedly to propagate constraints through a set of probabilistic calculations.

150 The characterization of constraints as lower bounds allows us to derive a relatively efficient procedure for probabilistic inference, based on successive transformations to the diagram, at the cost of some expressiveness. The implications of these transformations in terms of the sets of probability distributions admitted by the bounds are analyzed in detail.

2

Probabilistic Inference with Bounds on Probabilities

In standard probability theory, when x and y are random variables and completely specified probability distributions of the form p(x\y) and p(y) are available, one can calculate precisely the following quantities:

p(x) = I p(x\y)p(y) Mx) P[vl)

=

P{*\y)p{y) fyp(x\y)p(y)

using the standard formula for conditional marginalization and Bayes rule. 1 In this paper we examine the case where precise probability distributions are replaced by lower bounds. In general, constraints on probability distributions can be considered subsets of the space, P , of all probability distributions. In [Fertig and Breese, 1990], we use this structure as the basis of the interval probability theory presented here. However, for the sake of simplicity, we will start with a special class of constraints for lower bounds. Definition 1 (Lower Bound Constraint Function) A function b(x) is a lower bound constraint function if and only if Vx, b(x) > 0, / b(x) < 1 . Jx

Definition 2 (Constraint Set) Let P be the set of all probability distributions p(x) on a space X. The set C Ç P is the constraint set associated with a lower bound constraint function b(x) if and only if C = {p\peP,p(x)

>b(x)Vx}.

Expressing constraints in terms of lower bound constraint functions for all values of x allows us to derive an upper bound for the probabilities for discrete random variables. Theorem 1 (Upper Bounds) If x is a discrete random variable with a set of possible values {x\,X2, · · · , x n } o.nd given a constraint function b(x{) and associated constraint set C, then p(xi)(*i) < 1 - *£K*i) = U(*i)> so U(xi) is an upper bound. For an arbitrary X{ define p*(x) (depending on x t ) to be a probability distribution:

»-«={"«

)

Z =

Xi

Z ψ Xi

Thus p*(z) is a probability distribution over the values for x which achieves its upper bound at z = X{ and satisfies all lower bound constraints. □ Note that when the lower bounds for a variable sum to one, the lower bound equals the upper bound, and the interval for a probability value collapses to a single value. The definitions and theorem are analogous for the conditional case. The next two theorems provide the fundamental mechanisms for calculating bounds for new probability distributions based on bounds on the input probabilities in the form described above. Theorem 2 (Marginalization) Given lower bound constraint functions b(x\y) and b(y) for all values of x and y and associated constraint sets C and D, Vz,&(z) = b(x\y.)U(y.) + £ ) b(x\y)b(y)

(2)

νΦνβ

(where ys is such that b(x\ys) < b(x\y) for all y) is a sharp lower bound for p(x) £ CC\D. Theorem 3 (Bayes) Given lower bound constraint functions b(x\y) and b(y) for all values of x and y and associated constraint sets C and D,

(where ys is such that U(x\ys) > U(x\yi) for all yi ^ y) is a sharp lower bound for p(y\x) eCC\D. The proofs of these theorems is in [Fertig and Breese, 1990]. In the following section we show how these theorems are used in influence diagrams to perform corresponding influence diagram transformations.

3

Interval Influence Diagrams

An influence diagram is an acyclic directed graph D — (JV, A) of nodes N and arcs A [Howard and Matheson, 1981]. Associated with each node X G TV is a set of possible states Sx = {a?i,... ,a? n }. We will use the lower case x to indicate one of the possible values for a node. The predecessors of a node X, written Πχ, are those nodes with arcs directly to X. Associated with each node is a conditional probability distribution over the possible states of the node, given the possible states of its predecessors, written p(x\sUx). In this expression sUx is a member of the set of combined possible outcomes for the predecessors. The set of nodes, their outcomes, arcs, and conditional probability distributions completely define a probabilistic influence diagram or belief net.

152 Interval influence diagrams differ from the standard influence diagram formalism in that we specify lower bounds (as in Definition 1) on the conditional probability distributions associated with each node in the network. VXçN,

b(x\snx) l/(48a2)

(1)

guarantees the (a, 8) convergence criteria, where N is the total number of trials. Each trial corresponds to the choice of a joint instantation for all the nodes in the belief network. We have predicated our analysis on the existence of a trial generator that accurately produces states of the network according to their true probabilities, contingent on the 2

In the case of belief networks, the size of the input is the length of a string that fully describes

the nodes, their connections, and their conditional probabilities, represented in unary notation. Unary notation ensures a running time that is a polynomial in the problem size; as the probabilities approach 0 and 1, the unary representations assume unbounded size, and the algorithm's performance decreases dramatically. Note, in addition, that we have excluded deterministic relationships (probabilities equal to 0 or 1) from the analysis. 3

T h e interval error a is the maximum difference between the true and the approximate probabilities,

taken over all the probabilities in the network. The relative error e is the maximum difference between the true and the approximate probabilities, divided by the true probability. In general, it is much more difficult to guarantee a fixed relative error, especially as the probabilities approach 0.

194 available evidence. The original straight-simulation generator depends on the initial state (that is, it lacks ergodicity). Moreover, the straight-simulation generator offers no guarantees about its convergence properties. We must, therefore, turn our attention to the study of modified state generators. Given any belief network, we show how to construct a special Markov chain with the following two properties. First, states of the Markov chain correspond to joint instantiations of all the nodes in the network; the Markov chain associated with a network of n binary nodes, for example, has 2 n distinct states. Second, the stationary distribution of the Markov chain is identical to the joint posterior-probability distribution of the underlying belief network. In addition, the constructed Markov chain has the properties of ergodicity and time reversibility. Ergodic chains are, by definition, aperiodic (without cycles) and irreducible (with a nonzero transition probability between any pair of states). Time-reversible chains look the same whether the simulation flows forward or backward. Once again, [Chavez, 1989a] presents the details of the construction. In the limit of infinity, after the Markov chain has reached its stationary distribution, that chain generates states according to their true probabilities. Obviously, we cannot aiFord to let the chain reach equilibrium at infinity. In practice, we wish to know how well the chain has converged after we have let it run for only a finite number t of transitions. Define the relative pointwise distance (r.p.d.) after t transitions, A(t) = max

3(0 f pv -

ije[M]

where P^' denotes the ί-step transition probability from state i to j (with a total of M states) and TTJ denotes the stationary probability of state j . Let Π = mint-€[jv/] π{, the joint probability of the least likely joint state of all variables in the network. We wish to determine the number of transitions t needed to guarantee a deterministic upper bound on Δ(ί). A Jerrum-Sinclair analysis of chain conductance (intuitively, the chain's tendency to flow around the state space [Jerrum and Sinclair, 1988]) and a combinatoric path-counting argument show that the BN-RAS generator requires log 7 +log Π

3i7: log(l-pg/8)

(2)

transitions to guarantee a relative pointwise distance of 7, where p0 is the smallest transition probability in the Markov chain. Combining the convergence analysis with the scoring strategy in relation (1), BN-RAS computes posterior-probability estimators Y that satisfy the constraint

Z l M _ a < f v a l u e < t h i s - > n v a l u e s ; v++) ■C

/ * compute P [ t h i s I parents] */ prod = c p r o b ( t h i s ) ; / * multiply by Ρ [ c h i l d I t h i s ] , for each c h i l d */ for ( i « 0; i < t h i s - > n c h i l d r e n ; i++) prod *= cprob(matrix[this->children[i]]); sum += prod; t h i s - > d i s t [ v ] = sum; } normalize dist; } Figure 1: Pseudo-code that computes the Markov transition probabilities from a belief network.

197 /* Perform one transition */ do_transition() { /* With probability 1/2, stay put (guarantees aperiodicity). Use ACM algorithm. */ p = drand_acm(); if (p value = choose(this->dist); } } /* Compute a trial */ next_trial() { set all the nodes to uniform random values; for (d = 0; d < transitions; d++) do_transition(); } estimateO { compute number of t r i a l s , n; for (j = 1; j value = choose(matrix[i]->dist); } /* Compute a t r i a l */ next_trial() { s t a t i c int i = 0; /* Don't re-initialize the nodes */ if (i == nnodes) i = 0; do_transition(i++);

Figure 3: The pseudo-code for straight simulation.

199

P(B I A) = .001 P(B | -nA) = .999 P(A) = 1/2 Figure 4: This two-node network poses severe convergence problems for straight simulation. reduce the efficiency of the approach? In other words, how does BN-RAS perform in comparison to straight simulation? We study each of those questions in turn, and display charts and graphs that illustrate our conclusions.

3.

Results

In the present experiments, we study two belief networks: a simple two-node network (Figure 4) for which straight simulation is known to perform poorly [Chin and Cooper, 1989], and a much more complex network, DxNET, for alarm management in the intensive-care unit [Beinlich et al., 1988]. The two-node network specifically causes straight simulation to undergo pathological oscillation. DxNET, on the other hand, reflects an anesthesiologist's clinical expertise and judgmental knowledge. For our present purposes, we observe that, to guarantee e = 0.1, 7 = 0.1, and 8 = 0.1 for the two-node example, we require 2,662,000 trials, with 316,911,596 transitions per trial; to guarantee an interval error a = 0.1 for the same network, we need only 1332 trials, with the same number of transitions per trial. For DxNET, the numbers prove even more formidable. The worst-case bounds require 13,307,782 trials, with 256,573,353,901 transitions per trial, to guarantee e = 0.1, 7 = 0.1, and 8 = 0.1; to ensure that a = 0.1, we need only 1332 trials, but we still require 256,573,353,901 transitions per trial. Figure 5 illustrates the number of trials, based on relation (1), as a function of the interval error a for several values of the failure probability 8. With an error tolerance of a = 0.1, the algorithm requires less than 10,000 trials, for values of 8 > 0.1. As the error tolerance shrinks, however, the number of trials increases quadratically. Note, however, that relation (1) specifies a distribution-free upper bound on the number of trials. Depending on the underlying probability distribution, fewer trials may suffice. Figure 6 illustrates the relationship between the number of transitions needed for sufficient mixing of the Markov chain, t, and the smallest transition probability, p0. The transition probabilities vary as the product of conditional probabilities at each local node group. The belief networks that knowledge engineers build for realistic applications

200

■0· -*■ -P-♦■ ♦ -D-+-

δ = 0.01 δ = 0.02 δ = 0.05 δ = 0.1 δ = 0.2 δ = 0.25 δ = 0.5

Figure 5: This graph demonstrates the relationship between the number of trials N needed to guarantee an (a, 6) algorithm for interval error a and failure probability 6. will typically require small transition probabilities. Such probabilities do not entail the approximation scheme's success or failure; rather, they suggest that the analytic bounds cannot guarantee efficient computation. Note, in particular, the logarithmic abscissa, and the relative unimportance of the 7 error term. For belief networks in which the smallest transition probability p0 > 0.1, we expect that BN-RAS will yield an acceptable, tractable computation for realistic values of a and 6. As p0 approaches 0, however, the number of transitions needed to guarantee the bounds, as indicated in equation (4), approaches infinity. Clearly, the analytic bounds do not always yield an efficient algorithm, even though they do predict a running time that varies only linearly with the number of nodes. The conditional probabilities that lie close to 0 and 1 require unrealistically large values of t to approximate the stationary distribution with great certainty.

3.1

DxNet Performance Measurements and Time Complexity

In this section, we study the performance of BN-RAS for the D X N E T problem on a Sun-4 timesharing processor running SunOS, a version of 4.3bsd UNIX. We measured CPU time with the UNIX system call clock(), which returns the elasped processor time in microseconds. Figures 7 and 8 demonstrate that the CPU time increases linearly with the number of trials and the number of transitions per trial, as expected. Those figures serve as nomograms for translating N and t into realistic CPU-consumption figures on the Sun-4.

201

Transitions required for mixing 2.00Θ+6 ■

ΰ·1.00θ+6 .2

±î M

Ο.ΟΟθ+0 Smallest transition probability

Figure 6: This graph illustrates the crucial relationship between po and £, the number of transitions needed to guarantee an acceptable relative pointwise distance.

DxNet Time Complexity 1500

1000 Transitions per trial

1

■ö-♦■ -B-·-

500

0

t-100 t = 200 t = 500 t = 1000

2000 4000 6000 8000 10000 Number of trials (N)

Figure 7: This graph illustrates that the computation time on a Sun-4 increases linearly with the number of trials.

202

DxNet: Time vs. transitions 12000-

N = 100 N-200 N = 500 N = 1000 N = 2000

0

10000

20000

Transitions per trial

Figure 8: This graph shows the relationship between £, the number of transitions per trial, and the concomitant Sun-4 C P U usage. Figures 9 illustrates t h e most crucial insight of this empirical study. T h e convergence depends not so much on the number of tabulated trials, but rather on the quality of those trials (as determined by the number of transitions per trial). In other words, if we had an ideal trial generator, we could expect very rapid convergence; inasmuch as the raw Markov chain reaches t h e stationary distribution only after many thousands of transitions, however, trial generation in D x N E T poses the greatest difficulty. If we could somehow modify the Markov chain and thereby increase the r a t e at which it reaches the stationary distribution, or if we could compute an initial state from which the chain converges to the limit in just a few transitions, the rate of convergence of BN-RAS would greatly increase. T h e theoretical analysis suggests t h a t the smallest transition probability in the Markov chain limits t h e r a t e of convergence in the worst case, as described in equation (4). For chains with large transition probabilities, we expect rapid convergence. For other networks, there is yet hope: BN-RAS and straight simulation, in contrast to the exact methods, require time linear in the number of nodes and outcomes, in the worst case. As the indegree of nodes grows, the size of cliques increases exponentially, and the Lauritzen-Spiegelhalter algorithm requires exponential time; as the loop cutset increases in size, Pearl's message-passing algorithm degrades exponentially [Suermondt and Cooper, 1988]. T h e analysis of BN-RAS, however, indicates that the latter remains insensitive to network topology in the worst case, and degrades only as the conductance falls. Two detailed graphs (Figures 10 and 11) make the point more cogently. Note the strong dependence of t h e error terms on /, and t h e absence of a close relationship between the error and N. These d a t a suggest t h a t the amount of computation required

203 to guarantee a certain interval error depends most critically on the smallest transition probability in the network, and on very little else.

3.2

Comparison with Straight Simulation

BN-RAS generates t -N total transitions of the Markov chain, but then discards (t — 1) -N of those states and scores only TV trials. In addition, the state generator shuffles the network into a random configuration at the beginning of each trial. We now compare the ras to straight simulation [Pearl, 1987b, Pearl, 1987a], both for the two-node network and for the full DxNET. Figure 12 compares straight simulation to BN-RAS for the worst case in which the Markov chain's behavior deteriorates when the conditional probabilities approach 0 and 1. By starting with a random configuration of the network and enumerating the transitions in a fixed order, straight simulation spends most of its time looping in one state; after many transitions, the simulation falls into the other state and stays there for many transitions. Until the simulation falls into the alternative state, that state remains invisible. Hence, for networks with low conductance, straight simulations can become mired in states that serve as sinks for the Markov chain's transition probabilities [Chin and Cooper, 1989]. BN-RAS, on the other hand, randomizes the chain after t trials. We therefore expect the errors to converge more uniformly toward 0, without oscillating. Indeed, Figure 12 illustrates that BN-RAS converges almost immediately to the correct answer, and stays there. We observe, however, that randomization at the beginning of each trial is not, by any means, a consistently successful strategy for improving convergence. For the full DxNET, straight simulation and BN-RAS exhibit nearly identical convergence properties. The randomization step at the beginning of each trial in BN-RAS, and the temporally symmetric selection of transitions from an ergodic and time-reversible Markov chain, do not necessarily improve performance. Notice, however, that BN-RAS achieves the same performance as straight simulation, even though BN-RAS throws away the vast majority of its trials. Clearly, the generation of good trials (by performing many transitions) reduces error more dramatically than the scoring of many poor trials.

4.

Discussion and Conclusions

Our investigations suggest that a precise analytic characterization of a randomized algorithm's properties can guide the search for more efficient approximations. We have shown, in particular, that the number of transitions per trial, and not the generation of a sufficient number of trials, constrains the precision of Monte Carlo approximations. Our results demonstrate that, for belief networks with transition probabilities bounded away from 0 and 1, randomized techniques offer acceptable performance. The mere generation of copious trials is not, however, likely to ensure success. A randomized algorithm that provides a priori convergence criteria, coupled with an extensive empirical analysis, can perform efficient probabilistic inference on large

204

DxNet: Average error vs. time 0.300 ■

Φ

I 0.200 -8

-D-·-Ö-♦· ■*· ■O

N-100 N-200 N-500 N-1000 N-2000 N = 5000

0.000 0

10000

20000

30000

Time (seconds)

Figure 9: This graph illustrates an intriguing result: With just a few trials (on the order of 100) and many transitions per trial (on the order of 5,000 to 20,000), we can achieve rapid convergence of the average error toward 0. DxNet: Average error vs. trials 0.300 ■

Transitions per trial 75 0.200

t = 100 t = 500 t = 1000 t = 5000 t = 20000

* 0.100

0.000 0

2000 4000 6000 8000 10000

Number of trials (N)

Figure 10: This graph plots the average error over all nodes against the number N of trials, for different values of i, the number of transitions per trial. Observe that t almost completely determines the convergence of the algorithm.

205 DxNet: Average error vs. transitions 0.300 ■

l-Q- N - 1 0 0

U- N-200 U- N-500 -0- N-1000 |-β- N-2000 N = 5000

0.000 0

10000

20000

30000

Transitions per trial

Figure 11: This graph plots the average error against the number of transitions, for different TV, the number of trials.

Straight simulation vs. BN-RAS (two-node network) 0.6-

l··-

BN-RAS Straight

10 Time (seconds)

Figure 12: This graph compares the average errors for straight simulation and BN-RAS on the two-node belief network.

206 networks. In addition, we have formally shown that the randomized algorithm requires time that is linear in the problem size, and polynomial in the error criteria (namely, the success probability 8 and the interval error a). As the complexity of a belief network increases, randomized algorithms may offer the only tractable approach to probabilistic inference. We must, however, hasten to reiterate that the minimum transition probability p0 severely constrains the efficacy of randomized techniques. In addition, BN-RAS does not necessarily outperform straight simulation in raw computation. In contrast to straight simulation, however, BN-RAS offers a detailed convergence analysis and a priori bounds on running time. Finally, we outline a set of experiments in progress to characterize further the usefulness of randomized algorithms for probabilistic inference. • We shall study the performance of the algorithm on networks of various topologies, with a particular emphasis on inference problems that cannot be solved efficiently by deterministic methods. Networks with large loop cutsets and large indegrees offer particularly severe tests of exact algorithms. We conjecture that, as long as the smallest transition probability stays the same, the ras should remain insensitive to variations in topological structure. • We shall study the algorithm's performance on networks of different sizes and roughly similar topology (of the same maximum indegree and cutset complexity), with the smallest transition probability held constant. We expect that the performance of BN-RAS should depend very little on the size of the network. Clearly, we must expand computational resources on the order of the network's area so as to propagate an inference from one end to the other; it seems reasonable to expect, however, that the transitions per trial will dominate the running time.

5.

Acknowledgments

This work has been supported by grant IRI-8703710 from the National Science Foundation, grant P-25514-EL from the U.S. Army Research Office, Medical Scientist Training Program grant GM07365 from the National Institutes of Health, and grant LM-07033 from the National Library of Medicine. Computer facilities were provided by the SUMEXAIM resource under grant RR-00785 from the National Institutes of Health.

References [Beinlich et al., 1988] Beinlich, I. A., Suermondt, H. J., Chavez, R. M., and Cooper, G. F. (1988). The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. Technical Report KSL-88-84, Medical Computer Science Group, Knowledge Systems Laboratory, Stanford University, Stanford, CA.

207 [Chavez, 1989a] Chavez, R. (1989a). A fully polynomial randomized approximation scheme for the Bayesian inferencing problem. Technical Report KSL-88-72, Knowledge Systems Laboratory, Stanford University, Stanford, CA. [Chavez, 1989b] Chavez, R. (1989b). Randomized algorithms for probabilistic expert systems. PhD thesis, Knowledge Systems Laboratory, Stanford University, Stanford, CA. To appear. [Chin and Cooper, 1989] Chin, H. L. and Cooper, G. F. (1989). Bayesian belief network inference using simulation. In Uncertainty in Artificial Intelligence 3, pages 129-148. North-Holland, Amsterdam. [Cooper, 1987] Cooper, G. F. (1987). Probabilistic inference using belief networks is NP-hard. Technical Report KSL-87-27, Medical Computer Science Group, Knowledge Systems Laboratory, Stanford University, Stanford, CA. [Garey and Johnson, 1979] Garey, M. R. and Johnson, D. S. (1979). Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, New York, NY. [Jerrum and Sinclair, 1988] Jerrum, M. and Sinclair, A. (1988). Conductance and the rapid mixing property for Markov chains: The approximation of the permanent resolved. In Proceedings of the Twentieth ACM Symposium on Theory of Computing, pages 235-244. [Karp and Luby, 1983] Karp, R. M. and Luby, M. (1983). Monte-Carlo algorithms for enumeration and reliability problems. In Proceedings of the Twenty-fourth IEEE Symposium on Foundations of Computer Science. [Lauritzen and Spiegelhalter, 1987] Lauritzen, S. L. and Spiegelhalter, D. J. (1987). Fast manipulation of probabilities with local representations with applications to expert systems. Technical Report R-87-7, Institute of Electronic Systems, Aalborg University, Aalborg, Denmark. [Pearl, 1986] Pearl, J. (1986). Fusion, propagation, and structuring in belief networks. Artificial Intelligence, 29:241-288. [Pearl, 1987a] Pearl, J. (1987a). Addendum: Evidential reasoning using stochastic simulation of causal models. Artificial Intelligence, 33:131. [Pearl, 1987b] Pearl, J. (1987b). Evidential reasoning using stochastic simulation of causal models. Artificial Intelligence, 32:245-257. [Suermondt and Cooper, 1988] Suermondt, H. J. and Cooper, G. F. (1988). Updating probabilities in multiply connected belief networks. In Proceedings of the Fourth Workshop on Uncertainty in Artificial Intelligence, pages 335-343, University of Minnesota, Minneapolis, MN. American Association for Artificial Intelligence.

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

209

Weighing and Integrating Evidence for Stochastic Simulation in Bayesian Networks Robert Fung and Kuo-Chu Chang Advanced Decision Systems 1500 Plymouth Street Mountain View, California 94043-1230

Abstract Stochastic simulation approaches perform probabilistic inference in Bayesian networks by estimating the probability of an event based on the frequency that the event occurs in a set of simulation trials. This paper describes the evidence weighting mechanism, for augmenting the logic sampling stochastic simulation algorithm [Henrion, 1986]. Evidence weighting modifies the logic sampling algorithm by weighting each simulation trial by the likelihood of a network's evidence given the sampled state node values for that trial. We also describe an enhancement to the basic algorithm which uses the evidential integration technique [Chin and Cooper, 1987]. A comparison of the basic evidence weighting mechanism with the Markov blanket algorithm [Pearl, 1987], the logic sampling algorithm, and the evidence integration algorithm is presented. The comparison is aided by analyzing the performance of the algorithms in a simple example network.

1

Introduction

One of the newer approaches to inference in Bayesian networks is stochastic simulation. In this approach, the probability of an event of interest is estimated using the frequency that the event occurs in a set of simulation trials. This approach is approximate and has been recognized to be a valuable tool since it has been shown that exact probabilistic inference is NP-hard [Cooper, 1987]. Therefore, for networks which cannot be effectively addressed by exact methods [Kim and Pearl, 1985, Lauritzen and Spiegelhalter, 1988, Shachter, 1986], approximate inference schemes such as stochastic simulation are the only alternative for making inference computationally feasible. This paper describes a simple but promising mechanism, evidence weighting, for augmenting the logic sampling algorithm [Henrion, 1986]. Logic sampling has been shown

210 to have one major drawback when dealing with evidence. The mechanism described in this paper has some capability to handle this problem without the introduction of other drawbacks (e.g., dealing with deterministic nodes). When applying the logic sampling algorithm on networks with no evidence, sampling in each trial starts from the root nodes (i.e., nodes with no predecessors) and works down to the leaf nodes. The prior distribution of each root node is used to guide the choice of a sample value from the node's state space. Once a sample value for a node has been chosen, the sample value is "inserted" into the node's successors. The insertion of a value into a successor node "removes" the root node as a predecessor. Because Bayesian Nets are acyclic, once any root node is sampled, it must leave the graph with at least one new "root" node (i.e., a node with no "uninstantiated" predecessors). This property insures that the process of sampling will continue until all nodes in a network are sampled. At the end of each trial, the count for each event of interest is updated. If the event occurs in the trial, then it is incremented by a constant (e.g., 1). If the event does not occur in the trial then its count is not incremented. The probability of any event of interest represented in the network can be estimated based on its frequency of occurrence (e.g., the event's count divided by the number of trials). The major disadvantage of the logic sampling approach is that it does not deal well with evidence. When there is evidence in a network, sampling proceeds as described above with the addition of an acceptance procedure. For each simulation run, the acceptance procedure checks whether the sampled values for the evidence nodes match the observed evidence. If the values do not match, the results of that simulation run are discarded. If the values do match, the trial is "valid" and the sampled values are used to update the count of each event of interest. In cases where there are either large numbers of observations or when the a priori probability of the observed evidence is low, the percentage of "valid" simulations is low and therefore the number of simulation runs needed may be quite large. To overcome the disadvantage of the logic sampling algorithm, a "Markov blanket" approach [Pearl, 1987] has been proposed which modifies the algorithm by adding a preprocessing step to each simulation trial. This pre-processing step involves each node performing some local computation. This computation involves looking at its Markov blanket to determine a probability distribution for sampling. While the Markov blanket approach overcomes the major disadvantage of the logic sampling approach, by doing so, it has introduced some negative side effects. Most importantly, the convergence rate of this approach deteriorates when deterministic functions or highly dependent nodes are present in a network [Chin and Cooper, 1987]. Secondly, since the pre-processing step must take place for every node for every trial, the computation needed per trial for this approach is often greater than that for the simple logic-sampling approach. Another approach which has been proposed to overcome the disadvantage of the logic sampling algorithm is "evidential integration" [Chin and Cooper, 1987]. Evidential integration is a pre-processing step in which the constraints imposed by a network's evidence are integrated into the network using the arc reversal operation. Evidential integration transforms the network by an iterative application of Bayes' Rule. The transformation can be expressed as follows: p(X\E) = kP(E\X)P(X)

(1)

where X represents the states of the network, P(X) represents the a priori joint distri-

211

bution of the network, p(E\X) represents the assessed likelihood of the evidence given all the states, p(X\E) represents the joint distribution of the new network in which the evidence has been integrated, and fc is a normalization constant. The algorithm for evidential integration is as follows. For every evidence node, reverse arcs from predecessors which are not evidence nodes until the evidence node has no state node predecessors. Once a network has had this step performed, the logic sampling approach can be applied straightforwardly to estimate the posterior probability of any event of interest. This works because the evidential integration step creates the posterior distribution for the network. However, the evidential integration step may be expensive if the network is dense and heavily connected with the evidences. In this paper, we present a mechanism called evidence weighting which keeps the advantages of the logic-sampling approach and removes its major disadvantage (i.e., dealing with evidence). In this algorithm, only the state nodes of a network are simulated in each trial. For each simulation trial, the likelihood of all the evidence given the sampled state values is used to increment the count of each event of interest. The estimated probability distribution is obtained by normalizing after all the simulation trials are complete. The primary drawback of the evidence weighting algorithm is that when evidence likelihoods are extremal, the algorithm reduces to logic sampling and may converge slowly. To handle this problem, we propose an extension of the evidence weighting method by incorporating the evidential integration mechanism. Examples and simulation results are given to illustrate the various algorithms. This paper is organized as follows. Section 2 describes the evidence weighting method. Section 3 presents an extension of the evidence weighting method modified with the evidential integration mechanism. Section 4 presents the simulation results with the widely used example [Cooper, 1984]. A discussion of the algorithms through comparison with the Markov blanket simulation method [Pearl, 1987], the logic sampling algorithm, and the evidential integration algorithm are given in Section 5. Some concluding remarks are given in Section 6.

2

The Evidence Weighting Technique

In evidence weighting, the logic sampling algorithm is modified by considering a likelihood weight for each trial and sampling only state nodes. For each simulation trial, the likelihood of each piece of evidence given the sampled state node values is found. The product of these likelihoods is then used instead of a constant to increment the count of each event of interest. Referring back to equation (1), evidence weighting works by restricting sampling to the a priori distribution P(X) and then weighting each sample by the weight P(E\X), where P(E\X) can be easily obtained as: P(E\X) = Π P(Ei\C(Ei))

(2)

Ei€E

where C(E{) are the direct predecessors of 2?,·. The justification of this procedure is straightforward. By sampling from the a priori distribution P(X) and weighting each sample by P(E\X), the a posterior probability of an event of interest y, P(y\E), can be estimated as below,

rtyWnkjfE&PWxiMxi)

(3)

212 where Xi is the realization of X in the % — th trial, U(x{) is 1 if x, C y and 0 otherwise, and N is the total number of trials. The algorithm has what appears to be a substantial advantage over the logic sampling approach since all trials are "valid" and contribute to the reduction of estimation error. However if the likelihoods of all evidence nodes in a network are extreme (i.e., 1 or 0) this advantage will disappear and the evidence weighting mechanism will reduce to logic sampling. Not all pieces of evidence will bear on all events of interest. The determination of which pieces of evidence bear which events can result in the "sufficient information" [Shachter, 1986] needed to perform inference. This algorithm may converge more quickly (in trials) if the evidence weight for each event in the simulation trial only contains the likelihoods from pieces of evidence which bear on that event. This procedure may remove the "noise" created by the nonbearing evidence likelihoods. However we have not studied the tradeoff between the extra computations required and the improvement on convergence rate, if any, to know in what situations this calculation will be useful.

3

Evidence Weighting With Evidential Integration

As mentioned above, one major disadvantage of the evidence weighting technique is slow convergence in situations where the likelihoods of evidence are extremal. One way of handling this is to incorporate the evidential integration mechanism. As described in the introduction, the evidential integration is a pre-processing step in which the evidences are integrated into the network using the arc reversal operation. By doing so, it may be possible to transform the network into one that will generate stochastic sample instances more efficiently [Chin and Cooper, 1987]. The idea is to convert the extremal likelihoods to liklihoods which are less extreme so that the random samples can be generated more uniformly among the different values. For example, given the network and the associated conditional probability shown in the left part of Figure 1, the convergence rate of the evidence weighting technique will be slow due to the extremal likelihood values and the conditional probabilities. Note that nodes A0 to AN are assumed to have binary values, their prior probabilities and the conditional probability of B node given them are assumed to be uniform and not given in the figure. By reversing only the arc between the evidence and its predecessor, we have integrated the evidence "partially" into the network. The resulting network and the new conditional probability is given in the right part of Figure 1. As can be seen, the evidence likelihoods now become much less extreme. Preliminary simulation results show that with the same evidence weighting technique, the convergence rate of the second network is at least about 100 times faster than the first one. Note that in this example, we have only integrated the evidence "partially" into the network. This is to avoid the expensive arc reversal operation between the evidence and B node. In practice, the "amount" of evidence integration would be determined by the trade-off between the fixed cost in arc reversal operations and the dynamic costs in the simulation under different convergence rates.

213

A3

A2

AN

B-NODE BO Bl

CO 0.99 0.001

C-NODE CO Cl C2

Cl 0.001 0.004

EV 0.001 0.999 0.0

C2 0.009 0.995

AN

EV EV EV

B-NODE BO Bl

B-NODE B0 Bl

Figure 1: Evidential Integration Example

CO 0.4977 0.0003

EV 0.002 0.004

Cl 0.5023 0.9997

C2 0.0 0.0

214 (A) METksTATIC-CANbER

INCREASE EM:ALCÎUM

A RAIN-TUMOR

(D)

(E)

Figure 2: Example Network P(A): P(B\A): P(C\A): P(D\B,C): P(E\C):

P(a) = 0.20 P(b\a) = 0.80 P(c\a) = 0.20 P(d\b,c) = 0.80 P(d\b,-ic) =0.80 P(e\c) = 0.80

P(b\-^a) = 0.20 P(c|-«a) = 0.05 P{d\^b,c) = 0.80 P(d|-.6,-.c) = 0.05 P(e|ic) = 0.60

Table 1: Probability Distributions of Example Network

4

Example

In this section, the basic evidence weighting mechanisms described in Section 2, along with the "Markov blanket" algorithm, the logic sampling algorithm, and the evidential integration algorithm are simulated with the following simple example [Cooper, 1984]. Metastatic cancer is a possible cause of a brain tumor and is also an explanation for increased total serum calcium. In turn, either of these could possibly explain a patient falling into a coma. Severe headache is also possibly associated with a brain tumor. Assume that we are observe that a particular patient has severe headaches but not a coma. Figure 2 shows a Bayesian network which represents these relationships. Table 1 characterizes the quantitative relationships between the variables in the network. We used the four simulation algorithms on this network to produce some anecdotal evidence for the promise of the evidence weighting mechanism. The simulations were implemented in QUANTA, a Bayesian network research and development environment which is written in CommonLisp and runs on SUN workstations. The data was obtained for five different numbers of trials per simulation run. They are: 100, 200, 500, 1000, and 2000 trials. For each setting and for each algorithm, 100

215 simulation runs were performed. As an assessment measure, we have used the average accumulated absolute error denned as below: Error = ^ Σ Σ Σ I P'(*i) " i(*i) I iV

i=l

x

j

(4)

where x is a state node, p*(xj) is the estimated posterior probability of the i — th run for the state value Zj, p{xj) is the true posterior probability of Xj, and N is the total number of simulation runs. Other assessment measures have been suggested which emphasize different characteristics of the estimate. Figures 3 through 7 shows the results from estimating the posteriors with each of the four simulation algorithms. In Figure 3, the average accumulated absolute errors from all the nodes in the network is plotted against the number of simulation trials. It can be seen from the figure that the errors reduce as the number of trials increase. In fact, the errors decrease approximately proportional to the inverse of the square root of the number of trials (see Figure 4). Theoretically, this can be easily shown using the Central Limit Theorem. Figure 5 shows the relationship between the standard deviation of the estimates and the number of simulation trials per run, where the standard deviation is defined as the square root of the difference between the average error squared and the square of average error. As expected, it can be seen that the deviations decrease as the number of simulation trials increase. Figure 6 shows the comparison of average run time per trial for each of the four algorithms. The effect of the fixed cost of evidential integration can be seen in the runs with smaller trials. It can also be seen that the "Markov blanket" has by far the highest cost per trial of any technique. This disparity will grow larger with the size of the network. It can be seen that the other three algorithms have very similar costs per trial. This is obvious since the algorithms have similar sampling mechanisms. Combining Figure 3 and 6, Figure 7 shows the estimation error versus run time for the four algorithms. It can be clearly seen that for this network the evidence weighting and evidential integration mechanisms are significantly more accurate for a given amount of computation time.

5

Discussion

In this section, the salient features of the evidential weighting mechanism are discussed. The major advantages are three-fold. First, the mechanism have the ability to handle deterministic functions which is the major drawback of the Markov blanket approach. Since the mechanism samples only in a "causal" direction, the presence of deterministic functions will not affect their operation. Second, the evidence weighting mechanism is relatively simple. This not only allows it to be easily understood and implemented but also allows other mechanisms to be easily incorporated. And thirdly, since all the trials are used for calculating posteriors, the mechanism seems to have good convergence properties for a large class of networks. Although the evidential integration algorithm also has these advantages, it has the fixed cost of the evidential integration step as its major disadvantage. For certain networks, the arc reversal operations used in the evidential integration process may dramatically increase the required computation and memory. The main disadvantage of the evidence weighting algorithm is that it will converge slowly in the situations where the likelihoods for evidence are extremal. In these situa-

216

p

Error-Trial

1

02

0J8

\

Markov Blanket

G

B

Q

\

Evidential Integration

·

·

·

0.16 0.14 012

1

oi 0.08 0.06 0.04 0.02

o

200 400 600 800 1000 1200 1400 1600 1800 2000 300 500 700 900 U00 1300 1500 1700 1900 Number of Triad«

α 100

Figure 3: Error versus Number of Trials

Log(Error-THat)

Logic Sampling Markov Blanket Evidential Integration Evidential Weighting

200

500 LotfNamber of Trials)

Figure 4: Log(Error) versus Log(Number of Trials)

217

w

SO-Triats

||

0.13

0.12

n

i i

r ε r

0.09

0.06

0.03

0

200 400 600 800 1000 1200 1400 1600 1800 2000 300 500 700 900 1100 1300 1500 1700 1900 Number of Trials

a 100

Figure 5: Standard Deviation versus Number of Trials

m

Avwaq»-Run-Urne

0.08 Markov Blank·« 0.07 0.06

B

O

O

Logic Sampling

Δ

A

Evidential Integration

·

·

Δ ·

Evidential Weighting

G

9

©

0.05 0.04 g

0.03 0.02 0.01

0

200 400 600 800 1000 1200 1400 1600 1800 2000 100 300 500 700 900 1100 1300 1500 1700 1900 Number of Trials

Figure 6: Average Run Time per Trial

218

|g

Error-Tim« 0-23

Markov Blank«! Logic Sampling

02

Evidential Integration·— Evidential Weighting Q -

015

ε

Γ

01

0.05

II

o

0

20

40 Ä w Tan·

60

80

100

Figure 7: Error versus Run Time tions, the algorithm reduces to the logic sampling algorithm. We have shown by incorporating the evidential integration mechanism, these types of problems can be avoided. Another recent study [Shachter and Peot, 1989] shows that combining evidence weighting with other mechanisms (e.g., Markov blanket) may also be useful. It appears from these research results that evidence weighting is a promising algorithm. Since there seems to be a large number of performance analysis methods in use, further research in both theoretical and simulation methods would be useful. Some new techniques for convergence analysis of simulation methods have recently been put forward [Chavez, 1989]. Such analysis may be able to confirm the usefulness of the simulation mechanism proposed by this paper.

6

Conclusions

In this paper, we have presented a simple but promising mechanism for stochastic simulation, evidence weighting. The use of this mechanism appears to have many advantages. It is able to deal with deterministic variables, and the cost of each sample run is relatively low. The basic evidence weighting mechanism has the drawback of converging to the logic sampling algorithm under certain conditions. We have proposed the combination of evidential integration and evidence weighting to avoid this drawback. These results are preliminary, further research is needed in the area of convergence analysis. Since probabilistic inference in Bayesian networks is computationally hard, we believe no one algorithm will be able to perform optimally in every situation (e.g., time constraints, accuracy goals, network topology). Instead specialized algorithms are needed to fill "market" niches (e.g., singly-connected networks, non-exact inference), and intelligent

219 meta-level control mechanisms are needed to match situations (e.g., network topology, network distributions and desired accuracy) with algorithms. Acknowledgement: The authors wish to thank the fruitful discussion and comments from Ross Schachter, Mark Peot and Shozo Mori.

References [Chavez, 1989] Chavez, R. M. (1989). Hypermedia and Randomized Algorithms for Probabilistic Expert Systems. PhD thesis, Stanford University, Stanford, California, dissertation proposal. [Chin and Cooper, 1987] Chin, H. and Cooper, G. F. (1987). Stochastic simulation of bayesian networks. In Proceedings of the Third Workshop on Uncertainty in Artificial Intelligence, Seattle, Washington. [Cooper, 1984] Cooper, G. F. (1984). NESTOR: A computer-based medical diagnostic aid that integrates causal and probabilistic knowledge. PhD thesis, Stanford University, Stanford, California. [Cooper, 1987] Cooper, G. F. (1987). Probabilistic inference using belief networks is np-hard. Report KSL-87-27, Medical Computer Science Group, Stanford University. [Henrion, 1986] Henrion, M. (1986). Propogating uncertainty in bayesian networks by probabilistic logic sampling. In Lemmer, J. and Kanal, L., editors, Uncertainty in Artificial Intelligence. Amsterdam: North-Holland. [Kim and Pearl, 1985] Kim, J. H. and Pearl, J. (1985). A computational model for combined causal and diagnostic reasoning in inference systems. In Proceedings of the 8th Internationl Joint Conference on Artificial Intelligence, Los Angeles, California. [Lauritzen and Spiegelhalter, 1988] Lauritzen, S. L. and Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application in expert systems. Journal Royal Statistical Society B, 50. [Pearl, 1987] Pearl, J. (1987). Evidential reasoning using stochastic simulation of causal models. Artificial Intelligence, 29. [Shachter, 1986] Shachter, R. D. (1986). Intelligent probabilistic inference. In Kanal, L. and Lemmer, J., editors, Uncertainty in Artificial Intelligence. Amsterdam: NorthHolland. [Shachter and Peot, 1990] Shachter, R. D. and Peot, M. (1990). Simulation approaches to probabilistic inference for general probabilistic inference on belief networks. In this volume.

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

221

SIMULATION APPROACHES TO GENERAL PROBABILISTIC INFERENCE ON BELIEF NETWORKS Ross D. Shachter Department of Engineering-Economic Systems, Stanford University Terman Engineering Center, Stanford, CA 94305-4025 [email protected] and Mark A. Peot Department of Engineering-Economic Systems, Stanford University and Rockwell International Science Center, Palo Alto Laboratory 444 High Street, Suite 400, Palo Alto, CA 94301 [email protected] A number of algorithms have been developed to solve probabilistic inference problems on belief networks. These algorithms can be divided into two main groups: exact techniques which exploit the conditional independence revealed when the graph structure is relatively sparse, and probabilistic sampling techniques which exploit the "conductance" of an embedded Markov chain when the conditional probabilities have non-extreme values. In this paper, we investigate a family of "forward" Monte Carlo sampling techniques similar to Logic Sampling [Henrion, 1988] which appear to perform well even in some multiplyconnected networks with extreme conditional probabilities, and thus would be generally applicable. We consider several enhancements which reduce the posterior variance using this approach and propose a framework and criteria for choosing when to use those enhancements. 1.

Introduction

Bayesian belief networks or influence diagrams are an increasingly popular representation for reasoning under uncertainty. Although a number of algorithms have been developed to solve probabilistic inference problems on these networks, these prove to be intractable for many practical problems. For example, there are a variety of exact algorithms for general networks, using clique join trees [Lauhtzen and Spiegelhalter, 1988], conditioning [Pearl, 1986] or arc reversal [Shachter, 1986]. All of these algorithms are sensitive to the connectedness of the graph, and even the first, which appears to be the fastest, quickly grows intractable for highly connected problems. This is not surprising, since the general problem is NP-hard [Cooper, 1990]. Alternatively, several Monte Carlo simulation algorithms [Chavez, 1990; Chavez and Cooper, 1990; Henrion, 1988; Pearl, 1987] promise polynomial growth in the size of the problem, but suffer from other limitations. Convergence rates for Logic Sampling [Henrion, 1988] degrade exponentially with the number of pieces of evidence. The performance of Markov chain algorithms such as [Chavez, 1990; Chavez and Cooper, 1990; Pearl, 1987] may degrade rapidly if there are conditional probabilities near zero [Chavez, 1990; Chin and Cooper, 1989].

222 The goal of this research is to develop simulation algorithms which are suitable for a broad range of problem structures, including problems with multiple connectedness, extreme probabilities and even deterministic logical functions. Most likely, these algorithms will not be superior for all problems, but they do seem promising for reasonable general purpose use. In particular, there are several enhancements which can be adaptively applied to improve their performance in a problem-sensitive manner. Best of all, the algorithms described in this paper lend themselves to simple parallel implementation, and can, like nearly all simulation algorithms, be interrupted at "anytime," yielding the best solution available so far. 2.

The Algorithms

Let the nodes in a belief network be the set N = {1, ... , n} , corresponding to random variables X N = {X-|, ... , Xn}· Of course, the network is an acyclic directed graph. Each node j has a set of parents C(j), corresponding to the conditioning variables Xc(j) for the variable Xj. Similarly, S(j) is the set of children of node j corresponding to the variables X$(j) which are conditioned by Xj# We assume that the observed evidence is X E = X*E> where E c N, and that we are primarily interested in the posterior marginal probabilities, P{Xj | X*E} for all j é E. We will use a lower case "x" to denote the particular value which variable X assumes. "^, and 'p=0' otherwise ( if their true belief is ^ then they could quote any value for p as they will always expect to score ^). In this paper we shall turn the imprecise assessments into precise assessments to be judged by the Brier scoring rule. This apparent loss of information can be well justified when it comes to processing a single case. Spiegelhalter and Lauritzen (1990) have shown that if the conditional probabilities in a belief network are themselves considered as imprecise quantities with their own attached uncertainty, then in processing a case this imprecision should be ignored and a single 'mean' probability value used in order to derive appropriate evidence propagation on the case in hand. (We emphasise, however, that the evidence extracted from that case will in turn revise the conditional probability to be adopted in processing the next case). Furthermore, suppose we had some distribution over our subjective probability pj for an event e, and had to choose a representative value P, knowing it was to be criticised according to the Brier score. Then it is straightforward to show that our expected contribution to the overall score for that question is minimised if we choose P, to be the mean of our distribution. We therefore consider only the adequacy of the midpoints of the intervals provided for each of the events, although later we shall discuss how the imprecision becomes important if we allow the assessments to adapt as data accumulates. If we consider Table 1, we see, for example, that the 'precise' probability vector p_ for the question 'grunting?' in aortic stenosis is simply (0.10, 0.90). However, a problem arises if the experts have given intervals for a set of responses to a question whose midpoints do not add to 1, as, for example, is the case for 'main problem?' in aortic stenosis and hypoplastic left heart (and in many other instances not shown here). In this situation the midpoints of the intervals have been simply rescalcd to add to 1; for example, 'main problem?' in aortic stenosis has unadjusted probability vector (0, 0.925, 0.045, 0, 0), which is rescaled to (0, 0.954, 0.046, 0, 0). We now consider the calculation of the Brier score, and use the single case of aortic stenosis presenting with asymptomatic murmur as an example. We have p_ = (0, 0.954, 0.046, 0, 0), e = (0, 0, 1, 0, 0) and hence B = H (0+ 0.9542 + (1-0.046)2 + 0 + 0) = 0.9101 for that question. These scores may be analyzed in many different ways. The mean score, on a total of 3944 questions asked, was 0.12, while the mean score within disease varied from 0.08 to 0.22. However a disease may score highly simply because it does not present in a clearcut way and hence many probabilities near 0.5 are given to events, which will inevitably lead to a poor score. In order to measure the quality of the probability assessments, as opposed to the distinguishability of the disease, we clearly have to analyze the score more carefully.

289 4. Discrimination and Reliability There has been extensive work on decompositions of scoring rules into terms that reflect different aspects of an assessor's skill, see, for example, Yates (1982) and DeGroot and Fienberg (1983). Here we shall consider simple decompositions into a term that expresses 'lack of discrimination' (how far away from 0 and 100 the assessments are) and 'lack of reliability' (how untrustworthy the judgements are). Good reliability means, for example, that if an assessor provides a probability of .7 for a response, the about 70% of the questions will result in that response. Reliability appears more important than discrimination in the context of describing accurately the presentation of a disease. Hilden et al (1978) provide a general method of obtaining a decomposition. Under the hypothesis that our experts provide perfectly reliable judgements, then the Brier score for a single question on a single case has expectation. Eo(B)

=

* E 0 ^e r P l ) 2 ]

=

* E0[(l-2pr + Σρ?]

=

4 (1-Σρ?)

Deviations of B from this figure measure lack of reliability, and we since E0(pr) = Zpf. denote as R the statistic R

=

*k [l-2pr + Σρ? - 1 + Σρΐ]

B - E0(B)

Σρ? - pr.

The mean of R will be positive if the assessor tends to make too extreme probability judgements (over-confidence), and be negative if his assessments are not extreme enough (diffidence).

t

poor

D i s c r i m I

15

+ +

10

n a t i o

+

t

++

1

-5

0 optimal

Figure 1.

5 10 Reliability score

15

20

poor-

Mean discrimination and reliability scores of probability assessments, when grouped according to disease (see text for explanation of abbreviations).

290 Figures 1 and 2 show how the mean discrimination and reliability vary between the 27 diseases and the 24 questions that are asked. Disease assessments showing poor discrimination and good reliability, such as PFC (persistent foetal circulation), are characterised by having a varied presentation (probabilities not near 0 and 100) which is well-understood by the experts. Diseases such as AVSD+ (AV septal defect + outflow obstruction) and NUHD (non-urgent heart disease) are perceived as having a very consistent presentation (good discrimination) but this perception seems to be at fault. Questions such as 'heart failure?' display poor discrimination and excellent reliability, in that there is a wide variation in its incidence among diseases, but these incidences are known reasonably well.

D i s c r i m n a t i o n

20

lung fields on chest X-ray?

15 heart failure?

*

10

heart rate?

-5

X X 0

optimal

Figure 2

1

1

5 10 Reliability score

— i

15

20

poor-

Mean discrimination and reliability scores for probability assessments, when grouped according to question.

'Heart rate?' is a question with a categorical response in each disease, which is correctly known. However, 'lung fields on chest X-ray?' shows wide variation in response within diseases, and these variations in presentation are not well-estimated by the experts. Plots such as Figures 1 and 2 identify where further questioning of the experts may be necessary. A graphical representation of reliability is valuable and a number of suggestions have been made (Hilden et al, 1978). The simplest method is to group the probability assessments ft, for all responses to all questions in all diseases, into a small set of categories xlf...,xk. If assessments are assumed only to be made at these discrete values, then we can count the total number ι^ of times a probability xk has been given to a prospective event, and the fraction fk of nk in which these events occurred. Figure 3 shows a plot of fk against xk, where assessments are considered in twelve groups; 0%, 1-10%, 11-20%,..., 90-99%, 100%. This brings home the striking reliability of the assessors. One problem is that out of 3210 events that were given probability zero, 74 actually occurred. A possible solution to this problem is given in the discussion.

291 100 90

observed reliability

80

perfect reliability

70 60 H 50 40 30 20 10 (f„)

Figure 3

10

20

30 40 50 60 70 80 Predicted probability of event ( x k )

90

100

Overall reliability of subjective probability assessments; for example, of 297 events given probability 50%, 131 (44%) actually occurred. (Midpoints of original consensus probability intervals have been used). 5. Learning from experience

As mentioned in the introduction, the imprecision expressed in the ranges for the probability assessments is crucial in providing a mechanism for the experts' judgements (however good they may appear to be) to be tempered by experience. Essentially we assume there is some 'true' probability vector p appropriate to a particular question asked of a patient with a particular disease. The experts provide a probability distribution f(p) for the vector, and when data D is observed, the opinion about p is revised by Bayes theorem to be f(pJD) a f(Dlp) f(p). Spiegelhalter and Lauritzen (1990) discuss a number of different models for f(p), and we only consider the simplest here. It is well known that if f(p) is assumed to be a Dirichlet distribution (a beta distribution where questions have only two responses) then the prior opinion is equivalent to an 'imaginary' sample of cases that represents the experts' experience ; Bayes theorem simply updates that imaginary sample with real cases. In effect the conditional probabilities are stored as fractions of whole numbers, rather than single numbers between 0 and 1. Let us illustrate using the assessments in Table 1 for 'grunting?' in hypoplastic left heart. The expert assessment of 30-40% can be thought of as summarising a distribution, and we shall, somewhat arbitrarily, assume that there is about a 2 to 1 chance that the true frequency lies in that interval. This interpretation is equivalent to assuming the interval is a one standard error interval calculated after observing n cases of the disease of which a fraction 0.35 displayed grunting, where (0.30, 0.40) = 0.35 ± V{0.35(l-0.35)/n} using standard binomial statistical theory. Hence n = 0.35 x 0.65/0.052 = 91. Thus we take our experts' opinion as equivalent to having observed 32 instances of 'grunting' among 91 cases of hypoplastic left heart (HLH).

292 The probability 32/91 = 0.352 will be used for the next case, but if this happens to be HLH with grunting, the fraction will increase to 33 out of 92. The probability 33/92 = 0.359 will then be used for the next case. We note that adaptation may be slow when the experts are initially reasonably confident, and from Table 2 we see that after a year's worth of 200 cases the fraction has only shifted to (32+7)/(91+19) = 0.355, reflecting the accurate initial assessment. If, in contrast, the experts acknowledge their uncertainty about the probability and provide a wide initial range, adaptation will be considerably faster. A certain arbitrariness exists in dealing with more than two responses in which some assessments are more precise than others. Currently we use the formula n = midpoint x (l-midpoint)/(half the range of the interval)2 for each response, and adopt the lowest n as the implicit sample size underlying the expert judgment, i.e. we fix on the most imprecise assessment. Table 3 shows the results of learning about the aortic stenosis judgements shown in Table 1.

Question:

Feature:

Main problem? Cyanosis Heart failure? Asymptomatic murmur Arrythmia Other Grunting?

Yes No

Initial Judgements

Initial probs

0-0% 90-95% 2-7%

0.0

0-0% 0-0%

0.0 0.0

5-15% 85-95%

0.945 0.046

Implicit Observed Combined sample size sample samples & probs

0.0

65.8

3.2

1 2 1

1.0

4.2

0.014 0.929 0.058

0.0 0.0

0.0 0.0

3.6

0.090 0.910 1.000

67.8

0.0 0.0

0 0 4

73.0

3.6

0 4 4

36.4 40.0

1.00

69.0

0.10 0.90 1.00

32.4 36.0

1.000

Table 3. Initial judgements transformed to point probabilities, and then to implicit samples. When combined with the observed data the revised conditional probabilities are obtained. We note fractional implicit samples are quite reasonable. Once again, the major problem is the confidence expressed by 0-0% probability ranges. If, as in Table 3, other options are given less extreme judgements, then this procedure will revise an initial 0-0% interval. However, if all possible answers to a question are given 0-0% or 100-100% assessments, then the implicit sample size is, strictly speaking, infinity and no learning can take place. We are currently experimenting with a range of 'large' sample sizes to use in place of infinity. 6. Discussion We have briefly described a procedure for criticising and improving upon imprecise subjective probability assessments. Many issues remain open for further investigation. From a practical viewpoint, the pilot study has shown that reliable probability assessments can be obtained from experts although there is a tendency to them to be too extreme in their

293 judgements. Since the observations have been obtained over the telephone we might expect additional variation to be presented which could explain this over-confidence, and our next task is to contrast these results with data obtain at the specialist hospital. Our results contrast with those of Leaper et al (1972) who criticised the use of clinicians' probability assessments for a simple network representing acute abdominal pain, although they did not explicitly identify the kind of errors made by the clinicians and only compared final diagnostic performance. One explanation is the different clinical domain, in that compared with acute abdominal pain the precise presentation of babies with congenital heart disease has been carefully studied. From a technical perspective we need to investigate appropriate means of dealing with 'zero' assessments (currently our opinion is that they all should be treated as 0-4% on the basis of our empirical discovery of a 2% error rate). We are also examining altemative means of transforming intervals to implicit samples, so that the adequacy of our 'one standard error interval' assumption can be tested. Strictly, the correct means of scoring these imprecise assessments is to introduce the data in single cases, updating probabilities sequentially, and use the revised judgements for each new case. This needs to be investigated for a number of different scoring rules, and tools developed for early automatic identification of poor initial assessments. This 'prequential' approach (predictive sequential) has been pioneered by Dawid (1984), and promises to form the basis for automatic monitoring of the performance of predictive expert systems. Acknowledgments We are grateful to Philip Dawid for valuable discussions, and to the British Heart Foundation and the Science and Engineering Research Council for support. References Dawid A P (1984). Statistical theory - the prequential approach. J. R. Stat. Soc. A, 147, 277-305. Dawid A P (1986). Probability forecasting, Encyclopedia of Statistical Sciences, Vol 7, (eds Kotz and Johnson), J Wiley: New York, pp. 210-218. DeGroot M H & Fienberg S E (1983). TTie comparison and evaluation of forecasters. Statistician, 32, 12-22. Franklin R C G, Spiegelhalter D J, MaCartney F and Bull K (1989). Combining clinical judgements and statistical data in expert systems: over the telephone management decisions for critical congenital heart disease in the first month of Hfe. Intl. J. Clinical Monitoring and Computing. 6, 157-166. Hilden J, Habbema J D F & Bjerregaard B (1978). The measurement of performance in probabilistic diagnosis, ΠΙ - methods based on continuous functions of the diagnostic probabilities. Methods of Information in Medicine. 17, 238-346. Lauritzen S L & Spiegelhalter D J (1988). Local computation with probabilities on graphical structures, and their application to expert systems (with discussion). J. Roy. Stat. Soc.. B, 50, 157-244. Leaper D J, Horrocks J C, Staniland J R and de Dombal F T (1972) Computer-assisted diagnosis of abdominal pain using 'estimates' provided by clinicians. British Medical Journal 4, 350-354. Murphy A H & Winkler R L (1977). Reliability of subjective probability forecasts of precipitation and temperature. Applied Statistics. 26, 41-47. Murphy A H & Winkler R L (1984). Probability forecasting in meteorology, J. Amer. Statist. Assoc.. 79, 489-500.

294 Pearl J (1988). Probabilistic reasoning in intelligent systems. Networks of Plausible Inference. Morgan Kaufmann, California. Shapiro A R (1977). The evaluation of clinical predictions. A method and initial application, New England Journal of Medicine, 296. 1509-1514. Spiegelhalter D J and Lauritzen S L (1990). Sequential updating of conditional probabilities on directed graphical structures. Networks, (to appear). Yates J F (1982). External correspondence: Decomposition of the mean probability score. Organisational Behaviour and Human Performance. 30, 132-156. Zagoria R J & Reggia J A (1983). Transferability of medical decision support systems based on Bayesian classification, Medical Decision Making. 3, 501-510.

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

295

Automated construction of sparse Bayesian networks from unstructured probabilistic models and domain information Sampath Srinivas Rockwell Science Center, Palo Alto Laboratory 444 High Street Palo Alto, CA 94301 Stuart Russell Computer Science Division University of California Berkeley, CA 94720 Alice Agogino Department of Mechanical Engineering University of California Berkeley, CA 94720 Abstract An algorithm for automated construction of a sparse Bayesian network given an unstructured probabilistic model and causal domain information from an expert has been developed and implemented. The goal is to obtain a network that explicitly reveals as much information regarding conditional independence as possible. The network is built incrementally adding one node at a time. The expert's information and a greedy heuristic that tries to keep the number of arcs added at each step to a minimum are used to guide the search for the next node to add. The probabilistic model is a predicate that can answer queries about independencies in the domain. In practice the model can be implemented in various ways. For example, the model could be a statistical independence test operating on empirical data or a deductive prover operating on a set of independence statements about the domain.

296

1

Introduction

Bayes' belief networks (influence diagrams with only chance nodes) are usually constructed by knowledge engineers working with experts in the domain of interest. There are several problems with this approach. The expert may have only partial knowledge about the domain. In addition, there is a "knowledge acquisition bottleneck" problem when trying to build a knowledge base in this manner. It would be desirable to automate this modeling process such that belief networks could be constructed from partial domain information that may be volunteered by the expert and empirical data from the domain. Readers are referred to [Pearl, 1988], [Howard &: Matheson, 1981], [Shachter, 1986], [Breese, Horvitz & Henrion, 1988] and [Rege Sz Agogino, 1988] for details on Bayes' networks and influence diagrams.

1.1

A view of the general induction problem

The problem of inducing a Bayesian network from empirical data and domain information can be viewed as consisting of two subproblems: 1. How does one construct a dependency model for the variables in the domain? A dependency model is a set of statements of the form UX is independent of Y given Z" written as I(X,Z,Y) where X, Y and Z are disjoint sets of variables in the model [Pearl, 1988]. Thus, a predicate that can assign a truth value to statements of the form I(X, Z,Y) is a dependency model. 2. Given a predicate of the form described above, how does one structure a sparse Bayesian network to represent the dependency model? There are various possible Bayesian network representations for the same dependency model. The problem is to construct a comprehensible and computationally efficient one. The empirical data, for example, may be a collection of observations, each observation being a list of attribute value pairs (variable-value pairs) that represents a "snapshot" of the joint state of the domain variables. The domain information may consist of statements that can be used to infer facts about the dependence or independence relations among the variables in the domain (See Sec 1.2). The solution to the first subproblem calls for a tractable statistical test for testing independence. The solution to the second subproblem requires building a structured model from an unstructured one. The work described in this paper concentrates on this structuring problem.

1.2

Problem statement

In the context of the previous section, a more precise statement of the problem we are solving here is as follows. We are given: • A black box that can answer questions of the type "Is X independent of Y given Z?" where X, Y and Z are sets of variables in the domain. This could, for example, be a statistical test operating on empirical data or a deductive prover

297

that knows the basic probability model axioms and operates on a declarative set of independence statements. • Some partial expert information about the domain. The expert may make the following kinds of statements: — Declare that a variable is a hypothesis variable. Operationally, declaring that a variable A is a hypothesis means that in the expert's view, A is a root node of a belief network representation of the domain. — Declare that a variable is an evidence variable. Declaring a variable A to be an evidence node means that the expert views A as a leaf node in the belief network representation. — Declare that a variable A is a cause of a variable B, or equivalently, that a variable B is caused by variable A. Causality statements are interpreted as follows — Saying A is a cause of B declares that the expert views A as a direct predecessor of B in the belief network representation (see [Pearl, 1988]). — Make explicit independence declarations of the form I(X, Z,Y) where X , Y and Z are sets of domain variables. Our goal is to build a sparse Bayesian network for the domain given the information above. In a model it is usually easy for an expert to identify some 'primary' causes and some observables. The flow of causality in a causal model is from these primary causes to the observables. For example, in the medical domain, these primary causes are diseases and the observables are symptoms. In a model of a machine the primary causes would be possible faults and the observables would be sensors. Hypothesis variables correspond to primary causes and evidence variables to observables.

2

Bayesian networks

The work described in this paper makes use of the terminology and results found in Pearl [Pearl, 1988]. A brief summary of the relevant material follows. Probabilistic models comprise a class of dependency models. Every independence statement in a probabilistic model satisfies certain independent axioms (See [Pearl, 1988] for details). A belief network is a representation of a dependency model in the form of a directed acyclic graph (DAG). Given three disjoint node sets X , Y and Z in a directed acyclic graph, X is said to be d-separated from Y by Z if there is no adjacency path from X to Y that is active. An adjacency path follows arcs from a node in X to a node in Y without regard to the directionality of the arcs. An adjacency path from X to Y is active if (1) if every node in the path that has converging arrows is in Z or has a descendant in Z.(2) Every other node is outside Z. A converging arrows node in an adjacency path is a direct successor (in the DAG) to its neighbours in the path. We represent the statement "X is d-separated from Y by Z" as D(X, Z, Y).

298

A DAG is called an Independency map (I-map) if every d-separation in the graph implies the corresponding independence in the underlying dependency model that the DAG is attempting to represent, i.e: D(X,Z,Y)^I(X,Z,Y)

(1)

A belief network is called a dependency map (D-map) if every non-d-separation in the graph implies a corresponding non-independence in the underlying dependency model, i.e: ^D(X,Z,Y)=>^I(X,Z,Y)

(2)

I(X,Z,Y)=*D(X,Z,Y)

(3)

Or, equivalently:

If a DAG is both an I-map and a D-map it is called a perfect map of the dependency model. DAGs cannot represent all the possible kinds of independencies in dependency models. In other words, for many dependency models, there is no DAG that is a perfect map. Therefore, if a DAG representation is used for such a model, one that shows a maximum amount of useful information about the model should be chosen. A DAG that shows no spurious independencies while explicitly displaying as many of the model's independencies as possible is a reasonable compromise. Such a DAG is called a minimal I-map. Deletion of any edge of a minimal I-map makes it cease to be a I-map. It is to be noted that a dependency model may have many different minimal I-maps. Let M be a dependency model and d = Xi, X2,... XN be an ordering defined on the variables of the model. The boundary stratum Bi of the node Xi is defined to be a minimal subset of {XUX2 . · .Xi-i} such that I({Xi},B{, {Χχ,Χ2 ·. -Xi-i} - B{). The DAG created by designating the nodes corresponding to the variables in B{ as the parents of the node corresponding to X t , for all z, is called a boundary DAG of M relative to the node ordering d. If M is a probabilistic model then a boundary DAG of M is a minimal I-map. A Bayesian network is a minimal I-map of a probabilistic model. The result above is a solution to the problem of building a Bayesian network for a given probabilistic model and node ordering d. The form of the Bayesian network depends strongly on the order of introduction d of the variables. In the Boundary DAG algorithm the particular ordering of nodes chosen can make a large difference in the number of arcs in the resulting Bayesian network. Though the resulting belief network is guaranteed to be a minimal I-map, this does not imply that it is sparse. For example, take the networks in Figure l(This example is adapted from [Rege L· Agogino, 1988]). Fig l a is a perfect map of a probability model that describes the dependence of two temperature sensors T l and T2 on an (unobservable) process temperature T. If the boundary DAG algorithm is used to build a Bayesian network from the underlying probability model with node ordering d = {T, T1,T2} the resulting Bayesian network is the same as the perfect map Fig la. If the node ordering d = {T2,T1,T} is used instead, we get Fig l b . Though this network is a minimal I-map, it is fully connected and carries no information on conditional independence. Fig l a can be viewed as a

299

causal network in which the fact that the hypothesis node T makes the evidence nodes T\ and T2 conditionally independent is explicitly recorded. A belief network has to be sparse if it is to be comprehensible to the user and inference using the network is to be computationally tractable. Using the boundary DAG algorithm as a point of departure, our solution to the problem of building a belief network from a probabilistic model and expert information attempts to build a sparse Bayesian network by choosing an appropriate ordering of the nodes.

3

The construction algorithm

The boundary DAG method [Pearl, 1988] is a simple way of building a Bayesian network for a set of variables. The recursive statement of the algorithm for building a Bayesian network of k -f- 1 variables is as follows: Given: A Bayesian network K consisting of k variables and a variable Xk+i to be added to the Bayesian network. Algorithm: Using the independence predicate find the smallest subset P of the variables in K such that J({Xfc+i},P, K — P). Designate the variables in P as the predecessors of Xk+\> We could adapt this algorithm to build a sparse Bayesian network if we could choose Xk+i in a coherent way from all the variables which have not yet been added to the network. This is like a search problem. If there are n nodes in all, there are n — k nodes left to add at each recursive step of the algorithm. The problem is to find the best one to add. Ideally what we would like to do is to find the most sparse minimal I-map, i. e., among the n\ possible minimal I-maps (Bayesian networks) possible for n nodes using the Boundary DAG algorithm (one for each ordering of nodes), find the one that has the least number of arcs. This is possible, in principle, with a complete search. The complexity of such a procedure is prohibitive. The algorithm we have implemented for choosing node Xk+i uses a priority heuristic based on the expert's information and a greedy sub-heuristic to make the choice. The priority heuristic ensures that hypothesis nodes are added to the network before evidence nodes and that cause nodes are added before effect nodes, thus guiding the algorithm towards a sparse causal network. If there is not enough information to make a choice based on priority the node which adds the least number of arcs to the existing network is chosen as node Xk+i· The actual implementation of the priority heuristic is as follows. The expert's information is first 'compiled' into a DAG. Cause-of and caused-by relations are translated into appropriate directed links between nodes in the DAG. Nodes which are declared to be hypothesis or evidence nodes are annotated as such. This DAG is distinct from the belief network being constructed by the algorithm and is used solely for the purpose of making priority decisions. Priority is a function that defines a partial ordering among the nodes in the expert information DAG. The relative priority between two nodes A and B is decided as follows: If A is a hypothesis node and B is not a hypothesis node A has higher priority.

300

(a)

0») Figure 1: Different minimal I-maps for the same probability model

301 If A is an evidence node and B is not an evidence node A has lower priority. If A is an ancestor of B in the expert information DAG it has higher priority. If A is a descendant of B then A has lower priority. If none of the above cases apply the priority ranking of A and B is same. The recursive algorithm used for building a sparse Bayesian network of k + 1 variables is now stated as follows: Given: A Bayesian network K with k nodes and n — k candidate nodes which are yet to be added to the network. Algorithm: 1. Order the candidates using the priority ordering. If there is a unique candidate with highest priority choose it as the winner, i.e., the next node that will be added to the network. 2. If there is a set of candidates with the (same) highest priority, find the boundary stratum of each of the candidates and choose the candidate with the smallest boundary stratum as the winner. This is a greedy heuristic that tries to minimize the number of arcs added. 3. If there is still a tie choose any candidate as the winner from the remaining candidates. 4. Make the winner Xfc+i, the k + 1th node in the network. Find the winner's boundary stratum if it has not been found already. Assign the boundary stratum of the winner as predecessors of Xk+i in the belief network being constructed. The boundary stratum of a candidate C is found by generating all possible subsets of nodes S of the nodes in the existing diagram K in increasing order of size until a subset Sc is found which satisfies I(C, SC,K—SC). The order of generation of the subsets guarantees that Sc is the smallest subset of K that satisfies the above independence condition. In other words Sc is the boundary stratum. The algorithm recurses until all the nodes have been added to the network.

3.1

Complexity

The algorithm outlined above requires (n — k)2k independence checks when adding the k + 1th node. The total number of independence checks required is 0 ( 2 n + 1 ) . Using the contrapositive form of the decomposition axiom for probabilistic models [Pearl, 1988] it can be shown that once a particular subset 5 of a partial belief network K has been found not to be a boundary stratum for a candidate node C it will not be found to be a boundary stratum for the candidate C even if the belief network K is augmented with some new nodes. This allows us reduce the total number of independence checks to 0(2 n ). Nevertheless, the algorithm is still exponential. Despite this, it should be noted that once a boundary stratum Sc for a candidate node C has been found, there is no need to check all the subsets of K which are larger

302 than 5C. The hypothesis in having the expert information available is to guide the algorithm along towards a sparse belief network. If this is indeed the case, the Sc's are small and the algorithm runs in far smaller than exponential time. For example, if we operationalize our sparseness hypothesis as the assumption that the maximum number of predecessors that a node can have is p, then at each step of the construction algorithm we need to check only those subsets of nodes of the existing network which are of size less than or equal to p to find the boundary stratum of a candidate node. The overall complexity in such case is polynomial (0(n p + 2 )). Indeed, tests with the implemented system show that the algorithm takes far less time and generates results that are more desirable (belief nets with smaller number of arcs) as the amount of expert information available increases. In the trivial extreme, it is possible for the expert to basically "give the system the answer" by giving enough information to allow the total ordering of the nodes. In such a case the system serves to verify the expert's intuitions rather than to fill in the gaps in the expert's model.

4

Results

The belief network construction algorithm described above has been implemented in Common Lisp on a Symbolics workstation. The system is an experimental module in a comprehensive belief network and influence diagram evaluation and analysis package called IDEAL [Srinivas k Breese, 1988]. During testing of the system the underlying probability model has been implemented as a belief network and the independence test has been implemented as dseparation in this belief network. Though this may seem a little strange at first, it should be borne in mind that we are concentrating on the construction of a sparse belief network given a probability model and an independence predicate that operates on the model (Sec 1.1). The exact implementation of the model and the test do not have any effect on the results of the algorithm. In addition, there is a benefit to testing the system this way. The topology of the underlying belief network (i.e the belief network that represents the underlying probability model) gives us a standard against which our rebuilt network can be compared. The best possible rebuilt network will be identical to the underlying network since, in that case, it is a perfect map of the underlying model. All other possible rebuilt networks will be minimal I-maps of the underlying belief network but will not necessarily show all the independencies embodied in the underlying network. During the construction of the expert information DAG obvious contradictions in the expert's information are detected by the system and have to be corrected by the user. For example, the expert may specify a set of cause-of relations that lead to a cycle in the DAG. Another example of an error is specifying that a hypothesis node is caused by some other node. The system also verifies the expert's information as it builds the network and warns the user when it finds deviations between the model it is building and the expert's information. For example, the expert may have declared that variable A is a cause of variable B while the system may find that the boundary stratum of B does not contain A.

303 Fig 2 is an example of an underlying network, the expert information (as a partially specified DAG) and the rebuilt network. The expert information consists of (1) the identities of all the evidence nodes ( F 3 , Y2 and F l ) and hypothesis nodes (U2 and Ul). (2) knowledge of the existence of some arcs (see Fig 2b). The rebuilt network is similar to the underlying network except for the arcs among the subset of nodes W l , W2, Ul and V. It is interesting to note that the rebuilt network can be obtained from the original by two sequential arc reversals — reversal of arc V —> W2 followed by reversal of arc V —> W l (See [Shachter, 1986] for details on arc reversals). If the arc V —► W2 is added to the expert information DAG then the rebuilt network is identical to the underlying network (Fig 3). Fig 4 is a slightly larger example. Here we have attempted a crude calibration of the sensitivity of the system to the amount of causal expert information available. The underlying belief network has 26 nodes and 36 arcs. The expert information initially consists of (1) the identities of all the evidence and hypothesis nodes and (2) knowledge of the existence of all the arcs (i.e the expert knowledge consists of causal statements that describe all the arcs in the underlying diagram). The system builds the network with this information. The knowledge of the existence of some random arc in the underlying model is then deleted from the expert information DAG. The system builds the network again. This delete/build cycle is repeated many times. Figure 5a shows the variation in the number of arcs in the rebuilt network versus the number of arcs in the expert information DAG. Taking the number of arcs in the rebuilt network to be a rough indicator of the quality of the rebuilt network we see that the quality of the model improves as the amount of expert information available increases. Figure 5b shows the amount of time required to build the model against the number of arcs in the expert information DAG. The time required decreases as the expert information available increases.

5

Discussion and further work

As expected, the belief network constructed by the system depends strongly on the amount of expert information available. We are at present trying to characterize what types of information are critical to building good models. Our experience with the system shows that the identification of hypothesis and evidence nodes by the expert seems very important if a reasonable model is to be built. If this system is to be applied to induce a belief network from empirical data it is imperative that an inexpensive and fairly accurate statistical independence test be used. Well characterized conditional independence tests involving larger numbers of variables may not be tractable. It may be necessary, therefore, to make use of appropriate approximation techniques or less formal tests that may be more tractable. An additional and more basic 'problem' with statistical independence tests on empirical data is that they can never be exact. This fact can be regarded as a characteristic of the induction problem. In this regard, it is interesting to note that a Bayesian network and d-separation provide a sound and complete scheme to deduce, in polynomial time, every independence statement that is implied by the Bayesian network [Pearl, 1988].

304

EXfiflPLE-1

1

(uiY

/VV\

Λ Κ,V



vOC VV

Ύζ)

vV

a) Underlying network EXPERT-IHFORrtHTIOH-l

V^Xv

B^

\δ €λ

ΘΘ

vv

X5) b) Expert information

c) Rebuilt network

Figure 2: An example

305

|EXPERT-INF0RrtflTI0H-2

a) Expert information

b) Rebuilt network

Figure 3: Rebuilt network with additional expert information

306

Figure 4: A larger example

This property made a Bayesian network and d-separation an attractive scheme to use to represent the underlying model and independence test during testing of the system. The system, as it is implemented now, guarantees that the network constructed is a Bayesian network, i.e a minimal I-map. The expert information is used merely to guide the search for the next node. The actual boundary stratum of the next node is found from scratch using the independence test. Thus, in a sense, the system "trusts" the underlying model more than the expert. Substantial computational gains could be achieved if the system started out looking for the boundary stratum of a node under the assumption that the causes that the expert has declared for the node are necessarily members of the boundary stratum of the node. However, using this approach would no longer guarantee that the belief network constructed is a minimal I-map of the underlying model.

6

Acknowledgements

We would like to thank Jack Breese and Ken Fertig for their invaluable help and advice.

307

X: Number of arcs in expert information Y: Number of arcs in rebuilt network

(a)

X: Number of arcs in expert information Y: Time taken to build network (seconds)

(b) Figure 5: Sensitivity to amount of expert information

308

References [Breese, Horvitz & Henrion, 1988] Breese, J. S., Horvitz E. J. and Henrion, M. (1988) Decision Theory in Expert Systems and Artificial Intelligence. Research Report 3, Rockwell International Science Center, Palo Alto Laboratory. [Howard & Matheson, 1981] Howard, R. A. and Matheson, J. E. (1981) Influence diagrams. The Principles and Applications of Decision Analysis 2. Strategic Decisions Group, Menlo Park, CA. [Pearl, 1988] Pearl J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, Inc., San Mateo, Calif. [Rege & Agogino, 1988] Rege, A. and Agogino, A. M. (1988) Topological framework for representing and solving probabilistic inference problems in expert systems. IEEE transactions on Systems, Man and Cybernetics, 18 (3). [Shachter, 1986] Shachter, R. D. (1986) Evaluating influence diagrams. Research 34 (6), 871-882.

Operations

[Srinivas L· Breese, 1988] Srinivas, S. and Breese, J. S. (1988) IDEAL: Influence Diagram Evaluation and Analysis in Lisp. Internal Report, Rockwell International Science Center, Palo Alto Laboratory.

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

309

A Decision-Analytic Model for Using Scientific Data Harold P. Lehmann Section on Medical Informatics, MSOB x215 Stanford University, Stanford, CA 94305-5479 (415) 723-2954, [email protected] Many Artificial Intelligence systems depend on the agent's updating its beliefs about the world on the basis of experience. Experiments constitute one type of experience, so scientific methodology offers a natural environment for examining the issues attendant to using this class of evidence. This paper presents a framework which structures the process of using scientific data from research reports for the purpose of making decisions, using decision analysis as the basis for the structure, and using medical research as the general scientific domain. The structure extends the basic influence diagram for updating belief in an object domain parameter of interest by expanding the parameter into four parts: those of the patient, the population, the study sample, and the effective study sample. The structure uses biases to perform the transformation of one parameter into another, so that, for instance, selection biases, in concert with the population parameter, yield the study sample parameter. The influence diagram structure provides decision theoretic justification for practices of good clinical research, such as randomized assignment and blindfolding of care providers. The model covers most research designs used in medicine: case-control studies, cohort studies, and controlled clinical trials, and provides an architecture to separate clearly between statistical knowledge and domain knowledge. The proposed general model can be the basis for clinical epidemiological advisory systems, when coupled with heuristic pruning of irrelevant biases; of statistical workstations, when the computational machinery for calculation of posterior distributions is added; and of meta-analytic reviews, when multiple studies may impact on a single population parameter. I THgPfiQPLEM Decision-analytic models have been applied to different areas in artificial intelligence, including diagnosis (Heckerman, Horvitz, & Nathwani, 1989), learning (Star, 1987; Buntine, 1987), vision (Levitt, et al., 1988), and control of inference (Horvitz, 1989; Breese & Fehling, 1988). An activity common to these models is the agent's updating his belief in relevant propositions on the basis of evidence. The models usually leave implicit the decision maker's belief in the method by which the data were obtained. To be done properly, updating must include this belief in each of many possible models of observation of the data. We shall show that, in certain contexts,

the space of all possible observational contexts can be parameterized to be both assessable and computable. Scientific research is one such context; specific biases can be used to provide the necessary parameterization. Parameterization of the observational contexts is domain-dependent at different levels of the meaning of domain. We use the following hierarchy. The topmost level is the general field of scientific or systematic observation. The second level is the field of research aimed at discerning causal relations, as opposed, say, to exploratory descriptions. A third level is that of medical research, which eliminates from consideration a number of destructive experimental designs. A fourth level is the class of study, including,

310 for instance, balanced-design, case-control, cohort, or randomized, studies. A fifth level is the object (domain of interest), such as cardiology. Our goal is to offer a structure that allows parameterization of all studies within the fourth domain. A successful parameterization at this level should enable us to handle a wide variety of medical (object) domains, while the structure will probably be effective at higher levels as well. We may visualize the problem using the influence diagrams (Howard & Matheson, 1981) of Figure 1. Figure la shows that the decision maker makes his decision, D, knowing the data, x, at the time of the decision, and thinking about the effect of the parameter, value, V, of the decision maker. The calculation of expected utility for the decision requires the prior probablitiy distribution for the parameter, 0, on relevant outcomes, Ω, which affect the calculation of expected utility for the decision requires the prior probability distribution for the parameter 0, in addition to the likelihood distribution, P(x \ Θ). This formulation, however, leaves out the possibility that the data were obtained in different ways, which is equivalent to its preventing the decision maker from modeling his uncertainty about the likelihood distribution. In Figure lb, we include the observational context explicitly, operationalized (or parameterized) in terms of a bias parameter, 0, and we make explicit the types of data, v, that bear on φ, separately from those that bear on 0. To use this structure, we need the prior over φ and the likelihood distribution, Ρ(* | θ, φ). The latter distribution contains our knowledge about the mechanism by which the data are obtained in a particular experimental context. As an example of the importance of the observational context, consider the following problem. I have two coins, one of which I may bring to a gambling event. I would, of course, prefer the coin that gives me the greatest odds of winning me the most money. The coins are apparently identical, but they may have different chances of landing heads. I flip each coin 100 times, asking an assistant to give me the coins alternatingly. This process gives me two lists of tosses and outcomes, one for each coin. Most utility functions would have me choose the coin

with the higher chance (posterior probability) of heads as the coin to take (unless that proportion is too high, in which case I risk being found out as a cheater), based on the binomial likelihood: for i = 1,2, the posterior probability of 0/ (the chance of coin i falling heads) given */ (the observed number of heads from flipping coin i) is proportional* to 0/ (1-0/ ) 1 0 0 - Λ ^ a n ( i t n i s might be your advice to me. But what if I told you that my assistant is three years old? You might reconsider your advice, since the assumption that each list pertains to a separate coin is now probably invalid. We should introduce

0 Parameter of interest JC Data pertaining to parameter of interest D Decision alternatives Ω Outcomes V Value φ Bias parameter y Data pertaining to bias parameter Figure 1. General influence-diagram models for belief updating, (a) Using primary data to update belief in the parameter of interest, (b) Parameterizing the observational context and using additional data to update belief in the observational model. two new parameters, φι, i = 1,2, which represent the proportion of the time our young assistant gave me coin i, but told me it * This is the likelihood function, the likelihood distribution with those factors dependent on the parameter, Θ, only, excluded.

311

was the other one. The correct model for the likelihood function for jq now is 1(χΐ\θ() = θίχί(1-ΘΟ100-χί, where 0f, the effective parameter, is a function of 0i, 02, 0i> and 02, Specifically, 0/ =θι(1-φι) + θ2ψ2. To find the posterior distribution of 0/ given the data, we must integrate out the observational parameters. 2 A MODEL FOR THE USE OF REPORTED SCIENTIFIC D A T A

In developing a model to deal with systematic observations, we narrow our attention to medical research study designs for a number of reasons. First, providing data for high-stakes decisions relating to individual situations is a major purpose of medical research, as evidenced by the large number of clinical studies. Second, while our discussion will remain general, it will be focused by our considering a narrower domain than all scientific research, thereby clarifying the model. Third, the concept of bias in medical research is well established (Sackett, 1979), giving us a baseline against which to judge the external validity of our model. 2.1 Details of the General Model Our general model, depicted in Figure 2, is an expansion of Figure 1, based on Shachter, et al. (1989). It consists of a central framework with peripheral adjustments. The framework allows the representation of a number of modeling decisions, separate from the domain issue of which therapy to implement. We will point out these modeling decisions as they come up. One approach for making modeling decisions, in general, is for the physician to explore each potential model resulting from one modeling decision, and integrate expected value across these models, weighted by a prior probability on each model. A second approach is for the physician to perform a sensitivity analysis across some or all models, and either continue with a single best model, or integrate across models, depending on the outcome of the analysis. We now circumnavigate our general model, Figure 2. We have made explicit that

the physician is concerned about a particular Outcome of Interest, such as mortality, morbidity (undesired, nonfatal outcomes), or cost. The fact that he uses certain parameters to help him assess the likelihood of that outcome occurring is a result of a separate decision as to the probabilistic model he chooses to use. We have split the general Parameter of Interest, Θ,fromFigure 1 into four separate parameters: the Patient Parameter, 0pt, the Population Parameter, 0pop, the Sample Parameter, 0sample> and the Effective Sample Parameter, 0Sample· By separating 0 p t from 0 p o p, we allow the physician to take two types of modeling actions. One is for the physician to use the same data for many different clinical situations, if he knows how to derive the distribution for 0pt from Hyper parameter s updated by 0 p o p in a hierarchical Bayes framework (Berger, 1985), which may require traditional statistical models. A second type of modeling action is for him to effect a meta-analysis (L'Abbé, 1987), using multiple sources of data (several different papers) to update his-belief in 0 p o p . By separating 0Sampie from 0 p o p , we allow for the physician to generalize the study from 0sampie> the belief in the parameter for the perfectly-done study, to 0 pop , the belief in the parameter for the ideal population. The physician could calculate 0sampie exactly from 0 poP) if he knew how the Selection from the general population led to the patient pool sampled in the study and by what Protocol Design the patients were assigned their respective interventions (which is why 0sampic is a deterministic node). The last parameter, 0'Sampie, contains the physician's belief in the parameter of the study as performed. If the physician knew exactly how 0Sampie was modified by the experimental Protocol Implementation, he could calculate exactly what the parameter was to which the experimental outcome refers. The Data node of Figure 1 has been split into an Experimental Outcome node, representing the actual results from the study as performed; an Observed Data node,

312

Protocol

Protocol Design

Implementation, Çlmi

Selection

Effective [ Sample Parameter ] 6'sample

Population Parameter Hyperparameters Patient Parameter 9pt

Expérimentai -x ^ Outcome ^^ Reliable „ Measurement;

Credibility Reported Data

Medical Decision

Value

Figure 2. Influence-diagram model for using research data in the general medical decision-making problem. representing the data recorded by the experimenters; and a Reported Data node, representing the data as reported in the written article and the only data available to the reader. The likelihood of the experimental outcome given the Effective Sample Parameter is the probability distribution that depends on the Probabilistic Model the physician chooses for modeling the outcome of the experiment. For mortality studies, as an example, this model is usually binomial or constant hazard. If the physician knew the Measurement Reliability of the recording instruments, say the true sensitivity and specificity, and the true model relating the reliability to the actual and the observed data, he could calculate

exactly the Observed Data from the Experimental Outcome. Similarly, if he knew exactly the Credibility of the reporters, he could calculate the Reported Data from the Observed Data. Researchers often report in their papers only their estimate of the pai'ameter, rather than all the observed data. This estimate is a Statistical Function of the observed data, such as a mean or a regression coefficient. Statistical decision theory (Berger, 1985) concerns itself with the choice of that function. We note that, without at least the observed data, we lose information that could potentially alter our decision. The sufficient statistics proposed by statistical decision theory, then, appear to be context dependent,

313 and our model makes explicit just this context. 2.2 An Example of Navigating the Model To illustrate how we might apply the model, we consider a 55 year old white woman who has just had a heart attack and who has been brought into the hospital almost immediately after symptoms of chest pain, nausea, and sweating set in. Her physician, besides needing to stabilize her acute cardiovascular status, wants to prevent worsening of her general cardiac condition. The doctor knows that a drug, metoprolol, is considered possibly able to do so. He is primarily concerned with minimizing the patient's chance of death, and thereby maximizing the heart attack victim's life expectancy. There are some known side effects of the medication. Should he use the drug? Let us go through an abbreviated analysis using a study (Hjalmarson, et al., 1980) that bears on this question. As discussed before, we leave the choice of the value model to be implicit. The Outcome of Interest is mortality. We make the modeling decision to use the patient's probability of death as the Patient Parameter, which we will assume to be constant over time (constant hazard model). For the modeling decision of the choice of referent Population, we have at least two choices on the basis of cardiological domain knowledge: middle-aged women and middle-aged adults. If we choose the population of both sexes, there will be a larger number of studies, each with a large sample size, that we can bring to bear on this problem, which, in concert, may affect the Hyperparameters as much as if we used only the subgroup of women only. Clearly, there is a modeling decision trade-off between specificity of the data versus its the amount of data available. For the purposes of this paper, we shall take the modeling action of using only the combined population of middle-aged adults. The Sample in the study consists of all heart-attacks victims from south Sweden in the late 1970s. This characterization represents Selection from our population on ethnic grounds, but not on the basis of referral, diagnostic purity, or diagnostic access biases (Sackett, 1979). The Protocol Design is reported to have been that of a

double-blinded and randomized clinical trial, in which the assignment of a patient to a given treatment is independent of the patient's baseline status. There is evidence to support the claim that the Protocol Implementation was identical to the design. For instance, the compositions of the metoprolol and placebo groups are similar with respect to relevant characteristics, based on reported baseline data, corroborating the implementation of randomization. The number of withdrawals from the study on the basis of side-effects is also similar between the two groups, suggesting that if there were some unbfindfolding of the care providers such that the treating physicians became aware of the true treatment assignments, its degree was the same in both groups. Estimating the actual degree of withdrawal explicitly is important for calculating a posterior distribution on ^sample- Withdrawal refers to a patient's not receiving the treatment to which he was assigned. This estimation adds a bias parameter to be inferred, which, in turn, results in our considering a space of observational models larger than we would be considering without including the withdrawal bias. The withdrawal bias parameter in this study models the fact that the effective sample parameter for metoprolol was a result of mixing the treatment group with a third group of patients receiving no treatment, that is, a group with the baseline mortality risk (the group of patients withdrawn). The new parameter to be inferred is the degree of mixing, φ. Shachter and colleagues (1989) offer a mathematical form for the effective sample parameter, 0 sample, metoprolol

=

(1 - φ) +

· example, metoprolol

Φ ' ^sample, baseline» showing that ΘSample» is a function of ^sample and of φ, as represented by the deterministic node in Figure 2. We note the similarity to our coin problem, in Section 1, where now, φ\=φ2=φ. In the study, the reported overall withdrawal rate is 19.1% in both groups, and we could use this to update our belief in φ. Continuing around Figure 2, we find that the investigators use the binomial Probabilistic Model, in keeping with our

314 definition of the parameter of interest. intervention), and events (e.g., mortality, Measurement Reliability depends on the pain, contraction of disease), thereby sensitivity and the specificity of the sensing allowing for the same structure to be used in mechanisms. For mortality studies, the sensitivity, P(labeling patient as "dead" \ patient deceased), and specificity, ^(labeling the patient as "alive" \ patient is alive), depend on patients who have dropped out of the study. The authors assure us that the mortality status of each patient entered into the study was assessed, regardless of subsequent treatment status, and their credentials are such that we consider them to have high Credibility. Finally, the authors report both mortality rates and life tables as their Statistical Functions of the data. 3 USING BIASES TO PARAMETERIZE THE SPACE OF CLINICAL STUDIES We now have the basic architecture for parameterizing the space of observational models through expanding the peripheral nodes of Figure 2. We gave the details of one instance in the mixture model resulting from patients withdrawn from the study. Our present concern is simply to locate all potential biases in a single model. We wish, therefore, for as comprehensive as possible a set of biases that spans the space of all studies. To show this set, we need to examine more closely the top line of Figure 2: the relationships among 0 pop , 0sampie> an( * 0'sample· In causal studies, each of these parameters is a probability (e.g., mortality risk), or a parameter of a probability distribution, and hence represents a component of a local node group of an influence diagram. The most general form of such a local node group is given in Figure 3: the node Patient's Event is dependent on the nodes Patient's Baseline State and Patient's Exposure to an Agent of Interest. For instance, 0Sampie *s defined by the probability distribution of the events observed in the sample patients, given the baseline states of the sample patients and the agents to which those patients were exposed. Similarly for the population and for the effective sample. The language comes from Feinstein (1985), who uses this local node group to account for the widest possible range of initial states (e.g., healthy, diseased), agents (e.g., environmental exposure, medical therapy, process

Figure 3. The general decomposition of a parameter of interest, which parameterizes the indicated probability distribution. After Feinstein (1985). analyzing the entire class of comparative studies. Thus, to explore the relationship between specific biases and the parameter in which we are interested, we need to examine the primary belief network of which the parameter is a component, and to locate the dependencies of the parameter on those biases. Then, the relationship between two parameters can be worked out at the level of these primary local node groups. In Figure 4, we present a fragment of such an expanded belief network, showing the interplay between different classes of bias and the relationship between the Sample Parameter and the Effective Sample Parameter. The citation sources for the biases are indicated. As an example, we note that Classification Error is a class of bias upon which the assessment of Effective Sample Initial State is dependent. There is a large list of specific biases which fall into this category, such as Previous Opinion Bias and Diagnostic Suspicion Bias. These biases lead to a difference between the Sample Initial State and the Effective Sample Initial State; this difference leads, in turn to a difference between the Sample Parameter and the

315

Figure 4. Primary influence diagram, defining the Effective Sample Parameter , and its relationship to the Sample Parameter. Thickly drawn nodes and arcs indicate the local node groups of Figure 3. Footnote symbols indicate citation source for baises: *Sackett (1979), tFeinstein (1985), 1|Lehmann(l988). Effective Sample Parameter. This fragment, then, is the mechanistic Sample Parameter and the Effective Sample Parameter. We note that the model incorporates a wide variety of methodological elements, such as misclassification, misassignment, and conduct of the study

interpretation (or, expansion) of the arc in Figure 2 between the (e.g., blindfolding) all into a single model. This analysis of mechanism allows us to preserve (or define) the semantics of the biases, while providing a numerical environment in which to use them.

316

4 PREVIOUS A T T E M P T S TO M O D E L THE U S E OF SCIENTIFIC D A T A

Rennels (1986) uses AI heuristics in constructing a system, ROUNDSMAN, that offers therapeutic suggestions for a particular patient, on the basis of articles in its knowledge base. The primary heuristic is the calculation of the "distance" of a paper to a particular domain decision. For the program to function, a domain expert must preprocess each article; the user enters the defining characteristics for the patient. Problematic in Rennels' approach is that all types of biases and adjustments are combined into a single score, without an underlying structure like the one we have developed. Furthermore, by not considering the probabilistic model for the data, ROUNDSMAN is unable to give an integrated numeric solution to the question of what the physician should do on the basis of the data. Our model is in the same spirit as Eddy's confidence profile method (Eddy, 1989). His approach is motivated more by the desire to perform meta-analyses than the need to analyze in depth a single article. His "chains" are concerned with the causal connections between parameters and outcomes of interest. He deals with biases as correction factors, similar to our peripheral nodes in Figures 2 . The goal of the REFEREE project (Lehmann, 1988) is to create a system that aids a user in evaluating the methodological quality of a paper. The REFEREE team first implemented that goal in a rule-based program. Because of semantic and representational deficiencies, they redefined the goal to be the replicability of the conclusion of the study, and implemented the new goal in a probabilistic program (Lehmann, 1988) and in a multiattributevalue-based program (Klein, et al., 1990). However, even with these normative bases, it is not clear what should be done with the output of these systems. The approach we have taken in this paper is an answer to that question. The field of clinical epidemiology is "concerned with studying groups of people to achieve the background evidence needed for clinical decisions in patient care" (Feinstein 1985, p. 1). Besides bringing epidemiological and statistical techniques to the medical bedside, clinical epidemiologists

are concerned with physicians' coherent reading of the clinical research literature. Sackett, Haynes, and Tugwell (1985) have published a sequential algorithm for using the literature: If the article fails any step along the path, it is eliminated from consideration. L'Abbé, Detsky, and O'Rourke (1987) describe sequential steps to be taken for using an article in a metaanalysis. Their measure of quality is based on the rating-sheet approach of Chalmers and colleagues (1981). Our approach, in contrast, is global in that an article is implicidy discarded only after the probability distribution, P(6pt\ Reported Data), is determined; a distribution close to the prior distribution over 0pt suggests that the biases and adjustments have washed out any meaningfulness from the reported data. The primary advantage of the global approach is that it does not explicitly throw out data, unlike the sequential algorithm. Although a study may be designed and executed poorly or may be somewhat irrelevant, its data may still have an important bearing on the calculation of expected value. The disadvantage of the global approach is its computational complexity. Finally, the model in this paper should be compared with the usual machine learning paradigm of having a computer induce categories from examples, or learn from experience. In that paradigm, the machine assumes no knowledge about how the data are generated, except, perhaps, the degree of supervision. In contrast, our model presupposes much information about the data generation process and attempts to provide a framework for representing just that type of knowledge. 5

USES O F T H E M O D E L

The model may be used in several ways. At the least, it may help a reader to organize the study of research results. This use leads to a qualitative analysis that, alone, may result in all the insights the user may want. A quantitative approach helps a user to organize the clinical epidemiological concerns and provides the basis for a decision support system and statistical workstation, where the resulting probability distributions may be displayed and studied, and where distributions over probability models may be considered as well (Lehmann, 1990).

317

In using the general model to construct decision-support systems, we must construct specific models of how evidence in the study report updates our belief in a model. Extending the general model results in our building a large knowledge base of research designs; the general model serves as the core for the methodological and statistical knowledge base (Figure 5). Domain knowledge is incorporated as new conditions on the biases (new predecessors), or new sources of evidence for the different biases (new descendents). The resulting specific model is then instantiated for the given paper, updated probabilistically (Shachter, 1988) and the treatment recommendation is made through maxmizing utility. Barlow, Irony, and Shor (1988) give an example of following this process is choosing an experimental design. We have presented an architecture for a large knowledge base of science methodology. The details of this knowledge base have implications for designing Domian-dependent 0 0 0 Conditioning Knowledge

scientifically (rationally) functioning computer agents which collect data about the real world and have knowledge about the process by which they do so. The model we have presented eliminates most theoretical problems with previous approaches and provides a framework for future work. Extensions of our ideas include generalizing to other evidential domains in the hierarchy of domains presented at the beginning of this paper. We also need to make more explicit the various modeling decisions the reader undertakes in performing the type of analysis we have been discussing. The work we present here can be followed now by a practical instantiation (Lehmann, 1990). ACKNOWLEDGMENTS

Thanks to Greg Cooper and to Ross D. Shachter for help in formulating these ideas. Sources of funding include Grant LM-04136 from the National Library of medicine, and NIH grant RR-00785 for the SUMEX-AIM computing facilities. REFERENCES

Methodological and Statistical Knowledge

0000

Domain-dependent 0 Evidential Knowledge

Figure 5. General architecture for a decision support system. The central knowledge base contains methodological and statistical knowledge of biases and of the decision process (as in Figures 1, 2, and 3). The upper knoweldge base contains domain-specific knowledge regarding conditioning knowledge for specific biases, while the lower one contains domain-specific knowledge about what sorts of information from a study are evidential for specific biases.

Barlow, Richard E., Irony, Telba Z., and Shor, S. W. W. Informative sampling methods: The influence of experimental design on decision. Conference on Influence Diagrams for Decision Analysis, Inference, and Prediction. May 9-11, 1988. Engineering Systems Research Center, Bechtel Engineering Center, University of California at Berkeley (1988). Berger, James O. Statistical Decision Theory and Bayesian Analysis (New York: Springer-Verlag, 1985). Buntine, Wray. Decision tree induction systems: A Bayesian analysis. In L. N. Kanal, J. F. Lemmer, (Eds.) Uncertainty in Artificial Intelligence 3. Amsterdam: North-Holland, pp.109-127 (1989). Breese, John S., and Fehling, Michael R. Decision-theoretic control of problem solving: Principles and architecture. Proceedings of the Fourth Worfahop in Uncertainty in Artificial Intelligence, University of Minnesota, Minneapolis, August 19-21, 1988, pp. 30-37 (1988).

318 Chalmers, Thomas C , Smith, Harry, Jr, Blackburn, Bradley, et al. A method for assessing the quality of a randomized control trial. Controlled Clinical Trials, 2:31-49 (1981). Eddy, David M. The confidence profile method: A Bayesian method for assessing health technologies. Operations Research (1989). Feinstein, Alvan R. Clinical Epidemiology: The Architecture of Clinical Research (Philadelphia: W. B. Saunders Company, 1985). Heckerman, David E., Horvitz, Eric J., and Nathwani, Bharat N. Update on the Pathfinder project. Proceedings of the Thirteenth Annual Symposium on Computer Applications in Medical Care, Washington, D.C., November 5-8, 1989, pp. 203-207 (1989). Hjalmarson, A., Herlitz, J., Malek, L, et al. Effect on mortality of metoprolol in acute myocardial infarction. Lancet 2(8251):823-827 (1980). Horvitz, E. J. Reasoning about beliefs and actions under computational resource constraints. In L. N. Kanal, J. F. Lemmer, (Eds.) Uncertainty in Artificial Intelligence 3. Amsterdam: NorthHolland, pp.301-324 (1989). Howard, Ronald A., and Matheson, James E. Readings on the Principles and Applications of Decision Analysis. (Menlo Park, CA: Strategic Decision Group, 1981). Klein, David A., Lehmann, Harold P., Shortliffe, Edward H. Computer-based evalutation of randomized clincial trials: A value-theoretic aproach. Submitted to the Fourteenth Annual Symposium on Computer Applications in Medical Care (1990). L'Abbé, Kristan A., Detsky, Allan S., O'Rourke, Keith. Meta-Analysis in clinical Research. Annals of Internal Medicine. 107:224-233 (1987). Lehmann, Harold P. Knowledge acquisition for probabilistic expert systems. Proceedings of the Twelfth Annual Symposium on Computer Applications in Medical Care, Washington, D.C., November 6-9, 1988, pp. 73-77 (1988). Lehmann, Harold P., Shortliffe, Edward H. THOMAS: Building Bayesian statistical expert systems to aid in clinical decision making. Submitted to the Fourteenth

Annual Symposium on Computer Applications in Medical Care (1990). Levitt, Tod S., Binford, Thomas O., Ettinger Gil J., and Gelband, Patrice. Utilitybased control for computer vision. Proceedings of the Fourth Workshop in Uncertainty in Artificial Intelligence, University of Minnesota, Minneapolis, August 19-21, 1988, pp. 245-256 (1988). Rennels, Glenn D. A Computational Model of Reasoning from the Clinical Literature. Ph.D. Thesis, Division of Medical Informatics, Stanford University, Stanford, CA. (1986). Sackett, David L. Bias in analytic research. Journal of Chronic Diseases, 32:51-63 (1979). Sackett, David L., Haynes, R. Brian, and Tugwell, Peter. Clinical Epidemiology (Boston: Little, Brown and Company, 1985). Shachter, Ross D. Probabilistic inference and influence diagrams. Operations Research 36(4):589-604 (1988). Shachter, Ross D., Eddy, David M., and Hasselblad, Victor. An influence diagram approach to medical technology assessment. To appear in R. M. Oliver and J. Q. Smith (eds), Influence Diagrams, Belief Networks, and Decision Analysis (New York: John Wiley & Sons, in press). Self, Matthew, and Cheeseman, Peter. Bayesian prediction for artificial intelligence. Proceedings of the Third Workshop in Uncertainty in Artificial Intelligence, University of Washington, Seattle, July 10-12, 1987, pp. 61-69. Star, Spencer. Theory-based inducting learning: An integration of symbolic and quantitative methods. Proceedings of the Third Workshop in Uncertainty in Artificial Intelligence, University of Washington, Seattle, July 10-12, 1987, pp. 237-248 (1987).

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

319

Verbal expressions for probability updates How much more probable is "much more probable"? Christopher Elsaesser and Max Henrion Department of Engineering and Public Policy Carnegie Mellon University Abstract—Bayesian inference systems should be able to explain their reasoning to users, translating from numerical to natural language. Previous empirical work has investigated the correspondence between absolute probabilities and linguistic phrases. This study extends that work to the correspondence between changes in probabilities (updates) and relative probability phrases, such as "much more likely" or "a little less likely." Subjects selected such phrases to best describe numerical probability updates. We examined three hypotheses about the correspondence, and found the most descriptively accurate of these three to be that each such phrase corresponds to a fixed difference in probability (rather than fixed ratio of probabilities or of odds). The empirically derived phrase selection function uses eight phrases and achieved a 72% accuracy in correspondence with the subjects' actual usage. 1. Introduction A key characteristic for the acceptance of expert systems and other computer-based decision support systems is that they should be able to explain their reasoning in terms comprehensible to their users. (Teach and Shortliffe, 1984) found that physicians rated explanation an essential requirement for the acceptance of medical expert systems. Bayesian probabilistic inference has often been criticized as alien to human reasoning and therefore particularly hard to explain. There have been, however, recent attempts to refute at accusation by the development of practical and effective systems for explaining Bayesian inference (Elsaesser, 1987, Norton, 1986, Speigelhalter, 1985). The work reported here is a part of such an attempt. Bayesian reasoning usually uses numerical probabilities, but most people express preference for using natural language phrases, such as "probable", "very unlikely", "almost certain", and so forth. There has long been interest in developing empirical mappings between numbers and such probability phrases (e.g., Lichtenstein and Newman, 1967; Johnson, 1973; BeythMarom, 1982; Zimmer, 1983; Zimmer, 1985; Wallsten, et al., 1985). Most of these studies simply ask subjects to give the numerical probabilities they consider closest in meaning to selected phrases. This work has found considerable consistency in ranking of phrases between people, but moderate variability in the numbers assigned by different people. It has also found significant effect of the context on the numerical meaning assigned.

320

Provided careful note is taken of the interperson and intercontext variabilities (vagueness), we may use such mappings from numbers to phrases to generate explanations automatically in probabilistic decision aids. The converse mapping from phrases to numbers may also be used as an aid to elicitation of expert uncertain opinions expressed as verbal probabilities. Sensitivity analysis of the effect of the vagueness should be used to check that this does not contribute unduly to vagueness in conclusions. Hitherto all such work as been on absolute degrees of belief rather than changes in degrees of belief or probability updates. Since Bayesian inference is primarily about changes in probability, this seems an important lacuna. We have therefore chosen it as the focus of the study reported here. Specifically, our goal is to develop a mapping (a phrase selection function) that gives the relative probability phrase that best expresses a given change in probability. An example use of relative probability phrases are "little more likely" or "a great deal less likely". An example use might be as follows: Suppose the prior probability of Proposition A is 0.5 and evidence is presented to cause a revision to a posterior probability of 0.05. A Bayesian system might explain this thus: "In light of the evidence, A is a great deal less likely." 2. Hypotheses about Phrase Selection Functions A relative probability phrase selection function gives a phrase for any change from a prior probability pi to a posterior probability P2. It is a mapping from the unit square with dimensions pi and P2 into a set of relative probability phrases. f:[0,l] x [0,1] —> {relative probability phrases} A phrase selection function effectively partitions the unit square into regions corresponding to specific relative probability phrases. (Figures 1, 2, and 3 show examples.) Our objective is to describe the phrase selection function for a fixed set of relative probability phrases that best fits our subjects' actual usage. We should expect the partitions between regions to be monotonically increasing if the ordering of phrases is clear, but the actual shape of the partitioning curves is open to question. Three alternative models seemed intuitively appealing: Hi Constant probability ratio: This phrase selection function, illustrated in Figure 1, is characterized by regions with partition lines of regions with pi proportional to P2: P2 = ci pi Note that H i exhibits range effects in probabilities. (A similar model was proposed by (Oden, 1977), but he was concerned with "relative belief rather than relative probabilities.) One can draw an analogy between the constant ratio model and Fechner's law in psychophysics that implies that a subjectively constant increment in the magnitude of a quantity is proportional to its absolute magnitude.

321

H2 Constant probability difference: This model is characterized by partition lines of the form: P2 = P l + Ci

Figure 2 allows a visual comparison with the other models. The constant probability difference model does not exhibit range effects in probability. 1.0

>*

Ί3

i

'

*£> /

'

^

/

'

&

/

0.8 0.6

Ό.4

a» / 9 * /

,' s ,v&

^ 0.2

t ^ιΟί^Γ-·

»'/** [ 0, constructs a map to answer global queries such that the answer provided in response to any given query is correct with probability 1 — 6.

2

Spatial Modeling

We model the world, for the purposes of studying map learning, as a graph with labels on the edges at each vertex. In practice, a graph will be induced from a set of measurements by identifying a set of distinctive locations in the world, and by noting their

333 connectivity. For example, we might model a city by considering intersections of streets to be distinguished locations, and this will induce a grid-like graph. Kuipers [1988] develops a mapping based on locations distinguished by sensed features like those found in buildings (see Figure 1). Figure 2 shows a portion of a building and the graph that might be induced from it. Levitt [1987] develops a mapping based on locations in the world distinguished by the visibility of landmarks at a distance. In general, different mappings result in graphs with different characteristics, but there are some properties common to most mappings. For example, if the mapping is built for the purpose of navigating on a surface, the graph induced will almost certainly be planar and cyclic. In what follows, we will always assume that the graphs induced are connected, undirected, and of bounded degree; any other properties will be explicitly noted. Following [Aleliunas et a/., 1979], a graph model consists of a graph, G = (V,E), a set L of labels, and a labeling, φ : {V x E} —> L, where we may assume that L has a null element ± which is the label of any pair (v G V, e G E) where e is not an edge from v. We will frequently use the word direction to refer to an edge and its associated label from a given vertex. With this notation, we can describe a path in the graph as a sequence of labels indicating the edges to be taken at each vertex. If the graph is a regular tessellation, we may assume that the labeling of the edges at each vertex is consistent, i.e., there is a global scheme for labeling the edges and the labels conform to this scheme at every vertex. For example, in a grid tessellation, it is natural to label the edges at each vertex as North, South, East, and West. In general, we do not require a labeling scheme that is globally consistent. You can think of the labels on edges emanating from a given vertex as local directions. Such local directions might correspond to the robot having a compass that is locally consistent but globally inaccurate, or local directions might correspond to locally distinctive features visible from intersections in learning the map of a city. The robot's activities include moving about in the world and sensing the environment. To model these activities we introduce functions that model the robot's sensors and effectors. A movement function is a function from {V x L} —> V. The intuition behind this function is that for any location, one may specify a desired edge to traverse, and the function gives the location reached when the move is executed. A sensor function is a function from V to some range of interest. One important sensor function maps vertices to the number of out edges, that is, the degree of the vertex. Another useful function maps vertices to the power set of the set of all labels, 2 L , giving the possible directions to take from that vertex. We can also partition the set of vertices into some number of equivalence classes and use a function which maps vertices into these classes. We refer to this as a recognition sensor, since it allows the robot to recognize locations. The intuition behind this is that, in some cases, the local properties of locations (i.e., the properties that can be discerned of a particular location while situated in that location) enable us to tell them apart. To model uncertainty, we introduce probabilistic forms of these functions. We now develop and explore two kinds of uncertainty that arise in map learning.

334

2.1

Modeling Uncertainty

First, there may be uncertainty in the movement of the robot. In particular, the robot may occasionally move in an unintended direction. We refer to this as directional uncertainty, and we model this type of uncertainty by introducing a probabilistic movement function from {V x L} —► V. The intuition behind this function is that for any location, one may specify a desired edge to traverse, and the function gives the location reached when the move is executed. For example, if G is a grid with the labeling given above, and we associate the vertices of G with points (i,j) in the plane, we might define a movement function as follows:

iM»\j),0 =

( t , j + l) (i + l , j ) (t-1,/) (t,i-l)

70% 10% 10% 10%

if / is if/is if/is if / is

North North North North

where the "..." indicate the distribution governing movement in the other three directions. The probabilities associated with each direction sum to 1. In this paper, we will assume that movement in the intended direction takes place with probability better than chance. A second source of uncertainty involves recognizing locations that have been seen before. The robot's sensors have some error, and this can cause error in the recognition of places previously visited; the robot might either fail to recognize some previously visited location, or it might err by mistaking some new location for one seen in the past. We refer to this type of uncertainty as recognition uncertainty, and model it by partitioning the set of vertices into equivalence classes. We assume that the robot is unable to distinguish between elements of a given class using only its sensors. In this case the recognition function maps vertices to subsets which are the elements of the partition of the set of vertices. For example, a robot that explores the interior of buildings might use sonar as its primary sensor and use hallway junctions as its distinguished locations. In this case, the robot might be able to distinguish an L junction from a T junction, but might be unable to distinguish between two T junctions. In general, expanding the sensor capabilities of the robot will result in better discrimination of locations, i.e., more equivalence classes, but perfect discrimination will likely be either impractical or impossible. Some locations, however, may be sufficiently distinct that they are distinguishable from all others even with fairly simple sensors. In the model, these locations appear as singleton sets in the partition. We refer to these locations as landmarks. We use the term "landmark" advisedly; our landmarks have only some of the usual properties. Specifically, our landmarks are locations that we occupy, not things seen at a distance. They are landmarks because the "view" from them is unique. In the following, we make the rather strong assumption that, not only can the robot name the equivalence classes, but it can also determine if a given location is a member of an equivalence class that contains exactly one member (i.e., the robot can identify landmarks).

335

3

Map Learning

For our purposes, a map is a data structure that facilitates queries concerning connectivity, both local and global. Answers to queries involving global connectivity will generally rely on information concerning local connectivity, and hence we regard the fundamental unit of information to be a connection between two nearby locations (i.e., an edge between two vertices in the induced undirected graph). We say that a graph has been learned completely if for every location we know all of its neighbors and the directions in which they lie (i.e., we know every triple of the form (u, /, v) where u and v are vertices and / is the label at u of an edge in G from u to v). We assume that the information used to construct the map will come from exploring the environment, and we identify two different procedures involved in learning maps: exploration and assimilation. Exploration involves moving about in the world gathering information, and assimilation involves using that information to construct a useful representation of space. Exploration and assimilation are generally handled in parallel, with assimilation performed incrementally as new information becomes available during exploration. The problem that we are concerned with in this paper involves both recognition and directional uncertainty with general undirected graphs. In the following, we show that a form of Valiant's probably approximately correct learning is possible when applied to learning maps under certain forms of these conditions. At any point in time, the robot is facing in a direction defined by the label of a particular edge/vertex pair—the vertex being the location of the robot and the edge being one of the edges emanating from that vertex. We assume that the robot can turn to face in the direction of any of the edges emanating from the robot's location. Directional uncertainty arises when the robot attempts to move in the direction it is pointing. Let a > 0.5 be the probability that the robot moves in the direction it is currently pointing. More than 50% of the time, the robot ends up at the other end of the edge defining its current direction, but some percentage of the time it ends up at the other end of some other edge emanating from its starting vertex. With regard to recognition uncertainty, we assume that the locations in the world are of two kinds, those that can be distinguished, and all others. That is, there is some set of landmarks, in the sense explained above, and all other locations are indistinguishable. We model this situation using a partitioning W of V and assuming that we have a sensor function which maps V to W. W consists of some number of singletons and the set of all indistinguishable elements. We further assume that a second sensor function allows us to determine whether the current location is or is not a landmark. For convenience, we define D to be the subset of V consisting of all and only landmark vertices, and I to be the subset of V consisting of all and only non-landmark vertices. We refer to this kind of graph as a landmark graph. We define the landmark distribution parameter, r, to be the maximum distance from any vertex in 7 to its nearest landmark (if r = 0, then I is empty and all vertices are landmarks). We say that a procedure learns the local connectivity within radius r of some v 6 D if it can provide the shortest path between v and any other vertex in D within a radius r of v. We say that a procedure learns the global connectivity of a graph G within a constant factor if, for any two vertices u and υ in D, it can provide a path between u and v whose length is within a constant factor of the length of the shortest path between u and v in G. The path will be constructed from

336

Figure 3: A p a t h found between landmarks A and D paths found between locally connected landmarks (see Figure 3). In the following, we assume that the probability of the robot guessing that it did traverse a path p given that it actually did traverse p is 7, that 7 > | + e where e is positive, and that the robot knows these two facts. The answers to these guesses might be arrived at by various means. First, some monitoring of the robot's movement mechanisms could provide an indication of the quality of the traversal. Any a priori information about the path could be used to provide the answer, and some information regarding features seen in the previous exploration steps might be useful here as well. We begin by showing that the multiplicative error incurred in trying to answer global path queries can be kept low if the local error can be kept low, that the transition from a local uncertainty measure to a global uncertainty measure does not increase the complexity by more than a polynomial factor, and that it is possible to build a procedure that directs exploration and map building so as to answer global path queries that are accurate and within a small constant factor of optimal with high probability. Lemma 1 Let G be a landmark graph with distribution parameter r, and let c be some integer > 2. Given a procedure that, for any 8\ > 0, learns the local connectivity within cr of any landmark in G in time polynomial in J- with probability 1 — 8\, there is a procedure that learns the global connectivity of G with probability 1 — Sg for any 6g > 0 in time polynomial in j - and the size of the graph. Any global path returned as a result will be at most ^£2 times the length of the optimal path. Proof: Let m be the length of the longest answer we might have to provide to a global query. Then the probability of correctness for any global answer obeys ^(correct answer) > (1 — 6i)m A simple expansion gives

(1 - 8i)m = 1-πιδι +

Ε>1-πιδι

because E > 0. Thus, ensuring that every δι — Sg/m will ensure that ^(correct answer) > 1 — 6g„

337 We use the local procedure on every distinguishable vertex in the graph and the resulting representation is sufficient to provide a path between any two distinguishable vertices. Note that we do not have to know \V\ in order to calculate 8\, only the length of the longest answer expected. The proof that the resulting paths are within a constant factor of optimal appears in [Basye et a/., 1989]. Lemma 2 There exists a procedure that, for any Si > 0, learns the local connectivity within cr of a vertex in any landmark graph with probability 1 — Si in time polynomial in T> T^ï and the size °f G, and exponential in r. Proof: The learning algorithm can be broken down into three steps: a landmark identification step in which the robot finds and identifies a set of landmarks, a candidate selection step in which the robot finds a set of candidates for paths in G connecting landmarks, and a candidate filtering step in which the robot determines which of those candidates actually correspond to paths in G. In order to prove the lemma, landmark identification has to succeed in identifying all landmarks in G with high probability, candidate selection has to find all paths (or at least all of the shortest paths) between landmarks with high probability, and candidate filtering has to determine which of the candidates correspond to paths in G with high probablity. Let 1 — £,·, 1 — 6e, and 1 — Sj correspond, respectively, to the probabilities that the three steps succeed in performing their associated tasks. We will consider each of the three steps in turn. The first step is easy. The robot identifies all the landmarks in G with probability I —Si by making a random walk whose length is polynomial in j : and the size of G. A more sophisticated exploration might be possible, but a random walk suffices for polynomialtime performance. Having identified a set of landmarks, the robot has to try all paths of length r or less starting from each identified landmark. If d is the maximum degree of any vertex in G, then there can be as many as dT paths of length r or less starting from any vertex in G. This requires than an exhaustive search will be exponential in r. Since we expect that r will generally be small, this "local" exponential factor should not be critical. For each landmark, the robot tries some number of paths of length r trying to connect other landmarks within a radius r. Again, a simple coin-flipping algorithm will do for our purposes. Starting from a landmark A, the robot chooses randomly some direction to follow, it records that direction, and then attempts to follow that direction. It continues in this manner until it has taken r steps. If it encounters one or more landmarks (other than A), then it records the set of directions attempted as a candidate path. The resulting candidates look like: Aouto-, in^outi i · · ·, ink_^outk_i

j inß

where B is the landmark observed on a path starting from A, and the notation jnX0ut indicates that the robot observed itself entering a vertex of type X on the arc labeled in and it observed itself attempting to leave on the arc labeled out. The probability that the robot will traverse a particular path of length r on any given attempt is -t. The probability that the robot will traverse the path of length r that it attempts to traverse is ar. Since the robot records only those paths it attempts, it has to make enough attempts so that with high probability it records all the paths. The probability that the robot will record any given r-length path on n attempts starting at A is:

'-['-(i)T.

338 In order to ensure that we record all such paths with probability 1 — Sa we have to ensure that: 1

«[-[-on'

Solving for n we see that the robot will have to make a number of attempts polynomial in j - and exponential in r. Candidate filtering now proceeds as follows for each candidate path. The robot attempts to traverse the path, and, if it succeeds, it guesses whether or not it did so correctly. A traversal of the path that was correct indicates that the path really is in G. With directional uncertainty, it is possible that although the traversal started and ended at the right locations and seemed to take the right direction at each step, the path actually traversed is not the one that was attempted. This results in a "false positive" observation for the path in question. The purpose of the guess after a traversal is to distinguish false positives from correct traversais. For each traversal that succeeds, we record the answer to the guess, and we keep track of the number of positive and negative answers. After n traversais and guesses, if the path really is in G, we expect the number of positive answers to be near wy. We use n/2 as the threshold, and include only paths with more than n/2 positive answers in our representation. By making n sufficiently large, we can assure that this filtering accepts all and only real paths with the desired probability, 1 — Sf. We now consider the relation between n and 6f. The entire filtering step will succeed with global probability 1 — 6j if we ensure that each path is correctly filtered with some local probability, which we will call £//. An argument similar to the one used in the proof of Lemma 1 shows that the local probability is polynomial in the global probability, e, k) « P(r | -ie, -

>f

Grouper Knowledge Source

Splitter Knowledge Source

t

Fig. 3:

t

\ Merger Knowledge Source

t

Scheduler i-

Monitor

J

The architecture of PSEIKI.

levels of the right panel. In edge-based processing, the perceived image resides at the two lowest levels of the right panel, and, in region-based processing, there is also initial information at the face level. A uniform symbolic representation is used for all items that reside on the blackboard, regardless of panel or level. This symbolic representation consists of a data record containing many fields, one each for an identity tag, the name of the panel, the name of the level, etc. There is also a field that contains the identities of the children of the element, and a value field where information is stored on edge strength, average gray level in a region, etc. There also exists a set of fields for storing information on the parameters used for evidence accumulation. PSEIKI has four main knowledge sources (KS) that it uses to establish correspondences between the elements on the model side and the elements on the data side: Labeler, Grouper, Splitter, and Merger. The Grouper KS determines which data elements at a given level of the hierarchy should be grouped together to form a data element at a higher level of abstraction. Grouping proceeds in a data-driven fashion in response to goals that call for the establishment of nodes corresponding to the nodes on the model side. To explain, assume that the information shown in Fig. 4 resides on the blackboard. In this case, the nodes F A , F B and Fc will reside at the face level of the model panel. The node F A will point to the nodes E A , E B , E c and E D at the edge level, and so on. Of course, there will also be, on the model panel, a node at the object level pointing to the nodes F A , Fß and Fc at the face level. In this case, the Scheduler, to be discussed later, posts goals that seek to establish data-panel nodes corresponding to the model nodes. For example, a goal will be posted to form data nodes, each node being a different hypothesis, for the model node F A . To respond to this goal, the Scheduler will examine all the knowledge source activation records (KSAR's) that try to invoke the Grouper KS for those orphan data elements whose current labels correspond to one of the edges in F A . (As soon as the Monitor sees a data element without a parent, it sets up a KSAR that seeks to invoke the Grouper KS.) The Scheduler will

357 select that KSAR whose data edge has the strongest attachment with any of the edges in F A on the basis of the belief values. The data edge corresponding to such a KSAR then becomes a seed element for forming a grouping. In other words, the Scheduler uses this KSAR to fire the Grouper KS, which 'grows' the seed into an aggregation of data elements on the basis of relational considerations. For example, the Grouper KS will group Ei with E3 because the geometrical relationship between Ei and E3 is believed to be the same as between their current model labels. Using such considerations, for the example shown in the figure, the Grouper KS will propose the grouping ^ ^ ^ ^ , Ε ν , Ε ό , Ε ^ , under the face node F i , and consider Fi as a tentative correspondent of the model node F ^ . This grouping will subsequently be examined for internal consistency by the Labeler KS for the purpose of computing our revised belief in each of the labels for the data edges in the grouping and in using F A as a label for F i . The first action by the Labeler KS, which takes place before the Grouper KS does any groupings at all, is to construct an initial frame of discernment (FOD) for each data element on the basis of physical proximity; meaning that initially all the model elements within a certain distance of the data element (at the same level of abstraction) will be placed in the FOD for the data element. Note that since the camera used is calibrated, the comparison between the model and the data takes place in the same space - we could call it the image space, as opposed to the 3D space of the scene. The second major action of the Labeler KS, which takes place on a recurring basis, can be described as follows: Given a grouping from the model side and a tentative grouping on the data side, as supplied by the Grouper KS, the job of the Labeler KS is to estimate the degree of belief that can be placed in various possible associations between the data elements and their labels from the model side. These belief values are computed by revising the initial beliefs on the basis of the extent to which the data elements satisfy the relational constraints generated by their currently most-believed model labels. The revised beliefs are then propagated up the hierarchy, as discussed in the next section. While the Grouper KS aggregates data elements at one level of abstraction for representation by a node at the next higher level, the function of the Merger KS is to aggregate data elements at one level of abstraction so that the aggregation can be treated as a single element at the same level. In other words, while the Grouper KS may group together a set of edges into a face, the Merger KS will try to group a series of short edges into a longer edge. The Splitter KS performs the opposite action of the Merger KS; it splits a single element on the blackboard into multiple elements at the same level. The overall flow of control is controlled by the Monitor and the Scheduler, acting in concert. The Monitor uses OPS demons to run in the background, its task being to watch out for the data conditions that are needed for triggering the various KS's. For example, if there is a data element without a parent, it is the Monitor's job to become aware of that fact and synthesize a KSAR that is a record of the identity of the data element and the KS which can be triggered by that element. Initially, when the KSAR's are first created, they are marked as pending. When no KS is active, the Scheduler examines all the pending KSAR's and selects one according to prespecified policies. For example, the status of a KSAR that tries to invoke the Merger or the Splitter KS is immediately changed to active. It seems intuitively reasonable to fire these KS's first because they seek to correct any misformed groups. More precisely, the operation of the Scheduler can be broken into three phases. In the first phase, the initialization phase, which uses extensive backchaining, the Scheduler operates in a completely model-driven fashion for the establishment of nodes on the data side

358 corresponding to the supplied nodes on the model side. If the Scheduler cannot find data correspondents of the model nodes, it posts goals for their creation. In other words, the Scheduler examines the model panel from top to bottom, checks whether there exist a certain pre-specified number, ÏÏQ, of data correspondents of each model node. If the number of data nodes corresponding to a model node is fewer than nQ, the Scheduler posts goals for the deficit. If this model-driven search is being carried out at a level that is populated with data nodes, then the Scheduler must initiate action to search through those data nodes for possible correspondents for the model node. This is done by activating the KSAR's that seek to invoke the Labeler KS for computing the initial belief values for the data elements, using only proximity considerations as discussed in the next section, and retaining up to ÎÏQ data nodes that acquire the largest probability mass with respect to the model node. (Note that when the data elements are first deposited on the right panel of the blackboard, KSAR's for invoking the Labeler and the Grouper KS's for these data elements are automatically created; these KSAR's have pending status at the time of their creation.) For example, if the contents of the two panels of the blackboard are as shown in Fig. 4, the Scheduler will backchain downwards through the model panel, starting with the scene node. At the edge level, it will discover data on the right panel. The Scheduler will therefore activate the KSAR's that seek to compute the initial belief values for these edges. After the initial belief values are computed, the Scheduler will retain ÎIQ data edges for each model edge. As was mentioned before, if the number of data nodes corresponding to a model node is fewer than ÏÏQ, the Scheduler posts a goal for the deficit. For example, for the case of Fig. 4, the Scheduler will recognize that initially there will not be any data nodes corresponding to the object level model node for the cube, so the Scheduler will post a goal for the establishment of ÏÏQ object level data nodes for the cube. These ÏÏQ nodes, after they are instantiated, will presumably lead to different and competing hypotheses (different groupings) at the object level. In the same vein, the Scheduler will post a goal for the establishment of ÏÏQ competing nodes that would correspond to the node FA- Since for the example under consideration, there exist data nodes at the edge level, the goals set up by the Scheduler would Model Panel E

E

E

E

Data Panel

c

E

E

^ ΛÈ \ F1

B

F

H

E

F c

G

7, „ ' 1 " F2 ,1E, io

E/ \

I

2 ^ E >ï ^ 3

L_

E /

D

F

E

E

E

\ " ί

13

y)

'E

5

\ .

E

E

7

F3

|%

J

VE

E15

16

Fig. 4:

We have used this example to explain many of the points regarding evidence accumulation in PSEIKI. The left frame depicts the information residing in the model panel, and the right frame the information in the data panel.

359 be somewhat different. For example, the data edges E4 and E^ may have edge Ec as their labels. Therefore, the goal posted by the Scheduler will only require the establishment of T\Q-2 additional data nodes corresponding to the model edge E c . Of course, if no additional data edges can be found that can take the label E c , the Scheduler will make do with just 2. This process is akin to using a depth bound for finding a solution in a search graph. It should be clear that in the initialization phase, the operation of the Scheduler combines top-down model-driven search for grouping and labeling with bottom-up data-driven requests for finding parents for ungrouped data elements and for computing beliefs in the possible labels for the new groupings. Combinatorial explosions are controlled by putting an upper bound on the number of competing hypotheses that can be entertained in the modeldriven search. It is important to note that the number of competing hypotheses for any model node is not limited to XIQ. TO explain, assume that the Grouper KS has grouped the edges {Ei,E2,E3,E5E7,E6,E4}. Since the Splitter KS is given a high priority by the Scheduler, most likely this KS will fire next and probably discover that in the group formed data edge E3 is competing with the data edge E5. Therefore, the Splitter KS will split the group into two groups {E 1 ,E2,E5 j E 7 ,E6,E4}. and {E 1 ,E2,E3,E7,E6,E4}· In other words, because of the action of the Splitter KS, there can be a geometrical multiplication of the hypotheses formed by the Grouper KS. For these reasons, it becomes necessary to give a small value to ÏÏQ. For most of our experiments, ne is set to 3. Since our explanation above was based on Fig. 4, the reader is probably wondering about how the Grouper KS might construct ÏÏQ different and competing data groupings corresponding to, say, the model node FA- After the first grouping is constructed by the procedure already discussed, the Scheduler will discover that it still does not have TÏQ groupings corresponding to the model node FA- AS before, the Scheduler will examine all the pending KSAR's that seek to invoke the Grouper KS on data elements whose labels come from the edges in FA- Of these, the KSAR associated with the data edge that attaches most strongly, on the basis of the belief values, with one of the edges in F A , is selected for firing the Grouper KS, the data edge serving as a seed. (Note that the KSAR selected for the second grouping will not be the same as for the first grouping, since the KSAR used earlier is no longer pending.) After the second grouping is formed, it is compared with the first. If the two are identical, it is discarded. This process is continued until as many groupings can be formed as possible, with the total number not exceeding UQ-1 at the last attempt. When the nQth grouping is formed, it is possible that owing to the action of the Splitter KS we may end with more than ÏÏQ groupings. Any time a new group is formed, a KSAR is created that seeks to invoke the Labeler KS for computing the initial belief values to be assigned to the data node corresponding to this grouping. For example, suppose the Grouper KS has formed the grouping {E6,Eg,Eio,Ei2,Ei3,Eii} under the face node F2 to correspond to the model face node Fg. Subsequently, the Labeler KS will construct a frame of discernment for F2, consisting of all the model faces that have any overlap with F2. In our example, this frame of discernment for F2 could be { F A , F B , F C }. The label assigned to F2 will then consist of that model face label which gets the most mass using the formulas shown in the next section. It might seem incongruous to the reader that while the model face Fß was used for constructing the grouping F2, we should now permit the latter to acquire a different label. While in practice such a transfer of labels is not very likely, such a possibility has to be left open for the sake of a homogeneous computational procedure. At the end of the initialization phase, the system has deposited on the data panel a

360 number of competing nodes for each node on the model side. In practice, if the expectation map and the perceived image are sufficiently dissimilar, there will exist model nodes with no correspondents on the data side. At the same time, especially if the image pre-processor is producing many parallel lines for each real edge in the scene, there will exist many competing nodes, possibly exceeding ÜQ, on the data side for each node on the model side. It is important to note that the labels generated for the data nodes in the initialization phase of the Scheduler only involved proximity consideration. Relational considerations are taken into account in the phase discussed next. The second phase of the Scheduler is the updating phase. Unlike in the first phase, during the updating phase the Scheduler makes no use of the contents of the model panel. On the other hand, the Scheduler traverses the data panel from top to bottom and invokes relational considerations through the Labeler KS to revise the belief values in the association of the data nodes and their labels. Of course, the Labeler KS must access the model information to figure out the geometrical relationships between the different model nodes, so that these relationships can be compared with those between the corresponding data nodes. To explain, let's go back to the example of Fig. 4. During the initialization phase when the Labeler assigns initial beliefs to the nodes in the data panel, it also creates KSAR's for updating these belief values; however, these KSAR's are not attended to by the Scheduler until the updating phase. For the example of Fig. 4, the Scheduler will first look at the KSAR corresponding to the object level nodes in the data panel. Consider the object level node made of the face grouping {Fi,F2,F3}. The KSAR that calls for revising the beliefs associated with this object level node will in fact apply the Labeler KS to the face grouping {Fi,F2,F3} using relational considerations such as similarity of the transformations between F\ and F2 on the one hand and F ^ and Fg, on the other, assuming that F ^ and Fg are the current labels for Fi and F2, respectively. Similarly, when the KSAR for updating the belief values associated with the face F2 is processed, the result is the application of the Labeler KS to the edges {E6,Eg,Eio,Ei2>Fi3,Eii} for belief revision on the basis of relational considerations. Finally, during the last phase, the propagation phase, the belief revision takes place by propagating the belief values up the data panel hierarchy. Although the operation of the Scheduler was presented as consisting of three separate phases, temporally speaking the boundaries between the phases are not as tight as what might be construed by the reader from our discussion so far. For example, if in the middle of the updating phase the labels of two faces become identical and if these faces satisfy certain additional criteria, such as sharing a common boundary, the Merger KS will merge the two faces into a single grouping. When the Merger KS creates this new grouping, it will also post KSAR's for invoking the Labeler KS for initial belief value computations. This is one example of how computations typical of the initialization phase may have to be carried out during another phase. Another example would be when a data node changes its label during the process of belief revision in the update phase. Note that a data node takes that label for which it has the largest probability mass in the frame of discernment. The process of updating beliefs on the basis of relational considerations can lower the belief in the currently held label for a data node vis-a-vis the other labels in the frame of discernment. When that happens, the data node will change its label and that would trigger the formation of KSAR's of the updating and initialization variety. For example, for the case of Fig. 4, suppose during the update phase the label for the data edge Ei changes from EA to Ec- This would trigger the formation of a KSAR for updating the belief in the new label Ες. Similarly, if Fi 's label were to change from F ^ to, say, Fc during the update or the propagation phases, that would launch a KSAR that we refer to as the "labeling KSAR with re-labeling action." Assuming that at the

361 instant F i ' s label changed, its children were {Ei,E2»E3,E5,E7,E6,E4}, t n e re_iabeling action consists of first eliminating any previous bpa's (basic probability assignment functions) and frames of discernment for all of these Ej's, and then using the edges in Fc as the new frame of discernment for each Ej. 3. ACCUMULATION OF EVIDENCE Evidence accumulation in PSEIKI is carried out by the Labeler KS, which invokes different procedures for each of the three phases of the Scheduler. As opposed to being formalistic, practically all our explanation in this section will be with the help of simple examples. A more formal exposition can be found in [Andress and Kak 1989]. Initialization: Recall that in the initialization phase, the Labeler KS is called upon to examine different possible associations between the data nodes and the model nodes, the model node candidates for such associations being determined solely on the basis of their physical proximity to the data nodes. Let's say that during the initialization phase, an initial bpa function is sought for the data edge Ei in Fig. 4. The Labeler KS will pool together all the model edges whose centers of mass are within a radius r m a x of the center of mass of Ei and call this pool the frame of discernment for figuring out the labels for E j . Let's say that this FOD, denoted by ©initial, consists of ©initial = {E A ,E C ,E E } (!) To accumulate belief over this FOD, we use the metaphor that each model edge in the FOD is an expert and tells us, by using similarity and dissimilarity metrics, how much belief it places in its similarity to the data edge E i . In other words, the expert E A gives us the following information m E l ({E A }) = similarity.metricCE!, E A ) m E l ( {-iEA} ) = dissimilarity_metric(E i, E A ) m E l (©) = 1 - m E l ( { E A } ) - Ι Ϊ Ι Ε Ι ( { - Τ Ε Α } )

(2)

As the reader may recall, the bpa shown constitutes what Bamett calls a simple evidence function [Barnett 1981]. The similarity metrics currently being used in PSEIKI are presented in [Andress and Kak 1989]. For the example under discussion, the "experts" Eç and E E will yield the following two simple evidence functions: m E l ( { E c ) ) = similarity.metriciE!, E c ) m E l ({—IEÇ}) = dissimilarity.metriciEj, E ç ) m E l (©) = 1 - m E l ( { E c } ) - m E l ( { ^ E c ) ) and

(3)

362 Π Ι Ε ^ Ε Ε } ) = similarity_metric(Ei, E E ) m

E! ( { ~"ΈΕ } ) = dissimilarity_metric(E ι, EE ) mEl(0) = 1 -mEl({EE}) - mEl({-^E})

(4)

The Labeler KS combines these simple evidence functions using Barnett's algorithm. The accumulated belief is computed only for the singleton propositions in Θίηΐ^ι. The singleton proposition with the largest mass is then called the current label for the data edge Εχ. Assume for the sake of discussion, that at this point the Labeler has declared E A to be the current label for Εχ. It is most important to note that the three bpa's shown above are not discarded at this point. During the update phase, when beliefs are being revised on the basis of relational considerations, the updating bpa's are combined with the bpa's shown above. Also, the FOD for the data nodes is expanded to include additional labels representing the model correspondents of those data groupings in which the data node is currently participating. After the updating bpa's are combined with the initial bpa's shown above, it is entirely possible that the largest probability mass will be accrued for a singleton proposition that is different from the currently held label for E\. When this happens, the label for E\ will automatically change to the one for which the probability mass is now maximum. In procedure, the computation of initial beliefs and labels at all levels of the blackboard is identical to that outlined above, only the similarity and dissimilarity metrics used are different. Belief Revision: To explain with an example the process of belief revision on the basis of relational considerations, let's assume that the Grouper has advanced {Ei,E2,E5>E7,E6>E4} as a possible grouping, under the face level node F\t and that the current label for Fi is FA- (Operationally, the procedure for finding the current label for face Fi is identical to the one described above under Initialization. The Labeler constructs an initial FOD for F\ on the basis of physical proximity, uses face similarity metrics to generate a set of simple evidence functions for the singleton propositions in this FOD, and finally sets Fi's label to the singleton proposition with the largest probability mass.) Let's now focus on the data edge Ei from the grouping and explain what happens during the update phase of the Scheduler. First note that as soon as the Fi grouping {Ei,E2,Es,E7,E6,E4} is formed, the FOD for edge Ei is enlarged by adding to Ei 's initial FOD the model edge set corresponding to the face node FA, since F A is the current label for Fi. This FOD enlargement is carried out for each EJGFI. In other words, as soon as the grouping Fi comes into existence, the following new FOD is formed for Ei : ©revised

=

{EA,EC> E E>EB>ED}

(5)

which is obtained by taking the union of the initial FOD for Ei and the members of the grouping corresponding to the label F A for the face node Fi. Therefore, when the Labeler is invoked with a KSAR seeking to update the belief value for a face node, such as node Fi, the Labeler understands the request to mean that beliefs should be revised for all the children of Fi on the basis of their geometrical interrelationships vis-a-vis the corresponding relationships on the model side. The metaphor used for updating the beliefs associated with Ei on the basis of its belonging to the grouping Fi is that all the other edge elements in Fi are "experts" infiguringout their geometrical relationships to the edge Ei and comparing these relationships to those satisfied by their labels.

363 To elaborate, let's say that we want to compute the contribution that E5 will make to revising our belief in the assertion that E i ' s label is EA- TO estimate this contribution, we will set up the following bpa: mupdaterEs-^ÜEA)) = ™ Ε 5 ( { Ε Χ } ) · relational_similarity_metric(E5,Ei ; E X , E A ) r

nupdate:E 5 ->E 1 ({-»E A }) = m E 5 ( { E x } ) · relational_dissimilarity_metric(E5,Ei ; E X , E A )

mupdateiEs-^C^revised) = 1 ~ Update : E S - ^ C Î E A } ) ~ ^update : EJ-^EJ ({~Έ Α })

(6)

where Εχ is Es's current label; X could, for example, be B. The relational similarity and dissimilarity metrics give us measures of similarity and dissimilarity of the geometrical relationship between the first two arguments and the geometrical relationship between the last two arguments. For example, relational_similarity_metric(E5,Ei ; E X , E A ) figures out the rigid body transformation between E5 and Εχ, figures out the rigid body transformation between E x and E A , compares the two transformations, and then returns a measure of similarity between the two transformations. Further details on these rigid body transformations and their comparisons can be found in [Andress and Kak 1989]. For the example under consideration, for each index i in the set {2,5,7,6,4}, this updating process will generate the following simple evidence function: m

update : Ei^>El ({E A })

m

update : E ^ - ^ ( { - Έ Α })

m

update : Ei ->E: (®revised ) (7 ) When these simple evidence functions are combined using Dempster's rule with the simple evidence functions generated during the initialization phase, we obtain our revised belief in various possible labels for the data edge E j . For the example under consideration, note that any belief in the proposition - I E A will lend support to labels other than E A in the FOD of E i . (Since the focal elements of all the update bpa's for, say, the data edge Ei are the same, these in the current example being {EA }, {—«EA } and erevised» Dempster's rule for combining the bpa's possesses a simple and fast implementation without involving any set enumeration.) As was mentioned before, if during such belief revision the largest mass is accrued for a singleton proposition that is not the current label, then the current label would change and correspond to the singleton proposition. The reader should note that the request to update the beliefs associated with the data node Fi caused the beliefs and labels associated with Fi 's children to be altered. In other words, during the update phase, a KSAR ostensibly wanting to update the beliefs associated with the nodes at one level of abstraction actually causes the updating to occur at a lower level of abstraction. Although making for a cumbersome explanation of the belief revision process, there is an important operational reason for this. The belief revision process occurs on relational considerations, involving mutual relationships amongst the members of a data grouping vis-a-vis the corresponding relationships amongst the labels in a model grouping. Since the grouping information at one level of abstraction can only be determined by exa-

364 mining the nodes at the next higher level of abstraction, hence the reason for using the update KSAR's for, say, the face level nodes to actually update the beliefs associated with the nodes at the lower level, the edges. Before leaving the subject of belief updating, we would like to mention very quickly that when an edge like Eg, which is common to two faces, is first grouped into, say, face F\, its FOD is expanded by taking a union of its initial FOD and the edges that are grouped under the current label of Fi. If we assume that the initial FOD for Eg was ©initial = { E C » E D > E F }

and if we further assume that the current label for Fi is FA, then upon the formation of the first grouping, EG's FOD is revised to ©revised = {EF>EA>EB>ED>Ec} Now, when Eg is grouped again under the face node F2, E 6's FOD gets further revised to become ©revised = { E F , E A , E B » E D » E C 3 E » E H 1

which is the union of the previous FOD and the group of model edges under Fg, assuming that Fß is the current label for F2. Since the bulk of grouping takes place during the initialization phase of the Scheduler, for such an edge the FOD used for belief revision using relational consideration would in most cases correspond to the latter version. Propagation: During this phase, the Labeler "pushes" the belief values up the abstraction hierarchy residing in the data panel. The rationale on which we have based PSEIKFs belief propagation up the data hierarchy satisfies the intuitive argument that any evidence confirming a data element's label should also provide evidence that its parent's label is correct. Continuing with the previous example, note that the request to update the beliefs associated with the face node F2 actually caused the beliefs associated with F2's children to be revised on the basis of relational considerations. During the propagation phase, we want the revised beliefs associated with F2's children to say something about the beliefs associated with F2 itself. To explain how we propagate the beliefs upwards, let's consider the nature of the bpa obtained by combining all the update bpa's for the data edge E\ : " V i a t e l = mupdate:E2->E! © m update:Ej-^ ©

(8)

If at the time of computation of the individual update bpa's here, the label for Ei was EA, then the focal elements of the function mupôate.Ei are o n ty ( E A } > {~> E A) and ©revised· Clearly, the update probability mass as given by niupôatz IE^I^A)) arises from the consistency of Ei's label with its sibling's labels, all these labels being derived from the children of the face node FA; this probability mass can therefore be considered as a weighted vote of confidence that face F\'s label is FA- Similarly, the update probability mass given by "^updaterEjii"^^}) arises from the inconsistency of Ej's label with the labels of its siblings, the labels again being derived from the children of FA; this mass can therefore be considered as a weighted vote of no confidence in the assertion that Fi's label is FA- In a similar vein, mypdate. £χ (©revised) m a y be considered as a measure of ignorance about Fi's current label, from the standpoint of the "expert" Ei, ignorance in light of the labels currently assigned to Ej and its siblings. On the basis of this rationale, we can construct the

365 following bpa for updating the beliefs associated with the face node Fi : " V d a t e r E j - ^ Ü F A } ) = î^updateiE^fEA)) m

update r E ^ F i ( { - J P A } ) = '"update : Ei ( { - E A ) )

roupdateiEi-^FjC^face) =

m

update: E^revised)

(9)

Since m u p ( i a t e . E1 is a valid bpa, having been obtained by the combination shown in Eq. (8), it follows that n^p^te : ^ ->Fi m ust also be a valid bpa. A bpa like the one shown above could be generated for each child node of the face node F\. Out of all such bpa's, we retain that one which exhibits the strongest attachment for its model label. To explain, say Fi 's children are {Ej I 1=1,2,4,5,6,7} (see Fig. 4). Corresponding to each of these edges, there will exist a composite updating bpa mUpdate. ^ obtained via the combination in Eq. (8). For constructing the propagating bpa, π ΐ υ ρ ^ · E ^ ^ > w e n o w u s e onty that mupdate : Ej which has the largest belief for E, 's current label. Let F ^ be the current label for the face node F i , then the probability masses associated with the propagating bpa thus obtained will be denoted by mupdate iF! ( { F A ) )

(10)

m

update : Fi

m

update:F!(%ace)

This bpa is combined with the currently stored simple evidence functions for the face node Fi. Of course, if as a result of this combination, the probability mass assigned to the label F ^ is no longer the maximum, the label of Fj will be changed, which, as was mentioned before, invokes the initialization type computations once again. 4. EDGE-BASED vs. REGION-BASED OPERATION Note that PSEIKI can be operated in two different modes: the edge-based mode and the region-based mode. In the edge-based mode, edges extracted from the perceived image are input into the two lowest levels of the data panel of the blackboard in Fig. 3. In the regionbased mode, the perceived image is segmented into regions of nearly constant gray levels and the result input into the three lowest levels of the data panel. Note that in the regionbased mode, there is no presumption that the regions would correspond to the faces in the expectation map. In fact, in most imagery, because of glare from surfaces and other artifacts, each face in the expectation map will get broken into many regions in the data and there can also be regions in the data that can straddle two or more faces in the expectation map. However, as we have noticed in our experiments, in many cases the Merger KS is able to merge together some of the regions that correspond to the same face. Of course, it is not necessary for such merging to be perfect since for experiments in mobile robot self-location we do not need 1-1 correspondence between the perceived image and the expectation map everywhere. Thanks to the rigid-body constraints, the scene to model correspondence need only be established at a few locations to calculate the position and the orientation of the mobile robot with precision, as long as these locations satisfy certain geometrical constraints. More on this subject later.

366 5. ARE INDEPENDENCE CONDITIONS SATISFIED? The following question is frequently raised regarding evidence accumulation in PSEIKI: Have we satisfied the necessary condition for the application of Dempster's rule, the condition that says that all evidence must come from disparate sources, i.e., the sources of evidence must be independent? Superficially it may seem that PSEIKI violates this condition since when we compute an update bpa, as for example shown in Eqs. (6), we multiply the relational metrics by an initial bpa. Despite the aforementioned use of initial belief functions in the updating process, a closer examination reveals that the independence requirements are not being violated. To explain, let's first state that by independence we mean lack of predictability. Therefore, the question of independence reduces to whether an updating bpa, like π ^ ρ ^ . E s - ^ m Eq s · (6), can be predicted to any extent from a knowledge of one of the initial bpa's, for example m E 5 · We believe such a prediction can not be made for two reasons: 1) A product of a predictable entity with an unpredictable entity is still unpredictable; and, 2) the relational metric values that enter the formation of mUpdate. E s - ^ are n o t within the purview of the 4 'expert" giving us rriE5 on the basis of non-relational and merely geometric similarity of a data node with those model nodes that are in spatial proximity to the data node. Another way of saying the same thing would be that since PSEIKI may be called upon to "match" any image with the expectation map, it has no prior knowledge that the structure (meaning the relationships between different data entities) extracted in the supplied image bears any similarity to the structure in the expectation map. In general, it must be assumed that the data elements can be in any relationships ~ and, therefore, unpredictable relationships ~ vis a vis the relationships between their currently believed model correspondents. Therefore, it will not be possible to predict a probability mass distribution obtained from relational considerations from a probability mass distribution obtained from just element-toelement similarity considerations. Similar reasoning will show that for different values of the index i the updating bpa's in Eq. (7) are mutually independent. Hence, we can claim that the independence requirements are not being violated for the application of Dempster's rule. 6. ROBOT SELF-LOCATION USING PSEIKI Fig. 5a shows a line drawing representation of an expected scene that was used by PSEIKI for the interpretation of the image shown in Fig. 5b during a navigation experiment. The expectation map of Fig. 5a, obtained by rendering the CSG representation of the hallways using the calibration parameters of the camera on the robot and the position of the robot as supplied by odometry, was supplied to PSEIKI as a four level hierarchy of abstractions: vertices, edges, faces and scene. This abstraction hierarchy was produced by modifying a CSG based geometric modeler developed in the CADLAB at Purdue [Mashburn 1987]. Edges and regions were extracted from the camera image of Fig. 5b and input into the three lowest levels of the data panel of the blackboard. When PSEIKI terminates its processing, at the end of the belief propagation phase, there will generally be multiple nodes at the scene level, each with a different degree of belief associated with it. The autonomous navigation module controlling the mobile robot selects the scene node on the data panel that has the largest probability mass associated with it, and then works its way down the data panel to extract the edges associated with that scene node. For example, for the expectation map of Fig. 5a and the image of Fig. 5b, all the data

367

III |U LJ U /

V

\ / / (a)

Fig. 5:

(b)

While (a) shows the expectation map, (b) shows the actual camera image for an exercise in self-location. Mis-registration between the two is evident. Shown in (c) are the data edges from the image of (b) corresponding to the scene node with the largest probability mass associated with it.

edges corresponding to the scene level node with the largest probability mass are shown in Fig. 5c. From the data edges thus extracted, the navigation module retains those that have labels with probability masses exceeding some high threshold, usually 0.9. These edges and their labels are then used for self-location. The actual calculation of the robot's location is carried out by keeping track of two coordinate systems: the world coordinate system, represented by W , in which the hallways are modeled, and the robot coordinate system, represented by R 3 , which translates and turns with those motions of the robot. The camera is calibrated in R 3 ; the calibration parameters make it possible to calculate the line of sight in R 3 to any pixel in the image. The problem of robot self-location is to compute the position and the orientation of R 3 with respect to W 3 . In our work, we have assumed that the origin of R 3 always stays in the xy-plane of W 3 and

368 that the z-axes of both coordinate systems are parallel and designate the vertical. By the orientation of R we mean the angular rotation of the xy-plane of R with respect to the xyplane of W . We have shown in [Lopez-Abadia and Kak 1989] how the edge correspondences, as returned by PSEIKI, between the camera image and the expectation map, can be used to solve the problem of self-location. To summarize the procedure described in [Lopez-Abadia and Kak 1989], we decompose the problem of robot self-location into two sub-problems: the problem of finding the orientation of R 3 and the problem of the finding the coordinates of the origin of R 3 in the xy-plane of w 3 . As shown in [Lopez-Abadia and Kak 1989], the orientation of the robot can be found from a single edge in the camera image provided that edge has been identified as one of the horizontal edges by PSEIKI. If more than one horizontal edge is available, a weighted average is taken of the orientation results produced by the different edges, the weighting being a function of the belief values associated with the edges. A couple of different approaches are used simultaneously to compute the coordinates of the origin o f R 3 i n W 3 . One of these approaches makes use of the fact that it is possible to compute the perpendicular distance of the origin of R from a single edge in the image if that edge has been identified as a horizontal edge. Therefore, if PSEIKI can show matches for two non-parallel horizontal lines in the model, the world coordinates of the origin of R are easily computed. The second approach is capable of using any two image edges, provided they correspond to two non-parallel lines in the model, for computing the location of the origin of R 3 . Again, the results produced by both these approaches, for all possible pairs of edges satisfying the necessary conditions, are averaged using weights that depend upon the beliefs associated with the edges. 7. CONCLUDING REMARKS A unique feature of PSEIKI is that it enforces model-generated constraints at different levels of abstractions, and it does so under a very flexible flow of control made possible by the blackboard implementation. Could we have used, say, a relaxation-based approach for the same end purpose, that of labeling the edges or regions in an image with labels from a model? We believe not, for the following reason: To parallel PSEIKI's competence at reasoning under uncertainty, one would have to use probabilistic relaxation. However, we do not believe it would be an easy feat to make the implementation of probabilistic relaxation as model-driven as is PSEIKI. In other words, it would be hard to incorporate in probabilistic relaxation as much model knowledge as we can in PSEIKI. To give the reader an idea of the size of PSEIKI, the OPS 83 code for the blackboard is about 10,000 lines long, with another 10,000 lines of C code for many functions called by the rules and for the image pre-processor. Currently, on a SUN3 workstation, it takes PSEIKI about 15 minutes to process one image, which, as was mentioned before, is one of the reasons why the mobile robot does not attempt self-location continuously, but only on a need basis ~ the frequency of need depending on the quality of the odometry.

369 8. REFERENCES [Andress and Kak 1988] K. M. Andress and A. C. Kak, "Evidence Accumulation and Flow of Control in a Hierarchical Spatial Reasoning System," The AI Magazine Vol.9, No. 2, 75-95, 1988. [Andress and Kak 1989] K. M. Andress and A. C. Kak, The PSEIKI Report -- Version 3, School of Electrical Engineering, Purdue University, Technical Report TR-EE 89-35, 1989. [Barnett 1981] J. A. Barnett, "Computational Methods for a Mathematical Theory of Evidence," Proceedings IJCAI, 868-875, 1981. [Blaskl989] S. G. Blask, IHSE: Interactive Hierarchical Scene Editor, RVL Memo #11, Robot Vision Lab, EE Building, Purdue University, 1989. [Kak, Andress and Lopez-Abadia 1989] A. C. Kak, K. M. Andress and C. Lopez-Abadia, "Mobile Robot Self-Location with the PSEIKI System," AAAI Spring Symposium Series Workshop on Robot Navigation, 1989, Stanford, CA. [Lopez-Abadia and Kak 1989] C. Lopez-Abadia and A. C. Kak, Vision Guided Mobile Robot Navigation, School of Electrical Engineering, Purdue University, Technical Report TR-EE 89-34, 1989. [Mashburn 1987] T. Mashburn, A Polygonal Solid Modeling Package, M.S. Thesis, School of Mechanical Engineering, Purdue University, 1987. [Mortenson 1985] M. E. Mortenson, Geometric Modeling, John Wiley & Sons, New York, 1985.

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

371

Model-Based Influence Diagrams For Machine Vision T. S. Levitt,* J. M. Agosta,+ T. O. Binford+ *Advanced Decision Systems Mountain View, California +

Stanford University

Stanford, California

Abstract We show an approach to automated control of machine vision systems based on incremental creation and evaluation of a particular family of influence diagrams that represent hypotheses of imagery interpretation and possible subsequent processing decisions. In our approach, model-based machine vision techniques are integrated with hierarchical Bayesian inference to provide a framework for representing and matching instances of objects and relationships in imagery and for accruing probabilities to rank order conflicting scene interpretations. We extend a result of Tatman and Shachter to show that the sequence of processing decisions derived from evaluating the diagrams at each stage is the same as the sequence that would have been derived by evaluating the final influence diagram that contains all random variables created during the run of the vision system.

1

Introduction

Levitt and Binford [Levitt et al.-88], [Binford et al.-89], presented an approach to performing automated visual interpretation from imagery. The objective is to infer the content and structure of visual scenes of physical objects and their relationships. Inference for machine vision is an errorful process because the evidence provided in an image does not map in a one to one fashion into the space of possible object models. Evidence in support or denial of a given object is always partial and sometimes incorrect due to obscuration, occlusion, noise and/or compounding of errorful interpretation algorithms. On the other hand, there is typically an abundance of evidence [Lowe-86]. In our approach, three dimensional model-based machine vision techniques are integrated with hierarchical Bayesian inference to provide a framework for representing and matching instances of objects and relationships in imagery, and for accruing probabilities to rank order conflicting scene interpretations. In particular, the system design approach uses probabilistic inference as a fundamental, integrated methodology in a system for reasoning with geometry, material and sensor modeling.

372 Our objective is to be capable of interpreting observed objects using a very large visual memory of object models. Nevatia [Nevatia-74] demonstrated efficient hypothesis generation, selecting subclasses of similar objects from a structured visual memory by shape indexing using coarse, stick-figure, structural descriptions. Ettinger [Ettinger88] has demonstrated the reduction in processing complexity available from hierarchical model-based search and matching. In hierarchical vision system representation, objects are recursively broken up into sub-parts. The geometric and functional relations between sub-parts in turn define objects that they comprise. Taken together, the models form an interlocking network of orthogonal part-of and is-a hierarchies. Besides their shape, geometrical decomposition, material and surface markings, in our approach, object models hold knowledge about the image processing and aggregation operations that can be used to gather evidence supporting or denying their existence in imagery. Thus, relations or constraints between object sub-parts, such as the angle at which two geometric primitives meet in forming a joint in a plumbing fixture, are modeled explicitly as procedures that are attached to the node in the model to represent the relation. Thus model nodes index into executable actions representing image evidence gathering operations, image feature aggregation procedures, and 3D volume from 2D surface inference. In Binford and Levitt's previous work, the model structuring was guided by the desire to achieve the conditional independence between multiple children (i.e. sub-parts) of the same parent (super-part, or mechanical joint). This structuring allowed Pearl's parallel probability propagation algorithm [Pearl-86] to be applied. Similarly, the concept of value of information was applied to hierarchical object models to enable a partially parallelized algorithm for decision-theoretic system control. That is, the Bayes net was incrementally built by searching the model space to match evidence extracted from imagery. At each cycle, the model space dictated what evidence gathering or net-instantiating actions could be taken, and a decision theoretic model was used to choose the best set of actions to execute. However, the requirement to force conditional independence can lead to poor approximations to reality in object modeling [Agosta-89]. Further, the authors did not prove the coherence or optimality of the decision making process that guided system control. In this paper we make first steps toward formalizing the approach developed by Binford and Levitt. We set up the problem in an influence diagram framework in order to use their underlying theory in the formalization. Image processing evidence, feature grouping operations used to generate hypotheses about imagery interpretation, and the hypotheses themselves are represented in the influence diagram formalism. We want to capture the processes of searching a model database to choose system processing actions that generalize (i.e. generate higher level object hypotheses from lower level ones), search (i.e. predict and look elsewhere in an image for object parts based on what has already been observed) and refine (i.e. gather more evidence in support or denial of instantiated hypotheses). The behavior of machine vision system processing is represented as dynamic, incremental creation of influence diagrams. Matches of image evidence and inferences against object models are used to direct the creation of new random variables representing hypotheses of additional details of imagery interpretation. Dynamic instantiation of hypotheses are formally realized as a sequence of influence diagrams, each of whose random

373 variables and influence relations is a superset of the previous. The optimal system control can be viewed as the optimal policy for decision making based on the diagram that is the "limit" of the sequence. We extend a result of Tatman and Shachter [Tatman-86], [Tatman and Shachter-89] to show that the sequence of processing decisions derived from evaluating the diagrams at each stage is the same as the sequence that would have been derived by evaluating the final influence diagram that contains all the random variables created during the run of the vision system. In the following, we first review our approach to inference and control, sections 2 and 3. In section 4, we represent results of the basic image understanding strategies of aggregation, search and refinement in influence diagram formalisms. In section 5, we sketch a proof of the soundness of control of a vision system by incremental creation and evaluation of influence diagrams.

2

Model-Based Reasoning for Machine Vision

We take the point of view that machine vision is the process of predicting and accumulating evidence in support or denial of runtime generated hypotheses of instances of a priori models of physical objects and their photometric, geometric, and functional relationships. Therefore, in our approach, any machine vision system architecture must include a database of models of objects and relationships, methods for acquiring evidence of instances of models occurring in the world, and techniques for matching that evidence against the models to arrive at interpretations of the imaged world. Basic image evidence for objects and relationships includes structures extracted from images such as edges, vertices and regions. In non-ranging imagery, these are one or two dimensional structures. Physical objects, on the other hand, are three dimensional. The inference process from image evidence to 3D interpretation of an imaged scene tends to break up into a natural hierarchy of representation and processing, [Binford-80]. Processing in a machine vision system has two basic components: image processing to transform the image data to other representations that are believed to have significance for interpretation; and aggregation operations over the processed data to generate the relations that are the basis for interpretation. For example, we might run an edge operator on an image to transform the data into a form where imaged object boundaries are likely to have high values, while interior surfaces of objects are likely to have low values. We then threshold and run an edge linking operator on this edge image (another image processing operator) to produce a representation where connected sets of pixels are likely to be object boundaries. Now we search for pairs of edges that are roughly parallel and an appropriate distance apart to possibly be opposite sides of the projected image of an object we have modeled. This search "aggregates" the boundaries into pairs that may have significance for object recognition. Aggregation and segmentation operations are fundamental in data reduction. We show how the concept of aggregation in bottom-up reasoning can be the basis for generating hypotheses of object existence and type. Aggregation applies constraints from our understanding of geometry and image formation. The aggregation operators also correspond to the transformations between levels in the object recognition hierarchies. Sub-parts are grouped together at one level by relationships that correspond to a single node at the next higher level. Therefore, grouping operators dictate the "out-degree" of

374 a hypothesis at one hierarchy level with its children at the level below. Control of a machine vision system consists of selecting and executing image processing and grouping operations, searching the object model network to match groups to models, instantiating hypotheses of possible observed objects or object parts, accruing the evidence to infer image interpretations, and deciding when interpretation is sufficient or complete.

3

Sequential Control for Machine Vision Inference

Presented with an image, the first task for a machine vision system is to run some basic image processing and grouping operators to obtain evidence that can be used to find local areas of the image where objects may be present. This initial figure-fromground reasoning can be viewed as bottom-up model matching to models that are at the coarsest level of the is-a hierarchy, i.e. the "object/not-object" level. Having initialized the processing on this image, basic hypotheses, such as "ribbon/not-ribbon" can be instantiated by matching surface models. After initialization, a paradigm of sequential control for machine vision is as follows: 0. Check to see if we are done. If not, continue. 1. Create a list of all executable evidence gathering and grouping actions by concatenating the actions listed in each model node that corresponds to an instantiated hypothesis. 2. Select an action to execute. 3. Action execution results in either new hypotheses being instantiated or more evidence being attached to an existing hypothesis. 4. Propagate evidence to accomplish inference for image interpretation, and go to (0). From our model-based point of view, an action associated to a model node that corresponds to an instantiated hypothesis has one of the following effects: refining, searching or aggregation. In the following, we explain these actions. In the next section, we show a method of representing the effects of these actions in an influence diagram formalism. Refining a hypothesis is either gathering more evidence in direct support of it by searching for sub-parts or relationships on the part-of hierarchy below the model corresponding to the hypothesis or instantiating multiple competing hypotheses at a finer level of the is-a hierarchy that are refined interpretations of the hypothesized object. For example, given a hypothesized screwdriver handle, in refinement we might look for grooves in the hypothesized screwdriver handle. Searching from a hypothesis is both predicting the location of other object parts or relationships on the same hierarchy level and executing procedures to gather evidence in support or denial of their existence. In searching for the screwdriver handle, we might look for the blade of the screwdriver, predicting it to be affixed to one end or the other of the handle. Aggregation corresponds to moving up the part-of hierarchy to instantiate hypotheses that include the current hypothesis as a sub-part or sub-relationship. Having hypothesized the screwdriver handle and the screw-driver blade, we can aggregate sub-parts to hypothesize the existence of the whole screwdriver.

375 In summary, as we spawn hypotheses dynamically at runtime, hypothesis instantiation is guided by a priori models of objects, the evidence of their components, and their relationships. System control alternates between examination of the instantiated hypotheses, comparing them against the models, and choosing what actions to take to grow the instantiated hypothesis space, which is equivalent to seeing more structure in the world. The possible actions are also stored in the model space either explicitly as lists of functions that gather evidence (e.g. infer-specularity, find-edges, etc.) or that aggregate object components or other evidence nodes. Thus, inference proceeds by choosing actions from the model space that create new hypotheses and relationships between them. It follows that all possible chains of inference that the system can perform are implicitly specified a priori in the model-base. This feature clearly distinguishes inference from control. Control chooses actions and allocates them over available processors, and returns results to the inference. Inference uses the existing hypothesis space and the current results of actions (i.e., collected evidence), generates hypotheses and relationships, propagates probabilities, and accumulates the selectable actions for examination by control. In this approach, it is impossible for the system to reason circularly, as all instantiated chains of inference must be supported by evidence in a manner consistent with the model-base.

4

Model Guided Influence Diagram Construction

The influence diagram formalism with which we build the model-base allows three kinds of nodes: probability nodes, value nodes and decision nodes. Probability nodes are the same as in belief nets [Pearl-86]. Value nodes and decision nodes represent the value and decision functions from which a sequential stochastic decision procedure may be constructed. The diagram consists of a network showing the relations among the nodes. Solution techniques exist to solve for the decision functions, (the optimal policies) when given a complete diagram. Formulating the model-base as an influence diagram allows existing solution techniques [Shachter-86] to be exploited for evaluation of the interpretation process. The step of generating new hypotheses dynamically upward, from the evidence and hypothesis at the current stage, adds structure to the influence diagram. Expanding the network and then re-evaluating it introduces a new operation that is not equivalent to any evaluation step for influence diagrams. In an aggregation step, a hypothesis is created to represent a part composed of a set of sub-parts at the lower level. For example, in the domain of low level image constructs, such as lines and vertices, aggregation by higher level parts determines a segmentation of the areas of the image into projected surfaces. This concept of segmentation differs from "segmentation" used in image processing in several ways. First, a common process of aggregation is used throughout the part-of hierarchy; there is no unique segmentation operator. Second, the segmentation need not be complete; the aggregation operator may only distinguish the most salient features. The notion of segmentation as "partitioning a region into segments" no longer applies. Finally, because the refinement step allows the prediction by higher level hypotheses of lower level features that have not yet been hypothesized, the segmentation may be extended by interpretations from above. Hypothesis generation is implemented by aggregation operators. The combinations of all features at a level by all aggregation operators that apply is a combinatorially de-

376 manding step. To avoid this complexity, the adjacency of features is exploited. Features that are aggregated belong to objects that are connected in space. This does not necessarily mean that the features appear next to each other in the image; rather they are near each other in object space. Exploiting this constraint limits the hypotheses generated to a small number of all possible sets of features. Aggregation operators are derived from the models of parts in terms of the measured parameters of their sub-parts. From a physical model of the part, a functional relation among parameters is derived that distinguishes the presence of the part. In general, the aggregation operator calculates a score, based on distance and "congruence" between a part's sub-parts. Aggregation hypotheses may be sorted so that "coarse" sub-parts are considered before "fine" to further restrict the set of hypotheses generated. As described, this score is a deterministic function of the parameters of the features to be aggregated; see figure 1.

AGGREGATION-FUNCTION (PART-1, PART-2) 1

PART-2

Figure 1: Deterministic Aggregation Process The distribution of the aggregation function is conditioned by the hypothesis. It is described by a likelihood, p{s\h}, the probability of the score, given the hypothesis; see figure 2. From the model of the appearance of the object, a stochastic model of the distribution of the aggregation score can be derived for the cases that the hypothesis does or does not exist. This likelihood distribution is the probabilistic aspect of the aggregation node that allows the hypothesis probability to be inferred from the sub-part parameters. This formulation is valuable because it shows how the recognition process may be formalized as distributions within a probability net. Consider a search for projected-surface boundaries to identify the surfaces that compose them. In this instance, suppose the projected-surface boundaries are adjacent parallel lines. To aggregate projected-surface boundaries, we derive a scoring function based on both the parallelism and proximity of line boundaries. In searching for projected-surface boundaries, the model generation may disregard most potential boundaries of lines by physical arguments without resort to calculating the aggregation function. Those boundaries for which the scoring rule succeeds spawn a parent mode containing a surface hypothesis. This is how the aggregation

377

HYPOTHESIS OF EXISTENCE OF PHYSICAL OBJECT

AGGREGATION-FUNCTION (PART-1, PART-2)

Figure 2: Hypothesis Generation from the Aggregation Process operator participates in the aggregation process. A sub-part may be a member of the sets of several aggregation operators. Further rules are then applied to determine whether hypotheses so formed exclude each other, are independent or are necessarily co-incident. The range of exclusion through co-incidence may be captured in the derivation of the likelihood distributions of a sub-part as it is conditioned on more than one hypothesis. In general, the diagrams of figures 1 and 2 are solved by first substituting in the deterministic scoring functions and then applying Bayes' rule. To derive a general form for the aggregation operator influence diagram, imagine the aggregation operator as a parent to the part nodes. In Pearl's solution method, the parent receives a lambda message that are functions of the parameters in each of the sub-part nodes. This message contains the aggregation function. Because the aggregation operator expresses a relation among the parts, it may not be factorizable as it would be if the sub-part nodes were conditionally independent; hence the dependency expressed by the aggregation node among the part nodes. If we consider the aggregation node's clique to involve both the high level hypothesis and the sub-part nodes, then an additional set of arcs appear from the hypothesis to its sub-parts. This is clear when Bayes' rule is written out for the posterior distribution of the hypothesis: p{h\shl2}

=

p{s\h}p{h\sh}p{l2\8hh}p{h} p{sl2l\}

The aggregation operator likelihood appears multiplied by a set of other factors. The additional terms like p{l\h} we term "existence" likelihoods. They are the arcs to the sub-parts, /,, from the hypothesis, h. Their interpretation is: given h is observed (or is not observable), does the sub-part appear? Most often these are certainty relations. If there is no obscuration, existence of h implies appearance of its composite features, and vice versa. Thus they may express observability relations where h exists but not all of its features are observed.

378 To further clarify, think of each feature node's state space as the range of parameters that describe it, plus one point - that the node is not observed. The probability that the node appears is the integral of all the probability mass over the range of parameters. Thus each part can be envisioned as two probabilistic nodes: one a dichotomy, either the part is known to exist or it is not; the other a distribution over parameters that describe the location and shape, both dependent on the existence node. The aggregation function, pictured in figure 3A, expresses a relation between composite sub-part parameters and the existence of the parent. Figure 3B shows the aggregation node that encapsulates the deterministic aggregation function but has a probabilistic relationship to the hypothesis node. Figure 3C shows the aggregation abstraction, corresponding to figure 3B, that we will use for the remainder of the paper. The additional terms in Bayes' rule suggest direct relations between the existence node of the parent and appearance sub-part features. These additional terms may be thought of as the membership relations in the part-of network. The relation between the parameters of the sub-parts and the parent's parameters poses an additional inference problem, much along the lines of traditional statistical inference of estimating a set of model parameters from uncertain data. This method emphasizes the use of measured and inferred values to determine the existence of features; we are converting parameters not existence probabilities as we move up the network. The method concentrates on the classification aspect rather than the estimation and localization aspects. The hope is that once a set of stable, highlevel hypotheses are generated, the more difficult part of recognition has been solved, and accurate estimation can follow using the data classification generated by what is effectively an "interpretation driven" segmentation process. Estimation can be thought of as a "value to value" process. It might well be necessary to carry this out concurrently if accurate values are required. Alternately, evidence may enter the network directly at higher levels. Neither possibility presents a problem to the algorithm.

5

Dynamic Instantiation for Sequential Control

In this section, we present a way to formalize the control problem for inference up the machine vision hierarchy. We show how control over the hierarchy can be expressed as a dynamic program by an influence diagram formulation. At this level of generality, we can abstract out the structure at each level and coalesce all hypotheses at one level of the hierarchy into one node. These hypotheses nodes form a chain from the top level (the object) hypothesis to the lowest level. Each level has corresponding aggregation and, possibly, evidence nodes for the aggregation and, possibly, evidence nodes for the aggregation process at that level. This high level structure lets us show that for purposes of control, the level of the hierarchy can be considered as stages of a dynamic programming problem. Thus, each level has the structure shown in figure 3C. Each stage in the dynamic program is constructed from the aggregation operators at one level of the hierarchy. We add decision and sub-value nodes to the influence diagram to represent control in a dynamic program. In the following, we use e, to represent the ith set of observations (i.e., evidence from image processing operators), a, to represent the 2-th aggregation score, hi to represent hypotheses about physical objects, d, to represent processing decisions, and i>, to represent control costs. The v node represents the values assigned to the top-level hypotheses.

379

(A)

(B)

(^HYPOTHESÎJ)

C PART-2

OBSERVATION-n

( OBSERVATION-2

(C)

Figure 3: Aggregation: Influence Diagram Representation

380 The process starts at the bottom of the diagram with the first aggregation forming the first set of hypotheses from the original evidence. The evidence may guide the choice of aggregations, which we show by the decision, e?0, with a knowledge arc from e0. An example would be to choose an edge linking aggregation operator as xk a\ where e 0 are edges found in an image and /i, are hypothesized object boundaries. This first stage is shown in figure 4.

Figure 4: Aggregation Processing First Level Influence Diagram The final decision, di, selects the object hypothesis with the highest value. It will float to the top stage as we add more stages. The top level value v depends on the object hypotheses. Intermediate hypotheses do not contribute to the value. Stage decisions only affect the costs of calculation, v,·, which are additive as the dynamic programming formulation requires. It may be interesting to consider what are the computational gains from a value function that is separable by object hypotheses; such a value function is not considered here. Next, the system makes a decision of which processing action to take at the superior stage. For example, if we add the decision at d, to match boundaries into parallel sets with aggregation operator a2 and so generate projected-surface hypotheses, h2, we have the diagram shown in figure 5. Here d2 is, as described, the choice-of-object decision. We can continue to iterate the diagram building process to add another aggregation stage, as shown in figure 6. It is clear how the sequence of diagrams proceeds as we continue to generalize upward to complete the part-of-hierarchy. If we look at the sequence of influence diagrams from initialization to object recognition, then we can regard the final diagram as if it had been built before evaluation took place. The distributions within the nodes will differ depending on the solution to the diagram. It follows that if we show that the evaluation method is sound in terms of legal influence diagram operations, then we have a formal framework with which to develop an optimal recognition scheme and, in particular, a value based method of control. These results are an application of work by Tatman and Shachter [Tatman-85], [Tatman and Shachter-90] on sub-value nodes and dynamic programming techniques represented in influence diagram form. Tatman shows that optimal policies for diagrams such

381

Figure 5: Aggregation Processing Second Level Influence Diagram

Figure 6: Aggregation Processing Third Level Influence Diagram

382 as those above can be obtained by influence diagram techniques that are equivalent to dynamic programming methods and, like these methods, increase linearly in complexity with the number of stages. In particular, Tatman's influence diagram realization of Bellman's Principle of Optimality [Bellman-57] states that in a diagram with stage decision variables . A propositional literal is a propositional symbol or its negation, x = {χι,αΓΓ,..., xn} is a set of propositional literals. A clause is a finite disjunction of propositional literals, with no repeated literals. X = {Χχ,.,.,Χι} is a set of input clauses. * A Horn clause is a clause with at most one unnegated literal. For example, a Hornclause Xi can be written as äTj" V ÏË2 V äj^ V ... V ï J î V a ; , k > 0. A prime implicate of a set X of clauses is a clause π (often called π(Χ) to denote the set X of clauses for which this is a prime implicate) such that (1) X f= π, and (2) for no proper subset π ' of π does X \= π'. We denote t h e set of prime implicates with respect to X by U{X). (j is the j t h support clause for x with respect to X (often called (j(xiX)) iff (1) X ψ. (j, (2) xU£j does not contain a complementary pair of literals (i.e. both Xj and xj), and (3) X |= x U £,·. T h e set of support for a literal x is the disjunction of the support clauses for x, i.e. ξ(χ,Χ) = \/^i(x,X). We call the conjunction of the X^s a Boolean expression F, i.e. F = f\i=zlilXi.

4

DEMPSTER SHAFER THEORY FORMULATION IN LOGIC-BASED TERMS

Shafer [Shafer, 1976] implicitly defined a correspondence between set-theoretic notions relevant to subsets of Θ and logical notions. More precisely, as described on page 37 of [Shafer, 1976], if 0X and #2 are two subsets of Θ and x\ and x2 are the corresponding logical propositions, then we have the correspondence shown in Table 1. In Table 1, Table 1: Correspondence of set theoretic and logic theoretic notions SET T H E O R E T I C 0i 6>! 0i 01

Π 02 U 02 C02 =02

LOGIC THEORETIC x1 Λ x2 X\ V X2 X\ => X2 X l = ~>X2

0i = 02 means t h a t 0X is the set-theoretic complement of 0 2 . In this paper we summarize this correspondence, 2 comparing and contrasting t h e manipulation of DS Belief functions with certain logic-theoretic manipulations. Certain aspects of DS theory which do not occur in logic require extensions to traditional logic. These include x We often represent a clause not as a disjunction of literals (e.g. Έγ V χ 2 ) but as an implication (x\ => z 2 ). This is done to unambiguously identify which side of the implication the literals are on. 2 Described fully in [Provan, 1990a].

392 • Two arbitrary propositions (e.g. 0t- and $j) in DS theory can be defined (external to the logic) as being contradictory.3 This is equivalent to two arbitrary logical clauses (e.g. Xi and Xj) being contradictory. • DS theory can be used to pool multiple bodies of evidence. Since Dempster's rule is commutative, this pooling can be done dynamically, and in any order. Logical resolution is typically considered not to be a dynamic process, in that the set of clauses to be resolved typically does not change dynamically. In other words, logic traditionally assumes a fixed set of clauses. We now show the changes necessary to define arbitrary contradictions and to update a database consisting of propositional logic clauses.4

4,1

Symbolic Belief Function Computation

We now show the correspondence of set theoretic notions and propositional clauses, of symbolic Belief functions and minimal support clauses, and of the Belief function update rule φ and support clause updating. We start out by defining a set of DS Theory focal propositions Θ = {#i,...,0 n } and corresponding propositional logic propositions (or literals) x = {x1, ...,x n }. To each focal proposition there is a function ρ : Θ —> [0,1] (or corresponding ρ : x —> [0,1]) which assigns mass to the proposition. We define a set of clauses X = {X1,...,Xm} which denote the provability relations underlying 2 Θ . First, we define what evaluating the mass assigned to a support clause means: Definition: The mass assigned to a support clause ξ(χ,Χ) is given by

Π

Q(t(z,X)) =

*(*;)·

For example, for a support clause £(x7> X) = ^2 V x^, we have ρ(ξ(χ7, X))

(3) = £(z 2 ) ·

ρ(χ4).

The support clause for a literal is equivalent to a symbolic representation of the Belief assigned to that literal: Lemma 1 The belief assigned to a proposition Θ (which has corresponding logical clause Xk), Bel{9) = Σ^€θ ρ(θί), is equivalent to a symbolic representation of the Belief assigned to that literal, i.e. can be computed from the minimal support clause for Xk; i.e.

5e/(x) = £>(£,·(*, X)). t

Given a fixed database, i.e. a fixed set X of clauses, the Belief assigned to any literal or subset of literals can be symbolically computed from the set of support for the literal or subset of literals. DS Theory can also be used in the case of pooling several bodies of evidence, which is equivalent to changing the fixed set of clauses X. Belief function updating is necessary in pooling bodies of evidence. In a logical framework, a 3

More precisely, they can be defined to be mutually exclusive. We will use the fact that since F is also computed by Fn, one only needs to maintain Fn, and can update Π and "ignore" F. 4

393 database can be incrementally updated by support clause updating. For example, if the database is updated by a clause x$ Λ x7 =r* xnew Gthe x and xnewsupport £ x, such that x5,xfrom 7 then the set of support for xnew can be incrementally computed sets of for x5 and x7. Thus, if we have x$ Λ x7 =4> xnew> and x5 and x7 have support sets {{χι,χ2}, {x2,Xz}} and {{#i}, {x^Xe}} respectively, then xnew is assigned the support set {{xi,x2}, {z2>#3>Z4> x6}} by taking a set union of the support sets for x5 and x7. (See [de Kleer, 1986, Provan, 1988b] for a full description of such updating using an ATMS.) We now show the correspondence between Belief function updating and support clause updating. In DS Theory, Belief function updating is done according to Dempter's Rule of Combination (equation 2), and is summarized as Bel(6) = 0 t · Bel(ßi). Support clause updating must be done to compute the support clause for a newly-introduced literal x if ξ(χι„Χ) such that f\{Xi =>· x. The correspondence there are support clauses £(XJ,X)V.., between DS and logical updating is given by: Lemma 2 Dempster's rule for Belief combination, i.e. Btl{ß) = 0t·-öe/(0,·), corresponds to computing the Belief measure assigned to Xj (the clause corresponding to Θ). Bel(Xj) is the measure assigned to a disjoint form of £(Xj, Σ), Bel(x)

=/\(ζ{(χ,Χ)), i

where ξ(χ{,Χ)

is the support set for Xi with respect to X such that Λ,·£; => %·

A DNF Boolean formula is disjoint if each pair of conjunctive clauses is disjoint. A pair of conjunctive clauses are disjoint if, for each variable common to the clauses, say Xj, one clause contains the variable and the other contains the negated variable "xj. This provides only symbolic Boolean expressions for the Belief functions, and these must be evaluated. In general, a Boolean expression is not necessarily disjoint (that is, each pair of disjuncts is disjoint), and a disjoint expression is necessary for the evaluation of the correct Belief assignment. The Boolean expression must be expanded if it is not disjoint, a process which corresponds to what is known in the literature as a Network Reliability computation.

4.2

Network Reliability Computation

Network Reliability describes a set of techniques for analyzing computer and communication networks [Agrawal and Barlow, 1984]. The network reliability problem computes the probability that the network (or a portion of the network) is functioning. The input to a network reliability algorithm is (1) a Boolean expression F (which describes a network in which each literal represents a network component) and (2) a [0,1] assignment of weights to Boolean variables, which corresponds to the probability that the component x is functioning. Network reliability is a restriction of DS theory to Bayesian Belief functions. In logical terms, each clause consists of two literals, and hence a network reliability problem is an instance of 2SAT, the SATISFIABILITY problem with two literals per clause. If we frame this problem in graph theoretic terms, the weighted Boolean expression corresponds to a weighted graph. For a general DS theory problem, we have a weighted

394 hypergraph. Hence, the network reliability problem in graph theoretic terms corresponds to computing the probability that a set of vertices can communicate with one another (i.e. the probability that a path (or set of paths) exists between the specified vertices). The set of support for a proposition x corresponds to the set of paths in the graph to x (for F expressed in DNF), or the cutsets which disconnect x from the graph (for F expressed in CNF). Hence it is obvious that network reliability measures and Bayesian Belief functions compute exactly the same thing: both compute the probability that a (proof) path to a proposition exists in a graph. This correspondence between computing DS Belief functions and computing network reliability is useful because the latter problem has been carefully studied for many years, and results derived by the network reliability literature can be used for DS theory computations. Numeric assignments of Belief can be calculated as given by: Lemma 3 The DS Belief assigned to a literal Xi can be computed using an ATMS by converting the weighted ATMS label set to its graphical representation and computing the probability assigned to all disjoint proof paths for X{, which are defined by the label assigned to xt·. Hence calculating DS belief functions for an underlying Boolean expression F is identical to computing the network reliability for the graph corresponding to F. Several methods have been developed for computing network reliability. These methods and their applicability to DS Belief function computation are described in [Provan, 1990a, Pro van, 1988a].

5

ATMS-BASED IMPLEMENTATION OF DEMPSTER SHAFER THEORY

We call an ATMS-based implementation of Dempster Shafer theory an extended ATMS.5 In describing this implementation, we need to introduce some ATMS terminology. The ATMS is a database management system which, given a set X of propositional clauses, computes a set of support (called a label, C(x)) for each database literal x. C(x) consists of a set of sets of assumptions, which are a distinguished subset of the database literals. The assumptions, which we denote by A — {A 1} ..., A/}, are the primitive data representation of the ATMS. The labels for literals thus summarize "proofs" in terms of a Boolean expression consisting of assumptions only. In logical terms, an ATMS label is a disjunction of conjunction of assumptions, and is a restriction of the support set (denned earlier) to assumptions. The ATMS-based implementations assign mass only to assumptions. Additionally they are restricted to Horn clauses, as the ATMS slows considerably with non-Horn clauses. The ATMS records contradictions in terms of a conjunction of assumptions called a nogood. By ensuring null intersections of all labels with the set of nogoods, the ATMS maintains a consistent assignment of labels to database literals. The ATMS can incrementally update the database labeling following the introduction of new clauses. It does 5 Similar implementations have been done by Laskey and Lehner [Laskey and Lehner, 1988] and d'Ambrosio [D'Ambrosio, 1987].

395 this by storing the entire label and nogood set to avoid computing them every time they are needed. Belief can be assigned only to non-contradictory subsets. In probabilistic terms, this corresponds to conditioning on non-contradictory evidence. Conditioning in DS theory is expressed by Dempster's Rule of Conditioning: L e m m a 4 If Bel and Bel' are two combinable Belief functions,6 Bel® Bel'. Then _ _ _ Bel{9^e2)-Bel{e2) Bem 2)

°

let f?e/(-|02) denote

1-BeKf2)

(4)

for all θλ C Θ . There is an analog in the ATMS to Dempster's rule of Conditioning. L e m m a 5 If we call the set of nogoods Φ, then the ATMS's symbolic representation of equation 4 is, for all x £ x,

It is immediately obvious that the ATMS can be used to compute the symbolic representation of Belief functions as described earlier. We given a brief description of the algorithm, and refer the reader to the relevant papers (Provan [1988a], Laskey and Lehner [1988] and d'Ambrosio [1987]). ATMS-based Belief Function Algorithm 1. Compute a Boolean expression from the label: C = V; -^ή where each £ t = /\k A*. 2. Account for nogoods, using equation 5. 3. Convert the Boolean expression (5) into a disjoint form (a Network Reliability computation). 4. Substitute mass functions for the A,·'s to calculate the mass function for x. Example 1: Consider a following example with nogoods: the set of clauses is (represented both as implications and Horn clauses): The set of clauses, represented both as implications and in traditional clausal form, is A i =Φ· Xi

A2 χι x2 Xi x4 x2

=> x4 Λ A 3 => Λ A 4 => A A5 ^ A AQ =Φ· A x4 A A 7

A i V x\

x2 x3 x4 Χ5

=> x5

A 2 V x4 "χ~ϊ V A 3 V x2 ^VA4Vx3 27 V A 5 V x4 ΈΙ V A 6 V x 5 χΊ V Έ~4 V A 7 V x5

cf. [Shafer, 1976], p.67 for a definition of conditions for combinability.

396 The masses assigned to the assumptions are:

ASSUMPTION MASS A1 1.0 .8 A2 A3 .5 A4 .7 .8 As A6 .6 A7 .9 As .4 The label sets the ATMS assigns to the literals are:

LITERAL Xl X2 X3 X4

xs

LABEL SET {Ai} {Ai, A*} {AUA3,AA} {{A2}MuAh}} {{A2,A6},{AUA5,A6}, {Au A2, A3, A7}, {Au A3, A5, A7}}

The computation of the Belief expressions for (and hence Belief assigned to) these labels is trivial except for the expressions for xs, which we now show: Bei(x5)

=

Q({{A2,A6}AAUA5,A6}{AUA2,A7IA3},{A1,A3,A5,A6}})

= ρ((Α2 A A6) V (Αχ Λ A5 A A6) V (A1 Λ Α2 Λ A7 A A3) V (Ax A A3 A A5 A A6)) = ρ{Α2)ρ(Α6) + ρ(Α1)ρ(Α5)ρ(Α6) - ρ(Α1)ρ(Α2)ρ(Α5)ρ(Α6)-{ρ(Αι)ρ(Α2)ρ(Α3)ρ(Α7) + ρ(Αι)ρ(Α5)ρ(Α3)ρ(Α7) - ^(Α1)^(Α2)^(Α3)^(Α5)^(Α7)ρ(Αι)ρ(Α2)ρ(Α3)ρ(Α6)ρ{Α7) - ^AO^AsMAsMAßMArH ρ(Α1)ρ(Α2)ρ(Α3)ρ{Α5)ρ(Α6)ρ(Α7) = 0.746. The Belief assigned to the literals is:

LITERAL BELIEF ] .5 22 .35 Z3 £4 .96 .75 X5 As mentioned earlier, the ATMS can dynamically update the label sets assigned to literals following the introduction of new clauses. This means that the Belief assignments to literals can also be dynamically updated. In a logical framework, the label set for a set x of literals can be incrementally updated by support clause updating. For example, if the database is updated by a clause x5 A x7 => Xg such that x5 and x7 have already been assigned label sets and x8 has not, then the label set for x$ can be computed from the label sets for x$ and x7 as follows.

397 If x5 and x7 have label sets {{χχ,χ2}, {x2,Xz}} and {{#ι}, {^4,^β}} respectively, then x8 is assigned the label set {{xi,x2}, {χ2ιχ3ιχ4ιχβ}} by taking a combination of the label sets for x5 and χγ. Support clause updating is equivalent to pooling evidence for the antecedents of determine the Belief assigned to the consequent. Example 2: Consider the introduction of a new clause x2 Λ A8 => x£, such that ρ(Α8) = 0.4. Suppose we are given the information that x4 and x£ are contradictory, so that a nogood Φ is formed: £(Φ) = C(x4) Λ 0&) = {{A2}AAUA5}}A{AUA3,A8} = {{AUA2,A3, A8}, {Au A3, As, A8}} The new assignment of Belief to literals is:

LITERAL BELIEF Φ .192 x .5 2 .41 Z3 X4 .96 .57 X5

6

MODEL-BASED VISUAL RECOGNITION USING AN EXTENDED ATMS

Provan [Provan, 1987b, Provan, 1990b] describes a model-based visual recognition system called VICTORS 7 based on an ATMS. VICTORS was designed to test the use of a logical representation for high level vision, and the use of an ATMS to propagate the set X of logical clauses and maintain consistency within X. VICTORS exhibits many novel features: it can simultaneously identify all occurrences of a given figure within a scene of randomly overlapping rectangles, subject to variable figure geometry, input data from multiple sources and incomplete figures. Moreover, it conducts sensitivity analyses of figures, updates figures given new input data without having to entirely recompute the new figures, and is robust given noise and occlusion. A sample image which VICTORS interprets is shown in Figure 1. Artificial input data was used, as real input data distracted from the primary objectives of studying the use of logic and of the ATMS in vision. The theory of depiction and image interpretation underlying VICTORS has been described in [Reiter and Mackworth, 1990]; VICTORS has been formalized logically in [Provan, 1990b]. However, the basic implementation of VICTORS suffers from a major deficiency, namely its inability to rank visual interpretations. This is due to the TMS assigning only binary "weights"—each figure part hypothesis is either "believed" or "not believed" . Since visual systems typically identify a single best interpretation, this is a major drawback. In addition, the inability to rank interpretations leads to system inefficiency, especially in images with some degree of ambiguity (cf. [Provan, 1987a]). This is because ambiguity leads to exploration of a large number of partial interpretations, several The acronym stands for Visual Constraint Recognition System.

398 Figure 1: Typical Set of Overlapping Rectangles input to VICTORS

of which are definitely implausible (although they pass the Boolean constraints), and should not be explored. The extension of the ATMS with Belief functions has enabled the weighting of interpretations, thus overcoming this deficiency. We briefly describe this basic implementation, and the assignment of weights in VICTORS, full descriptions of both of which are given in [Provan, 1990b].

6.1

Basic VICTORS Description

The problem VICTORS solves is as follows: given a set of n 2D randomly overlapping rectangles and a relational and geometric description of a figure, find the best figures if any exist. We define the figure using a set of constraints over the overlap patterns of k < n rectangles. The type of figure identified, a puppet consisting of 7 or more parts, is shown in Figure 2. VICTORS can detect any type of object which can be decomposed into a set of parts each of which is representable as a rectangle; all it needs is a description of the object encoded as a set of constraints over a set of rectangles. The choice of a puppet as a figure for identification is not central to the operation of VICTORS or the issues it addresses. A puppet is one of many possible figures which fulfills the objectives of (1) being broken up naturally into multiple parts (ranging in the puppet from 7 up), and (2) having interpretations with the subparts taking on various configurations. Such an object model allows great variability in the degree of model complexity specified, and the ability to test the effect of that complexity on the size of search space generated. VICTORS consists of two main modules, a domain dependent Constraint Engine and a domain independent Reasoning Engine. The Constraint Engine uses a set of constraints for a given figure. A constraint is a set of filters, where each filter is a test of the geometric properties of a set of rectangles. Each constraint places restrictions on acceptable assignments of puppet parts to rectangles based on the overlap patterns of the rectangles. For example, one of the filters for a trunk is that there are at least 5 smaller rectangles overlapping it (which could be a neck and four limbs). We discuss some criteria defining a thigh in § 6.2.1. Based on the constraint set, the Reasoning Engine generates a set of TMS-clauses, where a TMS-clause is a logical clause which encodes a successful constraint. Each TMSclause consists of assumptions and TMS-nodes, where a TMS-node is a rectangle/puppet

399 part hypothesis. For example, a TMS-node could be C : trunk, and a TMS-clause Ai A C : trunk => D : thigh, where Αχ is an assumption. T h e TMS propagates t h e set of TMS-clauses to create a set of TMS-nodes. T h e T M S maintains consistency within this set of TMS-nodes subject to the TMS-clause set. A figure is identified from a consistent set of TMS-nodes which together define t h e figure. We present an example to demonstrate t h e details of t h e operation of V I C T O R S in the simplest case of identifying an unambiguous puppet with all parts of t h e puppet present. As a rule, in figures displaying puppets, most extraneous rectangles are removed so t h a t the points being stressed in t h e figures will b e clearly evident. In general, scenes are much more cluttered. We refer the reader to [Provan, 1987b] and [Provan, 1990b] for descriptions of further system capabilities, such as identifying puppets with ambiguous interpretations, missing pieces, occluded pieces, puppets amid clutter, etc. E x a m p l e : Consider t h e simple task of finding 15-element puppets from a scene of randomly overlapping rectangles, as shown in Figure 2. Figure 2: Process of Detecting P u p p e t

initial rectangles

puppet identified

First, the Constraint Engine assigns rectangles as seeds, where a seed is a rectangle/puppetpart hypothesis used to start the growth of puppet figures. In this case the seeds are A : head and C : trunk. Second, starting from the seed rectangles, all subsequent assignments of puppet parthypotheses to rectangles are made. Assignments are based on rectangle overlaps and the puppet topology. For example, an overlap with a rectangle identified as a head will produce only a head — neck TMS-clause, and not a head — thigh TMS-clause, because the head is attached only to the neck. Thus, in Figure 2, seed rectangle C propagates to all its overlaps, which can be several possible combinations of the limbs neck, upper arms and thighs. For example, it propagates left upper arm to rectangles D and H, since either of these rectangles could eventually end up with t h a t part assignment. Ddeft-upper-arm then propagates the part assignment left-forearm to rectangle J, which in turn propagates the part assignment left-hand to rectangle K. Third, t h e T M S propagates the clauses in the TMS-clause set to produce a set of consistent TMS-nodes. Propagation proceeds as follows: From the TMS-clauses A : head (identified as a seed) and A : head =» B : neck, t h e TMS infers B : neck. T h e T M S

400 continues this propagation process, eliminating multiple and/or contradictory hypotheses (for example A : head and A : neck are contradictory hypotheses), until a globally consistent set of hypotheses is assigned. The Constraint Engine then takes the rectangle/puppet-part hypothesis set and interprets it as puppet figures. In this case, a full puppet is identified, as shown by shaded rectangles in Figure 2(b). Note that each (partial) interpretation is associated with an assumption set. If assumptions are now shown in this example, for the partial interpretation derived from the clause set {Αχ => A : head, A2 =Φ- C : trunk, A3 Λ C : trunk =Φ F : right — thigh, A4 A A : head => B : neck}, the assumption set {A\, A2, A3, A4} is obtained for the partial puppet interpretation consisting of A : head, B : neck, C : trunk, F : right — thigh.

6.2

Uncertainty Representation in VICTORS

As mentioned earlier, the basic VICTORS implementation suffers from the inability to rank the interpretations, and outputs a set of interpretations with no way of choosing among them. Extending the ATMS with DS Belief functions enables this ranking to be done, as we now briefly explain. The ATMS is extended by assigning [0,1] weights to assumptions. In VICTORS, each assumption corresponds to the hypothesis of a rectangle representing a particular seed puppet part, such as A : head, or the hypothesis of a TMS-clause, such as A : head =>■ B : neck. With the assumption explicitly represented we have A\ => A : head and A4 Λ A : head => B : neck. In the process of generating an interpretation for an image, a sequence of assumptions is made, starting from seed assumptions and continuing to the extremities (hands, feet) of the puppet. We now describe the assignment of weights to assumptions. Weight assignment does not require significantly more processing than is necessary with the traditional ATMS. This is because the rectangle data that exists already is used to define criteria for "quality" of part acceptability. Thus, instead of testing a constraint that the overlap of rectangle C, identified as trunk, with rectangle D either qualifies D to be a thigh or not, a weight or probability with which the constraint could be true is calculated. Hence, we extend the basic VICTORS hypothesis (e.g. D satisfies a constraint to be a thigh given rectangle C is hypothesised as a trunk) to a weighted hypothesis. We have been studying the effectiveness of the simplest weight assignments, using more complicated assignments only when necessary. In the following section we present a weight assignment method which approximates more theoretically correct methods and which has been successful for simple input data. 6.2.1

Approximation M e t h o d s for Weight assignments

Figure-part hypotheses (e.g. rectangle D being a thigh) are based on rectangle overlaps (e.g. the overlap of D with a rectangle C already assumed to be a trunk). Some of the filters which define the constraints governing hypotheses include: (1) angle of overlap; (2) relative area; (3) relative overlap area; and (4) axial ratio. Each filter is satisfied with a [0,1] degree of acceptability; 0 is unacceptable and 1 is perfectly acceptable. In general, there is a probability distribution φ over the filter's feasible range. The simplest approximation to φ is to define a subset of each filter's range with which the filter

401 is satisfied with high probability, and the remaining subset with low probabiUty. For example, for the thigh, we have the following ranges: angle of overlap As shown in Figure 3(a), the total angular range within which an acceptable overlap occurs is [π, π/4]. We define a sub-range, namely [5π/4,0], as an overlap acceptable with high probabihty, and the remaining sub-range, [0, π/4] and [π,5π/4], as an overlap acceptable with low probability. These regions are shown in Figure 3(b). The angle of overlap a is computed to determine acceptability or unacceptability in basic VICTORS. In this extended system, all that is necessary in addition is to place this angle a in the high or low probability category. Figure 3: Regions with probabilistic weights of acceptability

Extreme positions of thigh

Total Angular Range

with respect to trunk

Ranges of low- and high-probability

relative area For acceptability of the trunk-thigh overlap, the ratio of the area of the thigh to the area of the trunk must fall within the bounds [0.6,0.15]. The bounds [0.4,0.25] define an overlap acceptable with high probability, and the bounds [0.6,0.4], [0.25,0.15] define an overlap acceptable with low probability. relative overlap area Similar to the relative area filter, there is a low and high probability ratio of overlap areas. For the thigh and trunk rectangles, this is given by Table 2. Similar high and low probability assignments exist for the axial ratio and all other filters. Next, the weights for all the separate criteria must be merged to give an overall weight. Because the criteria correspond to different frames of reference, refinement (cf. [Shafer, 1976]) is necessary to map these disparate frames onto a common frame, so the the weights from each criterion can be combined. A rough approximation to this refinement process can be obtained as follows. A probability pi is assigned to the high probability value, and p 2 is assigned to the low-probability value. The p t 's for a given assumption are multiplied together and normalized to ensure that the highest acceptability weight assigned is 1. In general, for n different criteria to be combined, the normalized measure is given by Q_

Π? = ιΚ

(max(pi))n'

402 Table 2: Probability assignments to ratio of overlap areas THIGH FILTER TYPE Simple acceptability high probability acceptability low probability acceptability TRUNK FILTER TYPE Simple acceptability high probability acceptability low probability acceptability

| | | |

RATIO [0.5,0] [0.3,0.1] [0.5,0.3], [0.1,0]

RATIO ] [0.25,0] [0.15,0.05] [0.25,0.15], [0.05,0]

Table 3: Weight assignments to thigh assumption j PROBABILITY TYPES WEIGHT 4 high T^ 0.625 3 high, 1 low 2 high, 2 low 0.4 4 low 0.15

For example, if four filters are used to define the constraint for thigh acceptability, and Pi = 0.8 and p2 = 0.5, the normalization constant is p\ = 0.84 = 0.4096. If we have 3 high-probability values and 1 low-probability values, the weight assigned is given by (0.83 x 0.5)/0.4096, which works out to 0.625. Some weights obtained based on different combinations of high- and low-probability criteria are given in Table 3. The methods of assigning weights, and the values of weights themselves are somewhat arbitrary. What is needed is a theory of assigning weights, and of learning appropriate assignments. Lowe [Lowe, 1985], for example, discusses some criteria necessary for such a theory, and Binford et al. [Binford et al, 1989] propose a theory based on quasi-invariant s. However, much more work needs to be done.

6.3

Belief Function Combination

For this domain, an efficient implementation of Dempster's rule of combination is possible because the data is hierarchical (the puppet figure is defined hierarchically). Thus, using an algorithm based on the algorithm of Shafer and Logan [1987], Belief combination can be done in time 0(h) where k is the number of elements of the puppet figure. In the general case of a non-hierarchical object model, no polynomial-time algorithm is possible without the use of approximation techniques [Provan, 1990a]. The frame of discernment consists of the set of all possible assignments of figure parts to rectangles. Thus for a puppet consisting of only a head and trunk, and an input of rectangles A,B,C, Θ = {A : head, A : trunk, B : head,B : trunk, C : head,C : trunk}. For a fc-part figure and n input rectangles, |Θ| = kn. The filters in the constraint set ensure that the number of clauses generated, and

403 also the number of weights assigned, is minimized. In addition, the maintenance of consistent labels ensures that contradictory figure hypotheses are never created, meaning that Dempster's rule can be applied to assign a Belief measure to each consistent figure hypothesis.

6.4

Results

Given the assignment of weights to assumptions, the DS Belief functions of interpretations are computed as described in previous sections. The use of DS Belief functions has enabled a ranking of interpretations, meaning that the best interpretation can be found. We show how this comes about with an example. Figure 4(a) shows an input image. Figure 4(b) shows some interpretations which can be discovered using VICTORS with a traditional ATMS. Figure 4(c) shows the best interpretation found by VICTORS with an extended ATMS. Figure 4: Interpretations found by VICTORS with traditional and extended ATMS

(a) Input set of rectangles

(b) 3 of 26 possible interpretations in normal VICTORS (c) Best interpretation computed

Additionally, we are studying different methods of using this ranking to prune the search space by exploring only the best partial interpretations. This has the potential of enhancing the efficiency of VICTORS.

404 Results to date indicate that even simple weight assignments prove useful in generating an ordering of partial interpretations equivalent to the theoretically accurate ordering. However, for more complicated input data these simple techniques are too inaccurate. Indeed, we anticipate that real, sensor-derived data will require sophisticated weight manipulation. Even so, there are domains in which simple weight assignments can provide the partial ordering necessary for directing search and improving the efficiency of the ATMS. Where appropriate, these computationally efficient approximations can replace the more computationally intensive DS representations.

6.5

Related Work

VICTORS is related to the system of Hutchinson et al. [Hutchinson et al, 1989] in that both systems use DS theory for model-based object recognition. Major differences include the use of 3D range data by [Hutchinson et al, 1989] in contrast to the synthetic data of VICTORS, and the use of DS theory to enforce relational constraints in [Hutchinson et al, 1989] as opposed the use of logic in VICTORS. VICTORS is also related to the system of Binford et al. ([Binford et al, 1989], [Levitt et al, 1988]) in its use of an uncertainty calculus for model-based object recognition, except that [Levitt et al, 1988] uses a probability-based influence diagram representation.

7

DISCUSSION

The relation between DS Theory and propositional logic has been described. We have shown how the support clause C(X{, X) gives a notion of a symbolic explanation for X,·. In the same way, a symbolic representation for a DS Belief function provides a notion of a symbolic explanation. Moreover, the numeric value of the Belief can be viewed as a numeric summary (or as the believability) of that explanation. In addition, just as a logical model describes which propositions are true in a given world, the DS Belief assigned to propositions describes the degree to which that set of propositions is true. Thus, to the extent to which logic and DS Theory overlap, DS Theory can acquire a logical semantics. Note that DS Theory has a different notion of contradiction to logic, in that two arbitrary propositions can be defined (external to the logic) as being contradictory. We have described an application of an ATMS extended with DS Belief functions to visual interpretation. For domains in which the best interpretation is required and truth maintenance is important, such an approach appears promising.

ACKNOWLEDGEMENTS

I would like to thank Judea Pearl for helpful discussions, and Alex Kean for much help with the Logic-based DS Theory implementation.

References [Agrawal and Barlow, 1984] A. Agrawal and R.E. Barlow. A Survey of Network Reliability and Domination Theory. Operations Research, 32(3):478-492, 1984. [Binford et al, 1989] T.O. Binford, T. Levitt, and W. Mann. Bayesian Inference in Model-Based Machine Vision. In L. Kanal, T.S. Levitt, and J.S. Lemmer, editors,

405 Uncertainty in Artificial Intelligence, Vol. 3. Elsevier Science Publishers B.V., Amsterdam, North-Holland, 1989. [D'Ambrosio, 1987] B.D. D'Ambrosio. Combining Symbolic and Numeric Approaches to Uncertainty Management. In AAAI Uncertainty in Artificial Intelligence Workshop, pages 386-393. Morgan Kaufmann, 1987. [de Kleer, 1986] J. de Kleer. An Assumption-based TMS. Artificial Intelligence Journal, 28:127-162, 1986. [Dempster, 1968] A.P. Dempster. Upper and Lower Probability Bounds. J. Royal Statistical Society, 1968. [Hutchinson et al., 1989] S. Hutchinson, R. Cromwell, and A. Kak. Applying Uncertainty Reasoning to Model Based Object Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 541-548, 1989. [Laskey and Lehner, 1988] K. Blackmond Laskey and P.E. Lehner. Belief Maintenance: An Integrated Approach to Uncertainty Management. In Proceedings of the American Association for Artificial Intelligence, pages 210-214, 1988. [Levitt et al, 1988] T. Levitt, T.O. Binford, G. Ettinger, and P. Gelband. Utility-Based Control for Computer Vision. In AAAI Workshop on Uncertainty in Artificial Intelligence, pages 245-256, 1988. [Lowe, 1985] D. Lowe. Perceptual Organization and Visual Recognition. Kluwer Academic, 1985. [Provan, 1987a] G. Provan. Complexity Analysis of Multiple-Context TMSs in Scene Representation. In Proceedings of the American Association for Artificial Intelligence, pages 173-177, 1987. [Provan, 1987b] G. Provan. The Visual Constraint Recognition System (VICTORS): Exploring the Role of Reasoning in High Level Vision. In Proc. IEEE Workshop on Computer Vision, pages 170-175, 1987. [Provan, 1988a] G. Provan. Solving Diagnostic Problems Using Extended Truth Maintenance Systems:Theory. Technical Report 88-10, University of British Columbia, Department of Computer Science, 1988. [Provan, 1988b] G. Provan. The Computational Complexity of Truth Maintenance Systems. Technical Report 88-11, University of British Columbia, Department of Computer Science, 1988. [Provan, 1990a] G. Provan. A Logic-based Analysis of Dempster Shafer Theory. International Journal of Approximate Reasoning, page to appear, 1990. [Provan, 1990b] G. Provan. Model-based Object Recognition using an Extended ATMS. Technical Report to appear, University of British Columbia, Department of Computer Science, 1990. [Reiter and Mackworth, 1990] R. Reiter and A.K. Mackworth. A Logical Framework for Depiction and Image Interpretation. Artificial Intelligence Journal, 41:125-155, 1990. [Shafer and Logan, 1987] G. Shafer and R. Logan. Implementing Dempster's Rule for Hierarchical Evidence. Artificial Intelligence Journal, 33:271-298, 1987. [Shafer, 1976] G. Shafer. Press, 1976.

A Mathematical Theory of Evidence.

Princeton University

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

407

Efficient Parallel Estimation for Markov Random Fields Michael J. Swain, Lambert E. Wixson and Paul B. Chou * Computer Science Department University of Rochester Rochester, NY 14627

Abstract We present a new, deterministic, distributed MAP estimation algorithm for Markov Random Fields called Local Highest Confidence First (Local HCF). The algorithm has been applied to segmentation problems in computer vision and its performance compared with stochastic algorithms. The experiments show that Local HCF finds better estimates than stochastic algorithms with much less computation.

1

Introduction

T h e problem of assigning labels from a fixed set to each member of a set of sites appears at all levels of computer vision. Recently, an optimization algorithm known as Highest Confidence First ( H C F ) (Chou, 1988) has been applied to labeling tasks in low-level vision. Examples of such tasks include edge detection, in which each inter-pixel site must be labeled as either edge or non-edge, and t h e integration of intensity and sparse depth d a t a for t h e labeling of depth discontinuities and t h e generation of dense depth estimates. In these tasks, it often outperforms conventional optimization techniques such as simulated annealing(Geman and Geman, 1984), Monte Carlo sampling(Marroquin et al., 1985), and Iterative Conditional Modes (ICM) estimation(Besag, 1986). T h e H C F algorithm is serial, deterministic, and guaranteed to t e r m i n a t e . We have developed a parallel version of H C F , called Local H C F , suitable for a SIMD architecture in which each processor must only communicate with a small number of its neighbors. Such an architecture would be capable of labeling an image in real-time. Experiments *Current address: IBM T.J. Watson Research Center, P. O. Box 218, Yorktown Heights, NY 10598

408 have shown that Local HCF almost always performs better than HCF and much better than the techniques just mentioned. In the next section, the labeling problem is discussed. Sections 3 and 4 review Markov Random Fields and Chou's HCF algorithm. Section 5 describes the Local HCF algorithm, and test results are presented in Section 6. Finally, we discuss future plans for this research.

2

Generating Most Probable Labelings

In probabilistic labeling, a priori knowledge of the frequency of various labelings and combinations of labelings can be combined with observations to find the a posteriori probabilities that each site should have a certain label. For complexity reasons, the interactions among the variables are usually modeled as Markov Random Fields, in which a variable interacts with a restricted number of other variables called neighbors. If a link is drawn between all neighboring variables the resulting graph is called the neighborhood graph. The problem is to find the labeling which has the highest probability given the input data. This is called the maximum a posteriori (MAP) labeling.1 For a Markov Random Field, the MAP labeling can be found by locating the minimum of the Gibbs energy, which is a function of both a priori knowledge (expressed as energies associated with cliques in the neighborhood graph) and the input data. A major problem with the probabilistic labeling approach is the exponential complexity of finding the exact MAP estimate. The methods mentioned in Section 1 have traditionally been used to find labelings whose energies are close to the global minimum of the energy function. Simulated annealing has been widely used for this purpose because of the elegant convergence proofs associated with the algorithm and its massively parallel nature (Geman and Geman, 1984). But in practice simulated annealing is slow and highly dependent on the cooling schedule and initial configuration. Marroquin's Monte Carlo MPM estimator has much better performance in practice, although its performance is also dependent on the initial configuration. A bad initial configuration slows or prevents the algorithm from reaching a good configuration (see Figure 7). Continuation methods have also been used for specific classes of computer vision reconstruction problems formulated as Markov Random Fields (Koch et al., 1986; Blake and Zisserman, 1987), but these cannot be applied to arbitrary MRF estimation problems. Although Koch's approach is efficient in analogue VLSI, it is slow to simulate on a grid of standard processors. The relaxation labeling technique of (Hummel and Zucker, 1983) could also be applied to the MRF estimation problem, but it does not guarantee generating a feasible solution, that is, one in the space of possible solutions. 1 (Marroquin et al., 1985) points out that the Maximizer of Posterior Marginals is more useful when the data is very noisy. This labeling minimizes the expected number of mislabeled sites. Marroquin uses a Monte Carlo procedure to compute this MPM labeling. Unless the Monte Carlo procedure is given a very good initial estimate, HCF produces better segmentations.

409 Faced with the problems posed by traditional optimization methods, Chou developed an algorithm called Highest Confidence First (HCF). Unlike the stochastic energy minimization procedures, the HCF algorithm is deterministic and guaranteed to terminate at a local minimum of the energy function. One drawback of HCF is that it is a serial algorithm. This discourages its real-time application to problems with large numbers of sites, and would also cast doubt on any hypothesized connection between HCF and biological plausibility. This paper presents a parallel adaptation of HCF, called Local HCF.

3

Markov Random Fields

A Markov Random Field is a collection of random variables S which has the following locality property: P(XS = ω8\ΧΓ =ur,reS,r^s) P(X8=us\Xr=ur,r€Ns,r^s)

=

where N8 is known as the neighborhood of the random variable X8. The MRF is associated with an undirected graph called the neighborhood graph in which the vertices represent the random variables. Vertices are adjacent in the neighborhood graph if the variables are neighbors in the MRF. Denote an assignment of labels to the random variables by ω. The HammersleyClifford theorem (Besag, 1974) shows that the joint distribution Ρ(ω) can be expressed as a normalized product of positive values associated with the cliques of the neighborhood graph. This can be written:

PM = Ç where and Z is a normalizing constant. The value U is referred to as the energy of the field; minimizing U is equivalent to maximizing Ρ(ω). In this notation, the positivity of the clique values is enforced by the exponential term and the clique parameters Vc may take on either positive or negative values. Normally, the unary clique values are broken into separate components representing prior expectations Ρ(ω8) and liklihood values obtained from the observations Ρ(08\ω8). This is done using Bayes rule, which states

^loj.as^hi.

If the Fc's are used to signify the prior expectations, then U is revised to read:

^ = Σ^( ω )-Σ 1 ο β^(ο.Μ cec

ses

(1)

410

begin ω = (/o,...,/o); top = Create_Heap(u;); while stability top < 0 do s = top; Change_State(u;5); Update_Stability (st ability^ ) ; Adjust_Heap(s); for r € Na do Update_Stability(stability r ); AdjustJHeap(r) end end end Figure 1: The algorithm HCF The denominator in Equation 1 is absorbed into the normalizing constant Z.

4

HCF

In the HCF algorithm all sites initially are specially labeled as "uncommitted", instead of starting with some specific labeling as with previous optimization methods. Cliques for which any member is uncommitted do not participate in the computation of the energy of the field. For each site, a stability measure is computed. The more negative the stability, the more confidence we have in changing its labeling. On each iteration, the site with minimum stability is selected and its label is changed to the one which creates the lowest energy. This in turn causes the stabilities of the site's neighbors to change. The process is repeated until all changes in the labeling would result in an increase in the energy, at which point the energy is at local minimum in the energy function and the algorithm terminates. The algorithm is given in Figure 1. The stability of a site is defined in terms of a quantity known as the augmented a posteriori local energy E, which is: c:s6c

seS

where ω' is the configuration that agrees with ω everywhere except that ω'3 = I. Also, V'c is 0 if ωτ = /o, the uncommitted state, for any r in c, otherwise it is equal to Vc. The stability G of an uncommitted site 5 is the negative difference between the two

411 lowest energy states that can be reached by changing its label: Ga{u) = -minkeL,k^min(Es(k)

-

Es(umin))

In this expression a?min = {fc|2£fcis a minimum}. The stability of a committed site is the difference between it and the lowest energy state different from the current state ω3: GB{u>) = minkeLtkfr,.(Ea(k)

5

-

ΕΛ{ω,)).

Local H C F

The Local HCF algorithm is a simple extension of HCF: On each iteration, change the state of each site whose stability is negative and less than the stabilities of its neighbors. In a preprocessing phase, the sites are each given a distinct rank, and, if two stabilities are equal in value, the site with lower rank is considered to have lower stability. These state changes are done in parallel, as is the recalculation of the stabilities for each site. The algorithm terminates when no states are changed. Pseudocode for Local HCF is given in Figure 2, for which you should assume a processor is assigned to every element of the site data structure. The algorithm is written in a notation similar to C* (Rose and Steele, 1987), a programming language developed for the Connection Machine. In the algorithm, the operator &all returns the result of a global and operation. For the low-level vision tasks which we have studied, the MRFs have uniform spatial connectivity and uniform clique potential functions. Thus, Local HCF applied to these tasks is well suited for a massively parallel SIMD approach which assigns a simple processor to each site. Each processor need only be able to examine the states and stabilities of its neighbors. The testing and updating of the labels of each site can then be executed in parallel. Such a neighborhood interconnection scheme is simple, cheap, and efficient. Like HCF, Local HCF is deterministic and guaranteed to terminate. It will terminate because the energy of the system decreases on each iteration. We know this because (roughly) (a) at least one site changes state per iteration — there is always a site whose stability is a minimum — and (b) the energy change per iteration is equal to the sum of the stabilities of the sites which are changed. These stabilities are negative and the state changes will not interact with each other because none of the changed sites are neighbors. Therefore, the energy of the system always decreases. A rigorous proof of convergence is given in Appendix A. Determinism and guaranteed termination are valuable features. Analysis of results is much easier; for each set of parameters, only one run is needed to evaluate the performance, as opposed to a sampling of runs, as with simulated annealing.

6

Test Results

We have chosen to use edge detection as our test domain. In this task, each site is either vertical or horizontal and appears between two pixels. The problem is to label each site

412

site: parallel array[1..NJSITES] of record stability; i; /* rank */ change; end begin with site do in parallel do begin change := false; Update_Stability (stability) ; (nbhd_stability,k) := minnGjv[t] (site[n].stability,site[n].i); if stability < 0 and (stability,i) < (nbhdjstability,k) t h e n begin Change_State(state) ; change := t r u e ; end any .change := (&all change); until any «change = false; end end Figure 2: The algorithm Local HCF as either an edge or a non-edge, based on the intensities of the pixels. We have added an implementation of Local HCF to the simulator originally constructed by Chou, which allows us to compare the final labelings produced by Local HCF, HCF, and a variety of standard labeling techniques. The simulator runs on a Sun workstation. The input is produced by Sher's probabilistic edge detector (Sher, 1987) and consists of the log likelihood ratio for an edge at each site. The algorithms were tested on likelihood ratios from the checkerboard image, the "P" block image, and the "URCS" block image which appear in Figure 5. As a much harder test, the algorithms were also presented with noisy (corrupted) likelihood ratios obtained by using an incomplete edge model to find edges in the "URCS" image. The clique energies were chosen in an ad hoc manner. They were chosen to encourage the growth of continuous line segments and to discourage abrupt breaks in line segments, close parallel lines (competitions) and sharp turns in line segments. "Encouragement" or "discouragement" is associated with a clique by assigning it a negative or positive energy,

413 respectively. To encode these relationships, a second-order neighborhood, in which each site is adjacent to eight others, is used. This neighborhood system is shown in Figure 3 and the clique values used are shown in Figure 4. o | o

o | o | o

|7|7|

o | o | o

o I o

Figure 3: Neighborhoods for vertical and horizontal edge sites. Circles represent pixels, the thick line represents the site, and thin lines represent the neighbors of the site.



o 3.0 (4.5)

o o o o 11.0

o o o o -0 .6 o |o

— — o |o -0.25

— —

10.0 (50.0) 1

o! o o; o 11..0

o!o o;o -3.0

o ;o o | o 1.0

__

: o;

: oi

1 o|

0.6

5.0 (15.0)

20.0 (40.0)

Figure 4: Clique energies. The parenthesized values were used for the corrupted edge data (see Figure 6). The goodness of the result of applying one of the labeling algorithms can be determined qualitatively by simply looking at a picture of the segmentation, and quantitatively by examining the energy of the final configuration. Figure 6 shows the labelings produced by Local HCF on the four test cases, and Figures 8-10 compare the algorithms over time. Figure 7 shows energies of the final configurations yielded by thresholding the likelihood ratio of edge to non-edge (TLR), simulated annealing MAP estimation, Monte Carlo MPM estimation, ICM estimation (scan-line order), ICM (random order), HCF, and our results from Local HCF. The values of MAP, MPM, and the ICM's are the averages of the results from several runs. In almost every case, Local HCF found the labeling with the least energy. Each Local HCF run took 20-30 iterations (parallel state changes); we expect that the Connection Machine will carry out these labelings almost instantaneously. We believe that Local HCF performs better than HCF because it is much less likely to propagate the results of local labelings globally across the image. The execution of HCF is often marked by one site s committing to a certain label, immediately followed by one of its neighbors s + 1 committing to a label which is compatible with the new label of s. This process is then repeated for a neighbor of s -f i, and its neighbor, and

414

Figure 5: Test images (8 bits/pixel). Checkerboard and P images are 50 x 50. URCS image is 100 x 124.

g) l^-il Figure 6: Edge labelings produced by Local HCF for above images. Rightmost labeling demonstrates Local HCF on noisy edge data from the URCS image.

Method TLR Annealing MPM ICM(s) ICM(r) HCF Local HCF

Checkerboard image -3952 -4282 -4392 -4364 -4334 -4392 -4392

P image -572 -680 -723 -693 -715 -750 -720

URCS image 4785 -349 -503 -503 -513 -380 -625

Corrupted URCS image 59719 -5303 -5296 -4954 -3728 -9635 -9648

Figure 7: Energy Values. (The smaller the energy the closer the labeling is to the MAP estimate.)

415 0 ,

o

0.6

d

02 \-

♦ * ♦ * * *

+

* * * * *

*

Figure 8: Fraction of sites committed versus iterations of Local HCF. Most sites commit early in the computation. so on. In this manner, the effects of locally high confidence can get propagated too far. Local HCF does not tend to propagate information as far. Appendix B develops this argument in more detail.

7

Conclusions and Future Work

We have introduced a parallel labeling algorithm for Markov Random Fields which produces better labelings than traditional techniques at a much lower computational cost. Empirically, ten iterations on a locally connected parallel computer is sufficient to almost completely label an entire image; forty iterations finishes the task. In the future we intend to study its applicability in situations in which there are much larger number of labels, such as occurs in recognition problems (Cooper and Swain, 1988). We are also studying an extension of Pearl's method for determining clique parameters on chordal graphs (Pearl, 1988) to more general graphs.

416

o 2500

2000

1500

o

1000

500

+

0

-500

+

-

+

+

+

+

+

+

+

+

+

* _J

.1000

+

1

1

Figure 9: Timecourse of parallel algorithms on the URCS image. * = Local HCF, -f Monte Carlo MPM, o = Simulated Annealing.

y-6000

U

Figure 10: Timecourse of parallel algorithms on the corrupted URCS image. * = Local HCF, + = Monte Carlo MPM.

417

A

Proof of Convergence for Local H C F

We prove t h a t t h e algorithm terminates, and returns a feasible solution which is at a local m i n i m u m of t h e energy function. Define t h e ordered stability of a site to be a pair (a, b) where a is the stability and b is the rank of t h e site. T h e n (a, b) < (c, d) iff 1. a < c or 2. a = c and b < d. L e m m a 1 For at least one site k, the ordered stability, neighborhood. That is: si < min s*. *

denoted s*, is a minimum

in its

η€Ν[ MAX{P(UC|IA & AC), P(UC|IA & AI)},

433 then it is easy to show that performance is maximal when P(IA)-0, i.e., when the aids advice is always accepted. In words, in order for an aid to be useful as an aid, it must be the case that either (a) the user has some ability to discriminate correct vs. incorrect advice, or (b) the incidental problem solving benefits provided by the aid are so strong that even when the aid's advice is ignored, the user's performance exceeds that of the aid's. The usefulness of an aid is therefore a well-defined function of three types of parameters: accuracy of advice [P(AC)], discrimination [P(AA|AC), P(AA|AI)], and incidental problem solving benefits. As noted earlier, it is generally unreasonable to attribute incidental benefits to the advisory componenet. We suggest therefore that the advisory component of a decision aid provides added value when the decision maker (by whatever means) is discriminating about when to accept the aid's advice. To illustrate the benefits of discrimination, consider again the initial example where P(AC) « .7, P(AA|AC) - P(AA|AI) - .5 P(UC|IA & AC) - P(UC|IA & AI) - .4. Under this circumstance we derived P(UC|UD) - .55. Modify this example by allowing the user some discrimination in accepting the aids advice, viz., P(AA|AC) - .7 and P(AA|AI) - .3. We now deduce P(UC|UD) - .66. A moderate ability to discriminate good from bad advice resulted in a moderate improvement in performance, more than compensating for the cost of evaluating and then ignoring the aid's advice. (Note that some of the improvement is attributable to the fact that the marginal P(AA) has also increased. This is reasonable -- if the aid is usually correct, then a discriminating user would usually accept its advice.) A more dramatic example is one where the user is very discriminating about when he or she accepts the aid's advice, but the aid's hit rate is lower than the user. For example, P(AA|AC) - .9, P(AA|AI) - .1, and P(AC) - .55. We now get P(UCjUD) - .68. Another variant on this analysis is the case where the user is a good predictor of his or her own capabilities, but does not have a good assessment of the aid's. This suggests a scenario where the user's first problem is to decide whether he or she could solve the problem unaided. If yes, then ignore the aid. If not, then accept the aid's advice. This leads to the following variant of eq 2. P(UC|UD) - P(IA|CU)*P(CU) + P(AC|AA & CU)*P(AA|CU)*P(CU) + P(AC|AA & IU)*P(AA|IU)*P(IU),

[Eq. 3]

434 where UD CU IU AC AA IA

-

Use Decision Aid User Would Be Correct Unaided User Would Be Incorrect Unaided Aid Advice Correct User Accepts Advice User Ignores Advice.

Assume that the user can predict his or her own success 70% of the time, and that the probability that the aid is correct is independent of whether or not the user would be correct. We can then plug in the following values : P(UC|UD) - .7*.6 + .7*.3*.6 + .7*.7*.4

- .74.

Finally, a user that is discriminating on both dimensions would likely perform even better. As these quantitative examples suggest, significant benefits can be obtained by supporting a user's ability to be discriminating in his or her use of an aid. This suggests that a user must know enough about an aid's falliblities (or her own) to be able to identify circumstances when the aid's advice (or her own judgment) should simply be ignored. Unfortunately, and not suprisingly, few decision aids are designed with the intent of helping a user easily identify contexts in which the aid is likely to be incorrect. Similarly, developers intent on transfering an aid to an operational environment are not likely to advertise the weaknesses of their product. Consequently, there is little reason to presume that the advisory component of many decision aids contribute much in the way of decision support (vs. automation). Fortunately, however, there seem to be a variety of relatively simple ways to achieve this discrimination. The first is to promote an accurate mental model of the decision aid in the decision maker. In Lehner and Zirk (1987), for instance, it was found that even a rudimentary understanding of how an expert system works could lead to dramatic improvement in performance. Alternatively, one could embed within the decision aid itself some "metarules" for identifying contexts in which the aid's advice should probably be ignored. For instance, an aid based on a quantitative uncertainty calculus might be able to flag situations where there is significant higher order uncertainty (significant amount of missing data, sensitivity analysis indicates recommendation not robust, etc.) Finally, one could promote in decision makers, a better understanding of the human decision making process; perhaps making them aware of many of the common biases (e.g., hindsight bias) that lead decision makers to be overconfident in their assessment accuracy. Before closing this section, it is worth noting that in our examples we assumed that after choosing to ignore the aid's advice the probability of generating the correct answer unaided is the same whether or not the rejected advice was correct. Presumably, however, the decision maker and the aid's algorithm are based on a common source of expertise. One might therefore expect a positive correlation between the two problem solvers. If, however, the two problem-solving approaches are positively correlated (e.g., when an expert system is designed to mimic human problem solving),

435 then the aided probability of success can easily decrease. To illustrate this, recall the example at the beginning of Section 3.0. In this example, we had P(AC)-.7, P(AA|AC)-.7, P(IA|AI)-.3, and P(UC|IA & AI)-P(UC|IA & AC)-.4. This resulted in P(UC|UD)-.68. Now suppose we added the assumption that the aid's algorithm is uniformly better than the unaided user. That is, for all problems where the unaided user would generate the correct answer, the algorithm would also get it right. This assumption modifys the above settings to P(UC|IA & AC) - .6/.7, and P(UC|IA & AI) - 0. Plugging these new numbers into Eq. 2 gives us P(UC|UD) - .7*.7 + (.6/.7)*.3*.7 + 0*.7*.3 - .67. The effect of the nonindependency was small, but negative. More generally, P(correct|aided) decreases whenever the user is more likely to generate a correct answer in the same circumstances as the aid. For the classes of problems we explored the effect of dependencies of this type was usually small. The impact of such dependencies was therefore ignored in our analysis. However, this result does suggest that the popular notion that decision aids should be designed to mimic human expert problem solving may be misguided. 4.0 DISCUSSION The analysis in this paper can be summarized as follows. The usefulness of a decision aid depends heavily on the ability of the user to identify contexts in which the aid (or user unaided) is likely to be incorrect. Without this ability, attending to (i.e., consider, but not always accept) the advice of a decision aid is counterproductive -- the decision maker would be better off either routinely accepting or routinely ignoring the aid's advice. The same result holds for any partial analyses that the aid might generate. We do not claim that the probabilistic analysis presented above is realistic in the sense that it models all the subtleties of a user/decision aid interaction. We do, however, claim that it provides a reasonable characterization of the impact of the variables we are examining. In particular, since eq. 2 and eq. 3 are just tautological probability equations, it is logically impossible for a more complex model to correctly derive directional impacts different from those presented here. Also of interest is the relationship of this analysis to empirical research examining the effectiveness of decision aids (see, for example, Adelman, in press; Sharda, et.al., 1988 for reviews). Empirical invesigations have had mixed results - - some aids improve performance, some have little effect, and occasionally one finds an aid which decreases performance. Unfortunately most empirical efforts to evaluate a decision aid evaluate the aid as a whole. They do not attempt to empirically discriminate the contribution of various components of an aid. In this paper we have examined the impact of various components analytically. Our analysis does not suggest that decision aids per se are ineffective, but only that it is inappropriate to attribute effectiveness to advisory support provided by the aid. While a particular decision aiding system may

436 be useful, that usefulness should be attributed to the fact that the aid also serves as an information source, and may also generate advice that is routinely accepted. It should not be assumed that the advisory component, which is the core of most decision aids, provides any useful decision support. Empirical research in this area is sorely needed. REFERENCES Adelman, L. Evaluating

Decision

Support

Systems,

in press.

Adelman, L. , Donnell, M.L., Phelps, R.H., and Patterson, J.F. An iterative Bayesian decision aid: Toward improving the user-aid and userorganization interfaces. IEEE Transactions on Systems, Man, and Cybernetics, 1982, SMC-12(6), 733-742. Adelman, L., Rook, F.W., and Lehner, P.E. User and R&D specialist evaluation of decision support systems: Development of a questionnaire and empirical results. IEEE Transactions on Systems, Man, and Cybernetics, 1985, SMC-15(3), 334-342. Dawes, R. The robust beauty of improper linear models in decision making. American Psychologist, 1979, 34(7), 571-582. Hayes-Roth, F., Waterman, D. and Lenat, D., Building ing, MA: Addisin Wesley, 1983. Hogarth, R.M., Beyond discrete biases: aspects of judgment heuristics. Psychological

Expert

Systems,

Read-

Functional and dysfunctional Bulletin, 1981, 90(2).

Lehner, P.E., and Zirk, D.A. Cognitive factors in user/expert system interaction. Human Factors, 1987, 29(1), 97-109. Sharda, R. , Barr, S. and McDonnell, J. Decision Support System Effectiveness: A review and an empirical test. Management Science, 1982, 34(2), 139-159. Sprague, R. and Carlson, E. Building Effective Decision Prentic-Hall, Inc., Englewood Cliffs, N.J., 1982.

Support

Systems,

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

437

INFERENCE POLICIES1' Paul E. Lehner George Mason University 4400 University Drive Fairfax, VA. 22030 [email protected] Abstract - It is suggested that an AI inference system should reflect an inference policy that is tailored to the domain of problems to which it is applied -- and furthermore that an inference policy need not conform to any general theory of rational inference or induction. We note, for instance, that Bayesian reasoning about the probabilistic characteristics of an inference domain may result in the specification of an nonBayesian procedure for reasoning within the inference domain. In this paper, the idea of an inference policy is explored in some detail. To support this exploration, the characteristics of some standard and nonstandard inference policies are examined. 1.0 SATISFYING REQUIREMENTS Consider the following, admittedly artificial, scenario. An inference system must be designed to support a human decision maker. The inference system has only two sources of evidence; degree of belief inputs from expertl and expert2. For the domain in question, the judgments of both sources are believed to be reliable. That is approximately X proportion of inferences believed to degree X are correct (Horwich, 1982). Furthermore, expertl's judgment are generally more extreme than expert2. However, it is uncertain as to the extent to which the two agents judgments are redundant or independent. Since the system must support a human decision maker, it is considered desirable that the inference system also be reliable. Reliability makes it easier for a user to determine circumstances when the aid's advice should be accepted, which often increases the accuracy of the user/machine combination (Lehner, et.al., in press). 1. Preparation of this manuscript was supported in part by the Center for Excellence in Command, Control, Communications, and Intelligence at George Mason University. The Center's general research program is sponsored by the Virginia Center for Innovative Technology, MITRE Corporation, the Defense Communications Agency, CECOM, PRC/ATI, ASD(C3I), TRW, AFCEA, and AFCEA NOVA. 2. I would like to thank Dave Schum and two anonymous reviewers for their helpful comments on an earlier draft.

438 What sort of inference policy will satisfy these requirements? One policy which meets these requirements is simply to ignore expert2 and routinely accept the judgments of expertl. From the perspective of most quantitative theories of inference, this is a bad idea --it routinely ignores obviously useful information. However, since the relationship between the judgments of the two experts is not known, it is not clear how one might merge the two sources of information and still maintain reliability. Consequently, while for individual problems within the inference domain the policy seems suboptimal, it may be appropriate for the inference domain as a whole. More generally, inference policies should be designed to satisfy a set of requirements determined by examining the anticipated characteristics of an inference domain. Often, a standard inference policy (Bayesian, Shaferian, etc.) will satisfy the requirements. Other times, a nonstandard inference policy is needed. Below we examine some standard and nonstandard policies to illustrate this approach. 2.0 STANDARD INFERENCE POLICIES We consider below some standard policies, and the types of domain requirements they satisfy. A standard inference policy is defined here as an inference procedure based on any theory of inference seriously considered in the inductive reasoning literature. 2.1 Bayesian Models Proponents of the so-called Bayesian approach are generally characterized by the their insistence that the only rational systems of belief values are point-valued probability models. There are several different lines of argument for this strong assertion. Two of the more popular ones are the Dutchbook and scoring rule arguments. According to the Dutchbook argument an agent's belief values do not conform to the probability calculus iff there exists a Dutchbook (no win, may lose) gamble that the agent would willingly play. Assuming that ideally rational agents do not accept such gambles, we must conclude that the belief values of such agents conform to the probability calculus. Another line is the scoring rule argument. If an agent wishes to minimize her error rate, and the scoring rule for measuring error is additive, then the expected error rate is minimal only if the agents belief values are derived from the probability calculus (Lindley, 1982). While these arguments may support the Bayesian view of rational induction, they do not support the notion that point-valued Bayesian models are necessarily a good inference policy. To show this, we need only point out (as illustrated in Section 1.0) that Bayesian reasoning about the characteristics of an inference domain may lead one to conclude that the best inference policy within a domain is nonBayesian. On the other hand, for inference domains that require point-valued estimates, and minimizing error rate seems the appropriate goal, it is hard to imagine how a nonBayesian system could be appropriate.

439 2.2 Interval Probability Models The standard litany on expert system (ES) technology claims that ESs encode human expert knowledge. Consequently, a properly engineered ES should make the same inferences that a human expert would. In the knowledge engineering literature, it is considered desirable to base a knowledge base on multiple experts. Consequently, an ES should encode the common expertise of experts and generate belief values that conform with this common expertise. If experts disagree, then a point-valued system cannot possibly reflect this common expertise. On the other hand, it is arguable that interval probability systems do maintain common expertise. If each expert has a goal of minimizing error, then each expert's belief judgments should conform to the probability calculus. If the knowledge base is composed of interval probability statements that are consistent with each experts judgments, then probability statements derivable from that knowledge base should also be consistent with the judgments of all the experts. 2.3 Nonmonotonic Reasoning Logics Recently, AI researchers have developed a number of formal logics within which it is possible to make defeasible inferences -- categorical inferences that can be later retracted without introducing an inconsistency (Reiter, 1987). The original justification for this approach was that people often "jump to conclusions" in the context of deductively incomplete data. Probabilists have noted some fundamental problems with these defeasible logics, which can lead them to jump to highly improbable conclusions. Most of them, for instance, are subject to some form of the lottery paradox. As a theory of inference defeasible logics leave much to be desired. Despite such problems, however, there are some domains where a defeasible logic may be an appropriate inference policy. Consider, for instance, a domain that satisfies the following criteria. Intentionally Benign - Inferential cues are intentionally designed to support correct inferences, particularly when negative consequences may result from false inferences. Reliable Feedback - If the agent acts in accordance with a false inference that may lead to a negative outcome, then the agent will receive feedback that the inference was false. Opportunity to Backtrack - The agent will have an opportunity to backtrack decisions prior to the occurrence of significant negative consequences. In a consistently benign environment categorical inferences based on a defeasible logic seems an appropriate inference policy, even though the logic itself may be inappropriate as a theory of inference.

440 3.0 NONSTANDARD INFERENCE POLICIES We define a nonstandard inference policy to be an inference procedure that does not correspond to any seriously considered theory of inference found in the literature. Obviously, the classification of an inference policy as nonstandard may change with the promotion of new theories. In this section, we examine some possible nonstandard policies. 3.1 Ratios of Possibilities Logical probability theory not withstanding, perhaps one of the most maligned concepts in inference theory is the idea that one can calculate a reasonable belief value for a proposition by deducing the ratio of possible states in which the proposition is true. To give a typical counter example, if we accept the axiom A-->B, then Bel(B)-.67; since {A,B}, {~A,B} and {~A,~B} are the three possible states. Suppose, however, that A is "Rover is a brown dog." and B is "Rover is a dog." In that case, the axiom A-->B certainly does not add any evidence that should impact ones degree of belief that "Rover is a brown dog." Yet according to the Possibility Ratio approach it has a major impact. Clearly there is no necessary connection between a ratio of possible states and the perceived probability of a proposition. Consequently, it is hard to imagine how a theory of inference can be based solely on possible world ratios. However, there may be domains where the simplistic ratio approach is an appropriate inference policy. This is because the procedure for enumerating possible states is rarely arbitrary. To see how this works, consider Laplace's rule of induction. This rule states that in a series of observations of some event A or -A, that after observing N occurrences of A, and no instances of ~A, the inductive probability of that A will occur on the next trial is 1+N/2+N. This rule of induction is a special case of Carnap's c* function, which in turn is one instance of a family of coherent induction functions (Carnap, 1952). Now consider a truth table containing the sixteen possible states of four propositions: A, B, C and D. The proposition of interest is A. The other propositions are considered as candidates for a deterministic causal model for predicting A. Initially no causal connections are posited. Consequently, the possibility ratio of A, PR(A), is 1/2. After one observation of a we posit the causal rule B-->A. Now A will be contain in exactly 8 of the 12 remaining worlds, so PR(A)-2/3. After event A occurs again, we add C-->B, giving us PR(A)-3/4. Finally, after A occurs a third time, we add D-->C and get PR(A)-4/5. Continuing this process we see that PR(A)-N+l/N+2. Our learning mechanism replicated this rule of induction. More generally, any coherent rule of induction can be emulated with a causal learning mechanism in this way (Lehner, in press). We now turn this around. If a causal learning scheme responds to new instances by seeking deterministic rules for predicting that instance, then one would expect a positive correlation between the relative frequency of an event, and the proportion of possible states containing that event. The more often X occurs, the greater the number of factors perceived as causally leading to X, resulting in a greater proportion of logically possible states containing X.

441 For some domains, therefore, ratios of possible states may provide a perfectly reasonable inference policy. Even though the causal learning mechanism may not explicitly take into account probabilistic considerations (e.g., as in most concept learning and explanantion-based learning systems), there may be good reason to believe one can extract reasonable belief values from such systems. 3.2 Possibility and Probability In section 1.0, we discussed an inference domain where reliable judgments are required. Here we expand a little on this idea. Consider the following problem. An inference system must be developed that must service the information requirements of multiple decision systems. Each decision system will query the inference module as needed regarding the status (truth value or degree of support) of certain propositions. The specific propositions queried will vary in each context. Since the propositions to be queried cannot be predicted, it is decided that the inference system will maintain an up-to-date description of the current situation. That is, for some set of atomic propositions and their logically distinct combinations, the system should be able to report a belief value on request. Finally, it is considered important that the inference system be reliable. That is, for each set S x (all sentences believed to degree x ) , the expected proportion of truths in S x is x. The reason for this is simply that from one problem to the next, the elements of S x that are queried is unpredicatable (more or less "random"). Consequently, if the system is reliable then the expected proportion of truths of propositions reported with degree of belief x is x. What type of inference policy would guarantee satisfying these requirements? As it turns out (Lehner, in press), provable reliability is achievable only if the system maintains (a) a set of possible states that contain the true state, (b) a set (possibly empty) of reliable probability statements that assigns point values to a partition of the possible states, and (c) belief values are set equal to c(q|r1)p(r1) + ... + c(q|rn)p(rR), where r^,...,r are sentences uniquely defining each partition, p(r^) is the probability of r^, and c(q|r.j) is the ratio of possible states in the r^-partition that contain q. Furthermore, precise reliability (for each set S x exactly x proportion are true) can always be achieved by ignoring all probability information and using only the possible states ratio. This result has an interesting ramification. Reliability is always achievable, but only at the cost of ignoring some useful probability information. Reliability and accuracy tradeoff. Minimizing expect error requires conformance to the probability calculus, thereby giving up on reliability. On the other hand, reliability is only guaranteed if the inference system reports judgments that do not conform to the probability calculus. To illustrate, suppose an inference system knew p(A)».8 and p(B)-.6, but had no information of p(A&B) . As shown in Table 1, there are two sets of belief values that are provably reliable, and one that is precisely reliable. From an inference policy perspective, therefore, the appropriateness of a

442 belief calculus depends on the relative importance of reliability vs. ac curacy. TABLE 1 ILLUSTRATION OF THE TRADEOFF BETWEEN ACCURACY ψΌ RELIABILITY (expected error - p(A)[l-Bel(A)]2 + [l-p(A)]Bel(A)? + p(B)[l-Bel(B)]2 + [l-p(B)]Bel(B)2 ) Belief Values A B

.8 .8 .5 .5

.6 .5 .6 .5

Expected Error (for A and B only)

.4 .41 .49 .50

Reliability (for all sentences) none guaranteed provably reliable provably reliable precisely reliable

Note here how the characterization of an inference domain impacts the assessment of whether an inference policy is appropriate. The importance of provable reliability depends in part on the inability to anticipate which propositions will be queried. If we knew, for instance, that the inference domain was such that the only propositions that will be queried are those for which reliable probability information is available, then ignoring this probability information would make little sense. 3.3 Introspection and Probability A concept endemic to nonmonotonic reasoning logics is the idea that negative introspection can provide evidential support for a hypothesis. For instance, in an autoepistemic logic, the sentence ~L~p-->p reads "If I cannot conclude ~p, then p is true," or equivalently, "If p were false, I'd know it." In everyday human affairs, this type of reasoning is quite common. It occurs whenever a person feels that he or she is knowledgeable on some topic ("My husband could not have been cheating on me," she said to the inspector, "for if he were, I would have known it.") It is also a characteristic of most conversations, where by convention it is assumed that all relevant information is communicated (Reiter, 1987).

Probability

Models of Negative

Introspection

From a probability perspective, evidence-from-introspection provides some interesting problems. In a probabilistic system, one could conceivably model default reasoning using an epsilon semantics (Pearl, 1988). That is, if x is an agent's belief threshold, then the agent believes p (i.e., Lp) if P(p|E)>x, where E is the current evidence. An epsilon-semantics translation of -L~p-->p might be P(p|'P(p)>(l-x)')«x. Let x-.9. Then, as long as the agent cannot deduce ~p with probability .9, that agent immediately concludes p with probability .9. This seems reasonable, if a little unusual. Suppose however that the agent decides to set a more conservative belief threshold, say x=.99. Now our agent concludes p with probability .99 whenever ~p cannot be deduced with .99 certainty. The more conservative the threshold, the less evidence needed for the agent to jump to a stronger conclusion. An epsilon semantics

443 seems inappropriate here. similar problems.

Other self-referential approaches seem to have

Given problems such as these a probabilist might be tempted to suggest that belief models should not have probability values conditional on self-referential probability statements, but should only be conditional on the original evidence items. A statement such as ~L~p-->p could simply be interpreted as P(p|~Ei,...,~E ) - High, where the E* are relevant evidence items which did not occur. However, this approach fails to account for the fact that people do seem to use negative introspection as a source of evidence. Consequently, it cannot be used to encode human expert judgment. Also, the number of nonoccuring evidence items can be quite large, if not infinite -- making the development of such models infeasible in practice and perhaps impossible in theory.

Probability

Analysis

of Negative

Introspection

Whether or not it is possible to develop probability models of negative introspection is unrelated to the issue of whether or not a probability analysis of negative introspection is useful. A probability analysis of an introspection-based inference policy may be quite informative. To illustrate, consider the default rule A:B|--B, which state that if proposition A is believed and it is consistent to believe B, then infer B. The autoepistemic logic equivalent of this rule is LA & ~L~B --> B. Presumably, when a knowledge engineer adds a default rule like this to a knowledge base she believes that for the inference domain to which it will be applied the default rule will usually generate a valid conclusion. As a result whether or not an inference system implements a probability model, there is still a probabilistic justification for each default rule that is added to a knowledge base. Probabilistically, the standard justification for a rule such as this is simply that P(B|A)-High, while an alternative justification, based on the communication convention interpretation, might be P(B|LA&~L~B)=High. Consider the following case. An inference system contains the default rules {A:B|--B, C:D|--D} and the material implications (D-->~C, C-->A). Upon learning C, two extensions result, one contains B and ~D, the other ~B and D. If the rules are interpreted in the standard way, the the first default rule can be shown to be provably irrelevant since P(B|C) - P(B|C&A) < 1-P(D|A&C) - P(D|C) where by provable irrelevance I simply mean that enough evidence has been acquired to make the posterior assessment of the probability of B independent of the value of P(B|A) in any fully specified probability model. If in fact the knowledge engineer had in mind the standard probability justifications for her default rules, then the default logic, by generating two extensions, is behaving in a manner inconsistent with the intentions of the knowledge engineer. Such a system does not reflect a satisfactory inference policy. On the other hand, if it is assumed that default rules reflect communication conventions, then the alternative form for the probability justifications more closely reflects the knowledge engineers beliefs about the inference domain. In this case, P(B|LC&~L~B&LA)«P(B|LC&~L~B) of which noth-

444 ing can be derived. More generally, if negative introspection on categorical beliefs is viewed as a source of evidence for a default conclusions, then no extension can be anomalous in the sense that the probability justification for an applicable rule can never be shown to be provably irrelevant to a current problem. However, nonmonotonic logic theorists seem greatly concerned with the anomalous extension problem (Morris, 1988) suggesting therefore that nonmonotonic reasoning cannot be justified solely by the notion of communication conventions. 4.0 SUMMARY AND DISCUSSION In this paper, an approach to inferencing under uncertainty was explored that calls for the specification of inference policies tailored to specific inference domains. Although the approach seems pluralistic, I claim no conflict with the Bayesian viewpoint that a rational/coherent system of belief values should conform to the probability calculus. As a scientist, I find the objective of minimizing the error rate of my theories very compelling. Furthermore, my theories involve the development of algorithms that I hope will usually work. Consequently, I feel compelled to reason probabilistically about the relative frequency that applications of my theories will "work". However, in my (hopefully) coherent reasoning about inference domains I can envision domains where global non-additive objectives (e.g., global reliability) are desirable. Consequently, I see no reason why coherent reasoning about an inference domain should necessarily lead to a Bayesian inference policy as the preferred approach to inferencing within a domain.

REFERENCES Carnap, R. The Continuum of Inductive Methods. University of Illinois Press: Chicago, 111. 1952. Horwich, P. Probability Cambridge, U.K. 1982.

and

Evidence.

Cambridge

University

Press:

Lehner, P. "Probabilities and Reasoning about Possibilities," International Journal of Approximate Reasoning, in press. Lehner, P., Mullin, T. and Cohen, M. "A Probability Analysis of the Usefulness of Decision Aids," Uncertainty in AI: Volume 5. North Holland, in press. Lindley, D. "Scoring Rules and the Inevitability of Probability," International Statistical Review. 1982, 50, 1-26. Morris, P. "The Anomalous Extension Problem in Default Reasoning," Artificial Intelligence. 1988, 35, 383-399. Pearl, J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, 1988. Reiter, R. "Nonmonotonic Reasoning," Annual Review of Computer Science. 1987, 2, 147-186.

Uncertainty in Artificial Intelligence 5 M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer (Editors) © Elsevier Science Publishers B.V. (North-Holland), 1990

445

COMPARING EXPERT SYSTEMS BUILT USING DIFFERENT UNCERTAIN INFERENCE SYSTEMS* David S. Vaughan Bruce M. Perrin Robert M. Yadrick McDonnell Douglas Missile Systems Company P.O. Box 516 St. Louis, MO 63166 Peter D. Holden McDonnell Aircraft Company P.O. Box 516 St. Louis, MO 63166 ABSTRACT This study compares the inherent intuitiveness or usability of the most prominent methods for managing uncertainty in expert systems, including those of EMYCIN, PROSPECTOR, Dempster-Shafer theory, fuzzy set theory, simplified probability theory (assuming marginal independence), and linear regression using probability estimates. Participants in the study gained experience in a simple, hypothetical problem domain through a series of learning trials. They were then randomly assigned to develop an expert system using one of the six Uncertain Inference Systems (UISs) listed above. Performance of the resulting systems was then compared. The results indicate that the systems based on the PROSPECTOR and EMYCIN models were significantly less accurate for certain types of problems compared to systems based on the other UISs. Possible reasons for these differences are discussed. 1.

INTRODUCTION

Several methods of managing uncertainty in expert systems, or Uncertain Inference Systems (UISs), have been proposed or developed over the past twenty years, including those devised for the MYCIN (Shortliffe & Buchanan, 1975) and PROSPECTOR (Duda et al., 1979) projects, methods based on fuzzy set theory (Zadeh, 1983) and Dempster-Shafer theory (Shafer, 1986), and approaches using simplified probability theory (e.g., assuming conditional independence; Pearl, 1985), among others. However, until recent years, little information, apart from theoretical or philosophical arguments, has been available to aid in the selection of * This research was conducted under the McDonnell Douglas Independent Research and Development program.

446 a UIS for a particular application. Empirical studies comparing the algorithmic accuracy of UISs began to appear about four years ago (e.g., Yadrick, et al., 1988; Perrin et al., 1987; Wise & Henrion, 1986), and later methodical improvements (Wise et al., 1989; Wise, 1989) resulted in identification of the relative limitations of several of these UISs under optimal conditions. Many of these approximate techniques were justified by their developers, at least in part, by arguing that their approach was more compatible with human mental representations and/or reasoning, more intuitive, and thus, more practical to use. Heuristic parameters and âdùfi£ combining methods which result in algorithmic inaccuracy might be offset, the argument goes, if humans can readily and accurately apply the approach to a given problem. Unfortunately, these claims of enhanced "usability" of the various UIS are virtually untested. Henrion and Cooley (1987) and Heckerman (1988) have reported single-expert case studies that compared UISs used to develop relatively large, complex applications. Such case studies provide useful insights into the system development process, but it is difficult to separate the effect of the usability of the approach from the numerous uncontrolled factors in an actual application. Some of these uncontrolled factors include 1) variation in factors other than the choice of a UIS, such as knowledge acquisition techniques, user interface, development environments, and the like, which are not an inherent feature of the UIS; 2) interference produced by requiring a single expert to provide multiple forms of parameter estimates and expectations or preferences of the expert for one of these forms over another; and 3) differences in the expectations, or where multiple individuals were involved, abilities of the system developers. Additionally, it is difficult to generalize from the experience of a single individual to the broader issue of usability. Finally, it is hard to determine what can serve as a criterion of accuracy independent of the judgments of the application expert. In fairness to the authors of these studies, a comparison of the inherent usability of the UIS was not the primary focus of their work. We found only a single study (Mitchell, Harp, and Simkin, 1989) which compared UISs under controlled learning and acquisition conditions. Unfortunately, participants in their study did not actually develop a system, but rather, estimated parameters which might be used in a system, such as an EMYCIN Certainty Factor (CF) for a change in belief in a conclusion given evidence. Thus, their participants were given no opportunity to observe the results of their parameter estimates, in terms of a system answer, or refine their estimates. Additionally, their study did not involve uncertainty in the observations of evidence, e.g., assigning a CF to the observed evidence. This makes drawing conclusions for uncertainty management even more problematic. The current study sought to control factors extraneous to inherent UIS differences as fully as practical, while providing all of the features necessary to build, test, and refine a working expert system. Differences in expertise were controlled by training the participants in a hypothetical problem domain to a criterion level of performance. Instructions to the participants were standardized and every effort was made to fully describe the essential features of the UIS without giving

447

information that might influence individual parameter values or rules. The entire procedure, apart from the instructions and a questionnaire, was automated; the system development environments for each of the UISs provided only simple editors and displays to minimize differences in knowledge acquisition and interface technologies. In short, the present study addressed the question of inherent usability of a UIS by examining how readily and accurately individuals could use the essential features of a given UIS. 2. METHOD Participants in this study were 60 volunteers from McDonnell Douglas Corporation interested or involved in AI activities. They varied widely in their knowledge of AI and in their experience with expert systems, ranging from relatively experienced knowledge engineers to managers of AI projects to personnel just introduced to the technology. Ten of these volunteers were randomly assigned to work with each of the six different UISs examined in this study. 2.1 Learning Trials To assure comparable levels of expertise, the participants in the study gained experience in solving a hypothetical diagnostic problem through a series of learning trials. Specifically, the problem was to diagnose whether a fictitious machine was or was not malfunctioning according to "temperature" and "pressure" readings. During the learning trials, participants saw a temperature and a pressure on a computer terminal and responded by typing an "M" for malfunction or a "W" for working, depending on which they believed most likely given the readings. They were then informed of the correct answer for that particular case, and a new temperature/pressure combination was presented to begin a new trial. This process continued until the participant achieved an average level of accuracy over 20 consecutive trials equal to about 85% correct, near optimal performance given the inherent unpredictability of the outcome. Problems for the learning (and test trials, which are described later) were generated in accordance with the probability of their occurrence, as indicated by the contingency table given in Table 1. For example, problems for which both temperature and pressure were high (above the nominally normal range) and the machine was malfunctioning were generated, on average, in 0.315 of the cases. The conditional probabilities of the conclusion (malfunction) given the various states of the evidence are as follows: P(C|~E1&~E2) = probability of malfunction given normal temperature and normal pressure = 0.1; P(C|~E1&E2) = P(C|E1&~E2) = 0.2; and P(C|E1&E2) = 0.9. The table is therefore conjunctive in nature, in that a malfunction is likely when both temperature and pressure are high (0.9) and unlikely otherwise (0.2 or less). The actual temperature and pressure readings were generated after it had been determined which cell of the contingency table a particular problem was to represent. We sampled from normal distributions with different means depending on whether the problem represented a normal or a high reading. Means and standard deviations for the normal and high temperature and pressure distributions

448

TABLE 1. THE CONTINGENCY TABLE FROM WHICH PROBLEMS WERE GEJ EVDENCE

CONCLUSION WORKING (~C) MALFUNCTION (C)

NORMAL TEMPERATURE (~E1 )/ NORMAL PRESSURE (~E2)

0.315

0.035

NORMAL TEMPERATURE (~E1 )/ HIGH PRESSURE (E2)

0.120

0.030

HIGH TEMPERATURE (E1 )/ NORMAL PRESSURE (~E2)

0.120

0.030

HIGH TEMPERATURE (E1 )/ HIGH PRESSURE (E2)

0.035

0.315

|

are given in Table 2. Thus, although problem presentation was controlled by a contingency table and sampling from normal distributions, there was nothing in the display or the instructions that suggested one interpretation of the problem was more appropriate than another. Problem displays and the instructions were intentionally vague on issues such as whether the abnormal temperature and pressure readings were symptoms or causes of the malfunction, whether or not the uncertainty was primarily in the readings (i.e., sensor unreliability) or in the relation between the readings and the conclusion, and what the exact likelihood of malfunction was when neither, either, or both readings were abnormal. TABLE 2. MEANS AND STANDARD DEVIATIONS FOR NORMALAND HIGH DISTRIBUTIONS NORMAL TEMPERATURE HIGH TEMPERATURE

MEAN 180.0 200.0

NORMAL PRESSURE HIGH PRESSURE

70.0 82.0

STD. DEV. 5.0 5.0 3.0 3.0

2.2 System Building & Tuning Trials After they achieved the learning criterion, study participants used one of six UISs to build, test, and refine a system that captured their knowledge about the diagnosis problem. The study included UIS implementations based on the EMYCIN and PROSPECTOR models, simplified probability theory (assuming marginal independence of the evidence), linear regression using probabilities, fuzzy set theory, and Dempster-Shafer theory. The UlS's relevant features were described and illustrated in written instructions, and every attempt was made to make the descriptions for the different UISs comparable in the amount and level of detail of the information. The following overview of the approaches we studied is only intended to provide an appreciation of some of the major differences between the UISs; for a full explication of the models, the interested reader should consult the references cited.

449 (a) EMYCIN & PROSPECTOR. Although different in computation and parameters, the EMYCIN (Shortliffe & Buchanan, 1975) and PROSPECTOR (Duda et al., 1979) UISs have similar rule formats. Both UISs allow simple rules, which relate one piece of evidence with the conclusion, and conjunctive ("AND") and disjunctive (OR M ) rules, which relate a particular type of evidence combination to the conclusion. Each of these UISs requires estimation of one or more parameters for each rule (EMYCIN requires a single parameter, a CF, to specify the strength of the link between evidence and conclusion; PROSPECTOR requires two, a Logical Sufficiency and a Logical Necessity). CFs are estimated on a -1 to +1 scale, while the PROSPECTOR parameters are estimated on a +6 to -6 scale. Additionally, the participants using the PROSPECTOR UIS must estimate a prior odds for the evidence of each rule and the conclusion, this on a scale of +4 to -4. Verbal descriptors of these scale values provided by the UIS developers were available to the study participants. (b) Linear Regression & Simplified Probability (Independence). Equations (1) and (2) below define the regression and simplified probability (or independence) models, respectively. The parameters for both are probabilities, expressed on a zero-to-one scale (the procedure implemented actually requested proportions to express relations). P ' ( C ) - a + b r F ( E i ) + b2*F(E 2 )

0)

F(C) = P'(~Ei) * P'(~E2) * P(C|~Ei&~E2) + P'(~Ei) * F(E 2 ) *P(C|~Ei&E 2 ) + F(Ei) *F(~E 2 )*P(C|Ei&~E 2 ) + F(Ei) * F ( E 2 ) *P(C|Ei&E 2 )

(2)

The regression UIS required an estimated proportion of cases that would have a malfunction when neither piece of evidence was present (both were in the normal range), corresponding to the intercept "a" in equation (1), and the proportion when each piece of evidence was present, corresponding to the weights "br and "b2". The independence UIS required an estimate of the proportions when neither, both, and either piece of evidence alone was present (the conditional probabilities in equation 2), for a total of four parameters. Nothing corresponding to rule selection is required under these approaches, as the form of the relations is specified by the models. (c) Fuzzy Set & Dempster-Shafer Theory. Both of these UISs could have been implemented in a variety of alternative ways. In each case, we chose one of the more simple, straightforward implementations, but one we believed was adequate for the problem. Certainly, additional empirical research into alternative implementations of these UISs is warranted. Our implementation of fuzzy set theory (Zadeh, 1983) used fuzzy membership functions and rule specifications based loosely on those described and illustrated by Bonissone and Decker (1986). Participants using this UIS described a membership function which mapped particular evidence values onto their rules. The membership function required eight parameters for each piece of evidence.

450 Four of these parameters indicated the high and low values of an interval over which the evidence was deemed definitely present (the bounded values of temperature or pressure were definitely "high") and the high and low points of an interval over which the evidence was deemed definitely absent. The remaining four parameters indicated the interval over which the evidence was considered uncertain. Rules were either simple or conjunctive, and the strength of the rules was expressed as a probability. Our implementation of Dempster-Shafer theory (Shafer, 1986) used simple support functions combined using Dempster's rule. The frame of discernment for the problem in this study was the set {working, malfunctioning} and each reading (temperature or pressure) was taken to be compatible with one, but not both of these conclusions. For example, a reading that supported the conclusion that the mechanism was malfunctioning did not support the conclusion that it was working. To estimate the support for a conclusion given a reading, participants were asked to estimate the readings which supported the following beliefs: Bel(working) = 0.999, Bel(malfunction) = 0.0; Bel(working) = 0.50, Bel(malfunction) = 0.0; Bel(working) = 0.0, Bel(malfunction) = 0.0; Bel(working) = 0.0, Bel(malfunction) = 0.50; and Bel(working) = 0.0, Bel(malfunction) = 0.999. These points defined a support function, and linear interpolation was used to estimate beliefs for evidence readings falling between those points supplied by the participants. Finally, Dempster's rule was used to combine beliefs. Once the systems were developed, the participants tested them using temperature and pressure values of their own choosing. If the system answer, which was in a form appropriate to the UIS, disagreed with their own evaluation, simple editors were available to modify any of the UlS's features. Testing and refining continued iteratively until the participant was satisfied with the system's performance. 2.3 Test Trials Finally, each system was used to diagnose a standard set of thirty test cases and the participants completed a brief questionnaire describing their background and their impressions of the UIS they had used. Throughout the study, participants were encouraged to ask any questions they wished. Questions requesting clarification of procedures and the like were answered at any time to the participant's satisfaction. If a participant asked a question pertaining specifically to the problem domain or to implementing their solution in a given UIS, the experimenter explained that figuring such things out was part of the task at hand. 3. RESULTS AND DISCUSSION Tests of mean differences suggest that the experimental procedure was effective in creating groups with roughly equivalent backgrounds and understanding of the problem domain. Analysis of variance (ANOVA) tests of the background characteristics, which included items on exposure to AI and expert systems methodologies, revealed no significant differences between the groups assigned to

451 use the different UISs [maximum F(5,53) = 0.88, p < 0.50]. Likewise, we found no differences between groups in the number of trials to reach the learning criterion [F(5,54) = 0.21, p < 0.96]. We also tested for the clarity or ease of use of the UISs in two ways: 1 ) in the selfreported ratings of comprehensiveness, consistency, and ease of use of the UIS from the questionnaire; and 2) in the effort expended to tune a given UIS to the satisfaction of the participant. ANOVA tests indicated the groups were not significantly different in the self-reported ratings [maximum F(5,53) = 1.56, p < 0.19]. This finding may imply that the UISs are equivalent in terms of clarity. It is also possible, however, that the participants, who used only one of the UIS during the study, lacked any basis for comparison. Differences in the average number of trials the participants spent in tuning the different UISs are given in Table 3. An ANOVA indicated that the groups differed significantly [F(5,54) = 2.63, p < 0.04], with participants who used the linear regression approach requiring the fewest trials, while those who used the PROSPECTOR UIS needing more than twice as many trials, on the average. The difference in the number of tuning trials may reflect the relative clarity or ease of use of the approaches; however, it may simply be related to the number of distinct parameters that must be estimated. The linear regression approach requires of the fewest parameters, only three, while participants using PROSPECTOR estimated an average of 13.0 parameters. Once this difference was taken into account using analysis of covariance, the groups did not differ significantly in the number of trials to tune their systems [F(5,53) = 0.42, p < 0.83]. TABLE 3. NUMBER OF TRIALS TO TUNE A GIVEN UNCERTAIN INFERENCE SYSTEM

NO. OF TRIALS

EMYCIN 9.5

PROSP. 19.3

GROUP INDEP. 11.3

LIN'R REG. 8.1

FUZZY SET 18.5

D-SHAFER 10.0

We evaluated six different measures to assess the accuracy of the participants' systems on the final thirty test trials; however, all six revealed the same pattern of findings. For simplicity, we report results based on the proportion of correct diagnoses by the system compared to the answer used to generate the problem (i.e., the cell of the contingency table sampled). The ANOVA for this index showed a significant main effect for trials and a significant trials by UIS interaction. The summary ANOVA table for this analysis is given as Table 4. The significant trials by UIS interaction [F(145,1566) = 1.58, p < 0.01] indicates substantial variation among the UISs in their accuracy given different types of trials. Further evaluation of the trial by UIS interaction revealed that, on average, systems developed using the six UISs were equally accurate given "consistent" evidence, that is, when both temperature and pressure readings were normal or both were high. However, performance of the different UISs varied widely when the evidence was "mixed" or conflicting, i.e., one of the readings was high while the other was

452

I TABLE 4. SUMMARY ANOVA TABLE FOR PROPORTION CORRECT DIAGNOSES

SOURCE

BETWEEN UIS SUBJ(UIS) WITHIN TRIALS UIS'TRIALS TRIALS'S(UIS) TOTAL

_QE_ 59 5 54 1740 29 145 1566 1799

_££_

52.53 5.41 47.12 269.47 71.50 25.29 172.68 322.0

_MS__

-£.

1.08 0.87

1.24

2.47 0.17 0.11

22.36* 1.58*

|

*P