Data Mining on Multimedia Data 3540003177, 9783540003175

Despite being a young field of research and development, data mining has proved to be a successful approach to extractin

189 46 3MB

English Pages 141 [137] Year 2002

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Data Mining on Multimedia Data
 3540003177, 9783540003175

Citation preview

Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen

2558

3

Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo

Petra Perner

Data Mining on Multimedia Data

13

Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Author Petra Perner Institute of Computer Vision and Applied Computer Sciences August-Bebel-Str. 16-20 04275 Leipzig Germany E-mail: [email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at .

CR Subject Classification (1998): H.2.8, I.2, H.5.1, I.4 ISSN 0302-9743 ISBN 3-540-00317-7 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP Berlin, Stefan Sossna e. K. Printed on acid-free paper SPIN: 10894540 06/3142 543210

P re fa c e T h e in c re a s in g u s e o f c o m p u te r te c h n o lo g y in m a n y a re a s o f e c o n o m ic , s c ie n tific , a n d s o c ia l life is re s u ltin g in la rg e c o lle c tio n s o f d ig ita l d a ta . T h e a m o u n t o f d a ta w h ic h is c re a te d o n a n d s to re d in c o m p u te rs is g ro w in g fro m d a y to d a y . E le c tro n ic d a ta b a s e fa c ilitie s a re c o m m o n e v e ry w h e re a n d c a n n o w b e c o n s id e re d a s ta n d a rd te c h n o lo g y . B e y o n d th a t a ris e s th e q u e s tio n : W h a t is th e n e x t h ig h e r q u a lity th a t c a n b e d e riv e d fro m th e s e e le c tro n ic d a ta p o o ls ? T h e n e x t lo g ic a l s te p w o u ld b e th e a n a ly s is o f th e s e d a ta in o rd e r to d e riv e u s e fu l in fo rm a tio n th a t is n o t e a s ily a c c e s s ib le to h u m a n s . H e re d a ta m in in g c o m e s in to p la y . B y u s in g s u c h te c h n o lo g y it is p o s s ib le to a u to m a tic a lly d e riv e n e w k n o w le d g e , n e w c o n c e p ts , o r k n o w le d g e s tru c tu re s fro m th e s e d ig ita l d a ta . It is a y o u n g d is c ip lin e b u t h a s a lre a d y s e e n e n o rm o u s k n o w le d g e g ro w th . W h e re a s th e e a rly w o rk w a s d o n e o n n u m e ric a l d a ta , m u ltim e d ia a p p lic a tio n s n o w d riv e th e n e e d to d e v e lo p d a ta m in in g m e th o d s th a t c a n w o rk o n a ll k in d s o f d a ta , s u c h a s d o c u m e n ts , im a g e s , a n d s ig n a ls . T h e b o o k is d e v o te d to th is re q u ire m e n t. It is w ritte n fo r s tu d e n ts , e x p e rts fro m in d u s try a n d m e d ic in e , a n d s c ie n tis ts w h o w a n t to g e t in to th e fie ld o f m in in g m u ltim e d ia d a ta . W e d e s c rib e th e b a s ic c o n c e p ts a s w e ll a s v a rio u s a p p lic a tio n s in o rd e r to in s p ire p e o p le to u s e d a ta m in in g te c h n o lo g y in p ra c tic e w h e re v e r it is a p p lic a b le . In th e firs t p a rt o f th e b o o k w e in tro d u c e s o m e b a s ic c o n c e p ts : h o w in fo rm a tio n p ie c e s c a n b e e x tra c te d fro m im a g e s a n d s ig n a ls , a n d th e w a y th e y h a v e to b e re p re s e n te d in th e s y s te m . In th e s e c o n d p a rt o f th e b o o k w e in tro d u c e th e b a s ic c o n c e p ts o f d a ta m in in g in a fu n d a m e n ta l w a y s o th a t th e re a d e r g e ts a g o o d u n d e rs ta n d in g o f th e s e . T h e th ird p a rt o f th e b o o k e x a m in e s re a l a p p lic a tio n s o f h o w th e s e c o n c e p ts w o rk o n re a l d a ta . I w o u ld lik e to th a n k P ro f. A ts u s h i Im iy a fro m th e M u ltim e d ia T e c h n o lo g y D iv is io n a t C h ib a U n iv e rs ity , J a p a n fo r th e tim e h e s p e n t o n s tu d y in g th e m a n u s c rip t fo r th is b o o k a n d h is v a lu a b le c o m m e n ts o n im p ro v in g it. F o r h is in s p irin g d is c u s s io n s o n th e to p ic o f d a ta m in in g o f m u ltim e d ia d a ta a n d h is c o n s ta n t in te re s t in th e p ro g re s s o f th is m a n u s c rip t I w o u ld lik e to th a n k P ro f. G e o rg e N a g y fro m R e n s s e la e r P o ly te c h n ic In s titu te , T ro y , U S A . O c to b e r 2 0 0 2

P e tra P e rn e r

Conte nts

1 Introduction ........................................................................................................ 1 1.1 What Is Data Mining?.................................................................................. 3 1.2 Some More Real-World Applications ......................................................... 3 1.3 Data Mining Methods – An Overview ....................................................... 6 1.3.1 Basic Problem Types ........................................................................ 6 1.3.2 Prediction.......................................................................................... 6 1.3.2.1 Classification ......................................................................... 6 1.3.2.2 Regression ............................................................................. 7 1.3.3 Knowlegde Discovery ...................................................................... 7 1.3.3.1 Deviation Detection............................................................... 7 1.3.3.2 Cluster Analysis .................................................................... 7 1.3.3.3 Visualization.......................................................................... 8 1.3.3.4 Association Rules .................................................................. 8 1.3.3.5 Segmentation ......................................................................... 8 1.4 Data Mining Viewed from the Data Side .................................................... 9 1.5 Types of Data ............................................................................................ 10 1.6 Conclusion................................................................................................. 11 2 Data Preparation.............................................................................................. 13 2.1 Data Cleaning ............................................................................................ 13 2.2 Handling Outlier ........................................................................................ 14 2.3 Handling Noisy Data ................................................................................. 14 2.4 Missing Values Handling .......................................................................... 16 2.5 Coding ....................................................................................................... 16 2.6 Recognition of Correlated or Redundant Attributes .................................. 16 2.7 Abstraction ................................................................................................ 17 2.7.1 Attribute Construction .................................................................... 17 2.7.2 Images............................................................................................. 17 2.7.3 Time Series ..................................................................................... 18 2.7.4 Web Data ........................................................................................ 19 2.8 Conclusions ............................................................................................... 22 3 Methods for Data Mining ................................................................................ 23 3.1 Decision Tree Induction ............................................................................ 23 3.1.1 Basic Principle................................................................................ 23 3.1.2 Terminology of Decision Tree........................................................ 24 3.1.3 Subtasks and Design Criteria for Decision Tree Induction............. 25

VIII

Contents

3.1.4 Attribute Selection Criteria............................................................. 28 3.1.4.1 Information Gain Criteria and Gain Ratio........................... 29 3.1.4.2 Gini Function....................................................................... 30 3.1.5 Discretization of Attribute Values .................................................. 31 3.1.5.1 Binary Discretization........................................................... 32 3.1.5.2 Multi-interval Discretization ............................................... 34 3.1.5.3 Discretization of Categorical or Symbolical Attributes....... 41 3.1.6. Pruning ........................................................................................... 42 3.1.7 Overview ........................................................................................ 43 3.1.8 Cost-Complexity Pruning ............................................................... 43 3.1.9 Some General Remarks .................................................................. 44 3.1.10 Summary......................................................................................... 46 3.2 Case-Based Reasoning .............................................................................. 46 3.2.1 Background..................................................................................... 47 3.2.2 The Case-Based Reasoning Process ............................................... 47 3.2.3 CBR Maintenance........................................................................... 48 3.2.4 Knowledge Containers in a CBR System ....................................... 49 3.2.5 Design Consideration ..................................................................... 50 3.2.6 Similarity ........................................................................................ 50 3.2.6.1 Formalization of Similarity ................................................. 50 3.2.6.2 Similarity Measures............................................................. 51 3.2.6.3 Similarity Measures for Images........................................... 51 3.2.7 Case Description............................................................................. 53 3.2.8 Organization of Case Base.............................................................. 53 3.2.9 Learning in a CBR System ............................................................. 55 3.2.9.1 Learning of New Cases and Forgetting of Old Cases.......... 56 3.2.9.2 Learning of Prototypes ........................................................ 56 3.2.9.3 Learning of Higher Order Constructs.................................. 56 3.2.9.4 Learning of Similarity ......................................................... 56 3.2.10 Conclusions .................................................................................... 57 3.3 Clustering .................................................................................................. 57 3.3.1 Introduction .................................................................................... 57 3.3.2 General Comments ......................................................................... 58 3.3.3 Distance Measures for Metrical Data ............................................. 59 3.3.4 Using Numerical Distance Measures for Categorical Data ............ 60 3.3.5 Distance Measure for Nominal Data .............................................. 61 3.3.6 Contrast Rule .................................................................................. 62 3.3.7 Agglomerate Clustering Methods................................................... 62 3.3.8 Partitioning Clustering.................................................................... 64 3.3.9 Graphs Clustering ........................................................................... 64 3.3.10 Similarity Measure for Graphs ....................................................... 65 3.3.11 Hierarchical Clustering of Graphs .................................................. 69 3.3.12 Conclusion...................................................................................... 71 3.4 Conceptual Clustering ............................................................................... 71 3.4.1 Introduction .................................................................................... 71 3.4.2 Concept Hierarchy and Concept Description ................................. 71 3.4.3 Category Utility Function ............................................................... 72

Contents

IX

3.4.4 Algorithmic Properties.................................................................... 73 3.4.5 Algorithm........................................................................................ 73 3.4.6 Conceptual Clustering of Graphs.................................................... 75 3.4.6.1 Notion of a Case and Similarity Measure............................ 75 3.4.6.2 Evaluation Function ............................................................ 75 3.4.6.3 Prototype Learning .............................................................. 76 3.4.6.4 An Example of a Learned Concept Hierarchy..................... 76 3.4.7 Conclusion ...................................................................................... 79 3.5 Evaluation of the Model ............................................................................ 79 3.5.1 Error Rate, Correctness, and Quality .............................................. 79 3.5.2 Sensitivity and Specifity ................................................................. 81 3.5.3 Test-and-Train ................................................................................ 82 3.5.4 Random Sampling .......................................................................... 82 3.5.5 Cross Validation ............................................................................. 82 3.5.6 Conclusion ...................................................................................... 83 3.6 Feature Subset Selection............................................................................ 83 3.6.1 Introduction .................................................................................... 83 3.6.2 Feature Subset Selection Algorithms.............................................. 83 3.6.2.1 The Wrapper and the Filter Model for Feature Subset Selection .............................................................................. 84 3.6.3 Feature Selection Done by Decision Tree Induction ...................... 85 3.6.4 Feature Subset Selection Done by Clustering................................. 86 3.6.5 Contextual Merit Algorithm ........................................................... 87 3.6.6 Floating Search Method.................................................................. 88 3.6.7 Conclusion ...................................................................................... 88 4 Applications ...................................................................................................... 91 4.1 Controlling the Parameters of an Algorithm/Model by Case-Based Reasoning ................................................................................................ 91 4.1.1 Modelling Concerns........................................................................ 91 4.1.2 Case-Based Reasoning Unit............................................................ 92 4.1.3 Management of the Case Base........................................................ 93 4.1.4 Case Structure and Case Base......................................................... 94 4.1.4.1 Non-image Information ....................................................... 95 4.1.4.2 Image Information............................................................... 96 4.1.5 Image Similarity Determination .................................................... 97 4.1.5.1 Image Similarity Measure 1 (ISim_1) ................................. 97 4.1.5.2 Image Similarity Measure 2 (ISIM_2) ................................ 98 4.1.5.3 Comparision of ISim_1 and ISim_2.................................... 98 4.1.6. Segmentation Algorithm and Segmentation Parameters................. 99 4.1.7 Similarity Determination .............................................................. 100 4.1.7.1 Overall Similarity .............................................................. 100 4.1.7.2 Similarity Measure for Non-image Information................ 101 4.1.7.3 Similarity Measure for Image Information........................ 101 4.1.8 Knowledge Acquisition Aspect .................................................... 101 4.1.9 Conclusion .................................................................................... 102

X

Contents

4.2 Mining Images......................................................................................... 102 4.2.1 Introduction .................................................................................. 102 4.2.2 Preparing the Experiment ............................................................. 103 4.2.3 Image Mining Tool....................................................................... 105 4.2.4 The Application ............................................................................ 106 4.2.5 Brainstorming and Image Catalogue ............................................ 107 4.2.6 Interviewing Process..................................................................... 107 4.2.7 Setting Up the Automatic Image Analysis and Feature Extraction Procedure...................................................................................... 107 4.2.7.1 Image Analysis.................................................................. 108 4.2.7.2 Feature Extraction ............................................................. 109 4.2.8 Collection of Image Descriptions into the Data Base ................... 111 4.2.9 The Image Mining Experiment..................................................... 112 4.2.10 Review.......................................................................................... 113 4.2.11 Using the Discovered Knowledge ................................................ 114 4.1.12 Lessons Learned ........................................................................... 115 4.2.13 Conclusions .................................................................................. 116 5 Conclusion ...................................................................................................... 117 Appendix............................................................................................................ 119 The IRIS Data Set ..................................................................................... 119 References.......................................................................................................... 121 Index................................................................................................................... 129

1 I n tr od u c ti on

W e a re c o n tin u a lly c o n fro n te d w ith n e w p h e n o m e n a . S o m e tim e s it ta k e s u s y e a rs to b u ild b y h a n d a m o d e l th a t d e s c rib e s th e o b s e rv e d p h e n o m e n a a n d th a t a llo w s u s to p re d ic t n e w e v e n ts . B u t m o re o fte n th e re is a n u rg e n t n e e d fo r s u c h a m o d e l a n d th e tim e to d e v e lo p it is n o t g iv e n . T h e s e p ro b le m s a re k n o w n fo r e x a m p le fro m m e d ic in e w h e re w e w a n t to k n o w h o w to tre a t p e o p le w ith a c e rta in d is e a s e s o th a t th e y c a n re c o v e r q u ic k ly . Y e a rs a g o in 1 9 9 3 , w e s ta rte d o u t w ith d a ta m in in g fo r a m e d ic a l p ro b le m c a lle d in -v itro fe rtiliz a tio n th e ra p y (IV F th e ra p y ). W e w ill ta k e th is p ro b le m a s in tro d u c to ry e x a m p le s in c e it s h o w s n ic e ly h o w d a ta m in in g c a n b e a p p lie d a n d d e m o n s tra te s th e re s u lts th a t w e c a n o b ta in fro m th e d a ta m in in g p ro c e s s [P e T 9 7 ]. In -v itro fe rtiliz a tio n th e ra p y c a n h e lp c h ild le s s c o u p le s m a k e th e ir w is h to h a v e a b a b y c o m e tru e . H o w e v e r, th e s u c c e s s ra te w a s v e ry lo w in 1 9 9 3 . A lth o u g h th is th e ra p y w a s a lre a d y in u s e fo r m o re th a n te n y e a rs m e d ic a l d o c to rs h a d n o t b e e n a b le to d e v e lo p a c le a r m o d e l a b o u t th e fu n c tio n a n d e ffe c t o f th e th e ra p y . T h e m a in re a s o n fo r th a t w a s s e e n in th e c o m p le x in te rlo c k in g o f b io lo g ic a l, c lin ic a l, a n d m e d ic a l fa c ts . T h e re fo re , d o c to rs s ta rte d o u t to b u ilt u p a d a ta b a s e w h e re e a c h d ia g n o s tic p a ra m e te r a n d c lin ic a l in fo rm a tio n o f a p a tie n t w a s re c o rd e d . T h is d a ta b a s e c o n ta in e d p a ra m e te rs fro m u ltra s o n ic im a g e s s u c h a s th e n u m b e r a n d th e s i z e o f f o l l i c l e s r e c o r d e d o n c e r t a i n d a y s o f t h e w o m e n m e n s t r u a t i o n ’s c y c l e , c l i n i c a l d a ta , a n d h o rm o n e d a ta . W e u s e d th is d a ta b a s e a n d a n a ly z e d th e m w ith d e c is io n tre e in d u c tio n . A s re s u lt w e o b ta in e d a d e c is io n tre e s h o w in g u s e fu l d e c is io n r u le s f o r th e I V F - th e r a p y , s e e F ig u r e 1 . T h is m o d e l c o n ta in s r u le s s u c h a s f o r e .g .: I F h o r m o n e E 2 a t th e th ir d c y c le d a y < = 6 7 A N D n u m b e r o f f o llic le < = 1 6 ,5 A N D h o rm o n e E 2 a t th e 1 2 . c y c le d a y < = 2 6 2 0 T H E N D ia g n o s is _ 0 . It d e s c rib e d th e d ia g n o s is m o d e l in s u c h a w a y th a t p h y s ic ia n s c o u ld fo llo w s u it. T h e le a rn t d e c is io n tre e h a d tw o fu n c tio n s fo r th e p h y s ic ia n s : 1 . E x p lo ra to ry F u n c tio n T h e le a rn t ru le s h e lp e d th e e x p e rts to b e tte r u n d e rs ta n d th e e ffe c ts o f th e th e ra p y . T h is is p o s s ib le s in c e th e k n o w le d g e is m a d e e x p lic it fo r th e e x p e rt b y th e re p re s e n ta tio n o f th e d e c is io n tre e . H e c a n u n d e rs ta n d th e ru le s b y tra c in g d o w n e a c h p a th o f th e d e c is io n tre e . T h e tru s t in th is k n o w le d g e g o t h ig h e r w h e n h e fo u n d a m o n g th e w h o le s e t o f ru le s a fe w ru le s th a t h e h a d a lre a d y b u ilt u p in p a s t. T h is k n o w le d g e g a v e h im n e w im p u ls e s to th in k a b o u t th e e ffe c ts o f th e IV F P . P e rn e r: D a ta M in in g o n M u ltim e d ia D a ta , L N C S 2 5 5 8 , p p . 1 − 1 1 , 2 0 0 2 . © S p rin g e r-V e rla g B e rlin H e id e lb e rg 2 0 0 2

2

1 In tro d u c tio n

th e ra p y a s w e ll a s to a c q u ire n e w n e c e s s a ry in fo rm a tio n a b o u t th e p a tie n t in o rd e r to im p ro v e th e s u c c e s s ra te o f th e th e ra p y . 2 . P re d ic tio n F u n c tio n A fte r s o m e e x p e rim e n ts ta s k o f IV F -th e ra p y . It is c c ia n s c o u ld u s e th e le a rn t m s tim u la tio n s y n d ro m e fo r n e T h e a tte n tiv e re a d e r w o w h o le IV F -th e ra p y . O u r e x p m e a s u re m e n ts a re n o t e n o u le s s , re s u lts w e re e n o u g h to th e th e ra p y .

w e c a m e u p w ith a g o o d m o d e l fo a lle d d ia g n o s is o f o v e r s tim u la tio n o d e l in o rd e r to p re d ic t th e d e v e lo w in c o m in g p a tie n ts . u ld h a v e n o te d th a t w e c o u ld n o t e rim e n ts a t th a t tim e s h o w e d th a t th e g h to c h a ra c te riz e th e w h o le IV F p s tim u la te n e w d is c u s s io n a n d g a v e

r a s u b -d ia g n o s is s y n d ro m e . P h y s ip m e n t o f a n o v e r re v o lu tio n iz e th e re c e n t d ia g n o s tic ro c e s s . N e v e rth e n e w im p u ls e s fo r

--1 5 5 D S E 2 Z T 3

< = 6 7 7 4 D S A T Z T 1 2

> 6 7 8 1 D S [0 ]

< = 1 6 .5 5 3 D S E 2 Z T 1 2

< = 2 6 2 0 2 9 D S [0 ]

> 1 6 .5 2 1 D S L H Z T 3

> 2 6 2 0 2 4 D S E 2 Z T 1 2

< = 2 .6 9 1 5 D S [1 ]

< = 3 6 0 0 5 D S [1 ]

> 3 6 0 0 1 9 D S E 2 Z T 6

< = 3 0 4 .5 1 3 D S [0 ]

< = 4 0 4 D S [1 ]

> 4 0 2 D S [0 ]

> 3 0 4 .5 6 D S E 2 Z T 3

< = 5 0 4 D S [0 ]

A ttr i b u te s

> 2 .6 9 6 D S E 2 Z T 3

> 5 0 2 D S [1 ]

D e s c r i p ti on

E 2 D a y _ 3 , 6 , 9 , 1 2

H o r m o n e E s t r a d io l m e a s u r e d a t D a y 3 , 6 , 9 , a n d 1 2 o f t h e w o m a n s m e n s t r u a t io n c y c le

L H Z D a y 3

L u t e in is ie r n d e s H o r m o n e a t th e c y c le d a y 3

F ig . 1 . D a ta a n d re s u lts o f IV F th e ra p y

T h e a p p lic a tio n w e h a v e d e s c rib e d b e fo re fo llo w s th e e m p iric a l c y c le o f th e o ry fo rm a tio n , s e e F ig u re 2 . B y c o lle c tin g o b s e rv a tio n s fro m o u r u n iv e rs e a n d a n a ly z in g th e s e o b s e rv a tio n s b a s e d o n s o u n d m a th e m a tic a l m e th o d s w e c a n c o m e u p w ith a th e o ry th a t a llo w s u s to p re d ic t n e w e v e n ts a b o u t th e u n iv e rs e . T h a t is n o t o n ly a p p lic a b le to o u r in tro d u c to ry e x a m p le it c a n b e u s e d fo r e v e ry a p p lic a tio n th a t m e e ts th e c rite ria o f th e e m p iric a l c y c le .

1 .2 S o m e M o r e R e a l- W o r ld A p p lic a tio n s

3

P r e d ic tio n

U n iv e r s e T h e o ry

A n a ly s is O b s e r v a tio n s

F ig . 2 . E m p iric a l C y c le o f T h e o ry F o rm a tio n

1 . 1 W h a t I s D a ta M i n i n g ? In o u r IV F e x a m p le w e d is c o v e re d d ia g n o s is k n o w le d g e fro m a s e t o f d a ta re c o rd e d fro m a ll p a tie n ts b a s e d o n D a ta M in in g . B y s u m m a riz in g th e s e d a ta in to a s e t o f ru le s o u r m e th o d h e lp e d h u m a n to fin d th e g e n e ra l m e a n in g o f th e s in g le d a ta . U s u a lly , h u m a n s b u ilt u p s u c h k n o w le d g e b y e x p e rie n c e o v e r y e a rs . T h e ta s k g e ts m u c h h a rd e r a s m o re c o m p le x th e p ro b le m is . T h e IV F e x a m p le is d e s c rib e d b y o n ly 3 6 p a ra m e te rs . S o m e tim e s th e re c a n b e m o re th a n 1 0 0 p a ra m e te rs . T h is is n o t a n y m o re to o v e rlo o k b y h u m a n s . T h e in n o v a tiv e id e a o f d a ta m in in g is th a t it p ro v id e s m e th o d s a n d s y s te m s th a t c a n a u to m a tic a lly fin d th e s e g e n e ra l m e a n in g s b a s e d o n la rg e a n d c o m p le x , d ig ita l re c o rd e d d a ta . T h e d a ta m in in g s y s te m s u s u a lly d o n o t c a re a b o u t th e n u m b e r o f p a ra m e te rs . T h e y c a n w o rk w ith 1 0 o r e v e n w ith s e v e ra l h u n d re d p a ra m e te rs . H o w e v e r, o u r in tro d u c to ry e x a m p le a ls o s h o w e d th a t it is n o t a lw a y s p o s s ib le to c o m e u p w ith g e n e ra liz a tio n s e v e n w h e n d a ta a re a v a ila b le . T h e re m u s t b e g e n e ra liz a tio n c a p a b ility w ith in th e d a ta . O th e rw is e , it c a n b e a lre a d y u s e fu l to fin d p a tte rn s in th e d a ta a n d u s e th e m fo r e x p lo ra tio n p u rp o s e . T h is is a m u c h w e a k e r a p p ro a c h b u t u s e fu l fo r h u m a n s k n o w le d g e d is c o v e ry p ro c e ss. F o r a fo rm a l d e fin itio n o f D a ta M in in g w e lik e to fo llo w G re g o ry P ia te s k y S h a p iro ’s d e fin itio n : " D a ta M m e a n s a p te n tia lly u fro m d a ta

in in g ro c e ss se fu l in d a ta

, w h ic h is a ls o re fe rre d to a s k n o w le d g e d is c o v e ry in d a ta b a s e s , o f n o n triv ia l e x tra c tio n o f im p lic it, p re v io u s ly u n k n o w n a n d p o in fo rm a tio n (s u c h a s k n o w le d g e ru le s , c o n s tra in ts , re g u la ritie s ) b a s e s ."

1 . 2 S om e M or e R e a l - W or l d A p p l i c a ti on s S a y y o u h a v e g e n e ra te d a re c o rd in a d a ta b a s e lis tin g a ll c a r in s u ra n c e p o lic y h o ld e rs b y a g e , s e x , m a rita l s ta tu s a n d s e le c te d p o lic y , fo r e x a m p le , s e e F ig u re 3 a n d F ig u re 4 . W ith D a ta M in in g , y o u c a n u s e th is d a ta b a s e to g e n e ra te k n o w le d g e

4

1 In tro d u c tio n

te llin g y o u w h ic h m a rk e t g ro u p in g b u y s w h ic h ty p e o f in s u ra n c e p o lic y . T h is k n o w le d g e a llo w s y o u to a d ju s t y o u r p ro d u c t p o rtfo lio in o rd e r to s u p p ly g a p s in d e m a n d o r p re d ic t w h ic h p ro d u c t a n e w c u s to m e r is m o s t lik e ly to o p t fo r. S e fe m fe m fe m m a m a m a fe m m a fe m fe m m a fe m m a fe m ...

x

A g e a le a le a le

3 3 3 5 3 6 2 9 3 3 3 5 3 5 3 5 3 5 4 4 4 6 4 8 5 0 5 2

le le le a le le a le a le le a le le a le ...

M a s in s in s in s in s in s in s in s in s in m a m a m a m a m a ...

r ita l_ S ta tu s g le g le g le g le g le g le g le g le g le r ie d r ie d r ie d r ie d r ie d

C h ild r e n

0

= fe m a le 5 D S ? ? ? [

1 ]

1

1 1

0

1 2

1

1

2

1

2

0

1

1 1

1

0

1 ]

F ig . 4 . R e s u ltin g d e c is io n tre e fo r c u s to m e r p ro filin g

2 2

1

2

1

0

2 ]

2

1

0

= m a rrie d 5 D S [ 2 ]

1

1 0

> 0 2 D S

= m a le 6 D S ? ? ? [

1

1

1

--1 8 D S M A R IT A L _ S T

[

1

1

0

F ig . 3 . E x e rp t o f a c u s to m e r d a ta b a s e

< = 0 1 1 D S S E X

P u rc h a s e 1

0

...

= s in g le 1 3 D S C H IL D R E N

M A IL P X 0

2

1 ...

2 ...

1 .2 S o m e M o r e R e a l- W o r ld A p p lic a tio n s

O r a c rim in a l in v e s tig a tio n o ffic e g e n e r c o n ta in in g in fo rm a tio n o n th e o ffe n d e r s u c h b e tw e e n v ic tim a n d o ffe n d e r. W ith D a ta M e s ta b lis h o ffe n d e r p ro file s , w h ic h a re u s e f ro w in g d o w n lis ts o f s u s p e c ts . S a y a d o c to r h a s g e n e ra te d a d a ta b a s e fo r tie n ts ’ d a ta a lo n g w ith c lin ic a l a n d la b o ra to th e d is e a s e . W ith D a ta M in in g te c h n iq u e s , q u ire th e k n o w le d g e n e c e s s a ry to d e s c rib e b e u s e d to m a k e p ro g n o s e s fo r n e w p a tie n ts 5 illu s tra te s th is p ro c e s s fo r th e p re d ic tio n tra n s p la n ta tio n .

a te s a a s a g e in in g , u l in c

d a ta p o o l o f c rim , s e x , b e h a v io r a n th e s e d a ta b a s e s c rim in a l in v e s tig a

a c e rta in d is e ry v a lu e s a n d h e c a n u s e th th e d is e a s e . T o r p re d ic t lik e o f th e d a y o f

in a l o ffe d re la tio n a n b e u se tio n s fo r

5

n c e s s h ip d to n a r-

a s e in w h ic h to s to re p a d e ta ils o f th e n a tu re o f is d a ta c o lle c tio n to a c h is k n o w le d g e c a n th e n ly c o m p lic a tio n s . F ig u re th e in fe c tio n a fte r liv e r

F ig . 5 . Illu s tra tio n o f th e d a ta a n d th e d a ta m in in g p ro c e s s fo r m in in g th e k n o w le d g e fo r th e id e n tific a tio n o f th e tim e o f in fe c tio n a fte r liv e r tra n s p la n ta tio n

6

1 In tro d u c tio n

1 .3

D a ta M i n i n g M e th od s – A n O v e r v i e w

1 . 3 . 1 B a s i c P r ob l e m

T y p e s

D a ta M in in g m e th o d s c a n b e d is tin g u is h e d in to tw o m a in c a te g o rie s o f d a ta m in in g p ro b le m s : 1 . P re d ic tio n a n d 2 . K n o w le d g e D is c o v e ry (s e e F ig u re 6 ). W h ile p re d ic tio n is th e s tro n g e s t g o p ro a c h a n d u s u a lly p rio r to p re d ic tio n T h e o v e r-s tim u la tio n s y n d ro m e re p re d ic tiv e d a ta m in in g . In th is e x a m p th a t d e s c rib e s th e d ia g n o s is k n o w le d p re d ic tio n o f th e o v e r-s tim u la tio n s y n

a l, k n o w le d g e d is c o v e ry is th e w e a k e r a p . c o g le , g e . d ro

n itio w e m T h e m e w

n d e s c rib e d in e d o u r d a d o c to rs u s e h e n a n e w p

in S ta b a th is a tie n

e c tio n se fo r k n o w t c o m

1 b a se le d g e s in

e lo n g s to t o f ru le s e fo r th e .

D a t a M in in g M e t h o d

P r e d ic t io n

C la s s if ic a t io n

R e g r e s s io n

K n o w le d g e D is c o v e r y

D e v ia t io n D e t e c t io n

C lu s t e r in g

A s s o c ia t io n R u le s

V is u a liz a t io n

F ig . 6 . T y p e s o f D a ta M in in g M e th o d s

F o r th a t k in d o f d a ta m in in g , w e n e e d to k n o w th e c la s s e s o r g o a ls o u r s y s te m s h o u ld p re d ic t. In m o s t c a s e s w e m ig h t k n o w a -p rio ri th e s e g o a ls . H o w e v e r, th e re a re o th e r ta s k s w e re th e g o a ls a re n o t k n o w n a -p rio ri. In th a t c a s e , w e h a v e to fin d o u t th e c la s s e s b a s e d o n m e th o d s s u c h a s c lu s te rin g b e fo re w e c a n g o in to p re d ic tiv e m in in g . F u rth e rm o re , th e p re d ic tio n m e th o d s c a n b e d is tin g u is h e d in to c la s s ific a tio n a n d re g re s s io n w h ile k n o w le d g e d is c o v e ry c a n b e d is tin g u is h e d in to : d e v ia tio n d e te c tio n , c lu s te rin g , m in in g a s s o c ia te ru le s , a n d v is u a liz a tio n . T o c a te g o riz e th e a c tu a l p ro b le m in to o n e o f th e s e p ro b le m ty p e s is th e firs t n e c e s s a ry s te p w h e n d e a lin g w ith D a ta M in in g . T h e re fo re , w e w ill g iv e a s h o rt in tro d u c tio n to th e d iffe re n t m e th o d s . 1 . 3 . 2 P r e d i c ti on 1 . 3 . 2 . 1 Cl a s s i f i c a ti on A s s u m e th e re is a s e t o f o b s e rv a tio n s fro m a p a rtic u la r d o m a in . A m o n g th is s e t o f d a ta th e re is a s u b s e t o f d a ta la b e lle d b y c la s s 1 a n d a n o th e r s u b s e t o f d a ta la b e lle d b y c la s s 2 . E a c h d a ta e n try is d e s c rib e d b y s o m e d e s c rip tiv e d o m a in v a ria b le s a n d

1 .3

D a ta M in in g M e th o d s – A n O v e rv ie w

7

th e c la s s la b e l. W e n o w w a n t to fin d a m a p p in g fu n c tio n th a t a llo w s to s e p a ra te s a m p le s b e lo n g in g to c la s s 1 fro m th o s e b e lo n g in g to c la s s 2 . F u rth e rm o re , th is fu n c tio n s h o u ld a llo w to p re d ic t th e c la s s m e m b e rs h ip o f n e w fo rm e rly u n s e e n s a m p le s . S u c h k in d o f p ro b le m s b e lo n g to th e p ro b le m ty p e " c la s s ific a tio n " . T h e re c a n b e m o re th a n tw o c la s s e s b u t fo r s im p lic ity w e a re o n ly c o n s id e rin g th e tw o c la s s p ro b le m . T h e m a p p in g fu n c tio n c a n b e le a rn t b y d e c is io n tre e o r ru le in d u c tio n [W e K 9 0 ], n e u ra l n e tw o rk s [R z e 9 8 ][S h T 0 2 ], s ta tis tic a l c la s s ific a tio n m e th o d s [C A D K R 0 2 ] o r c a s e -b a s e d re a s o n in g [C rR 0 2 ]. W e w ill c o n c e n tra te in th is b o o k o n s y m b o lic a l le a rn in g m e th o d s s u c h a s d e c is io n tre e a n d ru le in d u c tio n a n d c a s e b a s e d re a s o n in g . 1 . 3 . 2 . 2 R e g r e s s i on W h e re a s c la s s ific a tio n d e te rm o f re g re s s io n [R P D 9 8 ][A tR 0 0 g iv e lig h t o f a c e rta in lu m in o fo rm e d in to a g ra y v a lu e b y W h e n w e c h a n g e th e lu m in o m e a n s th e v a ria b ility o f th e o a b ility o f o n e o r m o re in p u t v a

in e s th e ] is n u m u s in te n th e s e n s u s in te n u tp u t v a ria b le s .

se t e ric s ity o r, s ity ria b

m e a l. to a c c , w le

m b e rs h ip S u p p o se w th is s e n s o o rd in g to e a ls o c h w ill b e e x

o f th e s a m p le s , th e a n s w e r e h a v e a C C D se n so r. W e r. T h e n th is lig h t is tra n s a tra n s fo rm a tio n fu n c tio n . a n g e th e g ra y v a lu e . T h a t p la in e d b a s e d o n th e v a ri-

1 . 3 . 3 K n ow l e g d e D i s c ov e r y 1 . 3 . 3 . 1 D e v i a ti on D e te c ti on R e a l-w o rld o b s e rv a tio n a re ra n d o m e v e n ts . T h e d e te rm in a tio n v a lu e s , s u c h a s th e q u a lity o f a n in d u s tria l p a rt, th e in flu e n c e m e n t to a p a tie n t g ro u p o r th e d e te c tio n o f v is u a l a tte n tiv e re g b e d o n e b a s e d o n s ta tis tic a l p a ra m e te r te s ts . M e th o d s fo r th e k n o w n p a ra m e te rs , te s t o f h y p o th e s is a n d th e e s tim a tio n o f c o n lin e a r m o d e ls c a n b e fo u n d in K o c h [K o c 0 2 ].

o f a c h a ra c te ris tic o f a m e d ic a l tre a tio n s in im a g e s c a n e s tim a tio n o f u n fid e n c e in te rv a ls in

1 . 3 . 3 . 2 Cl u s te r A n a l y s i s A n u m b e r o f o b je c ts th a t a re re p re s e n te d b y a n -d im e n s io n a l a ttrib u te v e c to r s h o u ld b e g ro u p e d in to m e a n in g fu l g ro u p s . O b je c ts th a t g e t g ro u p e d in to o n e g ro u p s h o u ld b e a s s im ila r a s p o s s ib le . O b je c ts fro m d iffe re n t g ro u p s s h o u ld b e a s d is s im ila r a s p o s s ib le . T h e b a s is fo r th is o p e ra tio n is a c o n c e p t o f s im ila rity th a t a llo w s u s to m e a s u re th e c lo s e n e s s o f tw o d a ta e n trie s a n d to e x p re s s th e d e g re e o f th e ir c lo s e n e s s . I n C h a p te r 3 S e c tio n 3 .3 .1 - 3 .3 .3 w e w ill d e s c r ib e d if f e r e n t s im ila rity m e a s u re s . O n c e g ro u p s h a v e b e e n fo u n d w e c a n a s s ig n c la s s la b e ls to th e s e g ro u p s a n d la b e l e a c h d a ta e n try in o u r d a ta b a s e a c c o rd in g to its g ro u p m e m b e rs h ip w ith th e c o rre s p o n d in g c la s s la b e l. T h e n w e h a v e a d a ta b a s e w h ic h c a n s e rv e a s b a s is fo r c la s s ific a tio n .

8

1 In tro d u c tio n

1 . 3 . 3 . 3 V i s u a l i z a ti on T h e f a m o u s r e m a r k " A p ic tu r e is w o r th m o r e th a n a th o u s a n d w o r d s ." e s p e c ia lly h o ld s fo r th e e x p lo ra tio n o f la rg e d a ta s e ts . N u m b e rs a re n o t e a s y to b e o v e rlo o k e d b y h u m a n s . T h e s u m m a riz a tio n o f th e s e d a ta in to a p ro p e r g ra p h ic a l re p re s e n ta tio n m a y g iv e h u m a n s a b e tte r in s ig h t in to th e d a ta [E F P 0 1 ]. F o r e x a m p le , c lu s te rs a re u s u a lly n u m e ric a l re p re s e n te d . T h e d e n d ro g ra m (s e e F ig u re 1 1 ) illu s tra te s th e s e g ro u p in g s , a n d g iv e s a h u m a n a n u n d e rs ta n d in g o f th e re la tio n s b e tw e e n th e v a rio u s g ro u p s a n d s u b g ro u p s . A la rg e s e t o f ru le s is e a s ie r to u n d e rs ta n d w h e n s tru c tu re d in a h ie ra rc h ic a l fa s h io n a n d g ra p h ic a lly v ie w e d s u c h a s in th e fo rm o f a d e c is io n tre e .

1 . 3 . 3 . 4 A s s oc i a ti on R u l e s T o fin d o u t a s s o c ia tio n s b e tw e e n d iffe re n t ty p e s o f in fo rm a tio n w h ic h s e e m to h a v e n o s e m a n tic d e p e n d e n c e c a n g iv e u s e f u l in s ig h ts in f o r e .g . c u s to m e r b e h a v io r. M a rk e tin g m a n a g e r h a v e fo u n d th a t c u s to m e r w h o b u y o il w ill a ls o b y v e g e ta b le s . S u c h in fo rm a tio n c a n h e lp to a rra n g e a s u p e rm a rk e t s o th a t c u s to m e rs fe e l m o re a ttra c t to s h o p th e re . T o d is c o v e r w h ic h H T M L d o c u m e n ts a re re trie v e d in c o n n e c tio n w ith o th e r H T M L d o c u m e n ts c a n g iv e in s ig h t in th e u s e r p ro file o f th e w e b s ite v is ito rs . W e c a n id e n tify le s io n e d s tru c tu re s in b ra in M R im a g e s . T h e e x is te n c e o f a le s io n e d a re a m a y s u g g e s t th e e x is te n c e o f a n o th e r le a s io n e d s tru c tu re h a v in g a d is tin c t s p a tia l re la tio n to th e o th e r s tru c tu re . T o c o u n t th e o c c u rre n c e s o f s u c h a p a tte rn m a y g iv e h in ts fo r th e d ia g n o s is . M e th o d s o n a s s o c ia tio n ru le m in in g c a n b e fo u n d in Z h a n g e t a l. [Z h Z 0 2 ] a n d A d a m o [A d a 0 1 ]. In [H G N 0 2 ] a re d e s c rib e d th e a p p lic a tio n o f th e s e m e th o d s to e n g in e e rin g d a ta . 1 . 3 . 3 . 5 S e g m e n ta ti on S u p p o s e w e h a v e m in e d a m a rk e tin g d a ta b a s e fo r u s e r p ro file s . In th e n e x t s te p , w e w a n t to s e t u p a m a ilin g a c tio n in o rd e r to a d v e rtis e a c e rta in p ro d u c t fo r w h ic h it is h ig h ly lik e ly th a t it a ttra c ts th is u s e r g ro u p . T h e re fo re , w e h a v e to s e le c t a ll a d d re s s e s in o u r d a ta b a s e th a t m e e t th e d e s ire d u s e r p ro file . B y u s in g th e le a rn t ru le a s q u e ry to th e d a ta b a s e w e c a n s e g m e n t o u r d a ta b a s e in to c u s to m e r th a t d o n o t m e e t th e u s e r p ro file a n d in to th o s e th a t m e e t th e u s e r p ro file . O r s u p p o s e w e h a v e m in e d a m e d ic a l d a ta b a s e fo r p a tie n t p ro file s a n d w a n t to c a ll in th e s e p a tie n ts fo r a s p e c ific m e d ic a l te s t. T h e n , w e h a v e to s e le c t th e n a m e s a n d a d d re s s e s o f a ll p a tie n ts fro m o u r d a ta b a s e th a t m e e t o u r p a tie n t p ro file . T h e s e p a ra tio n o f a d a ta b a s e in to o n ly th o s e d a ta th a t m e e t a g iv e n p ro file is c a lle d s e g m e n ta tio n .

1 .4 D a ta M in in g V ie w e d f r o m

1 . 4 D a ta M i n i n g V i e w e d f r om

th e D a ta S id e

9

th e D a ta S i d e

W e h a v e d is c u s s e d d a ta m in in g fro m th e p ro b le m -ty p e p e rs p e c tiv e . W e c a n a ls o v ie w D a ta M in in g fro m th e d a ta -ty p e d im e n s io n . A lth o u g h , m in in g te x t o r im a g e s c a n b e o f th e s a m e p ro b le m ty p e th e re h a v e b e e n d e v e lo p e d o v e r tim e s p e c ia l fie ld s s u c h a s te x t m in in g [V is 0 1 ], tim e s e rie s a n a ly s is [S H S 0 0 ], im a g e m in in g [P e r0 1 ], o r w e b m in in g [K M S S 0 2 ][B lG 0 2 ][P e F 0 2 ]. T h e s p e c ific p ro b le m fo r th is ty p e o f d a ta lie s in th e p re p a ra tio n o f th e d a ta fo r th e m in in g p ro c e s s a n d th e re p re s e n ta tio n o f th e s e d a ta . A lth o u g h th e p ix e l o f a 2 D o r 3 D im a g e a re o f n u m e ric a l d a ta ty p e , it w o u ld n o t b e w is e to ta k e th e w h o le im a g e m a trix its e lf fo r th e m in in g p ro c e s s . U s u a lly , th e o rig in a l im a g e m ig h t b e d is to rte d o r c o rru p te d b y n o is e . B y p re -p ro c e s s in g th e im a g e a n d e x tra c tin g h ig h e r-le v e l in fo rm a tio n fro m th e im a g e m a trix th e in flu e n c e o f n o is e a n d d is to rtio n s w ill b e re d u c e d a s w e ll a s th e n u m b e r o f in fo rm a tio n th a t h a v e to b e h a n d le d . B e y o n d th is , th e e x tra c tio n o f h ig h e r le v e l in fo rm a tio n a llo w s a n u n d e rs ta n d in g o f th e im a g e c o n te n t. T h e re p re s e n ta tio n o f a n im a g e c a n b e d o n e o n d if f e r e n t le v e ls th a t a r e d e s c r ib e d in C h a p te r 2 S e c tio n 2 .7 .1 . T h e c a te g o riz a tio n o f te x t in to s im ila r g ro u p s o r c la s s ific a tio n o f te x t d o c u m e n ts re q u ire s a n u n d e rs ta n d in g o f th e c o n te n t o f th e d o c u m e n ts . T h e re fo re , th e d o c u m e n t h a s to g o th ro u g h d iffe re n t p ro c e s s in g s te p s d e p e n d in g o n th e fo rm o f th e a v a ila b le te x t. A p rin te d d o c u m e n t m u s t b e c o n v e rte d in to a d ig ita l d o c u m e n t th a t re q u ire s d ig ita liz a tio n o f th e d o c u m e n t, re c o g n itio n o f th e p rin te d a re a a n d th e c h a ra c te rs , g ro u p in g o f th e c h a ra c te rs in to w o rd s a n d s e n te n c e s . A d ig ita l v e rs io n m u s t b e p a rs e d in to w o rd s a n d a ll u n n e c e s s a ry fo rm a ttin g in s tru c tio n s m u s t b e re m o v e d . A fte r a ll th a t w e a re s till fa c e d w ith th e p ro b le m o f th e c o n te x tu a l w o rd s e n s e o r th e s e m a n tic s im ila rity b e tw e e n d iffe re n t w o rd . A n a p p lic a tio n fo r te x t m in in g c a n b e fo u n d in V is a e t a l. [V T V B 0 2 ]. T y p e s o f D a ta

T im e S e r ie s

Im a g e s

V id e o

T e x t W

S o u n d T r a ffic D a t a L ife M o n ito r in g M e d ic a l D a t a

2 D

T im e S e r ie s A n a ly s is

I m a g e M in in g

3 D

Im a g e s Im a g e s

S e v e r L o g s e b D o c u m e n ts

H a n d w r ittin g s D o c u m e n ts

V id e o M in in g

F ig . 7 . O v e rv ie w o f D a ta M in in g M e th o d s v ie w e d fro m

T e x t M in in g

W e b M in in g

th e D a ta S id e

In tim e -s e rie s a n a ly s is th e p ro b le m is to re c o g n iz e e v e n ts . T h a t ra is e s th e q u e s tio n w h a t is a n e v e n t. U s u a lly , c h a n g e s fro m n o rm a l s ta tu s w ill b e d e te c te d b y re g re s -

1 0

1 In tro d u c tio n

s io n . H o w e v e r, a tim e s e rie s tio n a n d th is re p re s e n ta tio n c T h e b a s is fo r w e b m in in g s a ry in fo rm a tio n m u s t b e e x m e n ts . T h e fin a l re p re s e n ta tio n o o r s y m b o lic a l a ttrib u te s b u t g ra p h s , a n d re la tio n a l s tru c tu

c a n a n b a re tra c

a ls e th th e te d

o b e b se r fro

e c o a s is v e r m b

n v e rte d in to a fo r th e m in in g lo g s o r th e w e b o th d a ta ty p e s

sy m p ro d o b y

b o lic c u c e ss [S c c u m e n ts p a rs in g

rv h G . T th

e d 0 2 h e e se

e s c rip ]. n e c e sd o c u -

f m u ltim e d ia d a ta c a n b e e ith e r o f th e ty p e n u m e ric a l m o re c o m p le x re p re s e n ta tio n s s u c h a s e .g . s trin g s , re s a re a ls o p o s s ib le .

1 . 5 T y p e s of D a ta A n o v e rv ie w a b o u t ty p e s o f d a ta is g iv e n in F ig u re 8 . A ttrib u te s c a n b e o f n u m e ric a l o r c a te g o ric a l d a ta ty p e . N u m e ric a l v a ria b le s a re fo r e x a m p le th e te m p e ra tu re o r th e g ra y le v e l o f a p ix e l in a n im a g e . T h is v a ria b le c a n h a v e d is tin c t g ra y le v e ls ra n g in g fro m 0 to 2 5 5 . C a te g o ric a l d a ta is o n e fo r w h ic h th e m e a s u re m e n t s c a le c o n s is ts o f a s e t o f c a te g o rie s . F o r in s ta n c e , th e s iz e o f a n o b je c t m a y b e d e s c rib e d a s " s m a ll" , " m e d iu m " , a n d " b ig " . T h e re a re d iffe re n t ty p e s o f c a te g o ric a l v a ria b le s . C a te g o ric a l v a ria b le s fo r w h ic h le v e ls d o n o t h a v e a n a tu ra l o rd e rin g a re c a lle d n o m in a l. M a n y c a te g o ric a l v a ria b le s d o h a v e o rd e re d le v e ls . S u c h v a ria b le s a re c a lle d o rd in a l. F o r in s ta n c e , th e g ra y le v e l m a y b e e x p re s s e d b y c a te g o ric a l le v e ls s u c h a s " b la c k " , " g ra y " , a n d " w h ite " . It is c le a r th a t th e le v e ls " b la c k " a n d " w h ite " s ta y o n th e o p p o s ite e n d s o f th e g ra y le v e l s c a le w h e re a s th e le v e l " g ra y " lie s in b e tw e e n o f b o th le v e ls . A n in te rv a l v a ria b le is o n e th a t h a s n u m e ric a l d is ta n c e s b e tw e e n a n y tw o le v e ls o f th e s c a le . In th e m e a s u re m e n t h ie ra rc h y , in te rv a l v a ria b le s a re h ig h e s t, o rd in a l v a ria b le s a re n e x t, a n d n o m in a l v a ria b le s a re lo w e s t. O n ly a n a ttrib u te -b a s e d d e s c rip tio n m ig h t n o t b e a p p ro p ria te fo r m u ltim e d ia a p p lic a tio n s . T h e g lo b a l s tru c tu re o f a g iv e n o b je c t o r a s c e n e a n d th e s e m a n tic in fo rm a tio n o f th e p a rts o f th e o b je c ts o r th e s c e n e a n d th e ir re la tio n m ig h t re q u ire a n a ttrib u te d g ra p h re p re s e n ta tio n . W e d e fin e a n a ttrib u te d g ra p h a s fo llo w s : D e fin itio n 1 : W

... s e t o f e .g .: W A ... s e t o f e .g .: b : A → B ... s e t o f

a ttrib u te v a lu e s = { "d a rk _ g re y ", " a ll a ttrib u te s A = { s h a p e , o b je c W p a rtia l m a p p in a ll a ttrib u te a s s ig n

le f t_ b e h in d " , " d ir e c tly _ b e h in d " , ...} t a r e a , s p a tia l_ r e la tio n s h ip , ...} g , c a lle d a ttrib u te a s s ig n m e n ts m e n ts o v e r A a n d W .

1 .6 C o n c lu s io n

A g ra p h N p q

G = (N , ... f in ite : N → B : E → B Id

p , q ) c se t o f m a p p m a p p e n tity

1 1

o n s is ts o f n o d e s in g o f a ttrib u te s to n o d e s in g o f a ttrib u te s to e d g e s , w h e re E = (N x N )\IN a n d IN is th e re la tio n in N .

T h e n o d e s a re fo r e x a m p le th e o b je c ts a n d th e e d g e s a re th e s p a tia l re la tio n b e tw e e n th e o b je c ts . E a c h o b je c t h a s a ttrib u te s w h ic h a re a s s o c ia te d to th e c o rre s p o n d in g n o d e w ith in th e g ra p h .

T y p e s o f D a ta

n u m e r ic a l

c a t e g o r ic a l

s tr in g

g ra p h

in t e r v a l

o r d in a l

n o m in a l

a tt r ib u te d g r a p h

F ig . 8 . O v e rv ie w T y p e s o f D a ta

1 . 6 Con c l u s i on In th is c h a p te r w e h a v e e x p la in e d w h a t d a ta m in in g is a n d w e g a v e a n o v e rv ie w a b o u t th e b a s ic m e th o d s . T h e d iv e rs ity o f a p p lic a tio n s th a t w e h a v e d e s c rib e d s h o u ld g iv e y o u a n id e a w h e re it c a n b e re a s o n a b le to a p p ly d a ta m in in g m e th o d s in o rd e r to g e t n e w in s ig h ts in to th e a p p lic a tio n . It s h o u ld in s p ire y o u to th in k a b o u t u s in g d a ta m in in g te c h n iq u e s e v e n fo r y o u r a p p lic a tio n . D e s p ite th e b a s ic m e th o d s fo r d a ta m in in g w e h a v e v ie w e d th e fie ld fro m th e d a ta s id e . W h e n it c o m e s to m u ltim e d ia d a ta s u c h a s im a g e s , v id e o o r a u d io m o re c o m p le x d a ta s tr u c tu r e s th a n a ttr ib u te - v a lu e p a ir r e p r e s e n ta tio n s a r e o f te n r e q u ir e d s u c h a s e .g . s e q u e n c e s o r g ra p h s . T h e y re q u ire s p e c ia l a lg o rith m fo r m in in g w h ic h w ill d e s c r ib e in S e c tio n 3 .3 .9 f o r g r a p h c lu s te r in g .

2 D a ta P r e p a r a ti on

B e fo re g o in g in to o u r d a ta m in in g e x p e rim e n t, w e n e e d to p re p a re th e d a ta in s u c h a w a y th a t th e y a re s u ita b le fo r th e d a ta m in in g p ro c e s s . T h e o p e ra tio n s fo r d a ta p re p a ra tio n c a n b e c a te g o riz e d a s fo llo w s (s e e F ig . 9 ):



D a ta c le a n in g N o rm a liz a tio n H a n d lin g n o is y , u n c e rta in a n d u n tru s tw o rth y in fo rm a tio n M is s in g v a lu e h a n d lin g T ra n s fo rm a tio n D a ta C o d in g A b s tra c tio n

• • • • • •

O p e r a t io n s f o r D a t a P r e p a r a t io n

S t a n d a r d iz a t io n

N o is y , U n c e r t a in U n t r u s t w o r t h y D a t a H a n d lin g

S m o o t in g

M is s in g V a lu e H a n d lin g

T r a n s f o r m a t io n C o d in g

A b s t r a c t io n

O u t lie r D e t e c t io n

F ig . 9 . D a ta P re p a ra tio n O p e ra tio n s

2 . 1 D a ta Cl e a n i n g M o s t d a ta m in in g to o ls re q u ire th e d a ta in a fo rm a t s u c h a s s h o w n in ta b le 1 . It is a s im p le ta b le s h e e t w h e re th e firs t lin e d e s c rib e s th e a ttrib u te n a m e s a n d th e c la s s a ttrib u te a n d w h e re th e fo llo w in g lin e s c o n ta in th e d a ta e n trie s d e s c rib in g th e c a s e n u m b e r a n d th e a ttrib u te v a lu e s fo r e a c h a ttrib u te o f a c a s e . It is im p o rta n t to n o te th a t th e in p u tte d d a ta s h o u ld fo llo w th e p re d e fin e d n a m e s a n d ty p e s fo r th e a ttrib u te s . N o s u b je c tiv e d e s c rip tio n o f th e p e rs o n w h o c o lle c te d th e d a ta s h o u ld b e in s e rte d in to th e d a ta b a s e n o r s h o u ld o th e r v o c a b u la ry b e u s e d th a n p re d e fin e d in a d v a n c e . O th e rw is e , w e w o u ld h a v e to re m o v e th e s e in fo rm a tio n in a d a ta c le a n in g s te p . S in c e d a ta c le a n in g is a tim e -c o n s u m in g p ro c e s s a n d o fte n d o u b le w o rk it is b e tte r to s e t u p th e in itia l d a ta b a s e in s u c h a w a y th a t it c a n im m e d ia te ly b e P . P e rn e r: D a ta M in in g o n M u ltim e d ia D a ta , L N C S 2 5 5 8 , p p . 1 3 − 2 2 , 2 0 0 2 . © S p rin g e r-V e rla g B e rlin H e id e lb e rg 2 0 0 2

1 4

2 D a ta P re p a ra tio n

u s e d fo r d a ta m in in g . R e c e n t w o rk o n d a ta w a re h o u s e s [M a d 0 1 ] ta k e in to c o n s id e ra tio n th is a s p e c t. T a b le 1 . C o m m o n D a ta T a b le

C a se C _ 1 C _ 2

F _ 1 V 1 1 V 2 1

F _ 2 V 1 2 V 2 2

... ... ...

F _ k V 1 k V 2 k

C _ i

V i1

V i2

...

v ik

2 . 2 H a n d l i n g O u tl i e r In a lm o s t a ll re a l w o rld d a ta , s o m e c a n b e fo u n d , w h ic h d iffe r s o m u c h fro m th e o th e rs a s to in d ic a te s o m e a b n o rm a l s o u rc e o f e rro r n o t c o n te m p la te d in th e th e o re tic a l d is c u s s io n s . T h e in tro d u c tio n o f w h ic h in to th e in v e s tig a tio n s c a n o n ly s e rv e to p e rp le x a n d m is le a d th e in q u ire r. U n i-v a ria te o u tlie rs a re to re c o g n iz e b y u s in g b o x p lo ts [C a r0 0 ][Z R C 9 8 ]. F ig u re 1 0 s h o w s th e b o x p lo ts fo r th e fe a tu re _ 1 o f th e iris d a ta s e t [F is ]. E a c h b o x re p re s e n ts th e ra n g e o f th e fe a tu re v a lu e s fo r o n e o f th e th re e c la s s e s . T h e m e d ia n fo r e a c h d a ta s a m p le s is in d ic a te d b y th e b la c k c e n te r lin e , a n d th e firs t a n d th ird q u a rtile s a re th e e d g e s o f th e re d a re a . T h e d iffe re n c e o f th e firs t a n d th ird q u a rtile is k n o w n a s th e in te rq u a rtile ra n g e (IR Q ). T h e b la c k lin e s a b o v e a n d u n d e r th e re d b o x e s re p re s e n t th e a re a w ith in 1 .5 tim e s th e in te r-q u a rtile ra n g e . P o in ts a t a g r e a te r d is ta n c e f r o m th e m e d ia n th a n 1 .5 tim e s th e I R Q a r e p lo tte d in d iv id u a lly . T h e s e p o in ts re p re s e n t p o te n tia l o u tlie rs . T h e p ro b le m g e ts m u c h h a rd e r if m u ltiv a ria te o u tlie r s h o u ld b e re c o g n iz e d . S u c h k in d o f o u tlie r c a n b e d e te c te d b y c lu s te r a n a ly s is (s e e C h a p te r 3 fo r c lu s te r a n a ly s is ). B a s e d o n a p ro p e r s im ila rity m e a s u re th e s im ila rity o f o n e s a m p le to a ll th e o th e r s a m p le s is c a lc u la te d a n d th e n v is u a liz e d in a d e n d ro g ra m b y th e s in g le lin k a g e m e th o d . S im ila r s a m p le s w ill fo rm g ro u p s s h o w in g c lo s e re la tio n to e a c h o th e r w h ile o u tlie rs w ill re s u lt in s in g le lin k s s h o w in g a c le a r d is ta n c e to th e o th e r g ro u p in g s . A d e e p e r in s ig h t to th e h a n d lin g o f m u lti-v a ria te o u tlie rs c a n b e fo u n d in [B a T 8 4 ][A n d 8 4 ].

2 . 3 H a n d l i n g N oi s y D a ta R e a l re a so m e n t d a ta th re e u re m

m e a s u re m e n ts w ill u s u a lly b e a ff n s fo r n o is y d a ta . It c a n b e c a u s e o r b y th e p e rs o n w h o c o lle c te d fro m th e IV F th e ra p y . It s h o w s u n til d a y fo u rte e n o f th e w o m a n ’ e n ts fo r le a rn in g th e m o d e l w ill re

e c d th th

te d (c o rru p te d ) b b y th e m e a s u re m e d a ta . T h e d a ta e h o rm o n e s ta tu s m e n s tru a tio n c y c s u lt in a p re d ic tio n

y n o is e . T e n t d e v ic s h o w n in s o f a w o le . T a k in g s y s te m w

h e re a re m a n y e , th e e n v iro n F ig u re 1 1 a re m a n fro m d a y th e re a l m e a s ith lo w e r a c c u -

2 .3 H a n d lin g N o is y D a ta

1 5

ra c y th a n th a t le a rn t fro m th e s m o o th e d d a ta . B y c a lc u la tin g th e s lid in g m e a n v a lu e a n d u s in g th e s e d a ta fo r le a rn in g w e c a n im p ro v e th e a c c u ra c y o f th e le a rn t m o d e l.

F ig . 1 0 . B o x p lo t o f Iris D a ta F e a tu re _ 1

1 4 0 0 0 1 2 0 0 0 1 0 0 0 0

Z y k lu s 1

8 0 0 0

Z y k lu s 2

6 0 0 0

Z y k lu s 1 *

4 0 0 0

Z y k lu s 2 *

2 0 0 0 0 3

4

6

7

8

E 2 [t ] =

F ig . 1 1 . D a ta S m o o th in g

9

1 0

1 2 n





1 1

1 2

t + n

i = t − n

E 2 [ i]

1 3

1 4

1 6

2 D a ta P re p a ra tio n

2 .4 M is s in g V a lu e s H a n d lin g T h e re a re m a n y re a s o n s fo r m is s in g v a lu e s in re a l w o rld d a ta b a s e s : 1 . T h e v b a se . 2 . T h e re 3 . T h is c d id n o

a lu e m ig h t n o t b e m e a s u re d (n e g le c t) o r s im p ly n o t in p u tte d in th e d a ta m ig h t b e s o m e o b je c tiv e re a s o n th a t th e v a lu e c o u ld n o t b e m e a s u re d . a n b e b e c a u s e a p a tie n t d id n o t w a n t u s to m e a s u re th is v a lu e o r a p e rs o n t w a n t to a n s w e r th e q u e s tio n in a q u e s tio n n a ire .

N o w , w e a re fa c e d w ith th e p ro b le m : H o w to d e a l w ith m is s in g v a lu e s ? T h e s im p le s t s tra te g y w o u ld b e to e lim in a te th e d a ta s e t. T h is is in s u ffic ie n t fo r s m a ll d a ta b a s e s e v e n if o n ly o n e v a lu e in a d a ta e n try is m is s in g . T h e re fo re , it m ig h t b e b e tte r to in s e rt a g lo b a l c o n s ta n t fo r m is s in g v a lu e s . T h is w o u ld a t le a s t g u a ra n te e to u s e th e d a ta s e t. H o w e v e r, th is s tra te g y d o e s n o t re fle c t th e re a l d o m a in d is trib u tio n o f th e a ttrib u te v a lu e . T h is c a n o n ly b e a c h ie v e d b y c o n s id e rin g th e s ta tis tic a l p ro p e rtie s o f th e s a m p le s . F o r th e o n e -d im e n s io n a l c a s e , w e c a n c a lc u la te th e c la s s c o n d itio n a l d is trib u tio n o f th e a ttrib u te v a lu e s o f a n a ttrib u te . T h e m e a n v a lu e o f th e c la s s th e s a m p le b e lo n g s to c a n b e in s e rte d fo r th e m is s in g v a lu e . F o r th e n d im e n s io n a l c a s e , w e c a n s e a rc h fo r s im ila r d a ta tu p le a n d in s e rt th e fe a tu re v a lu e o f th e m o s t s im ila r d a ta tu p le fo r th e m is s in g v a lu e . H o w e v e r, h e re w e n e e d to d e fin e a p ro p e r s im ila rity m e a s u re in o rd e r to fin d c lo s e d a ta e n trie s . T h is p ro b le m its n o t triv ia l (s e e C h a p te r C a s e -B a s e d R e a s o n in g ).

A

2 . 5 Cod i n g d a ta m in in g u e s re g a rd le s s e s s a ry to c o d e s ig n to e a c h c a " c o lo r" g e ts fo p h is tic a te d c o d

to o l o r a d o f w h e th e r th e c a te g o te g o ric a l v r th e a ttrib in g te c h n iq

a ta m in in g te c h n iq u th e d a ta ty p e is n u m ric a l d a ta . T h e s im p a lu e a d is tin c t n u m e u te v a lu e s " g re e n " = u e s c a n b e ta k e n fro

e m ig h t re q u e ric a l o r s y m le s t fo rm o f ric a l v a lu e s u 1 , " b lu e " = 2 , m th e c o d in g

ire o n ly b o lic a l. c o d in g w c h th a t e a n d " re d th e o ry .

n u m e T h e n o u ld .g . a n " = 3 .

ric a l v it is n b e to a ttrib M o re

a le c a su te so -

2 . 6 R e c og n i ti on of Cor r e l a te d or R e d u n d a n t A ttr i b u te s S e le c tin g th e rig h t p ro b le m s in d e s ig n th e re le v a n t fe a tu re to a d d re s s th is is s u a n d d a ta -m o d e lin g m a y d e g ra d e th e p

s in s e

e t o f fe g a g o o a re fo r is to c o p h a se . H e rfo rm a n

a tu re s fo r c la s s ific a tio n is o n e d c la s s ifie r. V e ry o fte n w e d o n a p a rtic u la r c la s s ific a tio n ta s k . lle c t a s m a n y fe a tu re s a s w e c a o w e v e r, irre le v a n t o r c o rre la te c e o f th e c la s s ifie r. In a d d itio n

o f th e m o s t im p o rta n t o t k n o w a -p rio ri w h a t O n e p o p u la r a p p ro a c h n p rio r to th e le a rn in g d fe a tu re s , if p re s e n t, , la rg e fe a tu re s p a c e s

2 .7 A b s tr a c tio n

c a n s o m e tim e s re s u lt in o v e rly c o to in te rp re t. In th e e m e rg in g a re a o f d a ta m fa c e d w ith th e p ro b le m o f d a ta s e a n d in s ta n c e s . S u c h k in d s o f d a ta in g p ro c e s s c a n b e m a d e e a s ie r fe a tu re s w h ile ig n o rin g th e o th e r tio n ( s e e C h a p te r 3 .6 ) . I n th e f e rith m is fa c e d w ith th e p ro b le m o fo c u s its a tte n tio n .

1 7

m p le x c la s s ific a tio n m o d e ls th a t m a y n o t b e e a s y in in g a p p lic a tio n s , u s e rs o f d a ta m in in g to o ls a re ts th a t a re c o m p ris e d o f la rg e n u m b e rs o f fe a tu re s s e ts a re n o t e a s y to h a n d le fo r m in in g . T h e m in to p e rfo rm b y fo c u s s in g o n a s u b s e t o f re le v a n t o n e s . T h is p ro c e s s is c a lle d fe a tu re s u b s e t s e le c a tu re s u b s e t s e le c tio n p ro b le m , a le a rn in g a lg o f s e le c tin g s o m e s u b s e t o f fe a tu re s u p o n w h ic h to

2 . 7 A b s tr a c ti on 2 . 7 . 1 A ttr i b u te Con s tr u c ti on B e t w e e n t h e a t t r i b u t e s A 1 , A 2 , . . . , A k , . . . A i, . . . , A n i n t h e d a t a t a b l e t h e r e m i g h t e x i s t d i f fe re n t re la tio n s d e p e n d in g fro m th e d o m a in . In s te a d o f ta k in g th e b a s ic a ttrib u te s it m ig h t b e w is e to e x p lo re th e s e re la tio n s b e tw e e n th e a ttrib u te s b a s e d o n d o m a in k n o w le d g e b e fo re th e d a ta m in in g e x p e rim e n t a n d c o n s tru c t n e w b u t m o re m e a n i n g f u l a t t r i b u t e s A n e w = A i° A k . T h e r e s u l t o f t h e m i n i n g p r o c e s s w i l l h a v e h i g h e r e x p la n a tio n c a p a b ility th a n th o s e u s in g th e b a s ic a ttrib u te s . T h e re b y th e re la tio n ° c a n b e a n y lo g ic a l o r n u m e ric a l fu n c tio n . T h e in itia l d a ta ta b le o f o u r IV F d o m a in c o n ta in e d th e a ttrib u te s iz e _ i fo r e a c h i o f th e n fo llic le . T h e c o n s tru c tio n o f a n e w a ttrib u te m e a n _ fo llic le _ s iz e b ro u g h t m o re m e a n in g fu l re s u lts . 2 .7 .2 Im a g e s S u p p o s e , w e h a v e a m e d ic a l d o c to r w h o fo r e x a m p le w ill m a k e th e lu n g o f a p a tie n t v is ib le b y ta k in g a n X -ra y . T h e re s u ltin g im a g e e n a b le s h im to in s p e c t th e lu n g fo r irre g u la r tis s u e s . H e w ill m a k e th e d e c is io n a b o u t m a lig n a n t o r b e n ig n n o d u le b a s e d o n s o m e m o rp h o lo g ic a l fe a tu re s o f th e n o d u le th a t a p p e a re d in th e im a g e . H e h a s b u ilt u p th is k n o w le d g e o v e r y e a rs in p ra c tic e . A n o d u le w ill b e m a lig n a n t if f o r e .g . th e f o llo w in g r u le is s a tis f ie d : if th e s tr u c tu r e in s id e th e n o d u le is ir r e g u la r a n d a r e a s o f c a lc ific a tio n s a p p e a r a n d th e r e a r e s h a r p m a r g in s th e n th e n o d u le is m a lig n a n t. A n a u to m a tic im a g e in te rp re ta tio n b a s e d o n s u c h a ru le w o u ld o n ly b e p o s s ib le a fte r th e im a g e h a s p a s s e d th ro u g h v a rio u s p ro c e s s in g s te p s . T h e im a g e m u s t a u to m a tic a lly b e s e g m e n te d in to o b je c ts a n d b a c k g ro u n d , o b je c ts m u s t b e la b e lle d a n d d e s c rib e d b y fe a tu re s , th e s e fe a tu re s m u s t b e g ro u p e d in to s y m b o lic re p re s e n ta tio n s u n til th e fin a l re s u lt c a n b e o b ta in e d b a s e d o n s u c h a ru le a s d e s c rib e d a b o v e in a n in te rp re ta tio n s te p . In o p p o s itio n to th a t, to m in e a n im a g e d a ta b a s e c o n ta in in g o n ly im a g e s a n d n o im a g e d e s c rip tio n s fo r s u c h k in d o f k n o w le d g e w o u ld re q u ire to e x tra c t a u to m a tic a lly th e n e c e s s a ry in fo rm a tio n fro m th e im a g e . T h is is a c o n tra d ic tio n . W e d o n o t k n o w in a d v a n c e o u r im p o rta n t

1 8

2 D a ta P re p a ra tio n



fe a tu re s o f a c o lle c tio n o f im a g e s n o r d o w e k n o w th e w a y th e y a re re p re s e n te d in th e im a g e . R e c e n tly , w e k n o w lo w -le v e l fe a tu re s s u c h a s b lo b s , re g io n s , rib b o n s , lin e s , a n d e d g e s a n d w e k n o w h o w th e s e fe a tu re s c a n b e e x tra c te d fro m im a g e s , s e e F ig u re 1 2 . H o w e v e r, fe a tu re s s u c h a s a n " irre g u la r s tru c tu re in s id e th e n o d u le " a re n o t s o c a lle d lo w -le v e l fe a tu re s . It is e v e n n o t re a lly c le a r th e w a y th is fe a tu re is re p re s e n te d in a n im a g e . T h e re fo re , th e re d o e s n o t e x is t a n a lg o rith m y e t th a t c a n e x tra c t th is fe a tu re . O n th e b a s e o f lo w -le v e l fe a tu re s w e c a n c a lc u la te s o m e h ig h -le v e l fe a tu re s b u t it is n o t p o s s ib le to o b ta in a ll s u c h fe a tu re s in th is w a y . T h e re fo re , w e s h o u ld a ls o a llo w to in p u t e x p e rts d e s c rip tio n s in to a n im a g e d a ta b a s e . B e s id e s th a t, w e c a n d e s c rib e a n im a g e b y s ta tis tic a l p ro p e rtie s w h ic h m ig h t a ls o b e n e c e s s a ry in fo rm a tio n . O n th e b a s e o f th is d is c u s s io n w e c a n id e n tify d iffe re n t w a y s o f re p re s e n tin g th e c o n te n t o f a n im a g e th a t b e lo n g s to d iffe re n t a b s tra c tio n le v e ls . T h e h ig h e r th e c h o s e n a b s tra c tio n le v e l is th e m o re u s e fu l is th e d e riv e d in fo rm a tio n w ith d a ta m in in g . W e c a n d e s c rib e a n im a g e b y s ta tis tic a l p ro b y lo w -le v e l fe a b o n s, e d g e s a n d b y h ig h -le v e l o r a n d a t le a s t b y e x p e r







p e rtie tu re s lin e s , sy m b

s th a n d w h o lic

a t th ic h fe

is th e lo w e s t e ir s ta tis tic a l is th e n e x t h a tu re s th a t c a

a b s tra c tio n le v e p ro p e rtie s s u c h ig h e r a b s tra c tio n b e o b ta in e d f

l, a s re g io n s , b lo b s , rib n le v e l ro m lo w -le v e l fe a tu re s ,

ts s y m b o lic d e s c rip tio n w h ic h is th e h ig h e s t a b s tra c tio n le v e l.

F o r th e o p e ra tio n s o n im a g e s w e lik e to re fe r th e in te re s te d r e ra tu re o n im a g e p ro c e s s in g . F o r im a g e p re p ro c e s s in g a n d [P e B 9 9 ]. T h e e x tra c tio n o f lo w -le v e l fe a tu re s is d e s c rib e d in [Z s c rip tio n is d e s c rib e d in [R a o 9 0 ]. Im a g e S ta tis tic s a re d e s c rib e d e x a m p le o n m o tio n a n a ly s is s e e Im iy a e t a l. [Im F 9 9 ]. E x a m p tu re s a n d s ta tis tic a l fe a tu re s th a t c a n b e u s e d to d e s c rib e th e im g iv e n in C h a p te r 4 b a s e d o n tw o d iffe re n t a p p lic a tio n s .

e a d e r to s p e c ia l lits e g m e n ta tio n s e e a m 9 6 ]. T e x tu re d e in [P e B 9 9 ]. F o r a n le s fo r te x tu re fe a a g e c o n te n t w ill b e

2 .7 .3 T im e S e r ie s T im e s e rie s a n a ly s is is o fte n [S rG 9 9 ][F a F 9 9 ]. F o r th a t p u rp tim e s e rie s c a n a ls o b e c o n c e rn o r n o is e a n a ly s is o f te c h n ic a l o b o b s e rv e tim e s e rie s o f s e v e ra l th e s e e v e n ts in th e d iffe re n t tim th e o c c u rre n c e o f a d is e a s e o r a q u ire s a te m p o ra l a b s tra c tio n o f

re fe rre d to in th e lite ra tu re a s e v o s e re g re s s io n is u s e d . H o w e v e r, e d w ith in te rp re ta tio n s u c h a s s c in je c ts . In m e d ic a l p ro c e s s e s d o c to rs d ia g n o s tic p a ra m e te rs . O n ly th e e s e rie s a n d th e ir re la tio n to e a c h o d a n g e ro u s s ta tu s fo r p a tie n ts . S u c h th e tim e s e rie s [S h a 9 7 ][S h a 9 9 ].

e n t re c o g n itio n th e a n a ly s is o f tig ra m a n a ly s is u s u a lly h a v e to c o m b in a tio n o f th e r c a n p re d ic t a n a n a ly s is re -

2 .7 A b s tr a c tio n

F e a tu r e F ilt e r _ 1

Im a g e

S e g m e n ta tio n & O b je c t L a b e llin g

F e a t u r e F ilt e r _ 2 F e a tu r e F ilt e r _ 3

B lo b s

1 9

D e s c r ip t io n

R e g io n s

D e s c r ip t io n

R ib b o n s

C a lc u la t io n o f h ig h - le v e l F e a tu re s

D e s c r ip t io n

... F e a t u r e F ilt e r _ n

E d g e s / L in e s

D e s c r ip t io n

S y m b o lic T e rm s S p a t ia l R e la tio n s

g e o m e tr ic a l s ta t is t ic a l p r o p e r tie s t e x t u r e , c o lo r

D e s c r ip t io n

lo w - le v e l F e a t u r e s

M a n u a l A c q u is it io n E x p e rts D e s c r ip t io n

A u t o m a t ic A c q u is it io n P ix S t a t is D e s c r ip Im a

e l t ic a l tio n o f g e

N u m e r ic a l F e a tu re s

S y m b o lic D e s c r ip t io n o f Im a g e

I m a g e M in in g D a t a b a s e Im a g e _ 1 Im a g e _ 2 ... Im a g e _ N

I m a g e D e s c r ip t io n I m a g e D e s c r ip t io n ... I m a g e D e s c r ip t io n

F ig . 1 2 . D iffe re n t T y p e s o f In fo rm a tio n th a t c a n b e e x tra c te d fro m

T im e s e rie s c s e e F ig u re 1 3 . W o f th e tim e s e rie m e n ts o f a n n -th c u rv e s e g m e n ts th e n w e c a n s y m

a n b e d e s c rib e d e c a n u se F o u r s . In th e tim e d o rd e r in te rp o la c a n b e la b e le d b o lic a lly in te rp

Im a g e s

b y p a ra m e te rs fro m th e fre q u e n ie r c o e ffic ie n ts a n d th e C e p tru m o m a in w e c a n d e s c rib e a tim e s tio n f u n c tio n s u c h a s e .g . lin e s a b y s y m b o lic te rm s s u c h a s s lo re t th e lin e s e g m e n t.

c y o r tim e d o m a in , fo r th e d e s c rip tio n e rie s b y c u rv e s e g n d p a ra b o la . T h e s e p e , p e a k , o r v a lle y

2 . 7 . 4 W e b D a ta T h e re a re d iffe re n t ty p e s o f d a ta : u s e r e n try d a ta , s e rv e r lo g s , w e b d o c u m e n ts a n d w e b m e ta d a ta .

2 0

2 D a ta P re p a ra tio n

D e s c r ip ito n o f T im e S e r ie s

T im e D o m a in

F r e q u e n c y D o m a in

F o u r ie r A n a ly s is

C e p tru m C o r r e la tio n A n a ly s is

In te r p o la tio n

s y m b o lic a l D e s c r ip tio n

F ig . 1 3 . D e s c rip tio n o f T im e S e rie s

T h e u s e r u s u a lly in p u ts u s e r d a ta h im s e lf w h e n re q u e s te d to re g is te r a t a w e b s ite o r w h e n h e is a n s w e rin g a q u e s tio n n a ire o n a w e b s ite . T h e s e in fo rm a tio n a re s to re d in to a d a ta b a s e w h ic h c a n b e ta k e n la te r o n fo r d a ta m in in g . W e b s e rv e r lo g s a re a u to m a tic a lly g e n e ra te d b y th e s e rv e r w h e n a u s e r is v is itin g a n U R L a t a s ite . In a s e rv e r lo g a re re g is te re d th e IP a d d re s s o f th e v is ito r, th e tim e w h e n h e is e n te rin g th e w e b s ite , th e tim e d u ra tio n h e is v is itin g th e re q u e s te d U R L a n d th e U R L h e is v is itin g . F ro m th e s e in fo rm a tio n c a n b e g e n e ra te d th e p a th th e u s e r is g o in g o n th is w e b s ite [C M S 9 9 ]. W e b s e rv e r lo g s a re im p o rta n t in fo rm a tio n in o rd e r to d is c o v e r th e b e h a v io r o f a u s e r a t th e w e b s ite . In th e e x a m p le g iv e n in F ig u re 1 5 a ty p ic a l s e rv e r lo g file is s h o w n . T a b le 2 s h o w s th e c o d e fo r th e U R L . In ta b le 3 is s h o w n th e p a th th e u s e r is ta k in g o n th is w e b s ite . T h e u s e r h a s b e e n v is itin g th e w e b s ite 4 tim e s . A u s e r s e s s io n is c o n s id e re d to b e c lo s e d w h e n th e u s e r is n o t ta k in g a n e w a c tio n w ith in 2 0 m in u te s . T h is is a ru le o f th u m b th a t m ig h t n o t a lw a y s b e tru e . S in c e in o u r e x a m p le th e tim e d u ra tio n b e tw e e n th e firs t u s e r a c c e s s s ta rtin g a t 1 : 5 4 a n d th e s e c o n d o n e a t 2 :2 4 is lo n g e r th a n 2 0 m in u te s w e c o n s id e r th e firs t a c c e s s a n d th e s e c o n d a c c e s s a s tw o s e s s io n s . H o w e v e r, it m ig h t b e th a t th e u s e r w a s s ta y in g o n th is w e b s ite fo r m o re th a n 2 0 m in u te s s in c e h e is n o t e n te rin g th e w e b s ite b y th e m a in p a g e . T h e w e b d o c u m e n ts c o n ta in in fo rm a tio n s u c h a s te x t, im a g e s , v id e o o r a u d io . T h e y h a v e a s tru c tu re th a t a llo w s to re c o g n iz e fo r e .g . th e title o f th e p a g e , th e a u th o r, k e y w o rd s a n d th e m a in b o d y . T h e fo rm a ttin g in s tru c tio n m u s t b e re m o v e d in o rd e r to a c c e s s th e in fo rm a tio n th a t w e w a n t to m in e o n th e s e s id e s . A n e x a m p le o f a n H T M L d o c u m e n t is g iv e n in F ig u re 1 4 . T h e re le v a n t in fo rm a tio n o n th is p a g e is m a rk e d w ith g re y c o lo r. E v e ry th in g e ls e is H T M L c o d e w h ic h is e n c lo s e d in to b ra c k e ts < > . T h e title o f a p a g e c a n b e id e n tifie d b y s e a rc h in g th e p a g e fo r th e c o d e < title > to fin d th e b e g in n in g o f th e title a n d fo r th e c o d e < /title > to fin d th e e n d o f th e title . Im a g e s c a n b e id e n tifie d b y s e a rc h in g th e w e b p a g e fo r th e file e x te n s io n .g if , .jp g . W e b m e ta d a ta g iv e u s th e to p o lo g y o f a w e b s ite . T h is in fo rm a tio n is n o rm a lly s to re d a s a s id e -s p e c ific in d e x ta b le im p le m e n te d a s a d ire c te d g ra p h .

2 .7 A b s tr a c tio n

2 1

< h t m l > < h e a d > < t i t l e > w e l c o m e

t o

t h e

h o m e p a g e

o f

P e t r a

P e r n e r < / t i t l e >

< / h e a d > < b o d y b g c o l o r = " # c c f f c c " l i n k = " # 6 6 6 6 9 9 " >

t e x t = " b l a c k "

< t d w i d t h = " 2 0 " v a l i g n = " t o p " > < i m g s r c = " . . / i m a g e s / n i x . g i f " > < / t d >

h e i g h t = " 5 "

< t d w i d t h = " 4 2 3 " v a l i g n = " t o p " > < f o n t f a c e = " A r i a l , H e l v e t i c a , G e n e v a " W e l c o m e

t o

t h e

h o m e p a g e

o f

P e t r a

b a c k g r o u n d = " . . / i m a g e s / h i n t . g i f "

s i z e = " 4 "

w i d t h = " 2 0 "

c o l o r = " # 6 6 6 6 9 9 " >

P e r n e r < / b > < b r > < / f o n t > < / b r > < / b r >

< f o n t f a c e = " A r i a l , H e l v e t i c a , G e n e v a " l o r = " # 6 6 6 6 9 9 " > I n d u s t r i a l C o n f e r e n c e < / f o n t > < / b r > < / b r > < / b r >

s i z e = " 3 " c o D a t a M i n i n g 2 4 . 7 . - 2 5 . 7 . 2 0 0 1

< f o n t

s i z e = " 3 "

I n r e n P l e f o r L i s M L D

c o c e a s m a t M 2

f a c e = " A r i a l , H e l v e t i c a , G e n e v a " n n o e t i o f 0 0 n

e c t D v i s o n . A c 1 y

i o a t i t < / c e o u

n a o b r p t c

w i t M i n u r > < / e d a n

h i n w e b r P a f i

M L g . b s > p e n d

c o l o r = " b l a c k " >

D M 2 0 0 1 t h e r e w i l l b e h e l d a n i n d u s t r i a l < / b r > < / b r > i t e h t t p : / / w w w . d a t a - m i n i n g - f o r u m . d e f o r

c o n f e m o r e

i n -

r s f o r M L D M i s n o w a v a i l a b l e . I n f o r m a t i o n o n o n t h i s s i t e u n d e r t h e l i n k M L D M 2 0 0 1 < / b r > < / b r >

< / f o n t > < / t d > < / t r > < / t a b l e > < / d i v > < / b o d y > < / h t m l >

F ig . 1 4 . E x c e rp t fro m

a H T M L D o c u m e n t

h s 2 - 2 1 0 .h a n d s h a k e .d e - - [ 0 1 /S e p /1 9 9 9 :0 0 :0 1 :5 4 + 0 1 0 0 ] " G E T /s u p p o r t/ H T T P /1 .0 " - " h ttp ://w w w .s 1 .d e /in d e x .h tm l" " M o z illa /4 .6 [ e n ] ( W in 9 8 ; I ) " I s is 1 3 8 .u r z .u n i- d u e s s e ld o r f .d e - - 0 1 /S e p /1 9 9 9 :0 0 :0 2 :1 7 + 0 1 0 0 ] " G E T /s u p p o r t/la s e r je t- s u p p o r t.h tm H T T P /1 .0 " - - " h ttp ://w w w .s 4 .d e /s u p p o r t/" " M o z illa /4 .0 ( c o m p a tib le ; M S I W in d o w s 9 8 ; Q X W 0 3 3 0 d )" h s 2 - 2 1 0 .h a n d s h a k e .d e - - [ 0 1 /S e p /1 9 9 9 :0 0 :0 2 :2 0 + 0 1 0 0 ] " G E T /s u p p o r t/e s c .h tm l H T T P /1 .0 " - " h ttp ://w w w .s 1 .d e /s u p p o r t/" " M o z illa /4 .6 [ e n ] ( W in 9 8 ; I ) " p C 1 9 F 2 9 2 7 .d ip .t- d ia lin .n e t - - [ 0 1 /S e p /1 9 9 9 :0 0 :0 2 :2 1 + 0 1 0 0 ] " G E T /s u p p o r t/ H T T P /1 .0 " - " h ttp ://w w w .s 1 .d e /" " M O Z I L L A /4 .5 [ d e ] C - C C K - M C D Q X W 0 3 2 0 7 (W I)" h s 2 - 2 1 0 .h a n d s h a k e .d e - - [ 0 1 /S e p /1 9 9 9 :0 0 :0 2 :2 2 + 0 1 0 0 ] " G E T /s e r v ic e /n o tf o u n d .h tm l H T T P /1 .0 " - " h ttp ://w w w .s 1 .d e /s u p p o r t/e s c .h tm l" " M o z illa /4 .6 [ e n ] ( W in 9 8 ; I ) " h s 2 - 2 1 0 .h a n d s h a k e .d e - - [ 0 1 /S e p /1 9 9 9 :0 0 :0 3 :1 1 + 0 1 0 0 ] " G E T /s e r v ic e /s u p p o r tp a c k / in d e x _ c o n te n t.h tm l H T T P /1 .0 " - - " h ttp ://w w w .s 1 .d e /s u p p o r t/" " M o z illa /4 .6 (W in 9 8 ; I)" h s 2 - 2 1 0 .h a n d s h a k e .d e - - [ 0 1 /S e p /1 9 9 9 :0 0 :0 3 :4 3 + 0 1 0 0 ] " G E T /s e r v ic e /s u p p o r tp a c k /k o n ta k t.h tm l H T T P /1 .0 " - - " h ttp ://w w w .s 1 .d e /s e r v ic e /s u p p o r tp a c k /in d e x _ c o n te n t.h tm l" " M o z illa /4 .6 [ e n ] ( W in 9 8 ; I ) " c a c h e - d m 0 3 .p r o x y .a o l.c o m - - [ 0 1 /S e p /1 9 9 9 :0 0 :0 3 :5 7 + 0 1 0 0 ] " G E T /s u p p o r t/ H T T P /1 .0 " - " h ttp ://w w w .s 1 .d e /" " M o z illa /4 .0 ( c o m p a tib le ; M S I E 5 .0 ; A O L 4 .0 ; W in d o 9 8 ; D ig E x t)"

F ig . 1 5 . E x c e rp t fro m

a S e rv e r L o g file

l E 5 .0 ;

in N T ; [e n ]

w s

2 2

2 D a ta P re p a ra tio n

T a b le 2 . U R L A d d re s s a n d C o d e fo r th e A d d re s s

U R L A d d r e s w w w .s 1 .d e /in w w w .s 1 .d e /s u w w w .s 1 .d e /s u w w w .s 1 .d e /s u fo u n d .h tm l w w w .s 1 .d e /s e c o n te n t.h tm l w w w .s 1 .d e /s e t.h tm l

s

C o d e d e p p p p p p

x .h tm l o rt/ o r t/e s c .h tm l o rt/s e rv ic e -n o t

A B C D

rv ic e /s u p p o rtp a c k /in d e x _

E

rv ic e /s u p p o rtp a c k /k o n ta k

F

T a b le 3 . U s e r, T im e a n d P a th th e U s e r h a s ta k e n o n th e W e b -S ite

U se r U S E U S E U S E U S E

N a m e R _ 1 R _ 1 R _ 1 R _ 1

T im 1 :5 2 :2 3 :1 3 :4

e

P a th

4

A 0 -2 :2 2

B ÅC B E ÅF

1 3 - 3 :4 4

2 . 8 Con c l u s i on s U s e fu l re s u lts c a n o n ly b e o b ta in e d b y d a ta m in in g w h e n th e d a ta a re c a re fu lly p re p a re d . U n n e c e s s a ry d a ta , n o is y d a ta o r e v e n c o rre la te d d a ta h ig h ly a ffe c t th e re s u lt o f th e d a ta m in in g e x p e rim e n t. T h e ir in flu e n c e s h o u ld b e a v o id e d b y a p p ly in g p ro p e r d a ta p re p a ra tio n te c h n iq u e s . T h e ra w d a ta o f a m u ltim e d ia s o u rc e s u c h a s im a g e s , v id e o , o r lo g file d a ta c a n n o t b e u s e d fro m s c ra tc h . U s u a lly th e s e d a ta n e e d to b e tra n s fo rm e d in to a p ro p e r a b s tra c tio n le v e l. F o r e x a m p le fro m a n o b je c t in a n im a g e fe a tu re s s h o u ld b e c a lc u la te d th a t d e s c rib e th e p ro p e rtie s o f th e o b je c t. E a c h im a g e w ill th e n h a v e a n e n try in th e d a ta ta b le c o n ta in in g th e fe a tu re s o f th e o b je c ts e x tra c te d fro m th e im a g e . H o w th e im a g e s h o u ld b e re p re s e n te d is o fte n d o m a in -d e p e n d e n t a n d re q u ire s a c a re fu l a n a ly s is o f th e d o m a in . W e w ill s h o w o n e x a m p le s in th e c h a p te r 4 h o w th is c a n b e d o n e fo r d iffe re n t a b s tra c tio n le v e ls .

3 M e th od s f or D a ta M i n i n g

3 . 1 D e c i s i on T r e e I n d u c ti on 3 .1 .1 B a s ic P r in c ip le W ith d e c is io n tre e in d u c tio n w e c a n a u to m a tic a lly d e riv e fro m a s e t o f s in g le o b s e rv a tio n s a s e t o f ru le s th a t g e n e ra liz e s th e s e d a ta (s e e F ig u re 1 6 ). T h e s e t o f ru le s is re p re s e n te d a s d e c is io n tre e . D e c is io n tre e s re c u rs iv e ly p a rtitio n s th e s o lu tio n s s p a c e b a s e d o n th e a ttrib u te s p lits in to s u b s p a c e s u n til th e fin a l s o lu tio n s is re a c h e d . T h e re s u ltin g h ie ra rc h ic a l re p re s e n ta tio n is v e ry n a tu ra l to h u m a n p ro b le m s o lv in g p ro c e s s . D u rin g th e c o n s tru c tio n o f th e d e c is io n tre e a re s e le c te d fro m th e w h o le s e t o f a ttrib u te s o n ly th o s e a ttrib u te s th a t a re m o s t re le v a n t fo r th e c la s s ific a tio n p ro b le m . O n c e th e d e c is io n tre e h a s b e e n le a rn t a n d th e d e v e lo p e r is s a tis fie d w ith th e q u a lity o f th e le a rn t m o d e l. T h is m o d e l c a n b e u s e d in o rd e r to p re d ic t th e o u tc o m e fo r n e w s a m p le s . T h is le a rn in g m e th o d s is a ls o c a lle d s u p e rv is e d le a rn in g s in c e s a m p le s in th e d a ta c o lle c tio n h a v e to b e la b e lle d b y th e c la s s . M o s t d e c is io n tre e in d u c tio n a lg o rith m a llo w to u s e n u m e ric a l a ttrib u te s a s w e ll a s c a te g o ric a l a ttrib u te s . T h e re fo re , th e re s u ltin g c la s s ifie r c a n m a k e th e d e c is io n b a s e d o n b o th ty p e s o f a ttrib u te s .

C la s s

S e p a lL e n g

S e p a lW i

P e ta lL e n

P e ta lW i

S e to s a

5 ,1

3 ,5

1 ,4

0 ,2

S e to s a

4 ,9

3 ,0

1 ,4

0 ,2

S e to s a

4 ,7

3 ,2

1 ,3

0 ,2

S e to s a

4 ,6

3 ,1

1 ,5

0 ,2

S e to s a

5 ,0

3 ,6

1 ,4

0 ,2

V e r s ic o lo r

7 ,0

3 ,2

4 ,7

1 ,4

V e r s ic o lo r

6 ,4

3 ,2

4 ,5

1 ,5

V e r s ic o lo r

6 ,9

3 ,1

4 ,9

1 ,5

V e r s ic o lo r

5 ,5

2 ,3

4 ,0

1 ,3

...

...

...

...

--1 5 0 D S P E T A L L E N

D e c is io n T r e e In d u c tio n

< = 2 .4 5 5 0 D S [S e to s a ]

> 2 .4 5 1 0 0 D S P E T A L L E N

< = 4 .9 5 4 D S P E T A L W I

< = 1 .6 5 4 7 D S [V e rs ic o l]

> 1 .6 5 7 D S [V irg in ic ]

...

A ttr ib u te - V a lu e P a ir R e p r e s e n ta tio n

D a ta M in in g

F ig . 1 6 . B a s ic P rin c ip le o f D e c is io n T re e In d u c tio n

P . P e rn e r: D a ta M in in g o n M u ltim e d ia D a ta , L N C S 2 5 5 8 , p p . 2 3 − 8 9 , 2 0 0 2 . © S p rin g e r-V e rla g B e rlin H e id e lb e rg 2 0 0 2

> 4 .9 4 6 D S [V irg in ic ]

R e s u lt

2 4

3 M e th o d s fo r D a ta M in in g

3 . 1 . 2 T e r m i n ol og y of D e c i s i on T r e e A d e c is io n tre e u re 1 7 . T h e n o d e w ith c la s s la b e ls . E v n o d e h a v in g n o c a lle d in te rn a l n T h e n o d e s o f th

is a d ire c te d a -c y c lic g ra p h c o n s is tin g o f e d g e s a n d n o d e s , s e e F ig n o e d g e ry n o su c c e o d e s. e tre e c

e s e n te r is c a lle d th e ro o t n o d e . T h e ro o t n o d e c o n ta in s a ll d e e x c e p t th e ro o t n o d e h a s e x a c tly o n e e n te rin g e d g e . A s s o r is c a lle d a le a f o r te rm in a l n o d e . A ll o th e r n o d e s a re o n ta in th e d e c is io n ru le s s u c h a s IF a ttrib u te A ≤ v a lu e T H E N D .

T h e d e c is io n ru le is a fu n c tio n f th a t m a p s th e a ttrib u te A to D . T h e s a m p le s e t is s p litte d in e a c h n o d e in to tw o s u b s e ts b a s e d o n th e c o n s ta n t v a lu e fo r th e a ttrib u te . T h is c o n s ta n t is c a lle d c u t-p o in t. In c a s e o f a b in a ry tre e , th e d e c is io n is e ith e r tru e o r fa ls e . G e o m e tric a lly , th e te s t d e s c rib e s a p a rtitio n o rth o g o n a l to o n e o f th e c o o rd in a te s o f th e d e c is io n s p a c e . A te rm in a l n o d e s h o u ld c o n ta in o n ly s a m p le s o f o n e c la s s . If th e re a re m o re th a n o n e c la s s in th e s a m p le s e t w e s a y th e re is c la s s o v e rla p . A n in te rn a l n o d e c o n ta in s a lw a y s m o re th a n o n e c la s s in th e a s s ig n e d s a m p le s e t. A p a th in th e tr e e is a s e q u e n c e o f e d g e s f r o m ( v 1,v 2) , ( v 2,v 3) , ... , ( v n- 1,v n) .W e s a y th e p a th is fro m v 1 to v n a n d is o f th e le n g th n . T h e re is a u n iq u e p a th fro m th e ro o t to e a c h n o d e . T h e d e p th o f a n o d e v in a tre e is th e le n g th o f th e p a th fro m th e ro o t to v . T h e h e ig h t o f n o d e v in a tre e is th e le n g th o f a la rg e s t p a th fro m v to a le a f. T h e h e ig h t o f a tre e is th e h e ig h t o f its ro o t. T h e le v e l o f a n o d e v in a tre e is th e h e ig h t o f th e tre e m in u s th e d e p th o f v .

R o o t n o d e _ t i

t e r m in a l ( c la s s la b e l)

k

C (t), F (t), D (t) j

F ig . 1 7 . R e p re s e n ta tio n o f a D e c is io n T re e

A b in a ry tre e is a n o rd e re d tre e s u c h th a t e a c h s u c c e s s o r o f a n o d e is d is tin g u is h e d e ith e r a s a le ft s o n o r a rig h t s o n ; n o n o d e h a s m o re th a n o n e le ft s o n n o r m o re th a n o n e rig h t s o n . O th e rw is e it is a m u ltiv a ria te tre e . L e t u s n o w c o n s id e r th e d e c is io n tre e le a rn t fro m F is h e r´s Iris d a ta s e t [F is h e r]. T h is d a ta s e t h a s th re e c la s s e s (1 -S e to s a , 2 -V e ric o lo r, 3 -V irg in ic a ) w ith 5 0 o b s e rv a tio n s fo r e a c h c la s s a n d fo u r p re d ic to r v a ria b le s (p e ta l le n g th , p e ta l w id th , s e p a l le n g th a n d s e p a l w id th ). T h e le a rn t tre e is s h o w n in F ig u re 1 8 . It is a b in a ry tre e . T h e a v e ra g e d e p th o f th e tre e is 1 + 3 + 2 = 6 /3 = 2 . T h e ro o t n o d e c o n ta in s th e a ttrib u te

3 .1 D e c is io n T r e e I n d u c tio n

2 5

p e ta l_ le n g th . A lo n g a p a th th e ru le s a re c o m b in e d b y th e A N D o p e ra to r. F o llo w in g th e tw o p a th s f r o m th e r o o t n o d e w e o b ta in f o r e .g . tw o r u le s s u c h a s : R u le 1 : I F p e ta l_ le n g h t< = 2 .4 5 T H E N S e to s a R u le 2 : IF p e ta l_ le n g h t< 2 .4 5 A N D p e ta l_ le n g h t< 4 .9 T H E N V irg in ic a . In th e la te r ru le w e c a n s d u rin g th e p ro b le m s o lv in g th is a ttrib u te . T h is re p re s e s in c e o n ly a x is -p a ra lle l d e p o in ts a re c re a te d . H o w e v e f a ll in to th e in te r v a l [ 2 .4 5 ,4

e e th a t p ro c e n ta tio n c is io n r, it o n .9 ] fo r

th e a ttrib u te p e ta l_ le n g th s s . E a c h tim e it is u s e d a re s u lts fro m th e b in a ry s u rfa c e s (s e e F ig u re 2 1 ) ly m e a n s th a t th e v a lu e s th e d e s ire d d e c is io n ru le .

w ill d iff tre e b a s fo r a

b e u s e d tw o tim e re n t c u t-p o in t b u ild in g p ro c e d o n s in g le c n a ttrib u te s h o

e s o n e ss u tu ld

--1 5 0 D S P E T A L L E N

< = 2 .4 5 5 0 D S [S e to s a ]

> 2 .4 5 1 0 0 D S P E T A L L E N

< = 4 .9 5 4 D S P E T A L W I

< = 1 .6 5 4 7 D S [V e rs ic o l]

F ig . 1 8 . D e c is io n T re e le a rn t fro m

> 4 .9 4 6 D S [V irg in ic ]

> 1 .6 5 7 D S [V irg in ic ]

Iris D a ta S e t

3 . 1 . 3 S u b ta s k s a n d D e s i g n Cr i te r i a f or D e c i s i on T r e e I n d u c ti on T h e o v e ra ll p ro c e d u re o f th e d e c is io n tre e b u ild in g p ro c e s s is s u m m a riz e d in F ig u re 1 9 . D e c is io n tre e s re c u rs iv e ly s p lit th e d e c is io n s p a c e in to s u b s p a c e s b a s e d o n th e d e c is io n ru le s in th e n o d e s u n til th e fin a l s to p p in g c rite ria is re a c h e d o r th e re m a in in g s a m p le s e t d o e s n o t s u g g e s t fu rth e r s p littin g . F o r th is re c u rs iv e s p littin g th e tre e b u ild in g p ro c e s s m u s t a lw a y s p ic k a m o n g a ll a ttrib u te s th a t a ttrib u te w h ic h s h o w s th e b e s t re s u lt o n th e a ttrib u te s e le c tio n c rite ria fo r th e re m a in in g s a m p le s e t. W h e re a s fo r c a te g o ric a l a ttrib u te s th e p a rtitio n o f th e a ttrib u te s v a lu e s is g iv e n a -p rio ri. T h e p a rtitio n (a ls o c a lle d a ttrib u te d is c re tiz a tio n ) o f th e a ttrib u te v a lu e s fo r n u m e ric a l a ttrib u te s m u s t b e d e te rm in e d .

2 6

3 M e th o d s fo r D a ta M in in g

d o

w h ile tr e e te r m in a tio n c r ite r io n f a ild d o fo r a ll fe a tu r e s fe a tu r e n u m e r ic a l? y e s

n o

s p littin g - p r o c e d u r e fe a tu r e s e le c tio n p r o c e d u r e s p lit e x a m p le s b u ilt tr e e

F ig . 1 9 . O v e ra ll T re e In d u c tio n P ro c e d u re

It c a n b e d o n e b e fo re o r d u rin g th e tre e b u ild in g p ro c e s s [D L S 9 5 ]. W e w ill c o n e r th e c a s e w h e re th e a ttrib u te d is c re tiz a tio n w ill b e d o n e d u rin g th e tre e b u ild p ro c e s s . T h e d is c re tiz a tio n m u s t b e c a rrie d o u t b e fo re th e a ttrib u te s e le c tio n c e s s s in c e th e s e le c te d p a rtitio n o n th e a ttrib u te v a lu e s o f a n u m e ric a l a ttrib u te h ly in flu e n c e s th e p re d ic tio n p o w e r o f th a t a ttrib u te . A fte r th e a ttrib u te s e le c tio n c rite ria w a s c a lc u la te d fo r a ll a ttrib u te s b a s e d o n th e re m a in in g s a m p le s e t, th e re s u ltin g v a lu e s a re e v a lu a te d a n d th e a ttrib u te w ith th e b e s t v a lu e fo r th e a ttrib u te s e le c tio n c rite ria is s e le c te d fo r fu rth e r s p littin g o f th e s a m p le s e t. T h e n , th e tre e is e x te n d e d b y a tw o o r m o re fu rth e r n o d e s . T o e a c h n o d e is a s s ig n e d th e s u b s e t c re a te d b y s p littin g o n th e a ttrib u te v a lu e s a n d th e tre e b u ild in g p ro c e s s re p e a ts . A ttrib u te s p lits c a n b e d o n e : s id in g p ro h ig

• • •

u n iv a ria te o n n u m e ric a lly o r o rd in a l o rd e re d a ttrib u te s X s u c h a s X < = a , m u ltiv a ria te o n c a te g o ric a l o r d is c re tiz e d n u m e ric a l a ttrib u te s s u c h a s X ∈ A , o r lin e a r c o m b in a tio n s p lit o n n u m e ric a lly a ttrib u te s ∑ a i X i ≤ c . i

T h e in flu e n c e o f th e k in d o f a ttrib u te s p lits o n th e re s u ltin g d e c is io n s u rfa c e fo r tw o a ttrib u te s is s h o w n in F ig u re 2 1 . T h e a x is -p a ra lle l d e c is io n s u rfa c e re s u lts in a ru le s u c h a s IF F 3 ≥ 4 .9 T H E N C L A S S V irg in ic a w h ile th e lin e a r d e c is io n s u rfa c e re s u lts in a ru le s u c h a s I F - 3 .2 7 2 + 0 .3 2 5 4 * F 3 + F 4 ≥ 0 T H E N C L A S S V ir g in ic a .

3 .1 D e c is io n T r e e I n d u c tio n

T a x th su

h e is e e rfa

la p a x p c e

2 7

te r d e c is io n s u rfa c e b e tte r d is c rim in a te s b e tw e e n th e tw o c la s s e s th a n th e ra lle l o n e , s e e F ig u re 2 1 . H o w e v e r, b y lo o k in g a t th e ru le s w e c a n s e e th a t la n a tio n c a p a b ility o f th e tre e w ill d e c re a s e in c a s e o f th e lin e a r d e c is io n .

3 ,0

A ttr i b u te F 4

2 ,5 2 ,0

C la s s S e to s a C la s s V e r is ic o lo r C la s s V ir g in ic a

1 ,5 1 ,0 0 ,5 0 ,0 0 ,0

2 ,0

4 ,0

6 ,0

8 ,0

A ttr i b u te F 3 F ig . 2 0 . A x is -P a ra lle l a n d lin e a r A ttrib u te S p lits G ra p h ic a lly V ie w e d in D e c is io n S p a c e 3 ,0

1 . S c h ritt

---

K la s s e 3

1 5 0 D S F 3

2 ,5

2 . S c h ritt

A ttrib u t F 4

2 ,0

K la s s e 1

K la s s e

1 ,5 3

< = 2 .4 5

> 2 .4 5

5 0 D S [1 ]

1 0 0 D S F 4

2

1 ,0 1

K la s s e 2

,5 0 ,0 0

1

2

3

4

5

6

< = 1 .8

> 1 .8

5 0 D S [2 ]

5 0 D S [3 ]

7

A ttrib u t F 3

F ig . 2 1 . D e m o n s tra tio n o f R e c u rs iv e ly S p littin g o f D e c is io n S p a c e b a s e d o n tw o A ttrib u te s o f th e IR IS D a ta S e t

T h e in d u c d u e to n o is e s e t. T h e tre e c a u s e s a n in w h ic h m e a n s

e d d e c is in th e a b u ild in c re a se d re p la c in

io n tr ttrib u g p ro e rro r g su b

e e te te v a c e ss ra te tre e s

n d lu w w w

s to o v e rfit e s a n d c la s s ill p ro d u c e h e n c la s s ify ith le a v e s w

to th e d a ta . T in fo rm a tio n s u b tre e s th a t in g u n s e e n c ill h e lp to a v o

h is is ty p ic a lly c a u s e d p re s e n t in th e tra in in g fit to th is n o is e . T h is a s e s . P ru n in g th e tre e id th is p ro b le m .

2 8

3 M e th o d s fo r D a ta M in in g







N o w , w e c a n s u m m a riz e th e m a in s u b ta s k s o f d e c is io n tre e in d u c tio n a s fo llo w : a ttrib u te s e le c tio n (In fo rm a tio In d e x [B F O 8 4 ], G a in R a tio [Q ria [M a n 9 1 ], a ttrib u te d is c re tiz a tio n (C u t-P [F a I9 3 ] , L V Q -b a s e d d is c re tiz a b rid M e th o d s [P e T 9 8 ], a n d p ru n in g (C o s t-C o m p le x ity [B [Q u i8 6 ], C o n fid e n c e In te rv a l [N iB 8 1 ]).

G a in [Q u i8 6 ], X -S ta tis tic [K e r9 2 ], G in iu i8 8 ], D is ta n c e m e a s u re -b a s e d s e le c tio n c rite 2

n

o in t [Q u i8 6 ], C h i-M e rg e [K e rb 9 2 ], M L D P tio n , H is to g ra m -b a s e d d is c re tiz a tio n , a n d H y F O 8 4 ], R e d u c e d E rro r R e d u c tio n M e th o d [Q u i8 7 ], M in im a l E rro r

P ru n in g P ru n in g

B e y o n d th a t, d e c is io n tre e in d u c tio n a lg o rith m c a n b e d is tin g u is h e d in th e w a y th e y a c c e s s th e d a ta a n d in n o n -in c re m e n ta l a n d in c re m e n ta l a lg o rith m s . S o m e a lg o rith m s a c c e s s th e w h o le d a ta s e t in th e m a in m e m o ry o f th e c o m p u te r. T h is is in s u ffic ie n t if th e d a ta s e t is v e ry la rg e . L a rg e d a ta s e ts o f m illio n s o f d a ta d o n o t fit in th e m a in m e m o ry o f th e c o m p u te r. T h e y m u s t b e a s s e s s e d fro m d is k o r o th e r s to ra g e d e v ic e s o th a t a ll th e s e d a ta c a n b e m in e d . A c c e s s in g th e d a ta fro m e x te rn a l s to ra g e d e v ic e s w ill c a u s e lo n g e x e c u tio n tim e . H o w e v e r, th e u s e r lik e s to g e t re s u lts fa s t a n d e v e n fo r e x p lo ra tio n p u rp o s e s h e lik e s to c a rry o u t q u ic k ly v a rio u s e x p e rim e n ts a n d c o m p a re th e m to e a c h o th e r. T h e re fo re , s p e c ia l a lg o rith m h a v e b e e n d e v e lo p e d th a t c a n w o rk e ffic ie n tly a lth o u g h u s in g e x te rn a l s to ra g e d e v ic e s . In c re m e n ta l a lg o rith m c a n u p d a te th e tre e a c c o rd in g to th e n e w d a ta w h ile n o n in c re m e n ta l a lg o rith m g o tro u g h th e w h o le tre e b u ild in g p ro c e s s a g a in b a s e d o n th e c o m b in e d o ld d a ta s e t a n d th e n e w d a ta . S o m e s ta n d a r d a lg o r ith m a r e : C A R T , I D 3 , C 4 .5 , C 5 .0 , F u z z y C 4 .5 , O C 1 , Q U E S T , C A L 5 . 3 . 1 . 4 A ttr i b u te S e l e c ti on Cr i te r i a F o rm a lly , w e c a n d e s c rib fu ll s e t o f fe a tu re s , w ith re m a in in g s a m p le s e t i . L b e r e p r e s e n t e d b y S ( A , n i) v a lu e o f S to in d ic a te a g tio n is to fin d a n a ttrib u te te ria S s o th a t

e th e a ttrib u c a rd in a lity k e t th e fe a tu r . W ith o u t a n o o d a ttrib u te A b a se d o n

te s , a n e se y lo A . o u r

e le c tio n p ro b le m a s fo llo w : L e t Y b e th e d le t n i b e th e n u m b e r o f s a m p le s in th e le c tio n c rite rio n fu n c tio n fo r th e a ttrib u te s s o f g e n e ra lity , le t u s c o n s id e r a h ig h e r F o rm a lly , th e p ro b le m o f a ttrib u te s e le c s a m p le s u b s e t n i th a t m a x im iz e s o u r c ri-

S ( A , n i) = m a x S (Z , n i)

(1 )

Z ⊆ Y , Z = 1

N u m e ro u s a ttrib u te s e le c tio n c rite ria a re k n o w n . W e w ill s ta rt w ith th e m o s t u s e d c rite ria c a lle d in fo rm a tio n g a in c rite ria .

3 .1 D e c is io n T r e e I n d u c tio n

2 9

3 . 1 . 4 . 1 I n f or m a ti on G a i n Cr i te r i a a n d G a i n R a ti o F o llo w th e s o u c h a n n e in to s u s h o u ld s c rib e d

IF

in g rc e l. T b se b e b y

th a n h e ts to th

e th e o ry o f th e S h a n n o n d m e a s u re th e im p u rity tra n s m is s io n o v e r th e c b a s e d o n s p lits o n th e a tra n s m it th e s ig n a l w ith e fo llo w in g c rite rio n :

c h a n n e l [P h i8 o f th e re c e iv e d h a n n e l re s u lts ttrib u te v a lu e s th e le a s t lo s s o

I ( A ) = I (C ) − I (C / J ) = M a x

T H E N

7 ] d in J n

, w e c o n s id e r th e d a ta s e a ta w h e n tra n s m itte d v ia th e p a rtitio n o f th e d a ta o f th e a ttrib u te A . T h e in fo rm a tio n . T h is c a n b e

A ttr ib u te − A

S e le c t

w h e re I(A ) is th e e n tro p y o f th e s o u rc e , I(C ) is th e e n tro p y o f th e re c e iv e x p e c te d e n tr o p y to g e n e r a te th e m e s s a g e C 1, C 2, ..., C m a n d I ( C /J ) is th e n tro p y w h e n b ra n c h in g o n th e a ttrib u te v a lu e s J o f a ttrib u te A . F o r th e c a lc u la tio n o f th is c rite rio n w e c o n s id e r firs t th e c o n tin g e n c y ta b le 4 w ith m th e n u m b e r o f c la s s e s , n th e n u m b e r o f a ttrib u te v a lu e s n u m b e r o f e x a m p le s , L i n u m b e r o f e x a m p le s w ith th e a ttrib u te v a lu e J n u m b e r o f e x a m p l e s b e l o n g i n g t o c l a s s C j, a n d x ij t h e n u m b e r o f e x a m l o n g i n g t o c l a s s C j a n d h a v i n g a t t r i b u t e v a l u e A i. N o w , w e c a n d e fin e th e e n tro p y o f th e c la s s C b y :



I (C ) = −

R m

j

• ld

j = 1 N

t a s th e se t a im d e -

R j

e r o r th e e lo s in g ta b le J , n , R j i p le s

in th e th e b e -

(2 )

N

T h e e n tro p y o f th e c la s s g iv e n th e fe a tu re v a lu e s , is :

I (C / J ) =



n i= 1

L N

i





m j= 1



x L

ij i

ld

x L

ij i

= N

1



n i= 1

L i lo g L i −

n

m

i= 1

j= 1

∑ ∑

x ij l o g x

ij

(3 )

T h e b e s t fe a tu re is th e o n e th a t a c h ie v e s th e lo w e s t v a lu e o f (2 ) o r, e q u iv a le n tly , th e h ig h e s t v a lu e o f th e " m u tu a l in fo rm a tio n " I(C ) - I(C /J ). T h e m a in d ra w b a c k o f th is m e a s u re is its s e n s itiv ity to th e n u m b e r o f a ttrib u te v a lu e s . In th e e x tre m e c a s e , a fe a tu re th a t ta k e s N d is tin c t v a lu e s fo r th e N e x a m p le s a c h ie v e s c o m p le te d is c rim in a tio n b e tw e e n d iffe re n t c la s s e s , g iv in g I(C /J )= 0 , e v e n th o u g h th e fe a tu re s m a y c o n s is t o f ra n d o m n o is e a n d b e u s e le s s fo r p re d ic tin g th e c la s s e s o f fu tu re e x a m p le s . T h e re fo re , Q u in la n [Q u i8 8 ] in tro d u c te d a n o rm a liz a tio n b y th e e n tro p y o f th e a ttrib u te its e lf:

3 0

3 M e th o d s fo r D a ta M in in g

G (A )= I(A )/I(J ) w ith I ( J ) = −



n

L

i = 1

N

i

ld

L i

(4 )

N

O th e r n o rm a liz a tio n h a v e b e e n p ro p o s e d b y C o p p e rs m ith e t. a l [C H H 9 9 ] a n d L o p e z d e M o n ta ra s [L o M 9 1 ]. C o m p a ra tiv e s tu d ie s h a v e b e e n d o n e b y W h ite a n d L u i [W h L 9 4 ].

T a b le 4 . C o n tin g e n c y T a b le fo r a n A ttrib u te

Cl a s s A ttr i b u te v a l u e s J 1 J 2 . . . J i . . . J n S U M

C 1 x 1 1 x 2 1 . . . x i1 . . . x n 1 R 1

C 2 x 1 2 x 2 2 . . . x i2 . . . x n 2 R 2

... ... ... ... ... ... ... ... ... ... ... ...

C j x 1 j x 2 j . . . x ij . . . x n j R j

... ... ... ... ... ... ... ... ... ... ... ...

C m x 1 m x 2 m . . . x im . . . x n m R m

S U M L 1 L 2 . . . L i . . . L n N

3 . 1 . 4 . 2 G i n i F u n c ti on T h is m e a s u re ta k e s in to a c c o u n t th e im p u rity o f th e c la s s d is trib u tio n . T h e G in i fu n c tio n is d e fin e d a s :

G = 1 −



m

p

2

(5 )

i

i= 1

T h e s e le c tio n c rite ria is d e fin e d a s :

IF

G i n i( A ) = G ( C ) − G ( C / A ) = M a x !

T H E N S e le c t

A ttr ib u te_ A

3 .1 D e c is io n T r e e I n d u c tio n

3 1

T h e G in i fu n c tio n fo r th e c la s s is :



G (C ) = 1 −

⎛ R m



⎜ N ⎝

j= 1

2

⎞ ⎟ j

(6 )

⎟ ⎠

T h e G in i fu n c tio n o f th e c la s s g iv e n th e fe a tu re v a lu e s is d e fin e d a s :

G (C / J ) =



n

L i

N

i= 1

G ( J i)

(7 )

w ith

G (J i) = 1 −



m j= 1

⎛ x ⎜

⎜ L ⎝



i



2

ij



(8 )



3 . 1 . 5 D i s c r e ti z a ti on of A ttr i b u te V a l u e s A n u m e ric a l a ttrib u te m a y ta k e a n y v a lu e o n a c o n tin u o u s s c a le b e tw e e n its m in im a l v a lu e x 1 a n d its m a x im a l v a lu e x 2. B r a n c h in g o n a ll th e s e d is tin c t a ttr ib u te v a lu e s d o e s n o t le a d to a n y g e n e ra liz a tio n a n d w o u ld m a k e th e tre e v e ry s e n s itiv e to n o is e . R a th e r w e s h o u ld fin d m e a n in g fu l p a rtitio n s o n th e n u m e ric a l v a lu e s in to in te rv a ls . T h e in te rv a ls s h o u ld a b s tra c t th e d a ta in s u c h a w a y th a t th e y c o v e r th e ra n g e o f a ttrib u te v a lu e s b e lo n g in g to o n e c la s s a n d th a t th e y s e p a ra te th e m fro m th o s e b e lo n g in g to o th e r c la s s e s . T h e n , w e c a n tre a t th e a ttrib u te a s a d is c re te v a ria b le w ith k + 1 in te rv a ls . T h is p ro c e s s is c a lle d d is c re tiz a tio n o f a ttrib u te s . T h e p o in ts th a t s p lit o u r a ttrib u te v a lu e s in to in te rv a ls a re c a lle d c u t-p o in ts . T h e c u t-p o in ts k lie s a lw a y s o n th e b o rd e r b e tw e e n th e d is trib u tio n o f tw o c la s s e s . D is c re tiz a tio n c a n b e d o n e b e fo re th e d e c is io n tre e b u ild in g p ro c e s s o r d u rin g d e c is io n tre e le a rn in g [D L S 9 5 ]. H e re w e w a n t to c o n s id e r d is c re tiz a tio n d u rin g th e tre e b u ild in g p ro c e s s . W e c a ll th e m d y n a m ic a n d lo c a l d is c re tiz a tio n m e th o d s . T h e y a re d y n a m ic s in c e th e y w o rk d u rin g th e tre e b u ild in g p ro c e s s o n th e c re a te d s u b s a m p le s e ts a n d th e y a re lo c a l s in c e th e y w o rk o n th e re c u rs iv e ly c re a te d s u b s p a c e s . If w e u s e th e c la s s la b e l o f e a c h e x a m p le w e c o n s id e r th e m e th o d a s s u p e rv is e d d is c re tiz a tio n m e th o d s . If w e d o n o t u s e th e c la s s la b e l o f th e s a m p le s w e c a ll th e m u n s u p e rv is e d d is c re tiz a tio n m e th o d s . W e c a n p a rtitio n th e a ttrib u te v a lu e s in to tw o (k = 1 ) o r m o re in te rv a ls (k > 1 ). T h e re fo re , w e d is tin g u is h b e tw e e n b in a ry a n d m u lti-in te rv a l d is c re tiz a tio n m e th o d s , s e e F ig u re 2 3 . In F ig u re 2 2 , w e s e e th e c o n d itio n a l h is to g ra m o f th e v a lu e s o f th e a ttrib u te p e ta l_ le n g th o f th e IR IS d a ta s e t. In th e b in a ry c a s e (k = 1 ), th e a ttrib u te v a lu e s

3 2

3 M e th o d s fo r D a ta M in in g

w o u ld b e s p litte d a t th e c u t- p o in t 2 .3 5 in to a n in te r v a l f r o m 0 to 2 .3 5 a n d a s e c o n d in te r v a l f r o m 2 .3 6 to 7 . I f w e d o m u lti- in te r v a l d is c r e tiz a tio n , w e w ill f in d a n o th e r c u t-p o in t a t 4 .8 . T h a t g ro u p s th e v a lu e s in to 3 in te r v a ls (k = 2 ): in te rv a ll_ 1 fro m 0 to 2 .3 5 , in te r v a l_ 2 f r o m 2 .3 6 to 4 .8 , a n d in te r v a l_ 3 f r o m 4 .9 to 7 . W e w ill a ls o c o n s id e r a ttrib u te d is c re tiz a tio n o n c a te g o ric a l a ttrib u te s . M a n y a ttrib u te v a lu e s o f a c a te g o ric a l a ttrib u te w ill le a d to a p a rtitio n o f th e s a m p le s e t in to m a n y s m a ll s u b s a m p le s e ts . T h is a g a in w ill re s u lt in to a q u ic k s to p o f th e tre e b u ild in g p ro c e s s . T o a v o id th is p ro b le m , it m ig h t b e w is e to c o m b in e a ttrib u te v a lu e s in to a m o re a b s tra c t a ttrib u te v a lu e . W e w ill c a ll th is p ro c e s s a ttrib u te a g g re g a tio n . It is a ls o p o s s ib le to a llo w th e u s e r to c o m b in e a ttrib u te in te ra c tiv e ly d u rin g th e tre e b u ild in g p ro c e s s . W e c a ll th is p ro c e s s m a n u a l a b s tra c tio n o f a ttrib u te v a lu e s , s e e F ig u re 2 3 . 1 6

1 4

1 2

M e a n

1 0

8

6

P e a k s

4

K la s s e 1 S e to s a

2

K la s s e 2 V e r s ic o lo r K la s s e 3 V ir g in ic a

0 1 ,0

1 ,3

1 ,6

1 ,9

2 ,2

2 ,5

2 ,8

3 ,1

3 ,4

3 ,7

4 ,0

4 ,3

4 ,6

4 ,9

5 ,2

5 ,5

5 ,8

6 ,1

6 ,4

6 ,7

4 ,8

2 ,3 5

C u t- P o in ts

P e ta l le n g th

F ig . 2 2 . H is to g ra m

o f A ttrib u te P e ta l L e n g th a n d C u t-P o in ts

A t t r ib u te D is c r e t iz a t io n

N u m e r ic a l A tt r ib u t e s

B in a r y ( C u t - P o in t D e t e r m in a t io n )

M u lt i- I n t e r v a l D is c r e t iz a t io n

C a t e g o r ic a l A t t r ib u t e s

A b s t r a c t io n M a n u a l b a s e d o n D o m a in K n o w le d g e

A u t o m a t ic A g g r e g a t io n

F ig . 2 3 . O v e rv ie w a b o u t A ttrib u te D is c re tiz a tio n

3 . 1 . 5 . 1 B i n a r y D i s c r e ti z a ti on 3 . 1 . 5 . 1 . 1 B i n a r y D i s c r e ti z a ti on B a s e d on E n tr op y D e c is io n tr e e in d u c tio n a lg o r ith m lik e I D 3 a n d C 4 .5 u s e a n e n tr o p y c r ite r ia f o r th e s e p a r a t i o n o f a t t r i b u t e v a l u e s i n t o t w o i n t e r v a l s . O n t h e a t t r i b u t e r a n g e b e t w e e n x m in

3 .1 D e c is io n T r e e I n d u c tio n

3 3

a n d x m ax is te s te d e a c h p o s s ib le c u tp o in t T a n d th e o n e th a t f u llf ils th e f o llo w in g c o n d itio n is c h o s e n a s c u tp o in t T A:

I ( A ,T A ; S ) = M in !

IF

T H E N

S e le c t

T

fo r A

T

w ith S th e s u b s a m p le s e t, A th e a ttrib u te , a n d T th e c u tp o in t th a t s e p a ra te s th e s a m p le s in to s u b s e t S 1 a n d S 2. I(A , T A; S ) is th e e n tro p y fo r th e s e p a ra tio n o f th e s a m p le s e t in to th e s u b s e t S 1 a n d S 2:

S

I ( A ;T ; S ) = I (S ) = −



S

I (S 1) + 1

S

2

I (S

S

)

(9 )

, S )

(1 0 )

2

m

p (C i

, S ) ld

p (C i

j= 1

T h e c a lc u la tio n o f th e c u t-p o in t is u s u a lly a tim e c o n s u m in g p ro c e s s s in c e e a c h p o s s ib le c u t-p o in t is te s te d a g a in s t th e s e le c tio n c rite ria . T h e re fo re , a lg o rith m s h a v e b e e n p ro p o s e d th a t s p e e d u p th e c a lc u la tio n o f th e rig h t c u t-p o in t [S e i9 3 ]. 3 . 1 . 5 . 1 . 2 D i s c r e ti z a ti on B a s e d on I n te r - a n d I n tr a Cl a s s V a r i a n c e T o fin d c o n s id e tio b e tw c la s s v a

th p th e

re ro e sw

= P 0 (m 2

s

th e r th e e e n ria n c

B

e c a n a ls o d o a c lu s te rin g p r 2 s s v a ria n c e s B d S 2 is u s e d a s

s h o ld w b le m a s in te r-c la 2 in S 1 a n

− m ) o

+ P 1(m 2

u n su p e o b le m o f th e c rite ria

rv is e d in a o n tw o s u fo r fin

− m ) 2 a n d 1

s

2 W

d is e -d b se d in

c r im ts g

e tiz e n S 1 th e

= P 0s

a tio n . s io n a l a n d S 2 th re s h

T h e re fo re , w e sp a c e . T h e ra a n d th e in tra o ld :

+ P 1s 2

0

2 1

(1 1 )

T h e v a ria n c e s o f th e tw o g ro u p s a re d e fin e d a s :

s

2 0

=

T



( x i − m

i= x 1

0

)

2

h ( x i) a n d N s

∑ =

2 1

x 2 i= T

( x i − m 1

)

2

h ( x i) N

(1 2 )

w i t h N t h e n u m b e r o f a l l s a m p l e s a n d h ( x i) t h e f r e q u e n c y o f a t t r i b u t e v a l u e x i. T is th e th re s h o ld th a t w ill b e te n ta tiv e ly m o v e d o v e r a ll a ttrib u te v a lu e s . T h e v a lu e s m 0 a n d m 1 a re th e m e a n v a lu e s o f th e tw o g ro u p s th a t g iv e u s :

m = m w h e re P 0

a n d P 1

0

P 0

+ m 1

P 1

a re th e p ro b a b ility fo r th e v a lu e s o f th e s u b s e t 1 a n d 2 :

(1 3 )

3 4

3 M e th o d s fo r D a ta M in in g

P 0



=

T

h ( x i) a n d N

i= x 1

P 1

=



x 2 i= T

h ( x i) N

(1 4 )

T h e s e le c tio n c rite ria is :

IF

2

s s

B 2

= M A X ! T H E N S e le c t T A

fo r T

w

3 . 1 . 5 . 2 M u l ti - i n te r v a l D i s c r e ti z a ti on B in a ry in te rv a l d is c re tiz a tio n w ill re s u lt in b in a ry d e c is io n a lw a y s b e th e b e s t w a y to m o d e l th e p ro b le m . T h e re s u ltin v e ry b u s h y a n d th e e x p la n a tio n c a p a b ility m ig h t n o t b e g o o in c re a s e s in c e th e a p p ro x im a tio n o f th e d e c is io n s p a c e b a s s io n s m ig h t n o t b e a d v a n ta g e o u s a n d , th e re fo re , le a d s to a e rro r. D e p e n d in g o n th e d a ta it m ig h t b e b e tte r to c re a te m o re th a n tw o in te rv a ls fo r n u m e ric a l a ttrib u te s .

tre e s . T h is m ig h t n o t g d e c is io n tre e c a n b e d . T h e e rro r ra te m ig h t e d o n th e b in a ry d e c ih ig h e r a p p ro x im a tio n d e c is io n tre e s h a v in g

F o r m u lti-in te rv a l d is c re tiz a tio n w e h a v e to s o lv e tw o p ro b le m s : 1 . F in d m u lti in te rv a ls a n d 2 . 2 . D e c id e a b o u t th e s u ffic ie n t n u m b e r o f in te rv a ls . T h e d e te rm in a tio n o f th e n u m b e r o f th e in te rv a ls c a n b e d o n e s ta tic o r d y n a m ic . In th e la te r c a s e th e n u m b e r o f in te rv a ls w ill b e a u to m a tic a lly c a lc u la te d d u rin g th e le a rn in g p ro c e s s w h e re a s in th e s ta tic c a s e th e n u m b e r o f in te rv a ls w ill b e g iv e n a -p rio r b y th e u s e r p rio r to th e le a rn in g p ro c e s s . T h e n , th e d is c re tiz a tio n p ro c e s s w ill c a lc u la te a s m u c h in te rv a ls a s it re a c h e s th e p re d e fin e d n u m b e r re g a rd le s s if th e c la s s d is trib u tio n in th e in te rv a ls is s u ffic ie n t o r n o t. T h is re s u lts in tre e s h a v in g a lw a y s th e s a m e n u m b e r o f a ttrib u te p a rtitio n s in e a c h n o d e . A ll a lg o rith m d e s c rib e d a b o v e c a n b e ta k e n fo r th is d is c re tiz a tio n p ro c e s s . T h e d iffe re n c e b e tw e e n b in a ry in te rv a l d is c re tiz a tio n is th a t th is p ro c e s s d o e s n o t s to p a fte r th e firs t c u t-p o in t h a s b e e n d e te rm in e d th e p ro c e s s re p e a t u n til th e g iv e n n u m b e r o f in te rv a ls is re a c h e d [P e T 9 8 ] D u rin g d y n a m ic d is c re tiz a tio n p ro c e s s a re a u to m a tic a lly c a lc u la te d th e s u ffic ie n t n u m b e r o f in te rv a ls . T h e re s u ltin g d e c is io n tre e w ill h a v e d iffe re n t a ttrib u te p a rtitio n s in e a c h n o d e d e p e n d in g o n th e c la s s d is trib u tio n o f th e a ttrib u te . F o r th is p ro c e s s , w e n e e d a c rite rio n th a t a llo w s u s to d e te rm in e th e o p tim a l n u m b e r o f in te rv a ls .

3 .1 D e c is io n T r e e I n d u c tio n

3 5

3 . 1 . 5 . 2 . 1 B a s i c ( S e a r c h S tr a te g i e s ) A l g or i th m G e n e ra lly , w e h a v e to te s t a n y p o s s ib le c o m b in a tio n s o f c u t-p o in ts k in o rd e r fin d th e b e s t c u t-p o in ts . T h is w o u ld b e c o m p u ta tio n a l e x p e n s iv e . S in c e w e a s s u m th a t c u t-p o in ts a re a lw a y s o n th e b o rd e r o f tw o d is trib u tio n s o f x g iv e n c la s s c , w h a v e a h e u ris tic fo r o u r s e a rc h s tra te g y . D is c re tiz a tio n c a n b e d o n e b o tto m -u p o r to p -d o w n . In th e b o tto m -u p c a s e , w w ill s ta rt w ith a fin ite n u m b e r o f in te rv a ls . In th e w o rs t c a s e , th e s e in te rv a ls a e q u iv a le n t th e o rig in a l a ttrib u te v a lu e s . T h e y c a n a ls o b e s e le c te d b y th e u s e r e s tim a te d b a s e d o n th e m a x im u m o f th e s e c o n d o rd e r p ro b a b ility d is trib u tio n th w ill g iv e u s a h in t w h e re th e c la s s b o rd e rs a re lo c a te d . S ta rtin g fro m th a t th e a lg rith m m e rg e s in te rv a ls th a t d o m e t th e m e rg in g c rite ria u n til a s to p p in g c rite ria re a c h e d . In th e to p -d o w n c a s e , th e a lg o rith m firs t s e le c ts tw o in te rv a ls a n d re c u rs iv e re fin e s th e s e in te rv a ls u n til th e s to p p in g c rite ria is re a c h e d .

to e e e re o r a t o is ly

3 . 1 . 5 . 2 . 2 D e te r m i n a ti on of th e N u m b e r of I n te r v a l s In th e s im p le s t c a s e th e u s e r w ill s p e c ify h o w m a n y in te rv a ls s h o fo r a n u m e ric a l a ttrib u te . T h is p ro c e d u re m ig h t b e c o m e w o rs e e v id e n c e fo r th e re q u ire d n u m b e r o f in te rv a ls in th e re m a in in g d re s u lt in b u s h y d e c is io n tre e s o r w ill s to p th e tre e b u ild in g p ro c e s ib le . M u c h b e tte r w o u ld b e to c a lc u la te th e n u m b e r o f in te rv a ls f F a y y a d a n d Ira n i [F a I9 3 ] d e v e lo p e d a s to p p in g c rite ria b a s e d d e s c rip tio n le n g th p rin c ip le . B a s e d o n th is c rite ria th e n u m b e r o c u la te d fo r th e re m a in in g d a ta s e t d u rin g d e c is io n tre e in d u c tio n tio n p ro c e d u re is c a lle d M L D -b a s e d d is c re tiz a tio n . A n o th e r c rite ria c a n u s e a c lu s te r u tility m e a s u re to d e te rm in e n u m b e r o f in te rv a ls [P e r0 0 ].

u ld b e c a lc u la te d w h e n th e re is n o a ta s e t. T h is w ill ss so o n e r a s p o sro m th e d a ta . o n th e m in im u m f in te rv a ls is c a l. T h is d is c re tiz a th e b e s t s u ita b le

3 . 1 . 5 . 2 . 3 Cl u s te r U ti l i ty Cr i te r i a B a se d c lu s te r W e a ss v a ria n c

o n th e in u tility m e u m e th a t e a n d in tr

te r-c la s s v a ria n c e a s u re th a t a llo w s u in te r-c la s s v a ria n c a -in te rv a l v a ria n c e

a n d th e in tra -c la s s v a ria n c e w e c a n c re a te a s to d e te rm in e th e o p tim a l n u m b e r o f in te rv a ls . e a n d in tra -c la s s v a ria n c e a re th e in te r-in te rv a l .

2

L e t s w b e th e in tra -c la s s v a ria n c e a n d c a n d e fin e o u r u tility c rite ria a s fo llo w :

U

=



n

s

s B2 b e t h e i n t e r - c l a s s v a r i a n c e . T h e n w e − s

2 w k

2 b k

k = 1

n

T h e n u m b e r o f in te rv a ls n is c h o s e n fo r m in im a l U .

(1 5 )

3 6

3 M e th o d s fo r D a ta M in in g

3 . 1 . 5 . 2 . 4 M L D B a s e d Cr i te r i a T h e M L D -b a s e d c rite ria w a s in tro d u c e d b y F a y y a d a n d Ira n i [F a I9 2 ] D is c re tiz a tio n is d o n e b a s e d o n th e g a in ra tio . T h e g a in ra tio I(A , T ;S ) (s e e S e c tio n ) is te s te d a fte r e a c h n e w in te rv a l a g a in s t th e M L D -c rite ria :

lo g

I ( A ,T ; S ) >

2

(N

)

− 1

N

+

∇ ( A ,T ; S ) N

(1 6 )

w h e re N is th e n u m b e r o f in s ta n c e s in th e s e t S a n d

∇ ( A ,T ; S ) = lo g e x p a ttr a n tim

O n e o e n s iv ib u te a lg o r e .

f th e e . It v a lu ith m

m a in m u st e s). T w h ic

p b e y p h

2

(3

)

− 2 − [k ⋅ I ( S ) − k 1 ⋅ I ( S 1 ) − k 2 I ( S k

ro b le m s e v a lu a te ic a lly , N u se s so m

w ith th is d N -1 tim is v e ry e a ssu m

d is c re e s fo r la rg e . p tio n

tiz e a T h in

a tio n c rite ria is th c h a ttrib u te (w ith e re fo re , it w o u ld o rd e r to re d u c e

) 2

a t it N th b e g th e

]

(1 7 ) is re la e n u m o o d to c o m p u

tiv e ly b e r o f h a v e ta tio n

3 . 1 . 5 . 2 . 5 L V Q - B a s e d D i s c r e ti z a ti on V e c to r q u a n tiz a tio n is a ls o re la te d to th e n o tio n o f d is c re tiz a tio n [P e T 9 8 ]. fo r o u r e x p e rim e n t. L V Q [K o h 9 5 ] is a s u p e rv is e d le a rn in g a lg o rith m . T h is m e th o d a tte m p ts to d e fin e c la s s re g io n s in th e in p u t d a ta s p a c e . F irs tly , a n u m b e r o f c o d e b o o k v e c to rs W i la b e le d b y a c la s s a re p la c e d in to th e in p u t s p a c e . U s u a lly s e v e ra l c o d e b o o k v e c to rs a re a s s ig n e d to e a c h c la s s . T h e le a rn in g a lg o rith m is re a liz e d a s fo llo w : A fte r a n in itia liz a tio n o f th e n e u ra l n e t, e a c h le a rn in g s a m p le is p re s e n te d o n e o r s e v e ra l tim e s to th e n e t. T h e in p u t v e c to r X w ill b e c o m p a re d to a ll c o d e b o o k v e c to rs W in o rd e r to fin d th e c lo s e s t c o d e b o o k v e c to r W c. T h e le a r n in g a lg o r ith m w ill tr y to o p tim iz e th e s im ila r ity b e tw e e n th e c o d e b o o k v e c to rs a n d th e le a rn in g s a m p le s b y s h iftin g th e c o d e b o o k v e c to rs in th e d ire c tio n o f th e in p u t v e c to r if th e s a m p le re p re s e n ts th e s a m e c la s s a s th e c lo s e s t c o d e b o o k v e c to r. In c a s e o f th e c o d e b o o k v e c to r a n d th e in p u t v e c to r h a v in g d iffe re n t c la s s e s th e c o d e b o o k v e c to r g e ts s h ifte d a w a y fro m th e in p u t v e c to r, s o th a t th e s im ila rity b e tw e e n th e s e tw o d e c re a s e . A ll o th e r c o d e b o o k v e c to rs re m a in u n c h a n g e d . T h e fo llo w in g e q u a tio n s re p re s e n t th is id e a : fo r e q u a l c la s s e s : W fo r d iffe re n t c la s s e s : W

c

(t + 1 ) = W c

(t + 1 ) = W

F o r a ll o th e r: W j

c

(t) + α (t) ⋅[ X (t) − W c

(t) − α (t) ⋅[ X (t) − W

(t + 1 ) = W j

(t)

c

(t) c

]

(t)

]

(1 8 ) (1 9 ) (2 0 )

3 .1 D e c is io n T r e e I n d u c tio n

T h is b e h a v io r o f th e p o in t m ig h t b e in th c la s s e s . F ig u re 2 4 s h S in c e th is a lg o rith m to g e t g o o d re s u lts . H th e c h o ic e o f le a rn in g

a lg o rith m s w e c a n e m p lo y fo r d is c re tiz a tio n . A p e m id d le o f th e le a rn e d c o d e b o o k v e c to rs o f tw o w s th is m e th o d b a s e d o n o n e a ttrib u te o f th e IR trie s to o p tim iz e th e m is c la s s ific a tio n p ro b a b ility o w e v e r, th e p ro p e r in itia liz a tio n o f th e c o d e b o o k ra te α ( t ) is a c ru c ia l p ro b le m .

3 7

o te n tia l c u t o d iffe re n t IS d o m a in . w e e x p e c t v e c to rs a n d

1 6

1 4

1 2

M e a n

1 0

8

6

4 K la s s e 1 S e to s a

2

K la s s e 2 V e r s ic o lo r K la s s e 3 V ir g in ic a

0 1 ,0

1 ,3

1 ,6

1 ,9

2 ,2

2 ,5

2 ,8

3 ,1

3 ,4

3 ,7

4 ,0

4 ,3

4 ,6

4 ,9

5 ,2

5 ,5

5 ,8

6 ,1

6 ,4

6 ,7

1 ,0

1 ,3

1 ,6

1 ,9

2 ,2

2 ,5

2 ,8

3 ,1

3 ,4

3 ,7

4 ,0

4 ,3

4 ,6

4 ,9

5 ,2

5 ,5

5 ,8

6 ,1

6 ,4

6 ,7

C o d e b o o k v e c to rs

P e ta l le n g th

F ig . 2 4 . C la s s D is trib u tio n o f a n A ttrib u te a n d C o d e b o o k V e c to rs

3 . 1 . 5 . 2 . 6 H i s tog r a m

B a s e d D i s c r e ti z a ti on

A h is to g ra m -b a s e d m e th o d h a s b e e n s u g g e s te d firs t b y W u e t a l. [W L S 7 5 u s e d th is m e th o d in a n in te ra c tiv e w a y d u rin g to p -d o w n d e c is io n tre e b u ild o b s e rv in g th e h is to g ra m , th e u s e r s e le c ts th e th re s h o ld w h ic h p a rtitio n s th e s e t in g ro u p s c o n ta in in g o n ly s a m p le s o f o n e c la s s . In P e rn e r e t a l. [P e T 9 8 s c rib e d a n a u to m a tic h is to g ra m -b a s e d m e th o d fo r fe a tu re d is c re tiz a tio n . T h e d i s t r i b u t i o n p ( a | a ³C k ) P ( C k ) o f o n e a t t r i b u t e a a c c o r d i n g t o c l a s s e s C c u la te d . T h e c u rv e o f th e d is trib u tio n is a p p ro x im a te d b y a firs t o rd e r p o ly n th e m in im u m s q u a re e rro r m e th o d is u s e d fo r c a lc u la tin g th e c o e ffic ie n ts :

E =



(a 1x

+ a i

− y i) 0

2

k

]. T h e y in g . B y s a m p le ] is d e is c a lo m a n d

(2 1 )

i= 1

a 1

=



n

x i ⋅i

i= 1



n

i i= 1

2

(2 2 )

3 8

3 M e th o d s fo r D a ta M in in g

T h e c u t p o in ts a re s e le c te d b y fin d in g tw o m a x im a o f d iffe re n t c la s s e s s itu a te d n e x t to e a c h o th e r. W e u s e d th is m e th o d in tw o w a y s : F irs t, w e u s e d th e h is to g ra m -b a s e d d is c re tiz a tio n m e th o d a s d e s c rib e d b e fo re . S e c o n d , w e u s e d a c o m b in e d d is c re tiz a tio n m e t h o d b a s e d o n t h e d i s t r i b u t i o n p ( a | a ³S k ) P ( S k ) a n d t h e e n t r o p y - b a s e d m i n i m i z a tio n c rite ria . W e fo llo w e d th e c o ro lla ry d e riv e d b y F a y y a d a n d Ira n i [F a I9 3 ], w h ic h s a y s th a t th e e n tro p y -b a s e d d is c re tiz a tio n c rite ria fo r fin d in g a b in a ry p a rtitio n fo r a c o n tin u o u s a ttrib u te w ill a lw a y s p a rtitio n th e d a ta o n a b o u n d a ry p o in t in th e s e q u e n c e o f th e e x a m p le s o rd e re d b y th e v a lu e o f th a t a ttrib u te . A b o u n d a ry p o in t p a rtitio n s th e e x a m p le s in tw o s e ts , h a v in g d iffe re n t c la s s e s . T a k in g in to a c c o u n t th is fa c t, w e d e te rm in e p o te n tia l b o u n d a ry p o in ts b y fin d in g th e p e a k s o f th e d is trib u tio n . If w e fo u n d tw o p e a k s b e lo n g in g to d iffe re n t c la s s e s , w e u s e d th e e n tro p y -b a s e d m in im iz a tio n c rite ria in o rd e r to fin d th e e x a c t c u t p o in t b e tw e e n th e s e t w o c l a s s e s b y e v a l u a t i o n e a c h b o u n d a r y p o i n t K w i t h P i ≤ K ≤ P i+ 1 b e t w e e n t h i s tw o p e a k s . 1 6

1 4

1 2

M e a n

1 0

8

6

P e a k s

4

K la s s e 1 S e to s a

2

K la s s e 2 V e r s ic o lo r K la s s e 3 V ir g in ic a

0 1 ,0

1 ,3

1 ,6

1 ,9

2 ,2

2 ,5

2 ,8

3 ,1

3 ,4

3 ,7

4 ,0

4 ,3

4 ,6

4 ,9

5 ,2

5 ,5

5 ,8

6 ,1

6 ,4

6 ,7

4 ,8

2 ,3 5

C u t- P o in ts

P e ta l le n g th

F ig . 2 5 . E x a m p le s s o rte d b y a ttrib u te v a lu e s fo r a ttrib u te A a n d la b e lle d p e a k s

T h is m e th o d m e th o d c a n b h y b rid v e rs io h is to g ra m -b a d is c re tiz a tio n

is n o e a n n w h se d m .

t a s a lte ic h e th

tim rn a c o m o d

e c o tiv e b in w ith

n su m to th e s th th e

in g lik e th e m e th o d s e a d v a n ta g e n tro p y m

e o th e r d e s c rib e s o f th in im iz a

o n e s . W e w a n te d to s e e d b e fo re a n d if w e c a e lo w c o m p u ta tio n tim tio n h e u ris tic in th e c o

e if th is n fin d a e o f th e n te x t o f

3 . 1 . 5 . 2 . 7 Ch i - M e r g e D i s c r e ti z a ti on T h e C h iM tio n s te p m e rg e d u s ta tic . In tia l c u t-p p e n d e n c e

e rg e a lg a n d a b n til a te r o u r s tu d y o in ts a re te s t. T h e

o rith m in tro o tto m -u p m m in a tio n c o w e a p p ly C in v e s tig a te d s ta tis tic a l te

d u c e d b y K e rg in g p ro c n d itio n is m h iM e rg e d y b y te s tin g s t v a lu e s is :

e rb e r [K e r9 2 ] c o n s is ts o f e s s , w h e re in te rv a ls a re e t. K e rb e r u s e d th e C h iM n a m ic a lly to d is c re tiz a tio tw o a d ja c e n t in te rv a ls b y

a n c o n e rg n . T th e

in itia liz a tin u o u s ly e m e th o d h e p o te n 2 c in d e -

3 .1 D e c is io n T r e e I n d u c tio n

χ

m

k

i= 1

j= 1

∑ ∑ =

2

(A

− E ij )

ij

E

3 9

2

(2 3 )

ij

w h e r e m = 2 ( t h e i n t e r v a l s b e i n g c o m p a r e d ) , k - n u m b e r o f c l a s s e s , A ij - n u m b e r o f e x a m p le s in i-th in te rv a l a n d j-th c la s s , R i - n u m b e r o f e x a m p le s in i-th in te rv a l

R i

=



k

A

ij

; C j

- n u m b e r o f e x a m p le s in j-th c la s s

C j

∑ =

j= 1

n u m b e r o f e x a m p le s

N

=

F irs tly , a ll b o u n d a ry p o in p a ir o f a d ja c e n t in te rv a ls w ith th e lo w -e s t χ ²-v a lu e u n til a ll χ ²-v a lu e e x c e e d in te rm in e d b y s e le c tin g a d 2 m u la to o b ta in th e c.



m

A i = 1

k

C j

; E

ij

-e x p e c te d fre q u e n c y

E

ij

=

ij

R i ⋅C N

j= 1

ts w ill u s e d fo r o n e c o m p u te th w ill m e rg e to g e g a g iv e n th re s e s ire d s ig n ific a n

c u e th h o c e

t-p o in ts . χ ²-v a lu e . e r. T h is s ld . T h e v le v e l a n

In th e s e c o T h e tw o a te p is re p e a a lu e fo r th e d th e n u s in

; N

n d d ja te d th

j

– th e to ta l

.

s te p fo c e n t in c o n tin re s h o ld g a ta b le

r e a c h te rv a ls u o u s ly is d e o r fo r-

3 . 1 . 5 . 2 . 8 T h e I n f l u e n c e of D i s c r e ti z a ti on M e th od s on th e R e s u l ti n g D e c i s i on T r e e F ig u re 2 6 -2 9 s h o w d e c is io n tre e s le a rn t b a s e d o n d iffe re n t d is c re tiz a tio n m e th o d s . It s h o w s th a t th e k in d o f d is c re tiz a tio n m e th o d in flu e n c e s th e a ttrib u te s e le c tio n . T h e a ttrib u te in th e ro o t n o d e is th e s a m e fo r th e d e c is io n tre e b a s e d o n C h i-M e rg e d is c re tiz a tio n (s e e F ig u re 2 6 ) a n d L V Q -b a s e d d is c re tiz a tio n (s e e F ig u re 2 8 ). T h e c a lc u la te d in te rv a ls a re ra w ly th e s a m e . S in c e th e tre e g e n e ra tio n b a s e d o n h is to g ra m d is c re tiz a tio n re q u ire s a lw a y s tw o c u t-p o in ts a n d s in c e in th e re m a in in g s a m p le s e t is n o e v id e n c e fo r tw o c u t-p o in ts th e le a rn in g p ro c e s s s to p s a fte r th e firs t le v e l. T h e tw o tre e s g e n e ra te d b a s e d o n C h i-M e rg e d is c re tiz a tio n a n d o n L V Q -b a s e d d is c re tiz a tio n h a v e a ls o th e s a m e a ttrib u te in th e ro o t. T h e in te rv a ls a re a ls o s lig h tly d iffe re n tly s e le c te d b y th e tw o m e th o d s . T h e tre e in F ig u re 2 8 is th e m o s t b u s h y tr e e . H o w e v e r , th e e r r o r r a te ( s e e T a b le 3 .1 .2 ) o f th is tr e e c a lc u la te d b a s e d o n le a v e -o n e o u t (s e e C h a p te r) is n o t b e tte r th a n th e e rro r ra te o f th e tre e s h o w n in F ig u re 2 7 . S in c e th e d e c is io n is b a s e d o n m o re a ttrib u te s (s e e F ig u re 2 8 ) th e e x p e rts m ig h t lik e th is tre e m u c h m o re th a n th e tre e s h o w n in F ig u re 2 9 .

4 0

3 M e th o d s fo r D a ta M in in g

--1 5 0 D S P E T A L W I

< = 0 .8 5 0 D S [S e to s a ]

> 0 .8 & < = 1 .7 5 5 4 D S ? ? ? [V e rs ic o l]

> 1 .7 5 4 6 D S ? ? ? [V irg in ic ]

F ig . 2 6 . D e c is io n T re e b a s e d o n C h i-M e rg e D is c re tiz a tio n (k = 3 )

--1 5 0 D S P E T A L L E N

< = 2 .4 5 5 0 D S [S e to s a ]

> 2 .4 5 & < = 4 .9 5 5 4 D S ? ? ? [V e rs ic o l]

F ig . 2 7 . D e c is io n T re e b a s e d o n H is to g ra m

> 4 .9 5 4 6 D S ? ? ? [V irg in ic ] b a s e d D is c re tiz a tio n (k = 3 )

--1 5 0 D S P E T A L W I

< = 0 .8 3 7 5 5 0 D S [S e to s a ]

> 0 .8 3 7 5 & < = 1 .6 3 7 5 5 2 D S P E T A L L E N

< = 5 .0 4 8 6 4 4 8 D S S E P A L W I

< = 2 .2 7 5 4 D S P E T A L W I

< = 1 .4 3 7 5 2 D S [V e rs ic o l]

> 2 .2 7 5 4 4 D S [V e rs ic o l]

> 1 .6 3 7 5 4 8 D S P E T A L L E N

> 5 .0 4 8 6 4 4 D S S E P A L L E N G

< = 6 .1 8 7 5 2 D S ? ? ? [V e rs ic o l]

< = 5 .0 2 5 1 0 D S S E P A L W I

> 6 .1 8 7 5 2 D S [V irg in ic ]

> 1 .4 3 7 5 2 D S ? ? ? [V e rs ic o l]

F ig . 2 8 . D e c is io n T re e b a s e d o n L V Q b a s e d D is c re tiz a tio n

< = 2 .9 6 2 5 6 D S [V irg in ic ]

> 5 .0 2 5 3 8 D S [V irg in ic ]

> 2 .9 6 2 5 4 D S S E P A L L E N G

< = 6 .0 8 7 5 2 D S ? ? ? [V e rs ic o l]

> 6 .0 8 7 5 2 D S ? ? ? [V e rs ic o l]

3 .1 D e c is io n T r e e I n d u c tio n

4 1

--1 5 0 D S P E T A L L E N

< = 2 .4 5 5 0 D S [S e to s a ]

> 2 .4 5 & < = 4 .9 5 4 D S P E T A L W I

< = 1 .6 5 4 7 D S [V e rs ic o l]

> 4 .9 4 6 D S ? ? ? [V irg in ic ]

> 1 .6 5 7 D S ? ? ? [V irg in ic ]

F ig . 2 9 . D e c is io n T re e b a s e d o n M L D -P rin c ip le D is c re tiz a tio n T a b le 5 . E rro r R a te fo r D e c is io n T re e s b a s e d o n d iffe re n t D is c re tiz a tio n M e th o d s

D e s c ritiz a tio n M e th o d C h i-M H is to L V Q M L D

e rg e g ra m b a s e d D is c r. b a s e d D is c r. -b a s e d D is c r.

E rro r R a te U n p ru n e d T re e 6

P ru n e d T re e 6

4 4

5 .3 3 4

3 . 1 . 5 . 3 D i s c r e ti z a ti on of Ca te g or i c a l or S y m b ol i c a l A ttr i b u te s

3 . 1 . 5 . 3 . 1 M a n u a l A b s tr a c ti on of A ttr i b u te V a l u e s In o p p o s itio n to n u m e ric a l a ttrib u te s , s y m b o lic a l a ttrib u te s m a y h a v e a la rg e n u m b e r o f a ttrib u te v a lu e s . B ra n c h in g o n s u c h a n a ttrib u te c a u s e s a p a rtitio n in to s m a ll s u b s a m p le s e ts th a t w ill o fte n le a d to a q u ic k s to p o f th e tre e b u ild in g p ro c e s s o r e v e n to tre e s w ith lo w e x p la n a tio n c a p a b ilitie s . O n e w a y to a v o id th is p ro b le m is th e c o n s tru c tio n o f m e a n in g fu l a b s tra c tio n s o n th e a ttrib u te le v e l a t h a n d b a s e d o n a c a re fu l a n a ly s is o f th e a ttrib u te lis t [P B Y 9 6 ]. T h is h a s to b e d o n e in th e p re p a ra tio n p h a s e . T h e a b s tra c tio n c a n o n ly b e d o n e o n th e s e m a n tic le v e l. A d v a n ta g e o u s is th a t th e re s u ltin g in te rv a l c a n b e n a m e d w ith a s y m b o l th a t h u m a n c a n u n d e rs ta n d .

3 . 1 . 5 . 3 . 2 A u tom a ti c A g g r e g a ti on H o w e v e r, it is a ls o p o s s ib le to d o a u to m a tic a lly a b s tra c tio n o n s y m b o lic a l a ttrib u te v a lu e s d u rin g th e tre e b u ild in g p ro c e s s b a s e d o n th e c la s s -a ttrib u te in te rd e p e n d -

4 2

3 M e th o d s fo r D a ta M in in g

e n c e . T h e n th e d is c re tiz a tio n p ro c e s s is d o n e b o tto m -u p s ta rtin g fro m a ttrib u te in te rv a ls . T h e p ro c e s s s to p s u n til th e c rite ria is re a c h e d .

th e in itia l

3 .1 .6 P r u n in g If th e tre e is a llo w e d to g ro w u p to its m a x im u m s iz e it is lik e ly th a t it b e c o m e s o v e rfitte d to th e tra in in g d a ta . N o is e in th e a ttrib u te v a lu e s a n d c la s s in fo rm a tio n w ill a m p lify th is p ro b le m . T h e tre e b u ild in g p ro c e s s w ill p ro d u c e s u b tre e s th a t fit to n o is e . T h is u n w a rra n te d c o m p le x ity c a u s e s a n in c re a s e d e rro r ra te w h e n c la s s ify in g u n s e e n c a s e s . T h is p ro b le m c a n b e a v o id e d b y p ru n in g th e tre e . P ru n in g m e a n s re p la c in g s u b tre e s b y le a v e s b a s e d o n s o m e s ta tis tic a l c rite rio n . T h is id e a is illu s tra te d in F ig u re 3 0 a n d F ig u re 3 1 o n th e IR IS d a ta s e t. T h e u n p ru n e d tre e is a la rg e a n d b u s h y tre e w ith a n e s tim a te d e rro r ra te o f 6 .6 7 % . U p to th e s e c o n d le v e l o f th e tre e g e ts re p la c e d s u b tre e s b y le a v e s . T h e re s u ltin g p ru n e d tre e is s m a lle r a n d th e e rro r ra te b e c o m e s 4 .6 7 % c a lc u la te d w ith c ro s s v a lid a tio n . P ru n in g m e th o d s c a n b e c a te g o riz e d e ith e r in p re - o r p o s t-p ru n in g m e th o d s . In p re -p ru n in g , th e tre e g ro w in g p ro c e s s is s to p p e d a c c o rd in g to a s to p p in g c rite ria b e fo re th e tre e re a c h e s its m a x im a l s iz e . In c o n tra s t to th a t, in p o s t-p ru n in g , th e tre e is firs t d e v e lo p e d to its m a x im u m s iz e a n d a fte rw a rd s , p ru n e d b a c k a c c o rd in g to a p ru n in g p ro c e d u re . --1 5 0 D S P E T A L L E N

< = 2 .4 5 5 0 D S [S e to s a ]

> 2 .4 5 1 0 0 D S P E T A L L E N

< = 4 .9 5 4 D S P E T A L W I

< = 1 .6 5 4 7 D S [V e rs ic o l]

> 4 .9 4 6 D S P E T A L W I

> 1 .6 5 7 D S S E P A L L E N G

< = 5 .9 5 3 D S ? ? ? [V irg in ic ]

< = 1 .7 5 6 D S P E T A L L E N

> 5 .9 5 4 D S [V irg in ic ]

< = 5 .1 4 D S P E T A L W I

< = 1 .5 5 2 D S [V irg in ic ]

F ig . 3 0 . U n p ru n e d D e c is io n T re e fo r th e IR IS D a ta S e t

> 1 .7 5 4 0 D S [V irg in ic ]

> 5 .1 2 D S [V irg in ic ]

> 1 .5 5 2 D S [V e rs ic o l]

3 .1 D e c is io n T r e e I n d u c tio n

4 3

--1 5 0 D S P E T A L L E N

< = 2 .4 5 5 0 D S [S e to s a ]

> 2 .4 5 1 0 0 D S P E T A L L E N

< = 4 .9 5 4 D S P E T A L W I

< = 1 .6 5 4 7 D S [V e rs ic o l]

> 4 .9 4 6 D S [V irg in ic ]

> 1 .6 5 7 D S [V irg in ic ]

F ig . 3 1 . P ru n e d T re e fo r th e IR IS D a ta S e t b a s e d o n M in im a l E rro r P ru n in g

3 .1 .7 O v e r v ie w P o s t-P ru n in g m e th o d s c a n b e m a in ly c a te g o riz e d in to m e th o d s th a t u s e s a n in d e p e n d e n t p ru n in g s e t a n d th o s e th a t u s e s n o s e p a ra te p ru n in g s e t, s e e F ig u re 3 2 . T h e la te r o n e c a n b e fu rth e r d is tin g u is h e d in to m e th o d s th a t u s e s tra d itio n a l s ta tis tic a l m e a s u re s , re -s a m p lin g m e th o d s lik e c ro s s v a lid a tio n a n d b o o ts tra p p in g , a n d c o d e le n g h t m o tiv a te d m e th o d s . H e re w e o n ly w a n t to c o n s id e r c o s t-c o m p le x ity p ru n in g a n d c o n fid e n c e in te rv a l p ru n in g th a t b e lo n g s to th e m e th o d s w ith s e p a ra te p ru n in g s e t. A n o v e rv ie w a b o u t a ll m e th o d s c a n b e fo u n d in K u u s is to [K u u 9 8 ]. P r u n in g M e th o d s

U s e s I n d e p e n d e n t P r u n in g S e t

N o S e p a r a t e P r u n in g S e t

R e s a m p lin g M e t h o d s

C o d e le n g t h M o t iv a t e d P r u n in g M e th o d s

S t a t is tic a l M e t h o d s

F ig . 3 2 . G e n e ra l O v e rv ie w a b o u t P ru n in g M e th o d s

3 . 1 . 8 Cos t- Com p l e x i ty P r u n i n g T h T h p le c re

e c o s t-c o m p le e m a in id e a is x ity o f th e s u a te d a c o s t-c o

x ity p ru n in g m e to k e e p b a la n c e b tre e (T ) d e s c rib m p le x ity c rite ria

th o d w a s in tro d u c e d b y B re im a n e t a l. [B F O 8 4 ]. b e tw e e n th e m is c la s s ific a tio n c o s ts a n d th e c o m e d b y th e n u m b e r o f le a v e s . T h e re fo re , B re im a n a s fo llo w :

4 4

3 M e th o d s fo r D a ta M in in g

E (T ) + α ⋅ L e a v e s (T ) N (T )

C P (T ) =

w ith E (T ) th e o f s a m p le s b e lo n T , a n d α fre e d e w h o s e re p la c e m e IF

α =

n u m b e r o f m g in g to th e s u fin e d p a ra m e n t c a u s e s th e

is b te m

c la s s ifie d tre e T , L e r, o fte n c in im a l c o

M N (T ) ⋅( L e a v e s (T ) − 1 )

s a m p le a v e s(T ) a lle d c o s ts is re



s o f th - n u m m p le x p la c e

M IN !

T h e a lg o rith m te n ta tiv e ly re p la c e s a ll s u b tre e s b fo r α is m in im a l c o m p a re d to th e v a lu e s α o f th e o i n a s e q u e n c e o f t r e e s T 0 < T 2 < . . . < T i< . . . < T n w h e r e th e ro o t. T h e tre e s a re e v a lu a te d o n a n in d e p e n d e te n ta tiv e ly tre e s is s e le c te d th e s m a lle s t tre e a s fin c la s s ific a tio n s o n th e in d e p e n d e n t d a ta s e t. T h is is ro r) s e le c tio n m e th o d . O th e r a p p ro a c h e s u s e a m e th o d , in w h ic h th e s m a lle s t tre e d o e s n o t e x c e e d m a l n u m b e r o f e rro rs th a t y ie ld s a n d e c is io n tre e T i v ia tio n o f a n e m p iric a l e rro r e s tim a te d fro m th e in c a lc u la te d a s fo llo w :

S E (E

m in

) =

(2 4 )

E

m in

e su b e r ity b y a

T h e n

b tre e T , N (T ) - n u m b e r o f le a v e s o f th e s u b tre e p a ra m e te r. T h e s u b tre e le a f: S u b s titu te _ S u b tr e e

y le a v e s if th e c a lc u la te d v a lu e th e r re p la c e m e n ts . T h is re s u lts T 0 is th e o rig in a l tre e a n d T n is n t d a ta s e t. A m o n g th is s e t o f a l tre e th a t m in im iz e s th e m is c a lle d th e 0 -S E (0 -s ta n d a rd e rre la x e d v e rs io n , c a lle d 1 -S E E m in + S E ( E m in ) . E m in i s t h e m i n i a n d S E ( E m in ) i s t h e s t a n d a r d d e d e p e n d e n t d a t a s e t . S E ( E m in ) i s

⋅(N − E N

m in

)

w ith N

th e n u m b e r o f

te s t s a m p le s . 3 . 1 . 9 S om e G e n e r a l R e m a r k s In th e fo rm e r S e c tio n s , w e h a v e o u tlin e d m e th o d s fo r d e c is io n tre e in d u c tio n . H o w e v e r, s o m e g e n e ra l re m a rk s s h o u ld h e lp th e u s e r b e tte r u n d e rs ta n d th e re s u lts a n d th e b e h a v io r o f d e c is io n tre e in d u c tio n . O n e m a in p ro b le m is th e d e p e n d e n c e o f th e a ttrib u te s e le c tio n o n th e o rd e r o f th e a ttrib u te s . A lw a y s th e a ttrib u te th a t a p p e a rs firs t in th e d a ta ta b le w ill b e c h o s e n in c a s e tw o a ttrib u te s s h o w b o th th e b e s t p o s s ib le v a lu e s fo r th e s e le c tio n c rite ria . W h e re a s th is m a y n o t in flu e n c e th e a c c u ra c y o f th e re s u ltin g m o d e l th e e x p la n a tio n c a p a b ility m ig h t b e c o m e w o rs e . A tra in e d e x p e rt m ig h t n o t fin d th e a ttrib u te h e is u s u a lly u s in g . T h e re fo re , h is tru s t in th e m o d e l w ill b e e ffe c te d . O n e w a y to c o m e a ro u n d th is p ro b le m w o u ld b e to le t th e u s e r s e le c t w h ic h o n e o f th e a ttrib u te s th e tre e s h o u ld u s e . H o w e v e r, th a n th e m e th o d a c ts in a n in te ra c tiv e fa s h io n a n d n o t a u to m a tic a lly . In c a s e o f la rg e d a ta b a s e s is m ig h t b e p re fe ra b le to n e g le c t th is p ro b le m . L ik e w is e o th e r le a rn in g te c h n iq u e s , d e c is io n tre e in d u c tio n s tro n g ly d e p e n d s o n th e s a m p le d is trib u tio n . If th e c la s s s a m p le s a re n o t e q u a lly d is trib u te d th e in d u c tio n p ro c e s s m ig h t re la y o n th e d is trib u tio n o f th e la rg e s t c la s s . U s u a lly , u s e rs ig -

3 .1 D e c is io n T r e e I n d u c tio n

4 5

n o re th is p ro b le m . T h e y ru n th e e x p e rim e n t a lth o u g h o n e c la s s m ig h t d o m in a te in th e s a m p le s e t w h ile o th e rs a re o n ly re p re s e n te d b y a fe w e x a m p le s . W e h a v e d e m o n s ta te d th e in flu e n c e o f th e c la s s d is trib u tio n in th e s a m p le s e t o n th e IR IS d a ta s e t ( s e e F ig u r e s 3 3 - 3 5 a n d T a b le 3 .1 .3 ) . I t is to s e e th a t f o r th e f ir s t tw o e x a m p le s th e re s u ltin g d e c is io n tre e is m o re o r le s s th e s a m e fo r th e to p le v e l o f th e tre e s a s th e o rig in a l tre e b u t th e u p p e r le v e ls h a v e c h a n g e d . If th e c la s s d is trib u tio n g e ts e v e n w o rs e th e tre e c h a n g e s to ta lly . H o w e v e r, th e e rro r ra te c a lc u la te d w ith le a v e o n e o u t s ta y s in th e ra n g e o f th e o rig in a l tre e . --8 5 D S P E T A L L E N

< = 2 .4 5 2 5 D S [S e to s a ]

> 2 .4 5 6 0 D S P E T A L L E N

< = 5 .0 5 5 0 D S S E P A L L E N G

< = 4 .9 5 2 D S ? ? ? [V e rs ic o l]

> 5 .0 5 1 0 D S S E P A L L E N G

> 4 .9 5 4 8 D S [V e rs ic o l]

< = 6 .1 5 2 D S ? ? ? [V e rs ic o l]

> 6 .1 5 8 D S [V irg in ic ]

F ig . 3 3 . D e c is io n T re e fo r th e IR IS D a ta S e t D is trib u tio n _ 1

--5 4 D S P E T A L W I

--7 8 D S P E T A L L E N

< = 2 .4 5 2 5 D S [S e to s a ]

< = 1 .8 5 5 1 D S S E P A L W I

> 2 .4 5 5 3 D S P E T A L W I

< = 1 .8 5 5 0 D S [V e rs ic o l]

> 1 .8 5 3 D S [V irg in ic ]

F ig . 3 4 . D T IR IS D a ta S e t D is trib u tio n _ 2

< = 3 .2 5 4 8 D S [V e rs ic o l]

> 1 .8 5 3 D S [V irg in ic ]

> 3 .2 5 3 D S ? ? ? [V e rs ic o l]

F ig . 3 5 . D T IR IS D a ta S e t D is trib u tio n _ 3

4 6

3 M e th o d s fo r D a ta M in in g

T a b le 6 . E rro r R a te fo r d iffe re n t S a m p le S iz e s

A

C la s s D is trib u tio n N o . S e to s a 5 0 1 2 5 2 2 5 3 1 c a te g o ric a l trib u te is u s e d s e t th e n th e e n la rg e r a s n is a a ttrib u te s w ith v e ry s o o n s in c

V e rs ic o lo r 5 0 5 0 5 0 5 0

V irg in ic 5 0 9 3 3

E rro r R a te U n p ru n e d 6 .6 6 5 .8 8 2 .5 6 7 .4 0 7

a ttrib u te w ith n a ttrib u te v a lu e s b ra n c h e s in to n fo r s p littin g in a n o d e . If th e d is trib u tio n o f d a ta try d a ta s e t m w ill re s u lt in n s u b s e ts o f th e s iz e s s m a lle r is th e s iz e o f th e s u b s e ts k . A s a re s u lt m a n y a ttrib u te v a lu e s th e d e c is io n tre e b u ild in e in th e re m a in in g s u b s e ts w ill m e e t th e s to p p in g

P ru 4 .6 5 .8 2 .5 5 .5

n e d 6 8 6 5

s u b s e t w h e n th e a tis e q u a l in th e d a ta k = m /n . It is c le a r a s o f u s in g c a te g o ric a l g p ro c e s s w ill s to p c rite ria v e ry s o o n .

3 .1 .1 0 S u m m a r y D e c is io n tre e in d u c tio n is a p o w e rfu l m e th o d fo r le a rn in g c la s s ific a tio n k n o w le d g fro m e x a m p le . In c o n tra s t to ru le in d u c tio n d e c is io n tre e s p re s e n t th e re s u ltin k n o w le d g e in a h ie ra rc h ic a l m a n n e r th a t s u its to th e h u m a n re a s o n in g b e h a v io N o n e th e le s s , d e c is io n tre e s c a n b e c o n v e rte d in to a s e t o f ru le s . W e h a v e g iv e n a s o u n d d e s c rip tio n o f d e c is io n tre e in d u c tio n m e th o d s th a t c a le a rn b in a ry a n d n -a ry d e c is io n tre e s . W e in tro d u c e d th e b a s is s te p s o f d e c is io tre e le a rn in g a n d d e s c rib e th e m e th o d s w h ic h h a v e b e e n d e v e lo p e d fo r th e m . T h m a te ria l is p re p a re d in s u c h a w a y th a t th e re a d e r c a n fo llo w th e d e v e lo p m e n ts a n th e ir in te rre la tio n s h ip . T h e re is s till ro o m fo r n e w d e v e lo p m e n ts a n d w e h o p e w c o u ld in s p ire th e re a d e r to th in k a b o u t it.

e g r. n n e d e

3 . 2 Ca s e - B a s e d R e a s on i n g D e c is io n tre e s a re d iffic u lt to u tiliz e in d o m a in s w h e re g e n e ra liz e d k n o w la c k in g . B u t o fte n th e re is a n e e d fo r a p re d ic tio n s y s te m e v e n th o u g h th e e n o u g h g e n e ra liz e d k n o w le d g e . S u c h a s y s te m s h o u ld a ) s o lv e p ro b le m s u a lre a d y s to re d k n o w le d g e a n d b ) c a p tu re n e w k n o w le d g e m a k in g it im m a v a ila b le to s o lv e th e n e x t p ro b le m . T o a c c o m p lis h th e s e ta s k s c a s e -b a s o n in g is u s e fu l. C a s e -b a s e d re a s o n in g e x p lic itly u s e s p a s t c a s e s fro m th e e x p e rt´s s u c c e s s fu l o r fa ilin g e x p e rie n c e s . T h e re fo re , c a s e -b a s e d re a s o n in g c a n b e s e e n a s a m e th o d fo r p ro b le m a s w e ll a s a m e th o d to c a p tu re n e w e x p e rie n c e s a n d m a k e th e m im m a v a ila b le fo r p ro b le m s o lv in g . It c a n b e s e e n a s a le a rn in g a n d k n o w le d g e e ry a p p ro a c h s in c e it c a n c a p tu re fro m n e w e x p e rie n c e s s o m e g e n e ra l k n s u c h a s c a s e c la s s e s , p ro to ty p e s a n d s o m e h ig h e r le v e l c o n c e p t.

le d g re is s in g e d ia se d d o m

e is n o t th e te ly re a a in

s o lv in g e d ia te ly d is c o v o w le d g e

3 .2 C a s e - B a s e d R e a s o n in g

4 7

3 . 2 . 1 B a c k g r ou n d C a s e -B a s e d R e a s o n in g is u s e d w h e n g e n e ra liz e d k n o w le d g e m e th o d w o rk s o n a s e t o f c a s e s fo rm e rly p ro c e s s e d a n d s to re d n e w c a s e is in te rp re te d b y s e a rc h in g fo r s im ila r c a s e s in th e c a s e s e t o f s im ila r c a s e s th e c lo s e s t c a s e w ith its a s s o c ia te d re s u lt is s e n te d to th e o u tp u t. T o p o in t o u t th e d iffe re n c e s b e tw e e n a C B R le a rn in g s y s te m le a rn in g s y s te m , w h ic h re p re s e n ts a le a rn e d c o n c e p t e x p lic itly , ru le s o r d e c is io n tre e s , w e fo llo w th e n o tio n o f W e s s e t a l. [W b a s e d re a s o n in g s y s te m d e s c rib e s a c o n c e p t C im p lic itly b y a p a re la tio n s h ip b e tw e e n th e c a s e b a s e C B a n d th e m e a s u re s im u s e d m a y b e c h a ra c te riz e d b y th e e q u a tio n :

is la c in a c a b a se . A s e le c te

k in g . se b a s m o n g d a n d

T h e e . A th is p re -

a n d a sy m e .g . b y f o r m e G 9 4 ]: " A ir (C B , s im ) fo r c la s s ific

b o lic u la s , c a se . T h e a tio n

C o n c e p t = C a s e B a s e + M e a s u re o f S im ila rity T h is e q u a tio n in d ic a te s in a n a lo g y to a rith m e tic th a t it is p o s s ib le to re p re s e n t a g iv e n c o n c e p t C in m u ltip le w a y s , i.e . th e r e e x is t m a n y p a ir s C = ( C B 1, s im 1) , ( C B 2, s i m 2 ) , . . . , ( C B i, s i m i) f o r t h e s a m e c o n c e p t C . F u r t h e r m o r e , t h e e q u a t i o n g i v e s a h i n t h o w a c a s e -b a s e d le a rn e r c a n im p ro v e its c la s s ific a tio n a b ility . T h e re a re th re e p o s s ib ilitie s to im p ro v e a c a s e -b a s e d s y s te m . T h e s y s te m c a n • s to re n e w c a s e s in th e c a s e b a s e C B , • c h a n g e th e m e a s u re o f s im ila rity s im , • o r c h a n g e C B a n d s im . D u r in g th e le a r n in g p h a s e a c a s e - b a s e d s y s te m g e ts a s e q u e n c e o f c a s e s X 1, X 2, . . . , X i w i t h X i= ( x i, c l a s s ( x i) ) a n d b u i l d s a s e q u e n c e o f p a i r s ( C B 1 , s i m 1 ) , ( C B 2 , s i m 2 ) , . . . , ( C B i, s i m i) w i t h C B i ⊆ { X 1 , X 2 , . . . , X i} . T h e a i m i s t o g e t i n t h e l i m i t a p a i r ( C B n, s im n) th a t n e e d s n o f u r th e r c h a n g e , i.e . ∃ n ∀ m ≥ n ( C B n, s im n) = ( C B m , s im m ) , b e c a u s e it is a c o r r e c t c la s s if ie r f o r th e ta r g e t c o n c e p t C ." 3 . 2 . 2 T h e Ca s e - B a s e d R e a s on i n g P r oc e s s













T h e C B R

p ro c e s s is c o m p ris e d o f s ix p h a s e s (s e e F ig u re 3 6 ):

C u rre n t p ro b le m P ro b le m in d e x in R e trie v a l o f s im E v a lu a tio n o f c a M o d ific a tio n o f A p p lic a tio n to a T tio n in d e b a se

h e th a x in th

d e s c rip tio n g ila r c a s e s n d id a te c a s e s a s e le c te d c a s e , if n e c e s s a ry c u rre n t p ro b le m : h u m a n a c tio n .

c u rre n t p ro b le m is t a llo w to d e s c rib e g o f c a s e b a s e is d o e c lo s e s t c a s e is e

d e s c rib e d b th e b a s ic p r n e . A m o n g v a lu a te d a s

y so m e k e y w o r o p e rtie s o f a c a a s e t o f s im ila r a c a n d id a te c a

d s se c a se

, a ttrib u te s o . B a s e d o n th s e s re trie v e d . If n e c e ssa r

r a n y is d e s fro m y th is

a b s tra c c rip tio n th e c a s e c a s e is

4 8

3 M e th o d s fo r D a ta M in in g

m o th e u se th e

d ifie c u rr r. If c a se

d e n th b

s o th a t c a se e u se r a s e th

C ase

t it fits to th e is a p p lie d to is n o t s a tis fie e n th e c a s e b a

P r o b le m D e s c r ip t io n

c u rre n t p r th e c u rre n d w ith th e se m a n a g e

C ase S e le c t io n

o b t p re m

le m . T ro b le m s u lt o r e n t p ro

A c tio n

In d e x in g

h e p ro b le m a n d th e re s n o s im ila r c c e s s w ill s ta

s o lu tio n a s s o c ia te d to u lt is o b s e rv e d b y th e a s e c o u ld b e fo u n d in rt.

F a ilu r e / S u c c e s s R e g is t r a t io n

C ase B ase M an ag em en t

C a s e E v a lu a tio n C a s e R e tr ie v a l

C a se B a se

F ig . 3 6 . C a s e -B a s e d R e a s o n in g P ro c e s s

3 . 2 . 3 CB R

M a i n te n a n c e

C B R m a n a g e m e n t w ill o p e ra te o n n e w c a s e s a s w e ll a s o n c a s e s a lre a d y s to re d in c a se b a se . If a n e w c a s e h a s to b e s to re d in to th e c a s e b a s e th e n it m e a n s th e re is n o s im ila r c a s e in th e c a s e b a s e . T h e s y s te m h a s re c o g n iz e d a g a p in th e c a s e b a s e . A n e w c a s e h a s to b e in p u tte d in to th e c a s e b a s e in o rd e r to c lo s e th is g a p . F ro m th e n e w c a s e a p re d e te rm in e d c a s e d e s c rip tio n h a s to b e e x tra c te d w h ic h s h o u ld b e fo rm a tte d in to th e p re d e fin e d c a s e fo rm a t. A fte rw a rd s th e c a s e is s to re d in to th e c a s e b a s e . S e le c tiv e c a s e re g is tra tio n m e a n s th a t n o re d u n d a n t c a s e s w ill b e s to re d in to c a s e b a s e a n d s im ila r c a s e s w ill b e g ro u p e d to g e th e r o r g e n e ra liz e d b y a c a s e th a t a p p lie s to a m o re w id e r ra n g e o f p ro b le m s . G e n e ra liz a tio n a n d s e le c tiv e c a s e re g is tra tio n e n s u re th a t th e c a s e b a s e w ill n o t g ro w to la rg e a n d th a t th e s y s te m c a n fin d s im ila r c a s e s fa s t. It m ig h t a ls o h a p p e n th a t to o m u c h c a s e s w ill b e re trie v e d d u rin g th e C B R re a s o n in g p ro c e s s . T h e re fo re , it m ig h t b e w is e to re th in k th e c a s e d e s c rip tio n o r to a d a p t th e s im ila rity m e a s u re . F o r th e c a s e d e s c rip tio n s h o u ld b e fo u n d m o re d is tin g u is h in g a ttrib u te s th a t a llo w to s e p a ra te c a s e s th a t d o n o t a p p ly to th e c u rre n t p ro b le m . T h e w e ig h ts in th e s im ila rity m e a s u re m ig h t b e u p d a te d in o rd e r to re trie v e o n ly a s m a ll s e t o f s im ila r c a s e s .

3 .2 C a s e - B a s e d R e a s o n in g

C B R m a in te e rs (v o c a b u la ry q u e n tly , th e re p ro c e ss [H e W 9

n a n c e is , s im ila r h a s b e e n 8 ][C JR 0

4 9

a c o m p le x p ro c e s s a n d w o rk s o v e r a ll k n o w le d g e c o n ta in ity , re trie v a l, c a s e b a s e ) [R ic 9 5 ] o f a C B R s y s te m . C o n s e d e v e lo p e d a rc h ite c tu re s a n d s y s te m s w h ic h s u p p o rt th is 1 ].

C a s e B a s e

S e le c tiv e C a s e R e g is tr a tio n

C a s e G e n e r a liz a tio n

U p d a te d C a s e E n te r in g

N e w C a s e E n te r in g

C a s e R e fin e m e n t

C a s e F o r m a tio n D o m a in K n o w le d g e

F ig . 3 7 . C B R M a in te n a n c e

3 . 2 . 4 K n ow l e d g e Con ta i n e r s i n a CB R

S y s te m

T h e n o tio n o f k n o w le d g e c o n ta in e rs h a s b e e n in tro d u c e d b y R ic h te r [R ic 9 5 ]. It g iv e s a h e lp fu l e x p la n a tio n m o d e l o r v ie w o n C B R s y s te m s . A C B R s y s te m h a s fo u r k n o w le d g e c o n ta in e rs w h ic h a re th e u n d e rly in g v o c a b u la ry (o r fe a tu re s ), th e s im ila rity m e a s u re , th e s o lu tio n s tra n s fo rm a tio n , a n d th e c a s e s . T h e firs t th re e re p re s e n t c o m p ile d k n o w le d g e s in c e th is k n o w le d g e is m o re s ta b le . T h e c a s e s a re in te rp re te d k n o w le d g e . A s a c o n s e q u e n c e , n e w ly a d d e d c a s e s c a n b e u s e d d ire c tly . T h is e n a b le s a C B R s y s te m to d e a l w ith d y n a m ic k n o w le d g e . In a d d itio n k n o w le d g e c a n b e s h ifte d fro m o n e c o n ta in e r to a n o th e r c o n ta in e r. F o r in s ta n c e , in th e b e g in n in g a s im p le v o c a b u la ry , a ro u g h s im ila rity m e a s u re , a n d n o k n o w le d g e o n s o lu tio n s tra n s fo rm a tio n a re u s e d [A lt0 1 ]. H o w e v e r, a la rg e n u m b e r o f c a s e s a re c o lle c te d . O v e r tim e , th e v o c a b u la ry c a n b e re fin e d a n d th e s im ila rity m e a s u re d e fin e d in h ig h e r a c c o rd a n c e w ith th e u n d e rly in g d o m a in . In a d d itio n , it m a y b e p o s s ib le to re d u c e th e n u m b e r o f c a s e s b e c a u s e th e im p ro v e d k n o w le d g e w ith in th e

5 0

3 M e th o d s fo r D a ta M in in g

o th e r c o n ta in e rs n o w a v a ila b le c a s e s .

e n a b le th e C B R

s y s te m

to b e tte r d iffe re n tia te b e tw e e n th e

3 . 2 . 5 D e s i g n Con s i d e r a ti on T h e m a in p ro b le m s c o n c e rn e d w ith th e d e v e lo p m e n t o f a C B R s y s te m



W h a W h a H o w H o w H o w s itu a

• • • •

t m a k e s u p a c a se ? t is a n a p p ro p ria te s im ila ritiy m to o rg a n iz e a la rg e n u m b e r o f to a c q u ire a n d re fin e a n e w c a to g e n e ra liz e s p e c ific c a s e s to tio n s ?

e a su r c a se s se fo r a c a s

e fo r th e fo r e ffic e n try in e th a t is

p ro ie n t th e a p p

a re :

b le m ? re trie v a l? c a se b a se ? lic a b le to a w id e ra n g e o f

3 . 2 . 6 S i m i l a r i ty A n im p o tw e e n a m e a su re c a s e to a

rta n t p o in t in c a se A a n d a fo r s im ila rity n u m e ric a l s im

c a se -b c a se b e tw e ila rity

a se d B . W e n tw m e a

re e o su

a s o n in g is th e d e te rm in a tio n o f s im ila rity b e n e e d a n e v a lu a tio n fu n c tio n th a t g iv e s u s a c a s e s . T h is e v a lu a tio n fu n c tio n re d u c e s e a c h re .

3 . 2 . 6 . 1 F or m a l i z a ti on of S i m i l a r i ty T h e p ro b le m w ith s im ila rity is th a t it h a s n o m e a n in g u n le s s o n e s p e c ifie s th e k in d o f s im ila rity [S m i8 9 ]. It s e e m s a d v is a b le to re q u ire fro m a s im ila rity m e a s u re th e re fle x iv ity . A n o b je c t is s im ila r to its e lf. S y m m e try s h o u ld b e a n o th e r p ro p e rty o f s im ila rity . H o w e v e r, B a y e r e t. a l [B H W 9 2 ] s h o w th a t th e s e p ro p e rtie s a re n o t b o u n d to b e lo n g to s im ila rity in c o llo q u ia l u s e . L e t u s c o n s id e r th e s ta te m e n ts " A is s im ila r to B " o r " A is s a m e a s B " . W e n o tic e th a t th e s e s ta te m e n ts a re d ire c te d a n d th a t th e ro le s o f A a n d B c a n n o t b o u n d to b e e x c h a n g e d . P e o p le s a y : " A c irc le is lik e a n e llip s e ." b u t n o t " A n e llip s e is lik e a c ir c le ." o r " T h e s u n lo o k s lik e th e f a th e r ." b u t n o t " T h e f a th e r lo o k s lik e to th e s u n ." : T h e r e f o r e , s y m m e tr y is n o t n e c e s s a rily a b a s ic p ro p e rty o f s im ila rity . H o w e v e r, in th e a b o v e e x a m p le s it c a n b e u s e fu l to d e fin e th e s im ila rity re la tio n s y m m e tric a l. T h e tra n s itiv ity re la tio n m u s t a ls o n o t n e c e s s a rily h o ld . L e t u s c o n s id e r th e b lo c k w o rld : a re d b a ll a n d a re d c u b e m ig h t b e s im ila r; a re d c u b e a n d a b lu e s q u a re a re s im ila r; b u t a re d b a ll a n d a b lu e c u b e a re d is s im ila r. H o w e v e r, a c o n c re te s im ila rity re la tio n m ig h t b e tra n s itiv e . S m ith d is tin g u is h in to 5 d iffe re n t k in d s o f s im ila rity : • O v e ra ll s im ila rity • S im ila rity • Id e n tity • P a rtia l s im ila rity a n d • P a rtia l id e n tity .

3 .2 C a s e - B a s e d R e a s o n in g

5 1

O v e ra ll s im ila rity is a g lo b a l re la tio n th a t in c lu d e s a ll o th e r s im ila rity re la tio n s . A ll c o llo q u ia l s im ila rity s ta te m e n ts a re s u b s u m e d h e re . S im ila rity a n d id e n tity a re re la tio n s th a t c o n s id e r a ll p ro p e rtie s o f o b je c ts a t o n c e , n o s in g le p a rt w ill b e u n c o n s id e re d . A re d b a ll a n d a b lu e b a ll a re s im ila r, a re d b a ll a n d a re d c a r a re d is s im ila r. T h e h o lis tic re la tio n s s im ila rity a n d id e n tity a re d iffe re n t in th e d e g re e o f s im ila rity . Id e n tity d e s c rib e d o b je c ts th a t a re n o t s ig n ific a n t d iffe re n t. A ll re d b a ll o f o n e p ro d u c tio n p ro c e s s a re s im ila r. S im ila rity c o n ta in s id e n tity a n d is m o re g e n e ra l. P a rtia l s im ila rity a n d p a rtia l id e n tity c o m p a re th e s ig n ific a n t p a rts o f o b je c ts . O n e a s p e c t o r a ttrib u te c a n b e m a rk e d . P a rtia l s im ila rity a n d p a rtia l id e n tity a re d iffe re n t w ith re s p e c t to th e d e g re e o f s im ila rity . A re d b a ll a n d a p in k c u b e a re p a rtia l s im ila r b u t a re d b a ll a n d a re d c u b e a re p a rtia l id e n tic a l. T h e d e s c rib e d s im ila rity re la tio n s a re in c o n n e c tio n w ith m a n y re s p e c ts . Id e n tity a n d s im ila rity a re u n s p e c ifie d re la tio n s b e tw e e n w h o le o b je c ts . P a rtia l id e n tity a n d s im ila rity a re re la tio n s b e tw e e n s in g le p ro p e rtie s o f o b je c ts . Id e n tity a n d s im ila rity a re e q u iv a le n c e re la tio n s th a t m e a n s th e y a re re fle x iv e , s y m m e tric a l, a n d tra n s itiv e . F o r p a rtia l id e n tity a n d s im ila rity d o e s n o t h o ld th e s e re la tio n s . F o rm id e n tity fo llo w s s im ila rity a n d p a rtia l id e n tity . F ro m th a t fo llo w s p a rtia l s im ila rity a n d g e n e ra l s im ila rity . S im ila rity a n d id e n tity a re tw o c o n c e p ts th a t d e p e n d fro m th e a c tu a l c o n te x t. T h e c o n te x t d e fin e s th e e s s e n tia l a ttrib u te s o f th e o b je c ts th a t a re ta k e n in to c o n s id e ra tio n w h e n s im ila rity is d e te rm in e d . A n o b je c t " re d b a ll" m a y b e s im ila r to a n o b je c t " re d c h a ir" b e c a u s e o f th e c o lo r re d . H o w e v e r th e o b je c t " b a ll" a n d " c h a ir" a re d is s im ila r. T h e s e a ttrib u te s a re w e a th e r g iv e n a -p rio ri o r " s a lie n t" in c o n s id e re d p ro b le m a n d n e e d to m a k e e x p lic it b y a k n o w le d g e e n g in e e rin g s te p .

3 . 2 . 6 . 2 S i m i l a r i ty M e a s u r e s T h e c a lc u la tio n o s e n s e to c o e re d s im ila rity S in c e a ttrib n e e d to p a y a s im ila rity m e a tim e w ith n u m s c a le le v e l. S im ila rity m

n o f s im ila rity b e tw e e n th e m p a re tw o a ttrib u te s th a t d . u te s c a n b e n u m e ric a l a n d tte n tio n to th is b y th e s e le s u re s c a n b e u s e d fo r c a te g e ric a l a n d c a te g o ric a l a ttrib

a ttrib u te s m u s t b e m e a n in g fu l. It m a k e s o n o t m a k e a c o n trib u tio n to th e c o n s id c a te g c tio n o ric a u te s .

o ric a l o f th e l a ttrib T h e v a

o r a c o m b in s im ila rity u te s o r c a n ria b le s s h o u

a tio m e a d e a ld h

n o f su re l a t a v e

b o . N th e th e

th o s s

w t a a m a m

e ll e e

e a s u re s f o r th a t k in d o f a ttrib u te s w ill b e d e s c rib e d in C h a p te r 3 .3 .

3 . 2 . 6 . 3 S i m i l a r i ty M e a s u r e s f or I m a g e s Im a g e s c a n a n d e n e rg y m a y b e d is s c e p t o f in v a g o o d s im ila

b e ro ta te d , tra n s b u t th e y m ig h t b im ila r s in c e th e ria n c e in im a g e rity m e a s u re s h o

la te d , d iffe re n t in s c a le , o r m a y h a e c o n s id e re d a s s im ila r. In c o n tra s o b je c t in o n e im a g e is ro ta te d b y 1 in te rp re ta tio n is c lo s e ly re la te d to u ld ta k e th is in to c o n s id e ra tio n .

v e t to 8 0 th a

d iff th a d e g t o f

e re n t, tw re e . s im

t c o n tra s t o im a g e s T h e c o n ila rity . A

5 2

3 M e th o d s fo r D a ta M in in g



T h e c la s s ic a l s im ila rity m e a s u re s d o n n o t a llo w th is . U s u a lly , th e im a g e s o r th e fe a tu re s h a v e to b e p re -p ro c e s s e d in o rd e r to b e a d a p te d to th e s c a le , o rie n ta tio n o r s h ift. T h is p ro c e s s is a fu rth e r p ro c e s s in g s te p w h ic h is e x p e n s iv e a n d n e e d s s o m e a -p rio ri in fo rm a tio n w h ic h a re n o t a lw a y s g iv e n . F ilte rs s u c h a s m a tc h e d filte rs , lin e a r filte rs , F o u rie r o r W a v e le t filte rs a re e s p e c ia lly u s e fu l fo r in v a ria n c e u n d e r tra n s la tio n a n d ro ta tio n w h ic h h a s a ls o b e e n s h o w n b y [M N S 0 0 ]. T h e re h a s b e e n a lo t o f w o rk d o n e to d e v e lo p s u c h filte rs fo r im a g e in te rp re ta tio n in th e p a s t. T h e b e s t w a y to a c h ie v e s c a le in v a ria n c e fro m a n im a g e is b y m e a n s o f in v a ria n t m o m e n ts , w h ic h c a n a ls o b e in v a ria n t u n d e r ro ta tio n a n d o th e r d is to rtio n s . S o m e a d d itio n a l in v a ria n c e c a n b e o b ta in e d b y n o rm a liz a tio n (re d u c e s th e in flu e n c e o f e n e rg y ). D e p e n d in g o n th e im a g e re p re s e n ta tio n (s e e F ig u re 3 8 ) w e c a n d iv id e s im ila rity m e a s u re s in to : p ix e l (Ic o n ic )-m a trix b a s e d s im ila rity m e a s u re s , fe a tu re -b a s e d s im ila rity m e a s u re s , (n u m e ric a l o r s y m b o lic a l o r m ix e d ty p e ) a n d , s tru c tu ra l s im ila rity m e a s u re s .





S in c e im a g e in s im ila rity a p p ro a c h

a C B R im fo rm a tio n m e a su re s to th is , w e

a g e in te su c h a s w h ic h c h a v e sh

rp re a b o a n c o w n

ta tio n u t th e o m b in in [P e

s y s te m h a s a ls o to ta k e in to a c c o u n t n o n e n v iro n m e n t o r th e o b je c ts e tc , w e n e e d e n o n -im a g e a n d im a g e in fo rm a tio n . A firs t r9 9 ].

Im a g e R e p r e s e n ta tio n P ix e l( Ic o n ic ) - M a tr ix B a s e d S im ila r ity

b in a r y

g r e y le v e l

(1 8 ),(1 9 ),(1 3 ),(1 7 ),(2 1 ),(1 6 )

C h a in C o d e /S tr in g B a s e d S im ila r ity

c o lo r

F e a tu re -B a s e d S im ila r ity

n u m e r ic

S tr u c tu r a l S im ila r ity

s y m b o lic

(1 3 ) p o in t s e t 2 D /3 D (2 3 )

a ttr ib u te d g ra p h

s p a tia l r e la tio n

(3 4 ),(2 2 )

F ig . 3 8 . Im a g e R e p re s e n ta tio n s a n d S im ila rity M e a s u re

S y s te m a tic a lly s tu d ie s o n im a g e s im ila rity h a v e b e d o n e b y Z a m p e ro n i e t. a l [Z a S 9 5 ]. H e s tu d ie d h o w p ix e l-m a trix b a s e d s im ila rity m e a s u re s b e h a v e u n d e r d iffe re n t re a l w o rld in flu e n c e s s u c h a s tra n s la tio n , n o is e (s p ik e s , s a lt a n d p e p p e r n o is e ), d iffe re n t c o n tra s t a n d s o o n . Im a g e fe a tu re -b a s e d s im ila rity m e a s u re s h a v e b e e n s tu d ie d fro m a b ro a d e r p e rs p e c tiv e b y S a n tin i a n d J a in [S a J 9 9 ]. T h a t a re th e o n ly s u b s ta n tia te w o rk w e a re a w a re o f. O th e rw is e e v e ry n e w c o n fe re n c e o n p a tte rn re c o g n itio n n e w s im ila rity m e a s u re s a re p ro p o s e d fo r s p e c ific p u rp o s e s a n d th e d iffe re n t k in d s o f im a g e re p re s e n ta tio n b u t it is m is s in g s o m e m e th o d o lo g ic a l

3 .2 C a s e - B a s e d R e a s o n in g

w o rk . A s im ila rity m e a s u re fo r th e c o m p a ris o n C o rte la z z o e t a l. [C D M Z 9 6 ] a n d fo r g ra y -s c a le a n d M o g h a d d a m e t a l. [M N P 9 6 ]. A la n d m a rk c o m p a ris o n o f s h a p e s is p ro p o s e d b y v a n d e r H e s h a p e s im ila rity b a s e d o n s tru c tu ra l fe a tu re s in S im ila rity m e a s u re c a n b e in P e rn e r [P e r9 8 ] a n d W e h o p e th a t w e c o u ld p o in t o u t th a t im a g e s tio n th a t n e e d s s p e c ia l s im ila rity m e a s u re s a n d s h o u ld b e d o n e o n th a t ty p e o f in fo rm a tio n .

o f b in a ry im a g e s a re p r im a g e b y W ils o n e t a l. b a s e d s im ila rity a p p ro a id e n a n d V o s s e p o l [H e V M e h ro tra [M e h 9 3 ]. A B u n k e e t a l. [B u M 9 4 ] . a re s o m e s p e c ia l ty p e o th a t a m o re m e th o d o lo g

o p o [W c h 9 6 s tr

5 3

se d b y B O 9 7 ] fo r th e ] a n d a u c tu ra l

f in fo rm a ic a l s tu d y

3 . 2 . 7 Ca s e D e s c r i p ti on In th e fo rm e r c h a p te r w e h a v e s e e n th a t s im ila rity is c a lc u la te d o v e r e s s e n tia l a ttrib u te s o f a c a s e . O n ly th e m o s t p re d ic tiv e a ttrib u te s w ill g u a ra n te e u s th e e x a c t fin d in g o f th e m o s t s im ila r c a s e s . T h e a ttrib u te s a re s u m m a riz e d in to th e c a s e d e s c rip tio n . T h e c a s e d e s c rip tio n is w e a th e r g iv e n a -p rio ri o r n e e d s to b e a c q u ire d d u rin g a k n o w le d g e a c q u is itio n p ro c e s s . W e u s e re p e rto ry g rid fo r k n o w le d g e a c q u is itio n . T h e re a re d iffe re n t o p in io n s a b o u t th e fo rm a l d e s c rip tio n o f a c a s e . E a c h s y s te m u tiliz e a d iffe re n t re p re s e n ta tio n o f a c a s e . In m u ltim e d ia a p p lic a tio n w e u s u a lly h a v e to d e a l w ith d iffe re n t k in d o f in fo rm a tio n fo r o n e c a s e . F o r e x a m p le , in a n im a g e in te rp re ta tio n s y s te m w e h a v e tw o m a in d iffe re n t ty p e s o f in fo rm a tio n c o n c e rn e d w ith im a g e in te rp re ta tio n th a t a re im a g e -re la te d in fo rm a tio n a n d n o n im a g e re la te d in fo rm a tio n . Im a g e re la te d in fo rm a tio n c a n b e th e 1 D , 2 D o r 3 D im a g e s o f th e d e s ire d a p p lic a tio n . N o n -im a g e re la te d in fo rm a tio n c a n b e in fo rm a tio n a b o u t th e im a g e a c q u is itio n s u c h a s th e ty p e a n d p a ra m e te rs o f th e s e n s o r, in fo rm a tio n a b o u t th e o b je c ts o r th e illu m in a tio n o f th e s c e n e . It d e p e n d s o n th e ty p e o f a p p lic a tio n w h a t ty p e o f in fo rm a tio n s h o u ld b e ta k e n in to c o n s id e ra tio n fo r th e in te rp re ta tio n o f th e im a g e . T h e re fo re , w e re s tric t o u rs e lv e s to g iv in g a d e fin itio n a n d e x p la in th e c a s e d e s c r ip tio n o n a s p e c if ic a p p lic a tio n g iv e n in S e c tio n 5 .7 a n d 5 .8 . F o rm a l w e lik e to u n d e rs ta n d a c a s e a s fo llo w in g : D e fin itio n 5 .1 A c a s e F is a tr ip le ( P ,E ,L ) w ith a p r o b le m p la n a tio n o f th e s o lu tio n E a n d a p r o b le m s o lu tio n s L .

d e s c r ip tio n P , a n e x -

3 . 2 . 8 O r g a n i z a ti on of Ca s e B a s e C a se s c a n u re 3 9 . In c a se a n d c a se b a se p a ra lle l m c a se b a se

b e o rg a fla t o e a c h c a is v e ry a c h in e to g ro w

a n iz e d b y a fla t c a s e rg a n iz a tio n , w e h a v e s e in th e m e m o ry . It la rg e . S y s te m s w ith a to p e rfo rm re trie v a l o v e r a p re d e fin e d lim

b a s e o r b y a n h ie ra rc h ic a l fa s h io n , s e e F ig to c a lc u la te s im ila rity b e tw e e n th e p ro b le m is c le a r th a t th is w ill ta k e tim e e v e n if th fla t c a s e b a s e o rg a n iz a tio n u s u a lly ru n o n in a re a s o n a b le tim e a n d d o n o t a llo w th it. M a in te n a n c e is d o n e b y p a rtitio n in g th

e a e e

5 4

3 M e th o d s fo r D a ta M in in g

c a s e b a s e in to c a s e c lu s te rs te rs [P e r9 8 ]. T o s p e e d u p th e re trie v a l is n e c e s s a ry . T h is o rg a n iz a fro m th o s e c a s e s n o t s im ila trie v a l p ro c e s s . T h e re fo re , w c a se b a se :

a n d b y c o n tro llin g th e n u m b e r a n d s iz e o f th e s e c lu s p ro tio n r to e n

c e ss sh o th e e e d

a m u ld re c to f

o re s o p h is tic a te d o rg a n iz a tio n o f a llo w to s e p a ra te th e s e t o f s im e n t p ro b le m a t th e e a rlie s t s ta g e in d a n re la tio n p th a t a llo w s u s to

c a ila o f o r

se b a s r c a se th e re d e r o u

e s r

D e fin itio n A b in a ry re la tio n p o n a s e t C B is c a lle d a p a rtia l o rd e r o n C B if it is re fle x iv e , a n tis y m m e tric , a n d tra n s itiv e . In th is c a s e , th e p a ir 〈C B , p 〉 is c a lle d a p a rtia l o rd e re d s e t o r p o s e t. T h e re la tio n c a n b e c h o s e n d e p e n d in g o n th e a p p lic a tio n . O n e c o m m o n a p p ro a c h is to o rd e r th e c a s e b a s e b a s e d o n th e s im ila rity v a lu e . T h e s e t o f c a s e c a n b e re d u c e d b y th e s im ila rity m e a s u re to a s e t o f s im ila rity v a lu e s . T h e re la tio n < = o v e r th e s e s im ila rity v a lu e s g iv e s u s a p a rtia l o rd e r o v e r th e s e c a s e s . T h e h ie ra rc h y c o n s is ts o f n o d e s a n d e d g e s . E a c h n o d e in th is h ie ra rc h y c o n ta in s a s e t o f c a s e s th a t d o n o t e x c e e d a s p e c ifie d s im ila rity v a lu e . T h e e d g e s s h o w th e s im ila rity re la tio n b e tw e e n th e n o d e s . N o d e s th a t a re c o n n e c te d b y a n e d g e fo r th e s e n o d e s th e s im ila rity re la tio n h o ld s . T h e re la tio n b e tw e e n tw o s u c c e s s o r n o d e s c a n b e e x p re s s e d a s fo llo w s : L e t z b e a n o d e a n d x a n d y tw o s u c c e s s o r n o d e s o f z th a n x s u b s u m e s z a n d y s u b s u m e s z . B y tra c in g d o w n th e h ie ra rc h y , th e s p a c e g e ts s m a lle r a n d s m a lle r u n til fin a lly a n o d e w ill n o t h a v e a n y s u c c e s s o r. T h is n o d e w ill c o n ta in a s e t o f c a s e s fo r w h ic h th e s im ila rity re la tio n h o ld s . A m o n g th e s e c a s e s is to fin d th e c lo s e s t c a s e to th e q u e ry c a s e . A lth o u g h , w e s till h a v e to c a rry o u t m a tc h in g th e n u m b e r o f m a tc h e s w ill h a v e d e c re a s e d th ro u g h th e h ie ra rc h ic a l o rd e rin g . T h e n o d e s c a n b e re p re s e n te d b y th e p ro to ty p e s o f th e s e t o f c a s e s a s s ig n e d to th e n o d e . W h e n c la s s ify in g a q u e ry th ro u g h th e h ie ra rc h y th e q u e ry is o n ly m a tc h e d w ith th e p ro to ty p e . D e p e n d in g o n th e o u tc o m e o f th e m a tc h in g p ro c e s s , th e q u e ry b ra n c h e s rig h t o r le ft o f n o d e . S u c h k in d o f h ie ra rc h y c a n b e c re a te d b y h ie ra rc h ic a l o r c o n c e p tio n a l c lu s te rin g (s e e S e c t. 3 .4 ). A n o th e r a p p ro a c h u s e s a fe a tu re -b a s e d d e s c rip tio n o f a c a s e w h ic h m a k e s u p a n n -d im e n s io n a l q u e ry s p a c e . T h e q u e ry s p a c e is re c u rs iv e ly p a rtitio n e d in to s u b s p a c e s c o n ta in in g s im ila r c a s e s . T h is p a rtitio n is d o n e b a s e d o n a te s t o n th e a ttrib u te v a lu e s o f a n a ttrib u te . T h e te s t o n th e a ttrib u te b ra n c h e s th e c a s e b a s e in to a le ft o r rig h t s e t o f c a s e s u n til a le a f n o d e is re a c h e d . T h is le a f n o d e s till c o n ta in s a s e t o f c a s e s a m o n g w h ic h is to fin d th e c lo s e s t c a s e . S u c h s tru c tu re s c a n b e k -d tr e e [ W A D 9 3 ] a n d d e c is io n tr e e s ( s e e s e c tio n 3 .2 ) . T h e re a re a ls o s e t-m e m b e rs h ip b a s e d o rg a n iz a tio n s k n o w n s u c h a s s e m a n tic n e ts [G rA 9 6 ] a n d o b je c t-o rie n te d re p re s e n ta tio n s [B S t9 8 ].

3 .2 C a s e - B a s e d R e a s o n in g

5 5

O r g a n iz a t io n o f C a s e B a s e --

S im ila r it y - V a lu e B a s e d O r d e r in g

P a r t it io n in g

A t t r ib u t e - V a lu e B a s e d O r d e r in g

H ie r a r c h ic a l

H ie r a r c h ic a l C lu s t e r in g

k -d T re e s

S e t- M e m b e r s h ip B a s e d O r d e r in g

D e c is io n T r e e s

S e m a n t ic N e t s

O b je c t - O r ie n t e d R e p r e s e n ta tio n s

C o n c e p t io n a l C lu s t e r in g

F ig . 3 9 . O rg a n iz a tio n o f C a s e b a s e

3 . 2 . 9 L e a r n i n g i n a CB R C B R a n c e L c a se

m a n a g e o f th e s y e t x i b e a in c a s e b

m e n s te m se t a se

S y s te m

t is c lo s e ly re la te d to le a rn in g . It s h o u ld im p ro v e th e p e rfo rm . o f c a s e s c o lle c te d in a c a s e b a s e C B n. T h e r e la tio n b e tw e e n e a c h c a n b e e x p re s s e d b y th e s im ila rity v a lu e δ . T h e c a s e b a s e c a n b e

p a rtitio n e d in to c a s e c la s s e s s u c h a s

C B =

n

UC i

s u c h th a t th e in tra c a s e c la s s

i= 1

s im ila rity is h ig h a n d th e in te r c a s e c la s s s im ila rity is lo w . C a rd in a lity o f S e t o f C a s e o r C a s e C la s s .....T h e s e t o f c a s e s in e a c h c la s s c a n b e r e p r e s e n te d b y a r e p r e s e n ta tiv e w h o g e n e ra lly d e s c rib e s th e c lu s te r. T h is re p re s e n ta tiv e c a n b e th e p ro to ty p e , th e m e d io d , o r a n a -p rio ri s e le c te d c a s e . W h e re a s th e p ro to ty p e im p lie s th a t th e re p re s e n ta tiv e is th e m e a n o f th e c lu s te r w h ic h c a n e a s ily c a lc u la te d fro m n u m e ric a l d a ta . T h e m e d io d is th e c a s e w h o s e s u m o f a ll d is ta n c e to a ll o th e r c a s e s in a c lu s te r is m in im a l. T h e re la tio n b e tw e e n th e c a s e c la s s e s c a n b e e x p re s s e d b y h ig h e r o r d e r c o n s tr u c ts e x p r e s s e d e .g . a s s u p e r c la s s e s th a t g iv e s u s a h ie r a r c h ic a l s tru c tu re o v e r th e c a s e b a s e . T h e re a re d iffe re n t le a rn in g s tra te g ie s th a t c a n ta k e p la c e in a C B R s y s te m : 1 . L e a rn in g ta k e s p la c e if a n e w c a th a t: C B n + 1 = C B n ∪ { x } . T h a t d a te d a c c o rd in g to th e n e w c a s e . 2 . It c a n b e in c re m e n ta lly le a rn t th e in g th e c la s s . 3 . T h e re la tio n s h ip b e tw e e n th e d iffe c o rd in g th e n e w c a s e c la s s e s . a n d 4 . L e a rn in g o f S im ila rity M e a s u re .

s e x i+ 1 h a s to b e s to re d in to c a s e b a s e s u c h m e a n s th a t th e c a s e b a s e is in c re m e n ta lly u p c a s e c la s s e s a n d /o r th e p ro to ty p e s re p re s e n tre n t c a s e s o r c a s e c la s s e s c a n b e u p d a te d a c -

5 6

3 M e th o d s fo r D a ta M in in g

3 . 2 . 9 . 1 L e a r n i n g of N e w

Ca s e s a n d F or g e tti n g of O l d Ca s e s

L e a rn in g o f n e w c a s e s m e a n s ju s t a d d in g c a s e s in to th e c a s e b a s e u p o n s o m e n o tific a tio n . C lo s e ly re la te d to c a s e a d d in g is c a s e d e le tio n o r fo rg e ttin g o f c a s e s w h ic h h a v e s h o w n lo w u titility . T h is s h o u ld c o n tro l th e s iz e o f th e c a s e b a s e . T h e re a re a p p ro a c h e s th a t k e e p th e s iz e o f c a s e b a s e c o n s ta n t a n d d e le te c a s e s th a t h a v e n o t s h o w n g o o d u tility w ith in a tim e w in d o w [B lP 0 0 ]. T h e fa ilu re ra te is u s e d a s u tility c rite rio n . G iv e n a p e rio d o f o b s e rv a tio n o f N c a s e s , if th e C B R c o m p o n e n t e x h ib its M fa ilu re s in s u c h a p e rio d , w e d e fin e th e fa ilu re ra te a s f r = M / N . O th e r a p p ro a c h e s try to e s tim a te th e “ c o v e ra g e ” o f e a c h c a s e in m e m o ry a n d b y u s in g th is e s tim a te to g u id e th e c a s e m e m o ry re v is io n p ro c e s s [S M c 9 8 ]. T h e a d a p ta b ility to th e d y n a m ic o f th e c h a n g in g e n v iro n m e n t th a t re q u ire s to s to re n e w c a s e s in s p ite o f th e c a s e b a s e lim it is a d d re s s e d in [S u T 9 8 ]. B a s e d o n in tra c la s s s im ila rity is d e c id e d w h e th e r a c a s e is to re m o v e o r to s to re in a c lu s te r. 3 . 2 . 9 . 2 L e a r n i n g of P r ototy p e s L e a rn in g o f p ro to ty p e s h a v e b e e n d e s c rib e d in [P e r9 9 ] fo r fla t o rg a n iz a tio n o f c a s e b a s e a n d fo r h ie ra rc h ic a l re p re s e n ta tio n o f c a s e b a s e in [P e r9 8 ]. T h e p ro to ty p e o r th e re p re s e n ta tiv e o f a c a s e c la s s is th e m o re g e n e ra l re p re s e n ta tio n o f a c a s e c la s s . A c la s s o f c a s e s is a s e t o f c a s e s c h a irin g s im ila r p ro p e rtie s . T h e s e t o f c a s e s d o n o t e x c e e d a b o u n d a ry fo r th e a n in tra c la s s d is s im ila rity . C a s e s th a t a re o n th e b o u n d a ry o f th is h y p e rb a ll h a v in g m a x im a l d is s im ila rity v a lu e . A p ro to ty p e c a n b e s e le c t a -p rio ri b y th e d o m a in u s e r. T h is a p p ro a c h is p re fe ra b le if th e d o m a in e x p e rt k n o w s fo r s u re th e p ro p e rtie s o f th e p ro to ty p e . T h e p ro to ty p e c a n b e c a lc u la te d b y a v e ra g in g o v e r a ll c a s e s in a c a s e c la s s o r th e m e d ia n o f th e c a s e s is c h o s e n . If o n ly a fe w c a s e s a re a v a ila b le in a c la s s a n d s u b s e q u e n tly n e w c a s e s a re s to re d in th e c la s s th e n it is p re fe ra b le to in c re m e n ta lly u p d a te th e p ro to ty p e a c c o rd in g to th e n e w c a s e s .

3 . 2 . 9 . 3 L e a r n i n g of H i g h e r O r d e r Con s tr u c ts T h e o rd e rin c a s e c la s s e s a n e d g e s im in th e h ie ra le a v e s o f th o th e r, h ig h e

g o f th e d iffe a re re la te d to ila rity re la tio n rc h y a p p ly to e h ie ra rc h y . B r o rd e r c o n s tru

re n e a c h o a w y c ts

t c a s e c la s s e s g iv e s a n u n d e rs ta n d in g o f h o h o th e r. F o r tw o c a s e c la s s e s w h ic h a re c o n n ld s . C a s e c la s s e s th a t a re lo c a te d a t a h ig h e r id e r ra n g e o f p ro b le m s th a n th o s e lo c a te d le a rn in g h o w th e s e c a s e c la s s e s a re re la te d a re le a rn t [P e r9 8 ].

w th e s e c te d b p o s itio n e a r th to e a c

e y n e h

3 . 2 . 9 . 4 L e a r n i n g of S i m i l a r i ty B y in tro d u c in g fe a tu re w e ig h ts w e c a n p u t s p e c ia l e m p h a s is o n s o m e fe a tu re s fo r th e s im ila rity c a lc u la tio n . It is p o s s ib le to in tro d u c e lo c a l a n d g lo b a l fe a tu re w e ig h ts . A fe a tu re w e ig h t fo r a s p e c ific a ttrib u te is c a lle d lo c a l fe a tu re w e ig h t. A

3 .3 C lu s te r in g

5 7

fe a tu re w e ig h t th a t a v e ra g e s o v e r a ll lo c a l fe a tu re w e ig h ts fo r a c a s e is c a lle d g lo b a l fe a tu re w e ig h t. T h is c a n im p ro v e th e a c c u ra c y o f th e C B R s y s te m . B y u p d a tin g th e s e fe a tu re w e ig h ts w e c a n le a rn s im ila rity [W A M 9 7 ][B C S 9 7 ].

3 . 2 . 1 0 Con c l u s i on s C a s e -b a s e d re a s o n in g c a n b e u s e d w h e n g e n e ra liz e d k n o w le d g e is la c k in g b u t a s u ffic ie n t n u m b e r o f fo rm e rly s o lv e d c a s e s a re a v a ila b le . A n e w p ro b le m is s o lv e d b y s e a rc h in g th e c a s e b a s e fo r s im ila r c a s e s a n d a p p ly in g th e a c tio n o f th e c lo s e s t c a s e to th e n e w p ro b le m . T h e re trie v a l c o m p o n e n t o f C B R p u ts it in th e s a m e lin e w ith m u ltim e d ia d a ta b a s e s te c h n iq u e s . M o re th a n th a t th e C B R m e th o d o lo g y a ls o in c lu d e s th e a c q u is itio n o f n e w c a s e s a n d le a rn in g a b o u t th e k n o w le d g e c o n ta in e rs . T h e re fo re , it c a n b e s e e n a s a n in c re m e n ta l le a rn in g a n d k n o w le d g e d is c o v e ry a p p ro a c h . T h is p ro p e rty m a k e s it v e ry s u ita b le to a p p ly C B R fo r m a n y a p p lic a tio n s . T h is c h a p te r d e s c rib e d th e b a s is d e v e lo p m e n ts c o n c e rn e d w ith c a s e -b a s e d re a s o n in g . T h e th e o ry a n d m o tiv e s b e h in d C B R te c h n iq u e s is d e s c rib e d in d e p th in A a m o d t a n d P la z a [A a P 9 5 ]. A n o v e rv ie w a b o u t re c e n t C B R w o rk c a n b e fo u n d in [ A lt0 1 ] . W e w ill d e s c rib e in C h a p te r 4 .1 h o w c a s e -b a s e d re a s o n in g c a n b e u s e d fo r im a g e s e g m e n ta tio n .

3 . 3 Cl u s te r i n g 3 . 3 . 1 I n tr od u c ti on A

s e t o f u n o rd e re d o b s e rv a tio n s , e a c h re p re s e n te d b y a n n -d im e n s io n a l fe a tu re v e c to r, w ill b e p a rtitio n e d in to s m a lle r, h o m o g e n o u s a n d p ra c tic a l u s e fu l c la s s e s C 1,C 2,....,C k s u c h th a t in a w e ll- d e f in e d s e n s e s im ila r o b s e r v a tio n s a r e b e lo n g in g to th e s a m e c la s s a n d d is s im ila r o b s e rv a tio n s a re b e lo n g in g to d iffe re n t c la s s e s . T h is d e fin itio n im p lie s th a t th e re s u ltin g c la s s e s h a v e a s tro n g in te rn a l c o m p a c tn e s s a n d a m a x im a l e x te rn a l is o la tio n . G ra p h ic a lly th is m e a n s , th a t e a c h c lu s te r w o u ld b e re p re s e n te d b y a s p h e ric a l s h a p e in th e n -d im e n s io n a l fe a tu re s p a c e . H o w e v e r, re a l w o rld c lu s te r m a y n o t fo llo w th is m o d e l a s s u m p tio n . T h e re fo re , u p -to -d a te c lu s te rin g a lg o rith m [G R S 2 0 0 1 ] d o n o t re ly o n th is a s s u m p tio n ra th e r th e y try to d is c o v e r th e re a l s h a p e o f th e n a tu ra l g ro u p in g s . C lu s te rin g M e th o d s c a n b e d is tin g u is h e d in to o v e rla p p in g , p a rtitio n e d , u n s h a rp o r h ie ra rc h ic a l m e th o d s , s e e F ig u re 4 0 .

5 8

3 M e th o d s fo r D a ta M in in g

Cl u s te r i n g ( u n s u p e r v is e d c la s s ific a t io n )

o v e r la p p in g

p a r t it io n e d

w it h u t ilit y fu n c t io n

u n s h a rp

w it h d is t a n c e fu n c t io n

h ie r a r c h ic a l

a g g lo m e r a t iv e

d iv is iv e

F ig . 4 0 . O v e rv ie w a b o u t C lu s te rin g M e th o d s

F u rth e rm o re w e c a n d is tin g u is h in g th e m in to m e th o d s th a t o p tim iz e a n u tility fu n c tio n a n d th o s e th a t a re u s in g a d is ta n c e fu n c tio n . T h e s e p a rtitio n in g m e th o d s a s s ig n to e a c h o b s e rv a tio n o n e a n d o n ly o n e c la s s la b e l. It is c le a r th a t th is s itu a tio n is a n id e a l s itu a tio n a n d c a n n o t b e a s s u m e d fo r a ll a p p lic a tio n s . U n s h a rp a n d o v e rla p p in g c lu s te rin g m e th o d s a llo w to a s s ig n a n o b s e rv a tio n to m o re th a n o n e c la s s . W h ile u n s h a rp c lu s te rin g m e th o d s a s s ig n a n o b s e rv a tio n w ith a m e m b e rs h ip v a l u e μ j t o t h e m c l a s s e s w i t h ( j = 1 , 2 , . . . , m ; μ 1 + μ 2 + . . . + μ j. . . + μ m = 1 ) , o v e r l a p p i n g c lu s te rin g m e th o d s a llo w to a s s ig n a n o b s e rv a tio n to o n e o r m o re c la s s e s b u t d o n o t c a lc u la te a d e g re e o f m e m b e rs h ip . T h e h ie ra rc h ic a l c lu s te rin g m e th o d s le a d to a s e q u e n c e o f p a rtitio n s , th e s o c a lle d h ie ra rc h y . T h e c la s s e s o f th is h ie ra rc h y a re w h e th e r e le m e n t u n k n o w n to e a c h o th e r o r th e y s a tis fy th e in c lu s io n re la tio n . F o r th e d e s c rip tio n o f th e c lu s te r q u a lity w e u s e th e v a ria n c e c rite ria :

v =

w ith μ



m

μ

j= 1

i



L

q i( x

ij

− x k ) 2



M in !

(2 5 )

i= 1

th e m e m b e rs h ip fu n c tio n ,

x k th e p ro to ty p e s o f th e k c la s s e s , q i

th e

w e i g h t o f t h e v a r i a b l e s , a n d x ij t h e v a r i a b l e s . T h e m e m b e r s h i p f u n c t i o n μ i s o n e fo r a ll c la s s e s in c a s e o f c ris p c lu s te rin g . T h e v a ria n c e s h o u ld b e a t its m in im u m w h e n th e c lu s te rin g a lg o rith m h a s a c h ie v e d a g o o d p a rtitio n o f th e d a ta . C lu s te rin g c a n b e d o n e b a s e d o n th e o b s e rv a tio n s o r o n th e a ttrib u te s . W h e re a s th e firs t a p p ro a c h g iv e s u s o b je c t c la s s e s th e la te r o n e is v e ry h e lp fu l in o rd e r to d is c o v e r re d u n d a n t a ttrib u te s o r e v e n a ttrib u te g ro u p s th a t c a n b e s u m m a riz e d in to a m o re g e n e ra l a ttrib u te . T h e b a s is fo r c lu s te rin g is a d a ta ta b le c o n ta in in g lin e s w ith m o b s e rv a tio n s a n d ro w s fo r n a ttrib u te s d e s c rib in g th e v a lu e o f th e a ttrib u te s fo r e a c h o b s e rv a tio n . N o te in c o n tr a s t to th e d a ta ta b le 2 .1 in S e c tio n 2 a c la s s la b e l m u s t n o t b e a v a ila b le . 3 . 3 . 2 G e n e r a l Com m e n ts B e fo re th e c a lc u la tio n o f th e s im ila rity b e tw e e n th e n u m e ro u s o b s e rv a tio n s (o r th e a ttrib u te s ) w e s h o u ld m a k e s u re th a t th e fo llo w in g p o in t a re s a tis fie d :

3 .3 C lu s te r in g

5 9

T h e c a lc u la tio n o f th e s im ila rity a n d th e c o m p a ris o n b e tw e e n th e o b s e rv a tio n s a n d th e ir d iffe re n t a ttrib u te s m u s t b e m e a n in g fu l. It m a k e s n o s e n s e to c o m p a re a ttrib u te s o f o b s e rv a tio n s w h ic h to n o t c o n trib u te to th e e x p e c te d m e a n in g . T h e h o m o g e n e ity o f th e d a ta m a trix m u s t b e a s s u m e d . A ttrib u te s s h o u ld h a v e e q u a l s c a le le v e l. M e tric a l a ttrib u te s s h o u ld h a v e a s im ila r v a ria n c e . A ttrib u te w e ig h tin g is o n ly p o s s ib le if th e c la s s s tru c tu re w ill n o t b e b lu rre d . O b s e rv a tio n s c o n ta in in g m ix e d d a ta ty p e s s u c h a s n u m e ric a l a n d c a te g o ric a l a ttrib u te s re q u ire s p e c ia l d is ta n c e m e a s u re . 3 . 3 . 3 D i s ta n c e M e a s u r e s f or M e tr i c a l D a ta A

d is ta n c e d (x ,y ) b e tw e e n tw o v e c to rs x a n d y is a fu n c tio n fo r w h ic h th e fo llo w in g id e n tity a n d s y m m e try c o n d itio n s m u s t h o ld :

d ( x , y ) ≥ 0 ;

d ( x , y ) = 0 th e n

x = y

d ( x , y ) = d ( y , x )

(2 6 ) (2 7 )

W e c a ll th e d is ta n c e d ( x ,y ) a m e tr ic if th e tr ia n g le u n e q u a tio n h o ld s :

d ( x , y ) ≤ d ( x , z ) + d ( z , y )

(2 8 )

I f w e r e q u ir e th e f o llo w in g c o n d itio n in s te a d o f ( ) th a n w e c a ll d ( x ,y ) a n u ltr a m e tric :

d ( x , y ) ≤ m a x {d ( x , z ),d ( z , y )}

(2 9 )

T h is m e tric p la y a n im p o rta n t ro le fo r th e h ie ra rc h ic a l c lu s te r a n a ly s is . A w e ll-k n o w d is ta n c e m e a s u re is th e M in k o w s k i m e tric :

d th d iffe If v a tio tio n s

e c h o ic e re n c e s in w e c h o se n s. T h e m (ro ta tio n

o f th e th e s u m p = 2 th e a su re a n d re f

( p ) ii



= ⎢ ⎣



J

x

ii

=

i‘ j



(3 0 )

⎥ ⎦

j= 1

p a ra m e te r p m a tio n . a n w e g iv e s p is in v a ria n t to le c tio n ). It is c

d

− x

ij

1 / p

⎤ p

d e p e n d s o n

th e im p o rta n c e w e g iv e to

th e

e c ia l e m p h a s is to b ig d iffe re n c e s in th e o b s e rtra n s la tio n s a n d o rth o g o n a l lin e a r tra n s fo rm a a lle d E u c lid e a n d is ta n c e :

J

x

ij

− x

2

(3 1 )

i‘ j

j= 1

If w e c h o s e p = 1 th e m e a s u re g e ts u n s e n s ib le to o u tlie r s in c e b ig a n d s m a ll d iffe re n c e a re e q u a lly tre a te d . T h is m e a s u re is a ls o c a lle d th e C ity -B lo c k _ M e tric :

6 0

3 M e th o d s fo r D a ta M in in g

d

∑ =

ii

J

x

(3 2 )

− x

ij

i‘ j

j= 1

If w e c h o s e p = ∞ , w e o b ta in th e s o c a lle d M a x N o rm :

d T h a s th e T h th e in d a s g iv ta n

= m a x x

ii

j

ij

(3 3 )

− x

i’j

is m e a s u re is u s e fu ll if o n ly th e m a x im a l d is ta n c e b e tw e e n tw o v a ria b le s a m o n g e t o f v a ria b le s is o f im p o rta n c e w h e re a s th e o th e r d is ta n c e s d o n o t c o n trib u te to o v e ra ll s im ila rity . e d is a d v a n ta n g e o f th e m e a s u re s d e s c rib e d a b o v e is th a t th e s e m e a s u re s re q u ire s ta s tic a l in d e p e n d e n c e o f th e a ttrib u te s . E a c h a ttrib u te is c o n s id e re d to b e e p e n d e n t a n d is o la te d . A h ig h c o rre la tio n b e tw e e n a ttrib u te s c a n b e c o n s id e re d m u ltip le m e a s u re m e n t o f a n a ttrib u te . T h a t m e a n s th e m e a s u re s d e s c rib e d a b o v e e th is fe a tu re m o re w e ig h t a s a n u n c o rre la te d a ttrib u te . T h e M a h a la n o b is d is c e :

d

2 i’

= (x i

− 1

− x i’) S

( x i

− x i’)

(3 4 )

ta k e s in to a c c o u n t th e c o v a ria n c e m a trix o f th e a ttrib u te s .

3 . 3 . 4 U s i n g N u m e r i c a l D i s ta n c e M e a s u r e s f or Ca te g or i c a l D a ta A lth o u g h a ll th e s e m e a s u re s a re d e s ig n e d fo r n u m e ric a l d a ta th e y c a n b e u s e d to h a n d le c a te g o ric a l fe a tu re s a s w e ll. S u p p o s e w e h a v e a n a ttrib u te c o lo r w ith g re e n , re d , a n d b lu e a s a ttrib u te v a lu e s . S u p p o s e w e h a v e a n o b s e rv a tio n 1 w ith re d c o lo r a n d o b s e rv a tio n 2 a ls o w ith re d c o lo r th a n th e tw o o b s e rv a tio n s a re id e n tic a l. T h e re fo re , th e d is ta n c e g e ts z e ro . S u p p o s e n o w , w e h a v e a n o b s e rv a tio n 3 w ith g re e n c o lo r a n d w e w a n t to c o m p a re it to th e o b s e rv a tio n 2 w ith re d c o lo r. T h e tw o a ttrib u te v a lu e s a re d iffe re n t th e re fo re th e d is ta n c e g e ts th e v a lu e o n e . If w e w a n t to e x p re s s th e d e g re e o f d is s im ila rity th a n w e h a v e to a s s ig n le v e ls o f d is s im ila rity to a ll th e d iffe re n t c o m b in a tio n b e tw e e n th e a ttrib u te v a lu e s . If a ∈ A is a n a ttrib u te a n d W a ⊆ W is th e s e t o f a ll a ttrib u te v a lu e s , w h ic h c a n b e a s s ig n e d to a , th e n w e c a n d e te rm in e fo r e a c h a ttrib u te a a m a p p in g : d is ta n c e a

: W a



[ 0 ,1 ]

.

(3 5 )

T h e n o rm a liz a tio n to a re a l in te rv a l is n o t a b s o lu te n e c e s s a ry b u t a d v a n ta g e o u s fo r th e c o m p a ris o n o f a ttrib u te a s s ig n m e n ts . F o r e x a m p le , le t a b e a n a ttrib u te a = s p a tia l_ re la tio n s h ip a n d W a

= { b e h in d _ r ig h t, b e h in d _ le f t, in f r o n t_ r ig h t, ...} .

3 .3 C lu s te r in g

6 1

T h e n w e c o u ld d e fin e : d is ta n c e d is ta n c e d is ta n c e a a a

(b e h in d _ rig h t, b e h in d _ rig h t ) = 0 (b e h in d _ rig h t, in fro n t_ rig h t) = 0 .2 5 (b e h in d _ rig h t, b e h in d _ le ft ) = 0 .7 5 .

B a s e d o n s u c h d is ta n c e m e a s u re fo r a ttrib u te s , w e c a n d e fin e d iffe re n t v a ria n ts o f d is ta n c e m e a s u re a s m a p p in g : d is ta n c e : B → R * ( R ... s e t o f p o s itiv e r e a l n u m b e r s ) in th e f o llo w in g w a y : 2

+

d is ta n c e ( x ,y ) = 1 /D Σ d is ta n c e a

w ith D = d o m a in (x ) ∩



a

(x (a ), y (a ))

D

d o m a in (y ).

3 . 3 . 5 D i s ta n c e M e a s u r e f or N om i n a l D a ta F o r n o m in a l a ttrib fo r th e c a lc u la tio n 3 .3 .1 . T h e v a lu e N n o t s h a re th e p ro p T h e v a lu e N 01 is th p ro p e rty th e o th e r

u te s h a v e b e o f th e s e d is in th is ta b 0 0 e rty n e ith e r e fre q u e n c y d o e s n o t h a v

e n d e s ig n e d s p e c ia l d is ta n c e c o e f ta n c e c o e ffic ie n ts is a n c o n tin g e n le is th e fre q u e n c y o f o b s e rv a tio n in th e o n e o b s e rv a tio n n o r in th e o f o b s e rv a tio n p a irs w h e re o n e o e th is p ro p e rty .

fic ie n ts . T h e b a s is c y ta b le , s e e ta b le p a ir s ( i,j) th a t d o o th e r o b s e rv a tio n . b s e rv a tio n h a s th e

T a b le 7 . C o n tig e n c y ta b le

S ta tu s o f th e O b s e r v a tio n i 0 1

´

S ta tu s o f th e O b s e r v a tio n j 0 N N

1 0 0

N

0 1

1 0

N

1 1

G iv e n th a t, w e c a n d e fin e d iffe re n t d is ta n c e c o e ffic ie n t fo r n o m in a l d a ta :

d

ii

= 1 − ( N d

W c o rre c o e ff c o e ff

1 1

ii

+ N

= 1 − N

0 0

) /( N

1 1

+ N

0 0

+ 2 ( N

/( N

1 1

+ N

1 0

+ N

1 1

0 1

)

1 0

+ N

0 1

)

(3 6 ) (3 7 )

h e r e a s th e s im ila r ity is in c r e a s e d b y th e v a lu e o f N 00 a n d th e v a lu e o f th e n o n s p o n d e n c e N 10 a n d N 01 g e ts d o u b le w e ig h t in th e R o g e r s a n d T a n im o to ic ie n t, is n o n -e x is te n c e o f a p ro p e rty N 00 n o t c o n s id e re d in th e J a c c a rd ic ie n t.

6 2

3 M e th o d s fo r D a ta M in in g

3 . 3 . 6 Con tr a s t R u l e T h is m e a s u re h a s b e e n d e v e lo p e d b y T v e rs k y [T v e 7 7 ]. It d e s c rib e s th e s im ila rity b e tw e e n a p ro to ty p e A a n d a n e w e x a m p le B a s :

S ( A , B ) =

D

α D i

α = 1 , β , χ = 0 .5

i

+ β E i

+ χ F

(3 8 )

i

w ith D ith e fe a tu re s th a t a re c o m m o n to b o th A a n d B ; E ith e fe a tu re s th a t b e lo n g to A b u t n o t to B ; a n d F i th e fe a tu re s th a t b e lo n g to A b u t n o t to B . 3 . 3 . 7 A g g l om e r a te Cl u s te r i n g M e th od s T h e d is ta n c e b e tw e e n e fu n c tio n . T h e v a lu e s a re w ith th e m in im a l v a lu e tw o c o rre s p o n d in g c la s s n e w c la s s . T h e d is ta n c e la te d a n d th e p ro c e d u re g o rith m in F ig u re 4 1 . T a ll o th e r c la s s e s is d o n e

t T h e c b le 3 .3 .2 lin k a g e , 3 .3 .2 a r e a g e a n d th e c la s s in th e d c lu s te rs .

k k ´

= α ld

a c h o b s e rv a tio n is c a lc u la te d b a s e d o n a p ro p e r d is ta n c e s to re d in a s e p a ra te d is ta n c e m a trix . T h e a lg o rith m s ta rts o f th e d is ta n c e in th e d is ta n c e m a trix a n d c o m b in e s th e e s w h ic h a re in th e in itia l p h a s e th e tw o o b s e rv a tio n s to a o f th is n e w c la s s to a ll re m a in in g o b s e rv a tio n s is c a lc u re p e a ts u n til a ll o b s e rv a tio n s h a v e b e e n p ro c e s s e d , s e e A lh e c a lc u la tio n o f th e d is ta n c e b e tw e e n th e n e w c la s s a n d b a s e d o n th e fo llo w in g fo rm u la : lk ´

+ α



d

l´k ´

+ ß d

ll´

+ γ d

lk ´

− d

l´k ´

o e ffic ie n ts α , β , a n d γ d e te rm in e th e w a y th is fu s io n w ill b e d o . W e c a n d is tin g u is h b e tw e e n : s in g le lin k a g e , c o m p le te lin k a g s im p le lin k a g e , m e d ia n , c e n tro id a n d W a rd m e th o d [M u c 9 2 g iv e n th e c o e ffic ie n ts α , β , a n d γ fo r th e s in g le lin k a g e , c o m th e m e d ia m e th o d . In th e s in g le lin k a g e m e th o d is th e d is ta n c e s d e fin e d a s th e m in im a l. T h is m e th o d is o fte n u s e d to re c o g n a ta . C o m p le te lin k a g e m e th o d c re a te s h o m o g e n e o u s b u t le s s O u tlie rs s ta y u n re c o g n iz e d .

1 . F in d m in im D = ( d k k ´) , k , 2 . F u s io n o f th 3 . C a lc u la tio n o th e r c la s s e F ig . 4 1 . A lg o rith m

a l v a lu e in th e d is ta n c e m k ´ = 1 ,2 ,...,K k = k ` e tw o c la s s e s l u n d l´ to o f th e d is ta n c e o f th e n e s k ´ (k ´= l, k `= l`) a c c o rd

a trix n e w c la s s k w c la s s k to a ll in g to tkk

o f A g g lo m e ra tiv e C lu s te rin g M e th o d

(3 9 ) n e , s e e ta e , a v e ra g e ]. In ta b le p le te lin k e b e tw e e n iz e o u tlie r s e p a ra b le

3 .3 C lu s te r in g

6 3

T a b le 8 . P a r a m e te r s f o r th e r e c u r s iv e f o r m u la o f tkk fo r s e le c te d a g g lo m e ra tiv e c lu s te rin g m e th o d s

M e th o S in g le C o m p M e d ia

d

α

L in k a g e le te L in k a g e n

1 /2 1 /2 1 /2

α l

β



1 /2 1 /2 1 /2

γ 0

-1 /2 1 /2 0

0 -1 /4

T h e re s u lt o f th e h ie ra rc h ic a l c lu s te r a n a ly s is c a n b e g ra p h ic a lly re p re s e n te d b y a d e n d ro g ra m , s e e F ig u re 4 2 . T h e s im ila rity v a lu e s c re a te d fo r e a c h n e w c la s s in d u c e s a h ie ra rc h y o v e r th e c la s s e s . A lo n g th e y -a x is in th e d e n d ro g ra m a re s h o w n a ll o b s e rv a tio n s . T h e x -a x is s h o w s th e s im ila rity v a lu e fro m 0 to 1 . T w o o b s e rv a tio n s th a t a re g ro u p e d to g e th e r in to a n e w c la s s a re lin k e d to g e th e r in th e d e n d ro g ra m a t th e s im ila rity v a lu e c a lc u la te d fo r th is n e w c la s s . T h e d e n d ro g ra m s h o w s u s v is u a lly to w h ic h g ro u p a n o b s e rv a tio n b e lo n g s to a n d th e s im ila rity re la tio n a m o n g d iffe re n t g ro u p s . If w e c u t th e d e n d ro g ra m in F ig u re 4 2 fo r c la s s e s a t th e s im ila rity le v e l 8 th e n th e d e n d ro g ra m d e c o m p o s e s in to tw o s u b g r o u p s s u c h a s g r o u p _ 1 = { C _ 2 ,C _ 6 , C _ 3 } a n d g r o u p _ 2 = { C _ 1 , C _ 4 , C _ 5 } . A g g lo m e ra tiv e c lu s te rin g m e th o d h a v e th e d is a d v a n ta g e th a t o n c e c la s s e s h a v e b e e n fu s io n e d th a n th e p ro c e s s c a n n o t re v e rs e d . H o w e v e r, th e d e n d ro g ra m is a s u ita b le re p re s e n ta tio n fo r a la rg e n u m b e r o f o b s e rv a tio n s . It a llo w s u s to e x p lo re th e s im ila rity lin k s b e tw e e n d iffe re n t o b s e rv a tio n s a n d e s ta b lis h e d g ro u p s . T o g e th e r w ith a d o m a in e x p e rt c a n w e d ra w u s e fu l c o n c lu s io n fo r th e a p p lic a tio n fro m th is re p re s e n ta tio n . A _ 2

A _ 3

A _ 4

A _ 5

A _ 6

A _ 7

A _ 8

D e n d r og r a m

C la s s

5

5

1

2

2

5

2 C la s s _ 1

1

5

1

1

1

1

5 C la s s _ 2

1

5

1

5

1

1

1 C la s s _ 3

5

1

5

2

1

1

3 C la s s _ 4

5

1

1

1

5

1

4 C la s s _ 5

2

5

3

1

1

3

C la s s e s

f or Cl a s s e s

C L A S S _ 1

C L A S S _ 4

C L A S S _ 5

C L A S S _ 2

C L A S S _ 6

3 C la s s _ 6 C L A S S _ 3

3

A t tr ib u t e s

D e n d r og r a m

4

f or A ttr i b u te s

A _ 1 A _ 2 A _ 6 A _ 8 A _ 4 A _ 5 A _ 7 A _ 3

0

2

4

6

8

1 0

D is t a n c e

F ig . 4 2 . D a ta T a b le a n d D e n d ro g ra m

o f C la s s e s a n d A ttrib u te s

5

6

7

8

9

1 0

6 4

3 M e th o d s fo r D a ta M in in g

3 . 3 . 8 P a r ti ti on i n g Cl u s te r i n g P a rtitio n in g c lu s te rin g m e th o d s s ta rt w ith a n in itia l p a rtitio n o f th e o b s e rv a tio n a n d o p tim iz e th e s e p a rtitio n a c c o rd in g to a n u tility fu n c tio n o r d is ta n c e fu n c tio n . T h e m o s t p o p u la r a lg o rith m is th e k -m e a n s c lu s te rin g a lg o rith m , s e e F ig u re 4 3 . T h e o b je c tiv e is to fin d a p a rtitio n th a t m in im iz e s th e v a ria n c e c rite ria d e s c rib e d in F o rm u la . If th e m e m b e rs h ip fu n c tio n μ ta k e s d iffe re n t v a lu e s fo r th e c lu s te rs th e n fu z z y c lu s te rin g is re a liz e d . T h e d is ta n c e b e tw e e n e a c h s a m p le a n d th e p ro to ty p e s o f th e c lu s te r is c a lc u la te d a n d th e s a m p le is a s s ig n e d to th e c lu s te r w ith th e lo w e s t v a lu e fo r th e d is ta n c e m e a s u re . A fte rw a rd s th e p ro to ty p e s o f th e c lu s te rs w h e re th e s a m p le s h a v e b e e n re m o v e d a n d th e c lu s te r w h e re th e s a m p le h a s b e e n in p u tte d a re re c a lc u la te d . T h is p ro c e s s re p e a ts u n til th e c lu s te rs a re s ta b le . A d iffic u lt p ro b le m is th e in itia l d e c is io n a b o u t th e n u m b e r o f c lu s te rs a n d th e in itia l p a rtitio n . T h e re fo re , h ie ra rc h ic a l c lu s te rin g c a n b e d o n e in a p re -s te p in o rd e r to g iv e a c lu e a b o u t th e n u m b e r o f c lu s te rs a n d th e ir s tru c tu re . T o fin d th e b e s t p a rtitio n is a n o p tim iz a tio n p ro b le m . A p ra c tic a l w a y to d o th is is to re p e a t c lu s te rin g w ith d iffe re n t in itia l p a rtitio n s a n d s e le c t th e o n e w h ic h y ie ld s to th e m in im a l v a lu e fo r th e v a ria n c e c rite ria . T h e k -m e a n s c lu s te rin g a lg o rith m is a m o d ific a tio n o f th e m in im a l d is ta n c e c lu s te rin g m e th o d . W h e re a s th e p ro to ty p e s a re o n ly c a lc u la te d a fte r a ll o b s e rv a tio n s h a v e b e e n v is ite d in th e m in im a l d is ta n c e c lu s te rin g m e th o d , th e p ro to ty p e s a re re c a lc u la te d im m e d ia te ly a fte r a n o b s e rv a tio n h a s b e e n a s s ig n e d to a n o th e r c lu s te r in th e c -m e a n s c lu s te rin g m e th o d . T h is y ie ld s in th e a v e ra g e to a n re d u c tio n o f ite ra tio n s te p s b u t m a k e s th e m e th o d s e n s itiv e to th e o rd e r o f th e o b s e rv a tio n s . H o w e v e r, th e c -m e a n s c lu s te rin g m e th o d d o e s n o t p ro v id e e m p ty c lu s te rs . P ro to ty p e C a lc u la tio n a n g e b e n S e le c t in itia l p a A s s ig n o r c o m p D O W H IL E F O R a

rtitio n o f u te fo r e a s to p p in g ll o b s e rv a C o m p u A s s ig n C o m p u E N D

th e o b s e rv a tio n s c h p a rtitio n th e p ro to c rite ria is n o t re a c h e d tio n s i D O te d is ta n c e fo r o b s e rv o b s e rv a tio n i to th e c te n e w p ro to ty p e o f o

ty p e a tio n i to a ll p ro to ty p e s lu s te r w ith m in im a l d is ta n c e ld c lu s te r a n d n e w c lu s te r

E N D F ig . 4 3 . K -M e a n s A lg o rith m

3 . 3 . 9 G r a p h s Cl u s te r i n g W e m sc e n e D e fin c a n e

a y s m itio x p r

re m ig h n 1 e ss

e m t re ). T d if

b e q u h is fe r

r th e ire a k in d e n c e s

d a ta ty p e s s tru c tu ra l o f re p re se in s tru c tu

d e re p n ta re

s c rib e d in C re s e n ta tio n tio n re q u ire a n d in th e

h a p te su c h s sp e c a ttrib

r 1 a s ia l u te

. O a n s im v a

b je c ts a ttrib u ila rity lu e s a

o r e te d m e s s ig

v e n g ra a su n e d

im p h re s to

a g e (se e th a t th e

3 .3 C lu s te r in g

c o m p o n e n ts o m a tc h in g a lg o th is c h a p te r w c lu s te rin g c a n

f th rith e w b e

is s tru c tu re . m th a t c a n ill d e s c rib e d o n e fo r th a

B e y fa st a s im t ty p

o n d c o m ila r e o f

th e p u te ity m in fo

d e fin itio th e s im e a su re rm a tio n

n o f th ila rity fo r g ra [P e r9 8

6 5

e s im ila rity s p e c ia l g ra p h fu n c tio n a re re q u ire d . In p h s a n d h o w h ie ra rc h ic a l ].

3 . 3 . 1 0 S i m i l a r i ty M e a s u r e f or G r a p h s

W e m a y d e fin e o u r p ro b le m tity o r s im ila rity b e tw e e n tw w e n e e d to d e te rm in e is o m o re la x th is re q u ire m e n t b y d e m B a s e d o n p a rtia l is o m o rp h is g ra p h s: D e fin it T w o g r e x is ts a (1 ) (2 )

io n 2 a p h s o n e p 1( x ) q 1( x )

o f s im ila rity a s o s tru c tu re s . If w rp h is m . T h a t is a n d p a rtia l is o m m , w e c a n in tro

a p ro e a r a v e o rp h d u c e

b le m o f f e lo o k in g ry s tro n g is m . a p a rtia l

in d in g s tru c tu ra l id e n fo r s tru c tu ra l id e n tity , re q u ire m e n t. W e m a y o rd e r o v e r th e s e t o f

: G to = =

= 1 -o n p 2( q 2(

(N e f( f(

,p 1 m a x )) x ),

,q 1) a n 1 p p in g f fo r a ll f(y )) fo

d G 2 = ( N 2,p 2,q 2) a r e in th e r e la tio n G 1≤ G : N 1 → N 2 w ith x ∈ N 1 r a ll x ,y ∈ N 1, x ≠ y . 2

iff th e re

Is a g ra p h G 1 in c lu d e d in a n o th e r g ra p h G 2 th e n th e n u m b e r o f n o d e s o f g ra p h G n o t h ig h e r th a n th e n u m b e r o f n o d e s o f G 2. N o w , c o n s id e r a n a lg o rith m fo r d e te rm in in g th e p a T h is ta s k c a n b e s o lv e d w ith a n a lg o rith m b a s e d o n to fin d a n o v e rs e t o f a ll p o s s ib le c o rre s p o n d e n c e s f a c a s e s . In th e fo llo w in g , w e a s s u m e th a t th e n u m b e r o th e n u m b e r o f n o d e s o f G 2. A te c h n ic a l a id is to a s s ig n to e a c h n o d e n a te m a ttrib u te a s s ig n m e n ts o f a ll th e c o n n e c te d e d g e s : K ( n ) = ( a ⏐ q ( n ,m ) = a , m T le n g F w o u

h e th o r ld

N \ { n } )

rt [S n d f n

is o m o rp c h 8 9 ]. T th e n e x o d e s o f

h is m h e m c lu d e G 1 is

o f a in n o n o t

tw o g a p p ro n p ro m g re a te

1

is

ra p h s. a c h is is in g r th a n

p o ra ry a ttrib u te lis t K (n ) o f a ll

( n ∈ N ).

o rd e r o f lis t e le m e n ts h a s n o m e a n in g . B e c a u s e a ll e d g e s e x is t in a g ra p h th e o f K ( n ) i s e q u a l t o 2 * ( |N |- 1 ) . d e m o n s tra tio n p u rp o s e s , c o n s id e r th e e x a m p le in F ig u re s 4 4 -4 7 . T h e re s u lt b e :

K (X ) = K (Y ) = In th e w o rs t c a s e , th In th e n e x t s te p , s ig n e d b y a m a p p in

(b l, b l, b (b r, b r, b e c o m p le w e a s s ig g f.T h a t

r) l). 3 x ity o f th e a lg o rith m is O ( ⏐ N ⏐ ). n to e a c h n o d e o f G 1 a ll n o d e s o f G 2 th a t c o u ld b e a s m e a n s w e c a lc u la te th e fo llo w in g s e ts :

6 6

3 M e th o d s fo r D a ta M in in g

L (n ) = { m T h e in w ith o u t c a n a ttrib u a s s ig n m e

⏐ m

∈ N 2, p 1( n ) = p 2( m ) , K ( n ) ⊆ K ( m ) } .

c lu s io n K (n ) ⊆ K (m ) s h o w s th a t in th e lis t K (m ) th e lis t K (n ) is in c lu d e d o n s id e rin g th e o rd e r o f th e e le m e n ts . D o e s th e lis t K (n ) m u ltip le c o n ta in s te a s s ig n m e n t th e n th e lis t K (m ) a ls o h a s to m u ltip le c o n ta in th is a ttrib u te n t.

F o r th e e x a m p le in F ig u re 4 4 a n d F ig u re 4 5 w e g e t th e fo llo w in g L -s e ts : L ( L ( L ( L (

X ) Y ) Z ) U )

= { = { = { = {

A } B 1} C } D , B 2} .

W e d id n o t c o n s id e r in th is e x a m p le th e a ttrib u te a s s ig n m e n ts o f th e n o d e s . N o w , th e c o n s tru c tio n o f th e m a p p in g f is p re p a re d a n d if th e re e x is ts a n y m a p p in g th e n m u s t h o ld th e fo llo w in g c o n d itio n : f(n ) ∈ L (n )

( n ∈ N 1) .

T h e firs t c o n d itio n fo r th e m a p p in g f re g a rd h o ld s b e c a u s e o f th e c o n s tru c tio n p ro c e d u re o f e m p ty , th e re is n o p a rtia l is o m o rp h is m . A ls o , if th e re a re n o n e m p ty s e ts , in a th a s s ig n m e n ts o f th e e d g e s m a tc h . If th e re is n o m a tc h , th e n th e c o rre s p o n d in g

in g th e a ttrib u te a s s ig n m e n ts o f n o d e s th e L -s e ts . In c a s e th a t o n e s e t L (n ) is ird s te p is to c h e c k if th e a ttrib u te L -s e t s h o u ld b e re d u c e d to :

fo r a ll n o d e s n 1 o f G 1 f o r a ll n o d e s n 2 o f L ( n 1) fo r a ll e d g e s (n if fo r a ll n o d e s p 1( n 1,m t h e n L ( n 1) := L

) o f G 1 1 o2 f L ( m 1 ) ) ≠ p 2( n 2,m 2) 1 ( n 1) \ { n 2} 1

m

, m

If th e L -s e t o f n o d e h a s b e e n c h a n g e d d u rin g th is p ro c e d u re , th e n th e e x a m in a tio n s a lre a d y c a rrie d o u t s h o u ld b e re p e a te d . T h a t m e a n s th a t th is p ro c e d u re s h o u ld b e re p e a te d u n til n o n e o f th e L -s e ts h a s b e e n c h a n g e d . If th e re s u lt o f th is s te p 3 is a n e m p ty L -s e t, th e n th e re is a ls o n o p a rtia l is o m o rp h is m . If a ll L -s e ts a re n o n e m p ty , th e n s o m e m a p p in g s f fo rm N 1 to N 2 h a v e b e e n d e te rm in e d . If e a c h L -s e t c o n ta in s e x a c tly o n ly o n e e le m e n t, th e n th e re is o n ly o n e m a p p in g . In a fin a l s te p , a ll m a p p in g s s h o u ld b e e x c lu d e d , w h ic h a re n o t o f th e o n e -to -o n e ty p e . F o r e x a m p le , le t u s c o m p a re th e re p re s e n ta tio n o f g ra p h _ 1 a n d g ra p h _ 2 in F ig u re 4 4 a n d F ig u re 4 5 . In s te p 3 , th e L -s e t o f p o re _ 1 w ill n o t b e re d u c e d a n d w e g e t tw o s o lu tio n s , s h o w n in F ig u re 4 4 a n d F ig u re 4 5 :

3 .3 C lu s te r in g

N

f 1

f 1

X

A Y

B Z

C

C

U

D

B

6 7

2

A B 1

1

2

If w e c o m p a re th e re p re s e n ta tio n o f g ra p h _ 1 a n d g ra p h _ 3 , a L -s e t o f p o re _ 1 a ls o c o n ta in s tw o e le m e n ts : L (U ) = { T , P } H th e e If a lre a is n o

o w e v e r in d g e s (U ,Y th e L -s e t d y c a rrie d c h a n g e in

s te p 3 , th e ) a n d (T ,R ) o f a n o d e o u t s h o u ld a n y L -s e t.

e le d o h a b e

m e n n o t s b e re p e

t T m a e n a te

w tc h c h d .

ill b e e w h e n a n g e d T h a t m

x c lu d n o d e d u rin e a n s

e d if th U is e x g s te p th a t s te

e a m 3 p

a ttrib u te a s s ig n m e n ts o f in e d . th e n th e e x a m in a tio n s 3 is to re p e a t u n til th e re

T h is a lg o rith m h a s a to ta l c o m p le x ity o f th e o rd e r O ( l N 2 l , l N 1 l * l M l ). ⏐ ⏐ re p re s e n ts th e m a x im a l n u m b e r o f e le m e n ts in a n y L -s e t (⏐ M ⏐ ≤ ⏐ N 2 ⏐ ). M

3

3

3

A

A

X

Br

Bl

Bl

B1

Br

Br

Br

B1

B2

Bl Br

Br

Y Bl

Br

Br

Bl

Bl

Br Bl

Bl C

F ig . 4 4 . G ra p h _ 1

D

Bl C

Z

F ig . 4 5 . G ra p h _ 2 a n d R e s u lt

U

D

6 8

3 M e th o d s fo r D a ta M in in g

A

S

X

Br

Bl

Br

Bl B1

Br Br

Y

Br

R U

Bl

T

Bl

B2 Br

Br

Br

Bl

Br Bl Bl

Z

Bl

C

P

Q

F ig . 4 6 . S e c o n d R e s u lt

S im ila rity b e tw th e fo llo w in g w a y In th e d e fin itio n o a ttrib u te a s s ig n m e to le ra n c e :

e e n a ttrib u te d g fo r th e m e a s u re f p a rt is o m o rp h n t o f n o d e s a n

F ig . 4 7 . G ra p h _ 3 a n d R e s u lt

ra p o f is m d e

h s c a n b e h a n d le d in m a n y w a y s . W e p ro p o s e c lo s e n e s s . w e m a y re la x th e re q u ire d c o rre s p o n d e n c e o f d g e s in th a t w a y th a t w e in tro d u c e ra n g e s o f

D e fin itio n 3 T w G 2 C 2 (1 ) (2 )

o g ra iff th w ith d is ta d is ta

p h s G 1 = ( N 1, p 1, q 1) a n d G 2 = ( N 2, p 2, q 2) a r e in th e r e la tio n G 1 ≤ e r e e x i s t s a o n e - t o - o n e m a p p i n g f : N 1 → N 2 a n d t h r e s h o l d ’s C 1 , n c e ( p 1( x ) , p 2( f ( x ) ) ≤ C 1 f o r a ll x ∈ N 1 n c e ( q 1( x ,y ) , q 2( f ( x ) , f ( y ) ) ≤ C 2 f o r a ll x ,y ∈ N 1, x ≠ y .

T h e re is a n o th e r w a y to h a n d le s im ila rity is th e w a y th e L -s e ts a re d e fin e d p a rtic u la rly th e in c lu s io n o f K -lis ts : G iv e n C a r e a l c o n s ta n t, n ∈ N 1 a n d m ∈ N 2. K ( n ) ⊆ C K ( m ) is tr u e if f f o r e a ttrib u te a s s ig n m e n t b 1 o f th e lis t K (n ) a ttrib u te a s s ig n m e n t b 2 o f K (m ) e x is ts , s th a t d is ta n c e ( b 1,b 2) ≤ C . E a c h e le m e n t o f K (m ) is to a s s ig n to d iffe re n t e le m e n t in lis t K (n ). O b v io u s ly , it is p o s s ib le to in tro d u c e a s e p a ra te c o n s ta n t fo r e a c h a ttrib D e p e n d in g o n th e a p p lic a tio n , th e in c lu s io n o f th e K -lis ts m a y b e s h a rp e n e d b g lo b a l th re s h o ld : If it is p o s s ib le to e s ta b lis h a c o rre s p o n d e n c e g a c c o rd in g to th e re q u ire m e n ts m e n tio n e d a b o v e , th e n a n a d d itio n a l c o n d itio n s h o u ld b e fu lfille d :

a n d a c h u c h

u te . y a

3 .3 C lu s te r in g

Σ d is ta n c e ( x ,y ) ≤ C (x ,y )



3

(C 3

6 9

- th re s h o ld c o n s ta n t) .

g .

T h e n , fo r th e L -s e t w e g e t th e fo llo w in g d e fin itio n : D e fin itio n 4 L (n ) = { m In s te p a ls o c o a s s ig n m b u t th e

3 o f n s id e e n ts c o m p



_ m

th e a lg r th e o f th e le x ity

o r d e e d o f

N 2, d is ta n c e ( p 1( n ) , p 2( m ) ) ≤ C 1, K ( n ) ⊆

ith m fo r th e d fin e d d is ta n c e g e s . T h is n e w th e a lg o rith m

e te rm fu n c a lc is n o

in a tio n o c tio n fo r u la tio n in t c h a n g e d

C

K (m ) } .

f o n e -to -o n e m a p p in g , w e s h o u ld th e c o m p a ris o n o f th e a ttrib u te c re a s e s th e to ta l a m o u n t o f e ffo rt, .

3 . 3 . 1 1 H i e r a r c h i c a l Cl u s te r i n g of G r a p h s W e h a v e c o n s id e re d s im ila rity b a s e d o n p a rtia l is o m o rp h is m b e tw e e n th e g ra p h s . T h is g iv e s u s a n o rd e r re la tio n o v e r th e o rg a n iz e th e s e g ra p h s in to a s u p e r g ra p h th a t is a d ire c te d s u p e r g ra p h c o n ta in a s e t o f g ra p h s fo r w h ic h th e p re d s im ila rity v a lu e h o ld a n d th e e d g e s s h o w th e p a rt is o m o th e s e g ro u p s o f s im ila r g ra p h s . T h e s u p e r g ra p h is d e fin e d a s fo llo w : D e fin H A (1 (2

itio is g su p ) N ) E T h m e

(3 ) p B e c a u se d e riv e d fro (2 ) in d e fin

n 5 iv e n , th e s e t o f a ll g ra p h s . e r g r a p h is a T u p e l I B = ( N ,E ,p ) , w ith ⊆ H se t o f n o d e s a n d 2 ⊆ N se t o f e d g e s. is s e t s h o u ld s h o w th e p a rtia l is o m o rp h is m a n in g it s h o u ld b e v a lid x ≤ y ⇒ ( x ,y ) ∈ E f o r a ll x ,y ∈ N . : N → B m a p p in g o f c lu s te r n a m e s to th e s u p e o f th e tra n s itiv ity o f p a rt is o m o rp h is m , c e r m o th e r e d g e s a n d d o n o t n e e d to b e s e p a ra te ly itio n 5 c a n b e re d u c e d s to ra g e c a p a c ity .

a s a n se g ra p g ra p h . e fin e d rp h is m

r g ra p h . ta in e d g e s c a n b e d ire c tly s to re d . A re la x a tio n o f to p

th e s e t o f g ra p h s is le a rn t th e s u p e r g ra p h a s fo llo w in g : In p u t is : S u p e r g ra p h IB = (N , E , p ) a n d g ra p h x ∈ H .

tio n s to th is th e e e n

in th e s e t o f n o d e s ,

L e a r n in g th e S u p e r G r a p h F ro m

im p o rta n t re la h s . It a llo w s u T h e n o d e s o f th re s h o ld fo r re la tio n b e tw

7 0

3 M e th o d s fo r D a ta M in in g

O u tp u t is : m o d i f i e d S u p e r g r a p h I B ’ = ( N ’, E ’, p ’) w i t h N ’ ⊆ N ∪ { x } , E ⊆ E ’, p ⊆ p ’ A g ra p T T h is th e g T

t th e b e g in n in g o f th e le a rn in g h N c a n b e a n e m p ty s e t. h e a ttrib u te a s s ig n m e n t fu n c tio is a n a n s w e r to th e q u e s tio n : W ra p h x ? h e in c lu s io n N ’ ⊆ N ∪ { x } s a y s th a t th e g ra p h x m a y b e is o m s o x ≤ y a n d a ls o y ≤ x h o ld . T h e n , g ra p h is n o t in c re a s e d .

p ro c e s s o r th e p ro c e s s o f c o n s tru c tio n o f s u p e r n p ’ g i v e s t h e v a l u e s ( p ’( x ) , ( d d ) ) a s a n o u t p u t . h a t is th e n a m e o f th e g ra p h th a t is m irro re d in

o rp h ic to o n e g ra p h y c o n ta in e d in th e d a ta b a s e , n o n e w n o d e is c re a te d , w h ic h m e a n s th e s u p e r

T h e a lg o rith m fo r th e c o n s tru c tio n o f th e m o d ifie d s u p e r g ra p h IB ’ c a n a ls o u s e th e c irc u m s ta n c e th a t n o g ra p h is p a rt is o m o rp h ic to a n o th e r g ra p h if it h a s m o re n o d e s th a n th e s e c o n d o n e . A s a t e c h n i c a l a i d f o r t h e a l g o r i t h m t h e r e a r e i n t r o d u c e d a s e t N i. N i c o n t a i n s a l l g ra p h s o f th e d a ta b a s e IB w ith e x a c tly i n o d e s . T h e m a x im a l n u m b e r o f n o d e s o f th e g ra p h c o n ta in e d in th e d a ta b a s e is k , th e n it is v a lid : k

N =



N

i= 1

T h c o m p m a k e to b e

e a r u c o

i

g ra p h w h ic h h a s to b e in c lu d e d in th e d a ta b a s e h a s l n o d e s ( l 〉 0 ). B y th e is o n o f th e c u rre n t g ra p h w ith a ll in th e d a ta b a s e c o n ta in e d g ra p h s , w e c a n s e o f tra n s itiv ity o f p a rt is o m o rp h is m fo r th e re d u c tio n o f th e n o d e s th a t h a s m p a re d .

A lg o r ith m E ’ := := fo r a if x ≤ N ’ := fo r a Z

E ; N ; ll y ∈ y th N ∪ ll i w fo r fo r

N l e n [ { x } ith 0 a ll y a ll y

fo r a ll i w ith l fo r a ll y if x ≤ y p ’ := p ∪

{ (x ,

IB ’ := IB ; re tu rn ]; ; < i < l; ∈ N i\ Z ; ≤ x th e n [ Z := Z \ { u _ u ≤ y , u ∈ Z } ; E ’ := E ’ ∪ { ( y ,x )} ]; < i ≤ k ∈ N i\ Z th e n [ Z := Z \ { u _ y ≤ u , u ∈ Z } ; E ’ := E ’ ∪ { ( x ,y )} ]; (d d : u n k n o w n ))} ;

3 .4 C o n c e p tu a l C lu s te r in g

If w e u s a lg o rith m o g ro u p o f g r in th e c a s e in s ta n c e o f in th e d a ta b

e th e c o n f S e c tio n a p h s th a t b a se . T h e a g ro u p a a se .

c e p t fo r 3 .3 .5 .1 w is a p p ro x re fo re , it p ro to ty p

th e d e te rm in a tio n o f s im ila rity , th e n ith o u t a n y c h a n g e s . B u t w e s h o u ld n o im a te ly is o m o rp h ic , th e firs t o c c u rre is b e tte r to c a lc u la te o f e v e ry in s ta n e a n d s to re th is o n e fo r e a c h n o d e o f

w e c a tic e th a d g ra p h c e a n d th e s u

n u t fo is e a c p e r

7 1

s e th e r e a c h s to re d h n e w g ra p h

3 . 3 . 1 2 Con c l u s i on C lu s te rin g is a ls o c a lle d u n s u p e rv is e d c la s s ific a tio n s in c e th e c la s s la b e ls a re n o t k n o w n a -p rio ri. H o w e v e r, c lu s te rin g a llo w s o n ly to fin d o u t s im ila r g ro u p s b a s e d o n a s o u n d s im ila rity m e a s u re . T h e m e th o d d o e s n o t m a k e th e k n o w le d g e a b o u t th e s im ila rity e x p lic it. T h e h ie ra rc h y s h o w n in th e d e n d ro g ra m (s e e F ig u re 4 2 ) te lls u s a b o u t th e s im ila rity re la tio n a m o n g th e d iffe re n t s u b g ro u p s a n d s u p e r g ro u p s b u t d o e s n o t d e s c rib e th e k n o w le d g e th a t te lls u s w h a t m a k e s th e m s im ila r. T h e c lu s te rin g a lg o rith m d e s c rib e d b e fo re a re d e v e lo p e d fo r n u m e ric a l a ttrib u te -b a s e d re p re s e n ta tio n s . R e c e n tly , th e re is w o rk g o in g o n to d e v e lo p K -m e a n s a lg o rith m th a t c a n w o rk o n firs t-o rd e r re p re s e n ta tio n s a n d c a te g o ric a l a ttrib u te s [G R B 9 9 ][H u a 9 8 ]. S im ila rity m e a s u re s fo r c a te g o ric a l d a ta a re g iv e n in [A g r9 0 ]. F o r m o re in fo rm a tio n o n s im ila rity m e a s u re s s e e [B o c 7 4 ][U ll9 6 ][D u J ] . F o r c lu s te rin g a n d its a p p lic a tio n to p a tte rn re c o g n itio n s e e [N a S 9 3 ].

3 . 4 Con c e p tu a l Cl u s te r i n g 3 . 4 . 1 I n tr od u c ti on C la s s ic a l c lu s te rin g m e th o d s o n ly c re a te c lu s te rs b u t d o n o t e x p la in w h y a c lu s te r h a s b e e n e s ta b lis h e d . C o n c e p tu a l c lu s te rin g m e th o d s b u ilt c lu s te r a n d e x p la in w h y a s e t o f o b je c ts c o n firm a c lu s te r. T h u s , c o n c e p tu a l c lu s te rin g is a ty p e o f le a rn in g b y o b s e rv a tio n s a n d it is a w a y o f s u m m a riz in g d a ta in a n u n d e rs ta n d a b le m a n n e r [F is 8 7 ]. In c o n tra s t to h ie ra rc h ic a l c lu s te rin g m e th o d s , c o n c e p tu a l c lu s te rin g m e th o d s b u ilt th e c la s s ific a tio n h ie ra rc h y n o t o n ly b a s e d o n m e rg in g tw o g ro u p s . T h e a lg o rith m ic p ro p e rtie s a re fle x ib le e n o u g h in o rd e r to d y n a m ic a lly fit th e h ie ra rc h y to th e d a ta . T h is a llo w s in c re m e n ta l in c o rp o ra tio n o f n e w in s ta n c e s in to th e e x is tin g h ie ra rc h y a n d u p d a tin g th is h ie ra rc h y a c c o rd in g to th e n e w in s ta n c e . K n o w n c o n c e p tu a l c lu s te rin g a lg o rith m s a re C lu s te r/S [M ic 8 3 ], C o b w e b [F is h 8 7 ], U N IM E M [L e b 8 5 ], c la s s it [G L F 8 9 ] a n d c o n c e p tu a l c lu s te rin g o f g ra p h s [P e r9 8 ]. 3 . 4 . 2 Con c e p t H i e r a r c h y a n d Con c e p t D e s c r i p ti on A c o n c e p t h ie ra rc h y is a d ire c te d g ra p h in w h ic h th e ro o t n o d e re p re s e n ts th e s e t o f a ll in p u t in s ta n c e s a n d th e te rm in a l n o d e s re p re s e n t in d iv id u a l in s ta n c e s . In te r-

7 2

3 M e th o d s fo r D a ta M in in g

n a l n o d e s s ta n d fo r s e ts o f in s ta c o n c e p t. T h e s u p e r c o n c e p t c a n th is s e t o f in s ta n c e s s u c h a s th e T h e re fo re a c o n c e p t C , c a lle d a a b s tra c t c o n c e p t d e s c rip tio n a n d C 2 , . . . , C i, . . . , C n } , w h e r e C i i s t h e

n c e s a tta c h e d to th a t n o d e s a n b e re p re s e n te d b y a g e n e ra liz p ro to ty p e , th e m e d io d o r a u s c la s s , in th e c o n c e p t h ie ra rc h y a lis t o f p o in te rs to e a c h c h ild c h ild c o n c e p t, c a lle d s u b c la s s

d re p re se n t a su e d re p re s e n ta tio e r s e le c te d in s ta is re p re s e n te d b c o n c e p t M (C )= o f c o n c e p t C .

p e rn o f n c e . y a n { C 1,

3 . 4 . 3 Ca te g or y U ti l i ty F u n c ti on C a te g o ry u tility c a n b th e s a m e c la s s (in tra c la s s e s (in te r-c la s s s s im ila rity w h ile lo w b a se d o n : 1 . 2 . T h c o n u

e -c im in

v ie w e d a s a fu n c tio n la s s s im ila rity ) a n d ila rity ). T h e g o a l s te r-c la s s s im ila rity .

a p ro b a b ilis tic c o n c e p t [F is 8 7 ] o a s im ila rity b a s e d c o n c e p t [P e r9 e c a te g o ry u tility fu n c tio n in C n c e p t. T h e c a te g o ry u tility fu n c m b e r o f a ttrib u te v a lu e s th a t c a n

P (C g iv e n a p a rtitio n

{C

,C 1

w ith n o s u c h k n o w le d g e

C U

=



w ith n th e n g o rie s a llo w T h e s im ila r d e s c rip tio n d e fin e d a s :

n k = 1

P (C k

)

[∑

i

j

,..., C 2

∑ ∑ i

i

} n

j

P ( A

ity o f th e o f o b je c c h ie v e h u tility fu

o b je c ts w ith in ts in d iffe re n t ig h in tra -c la s s n c tio n c a n b e

= V i

ij

C k

)

(4 0 ) 2

o v e r th e e x p e c te d n u m b e r o f c o rre c t g u e s s e s

P ( A j



ila r rity to a o ry

r 8 ]. O B W E B is d e fin e d b a s e d o n th e p ro b a b ilis tic tio n is d e fin e d a s th e in c re a s e in th e e x p e c te d c o rre c tly g u e s s e d

∑ ∑ )

k

o f th e s im th e s im ila h o u ld b e T h e c a te g

P ( A i

= V i

= V

ij

C

ij

) 2 . T h is g iv e s th e fo llo w in g c rite ria : k

)2 −

∑ ∑ i

j

P ( A i= V

ij

)

2

(4 1 )

n u m b e r o f c a te g o s a c o m p a ris o n ity -b a s e d u tility ( s e e S e c tio n 3 .2

rie s b e tw a p p .6 a

in e e ro a n d

a p n d c h C h

S = S b

a rtitio iffe re n re q u ir a p te r

+ S

n . A v e ra g in g o v t s iz e d p a rtitio n e s a p ro p e r s im 3 .3 ) . T h e v a r ia

e r th e n u m b e r o f c a te s. ila rity m e a s u re fo r th e n c e o f o b s e rv a tio n s is (4 2 )

w

w ith S b th e in te r-c la s s v a ria n c e a n d S w th e in tra -c la s s v a ria n c e . A g o o d p a rtitio n o f o b s e rv a tio n s s h o u ld m in im iz e th e in tra -c la s s v a ria n c e w h ile m a x im iz in g th e in te rc la s s v a ria n c e . T h u s , th e u tility fu n c tio n s h o u ld b e :

U F =

1 n

S b

− S w



M A X !

(4 3 )

3 .4 C o n c e p tu a l C lu s te r in g

w ith n th e c lu s te rin g s p ro to ty p e d b a se d re p re

n u o f e s se

7 3

m b e r o f c la s s e s . T h e n o rm a liz a tio n to n a llo w s to c o m p a re p o s s ib le d iffe re n t n u m b e r o f c la s s e s . T h e re p re s e n ta tio n o f e a c h c lu s te r is th e c rib e d w e a th e r b y th e a ttrib u te -b a s e d re p re s e n ta tio n o r b y th e g ra p h n ta tio n .

3 . 4 . 4 A l g or i th m i c P r op e r ti e s T h e a lg o rith m in c re m e n ta lly in c o rp o ra te d o b je c ts in to th e c la s s ific e a c h n o d e is a (e ith e r p ro b a b ilis tic o r p ro to ty p ic a lly ) c o n c e p t th a o b je c t c la s s . D u rin g th e c o n s tru c tio n o f th e c la s s ific a tio n tre e th e te n ta tiv e ly c la s s ifie d tro u g h th e e x is tin g c la s s ific a tio n tre e . T h e re fe re n t p o s s ib ilitie s fo r p la c in g a n o b s e rv a tio n in to th e tre e : 1 . 2 . 3 . 4 .

T h e o b je c t is A n e w c la s s T w o e x is tin g a n e x is tin g n

a tio n tre e w t re p re s e n te o b s e rv a tio n b y a re trie d

h e re d a n g e ts d if-

p la c e d in to a n e x is tin g c la s s , is c re a te d , c la s s e s a re c o m b in e d in to a s in g le c la s s (s e e F ig u re 4 8 ), a n d o d e is s p litte d in to tw o n e w c la s s e s (s e e F ig u re 4 9 ).

D e p e n d in g o n th e v a lu e o f th e u tility fu n c tio n fo r e a c h o f th e fo u r p o s s ib ilitie s , th e o b s e rv a tio n g e ts p la c e d in to o n e o f th is fo u r p o s s ib le p la c e s . T h e w h o le a lg o rith m is s h o w n in F ig u re 5 0 .

A

P

P

P

P

New Node

B

A

F ig . 4 8 . N o d e M e rg in g

B

A

A

B

B

F ig . 4 9 . N o d e S p littin g

3 . 4 . 5 A l g or i th m

A fte r w e h a v e d e c a n d e s c rib e o u [G L F 8 9 ] fo r c o n in to th e c o n c e p t a p p ly in g a ll th e g iv e s u s th e h ig h

fin e d o u r e v a lu a tio n fu n c tio n a n d th e d iffe re n t r le a rn in g a lg o rith m . W e a d a p t th e n o tio n c e p t le a rn in g to o u r p ro b le m . If a n e w in s ta n c h ie ra rc h y th e in s ta n c e is te n ta tiv e ly p la c e d in d iffe re n t le a rn in g o p e ra to rs d e s c rib e d b e fo re . e s t e v a lu a tio n s c o re is c h o s e n fo r in c o rp o ra tin

le a rn in g le v e ls , w e o f G e n n a ri e t. a l e h a s to b e e n te re d to th e h ie ra rc h y b y T h e o p e ra tio n th a t g th e n e w in s ta n c e

7 4

3 M e th o d s fo r D a ta M in in g

(s e e F ig u re 5 1 ). T h e n e w in s ta n c e is e n te re d in to th e c o n c e p t h ie ra rc h y a n d th e h ie ra rc h y is re o rg a n iz e d a c c o rd in g to th e s e le c te d le a rn in g o p e ra tio n . In p u t: O u tp u t: T o p -le v e l c a ll: V a ria b le s :

O V I U C M

S

C o n c e p t A n u n c la M o d ifie d s e t o f in s A , B , C , K , L , M ,

H ie ra rc h y C B s s ifie d in s ta n c e G C o n c e p t H ie ra rc h y C B ´ ta n c e s (to p -n o d e , G ) a n d D a re n o d e s in th e h ie ra rc h y a n d N a re p a rtitio n s c o re s

C o n c e p t H ie ra rc h y (N , G ) If N is a te rm in a l n o d e T h e n C re a te -n e w -te rm in a ls (N , G ) In c o rp o ra te (N , G ) E ls e F o r e a c h c h ild A o f n o d e N , C o m p u te th e s c o re fo r p la c in g G in A . C o m p u te th e s c o re s fo r a ll o th e r a c tio n w ith G L e t B b e th e n o d e w ith th e h ig h e s t s c o re Y . L e t D b e th e n o d e w ith th e s e c o n d h ig h e s t s c o re . L e t N b e th e s c o re fo r p la c in g I in a n e w n o d e C . L e t S b e th e s c o re fo r m e rg in g B a n d D in to o n e n o d e . L e t I b e th e s c o re fo r s p littin g D in to it s c h ild re n . If Y is th e b e s t s c o re T h e n C o n c e p t H ie ra rc h y (P , G ) (p la c e G in c a s e c la s s B ). E ls e If N is th e b e s t s c o re T h e n In p u t a n e w n o d e C E ls e if S is th e b e s t s c o re , T h e n le t O b e m e rg e d (B , D , N ) C o n c e p t H ie ra rc h y (N , G ) E ls e if Z is th e b e s t s c o re , T h e n S p lit (B , N ) C o n c e p t H ie ra rc h y (N , G ) p e ra tio n s o v e r C o n c e p t H ie ra rc h y a ria b le s : X , O , B , a n d D a re n o d e s in th e h ie ra rc h y . G is th e n e w in s ta n c e n c o rp o ra te (N , G ) p d a te th e p ro to ty p e a n d th e v a ria n c e o f c la s s N re a te n e w te rm in a ls (N , G ) C re a te a n e w c h ild W o f n o d e N . In itia liz e p ro to ty p e a n d v a ria n c e e rg e (B , D , N ) M a k e O a n e w c h ild o f N R e m o v e B a n d D a s c h ild re n o f N A d d th e in s ta n c e s o f P a n d R a n d a ll c h ild re n o f B a n d D to th e n o d e O C o m p u te p ro to ty p e a n d v a ria n c e fro m th e in s ta n c e s o f B a n d D p lit (B , N ) D iv id e In s ta n c e s o f N o d e B in to tw o s u b s e ts a c c o rd in g to e v a lu a tio n c rite ria A d d c h ild re n D a n d E to n o d e N In s e rt th e tw o s u b s e ts o f in s ta n c e s to th e c o rre s p o n d in g n o d e s D a n d E C o m p u te n e w p ro to ty p e a n d v a ria n c e fo r n o d e D a n d E A d d c h ild re n to n o d e D if s u b g ra p h o f c h ild re n is s im ila r to s u b g ra p h o f n o d e D A d d c h ild re n to n o d e E if s u b g ra p h o f c h ild re n is s im ila r to s u b g ra p h o f n o d e E

F ig . 5 0 . A lg o rith m

3 .4 C o n c e p tu a l C lu s te r in g

7 5

3 . 4 . 6 Con c e p tu a l Cl u s te r i n g of G r a p h s

3 . 4 . 6 . 1 N oti on of a Ca s e a n d S i m i l a r i ty M e a s u r e

T h e b a s i s f o r t h e d e v e l o p m e n t o f o u r s y s t e m i s a s e t o f c a s e s C B = { G 1 , G 2 , . . . , G i, ..., G n} , e a c h c a s e is a 3 -T u p e l G i = ( N , p , q ), w h ic h is a s tr u c tu r a l s y m b o lic re p re s e n ta tio n o f a n im a g e , a n d a s im ila rity m e a s u re [P e r9 8 ] fo r s tru c tu ra l re p re s e n ta tio n s . F o r th e c u rre n t im a g e , a n im a g e g ra p h is e x tra c te d b y im a g e a n a ly s is . T h is s tru c tu re is u s e d fo r in d e x in g . T h e in te rp re ta tio n o f a c u rre n t im a g e S is d o n e b y c a s e c o m p a r is o n : G iv e n a n im a g e S = ( N s, p s, q s), fin d a c a s e G m in th e c a s e b a s e C B w h ic h is m o s t s im ila r to th e c u rre n t im a g e . O u tp u t th e c a s e G m a n d th e s to re d s o lu tio n . F o r s im ila rity d e te rm in a tio n b e tw e e n o u r im a g e g ra p h s , w e c h o s e p a rt is o m o rp h is m : D e T w th e (1 (2

fin itio n 1 o g ra p h s G 1 = re e x is ts a o n e ) p 1(x ) = ) q 1(x ) =

Is a g is n o t h a ttrib u te c o rre sp o in tro d u c

( N 1,p -to -o n p 2(f(x q 2(f(x

ra p h G 1 in c lu d e d ig h e r th a n th e n s a n d d is to rtio n n d e n c e o f a ttrib u e d ra n g e s o f to le r

1

,q 1) a n d G 2 = ( N 2,p 2,q 2) a r e in th e r e la tio n G e m a p p in g f: N 1 → N 2 w ith )) fo r a ll x ∈ N 1 ), f( y )) fo r a ll x ,y ∈ N 1, x ≠ y .

in a n o th e r u m b e r o f in a g r te a s s ig n m a n c e s a c c o

g ra p h G 2 n o d e s o f a p h re p re e n ts o f n o rd in g to th

th e n th e G 2. In s e n ta tio d e s a n d e se m a n

n u m b e r o f n o d e o rd e r to h a n d le n w e re la x e d e d g e s in s u c h a tic te rm s .

1

≤ G 2

iff

s o f g ra p h G 1 th e u n s h a rp th e re q u ire d w a y th a t w e

3 . 4 . 6 . 2 E v a l u a ti on F u n c ti on

W h e n s e v e ra l d is tin c t p a rtitio n s a re g e n e ra te d o v e r c a s e b a s e , a h e u ris tic is u s e d to e v a lu a te th e s e p a rtitio n s . T h is fu n c tio n e v a lu a te s th e g lo b a l q u a lity o f s in g le p a rtitio n a n d fa v o rs p a rtitio n s th a t m a x im iz e p o te n tia l fo r in fe rrin g in fo rm a tio n . In d o in g th is , it a tte m p ts to m in im iz e w ith in c a s e c la s s v a ria n c e s a n d to m a x im iz e b e tw e e n c a s e c la s s v a ria n c e s . T h e e m p lo y m e n t o f a n e v a lu a tio n fu n c tio n p re v e n ts u s fo r th e p ro b le m o f d e fin in g a th re s h o ld fo r c la s s s im ila rity a n d in te r c la s s s im ila rity fro m w h ic h th e p ro p e r b e h a v io r o f th e le a rn in g a lg o rith m d e p e n d s . T h is th re s h o ld is u s u a lly d o m a in d e p e n d e n t a n d is n o t e a s y to d e fin e . G iv e n a p a r titio n { C 1, C 2, ... , C m } . T h e p a r titio n w h ic h m a x im iz e s th e d if fe r e n c e b e tw e e n th e b e tw e e n c a s e c la s s v a ria n c e s B a n d th e w ith in c a s e c la s s v a ria n c e s W is c h o s e n a s th e rig h t p a rtitio n :

S C O R E = m

1 s

* 2 B

− s

* 2 W



M A X !

(4 4 )

7 6

3 M e th o d s fo r D a ta M in in g

T h e n o rm a liz a tio n to m fe re n t p a rtitio n s .

(m -th e n u m b e r o f p a rtitio n s ) is n e c e s s a ry to c o m p a re d if-

If G pj is th e p ro to ty p e o f th e j-th c a s e c la s s in th e h ie ra rc h y a t le v e l k , G th e m e a n g ra p h o f a ll c a s e s in le v e l k , a n d G 2vj is th e v a ria n c e o f th e g ra p h s in th e p a rtitio n j, th e n fo llo w s fo r S C O R E : 1 =

S C O R E

m

∑ p

m

j

j = 1

(G

p j

)

− G

2

m

− ∑ p G j = 1

(4 5 )

2 v j

j

p j th e re la tiv e fre q u e n c y o f c a s e s in th e p a rtitio n j.

w ith

3 . 4 . 6 . 3 P r ototy p e L e a r n i n g

W h e n g ro u p e o f th e s p e a re d p ro to ty

le a rn d to g e c a s c a se p e fo

in g a e th e r. e s a n d m ig h t r th e c

c la s s T h e f e a c h n o t a la s s o

o f irs t n e w lw a f c a

c a a p c y s se

se s, p e a a se b e s.

c re is a

a se s w d c a se c o m p a g o o d c

h e re w o u re d a se .

th e e v ld b e th to th is c T h e re fo

a lu a tio n e re p re s a se . O b re , it is

m e a su r e n ta tiv e v io u s ly , b e tte r to

e h o o f th th e f c o m

ld s e c irs t p u

a la a te

re ss p a

D e fin itio n 3 A G ra p h G

C i

=

{

G 1,G

= ( N p

p

, p p ,q p ) is a p ro to ty p e o f a C la s s o f C a s e s

,...,G t( N 2

, p t,q t) t

o n e -to -o n e m a p p in g f: N (1 ) p p ( x i) = 1 (2 ) q In th e s a T h e re su s tra te g y c a s e c la s

p

∑ t

( x i, y i) =

→ p

t

}

iff G

N i

p

= C i a n d if th e re is a

w ith

p n ( f ( x i)) fo r a ll x i ∈ N

a n d

n = 1

1 n



t

q n ( f ( x i), f ( y i)) fo r a ll x i, y i ∈ N .

n = 1

m e m a n n e r, w e c a n c a lc u la te th e v a ria n c e o f th e g ra p h s in o n e c a s e c la s s . ltin g p ro to ty p e is n o t a c a s e th a t h a p p e n e d in re a lity . T h e re fo re , a n o th e r fo r c a lc u la tin g a p ro to ty p e c a n b e to c a lc u la te th e m e d ia n o f th e c a s e in a s.

3 . 4 . 6 . 4 A n E x a m p l e of a L e a r n e d Con c e p t H i e r a r c h y

T h e e x s y s te m e .g . c r w e re a

p e rim e n t w a s g iv e a c k -lik e -d n a ly z e d b

w a s m a n 4 0 im e fe c t-p r y th e im

d e o a g e e ssu a g e

n im s o f re -1 a n a

a g o n 0 0 ly s

e s e d o r is p

fro m e fe c c ra c ro c e

a n o n -d e s tru c tiv e t ty p e h a v in g d iffe k -lik e -d e fe c t-p re s u d u re d e s c rib e d in S

te re re e

s tin g n t su -5 0 . c tio n

d o m a b c la s s F ro m 2 a n d

in . T h e s , lik im a g e th e re

e e s -

3 .4 C o n c e p tu a l C lu s te r in g

7 7

s u ltin g A ttrib u te d g ra p h w e re g e n e ra te d fro m th e im a g e b y im a g e a n a ly s is p ro c e d u re . T h e s e g ra p h s w e re g iv e n to th e s y s te m fo r le a rn in g . T h e le a rn in g p ro c e s s is d e m o n s tra te d in F ig u re 5 2 o n fo u r c a s e s s h o w n in F ig u re 5 1 . C a s e 1 is a lre a d y c o n ta in e d in th e c a s e b a s e a n d c a s e tw o s h o u ld b e in c lu d e d in th e c a s e b a s e . T h e le a rn in g a lg o rith m te s ts firs t, w h e re in th e h ie ra rc h y th e c a s e s h o u ld b e in c o rp o ra te d . T h e s c o re s o f a ll th re e o p tio n s in q u e s tio n s h o w th e s a m e re s u lts . In s u c h a c a s e , th e a lg o rith m fa v o rs in s e rtin g to a n e x is tin g n o d e . T h is p re v e n ts u s fro m b u ild in g a h ie ra rc h y w ith m a n y n o d e s in c lu d in g o n ly o n e c a s e . L a te r th e a lg o rith m c a n s p lit th is n o d e if n e c e s s a ry . G iv in g c a s e th re e to th e n e w c a s e b a s e th e a lg o rith m fa v o r to o p e n a n e w n o d e in s te a d o f s p littin g tw o n o d e s . T h e n e w c a s e fo u r is rig h t in s e rte d to th e c a s e c la s s h a v in g th e c lo s e s t s im ila rity v a lu e , s e e ta b le 1 . T h e s y s te m w a s g iv e n 4 0 c a s e s fo r le a rn in g th e c a s e b a s e . F o r e v a lu a tin g th e s y s te m , w e c o n s id e re d th e g ro u p in g o f th e c a s e s in to th e h ie ra rc h y a n d in to th e c a s e c la s s e s . B a s e d o n o u r d o m a in k n o w le d g e w e c o u ld re c o g n iz e th a t m e a n in g fu l g ro u p in g s w e re a c h ie v e d b y th e le a rn in g a lg o rith m . T a b le 9 .

D is s im ila rity b e tw e e n Im a g e G ra p h s

1 1 2

-

2 0 .0 2 5 3 9 6 8 -

3 4

F ig . 5 1 . Im a g e w ith g ra p h u s e d fo r in itia l le a rn in g p ro c e s s

3 0 .1 1 2 0 3 7 0 0 .1 0 6 7 4 6 0 3 -

4 0 .1 9 7 7 5 1 3 0 .2 0 4 1 0 0 5 0 .1 8 0 4 2 3 2 -

7 8

3 M e th o d s fo r D a ta M in in g In se rt C a se 2 In s e rt to e x is tin g n o d e

N e w N o d e

P= 1,2

R e fin e m e n t P= 1,2

P= 1,2

P= 1,2

P= 1,2

1

S B S W S C * R * * * * In s In s

P= 2

P= 1

2

1

2

=

0 = 0 ,0 0 0 1 8 1 7 2 O R E = 0 ,0 0 0 1 8 1 7 2 e s u ltin g C a se B a se e rt C a se 3 e rt to e x is tin g n o d e

P= 2

P= 1

2

1

S B = 0 ,0 0 0 1 8 1 7 2 S W = 0 S C O R E = 0 ,0 0 0 1 8 1 7 2

S B = 0 ,0 0 0 1 8 1 7 2 S W = 0 S C O R E = 0 ,0 0 0 1 8 1 7 2

N e w N o d e

R e fin e m e n t P= 1,2,3

P= 1 ,2,3

P= 1,2,3

P= 1,2,3

P= 1 ,2,3

1

2

P= 3

P= 1,2

1

3

S B = 0 S W = 0 ,0 1 5 6 5 1 3 S C O R E = 0 ,0 1 5 6 5 1 3 In se rt C a se 4 In se rt to e x is tin g n o d e _ 1 P= 1,2,3,4

to

1

3

2

S B = 0 S W = 0 S C O R E R e s u ltin g In se rt n o d e _ 2

P= 1,2

,0 2 ,0 0 = C a

5 5 6 7 1 0 1 2 1 1 0 ,0 2 5 4 4 5 9 se B a se * * * *

e x is tin g

P= 3

3

2

S B = 0 ,0 2 5 5 6 7 1 S W = 0 S C O R E = 0 ,0 2 5 4 4 5 9

N e w N o d e

P= 1,2,3,4

R e fin e m e n t P= 1,2,3,4

P= 1,2,3,4

P= 3

P= 1,2,4

P= 3

P= 1,2,4

1

2

4

3

S B = 0 ,0 1 5 9 3 6 7 S W = 0 ,0 1 2 0 4 9 8 S C O R E = 0 ,0 0 3 8 8 6 9

P= 1,2

1

2

S B = S W = S C O R E R e s u ltin

P= 3,4

3

0 ,0 2 5 0 2 3 2 0 ,0 0 0 8 9 6 0 = 0 ,0 2 4 1 2 7 g C a se B a se

4

P= 1,2

1

2

P= 3

P= 4 P= 1,2

3

S B = 0 ,0 2 1 8 8 5 6 S W = 0 ,0 0 0 0 7 9 5 S C O R E = 0 ,0 2 1 8 0 5

F ig . 5 2 . D e m o n s tra tio n o f th e L e a rn in g P ro c e s s

4

1

2

P= 4

3

4

S B = 0 ,0 2 0 4 S W = 0 S C O R E = 0 ,0 2 0 4

3 .5 E v a lu a tio n o f th e M o d e l

7 9

3 . 4 . 7 Con c l u s i on C o n c e p tu a l c lu s te rin g m e th o d s b u ilt c lu s te r a s h o w n h o w it c a n b e u lo o k lik e . C o n c e p tu a l fo r c la s s ify in g s a te llite

is n d se d c lu im

a u se fu e x p la in fo r c lu s te rin g a g e s.

l d a ta w h y a s te rin g m e th o d

m in in g te c s e t o f o b je a ttrib u te -g s a re u se d

h n c ts ra p fo

iq u c o h s r e

e . C o n c e n firm a c a n d h o w .g . to s e g

p tu a l c lu s te rin lu s te r. W e h a v th e o u tp u t d o e m e n t im a g e s o

g e s r

3 . 5 E v a l u a ti on of th e M od e l O u r m o d e l s e n t o n ly a q u e s tio n : H W e a re in te

is le a rn t c u to u t o o w g o o d re s te d in

fro m a f th e w is o u r tw o d e

fin ite s e t h o le u n iv le a rn t m o d s c rip tiv e p

o f o b s e rv a tio n s fro m th e d o m a in th a t re p re e rs e . B e c a u s e o f th is re s tric tio n it a ris e s th e e l? ro p e rtie s :

1 . R e p re s e n ta tio n c a p a b ility a n d 2 . G e n e ra liz a tio n c a p a b ility .



R e p re s e n ta tio n c a p a b ility d e s c rib e s h o w g o o d th e m o d e l fits to th e d a ta fro m it w a s le a rn t. W h ile g e n e ra liz a tio n c a p a b ility d e s c rib e s th e a b ility o f th e m o g e n e ra liz e fro m th e s e d a ta s o th a t it c a n p re d ic t th e o u tc o m e fo r u n k n o w n s a m B o th c rite rio n c a n b e e x p re s s e d b y th e e rro r ra te . F ro m s ta tis tic s w e k n o w th a t w e c a n o n ly c a lc u la te a n e s tim a te fo r th e e rro b e c a u s e o f th e lim ite d s iz e o f th e s a m p le s e t a n d th e re p re s e n ta tio n p ro b le m p ro b le m d o m a in b a s e d o n th e s a m p le s e t. T h e g o a l is to c a lc u la te d a n e rro w h ic h h o p e fu lly c o m e s c lo s e to th e tru e e rro r ra te . D iffe re n t s a m p lin g s tra h a v e b e e n d e v e lo p e d to e n s u re a g o o d e s tim a te fo r th e e rro r ra te . T h e s e s a m s tra te g ie s a re :

th o s e d e l to p le s . r ra te o f th e r ra te te g ie s p lin g

T e s t-a n d -tra in R a n d o m S a m p lin g C ro s s v a lid a tio n B o o tra p p in g .







3 . 5 . 1 E r r or R a te , Cor r e c tn e s s , a n d Q u a l i ty T h e e rro r ra te frc a n b e c a lc u la te d b y :

fr = N w ith N p le s . f

f

N

th e n u m b e r o f fa ls e c la s s ifie d s a m p le s a n d N

(4 6 ) th e w h o le n u m b e r o f s a m -

8 0

M se e T th e d a s th fie d sh o w w e c

3 M e th o d s fo r D a ta M in in g

o re s p e c ific e rro r ra te s a b le 1 . In th e fie ld s o f a ta s e t a n d th e c la s s d is e m a rg in a l d is trib u tio n s a m p le s . T h e la s t ro w n in ro w 1 a n d th e la s t a n c a lc u la te p a ra m e te rs

c a n b e th e ta b trib u tio c ij. T h e sh o w s lin e s h th a t d e

o b ta in e d if w e c a le a re in p u tte d th e n a fte r th e s a m p le m a in d ia g o n a l is th e n u m b e r o f s o w s th e re a l c la s s s c rib e th e q u a lity

lc u la te th e c o n tin g e n c y ta b le , re a l c la s s d is trib u tio n w ith in s h a v e b e e n c la s s ifie d a s w e ll th e n u m b e r o f c o rre c t c la s s ia m p le s a s s ig n e d to th e c la s s d is trib u tio n . F ro m th is ta b le , o f th e c la s s ifie r.

T a b le 1 0 . C o n tin g e n c y T a b le

R e a l C la s s In d e x i ... ... ... c ji ... ... ... ... ...

1 A ss ig n e d C la s s In d e x

1

c j ...

... ...

m

c

1 1

m 1

S u m c

1 m

c

... ... m m

S u m

T h e c o r r e c tn e s s p o r a c c u r a c y th a t c o rd in g to th e n u m b e r o f s a m p le s : m ∑ c i = 1 p = m m ∑ ∑ i = 1 j = T h is m T h e c la s s i c o rre c t s a m p le

m

e a s u re is th e o c la s s ific a tio n to a ll s a m p le s c la s s ifie d s a m s in to c la s s i:

p p q u o p

o s ite a lity f c la le s o

to th is k i ss i a f c la s p

p

p

ti

=

ii (4 7 )

r ra te . u m b e r o f c o rre c t c la s s ifie d s a m p le s fo r o n e e c l a s s i f i c a t i o n q u a l i t y p ti i s t h e n u m b e r o f th e n u m b e r o f c o rre c t a n d fa ls e c la s s ifie d

c

ii

(4 8 )

m



c

c



c ij 1

e e rro th e n n d th s i to

=

k i

is n u m b e r o f c o rre c t c la s s ifie d s a m p le s a c -

ji

j= 1

ii

m

c

ij

i= 1

(4 9 )

3 .5 E v a lu a tio n o f th e M o d e l

T h e s e m e a s u re s a llo w s u s to tic u la r c la s s . T h e o v e ra ll e rro to th e c la s s ific a tio n q u a lity p S u c h a p p lic a tio n s a re k n o w n im p o rta n t to d e te c t a d e fe c t p la n t w ith h ig h a c c u ra c y w h b u t n o t a s a d e fe c t “ p o re ” is o n e tim e m o re o fte n th a n h a v

8 1

s tu d y th e b e h a v io r o f a c la s s ifie r a c c o rd in g to a p a rr ra te o f a c la s s ifie r m a y lo o k g o o d b u t w h e n lo o k in g . fo r a p a rtic u la r c la s s w e m a y fin d it n o t a c c e p ta b le . k i fo r e x a m p le fro m n o n -d e s tru c tiv e te s tin g w h e re it is s u c h a s " c ra c k " in s id e a c o m p o n e n t fro m a p o w e r ile a c o rre c t d e te c tio n o f a d e fe c t " p o re " a s a d e fe c t a c c e p ta b le s in c e it is b e tte r to re p la c e th e c o m p o n e n t in g o v e rlo o k e d a d e fe c t s u c h a s “ c ra c k ” .

3 . 5 . 2 S e n s i ti v i ty a n d S p e c i f i ty A s p e c ific e x p re s s io n o f th e s e m e a s u re s is th e s e n s itiv ity a n d m e a s u re a re u s e d in m e d ic in e . F o r e x p la n a tio n p u rp o s e le t u s c o d is p la y e d in F ig u re 5 3 . S u p p o s e w e h a v e to d e c id e w e a th e r a n o b je c t h a s a d e fe c t in s tw o c la s s p ro b le m , th e p re d ic tio n s y s te m c a n m a k e a d e c is io n s a m p le w ith a d e fe c t in to tru e p o s itiv e (T P ) o r fa ls e n e g a tiv e p re s e n t to th e s y s te m a s a m p le w ith o u t a d e fe c t th e d e c is io n c a n e g a tiv e (T N ) o r fa ls e p o s itiv e (F P ).

o r

P r e d i c ti on S y s te m

T P

id e o r a b o u t (F N ). n b e e

n o a In ith

t. F o r th is p re s e n te d c a se , w e e r b e tru e

T P + F N = 1 0 0 % F N

o r T N

s p e c ific ity . T h e s e n s id e r th e p ro b le m

T N + F P = 1 0 0 % F P

F ig . 5 3 . T h e fo u r C a s e s o f D ia g n o s is

N o w , w e c a n d e fin e th e e v a lu a tio n c rite rio n c a lle d s e n s itiv ity a n d s p e c ifity . L e t T P d e n o te tru e p o s itiv e , T N tru e n e g a tiv e , F P fa ls e p o s itiv e s a n d F N fa ls e n e g a tiv e a n s w e rs , th e n S e n s itiv ity S N a n d S p e c ific ty S P is : S N

S P =

=

T P T P + F N

(5 0 )

T N + F P

(5 1 )

T N

8 2

3 M e th o d s fo r D a ta M in in g

T h e s e tw o m e a s u re s tim e s it c a n b e b e tte r to s p e c ifity . F o r a m e d ic a m e a n th a t th e s y s te m c a ity w h ile th e d e c is io n fo H o w e v e r, a h ig h a c c u ra h u m a n , e c o n o m ic a l a n d

a re a s p e c ific e x p re s s io n fo r a tw o c la s s e s p ro b le m S o m e re q u ire fro m th e s y s te m a h ig h s e n s itiv ity in s te a d o f a h ig h l a p p lic a tio n s u c h a s th e d ia g n o s is o f c a n c e r th a t w o u ld n in fe r n e g a tiv e d ia g n o s is " n o c a n c e r" w ith h ig h p ro b a b ilr p o s itiv e d ia g n o s is " c a n c e r" n e e d s s o m e fu rth e r re v is io n . c y in s o rtin g o u t n e g a tiv e c a s e s m a k e s m o re s e n s e u n d e r d ia g n o s tic v ie w p o in t th a n v is e -v e rs a .

3 . 5 . 3 T e s t- a n d - T r a i n If a la rg e e n o u g h a n d re p re s e n ta tiv e s a m p le s e t is a v a ila b le th e e rro r ra te c a n b e c a lc u la te d b a s e d o n th e te s t-a n d -tra in s tra te g y . T h e re fo re , th e d a ta s e t is s p lit u p in to a d e s ig n d a ta s e t a n d a te s t d a ta s e t. T e s t d a ta s e t a n d d e s ig n d a ta s e t a re d is jo in t. T h e m o d e l is le a rn t b a s e d o n th e d e s ig n d a ta s e t. T h e te s t d a ta s e t c o n ta in s o n ly s a m p le s n o t h a v in g u s e d fo r d e s ig n in g th e m o d e l. If th e m o d e l c a n p re d ic t th e o u tc o m e rig h t fo r m o s t o f th e s e s a m p le s th a n it s h o w s g o o d g e n e ra liz a tio n c a p a b ility . If w e a p p ly th e d e s ig n d a ta s e t to th e m o d e l o n c e a g a in a n d m e a s u re th e re s u ltin g e rro r ra te th a n w e h a v e e s tim a te d a m e a s u re fo r th e re p re s e n ta tio n c a p a b ility o f th e m o d e l. T h is m e a s u re te lls u s h o w g o o d th e m o d e l fits to th e d e s ig n d a ta s e t. U s u a lly w e w ill n o t c a re s o m u c h fo r g o o d re p re s e n ta tio n c a p a b ility ra th e r w e a re in te re s te d h o w g o o d th e m o d e l c a n p re d ic t th e c o rre c t o u tc o m e fo r u n s e e n s a m p le s . 3 . 5 . 4 R a n d om

S a m p lin g

T o o v e rc o m e th is in flu e n c e w e c a n p a rtitio n th e d a ta te s t s e ts . B a s e d o n th a t w e c a n le a rn a n d e v a lu a te m u e rro r ra te is th e n th e a v e ra g e o f a ll s in g le e rro r ra te s . A m u c h m o re a c c u ra te e s tim a to r is th e m e a n e rro r s e le c t n -tim e s fro m th e s a m p le s e t a te s t a n d tra in in ra te . T h e re s u ltin g e rro r ra te is th e m e a n e rro r ra te fro m th e n te s t a n d tra in s e ts . T h e c re a te d n te s t s e t m o th e r d u e to th e ra n d o m s a m p lin g .

s e t in to m u ltip le tra in in g a n d ltip le c la s s ifie r. T h e re s u ltin g ra te g s o f a y

. T h e r e t a n d th e n n o t b e

e fo re , w c a lc u la c la s s ifie d is ju n c

e ra n d o m ly te th e e rro r rs o b ta in e d t fro m e a c h

3 . 5 . 5 Cr os s V a l i d a ti on T h is s a m s e ts in o b e tw e e n w h ic h w e lin g p ro k = 1 th e n

p lin g s tra te rd e r to a c c le a v e -o n e e s u b d iv id e c e s s k tim e th e m e th o d

g y is c o m p u ta tio n a l e x p e n s iv u ra te p re d ic t th e e rro r ra te w o u t m e th o d a n d n -fo ld c ro s th e d a ta in to k ro u g h ly e q u a l s , le a v in g o n e s e c tio n o u t e a c is c a lle d le a v e -o n e -o u t.

e b u t s ith o u t s v a lid a s iz e d p h tim e

u ffic ie n t fo r s m a ll s a m p le b ia s . W e c a n d is tin g u is h tio n . k -fo ld te c h n iq u e in a rts , th e n re p e a t th e m o d fo r v a lid a tio n p u rp o s e s . If

3 .6 F e a tu r e S u b s e t S e le c tio n

8 3

3 . 5 . 6 Con c l u s i on W e h a v e d e s c rib e d th e c o m m o n ly u s e d m e th o d s a n d th e m e a s u re s fo r e v a lu a tin g a m o d e l. F o r in fo rm a tio n o n b o o ts tra p p in g w e re fe r to E fro n [E fr8 2 ]. In fo rm a tio n o n R e c ie v e r-O p e ra tin g C u rv e (R O C ) a n a ly s is is d e s c rib e d in M e tz [M e t7 8 ].

3 . 6 F e a tu r e S u b s e t S e l e c ti on 3 . 6 . 1 I n tr od u c ti on S e le c tin g th e rig h t s e t o f fe a tu re s fo r c la s s ific a tio n is o n e o f th e m o s t im p o rta n t p ro b le m s in d e s ig n in g a g o o d c la s s ifie r. V e ry o fte n w e d o n ’t k n o w a -p rio ri w h a t th e re le v a n t fe a tu re s a re fo r a p a rtic u la r c la s s ific a tio n ta s k . O n e p o p u la r a p p ro a c h to a d d re s s th is is s u e is to c o lle c t a s m a n y fe a tu re s a s w e c a n p rio r to th e le a rn in g a n d d a ta -m o d e lin g p h a s e . H o w e v e r, irre le v a n t o r c o rre la te d fe a tu re s , if p re s e n t, m a y d e g ra d e th e p e rfo rm a n c e o f th e c la s s ifie r. In a d d itio n , la rg e fe a tu re s p a c e s c a n s o m e tim e s re s u lt in o v e rly c o m p le x c la s s ific a tio n m o d e ls th a t m a y n o t b e e a s y to in te rp re t. In th e e m e rg in g a re a o f d a ta m in in g a p p lic a tio n s , u s e rs o f d a ta m in in g to o ls a re fa c e d w ith th e p ro b le m o f d a ta s e ts th a t a re c o m p ris e d o f la rg e n u m b e rs o f fe a tu re s a n d in s ta n c e s . S u c h k in d s o f d a ta s e ts a re n o t e a s y to h a n d le fo r m in in g . T h e m in in g p ro c e s s c a n b e m a d e e a s ie r to p e rfo rm b y fo c u s s in g o n a s u b s e t o f re le v a n t fe a tu re s w h ile ig n o rin g th e o th e r o n e s . In th e fe a tu re s u b s e t s e le c tio n p ro b le m , a le a rn in g a lg o rith m is fa c e d w ith th e p ro b le m o f s e le c tin g s o m e s u b s e t o f fe a tu re s u p o n w h ic h to fo c u s its a tte n tio n . 3 . 6 . 2 F e a tu r e S u b s e t S e l e c ti on A l g or i th m s F o llo L e s ire d c rite r a lity , m a lly

w in g J a in e t a l. [ t Y b e th e o rig in n u m b e r o f fe a tu io n fu n c tio n fo r le t u s c o n s id e r , th e p ro b le m o f

J ( X ) = m a x J (Z )

J a Z 9 7 ], w e c a n d e s c rib e fe a tu re s u b s e t s a l s e t o f fe a tu re s , w ith c a rd in a lity n . L e re s in th e s e le c te d s u b s e t X , X ⊆ Y . L e t th e s e t X b e re p re s e n te d b y J (X ). W ith o a h ig h e r v a lu e o f J to in d ic a te a b e tte r fe a tu re s e le c tio n is to fin d a s u b s e t X ⊆ Y . A n e x h a u s tiv e a p p ro a c h to th is p ro b le

e le c tio n a s fo llo w : t d re p re s e n t th e d e th e fe a tu re s e le c tio n u t a n y lo s s o f g e n e rfe a tu re s u b s e t. F o rs u c h t h a t |X |= d a n d m w o u ld re q u ire e x -

Z ⊆ Y , Z = d

a m in in g a ll

⎛ n ⎞ ⎜ ⎟ p o s s ib le d -s u b s e ts o f th e fe a tu re s e t Y . ⎝ d ⎠

F e a tu re s e le c tio n a lg o rith m d iffe r in th e s e a rc h s tra te g y , th e fe a tu re s e le c tio n c rite ria , th e w a y th e y a d d o r d e le te a n in d iv id u a l fe a tu re o r a s u b s e t o f fe a tu re s a t e a c h ro u n d o f fe a tu re s e le c tio n a n d th e o v e ra ll m o d e l fo r fe a tu re s e le c tio n .

8 4

3 M e th o d s fo r D a ta M in in g

A c c o rd in g to th e q u a lity c rite ria [N a S 9 3 ] fo r fe a tu re s e le c tio n , th e m o d e l fo r fe a tu re s e le c tio n c a n b e d is tin g u is h e d in to th e filte r m o d e l a n d th e w ra p p e r m o d e l [C o v 7 7 ], [K o J9 8 ]. 3 . 6 . 2 . 1 T h e W r a p p e r a n d th e F i l te r M od e l f or F e a tu r e S u b s e t S e l e c ti on T h e w ra p p e r m o d e l (s e e F ig u re 5 4 ) a tte m p ts to id e n tify th e b e s t fe a tu re s u b s e t fo r u s e w ith a p a rtic u la r a lg o rith m , w h ile th e filte r a p p ro a c h (s e e F ig u re 5 5 ) a tte m p ts to a s s e s s th e m e rits o f fe a tu re s fro m th e d a ta a lo n e . T h e in c lu s io n o f th e p a rtic u la r c la s s ifie r in to th e fe a tu re s e le c tio n p ro c e s s m a k e s th e w ra p p e r a p p ro a c h m o re c o m p u ta tio n a lly e x p e n s iv e a n d th e re s u ltin g fe a tu re s u b s e t w ill o n ly b e a p p ro p ria te fo r th e u s e d c la s s ifie r w h ile th e filte r a p p ro a c h is c la s s ifie r in d e p e n d e n t. H o w e v e r, b o th m o d e ls re q u ire a s e a rc h s tra te g y th a t s h o u ld c o m e c lo s e to o p tim a l. V a rio u s s e a rc h s tra te g ie s h a v e b e e n d e v e lo p e d in o rd e r to re d u c e th e c o m p u ta tio n tim e .

F e a tu re S u b s e t S e a rc h

F e a tu re S e t

F e a tu re S u b s e t E v a lu a t io n

L e a r n in g A lg o r it h m

L e a r n in g A lg o r it h m

F ig . 5 4 . W ra p p e r M o d e l

F e a tu re S e t

F e a t u r e S u b s e t S e le c t io n A lg o r it h m

L e a r n in g A lg o r it h m

F ig . 5 5 . F ilte r M o d e l

T h e re h a v e b e e n d e v e lo p e d d iffe re n t filte r a p p ro a c h e s o v e r tim e in tra d itio n a l p a tte rn re c o g n itio n a n d a rtific ia l in te llig e n c e . T w o w e ll-k n o w n a lg o rith m s th a t h a v e b e e n d e v e lo p e d w ith in th e a rtific ia l in te llig e n c e c o m m u n ity a re F O C U S [A lD 9 4 ] a n d R E L IE F [K iR 9 2 ]. T h e F O C U S a lg o rith m s ta rts w ith a n e m p ty fe a tu re s e t a n d c a rrie s o u t e x h a u s tiv e s e a rc h u n til it fin d s a m in im a l c o m b in a tio n o f fe a tu re s . It w o rk s o n b in a ry , n o is e -fre e d a ta . R E L IE F a lg o rith m a s s ig n s a re le v a n c e w e ig h t to e a c h fe a tu re , w h ic h is m e a n t to d e n o te th e re le v a n c e o f th e fe a tu re to th e ta rg e t c o n c e p t. R E L IE F s a m p le s in s ta n c e s ra n d o m ly fro m th e tra in in g s e t a n d u p d a te d th e re le v a n c e v a lu e s b a s e d o n th e d iffe re n c e b e tw e e n th e s e le c te d in s ta n c e a n d th e tw o n e a re s t in s ta n c e s o f th e s a m e a n d o p p o s ite c la s s e s . L u i e t a l. [L u S 9 6 ] p ro p o s e d a p ro b a b ilis tic a p p ro a c h to th e s e a rc h p ro b le m o f th e o p tim a l s e t o f fe a tu re s b a s e d o n th e L a s V e g a s a lg o rith m . K o lle r e t a l. [K o lS 9 6 ] p ro p o s e d

3 .6 F e a tu r e S u b s e t S e le c tio n

8 5

a n a p p ro a c h b a s e d o n u s in g c ro s s -e n tro p y to m in im iz e th e a m o u n t o f p re d ic tiv e in fo rm a tio n lo s s d u rin g fe a tu re e lim in a tio n . It s ta rts w ith th e fu ll fe a tu re s e t a n d e lim in a te s fe a tu re a c c o rd in g to th e s e le c tio n c rite ria . It le a d s to s u b o p tim a l re s u lts . B o th a p p ro a c h e s s h o u ld b e e a s y to im p le m e n t a n d th e y a re e ffic ie n t in h a n d lin g la rg e d a ta s e ts . A n o v e rv ie w o f fe a tu re s u b s e t s e le c tio n a lg o rith m fro m th e p a tte rn re c o g n itio n s id e is g iv e n in J a in e t a l. [J a Z 9 7 ]. B e y o n d th e e x h a u s tiv e s e a rc h , b ra n c h -a n d b o u n d fe a tu re s e le c tio n c a n b e u s e d to fin d th e o p tim a l s u b s e t o f fe a tu re s m u c h m o re q u ic k ly th a n e x h a u s tiv e s e a rc h . O n e d ra w b a c k is th a t th e b ra n c h -a n d -b o u n d p ro c e d u re re q u ire s th e fe a tu re s e le c tio n c rite ria to b e m o n o to n e . A s u b o p tim a l fe a tu re s e le c tio n m e th o d th a t h a s b e e n s h o w n re s u lts th a t c o m e c lo s e to o p tim a l is th e S F F S a lg o rith m . B e y o n d th a t, s o m e c o m m o n le a rn in g a lg o rith m h a v e b u ilt in fe a tu re s e le c tio n , fo r e x a m p le , C 4 .5 . T h e f e a tu r e s e le c tio n in C 4 .5 m a y b e v ie w e d a s a f ilte r a p p r o a c h , ju s t a s th e C M a lg o rith m a n d th e S F F S a lg o rith m . T h e C M a lg o rith m s e le c ts fe a tu re s a c c o rd in g to th e ir " o b lig a tio n " to th e c la s s d is c rim in a tio n in th e c o n te x t o f o th e r fe a tu re s . In o p p o s itio n to th a t, th e S F F S a lg o rith m s e le c ts fe a tu re s a c c o rd in g to th e ir s ta tis tic a l c o rre la tio n b e tw e e n e a c h fe a tu re a n d th e c la s s . B e s id e s th a t, b o th a lg o rith m s d iffe r in th e s e a rc h s tra te g y . W h ile th e S F F S a lg o rith m is a tra d itio n a l te c h n iq u e u s e d in p a tte rn re c o g n itio n , th e C M a lg o rith m is a n e w a lg o rith m d e v e lo p e d b y th e d a ta m in in g c o m m u n ity . 3 .6 .3

F e a tu r e S e l e c ti on D on e b y D e c i s i on T r e e I n d u c ti on

D e te rm in in g th e re la tiv e im p o rta n c e o f a fe a tu re is o n e o f th e b a s ic ta s k s d u rin g d e c is io n tre e g e n e ra tio n . T h e m o s t o fte n u s e d c rite ria fo r fe a tu re s e le c tio n is in fo rm a tio n th e o re tic b a s e d s u c h a s th e S h a n n o n e n tro p y m e a s u re I fo r a d a ta s e t. If w e s u b d iv id e a d a ta s e t u s in g v a lu e s o f a n a ttrib u te a s s e p a ra to rs , w e o b ta in a n u m b e r o f s u b s e ts . F o r e a c h o f th e s e s u b s e ts w e c a n c o m p u te th e in fo rm a tio n v a l u e I i. I i w i l l b e s m a l l e r t h a n I , a n d t h e d i f f e r e n c e ( I - I i) i s a m e a s u r e o f h o w w e l l th e a ttrib u te h a s d is c rim in a te d b e tw e e n d iffe re n t c la s s e s . T h e a ttrib u te th a t m a x im iz e s th is d iffe re n c e is s e le c te d . T h e m e a s u re c a n a ls o b e v ie w e d a s a c la s s s e p a ra b ility m e a s u re . T h e m a in d ra w b a c k o f th e e n tro p y m e a s u re is its s e n s itiv ity to th e n u m b e r o f a ttrib u te s v a lu e s [ W h L 9 4 ] . T h e re fo re C 4 .5 u s e s th e g a in ra tio . H o w e v e r, th is m e a s u re s u ffe rs th e d ra w b a c k th a t it m a y c h o o s e a ttrib u te s w ith v e ry lo w in fo rm a tio n c o n te n t [L d M a n 9 1 ]. C 4 .5 [ Q u i9 3 ] u s e s a u n iv a ria te fe a tu re s e le c tio n s tra te g y . A t e a c h le v e l o f th e tre e b u ild in g p ro c e s s o n ly o n e a ttrib u te , th e a ttrib u te w ith th e h ig h e s t v a lu e s fo r th e s e le c tio n c rite ria , is p ic k e d o u t o f th e s e t o f a ll a ttrib u te s . A fte rw a rd s th e s a m p le s e t is s p lit in to s u b -s a m p le s e ts a c c o rd in g to th e v a lu e s o f th is a ttrib u te a n d th e w h o le p ro c e d u re is re c u rs iv e ly re p e a te d u n til o n ly s a m p le s fro m o n e c la s s a re in th e re m a in in g s a m p le s e t o r u n til th e re m a in in g s a m p le s e t h a s n o d is c rim in a tio n p o w e r a n y m o re a n d th e tre e b u ild in g p ro c e s s s to p s . A s w e c a n s e e fe a tu re s e le c tio n is o n ly d o n e a t th e ro o t n o d e o v e r th e e n tire d e c is io n s p a c e . A fte r th is le v e l, th e s a m p le s e t is s p lit in to s u b -s a m p le s a n d o n ly th e

8 6

3 M e th o d s fo r D a ta M in in g

m o s t im p o rta n t fe a tu re in th e re m a in in g s u b -s a m p le s e t is s e le c te d . G e o m e tric a lly it m e a n s th a t th e s e a rc h fo r g o o d fe a tu re s is o n ly d o n e in o rth o g o n a l d e c is io n s u b s p a c e s , w h ic h m ig h t n o t re p re s e n t th e re a l d is trib u tio n s , b e g in n in g a fte r th e ro o t n o d e . T h u s , u n lik e s ta tis tic a l fe a tu re s e a rc h s tra te g ie s [F u k 9 0 ] th is a p p ro a c h is n o t d riv e n b y th e e v a lu a tio n m e a s u re fo r th e c o m b in a to ria l fe a tu re s u b s e t; it is o n ly d riv e n b y th e b e s t s in g le fe a tu re . T h is m ig h t n o t le a d to a n o p tim a l fe a tu re s u b s e t in te rm s o f c la s s ific a tio n a c c u ra c y . D e c is io n tre e u s e rs a n d re s e a rc h e rs h a v e re c o g n iz e d th e im p a c t o f a p p ly in g a fu ll s e t o f fe a tu re s to a d e c is io n tre e b u ild in g p ro c e s s v e rs u s a p p ly in g o n ly a ju d ic io u s ly c h o s e n s u b s e t. It is o fte n th e c a s e th a t th e la tte r p ro d u c e s d e c is io n tre e s w ith lo w e r c la s s ific a tio n e rro rs , p a rtic u la rly w h e n th e s u b s e t h a s b e e n c h o s e n b y a d o m a in e x p e rt. O u r e x p e rim e n ts w e re in te n d e d to e v a lu a te th e e ffe c t o f u s in g m u ltiv a ria te fe a tu re s e le c tio n m e th o d s a s p re -s e le c tio n s te p s to a d e c is io n tre e b u ild in g p ro c e s s . A ttr.

5

1 0

1 5

2 0

2 5

C h a rlu n g W ith lu p l T h lu n g p l S p ic A n g S c a r V a s In v C a v S io R e g S tr S h a r S m o m a r S h a p e A rw c a l C o n v v e ss L o b m a r O u ts h th i F ig . 5 6 . S im ila rity b e tw e e n F e a tu re s

3 . 6 . 4 F e a tu r e S u b s e t S e l e c ti on D on e b y Cl u s te r i n g T h e p ro b le m o f fe a tu r u n d e r s o m e o b je c tiv e p ro b le m s a s a p ro b le m e a c h o th e r. T w o F e a tu n e r fo r th e ta rg e t c o n c tu re s e t. It c a n b e s h o w p lic a tio n s .

e s u b s e t s e le c tio n in v o lv e s fin d in g a fu n c tio n . W e c a n c o n s id e r o u r fe o f fin d in g th e s e t o f fe a tu re s w h ic h re s h a v in g h ig h s im ila rity v a lu e a re e p t. T h e y a re re d u n d a n t a n d c a n b e n in p ra c tic e th a t th is a s s u m p tio n h o

" g o o d " s e t o f fe a tu re s a tu re s u b s e t s e le c tio n a re m o s t d is s im ila r to u s e d in th e s a m e m a n re m o v e d fro m th e fe a ld s fo r m o s t o f th e a p -

3 .6 F e a tu r e S u b s e t S e le c tio n

8 7

T o s o lv e th is ta s k w e n e e d to s e le c t a p ro p e r s im ila rity m e a s u re , c a lc u la te th e s im ila rity b e tw e e n th e fe a tu re s a n d v is u a liz e th e s im ila rity re la tio n b e tw e e n th e fe a tu re s . In F ig u re 5 6 w e c a n s e e a d e n d ro g ra m th a t v is u a liz e th e s im ila rity re la tio n a m o n g v a rio u s fe a tu re s . T h e d e n d ro g ra m w a s c a lc u la te d fro m a fe a tu re s e t d e s c rib in g lu n g n o d u le s in a n x -ra y im a g e [P e r0 0 ]. W e c a n s e e th a t C h a r a c te r o f L u n g P le u r a a n d W ith in L u n g P le u r a a re m o re o r le s s u s e d in th e s a m e m a n n e r s in c e th e y a re v e ry s im ila r to e a c h o th e r w ith a v a lu e o f 0 .1 . F ro m th e d e n d ro g ra m w e c a n s e le c t a s u b s e ts o f fe a tu re s b y c h o s in g v a lu e fo r th e m a x im a l s im ila rity b e tw e e n th e fe a tu re s a n d c o lle c tin g fo r e a c h o f th e re m a in in g g ro u p s o f fe a tu re s in th e d e n d ro g ra m a fe a tu re th a t s h o u ld b e in s e rte d in to th e fe a tu re s u b s e t. 3 . 6 . 5 Con te x tu a l M e r i t A l g or i th m T h e c o n te x tu a l m e rit (C M u p o n w e ig h te d d is ta n c e s f e a t u r e c o r r e l a t i o n ’s t o t h e w a s to w e ig h t fe a tu re s b a c lo s e to e a c h o th e r in th e fo c u s in g u p o n th e s e n e a re c a lly ta k e n in to a c c o u n t. T o c o m p u te c o n te x tu a l b y fe a tu re k fo r e x a m p le s te r-e x a m p le d is ta n c e is 0 in te r-e x a m p le d is ta n c e is , w h e re tk is a th re s h o ld th e

m i n ⎜⎜

⎛ z ⎝

k r

− z t

) a lg o rith m [H o n 9 6 ] e m p lo y s b e tw e e n e x a m p le s w h ic h ta k e s in s ta n c e c la s s . T h e m o tiv a tio n u s e d u p o n h o w w e ll th e y d is c rim E u c lid e a n s p a c e a n d y e t b e lo n g s t in s ta n c e s , th e c o n te x t o f o th e k

fo r fe a tu re k (u s u a lly 1 / 2 o f th e m a g n itu d e o f ra n g e o f

D

∑ b e r o f

rs

=

1, ⎟⎟ e x a m p l e s r a n d s i s w i t h N f t h e t o t a l n u m

k s

⎠ k

se d le te a c h a re B y a ti-

m e r i t , t h e d i s t a n c e d rs b e t w e e n v a l u e s z kr a n d z ks t a k e n r a n d s is u s e d a s a b a s is . F o r s y m b o lic fe a tu re s , th e in i f z kr = z ks, a n d 1 o t h e r w i s e . F o r n u m e r i c a l f e a t u r e s , t h e

fe a tu re ). T h e to ta l d is ta n c e b e tw e e n



a m e rit fu n c tio n b a in to a c c o u n t c o m p n d e rly in g th is a p p ro in a te in s ta n c e s th a t to d iffe re n t c la s s e s . r a ttrib u te s is a u to m

M

th e c o n te x tu a l m e rit fo r a fe a tu re f is

= f



N r = 1



s∈ C ( r )

w

f rs

d

f rs

N f

k = 1

d

k rs

fe a tu re s a n d

, w h e re N

is

th e to ta l n u m b e r o f e x a m p le s , C (r) is th e s e t o f e x a m p le s n o t in th e s a m e c la s s a s e x a m p le s r, a n d

w

f rs

is a w e ig h t fu n c tio n c h o s e n s o th a t e x a m p le s th a t a re c lo s e

to g e th e r a re g iv e n g re a te r in flu e n c e in d e te rm in in g th e m e rit o f e a c h fe a tu re . In p ra c tic e , it h a s b e e n o b s e rv e d th a t w

ij

= D

1 2 rs

if s is o n e o f k n e a re s t n e ig h b o rs to

r , a n d 0 o th e rw is e , p ro v id e s ro b u s t b e h a v io r a s a w e ig h t fu n c tio n . A d d itio n a lly ,

8 8

u s in g

3 M e th o d s fo r D a ta M in in g

lo g 2

C ( r ) a s th e v a lu e fo r k h a s a ls o e x h ib ite d ro b u s t b e h a v io r. T h is a p -

p ro a c h to c o m p u tin g a n d o rd e rin g fe a tu re s b y th e ir m e rits h a s b e e n o b s e rv e d to b e v e ry ro b u s t, a c ro s s a w id e ra n g e o f e x a m p le s .

3 . 6 . 6 F l oa ti n g S e a r c h M e th od T h e fe a tu re s u b s e t s e le c tio n a lg o rith m d e s c rib e d in th e fo rm e r c h a p te r p e rfo rm s in th e firs t s te p a g re e d y s e a rc h o v e r a ll e x a m p le s a n d it fo c u s e s a fte rw a rd s o n th e c o m p u ta tio n o f th e

lo g 2

C ( r ) n e a re s t e x a m p le s b a s e d u p o n a n E u c lid e a n s p a c e .

T h e firs t s te p is a v e ry tim e -c o n s u m in g p ro c e s s a n d th e q u e s tio n a ris e s if s e c o n d s te p o f th e a lg o rith m , w h e re th e re a l m e rits a re c a lc u la te d , w ill le a d to a n e a r o p tim a l s o lu tio n . V a rio u s o th e r s e a rc h s tra te g ie s h a v e b e e n d e v e lo p e d to fin d th e s u b s e t o f fe a tu re s o p tim iz in g a n a d o p te d c rite rio n . T h e w e ll-k n o w n S e q u e n tia l F o rw a rd S e le c tio n (S F S ) a n d S e q u e n tia l B a c k w a rd S e le c tio n (S B S ) a re s te p -o p tim a l o n ly s in c e th e b e s t (th e w o rs t) fe a tu re is a lw a y s a d d e d (d is c a rd e d ) in S F S a n d S B S , re s p e c tiv e ly . T h is re s u lts in n e s te d fe a tu re s u b s e ts w ith o u t a n y c h a n c e to c o rre c t th e d e c is io n in la te r s te p s , c a u s in g th e p e rfo rm a n c e to b e o fte n fa r fro m o p tim a l. T h e id e a b e h in d th e p re v io u s m e th o d s a im e d a t c o u n te ra c tin g th e n e s tin g e ffe c t, c a n b e m o re e ffic ie n tly im p le m e n te d b y c o n s id e rin g c o n d itio n a l in c lu s io n a n d e x c lu s io n o f fe a tu re s c o n tro lle d b y th e v a lu e o f th e c rite rio n its e lf. T h e S e q u e n tia l F lo a tin g F o rw a rd S e le c tio n (S F F S ) p ro c e d u re c o n s is ts o f a p p ly in g a fte r e a c h fo rw a rd s te p a n u m b e r o f b a c k w a rd s te p s a s lo n g a s th e re s u ltin g s u b s e ts a re b e tte r th a n th e p re v io u s ly e v a lu a te d o n e s a t th a t le v e l. C o n s e q u e n tly , th e re a re n o b a c k w a rd s te p s a t a ll if th e p e rfo rm a n c e c a n n o t b e im p ro v e d . T h e d e s c rip tio n o f th e a lg o rith m c a n b e fo u n d in [P u d 9 4 ]. H e re , w e w a n t to c o m p a re th e p e rfo rm a n c e o f th e fo rm e r a lg o rith m w ith th e S F F S a lg o rith m w h e n th e e v a lu a tio n c rite rio n is th e M a h a la n o b is d i s t a n c e d ij:

d

ij

= ( x i − x j)S

− 1

( x i − x j) , w h e re S

– 1

is th e p o o le d s a m p le c o v a ri-

a n c e m a trix . T h e M a h a la n o b is d is ta n c e in c o rp o ra te s th e c o rre la tio n b e tw e e n fe a tu re s a n d s ta n d a rd iz e s e a c h fe a tu re to z e ro m e a n a n d u n it v a ria n c e . 3 . 6 . 7 Con c l u s i on D e s c rib in g a p ro b le m a c la s s ific a tio n m o d e k n o w a s th e c u rs e o f th e c o n s tru c tio n o f th s u b s e t o f fe a tu re s b e c la s s ific a tio n a c c u ra c

b y a s l w ith d im e n e m o d fo re th y . D e s

m a n y fe th e b e s t s io n a lity e l it is b e c o n s tr p ite th a t

a tu re s a s w e c a n d o e s n o c la s s ific a tio n a c c u ra c y . . R a th e r th a n ta k in g th e e tte r to s e le c t fro m th is u c tio n o f th e m o d e l. T h a s m a lle r fe a tu re s e t d o e

t n e c e s s a rily re s u T h is p ro b le m is w h o le fe a tu re s e fe a tu re s e t a re le is m a y le a d to b s n o t re q u ire s o m

lt in w e ll t fo r v a n t e tte r a n y

3 .6 F e a tu r e S u b s e t S e le c tio n

c o s ts fo r th e a s s e s s m e n t/c a lc u la tio n o f th e in te rp re ta tio n ta s k s [P e r0 0 ] in te rp re ta tio n w th e fe a tu re s m a y b e h ig h a n d re q u ire s p e c n e e rin g ta s k s [K o H 0 1 ]. H o w e v e r, fe a tu re s u b s e t s e le c tio n m a y th e p re d ic tio n a c c u ra c y m a y n o t in c re a s [P e rn 0 0 ] th a t th e S F F S a lg o rith m p e rfo rm N e v e rth e le s s , th e a lg o rith m ic p ro p e rtie s o d lin g la rg e d a ta b a s e s .

8 9

fe a tu re s w h ic h is im p o rta n t fo r im a g e h e re c o m p u ta tio n a l c o s ts fo r e x tra c tin g ia l p u rp o s e h a rd w a re o r fo r e v e n e n g in o t le a d to m o re c o m p a c t m o d e ls e d ra m a tic a lly . It h a s b e e n s h o w s s lig h tly b e tte r th a n th e C M a lg o r f th e C M a lg o rith m a re b e tte r fo r

a n d in ith m . h a n n

4 A p p l i c a ti on s

4 . 1 Con tr ol l i n g th e P a r a m e te r s of a n A l g or i th m / M od e l b y Ca s e - B a s e d R e a s on i n g

4 . 1 . 1 M od e l l i n g Con c e r n s D tr th A th

e s ig n in g ic k y p ro e rig h t c fte rw a rd a t a llo w

a h b le m u t-o s , th s u s

e u ris . A s u t o f e d o to c a

tic m o d e l th a t b a s is fo r th e m th e d o m a in s p a m a in s p a c e s h o lc u la te th e m o d

s im u la te s a o d e l d e s ig n c e w h e re th u ld b e d e s c e l p a ra m e te

ta s k s is re q e m o d rib e d r a n d

fro m th e o b je c tiv e re a lity u ire d to d e fin e a n d to d e s c e l s h o u ld a c t in , s e e F ig u re b y a re p re s e n ta tiv e s a m p le to e v a lu a te th e m o d e l.

is a rib e 5 7 . se t

F ig . 5 7 . M o d e l D e v e lo p m e n t P ro c e s s

T h e c o m to d e s c rib e la rg e e n o u g h a v e a n y a c

p le x ity o f th e p ro b le m s m a k e s it u s u a lly h a th e rig h t c u t-o u t o f th e o b je c tiv e re a lity . T h h a n d re p re s e n ta tiv e s a m p le s e t ta k e s o fte n q u is itio n u n it (s e n s o rs , c o m p u te rs ) in s ta lle d

rd e a a

to a c q lo n t th

c o m p le u is itio n g tim e e d o m a

P . P e rn e r: D a ta M in in g o n M u ltim e d ia D a ta , L N C S 2 5 5 8 , p p . 9 1 − 1 1 6 , 2 0 0 2 . © S p rin g e r-V e rla g B e rlin H e id e lb e rg 2 0 0 2

te ly d e fin e a n d o f a s u ffic ie n t a n d re q u ire s to in .

9 2

4 A p p lic a tio n s

E x a m p le fo r th e s e p ro b le m s in c h a ra c te r re c o g n itio n a re illu s tra te d b y R ic e e t a l. [ R N N 9 9 ] . I n o p tic a l c h a r a c te r r e c o g n itio n im a g in g d e f e c ts ( e .g ., h e a v y p r in t, lig h t p rin t, o r s tra y m a rk s ) c a n o c c u r a n d in flu e n c e th e re c o g n itio n re s u lts . R ic e e t a l. a tte m p te d to s y s te m a tic a lly o v e rv ie w th e fa c to rs th a t in flu e n c e th e re s u lt o f a n o p tic a l c h a ra c te r re c o g n itio n s y s te m , a n d h o w d iffe re n t s y s te m s re s p o n d to th e m . H o w e v e r, it is n o t y e t p o s s ib le to o b s e rv e a ll re a l-w o rld in flu e n c e s , n o r p ro v id e a s u ffic ie n tly la rg e e n o u g h s a m p le s e t fo r s y s te m d e v e lo p m e n t a n d te s tin g . T h e re fo re , o n e m o d e l w ill n o t w o rk o v e r a ll re a liz a tio n s o f th e o b je c tiv e re a lity a n d a re fin e m e n t o f th e m o d e l o r a n in te llig e n t c o n tro l o f v a rio u s m o d e ls h a s to b e d o n e d u rin g th e life tim e o f th e s y s te m [P e r9 9 ]. W e w ill d e s c rib e h o w c a s e -b a s e d re a s o n in g c a n b e u s e d to o v e rc o m e th e m o d e llin g b u rd e n . W e w ill d e s c rib e it b a s e d o n a ta s k fo r im a g e s e g m e n ta tio n . H o w e v e r, th e d e s c rib e s tra te g y c a n b e u s e d fo r a n y o th e r p ro b le m w h e re m o d e l p a ra m e te rs s h o u ld b e s e le c te d b a s e d o n th e a c tu a l s itu a tio n to e n s u re g o o d q u a lity o u tp u t. 4 . 1 . 2 Ca s e - B a s e d R e a s on i n g U n i t T h e c a s e -b a s e d re a s o n in g u n it fo r im a g e s e g m e n ta tio n c o n s is ts o f a c a s e b a s e in w h ic h fo rm e rly p ro c e s s e d c a s e s a re s to re d . A c a s e c o m p ris e s im a g e in fo rm a tio n , n o n - im a g e in f o r m a tio n ( e .g ., im a g e a c q u is itio n p a r a m e te r s , o b je c t c h a r a c te r is tic s , a n d s o o n ), a n d im a g e - s e g m e n ta tio n p a ra m e te rs . T h e ta s k is n o w to fin d th e b e s t s e g m e n ta tio n fo r th e c u rre n t im a g e b y lo o k in g in th e c a s e b a s e fo r s im ila r c a s e s . S im ila rity d e te rm in a tio n is b a s e d o n b o th n o n -im a g e in fo rm a tio n a n d im a g e in fo rm a tio n . T h e e v a lu a tio n u n it w ill ta k e th e c a s e w ith th e h ig h e s t s im ila rity s c o re fo r fu rth e r p ro c e s s in g . If th e re a re tw o o r m o re c a s e s w ith th e s a m e s im ila rity s c o re , th e c a s e to a p p e a r firs t w ill b e ta k e n . A fte r th e c lo s e s t c a s e h a s b e e n c h o s e n , th e im a g e -s e g m e n ta tio n p a ra m e te rs a s s o c ia te d w ith th e s e le c te d c a s e w ill b e g iv e n to th e im a g e -s e g m e n ta tio n u n it, a n d th e c u rre n t im a g e w ill b e s e g m e n te d (s e e F ig u re 5 8 ). It is a s s u m e d th a t im a g e s h a v in g s im ila r im a g e c h a ra c te ris tic s w ill s h o w s im ila r g o o d s e g m e n ta tio n re s u lts w h e n th e s a m e s e g m e n ta tio n p a ra m e te rs a re a p p lie d to th e s e im a g e s . T h e d is c u s s io n th a t fo llo w s w ill a s s u m e th e d e fin itio n o f re g io n s b a s e d o n c o n s ta n t lo c a l im a g e fe a tu re s to b e u s e d fo r s e g m e n ta tio n , th e c la s s ific a tio n o f re g io n s in to tw o o b je c t c la s s e s (b ra in a n d liq u o r), fo r la b e lin g . In th e a p p ro a c h u s e d fo r b ra in /liq u o r d e te rm in a tio n , th e v o lu m e d a ta o f o n e p a tie n t (a s e q u e n c e o f a m a x im u m o f 3 0 C T -im a g e s lic e s ) is g iv e n to th e C B R im a g e -s e g m e n ta tio n u n it. T h e C T im a g e s a re s to re d in D IC O M -fo rm a t. E a c h file c o n s is ts o f a h e a d e r a n d th e im a g e m a trix . T h e h e a d e r c o n ta in s s to re d in fo rm a tio n a b o u t th e p a tie n t a n d th e im a g e a c q u is itio n . T h e im a g e s a re p ro c e s s e d , s lic e b y s lic e , b e fo re th e b ra in /liq u o r v o lu m e ra tio c a n b e c a lc u la te d . F irs t, e a c h im a g e is p re p ro c e s s e d in o rd e r to e lim in a te th e n o n -in te re s tin g im a g e d e ta ils , lik e th e s k u ll a n d th e h e a d s h e ll, fro m th e im a g e . A fte rw a rd s , th e n o n -im a g e in fo rm a tio n is e x tr a c te d f r o m th e im a g e f ile h e a d e r ( s e e S e c tio n 4 .1 .4 .1 ) . F r o m th e im a g e m a tr ix c o n ta in e d in th e D IC O M -file , th e s ta tis tic a l fe a tu re s d e s c rib in g th e im a g e c h a ra c te r is tic s a r e p r o c e s s e d ( s e e S e c tio n 4 .1 .4 .2 ) . T h is in f o r m a tio n , to g e th e r w ith th e n o n -im a g e in fo rm a tio n , is g iv e n to th e u n it th a t d e te rm in e s th e s im ila rity . T h e

4 .1 C o n tr o llin g th e P a r a m e te r s o f a n A lg o r ith m /M o d e l b y C a s e - B a s e d R e a s o n in g

9 3

s im ila rity b e tw e e n th e n o n -im a g e in fo rm a tio n a n d th e im a g e in fo rm a tio n o f th e c u r r e n t c a s e a n d th e c a s e s in c a s e b a s e is c a lc u la te d ( s e e S e c tio n 4 .1 .5 ) . T h e c lo s e s t c a s e is s e le c te d , a n d th e s e g m e n ta tio n p a r a m e te r s ( s e e S e c tio n 4 .1 .6 ) a r e g iv e n to th e s e g m e n ta tio n u n it. T h e s e g m e n ta tio n u n it ta k e s th e p a ra m e te rs , a d ju s ts th e s e g m e n ta tio n a lg o rith m a n d s e g m e n ts th e im a g e in to b ra in a n d liq u o r a re a s . T h e re s u ltin g liq u o r a re a is d is p la y e d o n s c re e n to th e u s e r b y re d c o lo rin g o v e r th e a re a in th e o rig in a l im a g e . T h is is d o n e in o rd e r to g iv e th e u s e r v is u a l c o n tro l o f th e re s u lt. T h e s e p ro c e s s in g s te p s a re d o n e s lic e b y s lic e . A fte r e a c h s lic e h a s b e e n p ro c e s s e d , th e v o lu m e fo r b ra in a n d liq u o r is c a lc u la te d . F in a lly , th e b ra in /liq u o r v o lu m e ra tio is c o m p u te d a n d d is p la y e d to th e u s e r.

F ig . 5 8 . A rc h ite c tu re o f a n Im a g e S e g m e n ta tio n U n it b a s e d o n C a s e -B a s e d R e a s o n in g

4 . 1 . 3 M a n a g e m e n t of th e Ca s e B a s e T h e re s u lt o f th e s e g m e n ta tio n p ro c e s s is o b s e rv e d b y a u s e r. H e c o m p a re s th e o rig in a l im a g e w ith th e la b e le d im a g e o n d is p la y . If h e d e te c ts d e v ia tio n s o f th e m a rk e d a re a s in th e s e g m e n te d im a g e fro m th e o b je c t a re a in th e o rig in a l im a g e th a t s h o u ld b e la b e le d , th a n h e w ill e v a lu a te th e re s u lt a s in c o rre c t, a n d c a s e -b a s e m a n a g e m e n t w ill s ta rt. T h is w ill a ls o b e d o n e if n o s im ila r c a s e is a v a ila b le in th e c a s e b a s e . T h e e v a lu a tio n p ro c e d u re c a n a ls o b e d o n e a u to m a tic a lly [Z h a 9 7 ]. O n c e th e u s e r o b s e rv e s a b a d re s u lt, h e w ill ta g th e c a s e a s " b a d c a s e " . T h e ta g d e s c rib e s th e u s e r‘s c ritiq u e in m o re d e ta il. F o r th e b ra in /liq u o r a p p lic a tio n it is n e c e s s a ry to k n o w th e fo llo w in g in fo rm a tio n fo r th e m o d ific a tio n p h a s e : to o m u c h o r to o little b ra in a re a , to o m u c h o r to o little liq u o r a re a , a n d a s im ila rity v a lu e le s s th a n a p re d e fin e d v a lu e .

9 4

4 A p p lic a tio n s

In a n o ff-lin e p h a s e , th e b e s t s e g m e n ta tio n p a ra m e te rs fo r th e im a g e a re d e te rm in e d , a n d th e a ttrib u te s th a t a re n e c e s s a ry fo r s im ila rity d e te rm in a tio n a re c a lc u la te d fro m th e im a g e . B o th th e s e g m e n ta tio n p a ra m e te rs a n d th e a ttrib u te s c a lc u la te d fro m th e im a g e a re s to re d in th e c a s e b a s e a s a n e w c a s e . In a d d itio n to th a t, th e n o n -im a g e in fo rm a tio n is e x tra c te d fro m th e file h e a d e r, a n d s to re d to g e th e r w ith th e o th e r in fo rm a tio n in th e c a s e b a s e . D u rin g s to ra g e , c a s e g e n e ra liz a tio n w ill b e d o n e to e n s u re th a t th e c a s e b a s e w ill n o t b e c o m e to o la rg e . C a s e G e n e ra liz a tio n w ill b e d o n e b y g ro u p in g th e s e g m e n ta tio n p a ra m e te rs in to s e v e ra l c lu s te rs . E a c h d iffe re n t c o m b in a tio n o f s e g m e n ta tio n p a ra m e te rs w ill b e a c lu s te r. T h e c lu s te r n a m e w ill b e s to re d in th e c a s e to g e th e r w ith th e o th e r in fo rm a tio n . G e n e ra liz a tio n w ill b e d o n e o v e r th e v a lu e s o f th e p a ra m e te rs d e s c rib in g a c a s e . T h e u n it fo r m o d ify in g th e s e g m e n ta tio n is s h o w n in F ig u re 6 0 .

S e g m e n te d Im a g e

In p u t Im a g e

C B R

E v a lu a tio n (v is u a l o r a u to m a tic )

y e s G o o d ? n o

S -P a r a m e te r C a s e A ttr ib u te s

C a se B a se

S e le c tiv e C a s e R e g is tr a tio n

O ff-lin e S e g m e n ta tio n a n d P r o c e s s in g o f I m a g e C h a r a c te r is tic s

C a se G e n e r a liz a tio n

F ig . 5 9 . C a s e B a s e M a n a g e m e n t 4 . 1 . 4 Ca s e S tr u c tu r e a n d Ca s e B a s e A

c a s e c o n s is ts o f n o n -im a g e in fo rm a tio n , p a ra m e te rs d e s c rib in g th e im a g e c h a ra c te ris tic s its e lf, a n d th e s o lu tio n (th e s e g m e n ta tio n p a ra m e te rs ).

4 .1 C o n tr o llin g th e P a r a m e te r s o f a n A lg o r ith m /M o d e l b y C a s e - B a s e d R e a s o n in g

9 5

F ig . 6 0 . U s e r In te rfa c e o f th e m o d ific a tio n u n it

4 . 1 . 4 . 1 N on - i m a g e I n f or m a ti on T h e n o n -im a g e in fo rm a tio n n e c e s s a ry fo r th is b ra in /liq u o r a p p lic a tio n w ill b e d e s c rib e d b e lo w . F o r o th e r a p p lic a tio n s , d iffe re n t, a p p ro p ria te n o n -im a g e in fo rm a tio n w ill b e c o n ta in e d in th e c a s e . F o r e x a m p le , m o tio n a n a ly s is [K u P 9 9 ], in v o lv e s th e u s e o f th e c a m e ra p o s itio n , re la tiv e m o v e m e n t o f th e c a m e ra a n d th e o b je c t c a te g o ry its e lf a s n o n -im a g e in fo rm a tio n . F o r b ra in /liq u o r d e te rm in a tio n in C T im a g e s , p a tie n t-s p e c ific p a ra m e te rs (lik e a g e a n d s e x ), s lic e th ic k n e s s a n d n u m b e r o f s lic e s a re re q u ire d . T h is in fo rm a tio n is c o n ta in e d in th e h e a d e r o f th e C T im a g e file s o th a t th e s e p a ra m e te rs c a n b e a u to m a tic a lly a c c e s s e d . Y o u n g p a tie n ts h a v e s m a lle r liq u o r a re a s th a n o ld p a tie n ts . T h e im a g e s th e re fo r s h o w d iffe re n t im a g e c h a ra c te ris tic s . T h e a n a to m ic a l s tru c tu re s (a n d th e re fo re th e im a g e c h a ra c te ris tic s ) a ls o d iffe r b e tw e e n w o m e n a n d m e n . T h e n u m b e r o f s lic e s m a y v a ry fro m p a tie n t to p a tie n t b e c a u s e o f th is b io lo g ic a l d iv e rs ity , a n d s o m a y th e s ta rtin g p o s itio n o f th e s lic e s . T h e re fo re , th e n u m e ric a l v a lu e s a re m a p p e d o n to th re e in te rv a ls : b o tto m , m id d le a n d to p s lic e s . T h e s e in te rv a ls c o rre s p o n d to th e s e g m e n ts o f th e h e a d o f d iffe re n t im a g e c h a ra c te ris tic s

9 6

4 A p p lic a tio n s

(s e e F ig u re 6 1 ). T h e in te rv a ls c a n e a s ily b e c a lc u la te d b y d iv id in g th e n u m b e r o f s lic e s b y th re e . T h e re m a in in g u n c e rta in ty in p o s itio n c a n b e ig n o re d .

...

1 1

... ...

2 1

2 2

... ...

3 1

F ig . 6 1 . C T Im a g e s s h o w in g th e d iffe re n t s e g m e n ts o f th e h e a d

4 . 1 . 4 . 2 I m a g e I n f or m a ti on T h e k in d o f im a g e in fo rm a tio n u s e d to d e s c rib e a c a s e is c lo s e ly re la te d to th e k in d o f s im ila rity m e a s u re u s e d fo r s im ila rity d e te rm in a tio n . T h e re is a lo t o f w o rk g o in g o n a t p re s e n t in d e v e lo p in g n e w m e a s u re s fo r c o m p a rin g g re y -s c a le im a g e s [Z S t9 5 ][W B O 9 7 ] fo r v a rio u s o b je c tiv e s lik e im a g e re trie v a l a n d im a g e e v a lu a tio n . B e fo re a d e c is io n w a s m a d e to e m p lo y a p a rtic u la r s im ila rity m e a s u re fo r th is w o rk , o n e o f th e s e n e w m e a s u re s w e re e v a lu a te d a g a in s t th e m e a s u re a lre a d y b e in g u s e d . T h e re a s o n fo r c h o o s in g o n e p a rtic u la r s im ila rity m e a s u re , a s w e ll a s th e a p p ro p ria te im a g e in fo rm a tio n to d e s c rib e a c a s e , w ill b e b rie fly d is c u s s e d b e lo w .

4 .1 C o n tr o llin g th e P a r a m e te r s o f a n A lg o r ith m /M o d e l b y C a s e - B a s e d R e a s o n in g

9 7

4 . 1 . 5 I m a g e S i m i l a r i ty D e te r m i n a ti on

4 . 1 . 5 . 1 I m a g e S i m i l a r i ty M e a s u r e 1 ( I S i m _ 1 ) T h e s im ila rity m e a s u re d e v e lo p e d b y Z a m p e ro n i a n d S ta ro v o ito v [Z S t9 5 ] c a n ta k e th e im a g e m a trix its e lf a n d c a lc u la te th e s im ila rity b e tw e e n tw o im a g e m a tric e s (s e e F ig u re 6 2 ). T h e in p u t to th e a lg o rith m is th e tw o im a g e s th a t a re b e in g c o m p a re d . A c c o rd in g to th e s p e c ifie d d is ta n c e fu n c tio n , th e p ro x im ity m a trix is c a lc u la te d f o r o n e p ix e l a t p o s itio n r ,s in im a g e A to th e p ix e l a t th e s a m e p o s itio n in im a g e B , a n d to s u rro u n d in g p ix e ls w ith in a p re d e fin e d w in d o w . T h e s a m e is d o n e f o r th e p ix e l a t p o s itio n r ,s in im a g e B . T h e n , c lu s te r in g is p e r f o r m e d , b a s e d o n th a t m a trix , in o rd e r to g e t th e m in im u m d is ta n c e b e tw e e n th e c o m p a re d p ix e ls . A fte rw a rd s , th e a v e ra g e o f th e tw o v a lu e s is c a lc u la te d . T h is is re p e a te d u n til a ll th e p ix e ls o f b o th im a g e s h a v e b e e n p ro c e s s e d . F ro m th e a v e ra g e m in im a l p ix e l d is ta n c e , th e fin a l d is s im ila rity fo r th e w h o le im a g e is c a lc u la te d . U s e o f a n a p p ro p ria te w in d o w s h o u ld m a k e th is m e a s u re in v a ria n t to s c a lin g , ro ta tio n a n d tra n s la tio n , d e p e n d in g o n th e w in d o w s iz e . F o r th is k in d o f s im ila rity d e te rm in a tio n , it is n e c e s s a ry to s to re th e w h o le im a g e m a trix a s th e im a g e -in fo rm a tio n fo r e a c h c a s e . H o w e v e r, th e s im ila rity m e a s u re b a s e d o n Z a m p e ro n i´s w o rk h a s s o m e d ra w b a c k s , w h ic h w ill b e d is c u s s e d in S e c tio n 4 .1 .5 .3 . In p u t: im a g e A a n d im a g e B g re y -le v e l m a trix d e te rm in a tio n o f th e w in d o w s iz e B e g in w ith firs t p ix e l: r = s = 1 c o m p u ta tio n o f p ro x im ity m a trix d (a b a se d o n fpp = d

s e a rc h m in im u m : f p i ( a ,B ) = d p i ( a rs ,W

n o

) a n d d (b B

p i

(a

rs

,W B

) + d

p i

(b

rs

,W

n o

,W A

)

A

))

A

)

S u m

∑ =

f (r , s) 2

ii

r , s

r = N ? 1

D ( A ,B ) = N

y e s s = s + 1

rs

s e a r c h m in im u m : f p i ( b ,A ) = d p i ( b rs ,W )

B

f ii = 1 / 2 ( d

r = r + 1

,W

rs c ity

s = N ?

y e s

2

S u m

O u tp u t: D ( A ,B )

F ig . 6 2 . F lo w c h a rt fo r s im ila rity c a lc u la tio n a fte r Z a m p e ro n i e t a l.

9 8

4 A p p lic a tio n s

4 . 1 . 5 . 2 I m a g e S i m i l a r i ty M e a s u r e 2 ( I S I M _ 2 ) A s im s ta tis g ra y e n tro c a lc u

p le r a p p ro a c h is tic a l p ro p e rtie s o le v e l, lik e m e a n , p y , a n d c e n tro id [ la te d o n th e b a s is

to f th v a r D rS o f

c a lc e im ia n c 8 2 ] th e s

u la te fe a tu re s a g e . T h e se fe e , sk e w n e ss, k (s e e ta b le 1 1 ). e fe a tu re s .

fro m a n a tu re s a u rto s is , T h e s im

im re v a ila

a g e s ta tis ria tio rity b

th a t w ill d e tic a l m e a s u n c o e ffic ie n e tw e e n tw o

s c rib e th e re s o f th e t, e n e rg y , im a g e s is

T a b le 1 1 . S ta tis tic a l M e a s u re s fo r th e d e s c rip tio n o f th e im a g e

4 .1 .5 .3

Com p a r i s i on of I S i m _ 1 a n d I S i m _ 2

In v e s tig a tio n s w e re u n d e rta k e n in to th e b e h a v io u r o f th e tw o s im ila rity m e a s u re s . T h e s im ila ritie s b e tw e e n th e im a g e s lic e s o f o n e p a tie n t a re c a lc u la te d , b a s e d firs t o n IM 1 a n d fu rth e r o n IM 2 . S in g le -lin k a g e m e th o d is th e n u s e d to c re a te a d e n d ro g ra m , w h ic h g ra p h ic a lly s h o w s th e s im ila rity re la tio n b e tw e e n th e s lic e s . T h e d e n d ro g ra m (s e e F ig u re 6 4 ) b a s e d o n IM 2 s h o w s tw o c lu s te rs : o n e fo r th e s lic e s in th e m id d le a n d a t th e to p o f th e h e a d , a n d o n e fo r th e s lic e s a t th e b o tto m o f th e h e a d . T h e re is n o c le a r c lu s te r fo r th e to p a n d th e m id d le s lic e s . N e v e rth e le s s , th e d iffe re n c e s in th e s im ila rity v a lu e s a re b ig e n o u g h to m a k e a d is tin c tio n b e tw e e n th e s e s lic e s . T h e h ig h e s t d is s im ila rity is re c o g n iz e d b e tw e e n th e s lic e s fro m th e b o tto m , w h ic h h a p p e n s b e c a u s e o f th e h ig h c o m p le x ity o f th e im a g e s tru c tu re s in th a t s p h e re . T h e d e n d ro g ra m b a s e d o n IM 1 s h o w s a fin e r g ra d u a tio n b e tw e e n th e v a rio u s s lic e s (s e e F ig u re 6 3 ). It c a n a ls o d is tin g u is h b e tte r b e tw e e n th e b o tto m , m id d le

4 .1 C o n tr o llin g th e P a r a m e te r s o f a n A lg o r ith m /M o d e l b y C a s e - B a s e d R e a s o n in g

9 9

a n d to p s lic e s . H o w e v e r, s lic e s fro m d iffe re n t p a tie n ts a re c o m p a re d , it s h o w s s o m e d ra w b a c k s , w h ic h a re c a u s e d b y ro ta tio n , s c a lin g a n d tra n s la tio n . T h e in v a ria n t b e h a v io r o f th is m e a s u re is re la te d to th e w in d o w s iz e . C o m p e n s a tin g fo r th e s e e ffe c ts re q u ire s a la rg e w in d o w s iz e , w h ic h o n th e o th e r h a n d c a u s e s h ig h c o m p u ta tio n tim e (m o re th a n 3 m in u te s o n a 4 -n o d e s y s te m b a s e d o n P o w e r P C 6 0 4 a n d a w in d o w s iz e o f 3 0 x 3 0 p ix e ls ). T h is m a k e s th is m e a s u re u n s u ita b le fo r th e p ro b le m , a t h a n d . T h e s im ila rity m e a s u re b a s e d o n IM 1 h a s lim ite d in v a ria n c e in th e fa c e o f ro ta tio n , s c a lin g a n d tra n s la tio n . T h e re fo re , it w a s d e c id e d to u s e th e s im ila rity m e a s u re b a s e d o n IM 2 . M o re o v e r, in th e c a s e o f IM 1 , it is n e c e s s a ry to s to re th e w h o le im a g e m a trix a s a c a s e a n d c a lc u la te th e s im ila rity o v e r th e e n tire im a g e m a trix . T h e c o m p u ta tio n a l c o s ts fo r th e s im ila rity c a lc u la tio n a re v e ry h ig h , a n d s o w o u ld b e th e s to ra g e c a p a c ity . T h e lo w e r s e n s itiv ity o f IM 2 to th e d iffe re n t s e c to rs o f th e b ra in c a n b e re d u c e d b y in tro d u c in g th e s lic e n u m b e r a s n o n -im a g e in fo rm a tio n d is c u s s e d in S e c tio n 3 .1 .

F ig . 6 3 . D e n d ro g ra m

fo r im a g e s im ila rity b a s e d o n IM 1

F o r th e s im ila rity m e a s u re b a s e d o n IM 2 , it is o n ly n e fro m th e im a g e s b e fo re th e c a s e s c a n b e s to re d in th e c o f lo w c o m p u ta tio n a l c o s t. E a c h im a g e is d e s c rib e d b g ra y le v e l lik e : m e a n , v a ria n c e , s k e w n e s s , k u rto s is , v e n tro p y , a n d c e n tro id . T h is in fo rm a tio n , to g e th e r w ith a n d s e g m e n ta tio n p a ra m e te rs , c o m p ris e s a c a s e . 4 . 1 . 6 S e g m e n ta ti on A l g or i th m

c e s s a ry to a se b a se . y s ta tis tic a ria tio n c th e n o n -

c a lc u la te T h is c a lc u a l m e a su re o e ffic ie n t, im a g e in fo

fe a tu re s la tio n is s o f th e e n e rg y , rm a tio n

a n d S e g m e n ta ti on P a r a m e te r s

T h e g ra y le v e l h is to g ra m is c a lc u la te d fro m th e o rig in a l im a g e . T h is h is to g ra m is s m o o th e d b y s o m e n u m e ric a l fu n c tio n s a n d h e u ris tic ru le s [O P R 7 8 ][L e e 8 6 ] to fin d th e c u t p o in ts fo r th e liq u o r a n d b ra in g ra y -le v e l a re a s . T h e p a ra m e te rs o f th e

1 0 0

4 A p p lic a tio n s

fu n c tio n a n d th e a s s o c ia te is s m o o th e d p le x ity o f th g ra m is s e g m

ru le s a re s to re d w ith th e c a s e s , a n d c a s e is s e le c te d . T h e fo llo w in g s b y a n u m e ric a l fu n c tio n . T h e re a re e in te rp o la tio n fu n c tio n a n d th e in e n te d in to in te rv a ls , s u c h th a t e a c h

F ig . 6 4 . D e n d ro g ra m

d g te p tw te r b e

iv e n to th e s s a re p e rfo rm o p a ra m e te rs p o la tio n w id g in s w ith a

e g m e n ta tio n u n it if e d . T h e h is to g ra m to s e le c t: th e c o m th . T h e n th e h is to v a lle y , c o n ta in s a

fo r im a g e s im ila rity b a s e d o n IM 2

p e a k a n d e n d s w ith a v a lle y . T h e p e a k -to -s h o u ld e r ra tio o f e a c h firs t. A n in te rv a l is m e rg e d w ith th e n e ig h b o r s h a rin g th e h ig h e r d e rs if th e ra tio o f p e a k h e ig h t to th e h e ig h t o f its h ig h e r s h o u ld e r e q u a l to s o m e th re s h o ld . F in a lly , th e n u m b e r o f th e re m a in in g p a re d to a p re d e fin e d n u m b e r o f in te rv a ls . If m o re th a n th is h a v e te rv a ls w ith th e h ig h e s t p e a k s a re s e le c te d . T h e n u m b e r o f in te rv a n u m b e r o f c la s s e s in to w h ic h th e im a g e s h o u ld b e s e g m e n te d . T c a lc u la te d a n d th e n a p p lie d to th e im a g e .

in te rv a l is te s o f its tw o s h o is g re a te r th a n in te rv a ls is c o s u rv iv e d , th e ls d e p e n d s o n h e th re s h o ld s

te d u lo r m in th e a re

4 . 1 . 7 S i m i l a r i ty D e te r m i n a ti on 4 . 1 . 7 . 1 O v e r a l l S i m i l a r i ty S im ila rity c o m p ris e s tw o p a rts : n o n -im a g e s im ila rity a n d im a g e s im ila rity . T h e fin a l s im ila rity is c a lc u la te d b y :

S im

=

1 2

( S im N

+

S im I

) =

1 2

(

S ( C I

, b ) + 1 − d is t

A B

)

(5 2 )

4 .1 C o n tr o llin g th e P a r a m e te r s o f a n A lg o r ith m /M o d e l b y C a s e - B a s e d R e a s o n in g

1 0 1

It w a s d e c id e d th a t n o n -im a g e a n d im a g e s im ila rity s h o u ld h a v e e q u a l in flu e n c e to th e fin a l s im ila rity . O n ly w h e n b o th s im ila ritie s h a v e a h ig h v a lu e , w ill th e fin a l s im ila rity b e h ig h . 4 . 1 . 7 . 2 S i m i l a r i ty M e a s u r e f or N on - i m a g e I n f or m a ti on T v e rs k y ´s s im ila rity m e a s u re is u s e d fo r th e n o n -im a g e in fo rm a tio n [T v e 7 7 ]. T h e s im ila rity b e tw e e n a C a s e C i a n d a n e w c a s e b p re s e n te d to th e s y s te m is :

S (C

A

,b ) =

i

α A i + β D α = 1 , β , χ = 0 ,5 i

+ χ E i

(5 3 ) i

A i , th e fe a tu re s th a t a re c o m m o n to b o th C i a n d b ; D i , th e fe a tu re s th a t b e lo n g to C i b u t n o t to b ; M i , th e fe a tu re s th a t b e lo n g to b b u t n o t to C i .

w ith

4 . 1 . 7 . 3 S i m i l a r i ty M e a s u r e f or I m a g e I n f or m a ti on F o r th e n u m e ric a l d a ta ,

d is t

A B

1 =

k



K

C

i m a x

C

i

C

i= 1

is u s e d , w h e re tiv e ly .

C w

i m in

C

iA

iA i m

− C a x − C

C

a n d

is th e

th e ith fe a tu re w ith

i m in

w

C C

iB i m

− C a x − C

i m in

(5 4 )

i m in

a re th e ith fe a tu re v a lu e s o f im a g e A

iB

m in im u m

is th e m a x im u m



i m in

v a lu e

o f th e

ith

n u m e ric

1

v a lu e o f th e ith fe a tu re , a n d

+ w 2

+ ... + w i

+ ... + w k

a n d B , re sp e c -

o r s y m b o lic

fe a tu re .

w i is th e w e ig h t a tta c h e d to = 1 . F o r th e firs t ru n , w i is

s e t to o n e . F u rth e r s tu d ie s w ill d e a l w ith le a rn in g o f fe a tu re w e ig h ts . 4 . 1 . 8 K n ow l e d g e A c q u i s i ti on A s p e c t T h e c a se b a se h a b o v e , th e h e a d e in fo rm a tio n c a n b a s e . L ik e w is e , c o n ta in e d in th e c a n b e d o n e a u to o f a k n o w le d g e e

a s to b e fille d w ith a la rg e e n o u g h s e t o f r o f th e D IC O M -F ile c o n ta in e s th e n o n -im a b e a u to m a tic a lly e x tra c te d fro m th e file a n th e im a g e in fo rm a tio n c a n b e e x tra c te d fro D IC O M -file . T h e d e te rm in a tio n o f th e s e g m m a tic a lly b y a s p e c ific e v a lu a tio n p ro c e d u re n g in e e r. In o u r a p p ro a c h d e s c rib e d in [P e r9 9

c a se s. A g e in fo rm d s to re d m th e im e n ta tio n o r u n d e r ] it is d o n

s d e s c rib e d a tio n . T h is in th e c a s e a g e m a trix p a ra m e te rs th e c o n tro l e u n d e r th e

1 0 2

4 A p p lic a tio n s

c o n tro l o f a n k n o w le d g e e n g in e e r. H o w e v e r, th is ta s k is e ffic ie n tly s u th e a c q u is itio n u n it s h o w n in F ig . 6 0 T h e h is to g ra m is s m o o th e d a n d s te p b y s te p , a c c o rd in g to th e im p le m e n te d s e g m e n ta tio n a lg o rith m u n d tro l o f th e k n o w le d g e T h e k n o w le d g e e n g in e e r c a n c o n tro l e a c h s e g m e ra m e te r a n d p re v ie w th e s e g m e n ta tio n re s u lts o n s c re e n . O n c e th e b e s t tio n re s u lt h a s b e e n re a c h e d , th e c h o s e n s e g m e n ta tio n p a ra m e te rs to g e th e r w ith th e o th e r in fo rm a tio n in th e c a s e b a s e .

p p o rte d b y p ro c e sse d , e r th e c o n n ta tio n p a s e g m e n ta a re s to re d ,

4 . 1 . 9 . Con c l u s i on In th is c h a p te r w e h a v e d e s c rib e d h o w c a s e -b a s e d re a s o n in g c a n b e u s e d fo r c o n tro llin g th e p a ra m e te rs o f a n a lg o rith m o r a n m o d e l. W e h a v e d e s c rib e d o u r a p p ro a c h b a s e d o n a ta s k fo r im a g e s e g m e n ta tio n . H o w e v e r th e d e s c rib e d m e th o d o lo g y is g e n e ra l e n o u g h to b e u s e d fo r o th e r p ro b le m s a s w e ll. T h e c h a p te r s h o u ld g iv e th e re a d e r a n id e a h o w s u c h a s s y s te m c a n b e d e v e lo p e d a n d w h a t th e a d v a n ta g e s a re . It s h o u ld in s p ire h im to u s e th is s tra te g y fo r h is p ro b le m s . W e b e lie v e th a t c a s e -b a s e d re a s o n in g is a g o o d s tra te g y to c o m e o v e r th e m o d e lin g b u rd e n a n d th a t it c a n e ffic ie n tly s u p p o rt a ll p ro c e s s e s d u rin g th e d e v e lo p m e n t a n d th e life tim e o f a s y s te m .

4 .2 M in in g Im a g e s 4 . 2 . 1 I n tr od u c ti on M o s t o f th e re c e n t w o rk o n im a g e m in in g is d e v o te d to k n o w le d g e d is c o v e ry , s u c h a s c lu s te rin g a n d m in in g a s s o c ia tio n ru le s . T h e y a re d e a lin g w ith th e p ro b le m o f s e a rc h in g th e re g io n s o f s p e c ia l v is u a l a tte n tio n o r in te re s tin g p a tte rn s in a la rg e s e t o f im a g e , e .g . in C T a n d M R I im a g e s e ts [ M D H 9 9 ] , [ E Y D 0 0 ] o r in s a te llite im a g e s [B u L 0 0 ]. U s u a lly e x p e rie n c e d e x p e rts h a v e d is c o v e re d th is in fo rm a tio n . H o w e v e r, th e a m o u n t o f im a g e s , w h ic h is b e in g c re a te d b y m o d e rn s e n s o rs , m a k e s n e c e s s a ry th e d e v e lo p m e n t o f m e th o d s th a t c a n d e c id e th is ta s k fo r th e e x p e rt. T h e re fo re , s ta n d a rd p rim itiv e fe a tu re s th a t a re a b le to d e s c rib e th e v is u a l c h a n g e s in th e im a g e b a c k g ro u n d a re b e in g e x tra c te d fro m th e im a g e s a n d th e s ig n ific a n c e o f th e s e fe a tu re s is b e in g te s te d b y a s o u n d s ta tis tic a l te s t [M D H 9 9 ], [B u L 0 0 ]. C lu s te rin g is a p p lie d in o rd e r to e x p lo re th e im a g e s s e e k in g fo r s im ila r g ro u p s o f s p a tia l c o n n e c te d c o m p o n e n ts [Z a H 0 0 ] o r s im ila r g ro u p s o f o b je c ts [E Y D 0 0 ]. A s s o c ia tio n ru le s a re u s e d fo r fin d in g s ig n ific a n t p a tte rn in th e im a g e s [B u L 0 0 ]. T h e m e a s u re m e n t o f im a g e fe a tu re s in th e s e re g io n s o r p a tte rn s g iv e s th e b a s is fo r p a tte rn re c o g n itio n a n d im a g e c la s s ific a tio n . C o m p u te r-v is io n re s e a rc h e s a re c o n c e rn e d to c re a te p ro p e r m o d e ls o f o b je c ts a n d s c e n e s , to o b ta in im a g e fe a tu re s a n d to d e v e lo p d e c is io n ru le s th a t a llo w o n e to a n a ly z e a n d in te rp re t th e o b s e rv e d im a g e s . M e th o d s o f im a g e p ro c e s s in g , s e g m e n ta tio n , a n d fe a tu re m e a s u re m e n ts

4 .2 M in in g I m a g e s

1 0 3

a re p ro fro je c th e

s u c c e s s fu lly u s e d fo r th is p u rp o s e [K e P 9 1 ], [S N S 8 8 ], [P e r9 8 ]. T h e m in in g c e s s is d o n e b o tto m -u p . A s m a n y n u m e ric a l fe a tu re s a s p o s s ib le a re e x tra c te d m th e im a g e s , in o rd e r to a c h ie v e th e fin a l g o a l - th e c la s s ific a tio n o f th e o b ts [P Z J 0 1 ][F iB 0 1 ]. H o w e v e r, s u c h a n u m e ric a l a p p ro a c h u s u a lly d o e s n o t a llo w u s e r to u n d e rs ta n d th e w a y in w h ic h th e re a s o n in g p ro c e s s h a s b e e n d o n e . T h e s e c o n d a p p ro a c h to p a tte rn re c o g n itio n a n d im a g e c la s s ific a tio n is a n a p p ro a c h b a s e d o n th e s y m b o lic a l d e s c rip tio n o f im a g e s [B iC 0 0 ] m a d e b y th e e x p e rt [P e r0 0 ]. T h is a p p ro a c h c a n p re s e n t to th e e x p e rt in th e e x p lic it fo rm th e w a y in w h ic h th e im a g e h a s b e e n in te rp re te d . T h e e x p e rts h a v in g th e d o m a in k n o w le d g e u s u a lly p re fe r th e s e c o n d a p p ro a c h . W e a re d e s c rib in g o u r m e th o d o lo g y fo r m in in g im a g e s . W e u s u a lly s ta rt w ith a k n o w le d g e a c q u is itio n p h a s e to u n d e rs ta n d w h a t th e e x p e rt is lo o k in g fo r in th e im a g e s a n d th e n w e a re d e v e lo p in g th e p ro p e r fe a tu re e x tra c tio n p ro c e d u re w h ic h g iv e s th e b a s is fo r th e c re a tio n o f th e d a ta b a s e . A fte rw a rd s w e c a n s ta rt w ith th e m in in g e x p e rim e n t.

4 . 2 . 2 P r e p a r i n g th e E x p e r i m e n t T h e w h o le p ro c e d u re fo r im a g e m in in g is s u m m a riz e d in F ig u re 6 5 . It is p a rtia lly b a s e d o n o u r d e v e lo p e d m e th o d o lo g y fo r im a g e -k n o w le d g e e n g in e e rin g [P e r9 4 ]. T h e p ro c e s s c a n b e d iv id e d in to fiv e m a jo r s te p s : 1 . B ra in s to rm in g , 2 . In te rv ie w in g P ro c e s s 3 . C o lle c tio n o f Im a g e D e s c rip tio n s in to th e D a ta B a s e , 4 . M in in g E x p e rim e n t, a n d 5 . R e v ie w . B ra in s to rm in g is th e p ro c e s s o f u n d e rs ta n d in g th e p ro b le m d o m a in a n d id e n tify in g th e im p o rta n t k n o w le d g e p ie c e s o n w h ic h th e k n o w le d g e -e n g in e e rin g p ro c e s s w ill fo c u s . F o r th e in te rv ie w in g p ro c e s s w e u s e d o u r d e v e lo p e d m e th o d o lo g y fo r im a g e k n o w le d g e e n g in e e rin g d e s c rib e d in [P e r9 4 ] in o rd e r to e lic it th e b a s ic a ttrib u te s a s w e ll a s th e ir a ttrib u te v a lu e s . T h e n th e p ro p e r im a g e p ro c e s s in g a n d fe a tu re e x tra c tio n a lg o rith m s a re id e n tifie d fo r th e a u to m a tic e x tra c tio n o f th e fe a tu re s a n d th e ir v a lu e s . B a s e d o n th e s e re s u lts w e th e n c o lle c te d in to th e d a ta b a s e im a g e re a d in g s d o n e b y th e e x p e rt a n d d o n e b y th e a u to m a tic im a g e a n a ly s is a n d fe a tu re -e x tra c tio n to o l. T h e re s u ltin g d a ta b a s e is th e b a s is fo r o u r m in in g e x p e rim e n t. T h e e rro r ra te o f th e m in in g re s u lt w a s th e n d e te rm in e d b a s e d o n s o u n d s ta tis tic a l m e th o d s s u c h a s c ro s s v a lid a tio n . T h e e rro r ra te a s w e ll a s th e ru le s w e re th e n re v ie w e d to g e th e r w ith th e e x p e rt a n d d e p e n d in g o n th e q u a lity o f th e re s u lts th e m in in g p ro c e s s s to p s o r g o e s in to a s e c o n d tra il, s ta rtin g e ith e r a t th e to p w ith e lic itin g n e w a ttrib u te s o r a t a d e e p e r le v e l, e .g . w ith r e a d in g n e w im a g e s o r in c o r p o r a tin g n e w im a g e -a n a ly s is a n d fe a tu re -e x tra c tio n p ro c e d u re s . T h e in c o rp o ra tio n o f n e w im a g e a n a ly s is a n d fe a tu re -e x tra c tio n p ro c e d u re s s e e m s to b e a n in te ra c tiv e a n d ite ra tiv e p ro c e s s a t th e m o m e n t, s in c e it is n o t p o s s ib le to p ro v id e a d -h o c s u ffic ie n t im a g e a n a ly s is p ro c e d u re s fo r a ll im a g e fe a tu re s a n d d e ta ils a p p e a rin g in th e re a l w o rld . T h e m in in g p ro c e d u re s to p s a s s o o n a s th e e x p e rt is s a tis fie d b y th e re s u lts .

1 0 4

4 A p p lic a tio n s

S ta r t w ith B r a in - S to r m in g P r o b le m D o m a in U n d e rs ta n d C o lle c t P r o to ty p e Im a g e s Im a g e C a ta lo g u e N a tu r a l- L a n g u a g e D e s c r ip tio n o f Im a g e s Im a g e C a ta lo g u e w ith N a tu r a l- L a n g u a g e D e s c r ip tio n In te r v ie w

E x p e rt

R e v is e d N a tu r a l- L a n g u a g e D e s c r ip tio n a n d M a r k e d Im a g e D e ta ils S tr u c tu r e d In te r v ie w R e v is e d N a tu r a l- L a n g u a g e D e s c r ip t io n a n d M a r k e d Im a g e D e ta ils

C ir c le O b je c t o f In te re s t o r D ra w D e ta il

C ir c le O b je c t o f In te re s t o r D ra w D e ta il

E x tr a c t A ttr ib u te s a n d A ttr ib u te V a lu e s A ttr ib u te L is t

S e t o f Im a g e A n a ly s is a n d F e a tu r e E x tr a c tio n P ro c e d u re s

S e le c t A u to m a tic F e a tu r e D e s c r ip to r s a n d A n a ly s is

R e a d in g o f Im a g e s F e a tu re s b y th e Im a g e A n a ly s is T o o l

R e a d in g o f Im a g e s b y th e E x p e rt

M e a s u re m e n ts o f Im a g e s F e a tu re s

E x p e r t´ s R e a d in g s C o lle c t in to D a ta B a s e D a ta B a s e Im a g e M in in g E x p e r im e n t M in in g R e s u lt R e v ie w F in a l S e le c te d A ttr ib u te s , A ttr ib u te D e s c r ip to r s a n d R u le s

F ig . 6 5 . P ro c e d u re o f th e Im a g e M in in g P ro c e s s

E n d

... F in d n e w A ttr ib u te s / S a m p le s

4 .2 M in in g I m a g e s

1 0 5

4 . 2 . 3 I m a g e M i n i n g T ool F ig u re 6 6 s h o w s a s c h e m e o f th e to o l fo r im a g e m in in g . T h e re a re tw o p a rts in th e to o l: th e u n it fo r im a g e a n a ly s is , fe a tu re e x tra c tio n , a n d s to ra g e o f im a g e d e s c rip tio n s a n d th e u n it fo r d a ta m in in g . T h e s e tw o u n its c o m m u n ic a te o v e r a d a ta b a s e o f im a g e d e s c rip tio n s , w h ic h is c re a te d in th e fra m e o f th e im a g e -p ro c e s s in g u n it. T h is d a ta b a s e is th e b a s is fo r th e d a ta -m in in g u n it. T h e fe a tu re e x tra c tio n u n it h a s a n o p e n a rc h ite c tu re s o th a t n e w im a g e fe a tu re e x tra c tio n p ro c e d u re c a n b e im p le m e n te d a n d u s e d fo r fu rth e r im a g e m in in g a p p lic a tio n s . A n im a g e fro m th e im a g e a rc h iv e is s e le c te d b y th e e x p e rt a n d th e n it is d is p la y e d o n a m o n ito r. T o p e rfo rm im a g e p ro c e s s in g a n e x p e rt c o m m u n ic a te s w ith a c o m p u te r. H e d e te rm in e s w h e th e r th e w h o le im a g e o r p a rt o f it h a v e to b e p ro c e s s e d a n d o u tlin e s a n a re a o f in te re s t (fo r e x a m p le , a n o d u le re g io n ) w ith a n o v e rla y lin e . T h e e x p e rt c a n c a lc u la te s o m e im a g e fe a tu re s in th e m a rk e d re g io n (o b je c t c o n to u r, s q u a re , d ia m e te r, s h a p e , a n d s o m e te x tu re fe a tu re s ) [Z a m 9 6 ]. T h e e x p e rt e v a lu a te s o r c a lc u la te s im a g e fe a tu re s a n d s to re s th e ir v a lu e s in a d a ta b a s e o f im a g e fe a tu re s . E a c h e n try in th e d a ta b a s e p re s e n ts fe a tu re s o f th e o b je c t o f in te re s t. T h e s e fe a tu re s c a n b e n u m e ric a l (c a lc u la te d o n th e im a g e ) a n d s y m b o lic a l (d e te rm in e d b y th e e x p e rt a s a re s u lt o f im a g e re a d in g b y th e e x p e rt). In th e la tte r c a s e th e e x p e rt e v a lu a te s o b je c t fe a tu re s a c c o rd in g to th e a ttrib u te lis t, w h ic h h a s to b e s p e c ifie d in a d v a n c e fo r o b je c t d e s c rip tio n . T h e n h e fe e d s th e s e v a lu e s in to th e d a ta b a s e . W h e n th e e x p e rt h a s e v a lu a te d a s u ffic ie n t n u m b e r o f im a g e s , th e re s u ltin g d a ta b a s e c a n b e u s e d fo r th e m in in g p ro c e s s . T h e s to re d d a ta b a s e c a n e a s ily b e lo a d e d in to th e d a ta m in in g to o l D e c is io n M a s te r . T h e D e c is io n M a s te r c a rrie s o u t a d e c is io n -tre e in d u c tio n a c c o rd in g to th e m e th o d s d e s c rib e d in C h a p te r 3 .1 . It a llo w s o n e to le a rn a s e t o f ru le s a n d b a s ic fe a tu re s n e c e s s a ry fo r d e c is io n -m a k in g in th e s p e c ifie d ta s k . T h e in d u c tio n p ro c e s s d o e s n o t o n ly a c t a s a k n o w le d g e d is c o v e ry p ro c e s s , it a ls o w o rk s a s a fe a tu re s e le c to r, d is c o v e rin g a s u b s e t o f fe a tu re s th a t is th e m o s t re le v a n t to th e p ro b le m s o lu tio n . T h e d e v e lo p e d to o l a llo w s c h o o s in g d iffe re n t k in d s o f m e th o d s fo r fe a tu re s e le c tio n , fe a tu re d is c re tiz a tio n , p ru n in g o f th e d e c is io n tre e a n d e v a lu a tio n o f th e e rro r ra te . It p ro v id e s a n e n tro p y -b a s e d m e a s u re , a g in i-in d e x , g a in -ra tio a n d c h i s q u a re m e th o d fo r fe a tu re s e le c tio n . T h e D e c is io n M a s te r p ro v id e s th e fo llo w in g m e th o d s fo r fe a tu re d is c re tiz a tio n : c u t-p o in t s tra te g y , c h i-m e rg e d is c re tiz a tio n , M D L P -b a s e d d is c re tiz a tio n m e th o d a n d lv q -b a s e d m e th o d . T h e s e m e th o d s a llo w o n e to m a k e d is c re tiz a tio n o f th e fe a tu re v a lu e s in to tw o a n d m o re in te rv a ls d u rin g th e p ro c e s s o f d e c is io n -tre e b u ild in g . D e p e n d in g o n th e c h o s e n m e th o d fo r a ttrib u te d is c re tiz a tio n , th e re s u lt w ill b e a b in a ry o r n -a ry tre e , w h ic h w ill le a d to m o re a c c u ra te a n d c o m p a c t tre e s . T h e D e c is io n M a s te r a llo w s o n e to c h o s e b e tw e e n c o s t-c o m p le x ity p ru n in g , e rro r-re d u c tio n -b a s e d m e th o d s a n d p ru n in g b y c o n fid e n c e -in te rv a l p re d ic tio n . T h e to o l a ls o p ro v id e s fu n c tio n s fo r o u tlie r d e te c tio n s . T o e v a lu a te th e o b ta in e d e rro r ra te o n e c a n c h o o s e te s t-a n d -tra in a n d n -fo ld c ro s s v a lid a tio n . T h e u s e r s e le c ts th e p re fe rre d m e th o d fo r e a c h s te p o f th e d e c is io n tre e in d u c tio n p ro c e s s . A fte r th a t th e in d u c tio n e x p e rim e n t c a n s ta rt o n th e a c q u ire d d a ta -

1 0 6

4 A p p lic a tio n s

b a s e . A re s u ltin g d e th e tre e b y c h e c k in g h is /h e r d o m a in k n o w O n c e th e d ia g n o s tx t-fo rm a t fo r fu rth e c o m p o n e n t o f th e D te rfa c e a n d is s e t u p e a s ily .

c is io n th e fe le d g e is k n o r u se e c is io in s u c

tre e a tu re . w le d in a n n M a h a w

w ill b e d is p la y e d to th e u s e r. H e /s h e c a n e v a lu a te s in e a c h n o d e o f th e tre e a n d c o m p a rin g th e m w ith g e e x s te a y

h a s p e r r fo th a

b e e n le a rn t s y s te m o r r in te ra c tiv t n o n -c o m p

t, th e r th e e x e w o rk u te r s p

M a rk O b je c t/ D e ta ils

Im a g e A n a ly s is a n d F e a tu re E x tr a c tio n

p ro v id u s e th a u se rc a n h a

e d e ith e r in e d ia g n o s is frie n d ly in n d le it v e ry

D a ta M in in g T o o l D e c is io n _ M a s t e r

Im a g e D e s c r ip tio n T o o l D is p la y

u le s a re p e rt c a n . It h a s e c ia lis ts

E x p e rt´s D e s c r ip tio n D a ta B a s e S to ra g e

D e c is io n T re e In d u c tio n

E v a lu a tio n

D ia g n o s is C o m p o n e n t

In te rfa c e F ile ( d b f, o r a c le , e tc .)

F ig . 6 6 . A rc h ite c tu re o f th e Im a g e M in in g T o o l

4 . 2 . 4 T h e A p p l i c a ti on W e w ill d e s c rib e th e u s a g e o f th e im a g e m in in g to o l b a s e d o n th e ta s k o f H E p -2 c e ll c la s s ific a tio n . H E p -2 c e lls a re u s e d fo r th e id e n tific a tio n o f a n tin u c le a r a u to a n tib o d ie s (A N A ). T h e y a llo w th e re c o g n itio n o f o v e r 3 0 d iffe re n t n u c le a r a n d c y to p la s m ic p a tte rn s w h ic h a re g iv e n b y u p w a rd s o f 1 0 0 d iffe re n t a u to a n tib o d ie s . T h e id e n tific a tio n o f th e s e p a tte rn s h a s u p to n o w b e e n d o n e m a n u a lly b y a h u m a n in s p e c tin g th e s lid e s w ith th e h e lp o f a m ic ro s c o p e . T h e la c k in g a u to m a tio n o f th is te c h n iq u e h a s re s u lte d in th e d e v e lo p m e n t o f a lte rn a tiv e te c h n iq u e s b a s e d o n c h e m ic a l re a c tio n s , w h ic h h a v e n o t th e d is c rim in a tio n p o w e r o f th e A N A te s tin g . A n a u to m a tic s y s te m w o u ld p a v e th e w a y fo r a w id e r u s e o f A N A te s tin g . R e c e n tly , th e v a rio u s H E p -2 c e ll im a g e s o c c u rrin g in m e d ic a l p ra c tic e a re b e in g c o lle c te d in to a d a ta b a s e a t th e u n iv e rs ity h o s p ita l o f L e ip z ig . T h e im a g e s w e re ta k e n b y a d ig ita l im a g e -a c q u is itio n u n it c o n s is tin g o f a m ic ro s c o p e A X IO S K O P 2 fro m C a rl Z e is s J e n a , c o u p le d w ith a c o lo r C C D c a m e ra P o la rio d D P C . T h e d ig itiz e d im a g e s w e re o f 8 -b it p h o to m e tric re s o lu tio n fo r e a c h c o lo r c h a n n e l w ith a p e r p ix e l s p a tia l r e s o lu tio n o f 0 .2 5 μ m . E a c h im a g e w a s s to r e d a s a c o lo r im a g e o n th e h a rd d is k o f th e P C b u t is tra n s fo rm e d in to a g ra y -le v e l im a g e b e fo re u s e d fo r a u to m a tic im a g e a n a ly s is . T h e s c o p e o f o u r w o rk w a s to m in e th e s e im a g e s fo r th e p ro p e r c la s s ific a tio n k n o w le d g e s o th a t it c a n b e u s e d in m e d ic a l p ra c tic e fo r d ia g n o s is o r fo r te a c h in g

4 .2 M in in g I m a g e s

n o v ic e im a g e O u r a n d a c c a se s.

s. B d ia g e x p ts a

e s id n o s e rim s a

e s th a t it is s y s te m e n t w a s s p e c ia lis

1 0 7

s h o u ld g iv e u s th e b a s is fo r th e d e v e lo p m e n t o f a u to m a tic . s u p p o rte d b y a n im m u n o lo g is t w h o is a n e x p e rt in th e fie ld t to o th e r la b o ra to rie s in c a s e o f d ia g n o s tic a lly c o m p le x

4 . 2 . 5 B r a i n s tor m i n g a n d I m a g e Ca ta l og u e F irs t, w e s ta rte d w ith a b ra in s to rm in g p ro c e s s th a t h e lp e d u s to u n d e rs ta n d th e e x p e rt´s d o m a in a n d to id e n tify th e b a s ic p ie c e s o f k n o w le d g e . W e c o u ld id e n tify m a in ly fo u r p ie c e s o f k n o w le d g e : 1 . H e p -2 c e ll a tla s [B S tJ 9 5 ], th e e x p e rt, s lid e p re p a ra tio n a n d a b o o k d e s c rib in g th e b a s ic p a rts o f a c e ll a n d th e ir a p p e a ra n c e . T h e n th e e x p e rt c o lle c te d p ro to ty p e im a g e s fo r e a c h o f th e s ix c la s s e s a p p e a rin g m o s t fre q u e n tly in h is d a ily p ra c tic e . T h e e x p e rt w ro te d o w n a n a tu ra l-la n g u a g e d e s c rip tio n fo r e a c h o f th e s e im a g e s . A s a re s u lt w e o b ta in e d a n im a g e c a ta lo g u e h a v in g a p ro to ty p e im a g e fo r e a c h c la s s a n d a s s o c ia te d to e a c h im a g e is a n a tu ra lla n g u a g e d e s c rip tio n o f th e e x p e rt (s e e F ig u re 6 7 ). 4 . 2 . 6 I n te r v i e w i n g P r oc e s s B a s e d o n th e s e im a g e d e s c rip tio n s w e s ta rte d o u r in te rv ie w in g p ro c e s s . F irs t, w e o n ly trie d to u n d e rs ta n d th e m e a n in g o f th e e x p e rt d e s c rip tio n in te rm s o f im a g e fe a tu re s . W e le t h im c irc le th e in te re s tin g o b je c t in th e im a g e to u n d e rs ta n d th e m e a n in g o f th e d e s c rip tio n . A fte r h a v in g d o n e th is , w e w e n t in to a s tru c tu re d in te rv ie w in g p ro c e s s a s k in g fo r s p e c ific d e ta ils s u c h a s : “ W h y d o y o u th in k th is o b je c t is fin e -s p e c k le d a n d th e o th e r o n e is n o t. P le a s e d e s c rib e th e d iffe re n c e b e tw e e n th e s e tw o .” I t h e lp e d u s to v e r if y th e e x p e r t d e s c r ip tio n a n d to m a k e th e o b je c t fe a tu re s m o re d is tin c t. F in a lly , w e c o u ld e x tra c t fro m th e n a tu ra l-la n g u a g e d e s c rip tio n th e b a s ic v o c a b u la ry (a ttrib u te s a n d a ttrib u te v a lu e s , s e e ta b le 1 2 ) a n d a s s o c ia te th e m e a n in g to e a c h a ttrib u te . In a la s t s te p w e re v ie w e d th e c h o s e n a ttrib u te s a n d th e a ttrib u te v a lu e s w ith th e e x p e rt a n d fo u n d a c o m m o n a g re e m e n t o n th e c h o s e n te rm s . T h e re s u lt w a s a n a ttrib u te lis t w h ic h is th e b a s is fo r th e d e s c rip tio n o f o b je c t d e ta ils in th e im a g e s . F u rth e rm o re , w e id e n tifie d fro m th e w h o le s e t o f fe a tu re d e s c rip to rs o u r im a g e a n a ly s is to o l p ro v id e s th e s e t o f a fe a tu re d e s c rip to rs w h ic h m ig h t b e u s e fu l fo r th e o b je c tiv e m e a s u re m e n t o f im a g e fe a tu re s . In o u r c a s e w e fo u n d th a t d e s c rib in g th e c e lls b y th e ir b o u n d a ry a n d c a lc u la tin g th e s iz e a n d th e c o n to u r o f th e c e ll m ig h t b e a p p ro p ria te . T h e d iffe re n t d e s c rip to rs o f th e n u c le i o f th e c e lls m ig h t b e s u ffic ie n tly d e s c rib e d b y th e te x tu re d e s c rip to r o f o u r im a g e -a n a ly s is to o l. 4 . 2 . 7 S e tti n g U p th e A u tom a ti c I m a g e A n a l y s i s a n d F e a tu r e E x tr a c ti on P r oc e d u r e A fte r h a v in g b a s is fo r th e H o w e v e r, w e c la s s ific a tio n

u n d e rs to o d w h a t d e v e lo p m e n t o f th s till d o n o t k n o w ru le s . T h is h a s to

th e e x p e rt e im a g e a n w h a t a re b e fig u re d

is a th o

lo o k in g fo r in a n im a g e w ly s is a n d fe a tu re e x tra c tio n e n e c e s s a ry fe a tu re s a n d w u t b y th e fo llo w in g d a ta m

e h p ro h a t in in

a v e c e d a re g p

th e u re . th e ro c -

1 0 8

4 A p p lic a tio n s

e s s . W e a re n o w a t th e p o in t w h e re w e c a n p re p a re th e im a g e s fo r th e m in in g p ro c e s s . B a s e d o n th e im a g e a n a ly s is a n d fe a tu re e x tra c tio n p ro c e d u re w e c a n e x tra c t fro m th e im a g e s a d a ta ta b le re le v a n t fo r th e d a ta m in in g e x p e rim e n t. C la s s F in e S p e c k le d

Im a g e

2 0 0 0 0 0

F in e d o tte d (s p e c k le d ) n u c le i flu o re s c e n c e 3 2 0 2 0 0

H o m o g e n e o u s N u c le a r

D e s c r ip tio n S m o o th a n d u n ifo rm flu n u c le i N u c le i s o m e tim e s d a rk C h ro m o s o m e s flu o re s c e d tre m e in te n s iv e D e n s e fin e s p e c k le d flu o re B a c k g ro u n d d iffu s e flu o re

o re sc e n c e w e a k

u p

o f th e to

e x -

sc e n c e sc e n t

A u n ifo rm d iffu s e flu o re s c e n c e o f th e e n tire n u c le u s o f in te rp h a s e c e lls . T h e s u rro u n d in g c y to p la s m is n e g a tiv e .

1 0 0 0 0 0 ...

...

C e n tro m e re

... N u c le i w e a k u n ifo rm o r fin e g ra n u la r, p o o r d is tin c tio n fro m b a c k g ro u n d

5 0 0 0 0 0

F ig . 6 7 . Im a g e C a ta lo g u e a n d E x p e rt‘s D e s c rip tio n

4 .2 .7 .1 Im a g e A n a ly s is T h e c o lo r im a g e h a s b e e n tra n s fo rm e d in to a g ra y le v e l im a g e . H is to g ra m e q u a liz a tio n w a s d o n e to e lim in a te th e in flu e n c e o f th e d iffe re n t s ta in in g [P e B 9 9 ]. A u to m a tic th re s h o ld in g h a s b e e n p e rfo rm e d b y th e a lg o rith m o f O ts u [O ts 7 8 ]. T h e a lg o rith m c a n lo c a liz e th e c e lls w ith th e ir c y to p la s m a tic s tru c tu re v e ry w e ll, b u t n o t th e n u c le a r e n v e lo p e its e lf. W e th e n a p p lie d m o rp h o lo g ic a l filte rs lik e d ila tio n a n d e ro s io n to th e im a g e in o rd e r to g e t a b in a ry m a s k fo r c u ttin g o u t th e c e lls fro m th e im a g e . O v e rla p p in g c e lls h a v e n o t b e e n c o n s id e re d fo r fu rth e r a n a ly s is . T h e y a re e lim in a te d b a s e d o n a s im p le h e u ris tic . E a c h o b je c t w ith a n a re a b ig g e r th a n 2 tim e s th e m e a n a re a w a s re m o v e d fro m th e im a g e . F o r e a c h c e ll in th e im a g e a r e c a l c u l a t e d t h e a r e a A c e ll a n d t h e f e a t u r e s d e s c r i b e d i n t h e n e x t S e c t i o n .

4 .2 M in in g I m a g e s

N o te , th e im a g e f ( x ,y ) c o n s id e r e d f o r f u r th e r c a lc u la tio n c o n ta in s n o w c e ll.

1 0 9

o n ly o n e

T a b le 1 2 . A ttrib u te L is t a n d A ttrib u te V a lu e N a m e s 0

In te rp h a s e C e lls

U n d e fin e d F in e s p e c k le d h o m o g e n e o u s C o a rs e S p e c k le d D e n s e fin e s p e c k le d F lu o re s c e n c e

1 2 3 4 N u c le o li

0

U n d e fin e d D a rk a re a flu o re s c e n c e 1 2

B a c k g ro u n d

0

U n d e fin e d D a rk F lu o re s c e n c e 1 2

C h ro m o so m e s

0

U n d e fin e d F lu o re s c e n c e D a rk 1 2

C y to p la s m

0

U n d e fin e d S p e c k le d F lu o re s c e n c e 1

C la s s e s

1 0 0 1 0 0 2 0 0 3 2 0 3 2 0

0 0 3 2 0 0 0 0 2 0

0 0 0 0

0

H o H o N u F in F in

m o m o c le e s e s

g e g e a r p e p e

n e o u s n e o u s fin e s p e c k le d c k le d c k le d n u c le a r

4 . 2 . 7 . 2 F e a tu r e E x tr a c ti on F o r th e d e s c rip tio n o f th e p s c rip to r w h ic h is fle x ib le e T h e te x tu re d e s c rip to r is b a [M a t7 5 ]. A d e e p d e s c rip tio T h e B o o le a n m o d e l a llo w s a s f o r e .g . c r y s ta ls , le a v e s , re a liz a tio n s o f c o m p a c t ra n ta k in g th e s u p re m u m . T h e is c a lc u la te d a s :

ro p e rtie s o f th e o b je c t w a s c h o s e n a te x tu re fe a tu re d e n o u g h to d e s c rib e c o m p le x a n d v e ry d iffe re n t te x tu re s . s e d o n ra n d o m s e ts . It is a ls o c a lle d th e B o o le a n m o d e l n o f th e th e o ry c a n b e fo u n d in S to y a n e t. a l [S tK M 8 7 ]. to m o d e l a n d s im u la te a h u g e v a rie ty o f te x tu re s s u c h e tc . T h e te x tu re m o d e l X is o b ta in e d b y ta k in g v a rio u s n d o m s e ts , im p la n tin g th e m in P o is s o n p o in ts in R , a n d fu n c tio n a l m o m e n t Q ( B ) o f X , a fte r B o o le a n iz a tio n ,

1 1 0

4 A p p lic a tio n s

P (B ⊂

κ

w h e re

c

X

) = Q (B ) = e x p (− θ M e s ( X

is th e s e t o f th e c o m p a c t ra n d o m



se t o f R



⊕ B )) n

,

θ

∀ B ∈ κ

(5 5 )

th e d e n s ity o f th e p ro c e s s



a n d M e s ( X ⊕ X ) is a n a v e ra g e m e a s u re th a t c h a ra c te riz e s th e g e o m e tric p ro p e rtie s o f th e re m a in in g s e t o f o b je c ts a fte r d ila tio n . R e la tio n (1 ) is th e fu n d a m e n ta l fo rm u la o f th e m o d e l. It c o m p le te ly c h a ra c te riz e s th e te x tu re m o d e l. Q ( B ) d o e s ’

n o t d e p e rg o d ic o u t re fe F o rm • o n th



e n d o th u s rrin g u la 1 e d e n

n th e lo w e c a n to th e p p ro v id s ity θ

c a tio p e a k a rtic e s u s o f th

n o f B th u s it is o u t th e m e a s u re u la r p o rtio n o f th th a t te x tu re m o d e p ro c e ss a n d

s ta fo e s e l

tio n r a p a c d e p

a ry . O n e c a n a ls o p ro v id e th a t it is s p e c ific p o rtio n o f th e s p a c e w ith e . e n d s o n tw o p a ra m e te rs :



a m e a s u re M e s ( X ⊕ B ) th a t c h a ra c te riz e s th e o b je c ts . In th e 1 -d im e n s io n a l s p a c e it is th e a v e ra g e le n g th o f th e lin e s a n d in th e 2 -d im e n s io n a l s p a c e is ‘

M e s ( X







B ) th e a v e ra g e m e a s u re o f th e a re a a n d th e p e rim e te r o f th e

o b je c ts u n d e r th e a s s u m p tio n o f c o n v e x s h a p e s . W e c o n s id e r th e 2 -d im e n s io n a l c a s e a n d d e v e lo p e d a p ro p e r te x tu re d e s c rip to r. S u p p o s e n o w th a t w e h a v e a te x tu re im a g e w ith 8 b it g ra y le v e ls . T h e n w e c a n c o n s id e r th e te x tu re im a g e a s th e s u p e rp o s itio n o f v a rio u s B o o le a n m o d e ls e a c h o f th e m ta k e s a d iffe re n t g ra y le v e l v a lu e o n th e s c a le fro m 0 to 2 5 5 fo r th e o b je c ts w ith in th e b itp la n e . T o re d u c e th e d im e n s io n a lity o f th e re s u ltin g fe a tu re v e c to r th e g ra y le v e ls r a n g in g f r o m 0 to 2 5 5 a r e n o w q u a n tiz ie d in to 1 2 in te r v a ls t. E a c h im a g e f( x ,y ) c o n ta in in g o n ly a c e ll g e ts c la s s ifie d a c c o rd in g to th e g ra y le v e l in to t c la s s e s , w ith t= { 0 ,1 ,2 ,..,1 2 } . F o r e a c h c la s s a b in a r y im a g e is c a lc u la te d c o n ta in in g th e v a lu e “ 1 ” fo r p ix e ls w ith a g ra y le v e l v a lu e fa llin g in to th e g ra y le v e l in te rv a l o f c la s s t a n d v a lu e “ 0 ” f o r a ll o th e r p ix e ls . T h e r e s u ltin g b itp la n e f( x ,y ,t) c a n n o w b e c o n s id e re d a s a re a liz a tio n o f th e B o o le a n m o d e l. T h e q u a n tiz a tio n o f th e g ra y le v e l in to 1 2 in te r v a ls w a s d o n e e q u a lly d is ta n t. W e c a ll th e im a g e f( x ,y ,t) in th e f o llo w in g c la s s im a g e . O b je c t la b e lin g is d o n e in th e c la s s im a g e s w ith th e c o n to u r fo llo w in g m e th o d [P e B 9 9 ]. A fte rw a rd s , fe a tu re s fro m th e b it-p la n e a n d fro m th e s e o b je c ts a re c a lc u la te d . T h e firs t o n e is th e d e n s ity o f th e c la s s im a g e t w h ic h is th e n u m b e r o f p ix e ls in th e c la s s im a g e la b e le d b y “ 1 ” d iv id e d b y th e a re a o f th e c e ll. If a ll p ix e ls o f a c e ll a re la b e le d b y “ 1 ” th e n th e d e n s ity is o n e . If n o p ix e l in a c e ll is la b e le d th a n th e d e n s ity is z e ro . F ro m th e o b je c ts in th e c la s s im a g e t a re c a lc u la te d th e a re a , a s im p le s h a p e fa c to r, a n d th e le n g th o f th e c o n to u r. A c c o rd in g to th e m o d e l, n o t a s in g le fe a tu re o f e a c h o b je c t is ta k e n fo r c la s s ific a tio n , b u t th e m e a n a n d th e v a ria n c e o f e a c h fe a tu re is c a lc u la te d o v e r a ll th e o b je c ts in th e c la s s im a g e t. W e a ls o c a lc u la te th e fre q u e n c y o f th e o b je c t s iz e in e a c h c la s s im a g e t. T h e lis t o f fe a tu re s a n d th e ir c a lc u la tio n is s h o w n in ta b le 2 .

4 .2 M in in g I m a g e s

1 1 1

4 . 2 . 8 Col l e c ti on of I m a g e D e s c r i p ti on s i n to th e D a ta B a s e N o w w e c o u ld s ta rt to c o lle c t a d a ta b a s e o f im a g e d e s c rip tio n s b a s e d o n th e s e a ttrib u te s a n d a ttrib u te v a lu e s th e e x p e rt h a s s p e c ifie d a s w e ll a s o n fe a tu re m e a s u re m e n ts c a lc u la te d w ith th e h e lp o f th e im a g e -a n a ly s is to o l. F o r o u r e x p e rim e n t w e u s e d a d a ta s e t o f 1 1 0 im a g e s . T h e d a ta s e t c o n ta in e d 6 c la s s e s , e a c h e q u a lly d is trib u te d . F o r e a c h c la s s w e h a d 2 0 im a g e s . T h e e x p e rt u s e d th e im a g e -a n a ly s is to o l s h o w n in F ig u re 6 6 a n d d is p la y e d o n e a fte r a n o th e r e a c h im a g e fro m o u r d a ta b a s e . H e w a tc h e d th e im a g e s o n d is p la y a n d d e s c rib e d th e im a g e c o n te n t o n th e b a s is o f o u r a ttrib u te lis t a n d fe d th e a ttrib u te v a lu e s in to th e d a ta b a s e . A s re s u lt w e o b ta in e d a d a ta b a s e b a s e d o n e x p e rt‘s im a g e re a d in g s . N e x t th e im a g e s w e re p ro c e s s e d b y o u r a u to m a tic im a g e a n a ly s is a n d fe a tu re e x tra c tio n p ro c e d u re . T h e c a lc u la te d v a lu e s fo r th e fe a tu re s w e re a u to m a tic a lly re c o rd e d in to th e d a ta b a s e . A n e x c e rp t o f th e d a ta b a s e c o n ta in in g e x p e rt‘s im a g e re a d in g s a n d th e a u to m a tic a lly c a lc u la te d im a g e fe a tu re s is s h o w n in F ig u re 6 8 . T a b le 1 3 . F e a tu re s

1 1 2

4 A p p lic a tio n s C o n to u r (K o n tu r)

Shape F a c to r (F o rm )

A re a

C la s s

M EAN

V AR

S KE W

C U R T

V C

E N E R G Y

...

100000

1 4 ,3 7 3 4

1 4 ,3 1 8 9

1 4 4 ,2 8 1 2

8 7 ,1 5 0 7

2 4 4 ,3 0 4 3

1 ,1 2 3 3

7 ,5 1 3 9

0 ,1 7 9 3

0 ,0 2 0 9 ...

100320

1 0 ,3 6 7 5

7 ,2 9 8 6

1 4 7 ,2 6 8 7

1 4 4 ,6 9 7 4

2 8 2 ,0 4 4 4

-0 ,6 9 9 9

2 ,4 2 4 3

0 ,1 1 6 1

0 ,0 2 3 8 ...

320200

1 1 ,9 1 4 2

9 ,4 3 4 8

1 5 0 ,4 5 1 2

1 3 2 ,5 2 8 6

6 7 5 ,6 5 6 2

0 ,1 6 8 5

-0 ,5 0 3 9

0 ,1 9 6 1

0 ,0 1 1 9 ...

200000 ...

9 ,0 3 3 2 ...

5 ,2 1 1 4 ...

1 5 6 ,5 7 9 5 ...

9 4 ,5 1 9 9 ...

1 4 0 0 ,9 9 8 3 ...

0 ,6 5 6 4 ...

-0 ,3 7 2 8 ...

0 ,3 9 6 0 ...

N U C LEO LI 1

...

0 0

2

0

0

0 ...

1 0

0

2 ...

B a c k g ro u n d ( H in te r g r u n d )

1 1

0 ,0 1 0 0 ... ...

C YTO PLA (Z y to p la )

C H R O M O

1

0 ...

1 ...

F ig . 6 8 . E x e rp t o f th e Im a g e D a ta b a s e

4 .2 .9 T h e Im a g e M in in g E x p e r im e n t T h e c o lle c te d d a ta s e t w a s th e n g iv e n to th e d a ta -m in in g to o l D e c is io n -M a s te r . T h e d e c is io n -tre e in d u c tio n a lg o rith m th a t s h o w e d th e b e s t re s u lts o n th is d a ta s e t is b a s e d o n th e e n tro p y -c rite rio n fo r th e a ttrib u te s e le c tio n , c u t-p o in t s tra te g y fo r th e a ttrib u te d is c re tiz a tio n a n d m in im a l e rro r-re d u c tio n p ru n in g . W e c a rrie d o u t th re e e x p e rim e n ts . F irs t, w e le a rn t a d e c is io n tre e o n ly b a s e d o n th e im a g e re a d in g b y th e e x p e rt, th e n le a rn t a d e c is io n tre e o n ly b a s e d o n th e a u to m a tic c a lc u la te d im a g e s fe a tu re s , a n d fin a lly , w e le a rn t a d e c is io n tre e b a s e d o n a d a ta b a s e c o n ta in in g b o th fe a tu re d e s c rip tio n s . T h e re s u ltin g d e c is io n tre e fo r th e e x p e rt’s re a d in g is s h o w n in F ig u re 6 9 a n d th e re s u ltin g d e c is io n tre e fo r th e e x p e rt’s re a d in g to g e th e r w ith th e m e a s u re d im a g e fe a tu re s is s h o w n in F ig u re 7 0 . W e d o n o t s h o w th e tre e fo r th e m e a s u re d im a g e fe a tu re s , s in c e th e tre e is to o c o m p le x . T h e e rro r ra te w a s e v a lu a te d b y le a v e o n e -o u t c ro s s -v a lid a tio n .

T a b le 1 4 . E rro r R a te th e E x p e rt a n d th e D e c is io n T re e M e th o d E x p e rt D e c is io n E x p e rt´s D e c is io n c a lc u la t e

T re e R e a d T re e d Im a

b in b g

a s e d o n g a s e d o n e F e a tu re s

E rro r R a te 2 3 .6 % 1 6 .6 % 2 5 .0 %

4 .2 M in in g I m a g e s

1 1 3

--2 6 3 D S IN T E R P H A S E

=

4

=

3 2 D S [3 2 0 2 0 0

3

4 8 D S [3 2 0 0 0 0 ]

=

2

4 7 D S [1 0 0 0 0 0 ]

=

=

1 9 3 D S C H R O M O S O M E ]

0

8 D S [1 0 0 0 0 0

F ig . 6 9 . D e c is io n T re e o b ta in e d fro m

=

0

4 3 D S [2 0 0 0 0 0

2

7 0 D S [1 0 0 3 2 0 ]

=

=

1

1 5 D S [1 0 0 0 0 0 ]

]

]

E x p e rt‘s Im a g e R e a d in g

--3 2 1 D S D E N S _ 0

< = 0 .0 0 0 1 5 2 4 0 D S S T A C O N T _ 2

> 0 .0 0 0 1 5 8 1 D S D E N S _ 1

< = 1 .5 1 3 7 5 1 3 2 D S S T A A R E A _ 9

> 1 .5 1 3 7 5 1 0 8 D S S T A A R E A _ 5

< = 2 .1 3 2 1 5 7 2 D S S T A A R E A _ 7

< = 1 1 .1 5 1 3 6 0 D S D E N S _ 9

> 0 .0 0 8 5 2 4 D S D E N S _ 2

< = 0 .0 0 2 9 5 1 2 D S D E N S _ 1

[

< = 0 .0 0 0 2 5 9 D S 1 0 0 0 0 0 ]

> 1 1 .1 5 1 3 1 2 D S 1 0 0 3 2 0 ] [

< = 0 .0 0 8 5 3 6 D S 1 0 0 0 0 0 ] [

[

> 2 .1 3 2 1 5 6 0 D S D E N S _ 4

[

[

< = 1 1 .2 0 1 3 0 D S S T A A R E A _ 4

< = 0 .0 8 3 8 1 2 D S S T A A R E A _ 2

< = 1 .1 0 6 7 6 D S 1 0 0 3 2 0 ] [

> 1 .1 0 6 7 6 D S 1 0 0 0 0 0 ] [

> 0 .0 8 3 8 4 8 D S D E N S _ 6

< = 0 .3 0 0 8 4 5 D S 3 2 0 2 0 0 ]

> 0 .0 0 2 9 5 1 2 D S 2 0 0 0 0 0 ]

[

[

< = 1 0 .1 7 7 5 9 D S 1 0 0 3 2 0 ]

> 0 .3 0 0 8 3 D S 1 0 0 0 0 0 ]

[

< = 0 .0 4 9 9 5 3 0 D S S T A C O N T _ 3

> 1 1 .2 0 1 7 8 D S S T A A R E A _ 4

> 1 0 .1 7 7 5 2 1 D S 2 0 0 0 0 0 ]

[

< = 1 6 .3 6 0 5 3 3 D S S T A A R E A _ 2

< = 2 .8 0 4 6 2 1 D S 1 0 0 3 2 0 ]

[

[

[

< = 2 4 .4 1 6 5 2 7 D S 2 0 0 0 0 0 ] [

[

> 0 .0 4 9 9 5 5 1 D S 5 0 0 0 0 0 ]

> 2 4 .4 1 6 5 3 D S 5 0 0 0 0 0 ]

> 1 6 .3 6 0 5 4 5 D S 3 2 0 0 0 0 ]

> 2 .8 0 4 6 1 2 D S S T A A R E A _ 1

< = 1 .0 6 1 0 5 6 D S 3 2 0 0 0 0 ] [

> 1 .0 6 1 0 5 6 D S 3 2 0 2 0 0 ]

> 0 .0 0 0 2 5 3 D S 3 2 0 2 0 0 ]

F ig . 7 0 . D e c is io n T re e o b ta in e d fro m

a u to m a tic a lly c a lc u la te d F e a tu re s

4 .2 .1 0 R e v ie w T h e p e rfo rm a n c e o f th e h u m a n e x p e rt w ith a n e rro r ra te o f 2 3 .6 % w a s v e r y p o o r (s e e ta b le 1 4 ). T h e e x p e rt w a s o fte n n o t a b le to m a k e a d e c is io n w h e n h e d id n o t s e e m ito tic c e lls in th e im a g e . W h e n th e m ito tic c e lls w e re a b s e n t in th e im a g e h e

1 1 4

4 A p p lic a tio n s

c o u ld n o t d e c id e th e c la s s . In th a t c a s e h e w a s n o t c o o p e ra tiv e a n d d id n o t re a d th e im a g e fe a tu re s . T h is b e h a v io r le a d s to th e c o n c lu s io n th a t h e h a s n o t a v e ry w e ll b u ilt u p u n d e rs ta n d in g a b o u t th e a p p e a ra n c e o f th e im a g e fe a tu re s n o r d o e s h is d e c is io n m a k in g s tra te g y re a lly re la y o n a c o m p le x im a g e in te rp re ta tio n s tra te g y . T h e re s u ltin g d e c is io n tre e b a s e d o n th e e x p e rt’s re a d in g s h o w n in F ig u re 6 9 s u p p o rts th is o b s e rv a tio n . T h e d e c is io n -m a k in g s tra te g y o f th e e x p e rt is o n ly b a s e d o n tw o im a g e fe a tu re s , th e in te rp h a s e _ c e lls a n d th e c h ro m o s o m e s . T h is tre e h a s a n e rro r ra te o f 1 6 ,6 % . H o w e v e r, fo r th e “ n o t_ d e c id e d ” s a m p le s in o u r d a ta b a s e th is tre e c o u ld n o t b e o f a n y h e lp s in c e th e e x p e rt d id n o t re a d th e im a g e fe a tu re s in th a t c a s e . T h e tre e b a s e d o n th e c a lc u la te d im a g e fe a tu re s s h o w s a n e rro r ra te o f 2 5 % . T h is p e rfo rm a n c e is n o t a s g o o d a s th e p e rfo rm a n c e o f th e e x p e rt, b u t th e tre e c a n a ls o m a k e a d e c is io n fo r th e “ n o t_ d e c id e d ” s a m p le s . W e b e lie v e th a t th e d e s c rip tio n o f th e d iffe re n t p a tte rn s o f th e c e lls b y a te x tu re d e s c rip to r is th e rig h t w a y to p ro c e e d . T h e B o o le a n m o d e l is fle x ib le e n o u g h a n d w e b e lie v e th a t b y fu rth e r d e v e lo p in g th is te x tu re m o d e l fo r o u r H E p -2 c e ll p ro b le m w e w ill fin d b e tte r d is c rim in a tin g fe a tu re s w h ic h w ill le a d to a b e tte r p e rfo rm a n c e o f th e le a rn t tre e . A lth o u g h th e fe a tu re s c a lc u la te d b a s e d o n th e B o o le a n m o d e l a re o f n u m e ric a l ty p e , th e y a ls o p ro v id e u s w ith s o m e e x p la n a tio n c a p a b ility fo r th e d e c is io n m a k in g p r o c e s s . T h e s u b tr e e s h o w n o n th e r ig h t s id e o f F ig u r e 7 0 s h o w s u s f o r e .g . th a t th e m o s t im p o rta n t fe a tu re is d e n s _ 0 . T h a t m e a n s if th e re e x is t s o m e o b je c ts in th e c la s s im a g e _ 0 w h ic h re fe rs to lo w g ra y le v e l (0 -2 1 in c re m e n ts ) th e c la s s 5 0 0 0 0 0 a n d p a rtia lly th e c la s s 2 0 0 0 0 0 c a n b e s e p a ra te d fro m a ll th e o th e r c la s s e s . T h a t m e a n s a s m a ll n u m b e r o f d a rk s p o ts in s id e th e c e ll re fe r to c la s s 5 0 0 0 0 0 a n d c la s s 2 0 0 0 0 0 . T h e d is c rim in a tin g fe a tu re b e tw e e n c la s s 5 0 0 0 0 0 a n d c la s s 2 0 0 0 0 0 is th e s ta n d a rd d e v ia tio n o f th e o b je c t c o n to u r in c la s s im a g e _ 3 . S m a ll c o n to u rs o f d a rk o b je c ts in c la s s im a g e _ 3 re fe r to c la s s 2 0 0 0 0 0 , w h e re a s b ig c o n to u rs re fe r to c la s s 5 0 0 0 0 0 . It is in te re s tin g to n o te th a t n o t th e fe a tu re s fin e _ s p e c k e ld o r flu o r e s c e n t n u c le o li a re th e m o s t d is c rim in a tin g fe a tu re s . T h e c la s s ifie r b a s e d o n th e c a lc u la te d im a g e fe a tu re s ta k e s o th e r fe a tu re s a n d th e re fo re le a d s to a n e w a n d d e e p e r u n d e rs ta n d in g o f th e p ro b le m .

4 . 2 . 1 1 U s i n g th e D i s c ov e r e d K n ow l e d g e T h e a c h ie v e d re s u lts h e lp e d e x p e rts to u n d e rs ta n d th e ir d e c is io n -m a k in g s tra te g y b e tte r. It is e v id e n t n o w th a t th e fu ll s p e c tru m o f h u m a n v is u a l re a s o n in g is n o t e x h a u s te d fo r th is in s p e c tio n ta s k . In s te a d o f d e v e lo p in g a s o p h is tic a te d re a s o n in g s tra te g y b y th e p h y s ic ia n s , th e p ro b le m w a s g iv e n b a c k to th e d e v e lo p e rs a n d p ro v id e rs o f th e fre e z e c u t H E p -2 c e lls . T h e y im p le m e n te d in to th e s u b s tra te s p e c ia l c e lls , f o r e .g . th e s o - c a lle d m ito s is . A lth o u g h m ito s is g iv e h ig h e r c o n f id e n c e in th e d e c is io n , a n o th e r p ro b le m s till e x is ts . T h e s e c e lls a p p e a r s o m e w h e re in th e s lid e s . It c a n o n ly b e g u a ra n te e d b y th e p ro d u c e rs th a t a c e rta in p e rc e n ta g e o f m ito s is a p p e a r in e a c h s lid e . It is h ig h ly lik e ly th a t u n d e r th e m ic ro s c o p e th e s p e c ia l c e lls a re n o t v is ib le s in c e n o n e o f th e s e c e lls lie in th e c h o s e n re g io n o f in te re s t. T h e n th e

4 .2 M in in g I m a g e s

1 1 5

s lid e st su b p ro th e

e m u s t b e m a n u a lly s h ifte d u n d e r th e m ic ro s c o p e u n til a b e tte r re g io n o f in te ris fo u n d . B e s id e s th a t e a c h c o m p a n y h a s a d iffe re n t s tra te g y h o w to s e t u p th e ir s tra te . A h ig h p e rc e n ta g e o f m ito s is in th e s u b s tra te is a s p e c ia l fe a tu re fo r th e d u c ts o f o n e c o m p a n y . A n o th e r c o m p a n y h a s a n o th e r s p e c ia l fe a tu re . H o w e v e r re a l H E p -2 c e lls a p p e a r e q u a lly in th e im a g e s . T h e re fo re o u r e ffo rt g o e s in tw o d ire c tio n s : 1 . S e ttin g u p a m o re s o p h is tic a te d im a g e c a ta lo g u e fo r te a c h in g p h y s ic ia n s a n d 2 . F u rth e r d e v e lo p m e n t o f o u r te x tu re fe a tu re e x tra c to r. T h e im a g e c a ta lo g u e s h o u ld e x p la in in m o re d e ta ils th e d iffe re n t fe a tu re s a n d fe a tu re v a lu e s o f th e v o c a b u la ry s o th a t th e e x p e rt is m o re c e rta in in re a d in g th e im a g e fe a tu re s a lth o u g h th e m ito s is is a b s e n t in th e im a g e . B y d o in g th a t it s h o u ld h e lp th e e x p e rts to u n d e rs ta n d th e ir in s p e c tio n p ro b le m b e tte r a n d to u s e a s ta n d a rd v o c a b u la ry w ith a c le a rly d e fin e d a n d c o m m o n ly a g re e d m e a n in g . T h e fu rth e r d e v e lo p m e n t o f o u r te x tu re fe a tu re e x tra c to r s h o u ld le a d to b e tte r d is tin g u is h in g fe a tu re s s o th a t th e p e rfo rm a n c e o f th e c la s s ifie r is im p ro v e d . B e s id e s th a t th e e x p la n a tio n c a p a b ility o f th e fe a tu re d e s c rip to r c a n b e u s e d fo r d e te rm in in g b e tte r s y m b o lic fe a tu re s a s th e b a s is o f th e im a g e c a ta lo g u e . T h e re c e n t re s u lts w e re u s e d to b u ilt a firs t a p p ro a c h o f a n a u to m a tic im a g e a n a ly s is a n d c la s s ific a tio n s y s te m fo r th e c la s s ific a tio n o f H E p -2 c e ll p a tte rn s . T h e im a g e a n a ly s is p ro c e d u re s a n d th e p ro c e d u re s fo r th e im a g e d e s c rip to rs o f th e im a g e fe a tu re s c o n ta in e d in th e d e c is io n tre e w e re c o m b in e d in to a n a lg o rith m . T h e ru le s o f th e d e c is io n tre e s h o w n in F ig u re 7 0 w a s im p le m e n te d in to th e p ro g ra m a n d n o w th e e x p e rt c a n u s e th e s y s te m in h is d a ily p ra c tic e a s d e c is io n s u p p o rt s y s te m . 4 .2 .1 2

L e s s on s L e a r n e d

h a v e fo u n d o u t th a t o u r m e th o d o lo g y o f d a ta m in in g a llo w s a u s e r to le a rn th e is io n m o d e l a n d th e re le v a n t d ia g n o s tic fe a tu re s . A u s e r c a n in d e p e n d e n tly u s e h a m e th o d o lo g y o f d a ta m in in g in p ra c tic e . H e c a n e a s ily p e rfo rm d iffe re n t e rim e n ts u n til h e is s a tis fie d w ith th e re s u lt. B y d o in g th a t h e c a n e x p lo re h is lic a tio n a n d fin d o u t th e c o n n e c tio n b e tw e e n d iffe re n t k n o w le d g e p ie c e s . H o w e v e r s o m e p ro b le m s s h o u ld b e ta k e n in to a c c o u n t fo r th e fu tu re s y s te m d e s ig n . A s w e h a v e a lre a d y p o in te d o u t in a p re v io u s e x p e rim e n t [P B Y 9 6 ], a n e x p e rt te n d s to s p e c ify s y m b o lic a l a ttrib u te s w ith a la rg e n u m b e r o f a ttrib u te v a lu e s . F o r e .g . in th is e x p e r im e n t th e e x p e r t s p e c if ie d f o r th e a ttr ib u te " m a r g in " f if te e n a ttr ib u te v a lu e s s u c h a s " n o n -s h a rp " , " s h a rp " , " n o n -s m o o th " , " s m o o th " , a n d s o o n . A la rg e n u m b e r o f a ttrib u te v a lu e s w ill re s u lt in s m a ll s u b -s a m p le s e ts s o o n a fte r th e tre e -b u ild in g p ro c e s s s ta rte d . It w ill re s u lt in a fa s t te rm in a tio n o f th e tre e -b u ild in g p ro c e s s . T h is is a ls o tru e fo r s m a ll s a m p le s e ts th a t a re u s u a l fo r m e d ic in e . T h e re fo re , a c a re fu l a n a ly s is o f th e a ttrib u te lis t s h o u ld b e d o n e a fte r th e p h y s ic ia n h a s s p e c ifie d it. D u rin g th e p ro c e s s o f b u ild in g th e tre e , th e a lg o rith m p ic k s th e a ttrib u te w ith th e b e s t a ttrib u te -s e le c tio n c rite ria . If tw o a ttrib u te s h a v e b o th th e s a m e v a lu e , th e o n e th a t a p p e a rs firs t in th e a ttrib u te lis t w ill b e c h o s e n . T h a t m ig h t n o t a lw a y s b e th e a ttrib u te th e e x p e rt w o u ld c h o o s e h im s e lf. T o a v o id th is p ro b le m , w e th in k th a t W e d e c su c e x p a p p

1 1 6

4 A p p lic a tio n s

in th is c a s e w e s h o u ld a llo w th e e x p e rt to c h o o s e m a n u a lly th e a ttrib u te th a t h e /s h e p re fe rs . W e e x p e c t th a t th is p ro c e d u re w ill b rin g th e re s u ltin g d e c is io n m o d e l c lo s e r to th e e x p e rt’s o n e s . T h e d e s c rib e d m e th o d o f im a g e m in in g h a d b e e n a lre a d y e s ta b lis h e d in p ra c tic e . It ru n s a t th e U n iv e rs ity h o s p ita l in L e ip z ig a n d H a lle a n d a t th e V e te rin a ry d e p a rtm e n t o f th e U n iv e rs ity in H a lle , w h e re th e m e th o d is u s e d fo r a n a ly s is o f s h e e p fo llic le b a s e d o n a te x tu re d e s c rip to r, e v a lu a tio n o f im a g in g e ffe c ts o f ra d io p a q u e m a te ria l fo r ly m p h -n o d u le a n a ly s is , m in in g k n o w le d g e fo r IV F th e ra p y , tra n s p la n ta tio n m e d ic in e a n d fo r th e d ia g n o s is o f b re a s t c a rc in o m a in M R im a g e s . In a ll th e s e ta s k s w e d id n o t h a v e a w e ll-tra in e d e x p e rt. T h e s e w e re n e w ta s k s a n d re lia b le d e c is io n k n o w le d g e h a s n o t b e e n b u ilt u p in p ra c tic e y e t. T h e p h y s ic ia n s d id th e e x p e rim e n ts b y th e m s e lv e s . T h e y w e re v e ry h a p p y w ith th e o b ta in e d re s u lts , s in c e th e le a rn t ru le s g a v e th e m d e e p e r u n d e rs ta n d in g o f th e ir p ro b le m s a n d h e lp e d to p re d ic t n e w c a s e s . It h e lp e d th e p h y s ic ia n s to e x p lo re th e ir d a ta a n d in s p ire d th e m to th in k a b o u t n e w im p ro v e d w a y s o f d ia g n o s is . 4 . 2 . 1 3 Con c l u s i on s T h e re a r o f th e s e h u m a n e to in te rp s y s te m s a llo w u s

e a lo t o f a p p lic a tio n s a ro u n d w h e re im a g e s a re a p p lic a tio n s w ill in c re a s e in fu tu re . M o s t o f th x p e rt in o rd e r to d is c o v e r th e in te re s tin g d e ta ils re t th e im a g e s . It is a c h a lle n g e fo r th e fu tu re th a t c a n e ffe c tiv e ly s u p p o rt a h u m a n b y th is ta s to b e tte r u n d e rs ta n d th e h u m a n v is u a l re a s o n in g

in v e se in to k . .

o lv e d a n d th e n u m b e r a p p lic a tio n s re q u ire a th e im a g e s o r in o rd e r d e v e lo p m e th o d s a n d T h e s e m e th o d s s h o u ld

5 Con c l u s i on

In th is b o o k w e h a v e g iv e n a n o v e rv ie w a b o u t d a ta m in in g fo r m u ltim e d ia d a ta . T h e b o o k d o e s n o t d e s c rib e a ll a s p e c ts o f d a ta m in in g . W e h a v e fo c u s e d o n m e th o d s w h ic h a re fro m o u r p o in t o f v ie w m o s t im p o rta n t fo r m in in g m u ltim e d ia d a ta . T h e s e m e th o d s in c lu d e d e c is io n -tre e in d u c tio n , c a s e -b a s e d re a s o n in g , c lu s te rin g , c o n c e p tu a l c lu s te rin g a n d fe a tu re -s u b s e t s e le c tio n . T h e a p p lic a tio n s w e h a v e d e s c rib e d in th is b o o k a re o n e o f th e firs t im a g e m in in g a p p lic a tio n s s o lv e d . T h e d e s c rib e d m e th o d o lo g y fo r im a g e m in in g c a n b e a p p lie d s u c c e s s fu lly to o th e r a p p lic a tio n s a s w e ll. T h e d a ta -m in in g to o l D e c is io n M a s te r c a n b e o b ta in e d fro m ib a i r e s e a r c h ( h ttp : //w w w .ib a i- r e s e a r c h .d e /b o o k _ d a ta _ m in in g ) . T h e m a te r ia l d e s c rib e d in th is b o o k c a n b e u s e d fo r a c o u rs e o n d a ta m in in g o n m u ltim e d ia d a ta . T h e fo ils fo r th e c o u r s e w ill a ls o b e a v a ila b le o n th is w e b p a g e . T h e fie ld o f m u ltim e d ia -d a ta m in in g is a n e m e rg in g fie ld a n d s till a t th e b e g in n in g o f its d e v e lo p m e n t. T h e re a re s till a lo t o f o p e n p ro b le m s w h ic h h a v e to b e s o lv e d in fu tu re . T h e m a in p ro b le m fo r im a g e a n d v id e o m in in g is h o w to e x tra c t th e n e c e s s a ry in fo rm a tio n w h ic h s h o u ld b e u s e d fo r m in in g fro m th e im a g e s . B e s id e s th a t n e w m e th o d s fo r s tru c tu ra l d a ta a re re q u ire d a s w e ll a s m e th o d s w h ic h c a n c o m b in e d iffe re n t d a ta s o u rc e s s u c h a s im a g e , a u d io a n d te x t o r s u c h a s in e -c o m m e rc e w h e re w e b d a ta , lo g file d a ta a n d m a rk e tin g d a ta s h o u ld b e c o m b in e d fo r th e m in in g p ro c e s s . In fu tu re w e h a v e to e x p e c t a lo t o f n e w a n d e x c itin g d e v e lo p m e n ts . T h e P a tte rn R e c o g n itio n C o m m u n ity is in te n s e ly e n g a g e d in th e s e p ro b le m s o n th e th e o re tic a l p a rt a n d h a s ta k e n u p th e p ro b le m o f m in in g im a g e s , te x ts , v id e o s a n d w e b d o c u m e n ts , w h ic h a lre a d y p re v io u s ly h a s le a d to s o m e s u b s ta n tia l c o n trib u tio n s . T h e a c tiv itie s a re c o o rd in a te d b y T e c h n ic a l C o m m itte e 1 7 D a ta M in in g a n d M a c h in e L e a rn in g o f th e In te rn a tio n a l A s s o c ia tio n o f P a tte rn R e c o g n itio n ( I A P R ) ( h ttp : //w w w .ib a i- r e s e a r c h .d e /tc 3 ) . N e w d e v e lo p m e n ts w ill b e p r e s e n te d o n b ia n n u a l b a s is a t th e I n te r n a tio n a l C o n fe r e n c e o n M a c h in e L e a r n in g a n d D a ta M in in g in P a tte r n R e c o g n itio n M L D M ( h ttp ://w w w .ib a i- r e s e a r c h .d e /e v lin k 5 ) h e ld in L e ip z ig w h ic h is o n e o f th e m o s t im p o rta n t fo ru m s o n th a t to p ic . W e h o p e w e c o u ld in s p ire y o u b y th is b o o k to d e a l w ith th is in te re s tin g fie ld .

P . P e rn e r: D a ta M in in g o n M u ltim e d ia D a ta , L N C S 2 5 5 8 , p . 1 1 7 , 2 0 0 2 . © S p rin g e r-V e rla g B e rlin H e id e lb e rg 2 0 0 2

A p p e n d ix

T h e I R I S D a ta S e t

T h c la s ic ria

e iris d a ta s s . T h e th r o lo r (c la s s b le s s u c h a

T h e d a ta d a ta b a s e s .

se t e e c 2 ), s p e

is c o la s s e a n d ta l le

se t c a n

m p ris e d o f 3 c la s s s a re d iffe re n t ty p e V irg in ic a (c la s s 3 ) n g h t a n d w id th , a n b e

o b ta in e d

e s. T s o f f . T h e d se p

fro m

h e d a ta s e lo w e rs s u c s e c la s s e s a l le n g th a

t c o n ta in h a s S e to a re d e sc n d w id th

s 5 0 s a m p le s p e r s a (c la s s 1 ), V e rrib e d b y fo u r v a .

f tp .ic s .u c i.e d u /p u b /m a c h in e - le a r n in g -

R e fe re n c e s

[A a P 9 5 ] A a m o d t A , P la z a E (1 9 9 5 ) C a s e -B a s e d R e a s o n in g : F o u n d a tio n a l Is s u e s , M e th o d o lo g ic a l V a ria tio n s a n d S y s te m A p p ro a c h e s . A I C o m m u n ic a tio n v o l. 7 , N o . 1 , 3 9 5 9 . [A d a 0 1 ] A d a m o J -M , D a ta M in in g fo r A s s o c ia tio n R u le s a n d S e q u e n tia l P a tte rn s , S p rin g e r V e rla g , H e id e lb e rg , 2 0 0 1 [A g r9 0 ] A g re s ti A , C a te g o ric a l D a ta A n a ly s is , N e w Y o rk , W ile y , 1 9 9 0 [ A lD 9 4 ] A lm u a llim , H . a n d D ie ttr ic h , T .G ( 1 9 9 4 ) L e a r n in g b o o le a n c o n c e p ts in th e p r e s e n c e o f m a n y irre le v a n t fe a tu re s . A rtific ia l In te llig e n c e , 6 9 (1 -2 ), 2 7 9 -3 0 5 . [ A lt0 1 ] A lth o f f K - D ( 2 0 0 1 ) C a s e - B a s e d R e a s o n in g , I n : S .K . C h a n g ( e d .) H a n d b o o k o f S o ftw a re E n g in e e rin g a n d K n o w le d g e E n g in e e rin g , v o l. I “ F u n d a m e n ta ls ” , W o rld S c ie n tific , p . 5 4 9 -5 8 8 . [A n d 8 4 ] A n d e rs o n T W (1 9 8 4 ) A n in tro d u c tio n to m u ltiv a ria te s ta tis tic a l a n a ly s is , W ile y , N e w Y o rk [A tR 0 0 ] A tk in s o n A , R ia n i M , R o b u s t D ia g n o s tic R e g re s s io n A n a ly s is , S p rin g e r V e rla g , H e id e lb e rg , 2 0 0 0 [B a L 8 4 ] B a rn e tt V D , L e w is T (1 9 8 4 ) O u tlie rs in s ta tis tic a l d a ta , W ile y , C h ic h e s te r[B C S 9 7 ] B o n z a n o , A ., C u n n in g h a m , P ., S m y th , B ., (1 9 9 7 ) L e a r n in g F e a tu re W e ig h ts fo r C B R : G lo b a l v e rs u s L o c a l, 5 th C o n g r e s s o f th e Ita lia n A s s o c ia tio n fo r A r tific ia l In te llig e n c e ( A I * I A 9 7 ) , L e c tu r e N o te s in C o m p u te r S c ie n c e , L e n z e r in i M . ( e d .) , p p 4 2 2 - 4 3 1 , S p rin g e r V e rla g . [B e S t9 8 ] B e rg m a n n R , S ta h l A (1 9 9 8 ) S im ila rity M e a s u re s fo r O b je c t-O rie n te d C a s e R e p r e s e n ta tio n s , I n P r o c .: A d v a n c e s in C a s e - B a s e d R e a s o n in g , B . S m ith a n d P . C u n n in g h a m ( E d s .) , L N A I 1 4 8 8 , S p r in g e r V e r la g 1 9 9 8 , p . 2 5 - 3 6 [B F O 8 4 ] B re im a n L , F rie d m a n J H , O ls h e n R A (1 9 8 4 ) C la s s ific a tio n a n d R e g re s s io n T re e s , T h e W a d s w o rth S ta tis tic s /P ro b a b ility S e rie s , B e lm o n t C a lifo rn ia [B H W 9 3 ] B a y e r H , H e rb ig B , W e s s S t (1 9 9 3 ) S im ila rity a n d S im ila rity M e a s u re s , In : S . W e s s , K .D . A lth o f f , F . M a u r e r , J . P a u lo k a t, R . P r a e g e r , a n d O . W e n d e l ( E d s .) , C a s e B a s e d R e a s o n in g B d . I, S E K I W O R K IN G P A P E R S W P -9 2 -0 8 (S F B ) [ B iC 0 0 ] B is c h o f ; W .F ., C a e lli, T . L e a r n in g s p a tio - te m p o r a l r e la tio n a l s tr u c tu r e , A p p lie d A rtific ia l In te llig e n c e , V o lu m e 1 5 , N u m b e r 8 , 2 0 0 0 , p . 7 0 7 -7 2 2 . [B lG 0 2 ] B la n c E , G iu d ic i P , S e q u e n c e R u le s fo r W e b C lic k s tre a m A n a ly s is , In : P . P e rn e r E d .) ,A d v a n c e s in D a ta M in in g , A p p lic a tio n s in E - C o m m e r c e , M e d ic in e , a n d K n o w le d g e M a n a g e m e n t, S p rin g e r V e rla g 2 0 0 2 , L N A I 2 3 9 4 , p . 3 9 -5 7 [B o c 7 4 ] B o c k H H (1 9 7 4 ) A u to m a tic C la s s ific a tio n , V a n d e n h o e c k a n d R u p re c h t, G ö ttin g e n , 1 9 7 4 [ B S tJ 9 5 ] B r a d w e ll, A .R ., S to k e s , R .P ., J o h n s o n , A .D ., A tla s o f H e p - 2 p a tte r n s , T h e B in d in g S ite L td ., B ir m in g h a m 1 9 9 5 .

1 2 2

R e fe re n c e s

[ B u L 0 0 ] B u r l, M . C ., L u c c h e tti, D .: A u to n o m o u s v is u a l d is c o v e r y . I n : D a ta M in in g a n d K n o w le d g e D is c o v e r y : T h e o r y , T o o ls , a n d T e c h n o lo g y , B e lu r V . D a s a r a th y ( e d s .) . S P IE , V o l. 4 0 5 7 (2 0 0 0 ) 2 4 0 -2 5 0 . [B u M 9 4 ] B u n k e H , M e s s m e r B , S im ila rity m e a s u re s fo r s tru c tu re d re p re s e n ta tio n s . In S . W e s s , K .- D . A lth o f f , a n d M .M . R ic h te r ( e d s .) , T o p ic s in C a s e - B a s e d R e a s o n in g , S p rin g e r V e rla g 1 9 9 4 , p p . 1 0 6 -1 1 8 [ C A D K R 0 2 ] C a e lli, T , A m in A , D u in R P W , K a m e l M , R id d e r D ( E d s .) , S tr u c tu r a l, S y n ta c tic , a n d S ta tis tic a l P a tte rn R e c o g n itio n , ln c s 2 1 9 6 , S p rin g e r V e rla g H e id e lb e rg , 2 0 0 2 [C a r0 0 ] C a rlin g K (2 0 0 0 ) R e s is ta n t o u tlie r ru le s a n d th e n o n -G a u s s ia n c a s e C o m p u ta tio n a l S ta tis tic s & D a ta A n a ly s is , V o lu m e 3 3 , Is s u e 3 , 2 8 M a y 2 0 0 0 , P a g e s 2 4 9 -2 5 8 [C D M 9 6 ] C o rte la z z o C , D e re tta G , M ia n G A , Z a m p e ro n i P , N o rm a liz e d w e ig h te d L e v e n s th e in d is ta n c e a n d tria n g le in e q u a lity in th e c o n te x t o f s im ila rity d is c rim in a tio n o f b ile v e l im a g e s , P a tte rn R e c o g n itio n L e tte rs , v o l. 1 7 , n o . 5 , 1 9 9 6 , p p . 4 3 1 -4 3 7 [C H H 9 9 ] C o p e rs m ith D , H o n g S J , H o s k in g J (1 9 9 9 ) P a rtitio n in g n o m in a l a ttrib u te s in d e c is io n tre e s , J o u rn a l o f d a ta m in in g a n d k n o w le d g e d is c o v e ry , v o l. 3 , n o . 2 , p . 1 0 0 -2 0 0 [C J R 0 1 ] C ra w S , J a rm u la k J , a n d R o w e R (2 0 0 1 ) M a in ta in in g R e trie v a l K n o w le d g e in a C a s e -B a s e d R e a s o n in g S y s te m . C o m p u ta tio n a l In te llig e n c e , 1 7 (2 ):3 4 6 -3 6 3 , 2 0 0 1 . [ C M S 9 9 ] C o o le y , R ., M o b a s h e r , B ., a n d S r iv a s ta v a , J ., D a ta P r e p a r a tio n f o r M in in g W o r ld W id e W e b B ro w s in g P a tte rn s , K n o w le d g e a n d In fo rm a tio n S y s te m s , 1 (1 ), 1 9 9 9 [ C o v 7 7 ] C o v e r , T .M . 1 9 7 7 . O n th e p o s s ib le o r d e r in g o n th e m e a s u r e m e n t s e le c tio n p r o b le m . IE E E T ra n s a c tio n s , S M C -7 (9 ), 6 5 7 -6 6 1 . [C rR 0 2 ] C ra w S , P re e c e A , A d v a n c e s in C a s e -B a s e d R e a s o n in g , S p rin g e r V e rla g , L N A I 2 4 1 6 , H e id e lb e rg , 2 0 0 2 [D e L 0 1 ] D e v ro y e L , L u g o s i G , C o m b in a to ric a l M e th o d s in D e n s ity E s tim a tio n , S p rin g e r V e rla g , H e id e lb e rg , 2 0 0 1 [D L S 9 5 ] D o u g h e rty J , K o h a v i R , a n d S a h a m in M (1 9 9 5 ) S u p e rv is e d a n d U n s u p e rv is e d th D is c re tiz a tio n o f C o n tin u o u s F e a tu re s , M a c h in e L e a rn in g , 1 4 IJ C A I, p p . 1 9 4 -2 0 2 , 1 9 9 5 . [D rS 8 2 ] D re y e r, H . a n d S a u e r, W . (1 9 8 2 ). P ro z e ß a n a ly s e . V e rla g T e c h n ik , B e rlin . [D u H 7 3 ] D u d a R O , H a rt P E , P a tte rn C la s s ific a tio n a n d S c e n e A n a ly s is , N e w Y o rk , W ile y , 1 9 7 3 [ E f r 8 2 ] E f r o n B ( 1 9 8 2 ) T h e J a c k k n if e , th e B o o ts tr a p a n d O th e r R e s a m p lin g P la n s ,S o c ie ty fo r In d u s tria l a n d A p p lie d M a th e m a tic s , 1 9 8 2 , P h ila d e lp h ia [ E F P 0 1 ] E b e r t, D . F a v r e , J M , P e ik e r t R ( E d s .) , D a ta V is u a liz a tio n 2 0 0 1 , S p r in g e r V e r la g , H e id le b e rg , 2 0 0 1 [ E Y D 0 0 ] E k lu n d , P . W ., Y o u , J ., D e e r , P .: M in in g r e m o te s e n s in g im a g e d a ta : a n in te g r a tio n o f fu z z y s e t th e o ry a n d im a g e u n d e rs ta n d in g te c h n iq u e s fo r e n v iro n m e n ta l c h a n g e d e te c tio n . In : D a ta M in in g a n d K n o w le d g e D is c o v e ry : T h e o ry , T o o ls , a n d T e c h n o lo g y . B e lu r V . D a s a r a th y ( e d s .) . S P I E , V o l. 4 0 5 7 ( 2 0 0 0 ) 2 6 5 - 2 7 3 . [F a F 9 9 ] F a w c e tt T , F o s te r P (1 9 9 9 ) A c tiv ity m o n ito rin g : in te re s tin g c h a n g e s in b e h a v io r, th P ro c e e d in g s o f th e 5 A C M S IG K D D In te rn . C o n fe re n c e o n K n o w le d g e D is c o v e ry a n d D a ta M in in g , p p . 5 3 -6 2 . [ F a I 9 3 ] F a y y a d U .M a n d I r a n i K B ( 1 9 9 3 ) M u lti- I n te r v a l D is c r e tiz a tio n o f C o n tin u o u s V a lth IJ C A I, v o l. u e d A ttrib u te s fo r C la s s ific a tio n L e a rn in g , M a c h in e L e a rn in g , 1 3 2 .,C h a m b e r y , F r a n c e , M o r g a n K a u f m a n n , p p . 1 0 2 2 - 1 0 2 7 , 1 9 9 3 .

R e fe re n c e s

1 2 3

[ F iB 0 1 ] F is c h e r S ., B u n k e H ., A u to m a tic I d e n tif ic a tio n o f D ia to m s U s in g D e c is io n F o r e s ts I n : P . P e r n e r ( e d .) , M a c h in e L e a r n in g a n d D a ta M in in g in P a tte r n R e c o g n itio n ,S p r in g e r V e r la g 2 0 0 1 , L N A I 2 1 2 3 , p . 1 7 3 - 1 8 3 . [ F is ] F is h e r ´ s I r is D a ta S e t f tp :f tp .ic s .u c i.e d u /p u b /m a c h in e - le a r n in g - d a ta b a s e [F is 8 7 ] F is h e r D H (1 9 8 7 ) K n o w le d g e A c q u is itio n v ia In c re m e n ta l C lu s te rin g , M a c h in e L e a rn in g , 2 : 1 3 9 -1 7 2 , 1 9 8 7 [F u k 9 0 ] F u k u n a g a , K . 1 9 9 0 . In tro d u c tio n to S ta tis tic a l P a tte rn R e c o g n itio n . A c a d e m ic P r e s s .[ G L H 8 9 ] G e n n a r i J H , L a n g le y P , F is h e r D ( 1 9 8 9 ) M o d e ls o f I n c r e m e n ta l C o n c e p t F o rm a tio n , A rtific ia l In te llig e n c e 4 0 (1 9 8 9 ) 1 1 -6 1 [G rA 9 6 ] G rim n e s M , A a m o d t A (1 9 9 6 ) A T w o L a y e r C a s e -B a s e d R e a s o n in g A rc h ite c tu re f o r M e d ic a l I m a g e U n d e r s ta n d in g , I n : I . S m ith a n d B . F a ltin g s ( E d s .) , A d v a n c e s in C a s e -B a s e d R e a s o n in g , L N A I 1 1 6 8 , S p rin g e r V e rla g 1 9 9 6 , p p 1 6 4 -1 7 8 [G R B 9 9 ] G u p ta S K , R a o K S , B h a tn a g a r V (1 9 9 9 ) K -m e a n s C lu s te rin g A lg o rih m fo r C a te g o r ic a l A ttr ib u te s . I n : M .K . M o h a n ia a n d A . M in T jo a ( E d s .) , P r o c . o f th e F ir s t I n te r n a tio n a l C o n fe re n c e o n D a ta W a re h o u s in g a n d K n o w le d g e D is c o v e ry (D a W a K -9 9 ), ln c s 1 6 7 6 , p . 2 0 3 -2 0 8 . S p rin g e r V e rla g 1 9 9 9 . [G R S 0 1 ] G u h a a S , R a s to g ib R , S h im c K (2 0 0 1 ) C u re : a n e ffic ie n t c lu s te rin g a lg o rith m fo r la rg e d a ta b a s e s In fo r m a tio n S y s te m s , V o lu m e 2 6 , Is s u e 1 , M a r c h 2 0 0 1 , P a g e s 3 5 -5 8 . [H e V 9 9 ] v a n d e r H e id e n A , V o s s e p o e l A (1 9 9 9 ) A L a n d m a rk -B a s e d A p p ro a c h o f S h a p e D is s im ila rity , In P ro c . o f IC P R 1 9 9 9 , v o l. I, T ra c k A , p p . 1 2 0 -1 2 4 [H e W 9 8 ] H e is te r F , W ilk e W (1 9 9 8 ) A n A rc h ite c tu re fo r M a in ta in in g C a s e -B a s e d R e a s o n in g S y s te m s , I n : B . S m y th a n d P . C u n n in g h a m ( E d s .) , A d v a n c e s in C a s e - B a s e d R e a s o n in g , L N A I 1 4 8 8 , S p rin g e r V e rla g , p . [H G N 0 2 ] H ip p J , G ü n tz e r U , N a k h a e iz a d e h G , D a ta M in in g o f A s s o c ia tio n R u le s a n d th e P r o c e s s o f K n o w le d g e D is c o v e r y in D a ta b a s e s , I n : P . P e r n e r ( E d .) ,A d v a n c e s in D a ta M in in g , A p p lic a tio n s in E -C o m m e rc e , M e d ic in e , a n d K n o w le d g e M a n a g e m e n t, S p rin g e r V e rla g 2 0 0 2 , L N A I 2 3 9 4 , p . 3 9 -5 7 [ H o n 9 6 ] H o n g , S .J . 1 9 9 6 . U s e o f c o n te x tu a l in f o r m a tio n f o r f e a tu r e r a n k in g a n d d is c r e tiz a tio n . IE E E T ra n s . o n K n o w le d g e D is c o v e ry a n d D a ta E n g in e e rin g . p . 5 5 -6 5 [H u a 9 8 ] H u a n g Z (1 9 9 8 ) E x te n s io n s to th e k -m e a n s a lg o rith m fo r c lu s te rin g la rg e d a ta s e ts w ith c a te g o ric a l v a lu e s . D a ta M in in g a n d K n o w le d g e D is c o v e ry , 2 (3 ): 2 8 3 -3 0 4 , 1 9 9 8 [ H u g 6 8 ] H u g h e s , G .F . 1 9 6 8 . O n th e m e a n a c c u r a c y o f s ta tis tic a l p a tte r n r e c o g n iz e r s . I E E E T ra n s a c tio n s , IT -1 4 (1 ), 5 5 -6 3 . [Im F 9 9 ] Im iy a A , F e rm in I, M o tio n A n a ly s is b y R a n d o m S a m p lin g a n d V o tin g P ro c e s s , C o m p u te r V is io n a n d Im a g e U n d e rs ta n d in g , 7 3 , 1 9 9 9 , 3 0 9 -3 2 8 [J a D 9 8 ] J a in A K , D u b e s R C (1 9 9 8 ) A lg o rith m fo r C lu s te rin g D a ta , P re n tic e H a ll 1 9 9 8 [J a Z 9 7 ] J a in , A . a n d Z o n k e r, D . 1 9 9 7 . F e a tu re S e le c tio n : E v a lu a tio n , A p p lic a tio n , a n d S m a ll S a m p le P e rfo rm a n c e . IE E E T ra n s . o n P a tte rn A n a ly s is a n d M a c h in e In te llig e n c e , 1 9 , 1 5 3 -1 5 8 . [J C R 0 0 ] J a rm u la k J , C ra w S , R o w e R (2 0 0 0 ) G e n e tic A lg o rith m to O p tim is e C B R R e trie v a l, In : E . B la n z ie ri a n d L . P o rtin a le (E d s .) : E W C B R 2 0 0 0 , L N A I 1 8 9 8 , p p . 1 3 6 -1 4 7 , S p rin g e r V e rla g 2 0 0 0 . [ K e P 9 1 ] K e h o e , A . a n d P a r k e r , G .A .: A n I K B d e f e c t c la s s if ic a tio n s y s te m f o r a u to m a te d in d u s tria l ra d io g ra p h ic in s p e c tio n . IE E E E x p e rt S y s te m s 8 (1 9 9 1 ) 1 4 9 -1 5 7 . [ K e r 9 2 ] K e r b e r , R .: „ C h iM e r g e : D is c r e tiz a tio n o f N u m e r ic A ttr ib u te s “ , L e a r n in g : I n d u c tiv e , A A A I 9 2 , p p . 1 2 3 -1 2 8 , 1 9 9 2 .

1 2 4

R e fe re n c e s

[K iR 9 2 ] K ira K , R e n d e ll L A 1 9 9 2 . T h e fe a tu re s e le c tio n p ro b le m : T ra d itio n a l m e th o d s a n d n e w a lg o rith m . In : A A A I-9 2 , P ro c e e d in g s N in e th N a tio n a l C o n fe re n c e o n A rtific ia l In te llig e n c e , A A A I P re s s /T h e M IT P re s s , 1 2 9 -1 3 4 . [ K M S S 0 2 ] K o h a v i R , M a s a n d B M , S p ilio p o u lo u M , S r iv a s ta v a H (E d s .) , W E B K D D 2 0 0 1 M in in g W e b L o g D a ta A c ro s s A ll C u s to m e rs T o u c h P o in ts , S p rin g e r V e rla g , H e id e lb e rg , 2 0 0 2 [K N P S 9 3 ] K u m m e rt F , N ie m a n n H , P re c h te l R , S a g e re r G , 1 9 9 3 . C o n tro l a n d E x p la n a tio n in a S ig n a l U n d e rs ta n d in g E n v iro n m e n t, S ig n a l P ro c e s s in g 3 2 , p p . 1 1 1 -1 4 5 . n d [K o c 0 2 ] K o c h K -R , P a ra m e te r E s tim a tio n a n d H y p o th e s is T e s tin g in L in e a r M o d e ls , 2 E d itio n , S p rin g e r V e rla g , 1 9 9 9 . [K o H 0 1 ] K o llm a r D , H e llm a n n D H (2 0 0 1 ) F e a tu re S e le c tio n fo r a R e a l-W o rld L e a rn in g T a s k , P . P e r n e r ( E d .) , M a c h in e L e a r n in g a n d D a ta M in in g in P a tte r n R e c o g n itio n , S p rin g e r V e rla g , B e rlin , 2 0 0 1 , p . 1 5 7 -1 7 2 [K o h 9 5 ]K o h o n e n T (1 9 9 5 ) „ S e lf-O rg a n iz in g M a p s ” , S p rin g e r V e rla g , 1 9 9 5 . [ K o J 9 8 ] K o h a v i, R . a n d J o h n , G .H . 1 9 9 8 . T h e W r a p p e r A p p r o a c h . I n , e d . L u i, H . a n d M a to d a H . F e a tu re E x tra c tio n C o n s tru c tio n a n d S e le c tio n , K lu w e r A c a d e m ic P u b lis h e rs , p . 3 0 -4 7 . [K o S 9 6 ] K o lle r D , S a h a m i M (1 9 9 6 ). T o w a rd O p tim a l F e a tu re S e le c tio n . In : e d . L . S a itta , M a c h in e L e a rn in g , P ro c e e d in g s o f th e T h irte e n th In te rn a tio n a l C o n fe re n c e (IC M L '9 6 ) , M o r g a n K a u f m a n n , 2 8 4 - 2 9 2 [K u P 9 9 ]K u m m e r, G . a n d P e rn e r, P . (1 9 9 9 ). M o tio n A n a ly s is . IB a I R e p o rt, L e ip z ig , IS S N 1 4 3 1 -2 3 6 0 [K u u 9 8 ] K u u s is to S (1 9 9 8 ) A p p lic a tio n o f th e P M D L P rin c ip le to th e In d u c tio n o f C la s s ific a tio n T re e s , P h D -T h e s is , T a m p e re F in la n d [L e b 8 5 ] L e b o w itz M (1 9 8 5 ) C a te g o riz in g n u m e ric in fo rm a tio n fo r g e n e ra liz a tio n , C o g n itiv e S c ie n c e 9 (1 9 8 5 ), 2 8 5 -3 0 9 . [ L e e 8 6 ] L e e , C .H . ( 1 9 8 6 ) . R e c u r s iv e r e g io n s p littin g a t th e h ie r a r c h ic a l s c o p e v ie w s . C o m p u te r V is io n G ra p h ic s , a n d Im a g e P ro c e s s in g 3 3 , 2 3 7 -2 5 9 . [L u S 9 6 ] L u i, H . a n d S e tio n o , R . 1 9 9 6 . A P ro b a b ilis tic A p p ro a c h to F e a tu re S e le c tio n - A F ilte r S o lu tio n . In : e d . L . S a itta , M a c h in e L e a rn in g , P ro c e e d in g s o f th e T h irte e n th In t e r n a t i o n a l C o n f e r e n c e ( I C M L '9 6 ) , . M o r g a n K a u f m a n n , 3 1 9 - 3 2 7 [M a d 0 1 ] M a d ria S K (2 0 0 1 ) D a ta w a re h o u s in g , D a ta & K n o w le d g e E n g in e e rin g , V o lu m e 3 9 , Issu e 3 , D e c e m b e r 2 0 0 1 , P a g e s 2 1 5 -2 1 7 [M a n 9 1 ] d e M a n ta ra s R L (1 9 9 1 ) A d is ta n c e -b a s e d a ttrib u te s e le c tio n m e a s u re fo r d e c is io n tre e in d u c tio n , M a c h in e L e a rn in g , 6 , p . 8 1 -9 2 . [ M a t7 5 ] G . M a th e r o n , R a n d o m S e ts a n d I n te g r a l G e o m e tr y ( J . W ile y & S o n s I n c ., N e w Y o rk L o n d o n , 1 9 7 5 ). [M e B 0 0 ] M e s s m e r B , B u n k e H (2 0 0 0 ) E ffic ie n t s u b g ra p h is o m o rp h is m d e te c tio n : a d e c o m p o s itio n a p p ro a c h , IE E E T ra n s . o n K n o w le d g e a n d D a ta E n g in e e rin g , v o l 1 2 , N o . 2 , 2 0 0 0 , p p . 3 0 7 -3 2 3 [M e h 9 3 ] M e h ro tra (1 9 9 3 )S im ila r S h a p e R e trie v a l U s in g a S tru c tu ra l F e a tu re In d e x , In fo rm a tio n S y s te m s , v o l. 1 8 (5 ), 1 9 9 3 , p p . 5 2 5 -5 3 7 . [M e t7 8 ] M e tz C E (1 9 7 8 ) B a s ic P rin c ip le s o f R O C A n a ly s is , S e m in a rs in N u c le a r M e d ic in e , V o l. V I I I , N o . 4 , 1 9 7 8 , p .2 8 3 - 2 9 8 [ M D H 9 9 ] M e g a lo o ik o n o m o u , K ., D a v a tz ik o s , C ., H e r s k o v its , E .: M in in g le s io n - d e f e c t a s s o c ia tio n s in a b ra in im a g e d a ta b a s e , in P ro c . In t. C o n f. K n o w le d g e D is c o v e ry a n d D a ta M in in g (K D D `9 9 ), S a n D ie g o , C a lifo rn ia , A u g u s t 1 9 9 9 , 3 4 7 -3 5 1 , 1 9 9 9 .

R e fe re n c e s

1 2 5

[M ic 8 3 ] M ic h a ls k i R S (1 9 8 3 ) A th e o ry a n d m e th o d o lo g y o f in d u c tiv e le a rn in g . In R . S . M ic h a ls k i, J .G . C a r b o n e ll, a n d T .M . M itc h e ll, ( E d s .) , M a c h in e L e a r n in g : A r tif ic ia l I n te llig e n c e A p p ro a c h . M o rg a n K a u fm a n n , 1 9 8 3 [M in 7 3 ] M in g e rs J (1 9 7 3 ) E x p e rt s y s te m s – ru le in d u c tio n w ith s ta tis tic a l d a ta , J o u rn a l o n th e O p e ra tio n a l R e s e a rc h S o c ie ty , 3 8 (2 ), p p . 3 9 -4 7 . [M N P 9 6 ] M o g h a d d a m , N a s ta r, P e n tla n d (1 9 9 6 ) A B a y e s ia n S im ila rity M e a s u re fo r D ire c t Im a g e M a tc h in g , In P ro c . o f IC P R ´9 6 , v o l. II, T ra c k B , p p . 3 5 0 -3 5 8 . [M N S 0 0 ] M ic a re lli A , N e ri A , S a n s o n e tti G (2 0 0 0 ). A c a s e -b a s e d a p p ro a c h to im a g e re c o g n itio n , In E . B la n z ie ri & L . P o rtin a le (E d s .) A d v a n c e s in C a s e -B a s e d R e a s o n in g (p p . 4 4 3 -4 5 4 ). B e rlin : S p rin g e r V e rla g [M u c 9 2 ] M u c h a H -J (1 9 9 2 ) C lu s te ra n a ly s e m it M ik ro c o m p u te rn , A k a d e m ie V e rla g , B e rlin , 1 9 9 2 [ N a S 9 3 ] N a d le r , M . a n d S m ith , E .P . 1 9 9 3 . P a tte r n R e c o g n itio n E n g in e e r in g , J o h n W ile y & S o n s In c . [N iB 8 7 ] N ib le tt T , B ra tk o I (1 9 8 7 ) C o n s tru c tio n d e c is io n tre e s in n o is y d o m a in s , In B ra tk o I a n d L a v r a c N . ( e d s .) , P r o g r e s s in M a c h in e L e a r n in g , S ig m a P r e s s , E n g la n d , p . 6 7 - 7 8 . [ O P R 7 8 ] O h la n d e r , R ., P r ic e , K . a n d R e d d y , D .R . ( 1 9 7 8 ) . P ic tu r e S e g m e n ta tio n u s in g r e c u rs iv e re g io n s p littin g m e th o d . C o m p u te r G ra p h ic s a n d Im a g e P ro c e s s in g , 8 , 3 1 3 -3 3 3 [O ts 7 8 ] O ts u (1 9 7 8 ) A th re s h o ld s e le c tio n m e th o d fro m g ra y -le v e l h is to g ra m s , IE E E T ra n s . o n S y s te m s , M a n , a n d C y b e rn e tic s , 9 (1 9 7 9 ) 3 8 -5 2 . [P B Y 9 6 ] P e rn e r P , B e lik o v a T B , Y a s h u n s k a y a N I (1 9 9 6 ) K n o w le d g e A c q u is itio n b y D e c is io n T re e In d u c tio n fo r In te rp re ta tio n o f D ig ita l Im a g e s in R a d io lo g y , In : A d v a n c e s in S tru c tu ra l a n d S y n ta c tic a l P a tte rn R e c o g n itio n , P . P e rn e r, P . W a n g , a n d A . R o s e n fe ld ( E d s .) , S p r in g e r V e r la g L n c s 1 1 2 1 , p . 3 0 1 - 3 1 1 [P h i8 7 ] P h ilip o w E (1 9 8 7 ), H a n d b u c h d e r E le k tro te c h n ik , B d 2 G ru n d la g e n d e r In fo rm a tio n s te c h n ik , T e c h n ik V e rla g , B e rlin , p . 1 5 8 -1 7 1 [P e B 9 9 ] P e tro u M , B o s d o g ia n n i P (1 9 9 9 ) Im a g e p ro c e s s in g , T h e fu n d a m e n ta ls , W ile y , C h ic h e s te r, N e w Y o rk , W e in h e im , B ris b a n e , S in g a p o re , T o ro n to [P e F 0 2 ] P e rn e r P , F is s G , In te llig e n t E -M a rk e tin g w ith W e b M in in g , P e rs o n a liz a tio n a n d U s e r - a d p a te d I n te r f a c e s , I n : P . P e r n e r ( E d .) ,A d v a n c e s in D a ta M in in g , A p p lic a tio n s in E -C o m m e rc e , M e d ic in e , a n d K n o w le d g e M a n a g e m e n t, S p rin g e r V e rla g 2 0 0 2 , L N A I 2 3 9 4 , p . 3 9 -5 7 [P e r0 0 ] P e rn e r P , F e a tu re D is c re tiz a tio n , IB a I R e p o rt, 2 0 0 0 [P e r0 1 ] P e rn e r P , Im p ro v in g th e A c c u ra c y o f D e c is io n T re e In d u c tio n b y F e a tu re P re S e le c tio n , A p p lie d A rtific ia l In te llig e n c e , A p p lie d A rtific ia l In te llig e n c e , v o l. 1 5 , N o . 8 , p . 7 4 7 -7 6 0 . [P e r9 3 ] P e rn e r, P . (1 9 9 3 ). C a s e -B a s e d R e a s o n in g F o r Im a g e In te rp re ta tio n in N o n T e s tin g . In : R ic h te r, M . (E d s ), T h e F irs t E u ro p e a n W o rk s h o p o n C a s e d e s tru c tiv e B a s e d R e a s o n in g . S F B 3 1 4 U n iv . K a is e rs la u te rn , v o l. II, 4 0 3 -4 1 0 . [ P e r 9 4 ] P e r n e r , P . A .: K n o w le d g e - b a s e d im a g e in s p e c tio n s y s te m f o r a u to m a tic d e f e c t r e c o g n itio n , c la s s ific a tio n , a n d p ro c e s s d ia g n o s is . In t. J . o n M a c h in e V is io n a n d A p p lic a tio n s 7 (1 9 9 4 ) 1 3 5 -1 4 7 [P e r9 8 ] P e rn e r P , D iffe re n t L e a rn in g S tra te g ie s in a C a s e -B a s e d R e a s o n in g S y s te m fo r Im a g e In te rp re ta tio n , A d v a n c e s in C a s e -B a s e d R e a s o n in g , B . S m ith a n d P . C u n n in g h a m ( E d s .) , L N A I 1 4 8 8 , S p r in g e r V e r la g 1 9 9 8 , S . 2 5 1 - 2 6 1 . [P e r9 8 ] P e rn e r P , C o n te n t-B a s e d Im a g e In d e x in g a n d R e trie v a l in a Im a g e D a ta b a s e fo r T e c h n ic a l D o m a in s , I n : M u ltim e d ia I n f o r m a tio n A n a ly s is a n d R e tr ie v a l, H o r a c e H .S . I p a n d A . S m u e ld e r ( E d s .) , L N C S 1 4 6 4 , S p r in g e r V e r la g 1 9 9 8 , p . 2 0 7 - 2 2 4

1 2 6

R e fe re n c e s

[P e r9 8 ] P e rn e r P , (1 9 9 8 ). U s in g C B R L e a rn in g fo r th e L o w -L e v e l a n d H ig h -L e v e l U n it o f a I m a g e I n te r p r e ta tio n S y s te m . I n : S a m e e r S in g h ( E d s .) I n te r n a tio n a l C o n f e r e n c e o n A d v a n c e s P a tte rn R e c o g n itio n IC A P R 9 8 , S p rin g e r V e rla g , L o n d o n , 4 5 -5 4 . [P e r9 9 ] P e rn e r P , A n A rc h ite tu re fo r a C B R Im a g e S e g m e n ta tio n S y s te m , J o u rn a l o n E n g in e e rin g A p p lic a tio n in A rtific ia l In te llig e n c e , E n g in e e rin g A p p lic a tio n s o f A rtific ia l In te llig e n c e , v o l. 1 2 (6 ), 1 9 9 9 , p . 7 4 9 -7 5 9 [P e T 9 7 ] P e rn e r P , T ra u tz s c h T , W is s e n a k q u is itio n in d e r m e d iz in is c h e n D ia g n o s e m itte ls In d u k tio n v o n E n ts c h e id u n g s b ä u m e n , Z e its c h rift K ü n s tlic h e In te llig e n z , 3 (1 9 9 7 ), S .3 2 - 3 3 [P e T 9 8 ] P e rn e r P , T ra u tz s c h S (1 9 9 8 ) M u ltin te rv a l D is c re tiz a tio n fo r D e c is io n T re e L e a rn in g , In : A d v a n c e s in P a tte rn R e c o g n itio n , A . A m in , D . D o ri, P . P u d il, a n d H . F r e e m a n ( E d s .) , L N C S 1 4 5 1 , S p r in g e r , H e id e lb e r g , p . 4 7 5 - 4 8 2 [P M K 9 4 ] P u d il, P , N a v o v ic o v a J , K ittle r J (1 9 9 4 ) F lo a tin g s e a rc h m e th o d s in fe a tu re s e le c tio n . P a tte r n R e c o g n itio n L e tte r s , 1 5 , 1 1 1 9 -1 1 2 5 . [P Z J 0 1 ] P e rn e r P , Z s c h e rp e l U , a n d J a c o b s e n C , A C o m p a ris o n b e tw e e n N e u ra l N e tw o rk s a n d D e c is io n T re e s b a s e d o n D a ta fro m In d u s tria l R a d io g ra p h ic T e s tin g . P a tte rn R e c o g n itio n L e tte rs , 2 (2 0 0 1 ), p p 4 7 -5 4 . [Q u i8 6 ] Q u in la n J R (1 9 9 8 6 ) In d u c tio n o f D e c is io n T re e s , M a c h in e L e a rn in g 1 , p p . 8 1 -1 0 6 , 1 9 8 6 . (G a in R a tio ) [Q u i8 7 ] Q u in la n J R (1 9 8 7 ) S im p lify in g d e c is io n tre e s , M a c h in e L e a rn in g 2 7 , p p . 2 2 1 -2 3 4 [Q u i8 8 ] Q u in la n J R (1 9 8 8 ), D e c is io n tre e s a n d m u ltiv a lu e d a ttrib u te s , In : H a y e s J E , M ic h ie D , a n d R ic h a r d s J ( e d s .) , M a c h in e I n te llig e n c e 1 1 , O x f o r d U n iv e r s ity P r e s s [ Q u i9 3 ] Q u in la n , J R ( 1 9 9 3 ) C 4 .5 : P r o g r a m s f o r M a c h in e L e a r n in g , M o r g a n K a u f m a n n , L o s A lto s , C a lifo rn ia , 1 9 9 3 . [R a o 9 0 ] R a o A R (1 9 9 0 ) A ta x o n o m y fo r te x tu re d e s c rip tio n a n d in d e n tific a tio n , S p rin g e r, N e w Y o rk , B e rlin , H e id e lb e rg [R ic 9 5 ] R ic h te r M M (1 9 9 8 ) In tro d u c tio n to C a s e -B a s e d R e a s o n in g . In : M . L e n z , B . B a r ts c h - S p ö r l, H .- D . B u r k h a r d t, S . W e s s ( E d s .) , C a s e - b a s e d R e a s o n in g T e c h n o lo g y : fro m F o u n d a tio n s to A p p lic a tio n s , S p rin g e r V e rla g 1 9 9 8 , L N A I 1 4 0 0 , p . 1 -1 6 [ R N N 9 9 ] R ic e , S .V ., N a g y , G ., & N a r tk e r , T .H . ( 1 9 9 9 ) . O p tic a l c h a r a c te r r e c o g n itio n : A n illu s tra te d g u id e to th e fro n tie r. L o n d o n : K lu w e r [R P D 9 8 ] R a w lin g J O , P a n tu la S G , D ic k e y D A , A p p lie d R e g re s s io n A n a ly s is – A R e s e a rc h n d T o o l, 2 E d itio n , S p rin g e r V e rla g , H e id e lb e rg , 1 9 9 8 [R z e 9 8 ] R z e m o lu c k E J , N e u ra l N e tw o rk D a ta A n a ly s is U s in g S im u ln e t, S p rin g e r V e rla g , H e id e lb e rg , 1 9 9 8 [S a J 9 9 ] S a n tin i S , J a in R (1 9 9 9 ) S im ila rity M e a s u re s , IE E E T ra n s . o n P a tte rn A n a ly s is a n d M a c h in e I n te llig e n c e , v o l. 2 1 , N o . 9 ,1 9 9 9 , p p . 8 7 1 - 8 8 3 [S c h G 0 2 ] S c h m id t R , G ie rl L , C a s e -B a s e d R e a s o n in g fo r P ro g n o s is o f T h re a te n in g In flu e n z a W a v e s , I n : P . P e r n e r ( E d .) ,A d v a n c e s in D a ta M in in g , A p p lic a tio n s in E C o m m e rc e , M e d ic in e , a n d K n o w le d g e M a n a g e m e n t, S p rin g e r V e rla g 2 0 0 2 , L N A I 2 3 9 4 , p . 3 9 -5 7 [S c h l8 9 ] S c h le s in g e r M I (1 9 8 9 ) M a th e m a tic a l T o o ls o f P ic tu re P ro c e s s in g , (in R u s s ia n ), N a u k o w a D u m k a , K ie w 1 9 8 9 [S e i9 3 ] S e id e lm a n n G . (1 9 9 3 ) U s in g H e u ris tic s to S p e e d U p In d u c tio n o n C o n tin u o u s V a lu e d A ttr ib u te s , I n : P . B . B r a z d il ( E d .) , M a c h in e L e a r n in g : E C M L - 9 3 , S p r in g e r , B e rlin , H e id e lb e rg , p . 3 9 0 - 3 9 5 [S h a 9 7 ] S h a h a r, Y (1 9 9 7 ) A F ra m e w o rk fo r K n o w le d g e -B a s e d T e m p o ra l A b s tra c tio n . A rtific ia l In te llig e n c e 9 0 7 9 -1 3 3

R e fe re n c e s

1 2 7

[S h a 9 9 ] S h a h a r Y (1 9 9 9 ) T im in g is E v e ry th in g : T e m p o ra l R e a s o n in g a n d T e m p o ra l D a ta M a in te n a n c e in M e d ic in e . W .H o r n e t a l ( e d s .) A r tif ic ia l I n te llig e n c e in M e d ic in e , ( P r o c . o f A I M D M - 9 9 ) , S p r in g e r , L N A I - 1 6 2 0 , S .3 0 - 4 6 st [S h S 0 0 ] S h u m w a y R H ; S to ffe r D S , T im e S e rie s A n a ly s is a n d Its A p p lic a tio n s , 1 E d itio n , rd C o rr. 3 P rin tin g , S p rin g e r V e rla g , H e id e lb e rg , 2 0 0 0 [S h T 0 2 ] S h a d b o lt J , T a y lo r J G , N e u ra l N e tw o rk s a n d th e F in a n c ia l M a rk e ts – P re d ic tin g , C o m b in in g a n d P o rtfo lio O p tim is a tio n , S p rin g e r V e rla g , H e id e lb e rg , 2 0 0 2 [S m i8 9 ] S m ith L B (1 9 8 9 ) F ro m g lo b a l s im ila ritie s to k in d s o f s im ila ritie s : th e c o n s tru c tio n o f d im e n s io n s in d e v e lo p m e n t. I n : S t. V o s n ia d o u a n d A . O r to n y ( E d s .) , S im ila r ity a n d A n a lo g ic a l R e a s o n in g , C a m b rid g e U n iv e rs ity P re s s , 1 9 8 9 [S M c 9 8 ] S m y th B , M c K e n n a E (1 9 9 8 M o d e llin g th e C o m p e te n c e o f C a s e -B a s e s , In : B . S m y th a n d P . C u n n in g h a m ( E d s .) , A d v a n c e s in C a s e - B a s e d R e a s o n in g , L N A I 1 4 8 8 , S p rin g e r V e rla g 1 9 9 8 , p . 2 0 8 -2 2 0 [ S N S 8 8 ] S c h r ö d e r , S ., N ie m a n n , H ., S a g e r e r , G .: K n o w le d g e a c q u is itio n f o r a k n o w le d g e b a s e d im a g e a n a ly s is s y s te m . In : P ro c . o f th e E u ro p e a n K n o w le d g e -A c q u is itio n W o rk sh o p (E K A W 8 8 ) . B o s s e , J ., G a in e s , B . ( e d s .) , G M D - S tu d ie n , V o l. 1 4 3 , S a n k t A u g u s tin (1 9 8 8 ). [S rG 9 9 ] S riv a s ta v a J , G u ra ln ik , V (1 9 9 9 ) E v e n t d e te c tio n fro m tim e s e rie s d a ta , P ro c e e d th in g s o f th e 5 A C M S IG K D D In te rn . C o n fe re n c e o n K n o w le d g e D is c o v e ry a n d D a ta M in in g , p p 3 3 -4 3 . [ S tK M 8 7 ] D . S to y a n , W .S . K e n d a ll, a n d J . M e c k e , S to c h a s tic G e o m e tr y a n d I ts A p p lic a tio n s (A k a d e m ie V e rla g , B e rlin , 1 9 8 7 ). [S u T 9 8 ] S u rm a , T y b u rc y J (1 9 9 8 ) A S tu d y o n C o m p e te n c e -P re s e rv in g C a s e R e p la c in g S tr a te g ie s in C a s e - B a s e d R e a s o n in g , I n : B . S m y th a n d P . C u n n in g h a m ( E d s .) , A d v a n c e s in C a s e -B a s e d R e a s o n in g , L N A I 1 4 8 8 , S p rin g e r V e rla g 1 9 9 8 , p . 2 3 3 -2 3 8 [ T M L F 9 6 ] T s c h a m m le r , A ., M id d e n d o r f , C . v o n L ü d in g s h a u s e n , M . a n d K r a h e , T h . ( 1 9 9 6 ) . C o m p u te riz e d to m o g ra p h y v o lu m e try o f c e re b ro s p in a l flu id b y s e m ia u to m a tic c o n to u r re c o g n itio n a n d g ra y v a lu e h is to g ra m a n a ly s is . R o fo F o rts c h ritte a u f d e m G e b ie t d e r R o e n tg e n s tra h le n u n d n e u e n b ild g e b e n d e n V e rfa h re n 1 6 4 (1 ), 1 3 -1 . [T v e 7 7 ] T v e rs k y , A . (1 9 7 7 ). F e a tu re o f S im ila rity . P s y c h o lo g ic a l R e v ie w 8 4 (4 ), 3 2 7 -3 5 0 . [U ll9 6 ] U lla h a A (1 9 9 6 ) E n tro p y , d iv e rg e n c e a n d d is ta n c e m e a s u re s w ith e c o n o m e tric a p p lic a tio n s , J o u r n a l o f S ta tis tic a l P la n n in g a n d In fe r e n c e , V o lu m e 4 9 , Is s u e 1 , 1 J a n u a ry 1 9 9 6 , P a g e s 1 3 7 -1 6 2 [ V is 0 1 ] V is a A , T e c h n o lo g y o f T e x t M in in g , I n : P e r n e r P . ( E d s .) , M a c h in e L e a r n in g a n d D a ta M in in g in P a tte rn R e c o g n itio n M L D M , S p rin g e r-V e rla g , L N A I 2 1 2 3 , H e id e lb e rg , 2 0 0 1 , p . 1 -1 1 . [V T V B 0 2 ] V is a A , T o iv o n e n J , V a n h a ra n ta H , B a c k B , C o n te n ts M a tc h in g D e fin e d b y P ro to ty p e s - M e th o d o lo g y V e rific a tio n w ith B o o k s o f th e B ib le , J o u r n a l o f M a n a g e m e n t In fo r m a tio n S y s te m s , 1 8 (4 ):8 7 -1 0 0 , 2 0 0 2 . [W A D 9 3 ] W e s s S t, A lth o ff K -D , D e rw a n d G (1 9 9 3 ) U s in g k -d T re e s to Im p ro v e th e R e tr ie v a l S te p in C a s e - B a s e d R e a s o n in g , I n : S t. W e s s , K .- D . A lth o f f , a n d M .M . R ic h te r ( E d s .) T o p ic s in C a s e - b a s e d R e a s o n in g , S p r in g e r V e r la g 1 9 9 3 , p . 1 6 7 - 1 8 2 [W A M 9 7 ] W e tts c h e re k D , A h a D W , M o h ri T (1 9 9 7 ) A re v ie w a n d e m p iric a l e v a lu a tio n o f fe a tu re w e ig h tin g m e th o d s fo r a c la s s o f la z y le a rn in g a lg o rith m s , A r tific ia l In te llig e n c e R e v ie w 1 9 9 7 , V o lu m e 1 1 , p p . 2 7 3 -3 1 4 [ W B O 9 7 ] W ils o n , D .L ., B a d d e le y , A .J . a n d O w e n s , R .A . ( 1 9 9 7 ) . A n e w m e tr ic f o r g r e y s c a le im a g e c o m p a ris io n . In te rn a tio n a l J o u rn a l o f C o m p u te r V is io n 2 4 (1 ), 1 9 9 7 , 1 -2 9 .

1 2 8

R e fe re n c e s

[W e G 9 4 ] W e s s S t, G lo b ig C h r (1 9 9 4 ) C a s e -B a s e d a n d S y m b o lic C la s s ific a tio n . In : W e s s S t., A lth o f f K .- D ., R ic h te r M .M . ( e d s .) . T o p ic s in C a s e - B a s e d R e a s o n in g . S p r in g e r V e rla g 1 9 9 4 , p p 7 7 -9 1 . [W e K 9 0 ] W e is s S M , K u lik o w s k i C A (1 9 9 0 ) C o m p u te r S y s te m s th a t L e a rn : C la s s ific a tio n a n d P re d ic tio n M e th o d s fro m S ta tis tic s , N e u ra l N e tw o rk s , M a c h in e L e a rn in g , a n d E x p e rt S y s te m s . M o rg a n K a u fm a n n , S a n M a te o , 1 9 9 0 . [W h L 9 4 ] W h ite A P , L u i W Z (1 9 9 4 ), B ia s in in fo rm a tio n -b a s e d m e a s u re s in d e c is io n tre e in d u c tio n , M a c h in e L e a rn in g , 1 5 , p 3 2 1 -3 2 9 . [W L S 7 5 ] W u C , L a n d g re b e D , a n d S w a in P , T h e d e c is io n tre e a p p ro a c h to c la s s ific a tio n , S c h o o l E le c . E n g ., P u r d u e U n iv ., W . L a f a y e tte , I N , R e p . R E - E E 7 5 - 1 7 , 1 9 7 5 . [ Z a H 0 0 ] Z a ia n e , O . R ., H a n , J .: D is c o v e r y s p a tia l a s s o c ia tio n s in I m a g e . I n : D a ta M in in g a n d K n o w le d g e D is c o v e r y : T h e o r y , T o o ls , a n d T e c h n o lo g y . B e lu r V . D a s a r a th y ( e d s .) , S P IE , V o l. 4 0 5 7 (2 0 0 0 ) 1 3 8 -1 4 8 . [Z a m 9 6 ] Z a m p e ro n i P (1 9 9 6 ) C h a p te r: F e a tu re E x tra c tio n , In : P ro g re s s in P ic tu re P ro c e s s in g , E ls e v ie r S c ie n c e B .V ., p p . 1 2 3 - 1 8 2 . [Z h a 9 7 ] Z h a n g , S . (1 9 9 7 ). E v a lu a tio n a n d C o m p a ris io n o f d iffe re n t S e g m e n ta tio n A lg o rith m . P a tte rn R e c o g n itio n L e tte rs 1 8 (1 0 ), 9 6 3 -9 6 8 . [Z h Z 0 2 ] Z h a n g C , Z h a n g S , A s s o c ia tio n R u le M in in g , S p rin g e r V e rla g , L N A I 2 3 0 7 , H e id e lb e rg , 2 0 0 2 [Z R C 9 8 ] Z a n ia S , R ia n ia M , C o rb e llin ia A , R o b u s t b iv a ria te b o x p lo ts a n d m u ltip le o u tlie r d e te c tio n , C o m p u ta tio n a l S ta tis tic s & D a ta A n a ly s is , V o lu m e 2 8 , Is s u e 3 , 4 S e p te m b e r 1 9 9 8 , P a g e s 2 5 7 -2 7 0 . [ Z S t9 5 ] Z a m p e r o n i, P ., S ta r o v o ito v , V . ( 1 9 9 5 ) . H o w d is s im ila r a r e tw o g r a y - s c a le im a g e s . In : P ro c e e d in g s o f th e 1 7 . D A G M S y m p o s iu m , S p rin g e r V e rla g , 4 4 8 -4 5 5 .

In d e x

A b s tra c tio n fo r 1 7 Im a g e s 1 7 T im e S e rie s 1 8 W e b D a ta 1 9

H ie ra rc h ic a l C lu s te rin g o f G ra p h s 6 9 o f A ttrib u te s 6 4 o f C la s s e s 6 3 P a rtitio n in g C lu s te rin g

A lg o rith m ic P ro p e rtie s 7 3 A lg o rith m 7 3 A p p lic a tio n 9 1 A s s o c ia tio n R u le s 8 A ttrib u te C o n s tru c tio n 1 7 A ttrib u te S e le c tio n C rite ria G a in R a tio 2 9 G in i F u n c tio n 3 0 In fo rm a tio n G a in 2 9 A ttrib u te S p lits L in e a r S p lit M u ltiv a ria te U n iv a ria te 2 A u to m a tic A g g re

9 8 e ffic ie n t lle c tio n o f Im a g e D e s c rip tio n n c e p tu a l C lu s te rin g 6 9 7 5 a lu a tio n F u n c tio n o f G ra p h s 7 5 C o m p a ris o n o f Im a g e S im ila rity M e a s u re s 9 8

C o C o C o E v 2 8

C o n C o n C o n C o n C T

2 6 2 6 2 6 6

c e p t D c e p t H tra s t R tro llin Im a g e

e s c rip ie ra rc u le g M o d 9 4

tio n 7 1 h y 6 9 , 7 6 6 2 e l P a ra m e te r

g a tio n o f A ttrib u te s 4 1

B ra in /L iq o u r R a tio B ra in s to rm in g 1 0 7

9 3

C a s e -B a s e d R e a s o n in g (C B R ) 4 7 B a c k g ro u n d D e s ig n A s p e c ts 5 0 Im a g e S e g m e n ta tio n 9 1 K n o w le d g e C o n ta in e rs 4 9 4 8 , 9 3 , 9 4 M a in te n a n c e P ro c e ss 4 8 C a se B a se O rg a n C a s e D e s c rip tio n C B R L e a rn in g C a te g o ry U tility C e n tro id 9 8 C la s s ific a tio n C lu s te r A n a ly s is C lu s te rin g 5 7 A g g lo m e ra te C G ra p h s C lu s te r

6 4

4 6 , 9 1

iz a tio n 5 3 , 9 4 5 3 , 9 4 5 5 F u n c tio n 7 2 6 7 lu s te rin g in g 6 4

6 2 , 9 9

1 0 D a ta C a te g o ric a l G ra p h -b a s e d N u m e ric a l S tru c tu ra l T y p e s 1 0 D a ta M in in g D e fin itio n O v e rv ie w fro m th e D a ta M e th o d s 2

1 0 1 0 1 0 1 0

3 6 S id e

9

3

D a ta P re p a ra tio n 1 3 1 6 C o d in g C o rre la te d A ttrib u te s 1 6 D a ta C le a n in g 1 3 D a ta S m o o th in g 1 5 1 4 H a n d lin g N o is y D a ta H a n d lin g O u tlie r 1 4 M is s in g V a lu e H a n d lin g 1 6 R e d u n d a n t A ttrib u te s 1 6

1 1 1

9 1

1 3 0

In d e x

D a ta M in in g T o o l 1 1 7 D e c is io n T re e In d u c tio n D e s ig n 2 5 L e a rn in g T a s k s 2 5 P rin c ip le 2 3 T e rm in o lo g y 2 4 D e n d ro g ra m 6 3 , 9 9 , 1 0 D e v ia tio n D e te c tio n 7 D is c o v e re d K n o w le d g e D is c re tiz a tio n o f A ttrib u te B a s e d o n In tra - a n d In te B in a ry D is c re tiz a tio n C a te g o ric a l A ttrib u te s D is ta n c e M e a s u re 5 9 E n tro p y -b a s e d 3 2 fo r C a te g o ric a l D a ta fo r M e tric a l D a ta 5 9 fo r N o m in a l D a ta 6 1 N u m e ric a l A ttrib u te s E m E n E n E v E C Q S S T E x

F e a tu re S u b s e t S e le c tio A lg o rith m s 8 3 b y C lu s te rin g 8 6 b y D e c is io n T re e In d C o n te x tu a l M e rit A lg F ilte r M o d e l 8 4 F lo a tin g S e a rc h M e th W ra p p e r M o d e l 8

Im Im Im Im Im

0

a g a g a g a g a g

e A e C e D e I e S

L e a rn in g 5 5 C a se s 5 6 H ig h -O rd e r C o n s tru c ts o f S im ila rity 5 6 P ro to ty p e s 5 6 , 7 6 th e S u p e r G ra p h 6 9

6 0 3 1

L ifte tim e

F o rm a tio n 7 9 , 9 4

8 2 2

n

8 3

u c tio n o rith m

8 5 8 7

2

M o M o M u C H L M N S U

4

6 6

n a ly s is 1 0 8 a ta lo g u e 1 0 7 a ta B a s e 1 1 1 n fo rm a tio n 9 6 e g m e n ta tio n 9 3 , 9 9

5 6

9 2

M a n u a l A b s tra c tio n M e a n 9 8 M in in g 9 Im a g e 9 , 1 0 2 T e x t 9 V id e o 9 W e b 9

4 1

d e l D e v e lo p m e n t P ro c e s s d e llin g 9 1 lti-in te rv a l D is c re tiz a tio n 3 h i-M e rg e D is c re tiz a tio n is to g ra m -b a s e d D is c re tiz a tio V Q -B a s e d D is c re tiz a tio n 3 6 L D B a s e d C rite ria u m b e r o f In te rv a ls 3 5 e a rc h S tra te g ie s 3 5 tility C rite ria 3 5

N o n -im a g e In fo rm a tio n O v e rfittin g

o d

1 0 7

K n o w le d g e A c q u is itio n A s p e c t K n o w le d g e D is c o v e ry 4 5 K u rto s is 9 8

1 1 4 V a lu e s 3 1 , 4 1 rc la s s 3 3 3 2 4 1

1 6 , 1 7 , 1 0 7 , 1 0 9

F e a tu re E x tra c tio n

9 7

In flu e n c e o f D is c re tiz a tio n 3 9 In -v itro F e rtiliz a tio n T h e ra p y 1 In te rv ie w in g P ro c e s s 1 1 6

p iric a l C y c le o f T h e o ry e rg y 9 8 tro p y 9 8 a lu a tio n o f th e M o d e l 7 9 rro r R a te o rre c tn e s s 7 9 u a lity 7 9 e n s itiv ity 8 1 p e c ifity 8 1 8 2 e s t-a n d -T ra in R a n d o m S a m p lin g C ro s s V a lid a tio n 8 p e rt D e s c rip tio n 1 0 8

G ra p h M a tc h in g

Im a g e S im ila rity D e te rm in a tio n Im a g e M in in g T o o l 1 0 6 Im a g e M in in g E x p e rim e n t 1 1 2

2 3

9 1 3 4 8 n

3 7 3 6

9 5

2 7

8 8 P re d ic tio n P re p a ra tio P ru n in g C o s t-C O v e rv

F u n c tio n o f E x p 4 2 o m p le x ie w 4

n

2 , 6 e rim e n t

1 0 3

ity P ru n in g 3

R e a l W o rld A p p lic a tio n R e g re s s io n 7 1 1 3 R e v ie w

3

4 3

In d e x S e g m e n ta tio n 8 S e g m e n ta tio n P a ra m e te r 9 4 , 9 9 S im ila rity 3 3 , 5 0 fo r Im a g e In fo rm a tio n 9 6 , 1 0 1 fo r N o n -im a g e In fo rm a tio n 9 5 , 1 0 1 F o rm a liz a tio n o f 5 0 M e a su re 5 1 1 0 0 O v e ra ll S im ila rity S im ila rity M e a s u re fo r G ra p h s S k e w n e ss 9 8

6 5

T e x tu re F e a tu re E x tra c to r T im e S e rie s A n a ly s is 9 V a ria n te 9 8 V a ria tio n 9 8 V is u a liz a tio n 8 V o lu m e try 9 3

1 0 9 , 1 1 1

1 3 1