Model-Based Clustering and Classification for Data Science: With Applications in R 110849420X, 9781108494205

Cluster analysis finds groups in data automatically. Most methods have been heuristic and leave open such central questi

980 180 62MB

English Pages xviii+428 [447] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Model-Based Clustering and Classification for Data Science: With Applications in R
 110849420X, 9781108494205

Citation preview

Model-based Clustering and Classification for Data Science Cluster analysis consists of methods for finding groups in data automatically. Most methods have been heuristic and leave open such central questions as: How many clusters are there? Which clustering method should I use? How should I handle outliers? Classification involves assigning new observations to groups given previously classified observations, and also has open questions about parameter tuning, robustness and uncertainty assessment. This book frames cluster analysis and classification in terms of statistical models, thus yielding principled estimation, testing and prediction methods, and soundly-based answers to the central questions. It develops the basic ideas of model-based clustering and classification in an accessible but rigorous way, using extensive real-world data examples and providing R code for many methods, and describes modern developments for highdimensional data and for networks. It explains recent methodological advances, such as Bayesian regularization methods, non-Gaussian model-based clustering, cluster merging, variable selection, semi-supervised classification, robust classification, clustering of functional data, text and images, and co-clustering. Written for advanced undergraduates and beginning graduate students in data science, as well as researchers and practitioners, it assumes basic knowledge of multivariate calculus, linear algebra, probability and statistics. c h a r l e s b o u v e y r o n is Professor of Statistics at Universit´e Cˆote d’Azur and the Chair of Excellence in Data Science at Inria Sophia-Antipolis. He has published extensively on model-based clustering, particularly for networks and highdimensional data. g i l l e s c e l e u x is Director of Research Emeritus at Inria Saclay ˆIle-de-France. He is one of the founding researchers in model-based clustering, having published extensively in the area for 35 years. t. b r e n d a n m u r p h y is Professor of Statistics at University College Dublin. His research interests include model-based clustering, classification, network modeling and latent variable modeling. a d r i a n e . r a f t e r y is Professor of Statistics and Sociology at University of Washington, Seattle. He was one of the founding researchers in model-based clustering, having published in the area since 1984.

C A M B R I D G E S E R I E S I N S TAT I S T I C A L A N D P R O B A B I L I S T I C M AT H E M AT I C S Editorial Board Z. Ghahramani (Department of Engineering, University of Cambridge) R. Gill (Mathematical Institute, Leiden University) F. P. Kelly (Department of Pure Mathematics and Mathematical Statistics, University of Cambridge) B. D. Ripley (Department of Statistics, University of Oxford) S. Ross (Department of Industrial and Systems Engineering, University of Southern California) M. Stein (Department of Statistics, University of Chicago) This series of high-quality upper-division textbooks and expository monographs covers all aspects of stochastic applicable mathematics. The topics range from pure and applied statistics to probability theory, operations research, optimization and mathematical programming. The books contain clear presentations of new developments in the field and also of the state of the art in classical methods. While emphasizing rigorous treatment of theoretical methods, the books also contain applications and discussions of new techniques made possible by advances in computational practice. A complete list of books in the series can be found at www.cambridge.org/statistics. Recent titles include the following: 30. 31. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49.

Brownian Motion, by Peter M¨orters and Yuval Peres Probability (Fourth Edition), by Rick Durrett Stochastic Processes, by Richard F. Bass Regression for Categorical Data, by Gerhard Tutz Exercises in Probability (Second Edition), by Lo¨ıc Chaumont and Marc Yor Statistical Principles for the Design of Experiments, by R. Mead, S. G. Gilmour and A. Mead Quantum Stochastics, by Mou-Hsiung Chang Nonparametric Estimation under Shape Constraints, by Piet Groeneboom and Geurt Jongbloed Large Sample Covariance Matrices and High-Dimensional Data Analysis, by Jianfeng Yao, Shurong Zheng and Zhidong Bai Mathematical Foundations of Infinite-Dimensional Statistical Models, by Evarist Gin´e and Richard Nickl Confidence, Likelihood, Probability, by Tore Schweder and Nils Lid Hjort Probability on Trees and Networks, by Russell Lyons and Yuval Peres Random Graphs and Complex Networks (Volume 1), by Remco van der Hofstad Fundamentals of Nonparametric Bayesian Inference, by Subhashis Ghosal and Aad van der Vaart Long-Range Dependence and Self-Similarity, by Vladas Pipiras and Murad S. Taqqu Predictive Statistics, by Bertrand S. Clarke and Jennifer L. Clarke High-Dimensional Probability, by Roman Vershynin High-Dimensional Statistics, by Martin J. Wainwright Probability: Theory and Examples (Fifth Edition), by Rick Durrett

Model-Based Clustering and Classification for Data Science With Applications in R Charles Bouveyron Universit´e Cˆote d’Azur

Gilles Celeux Inria Saclay ˆIle-de-France

T. Brendan Murphy University College Dublin

Adrian E. Raftery University of Washington

University Printing House, Cambridge CB2 8BS, United Kingdom One Liberty Plaza, 20th Floor, New York, NY 10006, USA 477 Williamstown Road, Port Melbourne, VIC 3207, Australia 314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India 79 Anson Road, #06–04/06, Singapore 079906 Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence. www.cambridge.org Information on this title: www.cambridge.org/9781108494205 DOI: 10.1017/9781108644181 © Charles Bouveyron, Gilles Celeux, T. Brendan Murphy and Adrian E. Raftery 2019 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2019 Printed in Singapore by Markono Print Media Pte Ltd A catalogue record for this publication is available from the British Library. Library of Congress Cataloging-in-Publication Data Names: Bouveyron, Charles, 1979– author. | Celeux, Gilles, author. | Murphy, T. Brendan, 1972– author. | Raftery, Adrian E., author. Title: Model-based clustering and classification for data science : with applications in R / Charles Bouveyron, Universit´e Cˆote d’Azur, Gilles Celeux, Inria Saclay ˆIle-de-France, T. Brendan Murphy, University College Dublin, Adrian E. Raftery, University of Washington. Description: Cambridge ; New York, NY : Cambridge University Press, 2019. | Series: Cambridge series in statistical and probabilistic mathematics | Includes bibliographical references and index. Identifiers: LCCN 2019014257 | ISBN 9781108494205 (hardback) Subjects: LCSH: Cluster analysis. | Mathematical statistics. | Statistics–Classification. | R (Computer program language) Classification: LCC QA278.55 .M63 2019 | DDC 519.5/3–dc23 LC record available at https://lccn.loc.gov/2019014257 ISBN 978-1-108-49420-5 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

To Nathalie, Alexis, Romain and Nathan Charles To Ma¨ılys and Maya Gilles ´ To Trish, Aine and Emer Brendan To Hana, Isolde and Finn Adrian

Contents

Page Preface

xv

1 1.1 1.2 1.3 1.4 1.5 1.6

Introduction Cluster Analysis Classification Examples Software Organization of the Book Bibliographic Notes

1 1 4 7 12 13 14

2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9

Model-based Clustering: Basic Ideas Finite Mixture Models Geometrically Constrained Multivariate Normal Mixture Models Estimation by Maximum Likelihood Initializing the EM Algorithm Examples with Known Number of Clusters Choosing the Number of Clusters and the Clustering Model Illustrative Analyses Who Invented Model-based Clustering? Bibliographic Notes

15 15 18 23 31 39 46 60 71 75

3 3.1 3.2 3.3 3.4

Dealing with Difficulties Outliers Dealing with Degeneracies: Bayesian Regularization Non-Gaussian Mixture Components and Merging Bibliographic Notes

79 79 92 97 105

4 4.1 4.2 4.3 4.4 4.5 4.6 4.7

Model-based Classification Classification in the Probabilistic Framework Parameter Estimation Parsimonious Classification Models Multinomial Classification Variable Selection Mixture Discriminant Analysis Model Assessment and Selection

109 109 113 114 119 124 126 127

vii

viii

Contents

5 5.1 5.2 5.3 5.4 5.5

Semi-supervised Clustering and Classification Semi-supervised Classification Semi-supervised Clustering Supervised Classification with Uncertain Labels Novelty Detection: Supervised Classification with Unobserved Classes Bibliographic Notes

134 134 141 144 154 160

6 6.1 6.2 6.3 6.4 6.5

Discrete Data Clustering Example Data The Latent Class Model for Categorical Data Model-based Clustering for Ordinal and Mixed Type Data Model-based Clustering of Count Data Bibliographic Notes

163 163 165 185 190 197

7 7.1 7.2 7.3 7.4 7.5

Variable Selection Continuous Variable Selection for Model-based Clustering Continuous Variable Regularization for Model-based Clustering Continuous Variable Selection for Model-based Classification Categorical Variable Selection Methods for Model-based Clustering Bibliographic Notes

199 199 208 210 211 215

8 8.1 8.2 8.3 8.4 8.5

High-dimensional Data From Multivariate to High-dimensional Data The Curse of Dimensionality Earlier Approaches for Dealing with High-dimensional Data Subspace Methods for Clustering and Classification Bibliographic Notes

217 217 221 227 238 257

9 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8

Non-Gaussian Model-based Clustering Multivariate t-Distribution Skew-normal Distribution Skew-t Distribution Box–Cox Transformed Mixtures Generalized Hyperbolic Distribution Example: Old Faithful Data Example: Flow Cytometry Bibliographic Notes

259 259 267 270 278 282 285 287 288

10 10.1 10.2 10.3 10.4 10.5 10.6 10.7

Network Data Introduction Example Data Stochastic Block Model Mixed Membership Stochastic Block Model Latent Space Models Stochastic Topic Block Model Bibliographic Notes

292 292 294 298 304 312 320 329

Contents

ix

11 11.1 11.2 11.3 11.4 11.5 11.6 11.7

Model-based Clustering with Covariates Examples Mixture of Experts Model Model Assessment Software Results Discussion Bibliographic Notes

331 331 333 339 339 340 348 349

12 12.1 12.2 12.3 12.4 12.5

Other Topics Model-based Clustering of Functional Data Model-based Clustering of Texts Model-based Clustering for Image Analysis Model-based Co-clustering Bibliographic Notes

351 351 363 368 373 382

List of R Packages Bibliography Author Index Subject Index

384 386 415 423

Expanded Contents

Page Preface

xv

1 1.1

1 1 1 3 4 4 6 7 12 13 14

Introduction Cluster Analysis 1.1.1 1.1.2

1.2

From Grouping to Clustering Model-based Clustering

Classification 1.2.1 1.2.2

From Taxonomy to Machine Learning Model-based Discriminant Analysis

1.3 1.4 1.5 1.6

Examples Software Organization of the Book Bibliographic Notes

2 2.1 2.2 2.3 2.4

Model-based Clustering: Basic Ideas Finite Mixture Models Geometrically Constrained Multivariate Normal Mixture Models Estimation by Maximum Likelihood Initializing the EM Algorithm 2.4.1 2.4.2

2.5 2.6 2.7

Initialization by Hierarchical Model-based Clustering Initialization Using the smallEM Strategy

Examples with Known Number of Clusters Choosing the Number of Clusters and the Clustering Model Illustrative Analyses 2.7.1 2.7.2

Wine Varieties Craniometric Analysis

2.8 2.9

Who Invented Model-based Clustering? Bibliographic Notes

3 3.1

Dealing with Difficulties Outliers 3.1.1 3.1.2 3.1.3

3.2 3.3 3.4

Outliers in Model-based Clustering Mixture Modeling with a Uniform Component for Outliers Trimming Data with tclust

Dealing with Degeneracies: Bayesian Regularization Non-Gaussian Mixture Components and Merging Bibliographic Notes x

15 15 18 23 31 33 36 39 46 60 60 65 71 75 79 79 79 81 88 92 97 105

Expanded Contents 4 4.1

Model-based Classification Classification in the Probabilistic Framework 4.1.1 4.1.2

4.2 4.3

Parameter Estimation Parsimonious Classification Models 4.3.1 4.3.2

4.4

Incorporating Must-link Constraints Incorporating Cannot-link Constraints

Supervised Classification with Uncertain Labels 5.3.1 5.3.2 5.3.3

5.4

Estimating the Model Parameters through the EM Algorithm A First Experimental Comparison Model Selection Criteria for Semi-supervised Classification

Semi-supervised Clustering 5.2.1 5.2.2

5.3

The Cross-validated Error Rate Model Selection and Assessing the Error Rate Penalized Log-likelihood Criteria

Semi-supervised Clustering and Classification Semi-supervised Classification 5.1.1 5.1.2 5.1.3

5.2

The Conditional Independence Model An Illustration

Variable Selection Mixture Discriminant Analysis Model Assessment and Selection 4.7.1 4.7.2 4.7.3

5 5.1

Gaussian Classification with EDDA Regularized Discriminant Analysis

Multinomial Classification 4.4.1 4.4.2

4.5 4.6 4.7

Generative or Predictive Approach An Introductory Example

The Label Noise Problem A Model-based Approach for the Binary Case A Model-based Approach for the Multi-class Case

Novelty Detection: Supervised Classification with Unobserved Classes 5.4.1 5.4.2

A Transductive Model-based Approach An Inductive Model-based Approach

5.5

Bibliographic Notes

6 6.1 6.2

Discrete Data Clustering Example Data The Latent Class Model for Categorical Data 6.2.1 6.2.2 6.2.3 6.2.4 6.2.5 6.2.6 6.2.7

6.3

Maximum Likelihood Estimation Parsimonious Latent Class Models The Latent Class Model as a Cluster Analysis Tool Model Selection Illustration on the Carcinoma Data Set Illustration on the Credit Data Set Bayesian Inference

Model-based Clustering for Ordinal and Mixed Type Data 6.3.1 6.3.2 6.3.3

Ordinal Data Mixed Data The ClustMD Model

xi 109 109 110 111 113 114 114 115 119 119 123 124 126 127 129 131 133 134 134 135 136 138 141 143 144 144 145 146 150 154 155 157 160 163 163 165 167 169 171 172 174 178 180 185 185 186 186

xii

Expanded Contents 6.3.4

6.4

Illustration of ClustMD: Prostate Cancer Data

Model-based Clustering of Count Data 6.4.1 6.4.2

Poisson Mixture Model Illustration: V´elib Data Set

6.5

Bibliographic Notes

7 7.1

Variable Selection Continuous Variable Selection for Model-based Clustering 7.1.1 7.1.2 7.1.3

Clustering and Noisy Variables Approach Clustering, Redundant and Noisy Variables Approach Numerical Experiments

7.2

Continuous Variable Regularization for Model-based Clustering

7.3 7.4

Continuous Variable Selection for Model-based Classification Categorical Variable Selection Methods for Model-based Clustering

7.2.1

7.4.1 7.4.2 7.4.3

Combining Regularization and Variable Selection

Stepwise Procedures A Bayesian Procedure An Illustration

7.5

Bibliographic Notes

8 8.1 8.2

High-dimensional Data From Multivariate to High-dimensional Data The Curse of Dimensionality

187 190 191 194 197 199 199 200 200 203 208 209 210 211 212 212 214 215

8.5

217 217 221 8.2.1 The Curse of Dimensionality in Model-based Clustering and Classification 223 8.2.2 The Blessing of Dimensionality in Model-based Clustering and Classification 225 Earlier Approaches for Dealing with High-dimensional Data 227 8.3.1 Unsupervised Dimension Reduction 228 8.3.2 The Dangers of Unsupervised Dimension Reduction 230 8.3.3 Supervised Dimension Reduction for Classification 231 8.3.4 Regularization 236 8.3.5 Constrained Models 237 Subspace Methods for Clustering and Classification 238 8.4.1 Mixture of Factor Analyzers (MFA) 238 8.4.2 Extensions of the MFA Model 241 8.4.3 Parsimonious Gaussian Mixture Models (PGMM) 244 8.4.4 Mixture of High-dimensional GMMs (HD-GMM) 247 8.4.5 The Discriminative Latent Mixture (DLM) Models 251 8.4.6 Variable Selection by Penalization of the Loadings 254 Bibliographic Notes 257

9 9.1 9.2 9.3

Non-Gaussian Model-based Clustering Multivariate t-Distribution Skew-normal Distribution Skew-t Distribution

8.3

8.4

9.3.1 9.3.2

9.4 9.5 9.6

Restricted Skew-t Distribution Unrestricted Skew-t Distribution

Box–Cox Transformed Mixtures Generalized Hyperbolic Distribution Example: Old Faithful Data

259 259 267 270 273 275 278 282 285

Expanded Contents

xiii

9.7 9.8

Example: Flow Cytometry Bibliographic Notes

287 288

10 10.1 10.2

Network Data Introduction Example Data

292 292 294 295 296 296 297 298 298 299 301 304 305 306 312 313 314 315 315 316 317 320 322 323 326 326 329

10.2.1 10.2.2 10.2.3 10.2.4 10.2.5

10.3

Sampson’s Monk Data Zachary’s Karate Club AIDS Blogs French Political Blogs Lazega Lawyers

Stochastic Block Model 10.3.1 Inference 10.3.2 Application

10.4

Mixed Membership Stochastic Block Model 10.4.1 Inference 10.4.2 Application

10.5

Latent Space Models 10.5.1 10.5.2 10.5.3 10.5.4 10.5.5 10.5.6

10.6

The Distance Model and the Projection Model The Latent Position Cluster Model The Sender and Receiver Random Effects The Mixture of Experts Latent Position Cluster Model Inference Application

Stochastic Topic Block Model 10.6.1 10.6.2 10.6.3 10.6.4

Context and Notation The STBM Model Links with Other Models and Inference Application: Enron E-mail Network

10.7

Bibliographic Notes

11 11.1

Model-based Clustering with Covariates Examples 11.1.1 CO2 and Gross National Product 11.1.2 Australian Institute of Sport (AIS) 11.1.3 Italian Wine

11.2

Mixture of Experts Model 11.2.1 Inference

11.3 11.4

Model Assessment Software 11.4.1 11.4.2 11.4.3 11.4.4

11.5

flexmix mixtools MoEClust Other

Results 11.5.1 CO2 and GNP Data 11.5.2 Australian Institute of Sport 11.5.3 Italian Wine

11.6

Discussion

331 331 331 331 332 333 337 339 339 339 340 340 340 340 340 341 343 348

xiv

Expanded Contents

11.7

Bibliographic Notes

349

12 12.1

Other Topics Model-based Clustering of Functional Data

351 351 353 354 356 358 363 364 365 366 368 368 370 372 373 375 376 379 380 382

12.1.1 12.1.2 12.1.3 12.1.4

12.2

Model-based Approaches for Functional Clustering The fclust Method The funFEM Method The funHDDC Method for Multivariate Functional Data

Model-based Clustering of Texts 12.2.1 Statistical Models for Texts 12.2.2 Latent Dirichlet Allocation 12.2.3 Application to Text Clustering

12.3

Model-based Clustering for Image Analysis 12.3.1 Image Segmentation 12.3.2 Image Denoising 12.3.3 Inpainting Damaged Images

12.4

Model-based Co-clustering 12.4.1 12.4.2 12.4.3 12.4.4

12.5

The Latent Block Model Estimating LBM Parameters Model Selection An Illustration

Bibliographic Notes

List of R Packages Bibliography Author Index Subject Index

384 386 415 423

Preface

About this book The century that is ours is shaping up to be the century of the data revolution. Our numerical world is creating masses of data every day and the volume of generated data is estimated to be doubling every two years. This wealth of available data offers hope for exploitation that may lead to great advances in areas such as health, science, transportation and defense. However, manipulating, analyzing and extracting information from those data is made difficult by the volume and nature (high-dimensional data, networks, time series, etc) of modern data. Within the broad field of statistical and machine learning, model-based techniques for clustering and classification have a central position for anyone interested in exploiting those data. This textbook focuses on the recent developments in model-based clustering and classification while providing a comprehensive introduction to the field. It is aimed at advanced undergraduates, graduates or first-year Ph.D. students in data science, as well as researchers and practitioners. It assumes no previous knowledge of clustering and classification concepts. A basic knowledge of multivariate calculus, linear algebra and probability and statistics is needed. The book is supported by extensive examples on data, with 72 listings of code mobilizing more than 30 software packages, that can be run by the reader. The chosen language for codes is the R software, which is one of the most popular languages for data science. It is an excellent tool for data science since the most recent statistical learning techniques are provided on the R platform (named CRAN). Using R is probably the best way to be directly connected to current research in statistics and data science through the packages provided by researchers. The book is accompanied by a dedicated R package (the MBCbook package) that can be directly downloaded from CRAN within the R software or at the following address: https://cran.r-project.org/package=MBCbook. We also encourage the reader to visit the book website for the latest information: http://math. unice.fr/~cbouveyr/MBCbook/. This book could be used as one of the texts for a graduate or advanced undergraduate course in multivariate analysis or machine learning. Chapters 1 and 2, and optionally a selection of later chapters, could be used for this purpose. The book as a whole could also be used as the main text for a one-quarter or xv

xvi

Preface

one-semester course in cluster analysis or unsupervised learning, focusing on the model-based approach. Acknowledgements This book is a truly collaborative effort, and the four authors have contributed equally. Each of us has contributed to each of the chapters. We would like to thank Chris Fraley for initially developing the mclust software and later R package, starting in 1991. This software was of extraordinary quality from the beginning, and without it this whole field would never have developed as it did. Luca Scrucca took over the package in 2007, and has enhanced it in many ways, so we also owe a lot to his work. We would also like to thank the developers and maintainers of Rmixmod software: Florent Langrognet, R´emi Lebret, Christian Poli, Serge Iovleff, Anwuli Echenim and Benjamin Auder. The authors would also like to thank the participants in the Working Group on Model-based Clustering, which has been gathering every year in the third week of July since 1994, first in Seattle and then since 2007 in different venues around Europe and North America. This is an extraordinary group of people from many countries, whose energy, interactions and intellectual generosity have inspired us every year and driven the field forward. The book owes a great deal to their insights. Charles Bouveyron would like to thank in particular St´ephane Girard, Julien Jacques and Pierre Latouche, for very fruitful and friendly collaborations. Charles Bouveyron also thanks his coauthors on this topic for all the enjoyable collaborations: Laurent Berg´e, Camille Brunet-Saumard, Etienne Cˆome, Marco Corneli, Julie Delon, Mathieu Fauvel, Antoine Houdard, Pierre-Alexandre Mattei, Cordelia Schmid, Amandine Schmutz and Rawya Zreik. He would like also to warmly thank his family, Nathalie, Alexis, Romain and Nathan, for their love and everyday support in the writing of this book. Gilles Celeux would like to thank his old and dear friends Jean Diebolt and G´erard Govaert for the long and intensive collaborations. He also thanks his coauthors in the area Jean-Patrick Baudry, Halima Bensmail, Christophe Biernacki, Guillaume Bouchard, Vincent Brault, St´ephane Chr´etien, Florence Forbes, Rapha¨el Gottardo, Christine Keribin, Jean-Michel Marin, Marie-Laure Martin-Magniette, Cathy Maugis-Rabusseau, Abdallah Mkhadri, Nathalie Peyrard, Andrea Rau, Christian P. Robert, Gilda Soromenho and Vincent Vandewalle for nice and fruitful collaborations. Finally, he would like to thank Ma¨ılys and Maya for their love. Brendan Murphy would like to thank John Hartigan for introducing him to clustering. He would like to thank Claire Gormley, Paul McNicholas, Luca Scrucca and Michael Fop with whom he has collaborated extensively on model-based clustering projects over a number of years. He also would like to thank his students and coauthors for enjoyable collaborations on a wide range of modelbased clustering and classification projects: Marco Alf`o, Francesco Bartolucci, Nema Dean, Silvia D’Angelo, Gerard Downey, Bailey Fosdick, Nial Friel, Marie

Preface

xvii

Galligan, Isabella Gollini, Sen Hu, Neil Hurley, Dimitris Karlis, Donal Martin, Tyler McCormick, Aaron McDaid, Damien McParland, Keefe Murphy, Tin Lok James Ng, Adrian O’Hagan, Niamh Russell, Michael Salter-Townshend, Lucy Small, Deirdre Toher, Ted Westling, Arthur White and Jason Wyse. Finally, he ´ would like to thank his family, Trish, Aine and Emer for their love and support. Adrian Raftery thanks Fionn Murtagh, with whom he first encountered modelbased clustering and wrote his first paper in the area in 1984, Chris Fraley for a long and very fruitful collaboration, and Luca Scrucca for another very successful collaboration. He warmly thanks his Ph.D. students who have worked with him on model-based clustering, namely Jeff Banfield, Russ Steele, Raphael Gottardo, Nema Dean, Derek Stanford and William Chad Young for their collaboration and all that he learned from them. He also thanks his other coauthors in the area, Jogesh Babu, Jean-Patrick Baudry, Halima Bensmail, Roger Bumgarner, Simon Byers, Jon Campbell, Abhijit Dasgupta, Mary Emond, Eric Feigelson, Florence Forbes, Diane Georgian-Smith, Ken Lo, Alejandro Murua, Nathalie Peyrard, Christian Robert, Larry Ruzzo, Jean-Luc Starck, Ka Yee Yeung, Naisyin Wang and Ron Wehrens for excellent collaborations. Raftery would like to thank the Office of Naval Research and the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD grants R01 HD054511 and R01 HD070936) for sustained research support without which this work could not have been carried out. He wrote part of the book during a fellowship year at the Center for Advanced Study in the Behavioral Sciences (CASBS) at Stanford University in 2017–2018, which provided an ideal environment for the sustained thinking needed to complete a project of this kind. ˇ c´ıkov´a, for her love and support Finally he would like to thank his wife, Hana Sevˇ through this project.

1 Introduction

Cluster analysis and classification are two important tasks which occur daily in everyday life. As humans, our brain naturally clusters and classifies animals, objects or even ideas thousands of times a day, without fatigue. The emergence of science has led to many data sets with clustering structure that cannot be easily detected by the human brain, and so require automated algorithms. Also, with the advent of the “Data Age,” clustering and classification tasks are often repeated large numbers of times, and so need to be automated even if the human brain could carry them out. This has led to a range of clustering and classification algorithms over the past century. Initially these were mostly heuristic, and developed without much reference to the statistical theory that was emerging in parallel. In the 1960s, it was realized that cluster analysis could be put on a principled statistical basis by framing the clustering task as one of inference for a finite mixture model. This has allowed cluster analysis to benefit from the inferential framework of statistics, and provide principled and reproducible answers to questions such as: how many clusters are there? what is the best clustering algorithm? how should we deal with outliers? In this book, we describe and review the model-based approach to cluster analysis which has emerged in the past half-century, and is now an active research field. We describe the basic ideas, and aim to show the advantages of thinking in this way, as well as to review recent developments, particularly for newer types of data such as high-dimensional data, network data, textual data and image data. 1.1 Cluster Analysis The goal of cluster analysis is to find meaningful groups in data. Typically, in the data these groups will be internally cohesive and separated from one another. The purpose is to find groups whose members have something in common that they do not share with members of other groups. 1.1.1 From Grouping to Clustering The grouping of objects according to things they have in common goes back at least to the beginning of language. A noun (such as “hammer”) refers to any one of a set of different individual objects that have characteristics in common. As 1

2

Introduction

Greene (1909) remarked, “naming is classifying.” Plato was among the first to formalize this with his Theory of Forms, defining a Form as an abstract unchanging object or idea, of which there may be many instances in practice. For example, in Plato’s Cratylus dialogue, he has Socrates giving the example of a blacksmith’s tool, such as a hammer. There are many hammers in the world, but just one Platonic Form of “hammerness” which is the essence of all of them. Aristotle, in his History of Animals, classified animals into groups based on their characteristics. Unlike Plato, he drew heavily on empirical observations. His student Theophrastus did something similar for plants in his Enquiry Into Plants. An early and highly influential example of the systematic grouping of objects based on measured empirical characteristics is the system of biological classification or taxonomy of Linnaeus (1735), applied to plants by Linnaeus (1753) and to animals by Linnaeus (1758). For example, he divided plants into 24 classes, including flowers with one stamen (Monandria), flowers with two stamens (Diandria) and flowerless plants (Cryptogamia). Linnaeus’ methods were based on data but were subjective. Adanson (1757, 1763) developed less subjective methods using multiple characteristics of organisms. Cluster analysis is something more: the search for groups in quantitative data using systematic numerical methods. Perhaps the earliest methods that satisfy this description were developed in anthropology, and mainly consisted of defining quantitative measures of difference and similarity between objects (Czekanowski, 1909, 1911; Driver and Kroeber, 1932). Most of the early clustering methods were based on measures of similarity between objects, and Czekanowski (1909) seems to have been the first to define such a measure for clustering purposes. Then development shifted to psychology, where Zubin (1938) proposed a method for rearranging a correlation matrix to yield clusters. Stephenson (1936) proposed the use of factor analysis to identify clusters of people, while, in what seems to have been the first book on cluster analysis, Tryon (1939) proposed a method for clustering variables similar to what is now called multiple group factor analysis. Cattell (1944) also introduced several algorithmic and graphical clustering methods. In the 1950s, development shifted again to biological taxonomy, the original problem addressed by the ancient Greeks and the eighteenth-century scientists interested in classification. It was in this context that the single link hierarchical agglomerative clustering method (Sneath, 1957), the average link method and the complete link method (Sokal and Michener, 1958) were proposed. These are sometimes thought of as marking the beginning of cluster analysis, but in fact they came after a half-century of previous, though relatively slow development. They did mark the takeoff of the area, though. The development of computational power and the publication of the important book of Sokal and Sneath (1963) led to a rapid expansion of the use and methodology of cluster analysis, which has not stopped in the past 60 years. Many of the ensuing developments in the 1970s and 1980s were driven by applications in market research as well as biological taxonomy. From the 1990s

1.1 Cluster Analysis

3

there was a further explosion of interest fueled by new types of data and questions, often involving much larger data sets than before. These include finding groups of genes or people using genetic microarray data, finding groups and patterns in retail barcode data, finding groups of users and websites from Internet use data, and automatic document clustering for technical documents and websites. Another major area of application has been image analysis. This includes medical image segmentation, for example for finding tumors in digital medical images such as X-rays, CAT scans, MRI scans and PET scans. In these applications, a cluster is typically a set of pixels in the image. Another application is image compression, using methods such as color image quantization, where a cluster would correspond to a set of color levels. For a history of cluster analysis to 1988, see Blashfeld and Aldenderfer (1988). 1.1.2 Model-based Clustering Most of the earlier clustering methods were algorithmic and heuristic. The majority were based on a matrix of measures of similarity between objects, which were in turn derived from the objects’ measured characteristics. The purpose was to divide or partition the data into groups such that objects in the same group were similar, and were dissimilar from objects in other groups. A range of automatic algorithms for doing this was proposed, starting in the 1930s. These developments took place largely in isolation from mainstream statistics, much of which was based on a probability distribution for the data. At the same time, they left several practical questions unresolved, such as which of the many available clustering methods to use? How many clusters should we use? How should we treat objects that do not fall into any group, or outliers? How sure are we of a clustering partition, and how should we assess uncertainty about it? The mainstream statistical approach of specifying a probability model for the full data set has the potential to answer these questions. The main statistical model for clustering is a finite mixture model, in which each group is modeled by its own probability distribution. The first successful method of this kind was developed in sociology in the early 1950s for multivariate discrete data, where multiple characteristics are measured for each object, and each characteristic can take one of several values. Data of this kind are common in the social sciences, and are typical, for example, of surveys. The model proposed was called the latent class model, and it specified that within each group the characteristics were statistically independent (Lazarsfeld, 1950a,c). We discuss this model and its development in Chapter 6. The dominant model for clustering continuous-valued data is the mixture of multivariate normal distributions. This seems to have been first mentioned by Wolfe (1963) in his Master’s thesis at Berkeley. John Wolfe subsequently developed the first real software for estimating this model, called NORMIX, and also developed related theory (Wolfe, 1965, 1967, 1970), so he has a real claim to be called the inventor of model-based clustering for continuous data. Wolfe proposed estimating the model by maximum likelihood using the EM algorithm,

4

Introduction

which is striking since he did so ten years before the article of Dempster et al. (1977) that popularized the EM algorithm. This remains the most used estimation approach in model-based clustering. We outline the early history of model-based clustering in Section 2.9, after we have introduced the main ideas. Basing cluster analysis on a probability model has several advantages. In essence, this brings cluster analysis within the range of standard statistical methodology and makes it possible to carry out inference in a principled way. It turns out that many of the previous heuristic methods correspond approximately to particular clustering models, and so model-based clustering can provide a way of choosing between clustering methods, and encompasses many of them in its framework. In our experience, when a clustering method does not correspond to any probability model, it tends not to work very well. Conversely, understanding what probability model a clustering method corresponds to can give one an idea of when and why it will work well or badly. It also provides a principled way to choose the number of clusters. In fact, the choice of clustering model and of number of clusters can be reduced to a single model selection problem. It turns out that there is a trade-off between these choices. Often, if a simpler clustering model is chosen, more clusters are needed to represent the data adequately. Basing cluster analysis on a probability model also leads to a way of assessing uncertainty about the clustering. In addition, it provides a systematic way of dealing with outliers by expanding the model to account for them. 1.2 Classification The problem of classification (also called discriminant analysis) involves classifying objects into classes when there is already information about the nature of the classes. This information often comes from a data set of objects that have already been classified by experts or by other means. Classification aims to determine which class new objects belong to, and develops automatic algorithms for doing so. Typically this involves assigning new observations to the class whose objects they most closely resemble in some sense. Classification is said to be a “supervised” problem in the sense that it requires the supervision of experts to provide some examples of the classes. Clustering, in contrast, aims to divide a set of objects into groups without any examples of the “true” classes, and so is said to be an “unsupervised” problem. 1.2.1 From Taxonomy to Machine Learning The history of classification is closely related to that of clustering. Indeed, the practical interest of taxonomies of animals or plants is to use them to recognize samples on the field. For centuries, the task of classification was carried out by humans, such as biologists, botanists or doctors, who learned to assign new observations to specific species or diseases. Until the twentieth century, this was done without automatic algorithms.

1.2 Classification

5

The first statistical method for classification is due to Ronald Fisher in his famous work on discriminant analysis (Fisher, 1936). Fisher asked what linear combination of several features best discriminates between two or more populations. He applied his methodology, known nowadays as Fisher’s discriminant analysis or linear discriminant analysis, to a data set on irises that he had obtained from the botanist Edgar Anderson (Anderson, 1935). In a following article (Fisher, 1938), he established the links between his discriminant analysis method and several existing methods, in particular analysis of variance (ANOVA), Hotelling’s T-squared distribution (Hotelling, 1931) and the Mahalanobis generalized distance (Mahalanobis, 1930). In his 1936 paper, Fisher also acknowledged the use of a similar approach, without formalization, in craniometry for quantifying sex differences in measurements of the mandible. Discriminant analysis rapidly expanded to other application fields, including medical diagnosis, fault detection, fraud detection, handwriting recognition, spam detection and computer vision. Fisher’s linear discriminant analysis provided good solutions for many applications, but other applications required the development of specific methods. Among the key methods for classification, logistic regression (Cox, 1958) extended the usual linear regression model to the case of a categorical dependent variable and thus made it possible to do binary classification. Logistic regression had a great success in medicine, marketing, political science and economics. It remains a routine method in many companies, for instance for mortgage default prediction within banks or for click-through rate prediction in marketing companies. Another key early classification method was the perceptron (Rosenblatt, 1958). Originally designed as a machine for image recognition, the perceptron algorithm is supposed to mimic the behavior of neurons for making a decision. Although the first attempts were promising, the perceptron appeared not to be able to recognize many classes without adding several layers. The perceptron is recognized as one of the first artificial neural networks which recently revolutionized the classification field, partly because of the massive increase in computing capabilities. In particular, convolutional neural networks (LeCun et al., 1998) use a variation of multilayer perceptrons and display impressive results in specific cases. Before the emergence of convolutional neural networks and deep learning, support vector machines also pushed forward the performances of classification at the end of the 1990s. The original support vector machine algorithm or SVM (Cortes and Vapnik, 1995), was invented in 1963 and it was not to see its first implementation until 1992, thanks to the “kernel trick” (Boser et al., 1992). SVM is a family of classifiers, defined by the choice of a kernel, which transform the original data in a high-dimensional space, through a nonlinear projection, where they are linearly separable with a hyperplane. One of the reasons for the popularity of SVMs was their ability to handle data of various types thanks to the notion of kernel. As we will see in this book, statistical methods were able to follow the different revolutions in the performance of supervised classification. In addition, some of

6

Introduction

the older methods remain reference methods because they perform well with low complexity.

1.2.2 Model-based Discriminant Analysis Fisher discriminant analysis (FDA, Fisher (1936)) was the first classification method. Although Fisher did not describe his methodology within a statistical modeling framework, it is possible to recast FDA as a model-based method. Assuming normal distributions for the classes with a common covariance matrix yields a classification rule which is based on Fisher’s discriminant function. This classification method, named linear discriminant analysis (LDA), also provides a way to calculate the optimal threshold to discriminate between the classes within Fisher’s discriminant subspace (Fukunaga, 1999). An early work considering class-conditional distributions in the case of discriminant analysis is due to Welch (1939). He gave the first version of a classification rule in the case of two classes with normal distributions, using either Bayes’ theorem (if the prior probabilities of the classes are known) or the Neyman–Pearson lemma (if these prior probabilities have to be estimated). Wald (1939, 1949) developed the theory of decision functions which offers a sound statistical framework for further work in classification. Wald (1944) considered the problem of assigning an individual into one of two groups under normal distributions with a common covariance matrix, the solution of which involves Fisher’s discriminant function. Von Mises (1945) addressed the problem of minimizing the classification error in the case of several classes and proposed a general solution to it. Rao (1948, 1952, 1954) extended this to consider the estimation of a classification rule from samples. See Das Gupta (1973) and McLachlan (1992) for reviews of the earlier development of this area. Once the theory of statistical classification had been well established, researchers had to face new characteristics of the data, such as high-dimensional data, low sample sizes, partially supervised data and non-normality. Regarding high-dimensional data, McLachlan (1976) realized the importance of variable selection to avoid the curse of dimensionality in discriminant analysis. Banfield and Raftery (1993) and Bensmail and Celeux (1996) proposed alternative approaches using constrained Gaussian models. About partially supervised classification, McLachlan and Ganesalingam (1982) considered the use of unlabeled data to update a classification rule in order to reduce the classification error. Regarding non-normality, Celeux and Mkhadri (1992) proposed a regularized discriminant analysis technique for high-dimensional discrete data, while Hastie and Tibshirani (1996) considered the classification of non-normal data using mixtures of Gaussians. In Chapter 4 specific methods for classification with categorical data are presented. These topics will be developed in this book, in Chapters 4, 5 and 8.

1.3 Examples

7

1.3 Examples We now briefly describe some examples of cluster analysis and discriminant analysis. Example 1: Old Faithful geyser data The Old Faithful geyser in Yellowstone National Park, Wyoming erupts every 35–120 minutes for about one to five minutes. It is useful for rangers to be able to predict the time to the next eruption. The time to the next eruption and its duration are related, in that the longer an eruption lasts, the longer the time until the next one. The data we will consider in this book consist of observations on 272 eruptions Azzalini and Bowman (1990). Data on two variables were measured: the time from one eruption to the next one, and the duration of the eruption. These data are often used to illustrate clustering methods. Example 2: Diagnosing type of diabetes Figure 1.1 shows measurements made on 145 subjects with the goal of diagnosing diabetes and, for diabetic patients, the type of diabetes present. The data consist of the area under a plasma glucose curve (glucose area), the area under a plasma insulin curve (insulin area) and the steady-state plasma glucose response (SSPG) for 145 subjects. The subjects were subsequently clinically classified into three groups: chemical diabetes (Type 1), overt diabetes (Type 2), and normal (non-diabetic). The goal of our analysis is either to develop a method for grouping patients into clusters corresponding to diagnostic categories, or to learn a classification rule able to predict the status of a new patient. These data were described and analyzed by Reaven and Miller (1979). Example 3: Breast cancer diagnosis In order to diagnose breast cancer, a fine needle aspirate of a breast mass was collected, and a digitized image of it was produced. The cells present in the image were identified, and for each cell nucleus the following characteristics were measured from the digital image: (a) radius (mean of distances from center to points on the perimeter); (b) texture (standard deviation of gray-scale values); (c) perimeter; (d) area; (e) smoothness (local variation in radius lengths); (f) compactness (perimeter2 / area − 1); (g) concavity (severity of concave portions of the contour); (h) concave points (number of concave portions of the contour); (i) symmetry; (j) fractal dimension. The mean, standard deviation and extreme values of the 10 characteristics across cell nuclei were then calculated, yielding 30 features of the image (Street et al., 1993; Mangasarian et al., 1995). A pairs plot of three of the 30 variables is shown in Figure 1.2. These were selected as representing a substantial amount of the variability in the data, and in fact are the variables with the highest loadings on each of the first three principal components of the data, based on the correlation matrix. This looks like a more challenging clustering/classification problem than the first two examples (Old Faithful data and diabetes diagnosis data), where clustering

Introduction





● ●

● ●● ●





1500



● ● ● ●

● ● ●

●● ● ● ● ●

●●

● ●● ●

●● ● ● ●● ● ●

●● ● ●

● ● ●

● ●



●●

● ●



● ●

● ● ●● ● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ●●●● ● ● ●● ● ●●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●



●●



● ● ● ●● ● ● ●●



● ● ● ● ●





●● ●● ● ● ● ●● ● ●●

insulin

●● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●●



●● ● ●● ● ● ● ●● ● ●●● ● ●● ● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ●● ●●● ●●





● ●









● ●

150



● ●

● ● ●●

● ● ● ●

200

● ●

250

● ●●

● ● ●

300

● ●●●

350

● ● ● ● ● ● ●●●● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ●● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ●●● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●●●

400

● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ●● ●●● ●● ● ●●●● ● ●



sspg

200

●●

100

600



● ● ●

0

0

500

1000

● ●





● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ●



● ●●



glucose

● ● ●

● ● ●

300



350

1500

250

1000

200

500

150

0

100

8

0

200

400

600

Figure 1.1 Diabetes data pairs plot: three measurements on 145 patients. Source: Reaven and Miller (1979).

was apparent visually from the plots of the data. Here it is hard to discern clustering in Figure 1.2. However, we will see in Chapter 2 that in this higher dimensional setting, model-based clustering can detect clusters that agree well with clinical criteria.

Example 4: Wine varieties Classifying food and drink on the basis of characteristics is an important use of cluster analysis. We will illustrate this with a data set giving up to 27 physical and chemical measurements on 178 wine samples (Forina et al., 1986). The goal of an analysis like this is to partition the samples into types of wine, and potentially also by year of production. In this case we know the right answer: there are three types of wine, and the year in which each sample was produced is also known.

1.3 Examples

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●●●● ● ● ●●● ●● ● ●● ●● ● ● ●●● ●● ●● ●●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ●● ●●● ●● ●● ●● ●● ●●●● ● ●● ● ●● ● ●● ● ● ●● ●● ● ●● ● ●● ● ● ● ●● ●● ●●●● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●● ●● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●●● ● ● ● ●●● ● ●● ●● ●●● ●● ● ● ● ● ● ● ●● ●●● ●● ● ●●●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ●●● ● ● ●● ●●●●● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●● ● ●● ●● ● ● ● ●● ●●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●





● ● ● ● ●



● ●





● ● ● ●

● ●● ● ● ● ●●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ●● ● ● ● ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ●● ● ●●● ● ● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ●● ●● ● ● ● ●●● ● ●●●●● ● ● ● ● ● ● ● ●●● ● ● ●●●●●● ● ● ● ●● ● ●● ● ● ●●● ●●● ● ●● ● ●●● ● ●●● ●● ● ●● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ●●● ● ● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ●● ●● ●● ●● ●● ●●●●● ● ● ● ●●

points1



● ●



● ●



● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●●●● ●● ●●● ● ● ● ● ● ●● ● ●●● ● ●● ● ●● ●● ● ●● ●● ●● ● ●● ●● ● ●●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●●●●● ● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●●● ●● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●●● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ●●●● ●● ● ●● ● ● ● ● ● ● ● ● ●● ●

0.1

0.2

0.3

● ●

0.4





● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ●● ●● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ●●● ● ● ●● ●● ● ●● ● ● ● ● ●●●● ● ● ●●● ● ● ● ●●● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ●●● ● ● ● ●● ● ●● ● ●●●● ●● ● ●●●● ● ●● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ●● ● ●● ●● ●●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●●● ● ●●● ●●●●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●●●● ●●●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ●●● ●● ● ● ● ● ● ● ● ●●● ● ● ●● ● ●●● ● ● ● ● ●●● ● ●●● ● ●● ●● ● ● ●● ●



3

● ●

4



texture2 2



0.0

●●



● ●





● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●●●● ● ● ● ●●●● ● ●● ●●●●●●● ●● ● ●● ● ●● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ●●●● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●●●●●●● ● ● ●●● ●● ● ●● ● ● ●●● ● ● ●●● ● ● ● ● ●●●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ●●● ● ●●●●● ●● ● ● ● ● ●● ● ●●● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●●●● ●● ●● ● ●● ● ●●● ● ● ● ●● ●● ●● ● ● ● ●●●● ● ● ●● ● ●● ●● ● ● ● ●● ●● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ●●● ● ● ● ●● ●● ● ● ●● ●● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ●●●●

5



● ●

1

0.00

0.05

0.10

0.15

0.20

concavity1





0.3





0.20 ●

0.4

0.15

0.2

0.10

0.1

0.05

0.0

0.00

9

1

2

3

4

5

Figure 1.2 Pairs plot of three of the 30 measurements on breast cancer diagnosis images.

Thus we can assess how well various clustering methods perform. We will see in Chapter 2 that model-based clustering is successful at this task. Example 5: Craniometric analysis Here the task is to classify skulls according to the populations from which they came, using cranial measurements. We will analyze data with 57 cranial measurements on 2,524 skulls. As we will see in Chapter 2, a major issue is determining the number of clusters, or populations. Example 6: Identifying minefields We consider the problem of detecting surface-laid minefields on the basis of an image from a reconnaissance aircraft. After processing, such an image is reduced to a list of objects, some of which may be mines and some of which may be “clutter”

10

Introduction Minefield data ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●●●● ● ● ●● ● ● ●● ● ● ●● ●● ●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ●● ●● ●●● ● ●●● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●●● ● ●● ● ● ● ●● ● ● ● ●● ●●● ● ●●● ●●●● ● ● ●●● ● ● ● ●● ● ●●● ●● ● ●● ● ●●● ● ●● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ● ●●●●● ●● ● ● ●● ●● ● ●●● ● ●● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ●●● ●● ● ● ●● ● ●●● ● ● ● ●● ● ●● ●●●● ●● ● ● ●● ●●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ●● ●●●● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ●●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●●●● ●● ●● ●● ● ●● ● ● ●●●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ●● ●●● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ●● ●● ●● ●● ●● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ●●● ● ●●●●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●

True classification

●● ●● ● ● ● ● ● ●● ● ● ●● ● ●●● ● ●● ● ●● ●●● ● ●●●● ● ●●● ●● ● ●● ● ● ● ●●● ● ●● ● ● ●● ●●●● ●●●●● ● ● ●● ●● ● ●● ● ● ● ●●● ● ● ●● ● ●● ●● ●● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ●●● ●● ● ● ●● ●●● ● ● ●● ●● ● ●● ●● ● ●● ●●● ●●● ● ●● ● ● ● ●● ●●●● ● ● ●● ● ● ●● ● ●● ● ● ●● ●●●● ● ● ● ●● ● ●● ●● ● ● ●●●● ● ● ● ●● ●●● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ●● ●●●● ●● ●● ● ● ●●● ●● ● ● ● ● ●●● ●● ● ● ● ●●● ● ● ● ● ●● ●● ●●● ● ● ● ● ● ●●● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ●● ●● ●●

Figure 1.3 Minefield data. Left: observed data. Right: true classification into mines and clutter.

or noise, such as other metal objects or rocks. The objects are small and can be represented by points without losing much information. The analyst’s task is to determine whether or not minefields are present, and where they are. A typical data set is shown in Figure 1.3.1 The true classification of the data between mines and clutter is shown in the right panel of Figure 1.3. These data are available as the chevron data set in the mclust R package. This problem is challenging because the clutter form over two-thirds of the data points and are not separated from the mines spatially, but rather by their density. Example 7: V´elib data This data set has been extracted from the V´elib large-scale bike sharing system of Paris, through the open-data API provided by the operator JCDecaux. The real time data are available at https://developer.jcdecaux.com/ (with an api key) . The data set consists of information (occupancy, number of broken docks, ...) about bike stations collected on the Paris bike sharing system over five weeks, between February 24 and March 30, 2014. Figure 1.4 presents a map of the V´elib stations in Paris (left panel) and loading profiles of some V´elib stations (right panel). The red dots correspond to the stations for which the loading profiles are displayed on the right panel. The information can be analyzed in different ways, depending on the objective or the data type. For instance, the data were first used in Bouveyron et al. (2015) in the context of functional time series clustering, in order to recover the temporal pattern of use of the bike stations. This data set will be analyzed in this book 1

Actual minefield data were not available, but the data in Figure 1.3 were simulated according to specifications developed at the Naval Coastal Systems Station, Panama City, Florida, to represent minefield data encountered in practice (Muise and Smith, 1992).

1.3 Examples

11

  

 













 

























 



 



 





1.0 0.6



Loading



 

Dim−03

Dim−15

Dim−03

Dim−15

Jeu−15

Sam−15 Sam−15

Jeu−03

Ven−15

Ven−03

Sam−03

Mer−15

Mer−03

Mar−15

Mar−03

0.0

Lun−15

Lun−03

Dim−15

Dim−15

Dim−03

Sam−15

Jeu−15

Sam−03

Jeu−03

Ven−15

Ven−03

Mer−15

Mer−03

Mar−15

Mar−03

1.0 Loading

Ven−15

Ven−03

Jeu−15

Jeu−03

Mer−15

Mer−03

Mar−15

Mar−03

Lun−15

Lun−03

Dim−15

Dim−15

Dim−03

Sam−15

Sam−03

Ven−15

Ven−03

Jeu−15

Jeu−03

Mer−15

Mer−03

Mar−15

Mar−03

Lun−15

Lun−03

Dim−15

0.0

0.2

0.4

0.6

0.8

1.0 0.6 0.0

0.2

0.4

Loading Dim−15

Dim−03

Sam−15

Sam−03

Ven−15

Jeu−15

Ven−03

Jeu−03

Mer−15

Mer−03

Mar−15

2.45

Mar−03

2.40

Lun−15

lon

Station HOTEL DE VILLE

0.8

1.0 0.6

Loading

0.4



0.0

2.35

Lun−15

0.0

Station GARE DE LYON

0.8



Lun−03

2.30

Lun−03

Station TOUR EIFFEL



Dim−15

2.25

Dim−15

Dim−15

Dim−03

Jeu−15

Sam−15

Jeu−03

Ven−15

Ven−03

Sam−03

Mer−15

Mer−03

Mar−15

Mar−03



Sam−03



0.2



Lun−15



Lun−03

Dim−15

0.0





48.81



Loading

 

0.6







0.8







0.4



 



48.84







 



0.8





 

0.2







0.4



0.6

 



0.2















 



Loading

 













 













lat

 





48.87



 







0.4





0.2





0.8

  

  







  

48.90

Station MONTMARTRE

1.0

Station CHAMPS ELYSEES

1.0

Station NATION



 

Figure 1.4 Map of the V´elib stations in Paris (left panel) and loading profiles of some V´elib stations (right panel). The red dots correspond to the stations for which the loading profiles are displayed on the right panel.

alternatively as count data (6), as functional data and as multivariate functional data (12). These data are available as the velib data set in the funFEM package. Example 8: Chemometrics data High-dimensional data are more and more frequent in scientific fields. The NIR data set (Devos et al., 2009) is one example, from a problem in chemometrics of discriminating between three types of textiles. The 202 textile samples are analyzed here with a near-infrared spectrometer that produces spectra with 2,800 wavelengths. Figure 1.5 presents the textile samples as 2,800-dimensional spectra, where the colors indicate the textile types. This is a typical situation in chemometrics, biology and medical imaging, where the number of samples is lower than the number of recorded variables. This situation is sometimes called “ultra-high dimensionality”. This is a difficult situation for most clustering and classification techniques, as will be discussed in Chapter 8. Nevertheless, when using appropriate model-based techniques, it is possible to exploit the blessing of high-dimensional spaces to efficiently discriminate between the different textile types. The data set is available in the MBCbook package. Example 9: Zachary’s karate club network data The Zachary’s karate club data set (Zachary, 1977) consists of the friendship network of 34 members of a university-based karate club. It was collected following observations over a three-year period in which a factional division caused the members of the club to formally separate into two organizations. While friendships within the club arose organically, a financial dispute regarding the pay of part-time instructor Mr. Hi tested these ties, with two political factions developing. Key to the dispute were two members of the network, Mr. Hi and club president John A. The dispute eventually led to the dismissal of Mr. Hi by John A., and Mr. Hi’s

12

Introduction

Figure 1.5 Some textile samples of the three-class NIR data set.

supporters resigned from the karate club and established a new club, headed by Mr. Hi. The data set exhibits many of the phenomena observed in social networks, in particular clustering, or community structure. The data are shown in Figure 1.6 where the friendship network is shown and the locations of Mr. Hi and John A. within the network are highlighted.

1.4 Software We will give examples of software code to implement our analyses throughout the book. Fortunately, model-based clustering is well served by good software, mostly in the form of R packages. We will primarily use the R packages mclust (Scrucca et al., 2016) and Rmixmod (Langrognet et al., 2016), each of which carries out general model-based clustering and classification and has a rich array of capabilities. The capabilities of these two packages overlap to some extent, but not completely. An advantage of using R is that it allows one to easily use several different packages in the same analysis. We will also give examples using several other R packages that provide additional model-based clustering and classification capabilities. These include FlexMix (Leisch, 2004), fpc (Hennig, 2015a), prabclus (Hennig and Hausdorf, 2015), pgmm (McNicholas et al., 2018), tclust (Iscar et al., 2017), clustMD (McParland and Gormley, 2017) and HDclassif (Berg´e et al., 2016).

1.5 Organization of the Book

13

Mr. Hi

John A.

Figure 1.6 The friendship network of Zachary’s karate club. The two key members in the dispute within the club, Mr. Hi and John A., are labeled and colored differently.

1.5 Organization of the Book In Chapter 2 we introduce the basic ideas of model-based clustering. In Chapter 3 we discuss some common difficulties with the framework, namely outliers, degeneracies and non-Gaussian mixture components, and we describe some initial, relatively simple strategies for overcoming them. In Chapter 4 we describe model-based approaches to classification. This differs from clustering in that a training set with known labels or cluster memberships is available. In Chapter 5 we extend this to discuss semi-supervised classification, in which unlabeled data are used as part of the training data. Chapter 6 is devoted to model-based clustering involving discrete data, starting with the latent class model, which was the earliest instance of model-based clustering of any kind. We also consider ordinal data, and data of mixed type, i.e. that include both discrete and continuous variables. In Chapter 7 we consider the selection of variables for clustering, for both continuous and discrete data. This is important because if variables are used that are not useful for clustering, the performance of algorithms can be degraded. In Chapter 8 we describe methods for model-based clustering for high-dimensional data. Standard model-based clustering methods can be used in principle regardless of data dimension, but if the dimension is high their performance can

14

Introduction

decline. We describe a range of dimension reduction, regularization and subspace methods. Chapter 9 describes ways of clustering data where the component distributions are non-Gaussian by modeling them explicitly, in contrast with the component merging methods of Chapter 3. In Chapter 10 we describe model-based approaches to clustering nodes in networks, with a focus on social network data. Chapter 11 treats methods for model-based clustering with covariates, with a focus on the mixture of experts model. Finally, in Chapter 12, we describe model-based clustering methods for a range of less standard kinds of data. These include functional data, textual data and images. They also include data in which both the rows and the columns of the data matrix may have clusters and we want to detect them both in the same analysis. The methods we describe fall under the heading of model-based co-clustering. 1.6 Bibliographic Notes A number of books have been written on the topic of cluster analysis and mixture modeling and these books are of particular importance in the area of model-based clustering. Hartigan (1975) wrote an early monograph on cluster analysis, which included detailed chapters on k-means clustering and Gaussian model-based clustering. Everitt and Hand (1981) wrote a short monograph on finite mixture models including Gaussian, exponential, Poisson and binomial mixture models. Titterington et al. (1985) is a detailed monograph on mixture models and related topics. McLachlan and Basford (1988) is a detailed monograph on finite mixture modeling and clustering. Everitt (1993) is a textbook on cluster analysis which covers a broad range of approaches. Lindsay (1995) gives an in-depth overview of theoretical aspects of finite mixture modeling. Gordon (1999) gives a thorough overview of clustering methods, including model-based approaches. McLachlan and Peel (2000) is an extensive book on finite mixture modeling and clustering. Fr¨ uhwirth-Schnatter (2006) is a detailed monograph on finite mixtures and Markov models with a particular focus on Bayesian modeling. McNicholas (2016a) gives a thorough overview of model-based clustering approaches developed from a range of different finite mixtures. Hennig et al. (2015) is an edited volume on the topic of cluster analysis including chapters on model-based clustering. Mengersen et al. (2011) and Celeux et al. (2018a) are edited volumes on the topic of mixture modeling which include chapters on model-based clustering and applications.

2 Model-based Clustering: Basic Ideas

In this chapter we review the basic ideas of model-based clustering. Model-based clustering is a principled approach to cluster analysis, based on a probability model and using standard methods of statistical inference. The probability model on which it is based is a finite mixture of multivariate distributions, and we describe this in Section 2.1, with particular emphasis on the most used family of models, namely mixtures of multivariate normal distributions. In Section 2.3 we describe maximum likelihood estimation for this model, with an emphasis on the Expectation-Maximization (EM) algorithm. The EM algorithm is guaranteed to converge to a local optimum of the likelihood function, but not to a global maximum, and so the choice of starting point can be important. We describe some approaches to this in Section 2.4. In Section 2.5, examples with known numbers of clusters are presented. Two persistent questions in the practical use of cluster analysis are: how many clusters are there? and, which clustering method should be used? It turns out that determining the number of clusters can be viewed as a model choice problem, and the choice of clustering method is often at least approximately related to the choice of probability model. Thus both of these questions are answered simultaneously in the model-based clustering framework by the choice of an appropriate probability model, and we discuss model selection in Section 2.6. We will work through some examples in Section 2.7.

2.1 Finite Mixture Models Suppose we have data that consist of n multivariate observations, y1 , . . . , yn , each of dimension d, so that yi = (yi,1 , . . . , yi,d ). A finite mixture model represents the probability distribution or density function of one multivariate observation, yi , as a finite mixture or weighted average of G probability density functions, called mixture components: p(yi ) =

G 

τg fg (yi | θg ).

(2.1)

g=1

In Equation (2.1), τg is the probability that an observation was generated by the G gth component, with τg ≥ 0 for g = 1, . . . , G, and g=1 τg = 1, while fg (· | θg ) is the density of the gth component given the values of its parameters θg . 15

16

Model-based Clustering: Basic Ideas

Mixture Component 1 Component 2

0.02 0.00

0.01

Probability Density

0.03

0.04

Density for 1−dim 2−component normal mixture model

●● ● ●●●● ●● ● ● ●● ●● ● ● ●● ● ●● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ●●● ●●●● ●● ● ● ● ● ● ● ● ●●●●● ● ● ●● ●●●●● ●●● ● ●● ●● ● ● ● ● ●●●●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ●● ●●●● ● ● ●● ● ●●●

50

60

70

80

●●



90

y

Figure 2.1 Probability density function for a one-dimensional univariate finite normal mixture with two mixture components. The individual component densities multiplied by their mixture probabilities are shown in red and blue, respectively, and the resulting overall mixture density (which is the sum of the red and blue curves) is the black curve. The dots show a sample of size 272 simulated from the density, with the colors indicating the mixture component from which they were generated.

Most commonly, fg is a multivariate normal distribution. In the univariate case, when yi is one-dimensional, fg (yi |θg ) is a N (μg , σg2 ) density function, and θg = (μg , σg ), consisting of the mean and standard deviation for the gth mixture component. Figure 2.1 shows an example of the density function for a univariate finite normal mixture model with two mixture components, together with a sample simulated from it. The model parameters were selected by estimating such a model for the Waiting Time data from the Old Faithful data set in Example 1, yielding μ1 = 54.7, μ2 = 80.1, σ1 = 5.91, σ2 = 5.87, τ1 = 0.362 and τ2 = 0.638. The R code to produce Figure 2.1 is shown in Listing 2.1. The density is clearly bimodal, and the density is lower in the middle between the two humps, so there is some separation between the two mixture components.

2.1 Finite Mixture Models

17

Listing 2.1: R code for Figure 2.1 # Estimate parameters of 1 d finite normal mixture model # with 2 components for Waiting time observations from # the Old Faithful data set : library ( mclust ) waiting